GitHub

This project was for a very specific, narrow purpose; so the likelihood of this coming in handy elsewhere is quite low.

A friend wanted some way to import the table at https://die-deutsche-wirtschaft.de/das-ranking-der-groessten-mittelstaendler-deutschlands/ into a spreadsheet. But the table on the website is not only un-copyable, but also un-selectable, making the task rather difficult. So he took screenshots of the tables, saved them as pdf files and sent me the pdfs.

From the various pdf files, I extracted only pages containing the tables and collated them into a single file called combined.pdf. The script here parses that pdf file and outputs each row in the table as an individual record in the output.tsv file.

The script isn't perfect, and sometimes fails to separate columns (but tells you which rows these are, so you can manually fix them), or omits the last line at the end of a page (which you need to check manually). It also doesn't write the very last record in a separate line (also needs to be fixed manually).

After making all the necessary manual adjustments, run check.go to perform a quick sanity check on the final TSV file.

The final TSV file can then be directly opened in your favourite spreadsheet program.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
check.go		check.go
combined.pdf		combined.pdf
main.go		main.go
output.tsv		output.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

gkotian/pdfs

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages