Skip to content

gkotian/pdfs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project was for a very specific, narrow purpose; so the likelihood of this coming in handy elsewhere is quite low.

A friend wanted some way to import the table at https://die-deutsche-wirtschaft.de/das-ranking-der-groessten-mittelstaendler-deutschlands/ into a spreadsheet. But the table on the website is not only un-copyable, but also un-selectable, making the task rather difficult. So he took screenshots of the tables, saved them as pdf files and sent me the pdfs.

From the various pdf files, I extracted only pages containing the tables and collated them into a single file called combined.pdf. The script here parses that pdf file and outputs each row in the table as an individual record in the output.tsv file.

The script isn't perfect, and sometimes fails to separate columns (but tells you which rows these are, so you can manually fix them), or omits the last line at the end of a page (which you need to check manually). It also doesn't write the very last record in a separate line (also needs to be fixed manually).

After making all the necessary manual adjustments, run check.go to perform a quick sanity check on the final TSV file.

The final TSV file can then be directly opened in your favourite spreadsheet program.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages