Skip to content

malexmave/pdok-mirror

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdok-mirror

A system for automatically mirroring all documents from the parliamentary documentation system (pdok) of the german Bundestag. It will download a copy of all documents to the local hard drive (~66 GB of PDFs at the time of writing), create a .txt version of all of them for good measure, and then optionally upload the PDFs to Archive.org (a feature you will most likely not need, as I am already doing that).

Yes, but... why?

Why not?

Also, given the current trends towards electing populists who would much rather see certain documents scrubbed from the archives, it can never hurt to have a backup of the history of your democracy somewhere safe.

Legal considerations

In germany, the documents this software is automatically downloading are not covered by Copyright, as they are official state documents (§ 5 Abs. 2 UrhG). However, you are required to:

Setup

Install all dependencies (see below) using pip and your distributions package manager (if you want the pdf->text conversion). If you want to use the internetarchive functionality, run ia configure and enter your internetarchive credentials (but again, you probably don't need that, as I'm already doing that).

Dependencies

  • peewee (as the ORM for the local database - licensed under the MIT License)
  • requests (to download the files - licensed under the Apache2)
  • internetarchive (to upload to archive.org - licensed under the AGPLv3)
  • python-magic (to check the MIME types of downloaded files - licensed under the MIT License)
  • pdftotext installed as a CLI application (for pdf->text conversion - optional, part of poppler-utils, not a python library)

License

As we use the internetarchive library, which is licensed under the AGPLv3, this software is also licensed AGPLv3. See LICENSE.txt for details.

About

Mirror the german "Parlamentarische Dokumentationssystem"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages