Find file History
Latest commit 7731488 Nov 5, 2013 William Palmer (BL) Add pdfs from Govdocs1 that are potentially broken

readme.md

PDF files from Govdocs1 that may cause errors

About these files

Test PDF files from Govdocs1. About 130,000 PDF files (all PDFs from the first ~500k files in Govdocs) were tested against PDF software and these are the files that caused problems. They may well be broken but they may be useful for stress-testing.

Description

  • error_set_1 contains pdf files that failed in a particular way
  • error_set_2 contains pdf files that failed in a different way

License

All PDF files in this folder and subfolders are copied from Govdocs1;

More information about these files can be found at [http://digitalcorpora.org/corpora/files]

Relevant quotes:

We have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.

If you decide to use this corpus in published research, the appropriate citation is: Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada