Add pdfbox-app-2.0.7.jar
to path
- Gather pdf files in
pdf_dir
(default~/testfiles
)prodigy dataset -a "Peter Williams" blank "Blank pages corpus"
once only
- Run pdf_page_summaries to find pages that contain text marks and pages that contain non-text marks.
- Run
python make_page_corpus.py
to create preprocessed corpus insummary_dir
(default~/testdata.pages1
) - Use
prodigy textcat.teach blank en_core_web_lg all.pages.jsonl --label BLANK
to examine pages with only text marks and decide which ones are effectively blank
- Pages with no marks are blank. This won't change.
- Pages with non-text marks are not blank.
- Human panel will go through pages with text marks and decide if they are blank
- Use ML to model 3.
These are pages that have only text marks.
- "Page intentionally left blank"
- Page number only
- Watermark only
- Page number and watermark.
Fix page number detection for ~/testdata/Year_8_Pythagoras_Booklet.pdf use text location as a hint Powerpoint decks where successive pages are supersets of previous page. Can we find the last page in such a sequence?
http://localhost:8000/jkraaijeveld_thesis.pdf#page=3 http://localhost:8000/lantz.dissertation.pdf#page=2 http://localhost:8000/Preservation%20of%20privacy%20in%20public-key%20cryptography.pdf#page=4 http://localhost:8000/10.1.1.458.9390.pdf#page=19 http://localhost:8000/talk_Eval.pdf#page=56 http://localhost:8000/23-parallel-scan.pdf#page=21
No mark except near top and bottom of page Less than 3 lines (usually)