Skip to content

Processes warcs with pdfs, making them warcs with just plain text for further processing.

Notifications You must be signed in to change notification settings

jelmervdl/pdfwarc2warc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pdfwarc2warc

Convert a warc filled with pdf records to a warc with plain text records.

Installation

This project only depends on warcio and pd3f (currently on our own branch with a couple of bugfixes).

pip3 install -r requirements.txt

Usage

docker-compose up -d
./pdfwarc2warc.py pdf.warc.gz > text.warc.gz

Make sure Parsr is running and accessible through http (default localhost:3001 but changeable through the --parsr-location option.).

The main thread reads records into a queue, which workers (controllable through -j or --threads) pick records from. Each worker then grabs the pdf from the record, submits it to parsr, processes the response with pd3f, and puts the text back into the record payload. The record is then put on another queue to be written to stdout (or whatever file is specified with --output.)

Error handling

A lot of things can go wrong when processing PDF files. Generally, records that produce errors are skipped over, unless an error is unrecoverable (e.g. parsr cannot be contacted, or a filesystem error).

To aid debugging the PDF processing (mainly done by pd3f) there is a --dump-errors path/to/folder option that dumps the exception traceback, the pdf, and any response from parsr (if there was any) for unknown errors to a folder, keyed by the warc record id.

Some errors are already pro-actively caught before we pass on the parsr output to pd3f, mainly around empty documents, or documents without any text.

About

Processes warcs with pdfs, making them warcs with just plain text for further processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages