pdfwarc2warc

Convert a warc filled with pdf records to a warc with plain text records.

Installation

This project only depends on warcio and pd3f (currently on our own branch with a couple of bugfixes).

pip3 install -r requirements.txt

Usage

docker-compose up -d
./pdfwarc2warc.py pdf.warc.gz > text.warc.gz

Make sure Parsr is running and accessible through http (default localhost:3001 but changeable through the --parsr-location option.).

The main thread reads records into a queue, which workers (controllable through -j or --threads) pick records from. Each worker then grabs the pdf from the record, submits it to parsr, processes the response with pd3f, and puts the text back into the record payload. The record is then put on another queue to be written to stdout (or whatever file is specified with --output.)

Error handling

A lot of things can go wrong when processing PDF files. Generally, records that produce errors are skipped over, unless an error is unrecoverable (e.g. parsr cannot be contacted, or a filesystem error).

To aid debugging the PDF processing (mainly done by pd3f) there is a --dump-errors path/to/folder option that dumps the exception traceback, the pdf, and any response from parsr (if there was any) for unknown errors to a folder, keyed by the warc record id.

Some errors are already pro-actively caught before we pass on the parsr output to pd3f, mainly around empty documents, or documents without any text.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
docker-compose.yml		docker-compose.yml
pdfwarc2warc.py		pdfwarc2warc.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfwarc2warc

Installation

Usage

Error handling

About

Releases

Packages

Languages

jelmervdl/pdfwarc2warc

Folders and files

Latest commit

History

Repository files navigation

pdfwarc2warc

Installation

Usage

Error handling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages