BLOBPROC

Queues like it's 1995!

BLOBPROC is a less kafkaesque version of PDF postprocessing found in sandcrawler, which is part of IA Scholar infra. Specifically it is designed to process and persist documents with minimum number of external components and little to no state.

BLOBPROC currently ships with two cli programs:

blobprocd exposes an HTTP server that can receive binary data and stores it in a spool folder
blobproc is a process that scans the spool folder and executes post processing tasks on each PDF, and removes the file from spool, if a best-effort-style processing of the file is done (peridically called by a systemd timer)

In our case pdf data may come from:

Heritrix crawl, via a ScriptedProcessor
(wip) a WARC file, a crawl collection or similar
in general, by any process that can deposit a file in the spool folder or send an HTTP request to blobprocd

In our case blobproc will execute the following tasks:

send PDF to GROBID and store the result in S3, using grobidclient Go library
generate text from PDF via pdftotext and store the result in S3 (seaweedfs)
generate a thumbnail from PDF via pdftoppm and store the result in S3 (seaweedfs)
find all weblinks in the PDF text and send them to a crawl API (wip)

More tasks can be added by extending blobproc itself. A focus remains on simple deployment via an OS distribution package. By pushing various parts into library functions (or external packages like grobidclient), the main processing routine shrinks to about 100 lines of code (as of 08/2024). Currently both blobproc and blobprocd run on a dual-core 2nd gen XEON with 24GB of RAM; blobprocd received up to 100 rps and wrote pdfs to rotational disk.

Bulk, back-of-the-envelope, reprocessing

Currently, about 5 pdfs/s. GROBID may be able to handle up to 10 pdfs/s. To reprocess, say 200M pdfs in less than a month, we would need about 10 GROBID instances.

Mode of operation

receive blob over HTTP, may be heritrix, curl, some backfill process
regularly scan spool dir and process found files

Usage

Server component.

$ blobprocd -h
Usage of blobprocd:
  -T duration
        server timeout (default 15s)
  -access-log string
        server access logfile, none if empty
  -addr string
        host port to listen on (default "0.0.0.0:8000")
  -debug
        switch to log level DEBUG
  -log string
        structured log output file, stderr if empty
  -spool string
         (default "/home/tir/.local/share/blobproc/spool")
  -version
        show version

Processing command line tool.

$ blobproc -h
blobproc - process and persist PDF derivatives

Emit JSON with locally extracted data:

  $ blobproc -f file.pdf | jq .

Flags

  -P    run processing in parallel (exp)
  -T duration
        subprocess timeout (default 5m0s)
  -debug
        more verbose output
  -f string
        process a single file (local tools only), for testing
  -grobid-host string
        grobid host, cf. https://is.gd/3wnssq (default "http://localhost:8070")
  -grobid-max-filesize int
        max file size to send to grobid in bytes (default 268435456)
  -k    keep files in spool after processing, mainly for debugging
  -logfile string
        structured log output file, stderr if empty
  -s3-access-key string
        S3 access key (default "minioadmin")
  -s3-endpoint string
        S3 endpoint (default "localhost:9000")
  -s3-secret-key string
        S3 secret key (default "minioadmin")
  -spool string
         (default "/home/tir/.local/share/blobproc/spool")
  -version
        show version
  -w int
        number of parallel workers (default 4)

Performance data points

The initial, unoptimized version would process about 25 pdfs/minute or 36K pdfs/day. We were able to crawl much faster than that, e.g. we reached 63G captured data (not all pdf) after about 4 hours. GROBID should be able to handle up to 10 pdfs/s.

A parallel walker could process about 300 pdfs/minute, and would match the inflow generated by one heritrix crawl node.

Scaling

tasks will run in parallel, e.g. text, thumbnail generation and grobid all run in parallel, but we process one file by one for now
we should be able to configure a pool of grobid hosts to send requests to

Backfill

point to CDX file, crawl collection or similar and have all PDF files sent to BLOBPROC, even if this may take days or weeks

TODO

for each file placed into spool, try to record the URL-SHA1 pair somewhere
pluggable write backend for testing, e.g. just log what would happen
log performance measures
grafana

Notes

This tool should cover most of the following areas from sandcrawler:

run_grobid_extract
run_pdf_extract
run_persist_grobid
run_persist_pdftext
run_persist_thumbnail

Including references workers.

Performance: Processing 1605 pdfs, 1515 successful, 2.23 docs/s, when processed in parallel, via fd ... -x - or about 200K docs per day.

real    11m0.767s
user    73m57.763s
sys     5m55.393s

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
cdx		cdx
cmd		cmd
dedent		dedent
extra		extra
fileutils		fileutils
packaging/deb/blobproc/DEBIAN		packaging/deb/blobproc/DEBIAN
pdfextract		pdfextract
pdfinfo		pdfinfo
spn		spn
static		static
testdata		testdata
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
blob.go		blob.go
blob_test.go		blob_test.go
go.mod		go.mod
go.sum		go.sum
service.go		service.go
service_test.go		service_test.go
urlmap.go		urlmap.go
urlmap_test.go		urlmap_test.go
version.go		version.go
walker.go		walker.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BLOBPROC

Bulk, back-of-the-envelope, reprocessing

Mode of operation

Usage

Performance data points

Scaling

Backfill

TODO

Notes

About

Uh oh!

Releases 24

Packages

Uh oh!

Languages

License

miku/blobproc

Folders and files

Latest commit

History

Repository files navigation

BLOBPROC

Bulk, back-of-the-envelope, reprocessing

Mode of operation

Usage

Performance data points

Scaling

Backfill

TODO

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Languages

Packages