PDF Trio

Git Repo: http://github.com/internetarchive/pdf_trio

Blog Post: https://towardsdatascience.com/making-a-production-classifier-ensemble-2d87fbf0f486

License: Apache-2.0

pdf_trio is a Machine Learning (ML) service which combines three distinct classifiers to predict whether a PDF document is a "research publication". It exposes an HTTP API which receives files (POST) and returns classification confidence scores. There is also an experimental endpoint for fast classification based on URL strings alone.

This project was developed at the Internet Archive as part of efforts to collect and provide perpetual access to research documents on the world wide web. Initial funding was provided by the Andrew W. Mellon Foundation under the "Ensuring the Persistent Access of Long Tail Open Access Journal Literature" project.

This system was originally designed, trained, and implemented by Paul Baclace. See CONTRIBUTORS.txt for details.

Quickstart with Docker Compose

These instructions describe how to run the pdf_trio API service locally using docker and pre-trained Tensorflow and fastText machine learning models. They assume you have docker installed locally, as well as basic command line tools (bash, wget, etc).

Download trained model files: about 1.6 GBytes of data to download from archive.org.

./fetch_models.sh

Run docker-compose: this command will build a docker image for the API from scratch and run it. It will also fetch and run two tensorflow-serving back-end daemons. Requires several GByte of RAM.

docker-compose up

You can try submitting a PDF for classification like:

curl localhost:3939/classify/research-pub/all -F pdf_content=@tests/files/research/hidden-technical-debt-in-machine-learning-systems__2015.pdf

To re-build the API docker image (eg, if you make local code changes):

docker-compose up --build --force-recreate --no-deps api

Development

The default python dependency management system for this project is pipenv, though it is also possible to use conda (see directions later in this document).

To install dependencies on a Debian buster Linux computer:

sudo apt install -y poppler-utils imagemagick libmagickcore-6.q16-6-extra ghostscript netpbm gsfonts wget
pip3 install pipenv
pipenv install --dev --deploy

Download model files:

./fetch_models.sh

Use the default local configuration:

cp example.env .env

Run just the tensorflow-serving back-end daemons using docker-compose like:

docker-compose up tfserving

Unit tests partially mock the back-end tensorflow-serving daemons, and any tests which do call these daemons will automatically skip if they are not available locally. To run the tests:

pipenv run python -m pytest
pipenv run pylint -E pdf_trio tests/*.py

# with coverage:
pipenv run pytest --cov --cov-report html

Background

The purpose of this project is to identify research works for richer cataloging in production at Internet Archive. Research works are not always found in well-curated locations with good metadata that can be utilized to enrich indexing and search. Ongoing work at the Internet Archive will use this classifier ensemble to curate "long tail" research articles in multiple languages published by small publishers. Low volume publishing is inversely correlated to longevity, so the goal is to preserve the research works from these sites to ensure they are not lost.

The performance target is to classify a PDF in about 1 second or less on average and this implementation achieves that goal when multiple, parallel requests are made on a multi-core machine without a GPU.

The URL classifier is best used as a "true vs. unknown" choice, that is, if the classification is non-research ('other'), then find out more, do not assume it is not a research work. Our motivation is to have a quick check that can be used during crawling. A high confidence is used to avoid false positives.

Design

REST API based on python Flask
Deep Learning neural networks
- Run with tensorflow serving for high throughput
- CNN for image classification
- BERT for text classification using a multilingual model
FastText linear classifier
- Full text 'bag of words' classification for high-throughput
- URL-based classification
PDF training data preparation scripts for each kind of sub-classifier

Two other repos are relied upon and not copied into this repo because they are useful standalone. This repo can be considered the 'roll up' that integrates the ensemble.

This PDF classifier can be re-purposed for other binary cases simply by using different training data.

PDF Challenges

PDFs have challenges, because they can:

be pure images of text with no easily extractable text at all
range from 1 page position papers to 100+ page theses
have citations either at the end of the document, or interspersed as footnotes

We decided to avoid using OCR for text extraction for speed reasons and because of the challenge of multiple languages.

We address these challenges with an ensemble of classifiers that use confidence values to cover all the cases. There are still some edge cases, but incidence rate is at most a few percent.

API Configuration and Deployment

The following env vars must be defined to run this API service:

FT_MODEL full path to the FastText model for linear classifier
FT_URL_MODEL path to FastText model for URL classifier
TF_IMAGE_SERVER_URL base API URL for image tensorflow-serving process
TF_BERT_SERVER_URL base API URL for BERT tensorflow-serving process

Backend Service Dependency Setup

These directions assume you are running in an Ubuntu Xenial (16.04 LTS) virtual machine.

sudo apt-get install -y poppler-utils imagemagick libmagickcore-6.q16-2-extra ghostscript netpbm gsfonts-other
conda create --name pdf_trio python=3.7 --file requirements.txt
conda activate pdf_trio

Edit /etc/ImageMagick/policy.xml to change:

<policy domain="coder" rights="none" pattern="PDF" />

To:

<policy domain="coder" rights="read" pattern="PDF" />

We expect imagemagick 6.x; when 7.x is used, the binary will not be called convert anymore.

Back-Backend Components for Serving

fastText-python is used for hosting fastText models
tensorflow-serving for image and BERT inference

Tensorflow-Serving via Docker

Tensorflow serving is used to host the NN models and we prefer to use the REST because that significantly reduces the complexity on the client side.

install on ubuntu (docker.io distinguishes Docker from some preexisting package 'docker', a window manager extension):
- sudo apt-get install docker.io
get a docker image from the docker hub (might need to be careful about v1.x to v2.x transition)
- sudo docker pull tensorflow/serving:latest
- NOTE: I am running docker at system level (not user), so sudo is needed for operations, YMMV
see what is running:
- sudo docker ps
stop a running docker, using the id shown in the docker ps output:
- sudo docker stop 955d531040b2
to start tensorflow-serving:
- ./start_bert_serving.sh
- ./start_image_classifier_serving.sh

Training and Models

These are covered in detail under data_prep:

fastText (needed for training)
bert variant repo for training
tf_image_classifier repo

Models

Sample models for research-pub classification are available at Internet Archive under https://archive.org/download/pdf_trio_models.

A handy python package will fetch the files and directory structure (necessary for tensorflow-serving). You can use curl and carefully recreate the directory structure, of course. The full set of models is 1.6GB.

Here is a summary of the model files, directories, and env vars to specify the paths:

env var	size	path	Used By
BERT_MODEL_PATH	1.3GB	pdf_trio_models/bert_models	start_bert_serving.sh
IMAGE_MODEL_PATH	87MB	pdf_trio_models/pdf_image_classifier_model	start_image_classifier_serving.sh
FT_MODEL	600MB	pdf_trio_models/fastText_models/dataset20000_20190818.bin	start_api_service.sh
FT_URL_MODEL	202MB	pdf_trio_models/fastText_models/url_dataset20000_20190817.bin	start_api_service.sh

See Data Prep for details on preparing training data and how to train each classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
data_prep		data_prep
finetune		finetune
pdf_trio		pdf_trio
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pylintrc		.pylintrc
CONTRIBUTORS.txt		CONTRIBUTORS.txt
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
docker-compose.yml		docker-compose.yml
example.env		example.env
example_calls.py		example_calls.py
fetch_models.sh		fetch_models.sh
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
rest_demo.sh		rest_demo.sh
start_api_service.sh		start_api_service.sh
start_bert_serving.sh		start_bert_serving.sh
start_image_classifier_serving.sh		start_image_classifier_serving.sh
tf-serving_rest_demo.sh		tf-serving_rest_demo.sh
tfserving_models_docker.config		tfserving_models_docker.config
uwsgi.ini		uwsgi.ini

License

internetarchive/pdf_trio

Folders and files

Latest commit

History

Repository files navigation

PDF Trio

Quickstart with Docker Compose

Development

Background

Design

PDF Challenges

API Configuration and Deployment

Backend Service Dependency Setup

Back-Backend Components for Serving

Tensorflow-Serving via Docker

Training and Models

Models

About

Topics

Resources

License

Stars

Watchers

Forks

Languages