Skip to content
Testing Python PDF libraries and making conclusions...
Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
sample_data
.gitignore
README.md
cleaner.sh
libraries_tests.py
plot.py
requirements.txt
run.py
utils.py

README.md

Python-PDF libraries performance tests

Checking performance with reading PDF and:

  • gathering info about the number of pages using python libraries.
  • ... some day ...

Current stable version: v1.0

Release date: 07.08.2019

Author:

Maciej Januszewski (maciek@mjanuszewski.pl)

Pre-requirements:

  • Firstly run Apache-Tika Server (for Tika purposes):
docker pull logicalspark/docker-tikaserver
docker run -d -p 9998:9998 logicalspark/docker-tikaserver

Sample PDFs data:

https://drive.google.com/open?id=1Xb99gWgynHO02e2YvAyX0dsnfUmWJwJD

Running:

./run.py /path/to/pdfs_data/ > /dev/null 2>&1 #disable prints

Sample plots outputs:

- Final statistics - overall processing time:

https://maciekj.pl/media/plots/pdfs_performance_final_stats_bar.html Scatter plot generated by plotly

- Final statistisc - bar chart:

https://maciekj.pl/media/plots/pdfs_performance_bar.html Boxes plot generated by plotly

You can’t perform that action at this time.