JSON benchmark with INSPIRE data

Introduction

Performance of JSON serialization and deserialization in the Python standard library is sometimes slow. Fortunately, there exist several alternative implementations, but their performance depends on the type of data (size, complexity) to be (de)serialized. This is why it's useful to benchmark different libraries on realistic data and on the hardware where they will eventually run. As the results show, orjson is the clear winner.

Performance of JSON schema validation is also slow, so looking for alternatives there is also useful, but none of them seem satisfactory (see further).

What we tested

We tested 5 JSON libraries on 3 datasets, for deserializing (load) and serializing (dump).

Libraries

Datasets

Each of the 3 datasets has 1000 records, selected from the top results of following searches.

most-cited: most cited records in INSPIRE.
random: records with earliest date >= 2015.
many-authors: records with more than 200 authors and earliest date >= 2015.

Note that the two latter datasets are effectively random, as no sort order has been specified.

The datasets are too large to be included in this repo, but the fetch_data.py script to generate them is present.

Benchmark

We (de)serialized all records in each of those datasets 5 times, with each library and aggregated the results (tests.py script). In order to run it on realistic hardware, we built docker images (see Dockerfile) and created jobs to run it on our kubernetes cluster (see kubernetes.yml). The results are then processed with analyze_results.py to generate the table and plot in the next subsection.

Results

Dataset	Library	Operation	Average time full dataset
random	json	load	2.5716108437627554
random	orjson	load	1.6769399158656597
random	ujson	load	2.272227367386222
random	rapidjson	load	2.7491534277796745
random	simplejson	load	2.1333015505224466
many-authors	json	load	76.37307294644415
many-authors	orjson	load	52.80944381467998
many-authors	ujson	load	73.78779327683151
many-authors	rapidjson	load	87.51749732345343
many-authors	simplejson	load	63.694295370951295
most-cited	json	load	5.2402846701443195
most-cited	orjson	load	3.564463946968317
most-cited	ujson	load	5.013203462585807
most-cited	rapidjson	load	5.926249857991934
most-cited	simplejson	load	4.376293800771236
random	json	dump	3.177771395072341
random	orjson	dump	0.6636625099927187
random	ujson	dump	1.833177963271737
random	rapidjson	dump	1.8705584052950144
random	simplejson	dump	4.131725179031491
many-authors	json	dump	90.9779446143657
many-authors	orjson	dump	18.102600283920765
many-authors	ujson	dump	52.41105090640485
many-authors	rapidjson	dump	50.736949410289526
many-authors	simplejson	dump	113.3073948584497
most-cited	json	dump	6.394005540758371
most-cited	orjson	dump	1.2374399024993181
most-cited	ujson	dump	3.6458546351641417
most-cited	rapidjson	dump	3.466435482725501
most-cited	simplejson	dump	8.050390353426337

The green bar is the average, the orange bar is the median. The boxes go from first to third quartile, and the whiskers from minimum to maximum.

orjson is the clear winner. Serializing large records is particularly impressive.

JSON schema validation

We're currently using jsonschema for validation. We looked at two other libraries:

fastjsonschema

This is a pure-python implementation, which precompiles the schema into a validator so is in principle much faster than the library we're currently using. While this is true for a small record where we found a 10x performance increase, a very large record resulted in very similar running times. Investigating further, we found a critical bug preventing us from using it right now.

jsonschema-rs

This is a rust implementation, which posts very impressive benchmarks. However, it doesn't work for us currently as it doesn't allow custom format validators. We should look at it again when it's more mature.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
analyze_results.py		analyze_results.py
docker-compose.yml		docker-compose.yml
fetch_data.py		fetch_data.py
kubernetes.yml		kubernetes.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
results.png		results.png
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JSON benchmark with INSPIRE data

Introduction

What we tested

Libraries

Datasets

Benchmark

Results

JSON schema validation

fastjsonschema

jsonschema-rs

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

inspirehep/json-benchmark

Folders and files

Latest commit

History

Repository files navigation

JSON benchmark with INSPIRE data

Introduction

What we tested

Libraries

Datasets

Benchmark

Results

JSON schema validation

fastjsonschema

jsonschema-rs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages