These statistics serve as an example how to use
(one of the utilities provided by the project gemmi)
to quickly extract data from mmCIF files.
The data is extracted from a local copy of the PDB archive in mmCIF format (we rsync it once a week).
gemmi-grep in the
extract.sh script to extract all the needed
metadata (and some more).
Reading all the archive (34GB of compressed files) takes 20-30min.
The output is redirected to a file (
grep.out) that is later used
to prepare JSON files used in our web pages.
./process.py >data.json makes a concise JSON file that includes only:
- entries deposited since 2008 (aribitrary cut-off),
- obtained using X-ray crystallography,
- in case of group depositions (PDB has now only a dozen of such groups) we take one entry from each group.
The web app itself is contained in a single file (index.html). It depends on three external libraries: dc.js, d3.js and crossfilter.
Here is an interactive demo (each plot can be used as a filter):
Synchrotron work patterns
./coldates.py >calendar.json writes the data used in calendar.html:
It uses a separate C++ program that needs to be compiled:
$ g++-7 -O3 -I../gemmi/include -I../gemmi/third_party -o tagstat tagstat.cpp -lz
Then it can be used to update the data:
$ ./tagstat components.cif > ccd-tags.tsv $ ./tagstat $PDB_DIR/structures/divided/mmCIF > mmcif-tags.tsv $ ./tagstat $PDB_DIR/structures/divided/structure_factors > sf-tags.tsv $ sed -i s"/ on 20..-..-../ on $(date -Idate)/" tags.html
$ curl -O ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif.gz $ ./resistat.py $PDB_DIR/structures/divided/mmCIF > residues.json