This is a test project on handling PDBx/mmCIF files.
The mmCIF format, based on CIF, itself based on STAR, is rather complex to parse using Linux command-line tools.
Using CCTBX it's easy to build a CIF parser in C++. So I did that to produce what I call the pCIF format (or parsable CIF). It has simple parsing rules: each type of data begins with a certain character at the beginning of the line.
| Line begins with | For |
|---|---|
> |
data segment start |
#_ |
table (loop) start |
\t (tab character) |
table data (yes, there's an extra, empty column) |
^ |
save frame start (not used in mmCIF) |
$ |
save frame end (not used in mmCIF) |
Data in tables is tab-delimited. Values that had newlines in mmCIF are stripped of them, and any sequence of whitespace in mmCIF is converted to a single space character (' ') here.
This converts input CIF format files into pCIF format. Either by running:
cat file.cif | pcif
or by running:
pcif file.cif
Each script has a usage comment inside. Most of the scripts use awk (some specifically
reference mawk for speed).
The tabview program is highly recommended for viewing tab-delimited files (pCIF, for
example). See https://github.com/firecat53/tabview for details.
The scripts are written in and for Linux. They should also run fine on Windows, given that
you have soe Bourne-like shell available. I use Git for Windows, which has bash included.
You would have to replace references to mawk with awk, though.
See the LICENSE file for binding license information. Copyright (C) 2015 Yuval Sedan