1015 binary cif support #1040

papillot · 2024-04-27T20:49:44Z

This PR adds support for Binary Cif files parsing and changes the RCSB data source provider to use this new format instead of the deprecated MMTF format.

Changes made

Added molstar library as a dependency to use the Cif/Binary Cif parsing code from this project. Thanks to tree-shaking only the import tree relative to the Cif reader is imported, but this causes a sensible increase of the whole bundle size.
The code for reading binary cif is the same as the one that what used for reading cif files. The only difference is the streamer invocation
Added code to parse "Upgraded mmCif files" from PDBe. These contain connectivity information and allow to get the proper bond orders.
Added pdbe as a new datasource. Data can be loaded from PDBe using the pseudo protocol pdbe://4hhb which downloads a binary cif (uncompressed) with the full connectivity
The nglviewer will download PDB files from PDBe by default, unless an alpha fold code is used (can't find a way to download a binary cif from PDBe for AlphaFold structures)
jest library as a test runner has been replaced with vitest. This was due to a bug with Jest when parsing the cif-parser file. The import from molstar is not an import of bundled code, but an import of ES module which is not suported natively by node and requires a transformer. But Jest do not transform files from node_modules. Albeit trying various approaches, I could not make it work and resolved to using vitest which worked out of the box.

Fixes

Secondary structure in files from AlphaFold is now handled
pdb codes are not limited to 4 letters when importing from rcsb, and alphafold codes can be used as well

Comments

Small benchmark, using the pdb 5z6y (relatively small strucutre GFP):

Format	Provider	Download size
mmtf	RCSB	17.2kb*
bcif	PDBe	182kb
bcif	RCSB	33.4kb*

(*) RCSB response is gzipped

Despite the claim that bcif achieves better compression, it seems that there are still some caveats and generally speaking forcing the transition from bcif to mmtf creates regressions (also some improvements for specific use cases where the extra data content is relevant)

src/parser/mmtf-parser.ts

fredludlow · 2024-05-03T12:35:16Z

Thanks again - will have a proper look asap, just to note some of our examples already failing due to this, e.g.
https://nglviewer.org/ngl/?script=showcase/viruses

EDIT: To clarify, failing because they can't grab the mmtf file, not because of changes in this PR!

fredludlow · 2024-05-03T13:34:50Z

Hmm, 3nap seems to be causing me some problems (might be a horror-show example as it's a virus) in the symmetry processing

papillot · 2024-05-03T15:52:38Z

I did not implement the alphaCarbondsOnly flag, to this might be it.
I'll look at the issue with the mmtf file more precisely. The code was allowing to download backbone only structures. I haven't looked at wether the same exist for bcif files

papillot · 2024-05-04T18:31:04Z

Fixed:

(it seems the bug was already there in the previous cif parser)

fredludlow · 2024-05-20T16:49:34Z

Okay - looks good, I've gone through all the parser examples and they all work except for:

parser/map(which breaks on parsing 4UJD.cif.gz - I've tested with recent versions of this entry in case it was edited in the repo or something but no luck
parser/validation (parsing the 3PQR.cif from the data directory)

I can dig some more into these but thought I'd flag first as it might be something simple when you're familiar with the code.

fredludlow · 2024-05-20T16:51:18Z

Apologies, first one that wasn't working is parser/map (edited above, previously said ccp4, but adding this comment in case you're following by email too)

fredludlow · 2024-05-20T16:52:54Z

Oh, and you should make yourself the authore for pdbe-datasource.ts!

papillot · 2024-05-21T19:37:13Z

Good catch @fredludlow !

The issue with the 4UJD.cif.gz file was due to how compression is handled. The streamer returns an ArrayBuffer, which needs to be converted to a string to be processed by the CIF parsing library.

The second one is a bug with handling altlocs (they were not processed correctly in fact)

Both were pretty major issues. Maybe we should add more tests to better cover this code?

fredludlow · 2024-05-22T22:51:03Z

Can confirm both those are now working for me.

I've got a local PDB mirror and am running a script to try NGL.autoLoad on every mmCIF formatted entry - if this works there may still be other classes of bug, but it would definitely be reassuring.

Happy for you to merge this in the meantime (and thank you again!)

fredludlow · 2024-05-22T22:55:49Z

Hmm, 7a4p is causing issues

papillot · 2024-05-22T23:23:08Z

That's a tricky file: one of the chain (identifier U, entity id 20) is missing from the coordinates block. It is reported elsewhere as a 3 aa chain.
So that's a missing null check I think.

fredludlow · 2024-05-23T06:33:38Z

7a4p was the ony one that threw an error / rejected the promise. There were approx 250 entries where the spacegroup was either undefined or another one that isn't recognized (P b c a, P 21 21 2 A, P 1 21/n 1, P n n a, C 4 21 2 and F 4 2 2 - putting here in case this comes up in the future) but I don't think that's related to the parser.

For reference, script is here: https://gist.github.com/fredludlow/e0a2a4af29d902350c872162315538d1

ppillot · 2024-05-24T15:38:17Z

Thanks @fredludlow that's so useful!

CIF reader from Mol* is wrapped in async calls which require to make the _parse function async in the binary cif parser.

mmCif files use the struct_conf table to define the alpha helices whereas the sheets are defined in the struct_sheet_range table. Alphafold modelCif files contain every DSSP assignation in the struct_conf table using DSSP mmcif codes (such as `TURN_TY1_P1`)

This table is available in "Updated mmcif files" distributed by PDBe. In this commit, the list of bonds defined for each residue is stored in a new dictionary in a ChemCompMap object. Bonds in the mmcif file are defined using atom names (e.g. CA), which need to be converted in indices in the atomList from a given residue type. The atomList contain list of indices of AtomTypes from the structure AtomMap. Given an atom index from the atomAlist, the AtomMap.get(idx) method returns an AtomType object that contains the atomname property.

Previously, the default was to use mmtf format server by RCSB. This format was containing the full connectivity, which is currently missing from bcif files distributed by RCSB. PDBe distributes "Updated" mmcif files, containing this data. The same content is available in their bcif files.

Jest cannot import code from ES modules which is the case of the modules from MolStar (not bundled). Jest code fails with some indications about tweaking jest config using the transformIgnorePattern property. After much trials and research I was not able to make it work and decided to switch the test runner to vitest, which solved the issue.

valueKind has 3 values: 0 if present, 1 if not present ('.' in Cif), 2 if unknown ('?' in Cif)

https://www.rcsb.org/structure/removed/4CWU

When splitting `(1,2,6,10,23,24)` against `(`, the first item is an empty string. The fix consists in filtering-out falsy values from the split array.

The CIF library returns `0` when a string column is converted to an int array. The fix here is to map the string array from the column using the String.charCodeAt() function.

In that case the `chainIndexDict` does not have the corresponding key. This fix still creates the corresponding `Entity` but with an empty chain list.

panda-byte · 2024-06-04T14:18:53Z

I'm not sure if the data source is set up properly for this. The following doesn't work for me (on a development server):

new Stage(...).loadFile('rscb://5z6y');

This tries to access http://models.rcsb.org/5z6y.bcif.gz, but that returns a 301 Moved Permanently, referring to the new location https://models.rcsb.org/5z6y.bcif.gz (using HTTPS!), which works, when used explicitly in the code. On the other hand, shouldn't the API described by RCSB be utilized instead? However, there seems to be no option for compression. The https://models.rcsb.org/5z6y.bcif.gz API seems to be the best option after all, even though it doesn't offer any options (I think that's just the download link on their website, right?).

papillot · 2024-06-04T15:25:11Z

I'm not sure if the data source is set up properly for this. The following doesn't work for me (on a development server):
new Stage(...).loadFile('rscb://5z6y');
This tries to access http://models.rcsb.org/5z6y.bcif.gz, but that returns a 301 Moved Permanently, referring to the new location https://models.rcsb.org/5z6y.bcif.gz (using HTTPS!), which works, when used explicitly in the code. On the other hand, shouldn't the API described by RCSB be utilized instead? However, there seems to be no option for compression. The https://models.rcsb.org/5z6y.bcif.gz API seems to be the best option after all, even though it doesn't offer any options (I think that's just the download link on their website, right?).

The protocol part (http:// vs https://) comes from the current location (i.e. the server that serves the current page, her your development server). We should make this always https then (I think it's already the case for PDBe).
Regarding the compression, are you referring to compression headers? Not sure they are necessary as the file is already transmitted as a compressed stream?

panda-byte · 2024-06-05T15:14:58Z

Oh, I see. I think that would be good.

Regarding the compression: I was referring to the RCSB model API, which offers several endpoints to make requests to. I was wondering if instead of using links like https://models.rcsb.org/5z6y.bcif.gz directly, maybe this API should be queried instead (like the full endpoint), as it seems to be the "official", documented way. But it apparently doesn't offer any compression options, so maybe using the direct download link is still better.

papillot · 2024-06-05T16:16:01Z

So, I've just checked and the https://models.rcsb.org endpoint does a good job with sending the data stream compressed using the HTTP headers:

Here is the download size compared with the resource size

I'll make a fix for the https vs http

ppillot · 2024-06-23T19:57:10Z

Oh, I see. I think that would be good.

Regarding the compression: I was referring to the RCSB model API, which offers several endpoints to make requests to. I was wondering if instead of using links like https://models.rcsb.org/5z6y.bcif.gz directly, maybe this API should be queried instead (like the full endpoint), as it seems to be the "official", documented way. But it apparently doesn't offer any compression options, so maybe using the direct download link is still better.

@panda-byte #1043 has been merged and published as v2.3.1 with the http/https fix for rcsb

papillot marked this pull request as draft April 27, 2024 21:04

papillot commented Apr 27, 2024

View reviewed changes

src/parser/mmtf-parser.ts Show resolved Hide resolved

papillot force-pushed the 1015-Binary-Cif-support branch from d93a5e3 to 2c96719 Compare April 28, 2024 21:03

papillot marked this pull request as ready for review April 28, 2024 21:19

fredludlow approved these changes May 22, 2024

View reviewed changes

papillot added 12 commits May 24, 2024 11:40

Types inference improvements: Structure in StructureParser

27af64d

Add molstar as a dependency (importing mol-io)

12b3a39

Allow async parsing methods

06f9638

CIF reader from Mol* is wrapped in async calls which require to make the _parse function async in the binary cif parser.

Parsing single model binary cif from RCSB

5c8512c

Implement core and chem_comp schema

d22fde3

Handle asTrajectory and firstModelOnly flags

0a98d4c

Update RCSB datasource to default to binary cif files

562bcee

Add PDBe server as datasource

5142a68

remove cif-parser.ts file

7b4520c

papillot added 17 commits May 24, 2024 11:42

Rename bcif-parser to cif-parser

e753c61

type Structure.extraData.cif

913bc68

update alphafold datasource suffix

6d5036f

ebi.ac.uk does not support requests made on http:// protocol

0be4bdd

fix import

36ea60f

fix: origx property is malformed

38136c1

fix: parse connections is broken

b17f3de

fix default export must be CifParser (backwards compatibility)

3445e65

remove unused code

c3acca1

fix: float(i) returns 0 if value is missing

e2b9ec0

valueKind has 3 values: 0 if present, 1 if not present ('.' in Cif), 2 if unknown ('?' in Cif)

4CWU superseded by 6CGV

72e1106

https://www.rcsb.org/structure/removed/4CWU

fix: operator expression with parentheses breaks parser

0bca727

When splitting `(1,2,6,10,23,24)` against `(`, the first item is an empty string. The fix consists in filtering-out falsy values from the split array.

fix: compressed cif files are parsed as binary

1974924

fix: alt-loc is always empty

f1e6471

The CIF library returns `0` when a string column is converted to an int array. The fix here is to map the string array from the column using the String.charCodeAt() function.

update file header

ac675d2

fix: some entities might be defined but not have any coordinates

cc34e1d

In that case the `chainIndexDict` does not have the corresponding key. This fix still creates the corresponding `Entity` but with an empty chain list.

papillot force-pushed the 1015-Binary-Cif-support branch from 63dbf30 to cc34e1d Compare May 24, 2024 15:51

ppillot merged commit 3d7c96a into nglviewer:master May 24, 2024
2 checks passed

ppillot mentioned this pull request May 24, 2024

RCSB is deprecating MMTF file format and will stop serve it in July 2024 #1015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1015 binary cif support #1040

1015 binary cif support #1040

papillot commented Apr 27, 2024 •

edited

Loading

fredludlow commented May 3, 2024 •

edited

Loading

fredludlow commented May 3, 2024

papillot commented May 3, 2024

papillot commented May 4, 2024 •

edited

Loading

fredludlow commented May 20, 2024 •

edited

Loading

fredludlow commented May 20, 2024

fredludlow commented May 20, 2024

papillot commented May 21, 2024

fredludlow commented May 22, 2024

fredludlow commented May 22, 2024

papillot commented May 22, 2024

fredludlow commented May 23, 2024

ppillot commented May 24, 2024

panda-byte commented Jun 4, 2024 •

edited

Loading

papillot commented Jun 4, 2024

panda-byte commented Jun 5, 2024 •

edited

Loading

papillot commented Jun 5, 2024

ppillot commented Jun 23, 2024

1015 binary cif support #1040

1015 binary cif support #1040

Conversation

papillot commented Apr 27, 2024 • edited Loading

Changes made

Fixes

Comments

fredludlow commented May 3, 2024 • edited Loading

fredludlow commented May 3, 2024

papillot commented May 3, 2024

papillot commented May 4, 2024 • edited Loading

fredludlow commented May 20, 2024 • edited Loading

fredludlow commented May 20, 2024

fredludlow commented May 20, 2024

papillot commented May 21, 2024

fredludlow commented May 22, 2024

fredludlow commented May 22, 2024

papillot commented May 22, 2024

fredludlow commented May 23, 2024

ppillot commented May 24, 2024

panda-byte commented Jun 4, 2024 • edited Loading

papillot commented Jun 4, 2024

panda-byte commented Jun 5, 2024 • edited Loading

papillot commented Jun 5, 2024

ppillot commented Jun 23, 2024

papillot commented Apr 27, 2024 •

edited

Loading

fredludlow commented May 3, 2024 •

edited

Loading

papillot commented May 4, 2024 •

edited

Loading

fredludlow commented May 20, 2024 •

edited

Loading

panda-byte commented Jun 4, 2024 •

edited

Loading

panda-byte commented Jun 5, 2024 •

edited

Loading