Skip to content

Commit

Permalink
documenting stats and fasta, see facebookresearch#366 and facebookres…
Browse files Browse the repository at this point in the history
  • Loading branch information
tomsercu authored and harryhaemin committed Feb 24, 2023
1 parent 13e8290 commit 1c33793
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions scripts/atlas/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,19 @@ The high quality structures are around 1TB in size.

The full database is available as PDB structures and is 15TB in size.

We also provide a [metadata dataframe](https://dl.fbaipublicfiles.com/esmatlas/v0/stats.parquet) as a .parquet file loadable via pandas.
We also provide a metadata dataframe [stats.parquet](https://dl.fbaipublicfiles.com/esmatlas/v0/stats.parquet) loadable via pandas: `df = pd.read_parquet('stats.parquet')`
The dataframe has length `617051007`, the file is 6.0GB and has md5 hash `3948a44562b6bd4c184167465eec17de`.
This dataframe has 4 columns:
- `id` is the MGnify ID
- `ptm` is the predicted TM score
- `plddt` is the predicted average lddt
- `num_conf` is the number of residues with plddt > 0.7
- `len` is the total residues in the protein

In parallel with `stats.parquet`, the sequences as fasta file can be downloaded from: [atlas.fasta](https://dl.fbaipublicfiles.com/esmatlas/v0/full/atlas.fasta),
The fasta file has `617051007` records matching the stats file, has file size 114GB, and has md5 hash `dc45f4383536c93f9d871facac7cca93`.

We recommend using `s5cmd` or `aria2c` to download files (installable via anaconda).
We will provide a list of paths to facilitate downloading.

**To download any of the structures provided, please use this `aria2c` command**
```
Expand Down

0 comments on commit 1c33793

Please sign in to comment.