documenting stats and fasta, see facebookresearch#366 and facebookres…

…earch#376
harryhaemin · Feb 24, 2023 · 1c33793 · 1c33793
1 parent 13e8290
commit 1c33793
Showing 1 changed file with 5 additions and 2 deletions.
diff --git a/scripts/atlas/README.md b/scripts/atlas/README.md
@@ -13,16 +13,19 @@ The high quality structures are around 1TB in size.
 
 The full database is available as PDB structures and is 15TB in size.
 
-We also provide a [metadata dataframe](https://dl.fbaipublicfiles.com/esmatlas/v0/stats.parquet) as a .parquet file loadable via pandas.
+We also provide a metadata dataframe [stats.parquet](https://dl.fbaipublicfiles.com/esmatlas/v0/stats.parquet) loadable via pandas: `df = pd.read_parquet('stats.parquet')`
+The dataframe has length `617051007`, the file is 6.0GB and has md5 hash `3948a44562b6bd4c184167465eec17de`.
 This dataframe has 4 columns:
 - `id` is the MGnify ID
 - `ptm` is the predicted TM score
 - `plddt` is the predicted average lddt
 - `num_conf` is the number of residues with plddt > 0.7
 - `len` is the total residues in the protein
 
+In parallel with `stats.parquet`, the sequences as fasta file can be downloaded from: [atlas.fasta](https://dl.fbaipublicfiles.com/esmatlas/v0/full/atlas.fasta),
+The fasta file has `617051007` records matching the stats file, has file size 114GB, and has md5 hash `dc45f4383536c93f9d871facac7cca93`.
+
 We recommend using `s5cmd` or `aria2c` to download files (installable via anaconda).
-We will provide a list of paths to facilitate downloading.
 
 **To download any of the structures provided, please use this `aria2c` command**
 ```