Skip to content

Building RNA3DB from scratch

Marcell Szikszai edited this page Mar 6, 2024 · 1 revision

If you wish to build your own version of RNA3DB from scratch, please follow the below steps carefully.

Installation

If you want to install and reproduce RNA3DB simply run:

$ git clone https://github.com/marcellszi/rna3db.git && cd rna3db
$ python -m pip install -e .

We recommend Python 3.10.* for running RNA3DB.

The following non-Python dependencies are also required to reproduce all steps:

Downloading required data

First, you must download the mmCIF files to scan for RNAs. We scan all crystal structures in the PDB for RNA chains for our data set.

To make this simple we provide scripts/download_pdb_mmcif.sh, which can be used to download all crystal structures in the PDB. You can do this via:

$ scripts/download_pdb_mmcif.sh data/cif/

Note: This step requires downloading and extracting 343GB+ of gzip files.

Before parsing mmCIF files, a modifications_cache.json must be generated. This file contains conversions from three_letter_code to one_letter_code for modified residues. To do this, we start by dowloading the latest version of components.cif from the Chemical Component Dictionary, and running:

$ scripts/generate_modifications_cache.py path/to/components.cif data/modifications_cache.json

Extracing RNAs from PDB

The next step is to parse the mmCIF files and extract all RNA chains.

$ python -m rna3db parse data/pdb_mmcif/mmcif_files output/parse.json

Homology search

As the next step, we perform homology search on all RNAs in the PDB. However, if you don't want this information for all chains, you can significantly reduce the number of searched sequences (therby speeding up the search) by filtering redundant, short, or low resolution sequences first. See the Filtering step below.

We must generate a FASTA file containing the RNAs we want to scan. We provide a convenient script to do this via:

$ scripts/json_to_fasta.py output/parse.json output/parse.fasta

Next, we must use Infernal's cmscan to do homology search. First, we need to download the latest version of Rfam's covariance models. Then we can simply run:

$ cmscan -o output/cmscans/parse.o --tbl output/cmscans/parse.tbl data/Rfam.cm output/parse.fasta

To attempt to find hits for those chains without significant hits, we rescan them with the --max flag. First, generate a FASTA of chains without hits in parse.tbl:

$ scripts/get_nohits.py output/parse.json output/nohits.fasta output/cmscans/parse.tbl

Then run cmscan again (this time with the --max flag):

$ cmscan --max -o output/cmscans/nohits.o --tbl output/cmscans/nohits.tbl data/Rfam.cm output/nohits.fasta

Note: Both of these scans are prohibitively slow to run on most consumer-grade hardware. We recommend running this step on a compute cluster. The latest homology search took 110 hours on a single Intel Xeon Platinum 8358 processor with 32 cores.

Filtering

This step can be performed prior to the homology search if you are not interested in the filtered chains's homology search. In any case, to filter a set of raw chains, you can run:

$ python rna3db filter output/parse.json output/filter.json

Clustering

Note: This step requires that you have the .tbl files produced by Infernal's cmscan. Please see the Homology step above.

Once you have gone through the filtering step, you must cluster by both sequence and structure. To do this, you can run,

$ python rna3db cluster output/filter.json output/cluster.json --tbl_dir output/cmscans/

where output/cmscans is your folder that contains all .tbl files used for homology. If RNA3DB cannot find MMseqs2, you may need to manually provide a path to the binary with the --mmseqs_binary_path flag.

Split

The final step is to create a training/testing split. This can be done by running,

$ python -m rna3db split output/cluster.json output/split.json