rnascan is a (mostly) Python suite to scan RNA sequences and secondary structures with sequence and secondary structure PFMs. Secondary structure is represented as weights in different secondary structure contexts, similar to how a PFM represents weights of different nucleotides or amino acids. This allows representation and use of secondary structures in a way that is similar to how PFMs are used to scan nucleotide sequences, and also allows for some flexibility in the structure, as you might find in the boltzmann distribution of secondary structures.
The secondary structure alphabet is as follows:
- B - bulge loop
- E - external (unpaired) RNA
- H - hairpin loop
- L - left paired RNA (i.e., a '(' in dot-bracket format)
- M - multiloop
- R - right paired RNA (i.e., a ')' in dot-bracket format)
- T - internal loop
The rnascan suite consists of two tools:
run_folding: Calculate an average structural context profile of an RNA sequence by folding overlapping 100 nt subsequences and averaging across.
rnascan: Scan RNA sequences and secondary structures with sequence and secondary structure PFMs.
Read the following steps to install rnascan. If you do not plan on using the
run_folding tool to fold sequences, you may skip the steps with an asterisk (*).
1. Install ViennaRNA*
To predict secondary structures, the program
RNAfold from the ViennaRNA package is used. Please follow the installation instructions on their website.
2. Download rnascan source
git clone firstname.lastname@example.org:morrislab/rnascan.git cd rnascan
3. Compile secondary structure parser C++ script*
The compiled binary must be saved in a location where it can be executed (i.e. is listed in your
PATH environment variable). Here, we use the user's local
g++ -o ~/bin/parse_secondary_structure scripts/parse_secondary_structure.cpp
rnascan Python components
This package requires Python 2.7+ or Python 3.5+. To install the package, run the following:
python setup.py install # alternatively, for user-specific installation: python setup.py install --user
Dependencies (pandas, numpy, and biopython) will be automatically downloaded and installed, if necessary.
For full documentation of options, refer to the help messages using the
-h option for each command.
run_folding sequences.fasta /path/to/output_dir
The second argument
/path/to/output_dir is the directory where the average structure profiles will be saved. One file per FASTA record will be outputted.
Scanning can be performed in four modes:
- Sequence only (using
-pto specify the sequence PFM)
- Structure only (using
-qto specify the structure PFM)
- Sequence and structure (
- Sequence and averaged structure (
Here are some example commands using minimal options:
# To run a test sequence rnascan -p pfm_seq.txt -t AGTTCCGGTCCGGCAGAGATCGCG > hits.tab # Sequence-only (use -p) rnascan -p pfm_seq.txt sequences.fasta > hits.tab # Structure-only (use -q) rnascan -q pfm_struct.txt structures.fasta > hits.tab # Sequence and structure rnascan -p pfm_seq.txt -q pfm_struct.txt sequences.fasta structures.fasta > hits.tab # Sequence and averaged structure rnascan -p pfm_seq.txt -q pfm_struct.txt sequences.fasta averaged_structures/ > hits.tab
Note that in the last example, the second positional argument is the path to a
directory containing the average structure profiles generated by
rnascan will look inside the directory and automatically search for files
that look like
To print the score at every position, change the default threshold using the
-m option to
-inf. To change the number of processing cores, use
rnascan -p pfm_seq.txt -q pfm_struct.txt -m ' -inf' -c 8 sequences.fasta averaged_structures/ > hits.tab
Computing background probabilities
rnascan computes the background probabilities from the input
sequences at the beginning of the run. To apply a uniform
background, use the option
rnascan -p pfm_seq.txt -u sequences.fasta > hits.tab
To compute the background probabilities of a set of input sequences and save it
for future use, use the option
rnascan -p pfm_seq.txt --bgonly sequences.fasta > background.txt rnascan -q pfm_struct.txt --bgonly structures.fasta > background.txt
In this mode,
rnascan computes the background probabilities, outputs to standard output (in the form of a Python dictionary), and exits (no scanning is performed). To re-use this background later, use the option
--bg_struct with the background file:
rnascan -p pfm_seq.txt --bg_seq background.txt sequences.fasta > hits.tab
Cook, K.B., Vembu, S., Ha, K.C.H., Zheng, H., Laverty, K.U., Hughes, T.R., Ray, D., Morris, Q.D., 2017. RNAcompete-S: Combined RNA sequence/structure preferences for RNA binding proteins derived from a single-step in vitro selection. Methods 126, 18–28. http://www.sciencedirect.com/science/article/pii/S1046202317300312