Workflow
Preprocessing
Unless otherwise stated, functions come from src/preprocessing/preprocess.py
.
-
Extract seq-name and chain-ID from source extracts
- Input:
- Website extract files (e.g.
data/user/input/prosite_extract.txt
)
- Website extract files (e.g.
- Output:
data/internal/pname_cid_map.pkl
- Description:
- Place prosite or ioncom extract in
/data/user/input/
. - Run
parse_extracts(source, filename)
inpreprocessing/preprocess.py
, specifying the source (ioncom
orprosite
) and name of extract file. This extracts the sequence names and chain-ids to be processed.
- Place prosite or ioncom extract in
- Input:
-
(Optional) Download relevant
.pdb
files from rscb server- Input:
data/internal/pname_cid_map.pkl
- Internet connection
- Output:
- Populated
data/internal/pdb_files/
- Populated
- Description:
Downloads corresponding.pdb
files from rscb server. Delete entries inpname_cid_map
if.pdb
files are not in folder.- Run
download_pdb()
- Run
trim_pnames_based_on_pdb()
- Run
- Input:
-
Create sequence
.fasta
file- Input:
data/internal/pname_cid_map.pkl
- Output:
data/internal/seqs.fasta
- Description:
The motif-finding binaries require the sequences to be in a.fasta
file.- Run
create_seq()
- Run
- Input:
-
Filter short sequences
- Input:
data/internal/seqs.fasta
- Populated
data/internal/pdb_files/
- Output:
- Updated
data/internal/seqs.fasta
- Updated
- Description:
Sequences shorter than the desired motif length (30 residues) can lead to errors when performing the motif search, and need to be dropped.- Run
filter_seq_file()
- Run
- Input:
-
(Optional) Create seed sequence file for
converge
- Input:
data/user/input/ioncom_binding_sites.txt
- Output:
data/internal/seed_seqs.fasta
- Description:
The motif-finding binaryconverge
requires seed sequences from which it generates its initial set of motifs.- Place ioncom binding-site file in
/data/user/input/
. - Run
make()
insrc/preprocessing/make_conv_seed_seqs.py
.
- Place ioncom binding-site file in
- Input:
-
Run motif-search binary to find motif positions
-
Input:
data/internal/seqs.fasta
- (Optional) Populated
data/internal/pdb_files/
- (Optional) Provided motif file (e.g.
data/user/input/meme.txt
) - (Optional)
data/internal/seed_seqs.fasta
-
Output:
data/internal/motif_pos.pkl
-
Description:
This finds the positions of the desired motif for each sequence-chain. There are three implemented ways of running this locally:- Motifs can be derived from scratch, using
meme
. This generates both the motif file and the motif positions. Runfind (process='meme', num_p=<num_processors>)
insrc/preprocessing/motif_finder.py
. - Motifs can be found using a given motif file. First, put the motif
file (in MEME format) in
data/user/input/<filename>
. Then, runfind (process='mast', motif_fname=<filename>, num_p=<num_processors>)
insrc/preprocessing/motif_finder
. - Motifs can be derived from scratch using
converge
, which also provides the motif file and positions. Runmake (input_fname=<filename>, num_p=<num_processors>)
insrc/preprocessing/make_conv_seed_seqs.py
.
Because of long run-time for the motif-finding process, it is recommended to run this step in a server. Instructions for doing so are in [1] below.
- Motifs can be derived from scratch, using
-
Descriptor Generation
-
Calculate descriptor properties
- Input:
data/internal/motif_pos.pkl
- Populated
data/internal/pdb_files/
- Output:
data/internal/descrs.pkl
- Description:
This calculates the descriptor properties, for each motif. Runcalculate()
insrc/descr/descr_main.py
.
- Input:
-
Visualise properties
- Input:
data/internal/descrs.pkl
- Output:
- (Optional)
data/user/output/
- (Optional)
- Description:
Plots for different descriptor properties can be generated viasrc/utils/plots.py
. Run eachplot_<something>(save=False)
as needed, and setsave=True
to keep the generated plots in the output folder.
- Input:
Tests
-
Generate Reference Output
/tests/src/setup_ref.py
-
Visualise Reference Output
/tests/src/plot_ref.py
-
Checks against reference output
/tests/src/test_motif_finder.py
/tests/src/test_descr_main.py
Data files
/data
/tmp
: created during runtime, should be deleted at end of run, except for debugging. Does not get deleted for tests that fail./input
/ioncom
allsulfate.txt
: Raw sequence-binding_site match, for mg, in dataIonCom.zip, downloaded from https://zhanglab.ccmb.med.umich.edu/IonCom/ >> download dataset used to...ioncom.txt
: allid_reso3.0_len50_nr40.txt in dataIonCom, shows list of sequences. (deprecated eventually)
/mg_full
mg_50.fasta
: From uniprot, uniref50 for seqs with MG as co-factor/ligand.mg_100.fasta
: uniref100 for MG cofactor seqs
/pdb_files
: Stored pdb_files. Both tests and main should call this, since downloading takes a while. Automatically downloaded from rscb server, via link https://files.rcsb.org/view/{1ABC}.pdb/prosite
prosite_extract.txt
: Copy-pasted from html (inspect source code) from prosite website (https://prosite.expasy.org/cgi-bin/pdb/pdb_structure_list.cgi?src=PS00018).
/internal
fasta_template.fasta
: Used for running mast, when we only want the seqlogo and doesn't actually care about matching for motifs.meme.txt
: Motif file for Calcium EF-hand.
Linter
pylint
, mostly following google style guide with some additional disabled clauses.
TODO:
- pdb_list from prosite need to be extracted too...?