Skip to content

Latest commit

 

History

History
executable file
·
160 lines (137 loc) · 6.41 KB

README.md

File metadata and controls

executable file
·
160 lines (137 loc) · 6.41 KB

Descriptor of Elementary Function

Workflow

Preprocessing

Unless otherwise stated, functions come from src/preprocessing/preprocess.py.

  1. Extract seq-name and chain-ID from source extracts

    • Input:
      1. Website extract files (e.g. data/user/input/prosite_extract.txt)
    • Output:
      1. data/internal/pname_cid_map.pkl
    • Description:
      1. Place prosite or ioncom extract in /data/user/input/.
      2. Run parse_extracts(source, filename) in preprocessing/preprocess.py, specifying the source (ioncom or prosite) and name of extract file. This extracts the sequence names and chain-ids to be processed.
  2. (Optional) Download relevant .pdb files from rscb server

    • Input:
      1. data/internal/pname_cid_map.pkl
      2. Internet connection
    • Output:
      1. Populated data/internal/pdb_files/
    • Description:
      Downloads corresponding .pdb files from rscb server. Delete entries in pname_cid_map if .pdb files are not in folder.
      1. Run download_pdb()
      2. Run trim_pnames_based_on_pdb()
  3. Create sequence .fasta file

    • Input:
      1. data/internal/pname_cid_map.pkl
    • Output:
      1. data/internal/seqs.fasta
    • Description:
      The motif-finding binaries require the sequences to be in a .fasta file.
      1. Run create_seq()
  4. Filter short sequences

    • Input:
      1. data/internal/seqs.fasta
      2. Populated data/internal/pdb_files/
    • Output:
      1. Updated data/internal/seqs.fasta
    • Description:
      Sequences shorter than the desired motif length (30 residues) can lead to errors when performing the motif search, and need to be dropped.
      1. Run filter_seq_file()
  5. (Optional) Create seed sequence file for converge

    • Input:
      1. data/user/input/ioncom_binding_sites.txt
    • Output:
      1. data/internal/seed_seqs.fasta
    • Description:
      The motif-finding binary converge requires seed sequences from which it generates its initial set of motifs.
      1. Place ioncom binding-site file in /data/user/input/.
      2. Run make() in src/preprocessing/make_conv_seed_seqs.py.
  6. Run motif-search binary to find motif positions

    • Input:

      1. data/internal/seqs.fasta
      2. (Optional) Populated data/internal/pdb_files/
      3. (Optional) Provided motif file (e.g. data/user/input/meme.txt)
      4. (Optional) data/internal/seed_seqs.fasta
    • Output:

      1. data/internal/motif_pos.pkl
    • Description:
      This finds the positions of the desired motif for each sequence-chain. There are three implemented ways of running this locally:

      1. Motifs can be derived from scratch, using meme. This generates both the motif file and the motif positions. Run find (process='meme', num_p=<num_processors>) in src/preprocessing/motif_finder.py.
      2. Motifs can be found using a given motif file. First, put the motif file (in MEME format) in data/user/input/<filename>. Then, run find (process='mast', motif_fname=<filename>, num_p=<num_processors>) in src/preprocessing/motif_finder.
      3. Motifs can be derived from scratch using converge, which also provides the motif file and positions. Run make (input_fname=<filename>, num_p=<num_processors>) in src/preprocessing/make_conv_seed_seqs.py.

      Because of long run-time for the motif-finding process, it is recommended to run this step in a server. Instructions for doing so are in [1] below.

Descriptor Generation

  1. Calculate descriptor properties

    • Input:
      1. data/internal/motif_pos.pkl
      2. Populated data/internal/pdb_files/
    • Output:
      1. data/internal/descrs.pkl
    • Description:
      This calculates the descriptor properties, for each motif. Run calculate() in src/descr/descr_main.py.
  2. Visualise properties

    • Input:
      1. data/internal/descrs.pkl
    • Output:
      1. (Optional) data/user/output/
    • Description:
      Plots for different descriptor properties can be generated via src/utils/plots.py. Run each plot_<something>(save=False) as needed, and set save=True to keep the generated plots in the output folder.

Tests

  • Generate Reference Output

    • /tests/src/setup_ref.py
  • Visualise Reference Output

    • /tests/src/plot_ref.py
  • Checks against reference output

    • /tests/src/test_motif_finder.py
    • /tests/src/test_descr_main.py

Data files

  • /data
    • /tmp: created during runtime, should be deleted at end of run, except for debugging. Does not get deleted for tests that fail.
    • /input
      • /ioncom
        • allsulfate.txt: Raw sequence-binding_site match, for mg, in dataIonCom.zip, downloaded from https://zhanglab.ccmb.med.umich.edu/IonCom/ >> download dataset used to...
        • ioncom.txt: allid_reso3.0_len50_nr40.txt in dataIonCom, shows list of sequences. (deprecated eventually)
      • /mg_full
        • mg_50.fasta: From uniprot, uniref50 for seqs with MG as co-factor/ligand.
        • mg_100.fasta: uniref100 for MG cofactor seqs
      • /pdb_files: Stored pdb_files. Both tests and main should call this, since downloading takes a while. Automatically downloaded from rscb server, via link https://files.rcsb.org/view/{1ABC}.pdb
      • /prosite
      • /internal
        • fasta_template.fasta: Used for running mast, when we only want the seqlogo and doesn't actually care about matching for motifs.
        • meme.txt: Motif file for Calcium EF-hand.

Linter

  • pylint, mostly following google style guide with some additional disabled clauses.

TODO:

  1. pdb_list from prosite need to be extracted too...?