-
Notifications
You must be signed in to change notification settings - Fork 0
ndaniels/HomologyTesting
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Contents of this directory ========================== Files in this directory include: * add-good-chains-to-whitelist.py * analyze-smurf.sh * argumentparsers.py * blastp-to-muscle.sh * gargamel.py * generate-hmm.py * generate-matt-alignments.py * generate-negative-controls.py * generate-positive-controls.py * get-good-chains.sh * smurf-query.sh * smurf-train.sh How to use scripts in this directory ==================================== The analyze-smurf.sh script will run everything necessary to generate multiple alignments, HMM files, and positive and negative controls using both smurf and hmmer as aligners, then analyze the results. However, the results analysis script has not yet been written, so this will just generate the results for now, by running generate-matt-alignments.py, generate-hmm.py, generate-positive-controls.py and generate-negative-controls.py with smurf and hmmer as aligners. The four scripts generate-matt-alignments.py, generate-hmm.py, generate-positive-controls.py and generate-negative-controls.py are intended to be used together in that order. The gargamel.py module contains common functionality for these scripts. To generate multiple alignments from all proteins in a given superfamily, leaving one family out at a time, use generate-matt-alignments.py. If the multiple alignment fails because matt refuses to analyze certain chains, use the get-good-chains.sh script. Use the add-good-chains-to-whitelist.py script to add those good chains to the whitelist in this directory. NOTE: this should only happen with recent versions of matt. To generate HMM files from the multiple alignment output generated by matt, use generate-hmm.py. Use this script only after running generate-matt-alignment.py. To run smurf or hmmer queries on every protein not aligned by matt but in that same superfamily, use generate-positive-controls.py. Use this script only after running generate-matt-alignment.py and generate-hmm.py. To run smurf or hmmer queries on every protein not in that superfamily, use the generate-negative-controls.py script. Use this script only after running generate-matt-alignment.py, generate-hmm.py and generate-positive-controls.py. To simply generate an HMM file from matt for use with smurf, use the smurf-train.sh script. To use smurf to query this generated HMM file with specific protein chains, use the smurf-query.sh script. Purpose of files ================ * add-good-chains-to-whitelist.py - Given a file containing a list of useable PBD IDs and associated chain letter, this script adds to the whitelist file the PDB IDs and chains read from the good chains input file. The format of the whitelist file is as follows: each line consists of a PDB ID, which is an alphanumeric string of length 4, followed by a colon, followed by one or more letters identifying a chain (for example, A, B, C, etc.). Repeated PDB IDs are NOT allowed in this file. Behavior on repeated PDB IDs is undefined. Lines starting with hashes are comments and are ignored. * analyze-smurf.sh - * argumentparsers.py - * blastp-to-muscle.sh - * gargamel.py - The alignment-test.py and query-negative-controls.py scripts import this module in order to access the common functions which it defines, which are useful for running matt/smurf on certain sets of protein chains. * generate-hmm.py - * generate-matt-alignments.py - * generate-negative-controls.py - Runs smurf queries on all proteins NOT in the superfamily aligned by the generate-matt-alignments.py script. This script relies on the successful completion of generate-matt-alignments.py and generate-hmm.py scripts, and the directory structure which they output. generate-negative-controls.py requires some command-line arguments. For more information on which command-line arguments this script accepts, run the command: generate-negative-controls.py --help * generate-positive-controls.py - * get-good-chains.sh - Given a file containing the output from Matt redirected from stdout, this script will output (to stdout) a list of protein chains useable by Matt. The format of the output is as follows: each line consists of a PDB ID, which is an alphanumeric string of length 4, followed by a colon, followed by a single letter identifying a chain (for example, A, B, C, etc.). Repeated PDB IDs are allowed in this file. Lines starting with hashes are comments and are ignored. NOTE: Currently, the downside to this method of determining which chains are good and which are bad is that we have to run Matt once and let it fail before we can determine which chains are good. * smurf-train.sh - Given a file containing a list of PDB files, this script runs Matt, SMURF-preparse, and hmmbuild on the specified set of PDB files to create a hidden Markov model for use with SMURF, specifically for use with the smurf-query.sh script. WARNING: Matt is finicky, and may fail. If Matt fails, the rest of the script will cause some segmentation faults. * smurf-query.sh - Given a hidden Markov model and a FASTA file containing a protein structure to query, this script runs SMURF to determine an alignment to the consensus template specified in the hidden Markov model generated by the smurf-train.sh script. Contact ======= Jeffrey Finkelstein <jeffrey.finkelstein@gmail.com>
About
cross-validation framework for remote homology detection
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published