GitHub - ndaniels/HomologyTesting: cross-validation framework for remote homology detection

ndaniels / HomologyTesting Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

cross-validation framework for remote homology detection

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
gargamel		gargamel
BuriedProbability.csv		BuriedProbability.csv
ExposedProbability.csv		ExposedProbability.csv
README		README
TODO		TODO
add-good-chains-to-whitelist.py		add-good-chains-to-whitelist.py
add_pvalue_to_hmm.rb		add_pvalue_to_hmm.rb
analyze-mrfy.sh		analyze-mrfy.sh
analyze-smurf-lite.sh		analyze-smurf-lite.sh
analyze-smurf.sh		analyze-smurf.sh
analyze_beta_interleave.rb		analyze_beta_interleave.rb
beta_mutation.rb		beta_mutation.rb
beta_pair.rb		beta_pair.rb
blast_augment_training.rb		blast_augment_training.rb
blastp-to-muscle-usage.sh		blastp-to-muscle-usage.sh
blastp-to-muscle.sh		blastp-to-muscle.sh
generate-all-matt-alignments.py		generate-all-matt-alignments.py
generate-complete-hmm.py		generate-complete-hmm.py
generate-csv.py		generate-csv.py
generate-hmm.py		generate-hmm.py
generate-matt-alignments.py		generate-matt-alignments.py
generate-mrf.py		generate-mrf.py
generate-negative-controls.py		generate-negative-controls.py
generate-negative-targets.py		generate-negative-targets.py
generate-positive-controls.py		generate-positive-controls.py
generate-positive-targets.py		generate-positive-targets.py
generate-training-targets.py		generate-training-targets.py
generate_runs.rb		generate_runs.rb
generate_runs_all.rb		generate_runs_all.rb
generate_targets.sh		generate_targets.sh
get-good-chains.sh		get-good-chains.sh
get_superfamilies.rb		get_superfamilies.rb
hive.rb		hive.rb
hive_smurf_lite.rb		hive_smurf_lite.rb
mrfy_simev_preparse.rb		mrfy_simev_preparse.rb
parse_smurf_results.rb		parse_smurf_results.rb
reorder_muscle_output.rb		reorder_muscle_output.rb
run-complete-hmm.py		run-complete-hmm.py
run-smurf.sh		run-smurf.sh
run_profile_smurf.sh		run_profile_smurf.sh
run_profile_smurf_usage.sh		run_profile_smurf_usage.sh
sim_ev.rb		sim_ev.rb
smurf-query.sh		smurf-query.sh
smurf-train.sh		smurf-train.sh
smurf_lite_preparse.rb		smurf_lite_preparse.rb
smurf_lite_simplify.rb		smurf_lite_simplify.rb
stockholm.rb		stockholm.rb
train-smurf-on-alignments.py		train-smurf-on-alignments.py
train_smurf.sh		train_smurf.sh
whitelist		whitelist

Repository files navigation

Contents of this directory
==========================

Files in this directory include:

* add-good-chains-to-whitelist.py
* analyze-smurf.sh
* argumentparsers.py
* blastp-to-muscle.sh
* gargamel.py
* generate-hmm.py
* generate-matt-alignments.py
* generate-negative-controls.py
* generate-positive-controls.py
* get-good-chains.sh
* smurf-query.sh
* smurf-train.sh

How to use scripts in this directory
====================================

The analyze-smurf.sh script will run everything necessary to generate multiple
alignments, HMM files, and positive and negative controls using both smurf and
hmmer as aligners, then analyze the results. However, the results analysis
script has not yet been written, so this will just generate the results for
now, by running generate-matt-alignments.py, generate-hmm.py,
generate-positive-controls.py and generate-negative-controls.py with smurf and
hmmer as aligners.

The four scripts generate-matt-alignments.py, generate-hmm.py,
generate-positive-controls.py and generate-negative-controls.py are intended to
be used together in that order. The gargamel.py module contains common
functionality for these scripts.

To generate multiple alignments from all proteins in a given superfamily,
leaving one family out at a time, use generate-matt-alignments.py.

If the multiple alignment fails because matt refuses to analyze certain chains,
use the get-good-chains.sh script. Use the add-good-chains-to-whitelist.py
script to add those good chains to the whitelist in this directory. NOTE: this
should only happen with recent versions of matt.

To generate HMM files from the multiple alignment output generated by matt, use
generate-hmm.py. Use this script only after running generate-matt-alignment.py.

To run smurf or hmmer queries on every protein not aligned by matt but in that
same superfamily, use generate-positive-controls.py. Use this script only after
running generate-matt-alignment.py and generate-hmm.py.

To run smurf or hmmer queries on every protein not in that superfamily, use the
generate-negative-controls.py script. Use this script only after running
generate-matt-alignment.py, generate-hmm.py and generate-positive-controls.py.

To simply generate an HMM file from matt for use with smurf, use the
smurf-train.sh script. To use smurf to query this generated HMM file with
specific protein chains, use the smurf-query.sh script.

Purpose of files
================

* add-good-chains-to-whitelist.py - Given a file containing a list of useable
  PBD IDs and associated chain letter, this script adds to the whitelist file
  the PDB IDs and chains read from the good chains input file. The format of 
  the whitelist file is as follows: each line consists of a PDB ID, which is an
  alphanumeric string of length 4, followed by a colon, followed by one or more
  letters identifying a chain (for example, A, B, C, etc.). Repeated PDB IDs 
  are NOT allowed in this file. Behavior on repeated PDB IDs is undefined. 
  Lines starting with hashes are comments and are ignored.
  
* analyze-smurf.sh - 

* argumentparsers.py - 

* blastp-to-muscle.sh - 

* gargamel.py - The alignment-test.py and query-negative-controls.py scripts
  import this module in order to access the common functions which it defines,
  which are useful for running matt/smurf on certain sets of protein chains.

* generate-hmm.py - 

* generate-matt-alignments.py - 

* generate-negative-controls.py - Runs smurf queries on all proteins NOT in the
  superfamily aligned by the generate-matt-alignments.py script. This script
  relies on the successful completion of generate-matt-alignments.py and
  generate-hmm.py scripts, and the directory structure which they output.

  generate-negative-controls.py requires some command-line arguments. For more
  information on which command-line arguments this script accepts, run the
  command:

    generate-negative-controls.py --help

* generate-positive-controls.py - 

* get-good-chains.sh - Given a file containing the output from Matt redirected
  from stdout, this script will output (to stdout) a list of protein chains
  useable by Matt. The format of the output is as follows: each line consists
  of a PDB ID, which is an alphanumeric string of length 4, followed by a
  colon, followed by a single letter identifying a chain (for example, A, B, C,
  etc.). Repeated PDB IDs are allowed in this file. Lines starting with hashes
  are comments and are ignored.

  NOTE: Currently, the downside to this method of determining which chains are 
  good and which are bad is that we have to run Matt once and let it fail 
  before we can determine which chains are good.

* smurf-train.sh - Given a file containing a list of PDB files, this script
  runs Matt, SMURF-preparse, and hmmbuild on the specified set of PDB files to
  create a hidden Markov model for use with SMURF, specifically for use with 
  the smurf-query.sh script. WARNING: Matt is finicky, and may fail. If Matt
  fails, the rest of the script will cause some segmentation faults.

* smurf-query.sh - Given a hidden Markov model and a FASTA file containing a
  protein structure to query, this script runs SMURF to determine an alignment
  to the consensus template specified in the hidden Markov model generated by 
  the smurf-train.sh script.

Contact
=======

Jeffrey Finkelstein <jeffrey.finkelstein@gmail.com>