Skip to content

kiharalab/contactPFP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ContactPFP

Protein Function Prediction method based on predicted contact information.

How to Run

Preparation

Install required softwares first.

Generate Contact-graph Database

Convert all proteins of swissprot into contact graphs in this steps.

preprocessing

Simply run DB/01-preprocessing.sh after editing paths of swissprot. In the script, those steps will be performed.

  1. take out unique sequences from swissprot
  2. filter sequences by length L=20~2000
  3. split into single FASTA files
  • by default, the files will split into several directories which have upto 1000 files.
  • the duplications is saved into file so that we can "extend" afterwards. You will find about 470K of fasta files in FASTA/\d{3} directories.

Generate MSAs

Generate MSA file (.a3m) for each proteins by using HHblits. All files will be generated by running DB/10-msa.sh, but it will takes insane time. You may need to use a good cluster computers, and about 2TB of disk space to save all .a3m files.

generate Contact predictions

By using those MSAs, trRosetta will calculate the distance predictions. Running DB/20-tr.sh will run prediction for all those proteins. This step will much faster by using GPUs, and generate ~38 TB of .npz files in total.

convert contact predictions to contact graph

Since the output of trRosetta is the probability of each distance bins, DB/30-convert.sh will convert them into binary contact graph by using certain cutoff. The distance cutoff is set 12 Angstrom, but you can change to another values. After this step, you will get the database that contains 2 files (. and .) for each proteins. Those files and dup.txt will be used in the prediction step.

Now, you are all set!

make a function prediction of query protein

If you have a single FASTA query as query.fasta, you can simply run below to make a prediction.

$ prediction/predict.sh query.fasta

The script will run,

  1. HHblits
  2. trRosetta
  3. convert into graph
  4. rank by gr-align
  5. post-processing

After all, you will find output/[query]/[query].prediction as the result. It is 3-column tables showing GO ID, GO category, and confidence score.

Reference

Yuki Kagaya, et al., "ContactPFP: Protein function prediction using predicted contact information." (in preparation)

About

Protein function prediction based on predicted residue-residue contacts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published