ContactPFP

Protein Function Prediction method based on predicted contact information.

How to Run

Preparation

Install required softwares first.

softwares
- python3
  - tested on python 3.6.7
- trRosetta
- HHBlits
- GR-align
- seqkit
- GNU parallel
sequence databases
- swissprot
- Uniclust30
  - required to generate MSA with HHblits

Generate Contact-graph Database

Convert all proteins of swissprot into contact graphs in this steps.

preprocessing

Simply run DB/01-preprocessing.sh after editing paths of swissprot. In the script, those steps will be performed.

take out unique sequences from swissprot
filter sequences by length L=20~2000
split into single FASTA files

by default, the files will split into several directories which have upto 1000 files.
the duplications is saved into file so that we can "extend" afterwards. You will find about 470K of fasta files in FASTA/\d{3} directories.

Generate MSAs

Generate MSA file (.a3m) for each proteins by using HHblits. All files will be generated by running DB/10-msa.sh, but it will takes insane time. You may need to use a good cluster computers, and about 2TB of disk space to save all .a3m files.

generate Contact predictions

By using those MSAs, trRosetta will calculate the distance predictions. Running DB/20-tr.sh will run prediction for all those proteins. This step will much faster by using GPUs, and generate ~38 TB of .npz files in total.

convert contact predictions to contact graph

Since the output of trRosetta is the probability of each distance bins, DB/30-convert.sh will convert them into binary contact graph by using certain cutoff. The distance cutoff is set 12 Angstrom, but you can change to another values. After this step, you will get the database that contains 2 files (. and .) for each proteins. Those files and dup.txt will be used in the prediction step.

Now, you are all set!

make a function prediction of query protein

If you have a single FASTA query as query.fasta, you can simply run below to make a prediction.

$ prediction/predict.sh query.fasta

The script will run,

HHblits
trRosetta
convert into graph
rank by gr-align
post-processing

After all, you will find output/[query]/[query].prediction as the result. It is 3-column tables showing GO ID, GO category, and confidence score.

Reference

Yuki Kagaya, et al., "ContactPFP: Protein function prediction using predicted contact information." (in preparation)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
DB		DB
prediction		prediction
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContactPFP

How to Run

Preparation

Generate Contact-graph Database

preprocessing

Generate MSAs

generate Contact predictions

convert contact predictions to contact graph

make a function prediction of query protein

Reference

About

Releases

Packages

Languages

License

kiharalab/contactPFP

Folders and files

Latest commit

History

Repository files navigation

ContactPFP

How to Run

Preparation

Generate Contact-graph Database

preprocessing

Generate MSAs

generate Contact predictions

convert contact predictions to contact graph

make a function prediction of query protein

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages