Protein Function Prediction method based on predicted contact information.
Install required softwares first.
- softwares
- python3
- tested on python 3.6.7
- trRosetta
- HHBlits
- GR-align
- seqkit
- GNU parallel
- python3
- sequence databases
- swissprot
- Uniclust30
- required to generate MSA with HHblits
Convert all proteins of swissprot into contact graphs in this steps.
Simply run DB/01-preprocessing.sh after editing paths of swissprot.
In the script, those steps will be performed.
- take out unique sequences from swissprot
- filter sequences by length L=20~2000
- split into single FASTA files
- by default, the files will split into several directories which have upto 1000 files.
- the duplications is saved into file so that we can "extend" afterwards.
You will find about 470K of fasta files in
FASTA/\d{3}directories.
Generate MSA file (.a3m) for each proteins by using HHblits.
All files will be generated by running DB/10-msa.sh, but it will takes insane time.
You may need to use a good cluster computers, and about 2TB of disk space to save all .a3m files.
By using those MSAs, trRosetta will calculate the distance predictions.
Running DB/20-tr.sh will run prediction for all those proteins.
This step will much faster by using GPUs, and generate ~38 TB of .npz files in total.
Since the output of trRosetta is the probability of each distance bins, DB/30-convert.sh will convert them into binary contact graph by using certain cutoff.
The distance cutoff is set 12 Angstrom, but you can change to another values.
After this step, you will get the database that contains 2 files (. and .) for each proteins.
Those files and dup.txt will be used in the prediction step.
Now, you are all set!
If you have a single FASTA query as query.fasta, you can simply run below to make a prediction.
$ prediction/predict.sh query.fasta
The script will run,
- HHblits
- trRosetta
- convert into graph
- rank by gr-align
- post-processing
After all, you will find output/[query]/[query].prediction as the result.
It is 3-column tables showing GO ID, GO category, and confidence score.
Yuki Kagaya, et al., "ContactPFP: Protein function prediction using predicted contact information." (in preparation)