Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints.
See our paper in Nature Communications for more. Please cite the paper if you use DMPfold.
You can also run DMPfold via the PSIPRED web server. This is a good way to get models for a few sequences, but if you want to run DMPfold on many sequences we strongly recommend you run it locally. The server version of DMPfold has restrictions on run time and uses parameters that give faster runs, so should not be used to benchmark DMPfold.
As it makes use of a lot of different software, installation can be a little fiddly. However we have aimed to make it as straightforward as possible. These instructions should work for a Linux system:
- Make sure you have Python 3 with PyTorch 0.4 or later, NumPy and SciPy installed. GPU setup is optional for Pytorch - it won't speed things up much because running the network isn't a time-consuming step. DMPfold has been tested on Python 3.6 and 3.7. The command
python3should point to the Python that you want to use.
- Install HH-suite and the uniclust30 database, unless you are getting your alignments from elsewhere.
- Install FreeContact.
- Install CCMpred.
- Install MODELLER, which requires a license key. Only the Python package is required so this can be installed with
conda install modeller -c salilab.
- Install CNS. We found we had to follow all of the steps in this comment to get CNS working: set
MXFPEPS2in machvar.inc to 8192, remove
-fastmflag in the make file, set
MXRTPin rtf.inc in the source directory to 4000 and in machvar.f add
WRITE (6,’(I6,E10.3,E10.3)’) I, ONEP, ONEMjust above line 67, which looks like
IF (ONE .EQ. ONEP .OR. ONE .EQ. ONEM) THEN. We also had to install the
flex-develpackage via our system package manager. In addition, you should change two values in
cns_solve_1.3/modules/nmr/readdatato larger numbers to allow DMPfold to run on larger structures. Change the
nrestraints = 20000line to something like
nrestraints = 50000and the
nassign 1600line to something like
- Download and patch the required CNS scripts by changing into the
cnsfilesdirectory and running
- Install CD-HIT, which is usually as simple as a clone and
make. CD-HIT is not required if you don't need to predict the TM-score of generated models.
- Install the legacy BLAST software, in particular
makemat. We may update this to BLAST+ in the future.
- Other software is pre-compiled and included here (PSIPRED, PSICOV, various utility scripts with the code in
src). This should run okay but may need separate compilation using the makefile if issues arise. Some other standard programs, such as csh shell, are assumed.
- Change lines 10/13-15/18/21/24 in
seq2maps.csh, lines 11/14/17/20 in
aln2maps.csh, lines 4/7 in
bin/runpsipredandsolvwithdb, lines 10/13 in
run_dmpfold.shand lines 7/10 in
predict_tmscore.shto point to the installed locations of the above software. You can also set the number of cores to use in
aln2maps.csh. This sets the number of cores for HHblits, PSICOV, FreeContact and CCMpred - the script will run faster with this set to a value larger than 1 (e.g. 4 or 8).
Here we give an example of running DMPfold on Pfam family PF10963.
First you need to generate the
This can be done in one of two ways:
- From a single sequence:
csh seq2maps.csh example/PF10963.fastato run HHblits, PSIPRED, SOLVPRED, PSICOV, FreeContact, CCMpred and alnstats.
- From an alignment:
csh aln2maps.csh example/PF10963.alnto run PSIPRED, SOLVPRED, PSICOV, FreeContact, CCMpred and alnstats. The file
PF10963.alnhas one sequence per line with the ungapped target sequence as the first line.
sh run_dmpfold.sh example/PF10963.fasta PF10963.21c PF10963.map ./PF10963 to run DMPfold, where the last parameter is an output directory that will be created.
sh run_dmpfold.sh example/PF10963.fasta PF10963.21c PF10963.map ./PF10963 5 20 instead runs 5 iterations with 20 models per iteration (default is 3 and 50).
The final model is
final_1.pdb and other structures may or may not be generated as
final_5.pdb if they are significantly different.
Many other files are generated totalling around 100 MB - these should be deleted to save disk space if you are running DMPfold on many sequences.
To predict the TM-score of a DMPfold model using our trained predictor, run
sh predict_tmscore.sh example/PF10963.fasta PF10963.aln PF10963/final_1.pdb PF10963/rawdistpred.1.
If this predictor estimates that a model has a TM-score of at least 0.5 then there is an 83% chance of this being the case according to cross-validation of the Pfam validation set.
See Supplementary Figure 1 in the paper for estimations on run time. It takes around 3 hours on a single core to carry out a complete DMPfold run for a 200 residue protein, but this can occasionally be much longer due to PSICOV not converging. 8 GB memory is generally sufficient to run DMPfold but more may be required for larger proteins.
Figure 5 in the paper gives some data on how DMPfold performs with respect to sequence length. Sequences up to around 600 residues in length can be modelled accurately, with performance degrading above this.
Models for the 1,475 Pfam families modelled in the paper can be downloaded here. Additional models for the remainder of the dark Pfam families can be downloaded here (some were not modelled due to small sequence alignments). Alignments for the Pfam families without available templates can be downloaded here. The format is one sequence per line with the ungapped target sequence as the first line.
The directory pfam in this repository contains text files with the lists from Figure 4A of the paper, target sequences for modelled families and data for modelled families (sequence length, effective sequence count, distogram satisfaction scores, estimated TM-score and probability TM-score >= 0.5).
The list of PDB chains used for training can be found here.