This project implements a protein structure prediction machine learning model using the Alphafold 2 and Alphafold 3 papers as a basis, designed to be trainable on a single mid tier GPU.
See ogchen.github.io/nanofold for project documentation and this blog post for a project write up.
- Leverages the
Alphafold 3
architecture which is significantly more efficient than the equivalentAlphafold 2
modules. Restricts the problem space to monomer protein chains to reduce training data required. - Reduces GPU memory usage with gradient checkpointing.
- Training is done using bfloat16, further reducing GPU memory footprint.
- Uses
torch.compile
for JIT compilation for training speedup. - Stores input features in Apache Arrow's IPC format to handle datasets larger than available RAM.
- Integration with MLFlow to monitor training metrics and manage model checkpoints.
- Compression of dataset using sparse matrices to save disk space.
- Docker images for training and data processing pipeline.
- CI for running python tests with GitHub Actions.
- Support for development within containers.
The data processing pipeline (entry point at nanofold/preprocess/__main__.py
) performs the following steps:
- Parses mmCIF files from the Protein Data Bank for protein chain details, including the residue sequence and atom co-ordinates.
- For each protein chain, it searches the small BFD and Uniclust30 genetic databases for proteins with similar residue sequences. The results are combined to form the multiple sequence alignment (MSA).
- Using the MSA, we search another database (PDB70) to find "templates" - proteins that are structurally similar.
- Dumps all input features to an Arrow IPC file, ready for the training pipeline.
Nanofold largely implements the model algorithms detailed in Alphafold's Supplementary Information, with a few key exceptions:
- In order to simplify the problem, Nanofold only considers protein chains in isolation. All details regarding ligands, DNA, RNA, and other small molecules, are ignored. Furthermore, there is only support for single chain proteins (monomers).
- Alphafold 3 implements additional auxiliary heads, i.e. the model is trained to predict various metrics such as the predicted local distance difference. These are ignored in Nanofold.
The relevant code can be found in nanofold/train
.
The following tools are required for the download script:
sudo apt install aria2 jq
Choose a download directory and a cut off date for PDB files. Note: adjust DATA_DIR
in docker/.env
if you choose a path
other than $HOME/data
.
Download and unzip mmCIF
files that were deposited before the cut off date with the following invocation:
export DATA_DIR=~/data
export CUTOFF_DATE=1989-01-01
mkdir -p $DATA_DIR/pdb && ./scripts/download_pdb.sh $DATA_DIR/pdb $CUTOFF_DATE
Download and unzip small BFD (17GB) with
wget https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz -P $DATA_DIR
gzip -d $DATA_DIR/bfd-first_non_consensus_sequences.fasta.gz
Download and unzip Uniclust30 with
aria2c https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/uniclust30_2016_03.tgz -d $DATA_DIR
tar -xf $DATA_DIR/uniclust30_2016_03.tgz -C $DATA_DIR
Download and unzip PDB70 (56GB) for template search with
wget https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/pdb70_from_mmcif_200401.tar.gz -P $DATA_DIR
mkdir -p $DATA_DIR/pdb70 && tar -xf $DATA_DIR/pdb70_from_mmcif_200401.tar.gz -C $DATA_DIR/pdb70
Requires NVIDIA Container Toolkit for GPU support within containers.
Build the docker images with
docker-compose -f docker/docker-compose.preprocess.yml build
docker-compose -f docker/docker-compose.train.yml build
Run the preprocessing script with
docker-compose -f docker/docker-compose.preprocess.yml run --rm preprocess python -m nanofold.preprocess -m /data/pdb/ -c /preprocess/ -o /preprocess/features.arrow --small_bfd /data/bfd-first_non_consensus_sequences.fasta --pdb70 /data/pdb70/pdb70 --uniclust30 /data/uniclust30_2016_03/uniclust30_2016_03
Note: this step takes a long time depending on the number of input proteins. One potential future adaptation would be to test out the speed of MMseqs2
, or to use precomputed MSAs.
Run the training script for N
epochs:
docker-compose -f docker/docker-compose.train.yml run --rm train python -m nanofold.train -c config/config.json -i /preprocess/features.arrow --mlflow --max-epoch $N
The training script will also spin up an MLFlow server. The dashboard can be accessed at localhost:8000 to monitor training metrics.
To resume training from an MLFlow checkpoint, identify the corresponding $RUNID
and run:
docker-compose -f docker/docker-compose.train.yml run --rm train python -m nanofold.train -r $RUNID -i /preprocess/features.arrow --mlflow --max-epoch $N
Run the pytorch profiler:
docker-compose -f docker/docker-compose.train.yml run --rm -v $DATA_DIR:/data train python -m nanofold.profile -c config/config.json -i /preprocess/features.arrow --mode time --mode memory
The profiler spits out a trace.json
and snapshot.pickle
file in the mounted /data/
volume.
Load trace.json
into chrome://tracing, and snapshot.pickle
into pytorch.org/memory_viz.
Refer to this Github comment if the profiler is complaining with CUPTI_ERROR_NOT_INITIALIZED
.
Run tests with
docker run --rm --gpus all train pytest tests/train
docker run --rm preprocess pytest tests/preprocess