This project is part of the Advanced Machine Learning course at Heidelberg university. The project is located in the area of Covid-19 research. The goal of this project is to predict the next possible mutations of the Covid-19 virus.
During the genome replication random mutations can appear. As a consequence the encoded protein sequence could be changed, which can lead to different behavior. If this change increases the fitness, it is probably passed on to the next generation.
A currently well known example for a mutating virus is SARS-CoV-2. Due to the developed vaccines the hope for an end of the pandemic arises. Nevertheless this is only true, if the vaccines, which are developed against the wild type of SARS-CoV-2, also remain effective against new mutations. To enable fast responses to new arising mutations it would be helpful to know the possible next mutations in advance. This can influence the treatment and prevention of diseases, by enabling the development of countermeasures and preventive measures in advance.
Machine Learning, especially Deep Learning enabled improvements in lots of different domains. This work applies Deep Learning to the area of virus genome mutation prediction. Due to the fact, that genome sequences could be treated as text data, methods from the NLP area can be applied. The success of Deep Learning for NLP tasks has already been shown in various areas such as text generation, text summarization or translation.
Our research question is whether a Machine Learning model can be trained to predict the next possible SARS-CoV-2 mutations. In this work, we propose three novelties:
- Model architecture: A new GAN based architecture influenced by Berman et al.. Our novelty is the usage of transformers instead of LSTM in the seq2seq model
- Dataset: Generation of a dataset for SARS-CoV-2, consisting of 9199 parent-child data instances
- Application domain: The training of the network for SARS-CoV-2
First install:
- Anaconda to create isolated python environments
- NVIDIA GeForce Experience to update your local graphics driver
- CUDA 11.1 to utilize GPUs during training (see installation guide)
- NVIDIA cuDNN v8.2.1 to accelerate DNN implementations on GPU (see installation guide)
Then simply run:
conda env create
conda activate aml
Pretrained models for the generator and the discriminator can be found here. Please see the images about the folder structures below to see how to integrate the model checkpoints into the model training or evaluation routines.
Choose the routine to run in the main.py file.
In case you need to add additional dependencies, adapt the environment.yml file and run:
conda env update
List all installed dependencies with their versions:
conda list
Run isort from project root:
isort .
Open tensorboard with:
tensorboard --logdir ./src/training/tensorboard/pretraining/
tensorboard --logdir ./src/training/tensorboard/training/discriminator/
tensorboard --logdir ./src/training/tensorboard/training/generator/
Due to the regulations of the GISAID platform the raw datasources and the dataset are not part of this repository. The structure of the data folder, where the raw data can be inserted can be seen in the following image.
As part of the dataset generation ncov is used for preprocessing. The repo is automatically cloned for usage and deleted after usage. For documentation purposes the structure of the repo with input and output files can be seen in the following image.
During the training phase model checkpoints are saved and loss plots are drawn to tensorboard. The structure of the training folder can be seen in the following image.
For details about the project (e.g. used dataset, chosen Machine Learning approach, results) see the report.
In case you are new to PyTorch see the following PyTorch Tutorial repository by which parts of the coding was inspired.
- Felix Hausberger
- Nils Krehl