AML Project - Prediction of the next SARS-CoV-2 mutations by a Transformer based GAN Framework

Introduction

This project is part of the Advanced Machine Learning course at Heidelberg university. The project is located in the area of Covid-19 research. The goal of this project is to predict the next possible mutations of the Covid-19 virus.

During the genome replication random mutations can appear. As a consequence the encoded protein sequence could be changed, which can lead to different behavior. If this change increases the fitness, it is probably passed on to the next generation.

A currently well known example for a mutating virus is SARS-CoV-2. Due to the developed vaccines the hope for an end of the pandemic arises. Nevertheless this is only true, if the vaccines, which are developed against the wild type of SARS-CoV-2, also remain effective against new mutations. To enable fast responses to new arising mutations it would be helpful to know the possible next mutations in advance. This can influence the treatment and prevention of diseases, by enabling the development of countermeasures and preventive measures in advance.

Machine Learning, especially Deep Learning enabled improvements in lots of different domains. This work applies Deep Learning to the area of virus genome mutation prediction. Due to the fact, that genome sequences could be treated as text data, methods from the NLP area can be applied. The success of Deep Learning for NLP tasks has already been shown in various areas such as text generation, text summarization or translation.

Research question and proposed novelties

Our research question is whether a Machine Learning model can be trained to predict the next possible SARS-CoV-2 mutations. In this work, we propose three novelties:

Model architecture: A new GAN based architecture influenced by Berman et al.. Our novelty is the usage of transformers instead of LSTM in the seq2seq model
Dataset: Generation of a dataset for SARS-CoV-2, consisting of 9199 parent-child data instances
Application domain: The training of the network for SARS-CoV-2

Setup

First install:

Anaconda to create isolated python environments
NVIDIA GeForce Experience to update your local graphics driver
CUDA 11.1 to utilize GPUs during training (see installation guide)
NVIDIA cuDNN v8.2.1 to accelerate DNN implementations on GPU (see installation guide)

Then simply run:

    conda env create
    conda activate aml

Pretrained models for the generator and the discriminator can be found here. Please see the images about the folder structures below to see how to integrate the model checkpoints into the model training or evaluation routines.

Choose the routine to run in the main.py file.

Development guidelines

In case you need to add additional dependencies, adapt the environment.yml file and run:

    conda env update

List all installed dependencies with their versions:

    conda list

Run isort from project root:

    isort .

Open tensorboard with:

    tensorboard --logdir ./src/training/tensorboard/pretraining/
    tensorboard --logdir ./src/training/tensorboard/training/discriminator/
    tensorboard --logdir ./src/training/tensorboard/training/generator/

Structure of git ignored folders

Data sources

Due to the regulations of the GISAID platform the raw datasources and the dataset are not part of this repository. The structure of the data folder, where the raw data can be inserted can be seen in the following image.

Dataset generation and ncov repository

As part of the dataset generation ncov is used for preprocessing. The repo is automatically cloned for usage and deleted after usage. For documentation purposes the structure of the repo with input and output files can be seen in the following image.

Training routine

During the training phase model checkpoints are saved and loss plots are drawn to tensorboard. The structure of the training folder can be seen in the following image.

Project details

For details about the project (e.g. used dataset, chosen Machine Learning approach, results) see the report.

In case you are new to PyTorch see the following PyTorch Tutorial repository by which parts of the coding was inspired.

Project team

Felix Hausberger
Nils Krehl

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AML Project - Prediction of the next SARS-CoV-2 mutations by a Transformer based GAN Framework

Introduction

Research question and proposed novelties

Setup

Development guidelines

Structure of git ignored folders

Data sources

Dataset generation and ncov repository

Training routine

Project details

Project team

About

Releases

Packages

Contributors 2

Languages

License

nilskre/AML-covid-project

Folders and files

Latest commit

History

Repository files navigation

AML Project - Prediction of the next SARS-CoV-2 mutations by a Transformer based GAN Framework

Introduction

Research question and proposed novelties

Setup

Development guidelines

Structure of git ignored folders

Data sources

Dataset generation and ncov repository

Training routine

Project details

Project team

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages