This is a package to fit and test variational autoencoder (VAE) models for T cell receptor sequences.
It is described in the paper Deep generative models for T cell receptor protein sequences by Kristian Davidsen, Branden J Olson, William S DeWitt III, Jean Feng, Elias Harkins, Philip Bradley and Frederick A Matsen IV.
Setting up your environment
Conda is the canonical way to prepare your environment and is required run the pipeline, although not strictly a dependency for fitting and using VAEs using this code. These instructions will assume that you have installed Conda.
If you want to see the entire environment preparation process process, see the Dockerfile. However, if you simply want to train and use vampire models, you can only execute
conda env create -f install/environment.yml
This will create a
vampire Conda environment which you can enter and use for running vampire.
If you also want to be able to compare repertoires using sumrep you will need to run the R installation steps in the Dockerfile.
We also provide an
install/environment-olga.yml to make a Conda environment in which one can run OLGA.
After setting up your environment (if you followed the steps above you'll need to
conda activate vampire), and run
pip install .
in the repository to install vampire.
If you want to use sumrep, see
install/test.sh for additional install instructions.
To get started, check out the demonstration script in
vampire/demo/demo.sh, which will show you how models and training parameters are specified.
To run the main pipeline on sample data, try running
scons -n inside the
Execute the commands on example data by running
You can run these in parallel using the
-j flag for scons.
Note that this pipeline runs on a very small data set (mixing training and testing) just for example purposes-- it does not give an appropriately trained model.
In order to run on your own data, use
python util.py split-repertoires to split your repertoires into train and test.
This will make a JSON file pointing to various paths.
You can run the pipeline on those data by running
Note that the frequency estimation pipeline is run using
The pipeline includes a
--clusters flag that, if used, will attempt to submit jobs to a SLURM cluster with the specified name.
If you have access to a cluster with a different cluster scheduler, hopefully you can modify the
execute.py script accordingly.
The documentation consists of
- the demonstration script
- the two pipelines, which will give you commands to try
- command line help, which is accessed for example via
tcr-vae train --help
- lots of docstrings in the source code
Please get in touch if anything isn't clear.
- Our preprocessing scripts exclude TCRBJ2-5, which Adaptive annotates badly, and TCRBJ2-7, which appears to be problematic for OLGA.
- We use Adaptive gene names and sequences, but will extend to more flexible TCR gene sets in the future.
- Original version (immortalized in the
originalbranch) by Kristian Davidsen.
- Pedantic rewrite, sconsery, extension, additional models, and comparative evaluation by Frederick "Erick" Matsen.
- Contributions from Phil Bradley, Will DeWitt, Jean Feng, Eli Harkins, and Branden Olson.
This project uses YAPF for code formatting with a format defined in
You can easily run yapf on all the files with a call to
yapf -ir . in the root directory.
Code also checked using flake8.
# noqa comments cause lines to be ignored by flake8.