This GitHub repository is designed to aggregate data about various phenotypes of the SARS-CoV-2 spike that may be of value for forecasting the virus's evolution.
The notebook draws on different data sources for the effects of mutations on SARS-CoV-2 phenotypes. Currently, these data sources are:
- full spike pseudovirus deep mutational scanning
- RBD yeast-display deep mutational scanning
- EVEscape predictions
The idea is that if you are interested in making forecasts of viral evolution, this repository provides a way to obtain up-to-date data on how mutations have been measured or predicted to affect spike phenotypes.
The repository consists of a Jupyter notebook (SARS2-spike-predictor-phenos.ipynb) that can be run with an appropriate YAML configuration file (e.g., config.yaml) to tabulate both the effects of mutations on spike phenotypes and the predicted phenotypes of different SARS-CoV-2 clades.
This notebook reads in data contained within the notebook about how mutations affect spike phenotypes: look at the mutation_phenotype_csvs
key in config.yaml to understand these data sources.
The data itself on the spike phenotypes are in ./data/, and the README in that subdirectory provides more explanation.
The notebook reads those data and then generates four output files, which by default are as follows:
- results/mutation_phenotypes.csv: how individual amino-acid mutations (Wuhan-Hu-1 numbering) affect the spike phenotypes.
- results/mutation_phenotypes_randomized.csv: a version of the mutation-phenotypes randomized with the phenotypes randomized among mutations for different random number seeds.
- results/clade_phenotypes.csv: the predicted phenotypes of different SARS-CoV-2 clades, along with the clade parents, their mutations, whether they are descendants of key ancestor clades, and their estimated growth rates (where available).
- results/clade_phenotypes_randomized.csv: the clade phenotypes generated from the randomized mutation phenotypes.
If you want, you can just use the values in those CSVs. However, although the input data in this repository with the spike predictor phenotypes is only updated sometimes (when new data become available), there are constantly new clades being designated and their estimated growth rates are being updated daily, so the clade phenotype estimates are constantly changing.
Therefore, if you want to get the latest predictions, your best bet is to clone this repository (perhaps as a submodule), and then run the notebook yourself.
After you have obtained the repo, first build the conda
environment in environment.yml, then activate it with:
conda activate SARS2-spike-predictor-phenos
Then run the Jupyter notebook SARS2-spike-predictor-phenos.ipynb using papermill with:
papermill -p config_yaml config.yaml SARS2-spike-predictor-phenos.ipynb results/SARS2-spike-predictor-phenos.ipynb
Note that you can pass a custom configuration file to the notebook using the -p config_yaml <configuration YAML>
, so you can potentially make a different configuration than the default one in config.yaml.
In particular, if you want reproducible output then you should specify specific versions of the pango_json
and pango_growth_json
keys in the YAML rather than just the latest versions as in the default config.yaml.
In addition to the CSV files in ./results/, running the notebook creates an interactive plot that allows you to look at scatter plots of the phenotypes for clades. That plot is placed in docs/index.html, and is rendered on GitHub Pages at https://jbloomlab.github.io/SARS2-spike-predictor-phenos/.
When making predictions, there is always a danger of over-fitting or failing to account for phylogenetic correlations in a way that makes phenotypes seem more predictive of evolution than they really are. Therefore, the pipeline creates files (as described above) that randomize effects among mutations and generates clade phenotype predictions from these randomized data. You should always compare the accuracy of predictions with the actual non-randomized data to those made with these randomized data: if the actual data are not any more predictive than the randomized data, then somehow you are overfitting or neglecting to account for phylogenetic correlations.
Each new run of this pipeline on GitHub has a tag indicating the date it was run as YYYY-MM-DD
.
In addition, the CHANGELOG describes updates such as adding new data.
This repository is maintained by Jesse Bloom.
Thanks to:
- Cornelius Roemer for maintaining the Pango clade sequence definitions used by this repo.
- The Bedford lab for mantaining the clade growth estimates used by this repo.
- The generators of the data that goes into the various spike phenotype predictors incorporated into this repo:
- Bernadeta Dadonaite and the Bloom lab for the full-spike deep mutational scanning:
- The Starr lab for the RBD yeast-display deep mutational scanning of ACE2 affinity and RBD expression:
- The Cao lab for the RBD yeast-display antibody-escape deep mutational scanning, which is incorporated in the Bloom lab antibody escape calculator:
- The Marks lab for the EVEscape values: