Phenotypes of SARS-CoV-2 spike mutants with possible predictive power for forecasting evolution

Overview

This GitHub repository is designed to aggregate data about various phenotypes of the SARS-CoV-2 spike that may be of value for forecasting the virus's evolution.

The notebook draws on different data sources for the effects of mutations on SARS-CoV-2 phenotypes. Currently, these data sources are:

The idea is that if you are interested in making forecasts of viral evolution, this repository provides a way to obtain up-to-date data on how mutations have been measured or predicted to affect spike phenotypes.

Running the notebook in the repo to get phenotypic predictions

The repository consists of a Jupyter notebook (SARS2-spike-predictor-phenos.ipynb) that can be run with an appropriate YAML configuration file (e.g., config.yaml) to tabulate both the effects of mutations on spike phenotypes and the predicted phenotypes of different SARS-CoV-2 clades.

This notebook reads in data contained within the notebook about how mutations affect spike phenotypes: look at the mutation_phenotype_csvs key in config.yaml to understand these data sources. The data itself on the spike phenotypes are in ./data/, and the README in that subdirectory provides more explanation.

The notebook reads those data and then generates four output files, which by default are as follows:

results/mutation_phenotypes.csv: how individual amino-acid mutations (Wuhan-Hu-1 numbering) affect the spike phenotypes.
results/mutation_phenotypes_randomized.csv: a version of the mutation-phenotypes randomized with the phenotypes randomized among mutations for different random number seeds.
results/clade_phenotypes.csv: the predicted phenotypes of different SARS-CoV-2 clades, along with the clade parents, their mutations, whether they are descendants of key ancestor clades, and their estimated growth rates (where available).
results/clade_phenotypes_randomized.csv: the clade phenotypes generated from the randomized mutation phenotypes.

If you want, you can just use the values in those CSVs. However, although the input data in this repository with the spike predictor phenotypes is only updated sometimes (when new data become available), there are constantly new clades being designated and their estimated growth rates are being updated daily, so the clade phenotype estimates are constantly changing.

Therefore, if you want to get the latest predictions, your best bet is to clone this repository (perhaps as a submodule), and then run the notebook yourself.

After you have obtained the repo, first build the conda environment in environment.yml, then activate it with:

conda activate SARS2-spike-predictor-phenos

Then run the Jupyter notebook SARS2-spike-predictor-phenos.ipynb using papermill with:

papermill -p config_yaml config.yaml SARS2-spike-predictor-phenos.ipynb results/SARS2-spike-predictor-phenos.ipynb

Note that you can pass a custom configuration file to the notebook using the -p config_yaml <configuration YAML>, so you can potentially make a different configuration than the default one in config.yaml. In particular, if you want reproducible output then you should specify specific versions of the pango_json and pango_growth_json keys in the YAML rather than just the latest versions as in the default config.yaml.

Interactive plot phenotypes

In addition to the CSV files in ./results/, running the notebook creates an interactive plot that allows you to look at scatter plots of the phenotypes for clades. That plot is placed in docs/index.html, and is rendered on GitHub Pages at https://jbloomlab.github.io/SARS2-spike-predictor-phenos/.

Importance of the randomized phenotypes

When making predictions, there is always a danger of over-fitting or failing to account for phylogenetic correlations in a way that makes phenotypes seem more predictive of evolution than they really are. Therefore, the pipeline creates files (as described above) that randomize effects among mutations and generates clade phenotype predictions from these randomized data. You should always compare the accuracy of predictions with the actual non-randomized data to those made with these randomized data: if the actual data are not any more predictive than the randomized data, then somehow you are overfitting or neglecting to account for phylogenetic correlations.

Versioning

Each new run of this pipeline on GitHub has a tag indicating the date it was run as YYYY-MM-DD. In addition, the CHANGELOG describes updates such as adding new data.

Acknowledgments

This repository is maintained by Jesse Bloom.

Thanks to:

Cornelius Roemer for maintaining the Pango clade sequence definitions used by this repo.
The Bedford lab for mantaining the clade growth estimates used by this repo.
The generators of the data that goes into the various spike phenotype predictors incorporated into this repo:
- Bernadeta Dadonaite and the Bloom lab for the full-spike deep mutational scanning:
  - Dadonaite et al (2023)
- The Starr lab for the RBD yeast-display deep mutational scanning of ACE2 affinity and RBD expression:
  - Taylor and Starr (2023)
- The Cao lab for the RBD yeast-display antibody-escape deep mutational scanning, which is incorporated in the Bloom lab antibody escape calculator:
- The Marks lab for the EVEscape values:
  - Thadani et al (2023)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
docs		docs
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SARS2-spike-predictor-phenos.ipynb		SARS2-spike-predictor-phenos.ipynb
config.yaml		config.yaml
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phenotypes of SARS-CoV-2 spike mutants with possible predictive power for forecasting evolution

Overview

Running the notebook in the repo to get phenotypic predictions

Interactive plot phenotypes

Importance of the randomized phenotypes

Versioning

Acknowledgments

About

Releases

Packages

Languages

License

jbloomlab/SARS2-spike-predictor-phenos

Folders and files

Latest commit

History

Repository files navigation

Phenotypes of SARS-CoV-2 spike mutants with possible predictive power for forecasting evolution

Overview

Running the notebook in the repo to get phenotypic predictions

Interactive plot phenotypes

Importance of the randomized phenotypes

Versioning

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages