Data and scripts for COVID-19

This GitHub repo contains data and scripts relevant to COVID-19, which is the disease caused by the virus SARS-CoV-2. For a full descriptions of our efforts, please see https://www.aicures.mit.edu/.

Note that since relatively little data for SARS-CoV-2 is available, most of the data in this repo is for SARS-CoV (responsible for the 2002/3 SARS outbreak) and other related coronaviruses. The hope is that models trained on this data will be able to retain their predictive ability on SARS-CoV-2.

Although the data contained in this repo can be used by any model, we have primarily been working with the message passing neural network model chemprop. Our trained models are available on http://chemprop.csail.mit.edu/predict and the predictions from these models on the Broad Repurposing Hub are available in predictions/.

data/

SARS-CoV data

AID1706_binarized_sars.csv - (N = 290,726; hits = 405) In-vitro assay that detects inhibition of SARS-CoV 3CL protease via fluorescence from PubChem AID1706.
evaluation_set_v2.csv - (N = 5,671; hits = 41) An evaluation set for SARS-CoV 3CL protease containing 41 experimentally validated hits along with 5630 molecules from the Broad Repurposing Hub which are treated as non-hits. There is no overlap with AID1706_binarized_sars.csv.
AID1706_binarized_sars_full_eval_actives.csv - (N = 290,767; hits = 446) is AID1706_binarized_sars.csv combined with the 41 validated hits from evaluation_set_v2.csv.
PLpro.csv - (N = 233,891; hits = 697) Bioassay that detects activity against SARS-CoV in yeast models via PL protease inhibition. Combines PubChem data from AID652038 and AID485353.

SARS-CoV-2 data

mpro_xchem.csv - (N = 880; hits = 78) Fragments screened for 3CL protease binding using crystallography techniques. Data is sourced from the Diamond Light Source group.
amu_sars_cov_2_in_vitro.csv - (N = 1,484; hits = 88) FDA-approved compounds screened against SARS-CoV-2 in vitro. Data is sourced from In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication.
ellinger.csv - (N = 5,632; hits = 67) Compounds screened against SARS-CoV-2 in vitro. Data is sourced from Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection.

Data extracted from literature

corona_literature_idex.csv - (N = 101) FDA-approved drugs that are mentioned in generic coronavirus literature. Drug to SMILES mapping is generated through the PubChem idex service and may contain multiple SMILES for generic drug names. These are not guaranteed to be effective against any targets; they simply appear in the literature.

Catalogues of drugs that can be screened for repurposing

broad_repurposing_library.csv - (N = 6,111) Compounds from the Broad Repurposing Hub, many of which are FDA-approved.
external_library.csv - (N = 861) A set of FDA-approved drugs.
expanded_external_library.csv - (N = 2,661) A larger set of FDA-approved drugs, but not a strict superset of external_library.csv.

Other property prediction data

ecoli.csv - (N = 2,335; hits = 120) Compounds which have been screened for inhibitory activity against E. coli, from the paper A Deep Learning Approach to Antibiotic Discovery.

splits.zip

Contains train/dev/test splits (using a scaffold split) of some of the above datasets for benchmarking purposes.

raw_data/

Original raw data files and format conversions.

predictions/

Predictions made by trained models on some of the repurposing datasets. See the README inside the predictions/ directory for details.

plots/

t-SNE plots comparing the datasets. Note that in the plots, "sars_pos" and "sars_neg" refer to any hits or non-hits, respectively, across both AID1706_binarized_sars.csv and PLpro.csv.

conversions/

Files for converting between smiles/cid/name. Obtained from https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi

similarity_computations/

The nearest neighbor computations from each test set to the training set.

scripts/

Various data processing scripts for reuse/reproducibility.

statistics/

Statistics about overlap between the SMILES strings of various datasets.

interpretation/

t-SNE plots of chemical rationales extracted (using this code) from a model trained on the combined AID1706 and PLpro datasets.

old/

Older versions of files from when we combined AID1706 data with other data that was unhelpful.

Experiment Commands

These commands are for running experiments using chemprop and should be run from the main directory in the chemprop repo. You may need to modify some paths depending on your directory structure. The commands below assume you are using AID1706_binarized_sars.csv but can be modified to work with any of the datasets.

Generating RDKit features

To speed up experiments, you can pre-generate RDKit features using the script save_features.py in chemprop/scripts. You should run this command:

python save_features.py
    --data_path ../../coronavirus_data/data/AID1706_binarized_sars.csv \
    --save_path ../../coronavirus_data/features/AID1706_binarized_sars.npz \
    --features_generator rdkit_2d_normalized

By default this will run feature generation using parallel processing. On occasion the parallel processing gets stuck near the end of feature generation, so if this happens, just kill the process and restart with the --sequential flag. This will pick up where the parallel version stopped and will finish correctly.

Training and testing

python train.py \
    --data_path ../coronavirus_data/data/AID1706_binarized_sars.csv \
    --dataset_type classification \
    --save_dir ../coronavirus_data/ckpt/AID1706_binarized_sars \
    --features_path ../coronavirus_data/features/AID1706_binarized_sars.npz \
    --no_features_scaling \
    --split_type scaffold_balanced \
    --quiet

The data splitting mechanism in chemprop is seeded so that this will reproduce the same train/dev/test split as in splits.zip.

Class balance

To run experiments with class balance, switch to the class_weights branch of chemprop (git checkout class_weights) and add the --class_balance flag. This will train with an equal number of positives and negatives in each batch.

Multi-task training for SARS-CoV-2 3CLpro and SARS-CoV 3CLpro

Experiment combining data on the 3CLpro target for SARS-CoV-2 mpro_xchem.csv and SARS-CoV AID1706_binarized_sars.csv.

5-fold cross validation performance is 0.850 +/- 0.022.

python multitask.py \
    --data_path data/mpro_xchem.csv \
    --source_data_path data/AID1706_binarized_sars.csv \
    --dataset_type classification \
    --save_dir ckpt/

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
316_AID1706_preds		316_AID1706_preds
conversions		conversions
data		data
interpretation		interpretation
old		old
plots		plots
predictions		predictions
raw_data		raw_data
scripts		scripts
similarity_computations		similarity_computations
statistics		statistics
.gitignore		.gitignore
README.md		README.md
multitask.py		multitask.py
splits.zip		splits.zip

kevingreenman/coronavirus_data

Folders and files

Latest commit

History

Repository files navigation

Data and scripts for COVID-19

Experiment Commands

Generating RDKit features

Training and testing

Class balance

Multi-task training for SARS-CoV-2 3CLpro and SARS-CoV 3CLpro

About

Resources

Stars

Watchers

Forks

Languages