Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


A SARS-CoV-2 Drug Disease Gene Association Dataset


Inspired by the White House Office of Science and Technology Policy machine-readable Coronavirus literature collection4, we noticed a lack of high quality, machine-readable datasets that contained information on drug, gene, and disease associations for SARS-CoV-2. We decided to build and release an open-source dataset for the community to explore two key questions:

  1. What existing drugs have interaction potential with SARS-CoV-2?

  2. What diseases and symptoms overlap with target genes that have interaction potential with SARS-CoV-2?

To address this, we compiled an integrated dataset from the following publications and resources:

  • A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing1

    • Tagged, cloned and expressed 26 of 29 SARS-CoV-2 proteins in human cells. Human proteins that were physically associated with viral proteins were identified using affinity purification mass spectrometry
  • The Drug Repurposing Hub: a next-generation drug library and information resource2

    • Hand curated collection of 4,707 compounds with literature-reported gene targets
  • The DisGeNET knowledge platform for disease genomics: 2019 update3

    • Standardized database of disease associated genes and phenotypes from multiple sources

Dataset Description

Our integrated dataset links 332 human genes which encode proteins that interact with tagged SARS-CoV-2 proteins in human cells to 181 drug/chemical compounds (and associated SMILES) and 953 diseases. The columns that can be found in the integrated dataset are described below. The source dataset for each column is indicated in the superscript.

The data is made available as a single, integrated zipped .csv file.

Column Name Description
Bait-Prey Information (Bait)1 Tagged viral expressed protein
Preys1 Identifier for the prey gene
PreyGene1 Gene corresponding to the associated human protein identified as an interactor with a tagged viral protein. This is a key join column that links all of the drug, gene, protein, and disease information together.
MIST1 Protein-protein interaction score as computed by MiST (MassSpectrometry interaction statistics)
Saint_BFDR1 False discovery rate as computed by SAINTexpress(Significance Analysis of INTeractome software)
AvgSpec1 Average spectral counts
Uniprot Protein Description1 Description of host protein from Uniprot database
Uniprot Protein Function1 Function of host protein from Uniprot database
Structures (PDB)1 Three-dimensional structural database for large biological molecules
Uniprot Function in Disease1 Function in disease of host protein from Uniprot database
Drug2/Compound1 Name of drug or compound
Drug Status2 Drug development stage(eg. Preclinical, Clinical Trial, Approved)
Pubchem_cid2 Pubchem ID of drug
SMILES1,2 SMILES of drug or compound
ZINC_ID1 ZINC database ID
Broad_id2 Broad Institute identifier for the drug
Pert_iname2 The internal (CMap-designated) name of a perturbagen. By convention, for genetic perturbations CMap uses the HUGO gene symbol
Purity2 Purity of drug
Expected mass2 Expected mass of drug
InChIKey2 International Chemical Identifier of compound
Pathway1 Name of associated pathway from gene
DSI3 Disease specificity index
DPI3 Disease pleiotropy index
diseaseName3 Name of the disease
diseaseType3 DisGeNET disease type: disease, phenotype and group
diseaseSemanticType3 UMLS Semantic Type(s) of the disease
Score3 DisGENET score for the Gene-Disease association
CAS number5 Unique numerical identifier assigned by the Chemical Abstracts Service (CAS) to every chemical substance described in the open scientific literature
year of approval5 Year drug approved
Status5 Label if the drug is discontinued
routes of administration5 Route of administration
Volume of distribution (VD)5 Distribution volume in liters
Clearance (Cl)5 volume of blood, serum, or plasma completely cleared of drug per unit of time (liter/hour)
Plasma Protein Binding (PPB)5 Plasma protein binding assay
Half-life (t1/2)5 Half-life in hours
F5 Bioavailability as a percentage
Cmax5 maximum (or peak) serum concentration
Cmax Unit5 units of maximum (or peak) serum concentration
Tmax5 time at which Cmax is attained in hours
comment on solubility5 Solubility of drug

Future Work and Next Steps

We are looking for additional collaborators that are interested in a targeted expansion of this dataset to other data related to SARS-CoV-2 and therapeutic candidates. Specifically, this dataset will next be expanded to include homologous SARS-CoV genes and their interactions with host proteins. This will allow us to understand the specificity of prey genes and potential therapeutics between SARS-CoV and SARS-CoV-2. We will further curate additional disease-symptom associations to discover molecular links to the wide range of symptoms across hosts.

This dataset will be versioned and regularly updated as new information about SARS-CoV-2 becomes available.


1. Gordon, David E., et al. "A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing." BioRxiv (2020).

2. Corsello, Steven M., et al. "The Drug Repurposing Hub: a next-generation drug library and information resource." Nature medicine 23.4 (2017): 405-408.

3. Piñero, Janet, et al. "The DisGeNET knowledge platform for disease genomics: 2019 update." Nucleic acids research 48.D1 (2020): D845-D855.

4. "CORD-19: Semantic Scholar." CORD-19 | Semantic Scholar,

5. Data sets representative of the Structures and Experimental Properties of FDA-approved Drugs, Douguet D., ACS Med Chem Lett., 2018, 9(3):204-209. doi: 10.1021/acsmedchemlett.7b00462. Accessed April 15, 2020.