A SARS-CoV-2 Drug Disease Gene Association Dataset
Inspired by the White House Office of Science and Technology Policy machine-readable Coronavirus literature collection4, we noticed a lack of high quality, machine-readable datasets that contained information on drug, gene, and disease associations for SARS-CoV-2. We decided to build and release an open-source dataset for the community to explore two key questions:
What existing drugs have interaction potential with SARS-CoV-2?
What diseases and symptoms overlap with target genes that have interaction potential with SARS-CoV-2?
To address this, we compiled an integrated dataset from the following publications and resources:
A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing1
- Tagged, cloned and expressed 26 of 29 SARS-CoV-2 proteins in human cells. Human proteins that were physically associated with viral proteins were identified using affinity purification mass spectrometry
The Drug Repurposing Hub: a next-generation drug library and information resource2
- Hand curated collection of 4,707 compounds with literature-reported gene targets
The DisGeNET knowledge platform for disease genomics: 2019 update3
- Standardized database of disease associated genes and phenotypes from multiple sources
Our integrated dataset links 332 human genes which encode proteins that interact with tagged SARS-CoV-2 proteins in human cells to 181 drug/chemical compounds (and associated SMILES) and 953 diseases. The columns that can be found in the integrated dataset are described below. The source dataset for each column is indicated in the superscript.
The data is made available as a single, integrated zipped .csv file.
|Bait-Prey Information (Bait)1||Tagged viral expressed protein|
|Preys1||Identifier for the prey gene|
|PreyGene1||Gene corresponding to the associated human protein identified as an interactor with a tagged viral protein. This is a key join column that links all of the drug, gene, protein, and disease information together.|
|MIST1||Protein-protein interaction score as computed by MiST (MassSpectrometry interaction statistics)|
|Saint_BFDR1||False discovery rate as computed by SAINTexpress(Significance Analysis of INTeractome software)|
|AvgSpec1||Average spectral counts|
|Uniprot Protein Description1||Description of host protein from Uniprot database|
|Uniprot Protein Function1||Function of host protein from Uniprot database|
|Structures (PDB)1||Three-dimensional structural database for large biological molecules|
|Uniprot Function in Disease1||Function in disease of host protein from Uniprot database|
|Drug2/Compound1||Name of drug or compound|
|Drug Status2||Drug development stage(eg. Preclinical, Clinical Trial, Approved)|
|Pubchem_cid2||Pubchem ID of drug|
|SMILES1,2||SMILES of drug or compound|
|ZINC_ID1||ZINC database ID|
|Broad_id2||Broad Institute identifier for the drug|
|Pert_iname2||The internal (CMap-designated) name of a perturbagen. By convention, for genetic perturbations CMap uses the HUGO gene symbol|
|Purity2||Purity of drug|
|Expected mass2||Expected mass of drug|
|InChIKey2||International Chemical Identifier of compound|
|Pathway1||Name of associated pathway from gene|
|DSI3||Disease specificity index|
|DPI3||Disease pleiotropy index|
|diseaseName3||Name of the disease|
|diseaseType3||DisGeNET disease type: disease, phenotype and group|
|diseaseSemanticType3||UMLS Semantic Type(s) of the disease|
|Score3||DisGENET score for the Gene-Disease association|
|CAS number5||Unique numerical identifier assigned by the Chemical Abstracts Service (CAS) to every chemical substance described in the open scientific literature|
|year of approval5||Year drug approved|
|Status5||Label if the drug is discontinued|
|routes of administration5||Route of administration|
|Volume of distribution (VD)5||Distribution volume in liters|
|Clearance (Cl)5||volume of blood, serum, or plasma completely cleared of drug per unit of time (liter/hour)|
|Plasma Protein Binding (PPB)5||Plasma protein binding assay|
|Half-life (t1/2)5||Half-life in hours|
|F5||Bioavailability as a percentage|
|Cmax5||maximum (or peak) serum concentration|
|Cmax Unit5||units of maximum (or peak) serum concentration|
|Tmax5||time at which Cmax is attained in hours|
|comment on solubility5||Solubility of drug|
Future Work and Next Steps
We are looking for additional collaborators that are interested in a targeted expansion of this dataset to other data related to SARS-CoV-2 and therapeutic candidates. Specifically, this dataset will next be expanded to include homologous SARS-CoV genes and their interactions with host proteins. This will allow us to understand the specificity of prey genes and potential therapeutics between SARS-CoV and SARS-CoV-2. We will further curate additional disease-symptom associations to discover molecular links to the wide range of symptoms across hosts.
This dataset will be versioned and regularly updated as new information about SARS-CoV-2 becomes available.
1. Gordon, David E., et al. "A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing." BioRxiv (2020).
2. Corsello, Steven M., et al. "The Drug Repurposing Hub: a next-generation drug library and information resource." Nature medicine 23.4 (2017): 405-408.
3. Piñero, Janet, et al. "The DisGeNET knowledge platform for disease genomics: 2019 update." Nucleic acids research 48.D1 (2020): D845-D855.
4. "CORD-19: Semantic Scholar." CORD-19 | Semantic Scholar, pages.semanticscholar.org/coronavirus-research.
5. Data sets representative of the Structures and Experimental Properties of FDA-approved Drugs, Douguet D., ACS Med Chem Lett., 2018, 9(3):204-209. doi: 10.1021/acsmedchemlett.7b00462. Accessed April 15, 2020. https://chemoinfo.ipmc.cnrs.fr/TMP/tmp.32127/e-Drug3D_1930_PK_v2.txt