In this script, we will reconcile the database [of Vaccine Candidates against COVID-19 of the Milken Institute](https://covid-19tracker.milkeninstitute.org/#about) to Wikidata.

First, we will just check the info for a vaccine candidate already on Wikidata: [Ad5-nCoV](https://www.wikidata.org/wiki/Q96695265)


In [1]:
import pandas as pd

df = pd.read_csv("COVID-19 Tracker-Treatments and Vaccines.csv")

The database contains treatments and vaccines. For now, we are only interested in vaccine candidates. Let's filter.

In [3]:
vax_df = df[df["Treatment vs. Vaccine"] == "Vaccine"]

In [4]:
vax_df.head(3)

Unnamed: 0,Developer / Researcher,Treatment vs. Vaccine,Product Category,Stage of Development,Anticipated Next Steps,Product Description,Clinical Trials for COVID-19,Funder,Published Results,Clinical Trials for Other Diseases (T only) / Related Use or Platform (V only),FDA-Approved Indications,Sources,Date Last Updated
261,Scancell/ University of Nottingham,Vaccine,DNA-based,Pre-clinical,Phase I to start in Q1 2021,DNA,,Unknown,,Same platform as vaccine candidates for cancer,,https://docs.google.com/document/d/1Y4nCJJ4njz...,4/28/2020
262,Entos Pharmaceuticals/ Cytiva,Vaccine,DNA-based,Pre-clinical,Phase I/II to start in late July 2020,DNA; Covigenix,,Canadian Institutes of Health Research (CIHR)/...,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/1/2020
263,BioNet Asia,Vaccine,DNA-based,Pre-clinical,Unknown,DNA,,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,4/20/2020


Apparently the names are in the "Product Description" column. Maybe we can find Ad5-nCoV there.

In [5]:
list(set(vax_df["Product Description"]))[0:5]

['Non-replicating viral vector; MVA-S encoded',
 'Avian paramyxovirus vector (APMV)',
 'LV-SMENP-DC Dendritic cells modified with lentiviral vector expressing synthetic minigene based on domains of selected viral proteins; administered with antigen-specific cytotoxic T lymphocytes',
 'Non-replicating viral vector; dendritic cell-based vaccine',
 'Protein subunit, recombinant S1-Fc fusion protein']

Hmm, the names are a bit messy. Names are mixed with descriptions. Let's look, them, only for those with clinical trials. 

In [6]:
vax_df_with_clinical_trials = vax_df.dropna(subset = ["Clinical Trials for COVID-19"])

In [7]:
vax_df_with_clinical_trials.head()

Unnamed: 0,Developer / Researcher,Treatment vs. Vaccine,Product Category,Stage of Development,Anticipated Next Steps,Product Description,Clinical Trials for COVID-19,Funder,Published Results,Clinical Trials for Other Diseases (T only) / Related Use or Platform (V only),FDA-Approved Indications,Sources,Date Last Updated
264,Inovio Pharmaceuticals/Beijing Advaccine Biote...,Vaccine,DNA-based,Phase I,"Phase I initial data released June 30, 2020; P...",DNA plasmid vaccine with electroporation; INO-...,NCT04336410,Coalition for Epidemic Preparedness (CEPI) / G...,Inovio (http://ir.inovio.com/news-releases/new...,Same platform as multiple vaccine candidates,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/1/2020
272,"Genexine Consortium (GenNBio, International Va...",Vaccine,DNA-based,Phase I,Phase I to start in July 2020,DNA vaccine (GX-19),NCT04445389,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/1/2020
276,Beijing Institute of Biological Products/ Sino...,Vaccine,Inactivated virus,Phase II,"In Phase II, June 2020",Inactivated,ChiCTR2000032459,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,6/12/2020
277,Wuhan Institute of Biological Products/ Sinopharm,Vaccine,Inactivated virus,Phase II,Phase III trial approved to start in United Ar...,Inactivated,ChiCTR2000031809,Unknown,(https://www.cnbg.com.cn/content/details_12_55...,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,6/25/2020
279,"Institute of Medical Biology, Chinese Academy ...",Vaccine,Inactivated virus,Phase II,Phase II began June 2020,Inactivated,NCT04412538,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,6/23/2020


I believe I will have to manually add some vaccine labels, but some info is present. Let's look at it again:

In [8]:
list(set(vax_df_with_clinical_trials["Product Description"]))

['RNA; mRNA',
 'DNA plasmid vaccine with electroporation; INO-4800',
 'Non-replicating viral vector; Adenovirus Type 5 vector (Ad5-nCoV)',
 'DNA vaccine (GX-19)',
 'RNA; LNP-encapsulated mRNA (mRNA 1273)',
 '3 LNP-mRNAs; BNT162 \n',
 'CoronaVac; Inactivated (inactivated + alum); PiCoVacc',
 'Non-replicating viral vector; AZD 1222 (formerly ChAdOx1)',
 'RNA; LNP-nCoVsaRNA',
 'LV-SMENP-DC Dendritic cells modified with lentiviral vector expressing synthetic minigene based on domains of selected viral proteins; administered with antigen-specific cytotoxic T lymphocytes',
 'Artificial antigen-presenting cells modified with lentiviral vector expressing synthetic minigene based on domains of selected viral proteins',
 'Adjuvanted recombinant protein (RBD-Dimer)',
 'mRNA',
 'Adeno-based',
 'Inactivated',
 'Protein subunit; NVX-CoV2373; Full-length recombinant SARS COV-2 glycoprotein nanoparticle vaccine adjuvanted with Matrix M',
 'Protein subunit, native like trimeric subunit spike protein']

In [9]:
ad5_row = vax_df_with_clinical_trials[vax_df_with_clinical_trials["Product Description"] == 'Non-replicating viral vector; Adenovirus Type 5 vector (Ad5-nCoV)']
ad5_row

Unnamed: 0,Developer / Researcher,Treatment vs. Vaccine,Product Category,Stage of Development,Anticipated Next Steps,Product Description,Clinical Trials for COVID-19,Funder,Published Results,Clinical Trials for Other Diseases (T only) / Related Use or Platform (V only),FDA-Approved Indications,Sources,Date Last Updated
294,CanSino Biologics/Beijing Institute of Biotech...,Vaccine,Non-replicating viral vector,Phase II,Phase II started April 2020; initial results r...,Non-replicating viral vector; Adenovirus Type ...,NCT04313127 ChiCTR2000030906 ChiCTR2000031781 ...,Unknown,The Lancet (https://www.thelancet.com/journals...,Same platform as vaccine candidates for EBOV,,https://docs.google.com/document/d/1Y4nCJJ4njz...,6/1/2020


Now let's think about how to add the information to Wikidata. 

developer(P178)

[vaccine for (P1924)](https://www.wikidata.org/wiki/Property:P1924)

The product category is currently represented in "instance of" statements. In this case, it would be an instance of [adenovirus-based vaccine (Q96841548)](https://www.wikidata.org/wiki/Q96841548)

Not sure how to link the vaccine to the clinical trials that describe them. The id properties are used for the clinical trial items themselves.
[NCT id](https://www.wikidata.org/wiki/Property:P3098)
[Chinese Clinical Trial Registry ID (P8064)](https://www.wikidata.org/wiki/Property:P8064)


Probably adding the vaccine as a main subject of the clinical trial. 

Published results can be reconciled to wikidata items and added as [described by source(P1343)](https://www.wikidata.org/wiki/Property:P1343)




In [71]:
import requests
import time
def get_first_wikidata_match(word_to_search):
    time.sleep(0.3)
    api_url = "https://www.wikidata.org/w/api.php?action=wbsearchentities&language=en&search=" + word_to_search + "&format=json"
    search_result = requests.get(api_url)
    
    index_to_return = 0
    try:
        qid = search_result.json()["search"][index_to_return]["id"]
    except:
        qid = "Not found"
    return qid
    

In [80]:
target_item = "Q96695265"
for index, row in ad5_row.iterrows():
    devs = row["Developer / Researcher"].split("/")
    for dev in devs:
        qid_dev = get_first_wikidata_match(dev)
        print(target_item + "|P178|" + qid_dev + "|S854|" + '"' + 'https://covid-19tracker.milkeninstitute.org/' + '"') 

Q96695265|P178|Q91016085|S854|"https://covid-19tracker.milkeninstitute.org/"
Q96695265|P178|Not found|S854|"https://covid-19tracker.milkeninstitute.org/"
Q96695265|P178|Q1437507|S854|"https://covid-19tracker.milkeninstitute.org/"
