In this script, we will reconcile the database [of Vaccine Candidates against COVID-19 of the Milken Institute](https://covid-19tracker.milkeninstitute.org/#about) to Wikidata.

First, we will just check the info for a vaccine candidate already on Wikidata: [Ad5-nCoV](https://www.wikidata.org/wiki/Q96695265)


In [2]:
import pandas as pd

df = pd.read_csv("COVID-19 Tracker-Treatments and Vaccines.csv")

The database contains treatments and vaccines. For now, we are only interested in vaccine candidates. Let's filter.

In [3]:
vax_df = df[df["Treatment vs. Vaccine"] == "Vaccine"]

In [4]:
vax_df.head(3)

Unnamed: 0,Developer / Researcher,Treatment vs. Vaccine,Product Category,Stage of Development,Anticipated Next Steps,Product Description,Clinical Trials for COVID-19,Funder,Published Results,Clinical Trials for Other Diseases (T only) / Related Use or Platform (V only),FDA-Approved Indications,Sources,Date Last Updated
263,Scancell/ University of Nottingham/ Nottingham...,Vaccine,DNA-based,Pre-clinical,Phase I to start in Q1 2021,DNA plasmid vaccine RBD&N,,Unknown,,Same platform as vaccine candidates for cancer,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/8/2020
264,Entos Pharmaceuticals/ Cytiva,Vaccine,DNA-based,Pre-clinical,Phase I/II to start in late July 2020,DNA; Covigenix,,Canadian Institutes of Health Research (CIHR)/...,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/1/2020
265,BioNet Asia,Vaccine,DNA-based,Pre-clinical,Unknown,DNA,,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,4/20/2020


Apparently the names are in the "Product Description" column. Maybe we can find Ad5-nCoV there.

In [5]:
list(set(vax_df["Product Description"]))[0:5]

['Non-replicating viral vector; parainfluenza virus 5 (PIV5)-based vaccine expressing the spike protein',
 'Protein subunit EPV-CoV-19',
 'VLP (CoVLP)+ Adjuvant (CpG 1018)',
 'Adeno-based',
 'Protein Subunit S, N, M & S1 protein']

Hmm, the names are a bit messy. Names are mixed with descriptions. Let's look, them, only for those with clinical trials. 

In [6]:
vax_df_with_clinical_trials = vax_df.dropna(subset = ["Clinical Trials for COVID-19"])

In [7]:
vax_df_with_clinical_trials.head()

Unnamed: 0,Developer / Researcher,Treatment vs. Vaccine,Product Category,Stage of Development,Anticipated Next Steps,Product Description,Clinical Trials for COVID-19,Funder,Published Results,Clinical Trials for Other Diseases (T only) / Related Use or Platform (V only),FDA-Approved Indications,Sources,Date Last Updated
266,Inovio Pharmaceuticals/Beijing Advaccine Biote...,Vaccine,DNA-based,Phase I,"Phase I initial data released June 30, 2020; P...",DNA plasmid vaccine with electroporation; INO-...,"NCT04336410, NCT04447781",Coalition for Epidemic Preparedness (CEPI) / G...,Inovio (http://ir.inovio.com/news-releases/new...,Same platform as multiple vaccine candidates,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/6/2020
271,Zydus Cadila Healthcare Limited,Vaccine,DNA-based,Pre-clinical,Phase I/II to start in July 2020,DNA plasmid,CTRI/2020/07/026352,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/8/2020
273,Osaka University/ AnGes/ Takara Bio/ Cytiva,Vaccine,DNA-based,Pre-clinical,Phase I to start in July 2020,DNA plasmid,JapicCTI-205328,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/9/2020
274,"Genexine Consortium (GenNBio, International Va...",Vaccine,DNA-based,Phase I,Phase I to start in July 2020,DNA vaccine (GX-19),NCT04445389,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/1/2020
279,Beijing Institute of Biological Products/ Sino...,Vaccine,Inactivated virus,Phase II,"In Phase II, June 2020",Inactivated,ChiCTR2000032459,Unknown,,,,https://docs.google.com/document/d/1Y4nCJJ4njz...,6/12/2020


I believe I will have to manually add some vaccine labels, but some info is present. Let's look at it again:

In [8]:
list(set(vax_df_with_clinical_trials["Product Description"]))

['Inactivated (inactivated + alum); CoronaVac (formerly PiCoVacc)',
 'LV-SMENP-DC Dendritic cells modified with lentiviral vector expressing synthetic minigene based on domains of selected viral proteins; administered with antigen-specific cytotoxic T lymphocytes',
 'Artificial antigen-presenting cells modified with lentiviral vector expressing synthetic minigene based on domains of selected viral proteins',
 'DNA plasmid',
 'Non-replicating viral vector; AZD 1222 (formerly ChAdOx1)',
 'Protein subunit; Full-length recombinant SARS COV-2 glycoprotein nanoparticle vaccine adjuvanted with Matrix M (NVX-CoV2373)',
 '3 LNP-mRNAs; BNT162 \n',
 'VLP; plant-derived VLP',
 'Protein subunit; recombinant spike protein with Advax adjuvant (COVAX-19)',
 'RNA; LNP-encapsulated mRNA (mRNA 1273)',
 'DNA plasmid vaccine with electroporation; INO-4800',
 'Protein subunit, native like trimeric subunit spike protein',
 'RNA; mRNA',
 'Non-replicating viral vector; Adenovirus Type 5 vector (Ad5-nCoV)',
 'A

In [9]:
ad5_row = vax_df_with_clinical_trials[vax_df_with_clinical_trials["Product Description"] == 'Non-replicating viral vector; Adenovirus Type 5 vector (Ad5-nCoV)']
ad5_row

Unnamed: 0,Developer / Researcher,Treatment vs. Vaccine,Product Category,Stage of Development,Anticipated Next Steps,Product Description,Clinical Trials for COVID-19,Funder,Published Results,Clinical Trials for Other Diseases (T only) / Related Use or Platform (V only),FDA-Approved Indications,Sources,Date Last Updated
300,CanSino Biologics/Beijing Institute of Biotech...,Vaccine,Non-replicating viral vector,Phase II,"Approved for military use in China on June 25,...",Non-replicating viral vector; Adenovirus Type ...,NCT04313127 ChiCTR2000030906 ChiCTR2000031781 ...,Unknown,The Lancet (https://www.thelancet.com/journals...,Same platform as vaccine candidates for EBOV,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/8/2020


Now let's think about how to add the information to Wikidata. 

developer(P178)

[vaccine for (P1924)](https://www.wikidata.org/wiki/Property:P1924)

The product category is currently represented in "instance of" statements. In this case, it would be an instance of [adenovirus-based vaccine (Q96841548)](https://www.wikidata.org/wiki/Q96841548)

Not sure how to link the vaccine to the clinical trials that describe them. The id properties are used for the clinical trial items themselves.
[NCT id](https://www.wikidata.org/wiki/Property:P3098)
[Chinese Clinical Trial Registry ID (P8064)](https://www.wikidata.org/wiki/Property:P8064)


Probably adding the vaccine as a main subject of the clinical trial. 

Published results can be reconciled to wikidata items and added as [described by source(P1343)](https://www.wikidata.org/wiki/Property:P1343)




In [10]:
import requests
import time
def get_first_wikidata_match(word_to_search):
    time.sleep(0.3)
    api_url = "https://www.wikidata.org/w/api.php?action=wbsearchentities&language=en&search=" + word_to_search + "&format=json"
    search_result = requests.get(api_url)
    
    index_to_return = 0
    try:
        qid = search_result.json()["search"][index_to_return]["id"]
    except:
        qid = "Not found"
    return qid
    

In [11]:
from datetime import datetime
today = datetime.now()
today_wikidata_format = today.strftime("+%Y-%m-%dT00:00:00Z/11")

ref_url = "|S854|" + '"' + 'https://covid-19tracker.milkeninstitute.org/' + '"'
retrieved_in = "|S813|" + today_wikidata_format


target_item = "Q96695265"

for index, row in ad5_row.iterrows():
    devs = row["Developer / Researcher"].split("/")
    for dev in devs:
        qid_dev = get_first_wikidata_match(dev)
        
        developer = "|P178|" + qid_dev 
        
        
        
        
        
        
        print(target_item + developer + ref_url + retrieved_in ) 

Q96695265|P178|Q91016085|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-11T00:00:00Z/11
Q96695265|P178|Not found|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-11T00:00:00Z/11
Q96695265|P178|Q1437507|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-11T00:00:00Z/11


Only one of the candidates with trials actually have an article as a describing source. This will be added manually:

In [12]:
print("Q96695265|P1343|Q95818623"+ ref_url + retrieved_in)

Q96695265|P1343|Q95818623|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-11T00:00:00Z/11


Let's add new items with:

* vaccine for covid-19 (|P1924|Q84263196)
* instance of a candidate vaccine (|P31|Q28051899)
* instace of whatever type of vaccine that is 

In [13]:
vax_category_to_wikidata = {
    "DNA-based" : "Q578537",
    "Inactivated virus": "Q3560939",
    "Non-replicating viral vector": "Q96841548",
    "Protein subunit":"Q97153933",
    "RNA-based vaccine":"Q97153934",
    "Virus-like particle":"Q58623657"  
}

vax_category_to_wikidata
vax_df_with_clinical_trials["wd_category"]=vax_df_with_clinical_trials["Product Category"].map(vax_category_to_wikidata)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [14]:
devs = [dev.replace(",", "/") for dev in vax_df_with_clinical_trials["Developer / Researcher"]]

devs = [dev.split("/")[0] for dev in devs]

vax_df_with_clinical_trials["wd_enlabel"] = [dev + " COVID-19 vaccine candidate" for dev in devs]
vax_df_with_clinical_trials["wd_enlabel"].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


array(['Inovio Pharmaceuticals COVID-19 vaccine candidate',
       'Zydus Cadila Healthcare Limited  COVID-19 vaccine candidate',
       'Osaka University COVID-19 vaccine candidate',
       'Genexine Consortium (GenNBio COVID-19 vaccine candidate',
       'Beijing Institute of Biological Products COVID-19 vaccine candidate',
       'Wuhan Institute of Biological Products COVID-19 vaccine candidate',
       'Institute of Medical Biology COVID-19 vaccine candidate',
       'Sinovac COVID-19 vaccine candidate',
       'CanSino Biologics COVID-19 vaccine candidate',
       'Consortium of the Jenner Institute COVID-19 vaccine candidate',
       'Gamaleya  Research Institute COVID-19 vaccine candidate',
       'Novavax COVID-19 vaccine candidate',
       'Vaxine Pty Ltd COVID-19 vaccine candidate',
       'Anhui Zhifei Longcom Biopharmaceutical COVID-19 vaccine candidate',
       'Clover Biopharmaceuticals Inc. COVID-19 vaccine candidate',
       'Moderna COVID-19 vaccine candidate',
      

A few these were already on Wikidata. Let's make it clear which ones.

In [15]:
vax_in_wikidata = ['Moderna COVID-19 vaccine candidate',
                   "Consortium of the Jenner Institute COVID-19 vaccine candidate", 
                   "Inovio Pharmaceuticals COVID-19 vaccine candidate",
                   "CanSino Biologics COVID-19 vaccine candidate" ]

In [19]:
'''
Print QuickStastaements for he creation of new items about vaccines
'''

with open("vax_with_clinical_trial.qs", "w+") as f:
    from datetime import datetime
    today = datetime.now()
    today_wikidata_format = today.strftime("+%Y-%m-%dT00:00:00Z/11")

    ref_url = "|S854|" + '"' + 'https://covid-19tracker.milkeninstitute.org/' + '"'
    retrieved_in = "|S813|" + today_wikidata_format


    for index,row in vax_df_with_clinical_trials.iterrows():
        if row["wd_enlabel"] not in vax_in_wikidata:
            print( row["wd_enlabel"] )

            refs = ref_url + retrieved_in 

            instance_of_candidate_qs = "LAST" + "|P31|" + "Q28051899" + refs

            vax_category = row["wd_category"]
            instance_of_category_qs = "LAST" + "|P31|" + vax_category + refs

            en_label_qs = "LAST" + "|Len|" + '"' + row["wd_enlabel"] + '"'

            en_description_qs  = "LAST" + "|Den|" + '"' + "candidate vaccine against COVID-19" + '"'

            vaccine_for_covid19_qs = "LAST" + "|P1924|Q84263196" + refs

            f.write("CREATE" + "\n")
            f.write(instance_of_candidate_qs + "\n")
            f.write(instance_of_category_qs + "\n")
            f.write(en_label_qs + "\n")
            f.write(vaccine_for_covid19_qs + "\n")
            f.write(en_description_qs + "\n")

            devs = row["Developer / Researcher"].split("/")
            for dev in devs:
                qid_dev = get_first_wikidata_match(dev)

                if qid_dev !="Not found":
                    developer_qs =  "LAST" + "|P178|" + qid_dev + refs
                    f.write(developer_qs + "\n")

f.close()

        

Zydus Cadila Healthcare Limited  COVID-19 vaccine candidate
Osaka University COVID-19 vaccine candidate
Genexine Consortium (GenNBio COVID-19 vaccine candidate
Beijing Institute of Biological Products COVID-19 vaccine candidate
Wuhan Institute of Biological Products COVID-19 vaccine candidate
Institute of Medical Biology COVID-19 vaccine candidate
Sinovac COVID-19 vaccine candidate
Gamaleya  Research Institute COVID-19 vaccine candidate
Novavax COVID-19 vaccine candidate
Vaxine Pty Ltd COVID-19 vaccine candidate
Anhui Zhifei Longcom Biopharmaceutical COVID-19 vaccine candidate
Clover Biopharmaceuticals Inc. COVID-19 vaccine candidate
CureVac COVID-19 vaccine candidate
Imperial College London COVID-19 vaccine candidate
BioNTech COVID-19 vaccine candidate
People's Liberation Army (PLA) Academy of Military Sciences COVID-19 vaccine candidate
Medicago Inc. COVID-19 vaccine candidate
Shenzhen Geno-Immune Medical Institute COVID-19 vaccine candidate


TypeError: must be str, not float

After adding the vaccines, the goal is to link the clinical trials to the vaccine items themselves.