In this script, we will reconcile the database [of Vaccine Candidates against COVID-19 of the Milken Institute](https://covid-19tracker.milkeninstitute.org/#about) to Wikidata.

First, we will just check the info for a vaccine candidate already on Wikidata: [Ad5-nCoV](https://www.wikidata.org/wiki/Q96695265)


In [1]:
import pandas as pd

df = pd.read_csv("COVID-19 Tracker-Treatments and Vaccines.csv")
vax_df = df[df["Treatment vs. Vaccine"] == "Vaccine"]

vax_df_with_clinical_trials = vax_df.dropna(subset = ["Clinical Trials for COVID-19"])

In [2]:
ad5_row = vax_df_with_clinical_trials[vax_df_with_clinical_trials["Product Description"] == 'Non-replicating viral vector; Adenovirus Type 5 vector (Ad5-nCoV)']
ad5_row

Unnamed: 0,Developer / Researcher,Treatment vs. Vaccine,Product Category,Stage of Development,Anticipated Next Steps,Product Description,Clinical Trials for COVID-19,Funder,Published Results,Clinical Trials for Other Diseases (T only) / Related Use or Platform (V only),FDA-Approved Indications,Sources,Date Last Updated
300,CanSino Biologics/Beijing Institute of Biotech...,Vaccine,Non-replicating viral vector,Phase II,"Approved for military use in China on June 25,...",Non-replicating viral vector; Adenovirus Type ...,NCT04313127 ChiCTR2000030906 ChiCTR2000031781 ...,Unknown,The Lancet (https://www.thelancet.com/journals...,Same platform as vaccine candidates for EBOV,,https://docs.google.com/document/d/1Y4nCJJ4njz...,7/8/2020


Now let's think about how to add the information to Wikidata. 

developer(P178)

[vaccine for (P1924)](https://www.wikidata.org/wiki/Property:P1924)

The product category is currently represented in "instance of" statements. In this case, it would be an instance of [adenovirus-based vaccine (Q96841548)](https://www.wikidata.org/wiki/Q96841548)

Not sure how to link the vaccine to the clinical trials that describe them. The id properties are used for the clinical trial items themselves.
[NCT id](https://www.wikidata.org/wiki/Property:P3098)
[Chinese Clinical Trial Registry ID (P8064)](https://www.wikidata.org/wiki/Property:P8064)


Probably adding the vaccine as a main subject of the clinical trial. 

Published results can be reconciled to wikidata items and added as [described by source(P1343)](https://www.wikidata.org/wiki/Property:P1343)




In [3]:
import requests
import time
def get_first_wikidata_match(word_to_search):
    time.sleep(0.3)
    api_url = "https://www.wikidata.org/w/api.php?action=wbsearchentities&language=en&search=" + word_to_search + "&format=json"
    search_result = requests.get(api_url)
    
    index_to_return = 0
    try:
        qid = search_result.json()["search"][index_to_return]["id"]
    except:
        qid = "Not found"
    return qid
    

In [4]:
from datetime import datetime
today = datetime.now()
today_wikidata_format = today.strftime("+%Y-%m-%dT00:00:00Z/11")

ref_url = "|S854|" + '"' + 'https://covid-19tracker.milkeninstitute.org/' + '"'
retrieved_in = "|S813|" + today_wikidata_format


target_item = "Q96695265"

for index, row in ad5_row.iterrows():
    devs = row["Developer / Researcher"].split("/")
    for dev in devs:
        qid_dev = get_first_wikidata_match(dev)
        
        developer = "|P178|" + qid_dev 
        
        
        
        
        
        
        print(target_item + developer + ref_url + retrieved_in ) 

Q96695265|P178|Q91016085|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q96695265|P178|Not found|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q96695265|P178|Q1437507|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11


Only one of the candidates with trials actually have an article as a describing source. This will be added manually:

In [5]:
print("Q96695265|P1343|Q95818623"+ ref_url + retrieved_in)

Q96695265|P1343|Q95818623|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11


Let's add new items with:

* vaccine for covid-19 (|P1924|Q84263196)
* instance of a candidate vaccine (|P31|Q28051899)
* instace of whatever type of vaccine that is 

In [6]:
vax_category_to_wikidata = {
    "DNA-based" : "Q578537",
    "Inactivated virus": "Q3560939",
    "Non-replicating viral vector": "Q96841548",
    "Protein subunit":"Q97153933",
    "RNA-based vaccine":"Q97153934",
    "Virus-like particle":"Q58623657"  
}

vax_category_to_wikidata
vax_df_with_clinical_trials["wd_category"]=vax_df_with_clinical_trials["Product Category"].map(vax_category_to_wikidata)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [7]:
devs = [dev.replace(",", "/") for dev in vax_df_with_clinical_trials["Developer / Researcher"]]

devs = [dev.split("/")[0] for dev in devs]

vax_df_with_clinical_trials["wd_enlabel"] = [dev + " COVID-19 vaccine candidate" for dev in devs]
vax_df_with_clinical_trials["wd_enlabel"].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


array(['Inovio Pharmaceuticals COVID-19 vaccine candidate',
       'Zydus Cadila Healthcare Limited  COVID-19 vaccine candidate',
       'Osaka University COVID-19 vaccine candidate',
       'Genexine Consortium (GenNBio COVID-19 vaccine candidate',
       'Beijing Institute of Biological Products COVID-19 vaccine candidate',
       'Wuhan Institute of Biological Products COVID-19 vaccine candidate',
       'Institute of Medical Biology COVID-19 vaccine candidate',
       'Sinovac COVID-19 vaccine candidate',
       'CanSino Biologics COVID-19 vaccine candidate',
       'Consortium of the Jenner Institute COVID-19 vaccine candidate',
       'Gamaleya  Research Institute COVID-19 vaccine candidate',
       'Novavax COVID-19 vaccine candidate',
       'Vaxine Pty Ltd COVID-19 vaccine candidate',
       'Anhui Zhifei Longcom Biopharmaceutical COVID-19 vaccine candidate',
       'Clover Biopharmaceuticals Inc. COVID-19 vaccine candidate',
       'Moderna COVID-19 vaccine candidate',
      

A few these were already on Wikidata. Let's make it clear which ones.

In [8]:
vax_in_wikidata = ['Moderna COVID-19 vaccine candidate',
                   "Consortium of the Jenner Institute COVID-19 vaccine candidate", 
                   "Inovio Pharmaceuticals COVID-19 vaccine candidate",
                   "CanSino Biologics COVID-19 vaccine candidate" ]

In [19]:
'''
Print QuickStastaements for he creation of new items about vaccines
'''

with open("vax_with_clinical_trial.qs", "w+") as f:
    from datetime import datetime
    today = datetime.now()
    today_wikidata_format = today.strftime("+%Y-%m-%dT00:00:00Z/11")

    ref_url = "|S854|" + '"' + 'https://covid-19tracker.milkeninstitute.org/' + '"'
    retrieved_in = "|S813|" + today_wikidata_format


    for index,row in vax_df_with_clinical_trials.iterrows():
        if row["wd_enlabel"] not in vax_in_wikidata:
            print( row["wd_enlabel"] )

            refs = ref_url + retrieved_in 

            instance_of_candidate_qs = "LAST" + "|P31|" + "Q28051899" + refs

            vax_category = row["wd_category"]
            instance_of_category_qs = "LAST" + "|P31|" + vax_category + refs

            en_label_qs = "LAST" + "|Len|" + '"' + row["wd_enlabel"] + '"'

            en_description_qs  = "LAST" + "|Den|" + '"' + "candidate vaccine against COVID-19" + '"'

            vaccine_for_covid19_qs = "LAST" + "|P1924|Q84263196" + refs

            f.write("CREATE" + "\n")
            f.write(instance_of_candidate_qs + "\n")
            f.write(instance_of_category_qs + "\n")
            f.write(en_label_qs + "\n")
            f.write(vaccine_for_covid19_qs + "\n")
            f.write(en_description_qs + "\n")

            devs = row["Developer / Researcher"].split("/")
            for dev in devs:
                qid_dev = get_first_wikidata_match(dev)

                if qid_dev !="Not found":
                    developer_qs =  "LAST" + "|P178|" + qid_dev + refs
                    f.write(developer_qs + "\n")

f.close()

        

Zydus Cadila Healthcare Limited  COVID-19 vaccine candidate
Osaka University COVID-19 vaccine candidate
Genexine Consortium (GenNBio COVID-19 vaccine candidate
Beijing Institute of Biological Products COVID-19 vaccine candidate
Wuhan Institute of Biological Products COVID-19 vaccine candidate
Institute of Medical Biology COVID-19 vaccine candidate
Sinovac COVID-19 vaccine candidate
Gamaleya  Research Institute COVID-19 vaccine candidate
Novavax COVID-19 vaccine candidate
Vaxine Pty Ltd COVID-19 vaccine candidate
Anhui Zhifei Longcom Biopharmaceutical COVID-19 vaccine candidate
Clover Biopharmaceuticals Inc. COVID-19 vaccine candidate
CureVac COVID-19 vaccine candidate
Imperial College London COVID-19 vaccine candidate
BioNTech COVID-19 vaccine candidate
People's Liberation Army (PLA) Academy of Military Sciences COVID-19 vaccine candidate
Medicago Inc. COVID-19 vaccine candidate
Shenzhen Geno-Immune Medical Institute COVID-19 vaccine candidate


TypeError: must be str, not float

After adding the vaccines, the goal is to link the clinical trials to the vaccine items themselves.

In [9]:
wd_items = []


for index,row in vax_df_with_clinical_trials.iterrows():
    wd_items.append(get_first_wikidata_match(row["wd_enlabel"]))

In [10]:
vax_df_with_clinical_trials["wd_items"] = wd_items

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [11]:


from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

def get_clinical_trial_item_from_nct(nct):

    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

    sparql.setQuery("""
    SELECT ?item ?itemLabel
    WHERE
    {
        ?item wdt:P3098""" + '"' + nct + '"' + """
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
    }
    """)


    sparql.setReturnFormat(JSON)
    
    try:
        results = sparql.query().convert()

        results_df = pd.json_normalize(results['results']['bindings'])
    
        return results_df[["item.value"]].values[0][0].split("/")[4]
    except:
        return "Not found"


In [12]:
today = datetime.now()
today_wikidata_format = today.strftime("+%Y-%m-%dT00:00:00Z/11")

ref_url = "|S854|" + '"' + 'https://covid-19tracker.milkeninstitute.org/' + '"'
retrieved_in = "|S813|" + today_wikidata_format
refs = ref_url + retrieved_in 


for index,row in vax_df_with_clinical_trials.iterrows():
    ct = row["Clinical Trials for COVID-19"]
    
    cts = ct.split(" ")
    
    for ct in cts:
        ct = ct.replace(",", "")
        clinical_trial_item = get_clinical_trial_item_from_nct(ct)
        
        if clinical_trial_item != "Not found":
            print( clinical_trial_item  + "|P4844|" + row["wd_items"] + refs)
            print( clinical_trial_item  + "|P921|" + row["wd_items"] + refs)
            print( row["wd_items"]  + "|P1343|" +  clinical_trial_item + refs)



Q90693591|P4844|Q96695266|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q90693591|P921|Q96695266|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q96695266|P1343|Q90693591|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q97047732|P4844|Q96695266|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q97047732|P921|Q96695266|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q96695266|P1343|Q97047732|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q96055759|P4844|Q97154232|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q96055759|P921|Q97154232|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q97154232|P1343|Q96055759|S854|"https://covid-19tracker.milkeninstitute.org/"|S813|+2020-07-12T00:00:00Z/11
Q92274099|P4844|Q97154233|S854|

In [76]:
vax_df_with_clinical_trials[["wd_enlabel", "wd_items"]]

Unnamed: 0,wd_enlabel,wd_items
266,Inovio Pharmaceuticals COVID-19 vaccine candidate,Not found
271,Zydus Cadila Healthcare Limited COVID-19 vacc...,Q97154000
273,Osaka University COVID-19 vaccine candidate,Q97154226
274,Genexine Consortium (GenNBio COVID-19 vaccine ...,Q97154228
279,Beijing Institute of Biological Products COVID...,Q97154229
280,Wuhan Institute of Biological Products COVID-1...,Q97154230
282,Institute of Medical Biology COVID-19 vaccine ...,Q97154232
285,Sinovac COVID-19 vaccine candidate,Q97154233
300,CanSino Biologics COVID-19 vaccine candidate,Q96695265
309,Consortium of the Jenner Institute COVID-19 va...,Q95042269


'Q90693591'