
## How this notebook is structured?


### Part zero: An explanation of what Wikidata is and what we can do once the statements are available in an open, interlinked data format.



### Second part: Conversion of statements into Wikidata format. 
*  2.0 -> A quick intro to Wikidata.
* Goals:
* Manually convert the statements that originated from the amazing work by other kagglers into Wikidata format. 
* Figure out ways to automatically extract relationships between terms from these datasets.



# Part 2.0 --> Wikidata : Why it can help to accelerate information exchange and synthesis. 
​
​
Wikidata is an open database, run by the Wikimedia foundation (the same that runs Wikipedia) that anyone can edit. Even though this may lead to mistakes here and there, this format enables fast correction and update of information, what is essential when we are working with novel information arising on a daily basis. 
​
Long story short: statements on Wikidata allow application of reasoning and inference algorithms, and to integrate diverse kinds of knowledge in the same place . Just the kind of multidisciplinarity we want for *actually* understanding Covid-19.
​
​
#### What are statements in Wikidata 
​
​
Statements, that is what we like. Statements in Wikidata are composed by three parts, forming a triple. An item, a property and a value. Let's go step by step.
​
#### **Items**:
​
The first of the three is called an [item](https://www.wikidata.org/wiki/Help:Items). Items can be pretty much anything, from physical entities to classes of objects. This gets way clearer by exemplification:
​
* The city of Wuhan is an item --> [Wuhan](https://www.wikidata.org/wiki/Q11746)
* The SARS-CoV-2 strain is an item --> [SARS-CoV-2](https://www.wikidata.org/wiki/Q82069695)
* The coronavirus pandemic in each specific place is an item--> [COVID-19 pandemic in italy](https://www.wikidata.org/wiki/Q84104992)
​
Pretty much everything that has a specific Wikipedia page has an Wikidata item. And there are many items that don't have a Wikipedia page, so almost everything can be represented in Wikidata. 
​
( Wikidata is an ongoing effort and I believe no one is really sure of what *cannot* be modeled there. 
​
Great, so we have items! **Anyone** can create items, which means that if something is not there *you* can create it! And from the point you create the item, anyone in the world can use it. Amazing, isn't it?
​
That is already useful for semantically tagging sentences. For example (from Wikipedia):
​
*SARS-CoV-2 is the cause of the ongoing pandemic of coronavirus disease 2019 (COVID-19).
​
We have te items [SARS-CoV-2](https://www.wikidata.org/wiki/Q82069695) and the  [COVID-19 pandemic](https://www.wikidata.org/wiki/Q82069695) in Wikidata that represent some of the concepts there. 
​
Each item is described by the letter Q followed by numbers, which are given on the order items are created.
​
But this is not  enough: we want to **link** concepts! That is when the properties come.
​
​
#### **Properties and Values**:
​
Properties are rigorous ways to describe items. For example, every major city has a population number from local census. That is something that we want to have on Wikidata. 
So, in Wikidata, links are made between items and specific values. [Wuhan](https://www.wikidata.org/wiki/Q11746), for example, is described by a series of statements such as:
​
* [Capital city of](1376https://www.wikidata.org/wiki/Property:P1376) --> [Hubei](https://www.wikidata.org/wiki/Q46862)
​
* [Population](https://www.wikidata.org/wiki/Property:P1082) --> 11 895 000
​
* [Timezone](https://www.wikidata.org/wiki/Property:P421) --> [UTC+8](https://www.wikidata.org/wiki/Q6985)
​
​
There are thousands of properties linking items to values.  You can also link two values together (i.e. items can also be values). This is super cool, because you end up with a interconneted network of knowledge. And this knowledge network is formal enough so machines can understand
​
* *Wait, what do you mean by understand it?*
​
What I mean is that all this knowledge is available and searchable in many user friendly formats. 
​
#### Using Wikidata statements
​
There are many tools built on top of Wikidata. Many Wikipedia pages have automatic infoboxes derived from Wikidata statements, for example. 
​
The one that is of greates use for us here is the SPARQL Query service, which is a way to query the Wikidata database either from the command line (via API calls) or [directly in your browser](https://query.wikidata.org/)
A few examples of queries related to Covid-19 on Wikidata:
* [Image grid of individuals who have died from COVID-19](https://w.wiki/LZF)
* [List of coronaviruses](https://w.wiki/LZG) 
​
On this query you can combine different properties and ask questions such as: 
* "which molecules interact with the ACE2 receptor and are used as treatment for some disease"?
* "which viruses are catalogued as cause of a pandemic item?"
​
As Wikidata has information about all kinds of stuff, queries are limited by user imagination (and database completeness, of course). That's why it is important to update Wikidata with the most accurate and complete infos as possible. 

# Part 2.1 -->  Conversion of statements to Wikidata 



Example of statement:
    
From [David Mezzetti's notebook](https://www.kaggle.com/davidmezzetti/cord-19-virus-genetics-origin-and-evolution#Evidence-that-livestock-could-serve-as-a-reservoir-after-the-epidemic-appears-to-be-over.)

"Their research suggests that bats may be the reservoir host for 2019-nCoV. (Sun et al)[https://www.medrxiv.org/content/10.1101/2020.02.18.20024539v2]

The Wikidata equivalent statements in a (quickstatements)[https://www.wikidata.org/wiki/Help:QuickStatements]-ready format:

[Q82069695](https://www.wikidata.org/wiki/Q82069695)|[P1605](https://www.wikidata.org/wiki/Property:P1605)|[Q28425](https://www.wikidata.org/wiki/Q28425)|[S854](https://www.wikidata.org/wiki/Property:P854)|"https://www.medrxiv.org/content/10.1101/2020.02.18.20024539v2"





# Work in progress

To do: 
    
    * Get statements from different notebooks in Kaggle.
    * Make a Wikidata gold standard.
    * Use spacy to extract known Wikidata claims from matching statements. 
    * Build models to try and automate the statement to claim transtion.

# Part 2.2 -->  Making sure articles themselves are represented in Wikidata.


Step by step:
    * Check articles described in metadata.csv;
    * Get Wikidata IDs for those articles;
    * Add the ones that are not present in Wikidata.
    




In [1]:
import pandas as pd

In [11]:
metadata = pd.read_csv("./input/CORD-19-research-challenge/metadata.csv")



Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,url
45764,xupeqtx3,a186e1e74616d4936c8de93c42a857c4cb9d1edf,Elsevier,One-tube smart genetic testing via coupling is...,10.1016/j.aca.2020.01.068,PMC7094703,32145848.0,els-covid,Abstract Urgent demand for portable diagnosis ...,2020-04-15,"Guo, Lulu; Lu, Baiyang; Dong, Qing; Tang, Yida...",Analytica Chimica Acta,,,True,custom_license,https://doi.org/10.1016/j.aca.2020.01.068
45765,f6mlvl56,efd9f0bbc3ac52b299b2799aa8d72cd9a5b55ccf,Elsevier,Sumoylation of the nucleocapsid protein of sev...,10.1016/j.febslet.2005.03.039,PMC7094623,15848177.0,els-covid,Abstract Severe acute respiratory syndrome cor...,2005-04-25,"Li, Frank Qisheng; Xiao, Han; Tam, James P.; L...",FEBS Letters,,,True,custom_license,https://doi.org/10.1016/j.febslet.2005.03.039
45766,8f6gpdy2,f9d941d30a663db32ceabe367cf36b6f3c2c744c; 1f19...,Elsevier,Modulation of influenza virus replication by a...,10.1016/j.antiviral.2008.05.008,PMC2614658,18585796.0,els-covid,"Abstract In recent years, increasing levels of...",2008-11-30,"Hoffmann, H.-Heinrich; Palese, Peter; Shaw, Me...",Antiviral Research,,,True,custom_license,https://doi.org/10.1016/j.antiviral.2008.05.008
45767,b7e9grj0,889ba9338ea71cd42c3bc675db30a1928d487f43; d38e...,Elsevier,Relative immunogenicity and protection potenti...,10.1016/j.vaccine.2008.01.024,PMC2288748,18291562.0,els-covid,"Summary Yersinia Pestis outer proteins, plasmi...",2008-03-20,"Wang, Shixia; Joshi, Swati; Mboudjeka, Innocen...",Vaccine,,,True,custom_license,https://doi.org/10.1016/j.vaccine.2008.01.024
45768,6b1y7yxg,f81692543d3e35858911cea48c298bfa23b20bc6,Elsevier,Quality of life and psychological status in su...,10.1016/j.jpsychores.2005.08.020,PMC7094294,16650592.0,els-covid,Abstract Background Little is known about the ...,2006-05-31,"Kwek, Seow-Khee; Chew, Wuen-Ming; Ong, Kian-Ch...",Journal of Psychosomatic Research,,,True,custom_license,https://doi.org/10.1016/j.jpsychores.2005.08.020
45769,4360s2yu,289deae0b2050aa259a05ba84565a4df82fa099a,Elsevier,Personal Protective Equipment: Protecting Heal...,10.1016/j.clinthera.2015.07.007,PMC4661082,26452427.0,els-covid,Abstract Purpose The recent Ebola epidemic tha...,2015-11-01,"Fischer, William A.; Weber, David J.; Wohl, Da...",Clinical Therapeutics,,,True,custom_license,https://doi.org/10.1016/j.clinthera.2015.07.007
45770,66jumbir,21a4369f83891bf6975dd916c0aa495d5df8709e,Elsevier,Viruses and asthma,10.1016/j.bbagen.2011.01.012,PMC3130828,21291960.0,els-covid,Abstract Background Viral respiratory infectio...,2011-11-30,"Dulek, Daniel E.; Peebles, R. Stokes",Biochimica et Biophysica Acta (BBA) - General ...,,,True,custom_license,https://doi.org/10.1016/j.bbagen.2011.01.012
45771,3wk36h9p,,Elsevier,Why the WHO won't use the p-word,10.1016/s0262-4079(20)30474-7,,,els-covid,"There are no criteria for a pandemic, but covi...",2020-03-07,"MacKenzie, Debora",New Scientist,,#5716,False,custom_license,https://doi.org/10.1016/s0262-4079(20)30474-7
45772,0ujw0gak,,WHO,"Communication, transparency key as Canada face...",10.1503/cmaj.1095846,PMC7030882,32071113.0,unk,,2020-02-17,"Glauser, Wendy",Canadian Medical Association Journal,1953688000.0,#4117,False,,https://doi.org/10.1503/cmaj.1095846
45773,28vx9w58,3369a14e1d116943f48b3a33597796c9802de279; f523...,PMC,Searching for animal models and potential targ...,10.1016/j.onehlt.2017.03.001,PMC5454147,28616501.0,cc-by-nc-nd,Emerging and re-emerging pathogens represent a...,2017-03-03,"Vergara-Alert, Júlia; Vidal, Enric; Bensaid, A...",One Health,,,True,noncomm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...


In [13]:
metadata.head(10)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,url
0,vho70jcx,f056da9c64fbf00a4645ae326e8a4339d015d155,biorxiv,SIANN: Strain Identification by Alignment to N...,10.1101/001727,,,biorxiv,Next-generation sequencing is increasingly bei...,2014-01-10,Samuel Minot; Stephen D Turner; Krista L Ternu...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/001727
1,i9tbix2v,daf32e013d325a6feb80e83d15aabc64a48fae33,biorxiv,Spatial epidemiology of networked metapopulati...,10.1101/003889,,,biorxiv,An emerging disease is one infectious epidemic...,2014-06-04,Lin WANG; Xiang Li,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/003889
2,62gfisc6,f33c6d94b0efaa198f8f3f20e644625fa3fe10d2,biorxiv,Sequencing of the human IG light chain loci fr...,10.1101/006866,,,biorxiv,Germline variation at immunoglobulin gene (IG)...,2014-07-03,Corey T Watson; Karyn Meltz Steinberg; Tina A ...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/006866
3,058r9486,4da8a87e614373d56070ed272487451266dce919,biorxiv,Bayesian mixture analysis for metagenomic comm...,10.1101/007476,,,biorxiv,Deep sequencing of clinical samples is now an ...,2014-07-25,Sofia Morfopoulou; Vincent Plagnol,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/007476
4,wich35l7,eccef80cfbe078235df22398f195d5db462d8000,biorxiv,Mapping a viral phylogeny onto outbreak trees ...,10.1101/010389,,,biorxiv,Developing methods to reconstruct transmission...,2014-11-11,Stephen P Velsko; Jonathan E Allen,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/010389
5,z3tgnzth,c41fdb2efd6d61384a92a84cbba3f8233629a41b,biorxiv,The infant airway microbiome in health and dis...,10.1101/012070,,,biorxiv,The nasopharynx (NP) is a reservoir for microb...,2014-12-02,Shu Mei Teo; Danny Mok; Kym Pham; Merci Kusel;...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/012070
6,1xxrnpg3,1dd898b5ca1ae70ec0e3cad89fc87a165002a99e,biorxiv,Using heterogeneity in the population structur...,10.1101/017178,,,biorxiv,"ABSTRACTIn 2013, U.S. swine producers were con...",2015-03-27,Eamon B. O’Dea; Harry Snelson; Shweta Bansal,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/017178
7,8ilzm51q,33565294e6bc67fb7ee14dcae6cfdb08148f4ea5,biorxiv,"Big city, small world: Density, contact rates,...",10.1101/018481,,,biorxiv,Macroscopic descriptions of populations common...,2015-04-27,Moritz U. G. Kraemer; T. Alex Perkins; Derek A...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/018481
8,wafvnbdu,3461d71f6890f7e5ba53bf168be3945cdb16d901,biorxiv,MERS-CoV recombination: implications about the...,10.1101/020834,,,biorxiv,Recombination is a process that unlinks neighb...,2015-06-12,Gytis Dudas; Andrew Rambaut,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/020834
9,4xocqn6o,1f9d3f9a1a0e8db6a086e0a2b5ba50cf9f235dae,biorxiv,On the causes of evolutionary transition:trans...,10.1101/027722,,,biorxiv,A pattern in which nucleotide transitions are ...,2015-09-28,Arlin Stoltzfus; Ryan W. Norris,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/027722


In [73]:
# For starters, we eill only work with titles containing a pubmed_id

metadata_only_with_pubmed_ids = metadata[~metadata["pubmed_id"].isnull()]
metadata_only_with_pubmed_ids.reset_index(inplace = True) 
metadata_only_with_pubmed_ids.head()

Unnamed: 0,index,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,url
0,1218,65b267ic,00e5a723d44eb9f2698c38b518eff85c00f9753b,CZI,Six weeks into the 2019 coronavirus disease (C...,10.1097/cm9.0000000000000760,,32097202.0,cc-by-nc-nd,,2020,"Harypursat, Vijay; Chen, Yao-Kai",Chin Med J (Engl),3005943000.0,#1985,True,noncomm_use_subset,https://doi.org/10.1097/cm9.0000000000000760
1,1219,5pqkuwb2,0938d2fb07611897abf38cea727ddbeea77b73d9,CZI,Backcalculating the Incidence of Infection wit...,10.3390/jcm9030657,,32121356.0,cc-by,To understand the time-dependent risk of infec...,2020,"Nishiura, Hiroshi",J Clin Med,3005847000.0,#3329,True,comm_use_subset,https://doi.org/10.3390/jcm9030657
2,1222,o877uul1,12d267205009c178b6a50506db717ff650d93415,CZI,A pneumonia outbreak associated with a new cor...,10.1038/s41586-020-2012-7,,32015507.0,cc-by,"Since the SARS outbreak 18 years ago, a large ...",2020,"Zhou, Peng; Yang, Xing-Lou; Wang, Xian-Guang; ...",Nature,3004280000.0,#246,True,comm_use_subset,https://doi.org/10.1038/s41586-020-2012-7
3,1223,cja8i0hw,140e6d0298bfcd1e825a4b81dcabc50d1658357a,CZI,The Novel Coronavirus: A Bird's Eye View,10.15171/ijoem.2020.1921,,32020915.0,cc-by-nc-sa,"The novel coronavirus (2019-nCoV) outbreak, wh...",2020,"Habibzadeh, Parham; Stoneman, Emily K.",Int J Occup Environ Med,3004736000.0,#319,True,noncomm_use_subset,https://doi.org/10.15171/ijoem.2020.1921
4,1224,g9wmlvnq,147de820d90c0ce89fb5ae6836ea1794b808fdf2,CZI,Voice from China: nomenclature of the novel co...,10.1097/cm9.0000000000000787,,32118646.0,cc-by-nc-nd,,2020,,Chin Med J (Engl),2966143000.0,#3475,True,noncomm_use_subset,https://doi.org/10.1097/cm9.0000000000000787


In [61]:
from wikidataintegrator import wdi_core


    
    
def get_qid_by_pubmed_id(pubmed_id):
    query_code = '''

    SELECT ?item ?itemLabel 
    WHERE 
    {
      ?item wdt:P698 "''' + str(pubmed_id) + '''"
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    }

    '''
    query_results = wdi_core.WDItemEngine.execute_sparql_query(query_code)

    try:
        wikidata_item = query_results['results']['bindings'][0]
        return(wikidata_item["item"]['value'].split("/")[4])
    except:
        return('not found')
    

In [54]:
get_qid_by_pubmedid(15848177)


'Q33292110'

In [None]:
import time

wikidata_ids = []
metadata_only_with_pubmed_ids.info()

for index, row in metadata_only_with_pubmed_ids.iterrows():
    pubmed_id = round(row["pubmed_id"])
    qid = get_qid_by_pubmedid(pubmed_id)
    wikidata_ids.append(qid)
    time.sleep(0.2)
    if index % 100 == 0:
        print(index)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34641 entries, 0 to 34640
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   index                        34641 non-null  int64  
 1   cord_uid                     34641 non-null  object 
 2   sha                          24743 non-null  object 
 3   source_x                     34641 non-null  object 
 4   title                        34641 non-null  object 
 5   doi                          31772 non-null  object 
 6   pmcid                        25325 non-null  object 
 7   pubmed_id                    34641 non-null  float64
 8   license                      34641 non-null  object 
 9   abstract                     31145 non-null  object 
 10  publish_time                 34641 non-null  object 
 11  authors                      34322 non-null  object 
 12  journal                      31987 non-null  object 
 13  Microsoft Academ

In [77]:
metadata_only_with_pubmed_ids["wikidata_id"] = wikidata_ids


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [81]:
metadata_only_with_pubmed_ids.to_csv("metadata_annotated_with_qids.csv")

Okay, now we have a series of articles that would be awesome to have on Wikidata. Let's look at what infos we have on them.


In [89]:
qless_articles = metadata_only_with_pubmed_ids[metadata_only_with_pubmed_ids["wikidata_id"] == "not found"]

In [92]:
sum(metadata_only_with_pubmed_ids["wikidata_id"] == "not found")

2928

In [88]:
qless_articles.head(3)

Unnamed: 0,index,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,url,wikidata_id
0,1218,65b267ic,00e5a723d44eb9f2698c38b518eff85c00f9753b,CZI,Six weeks into the 2019 coronavirus disease (C...,10.1097/cm9.0000000000000760,,32097202.0,cc-by-nc-nd,,2020,"Harypursat, Vijay; Chen, Yao-Kai",Chin Med J (Engl),3005943000.0,#1985,True,noncomm_use_subset,https://doi.org/10.1097/cm9.0000000000000760,not found
1,1219,5pqkuwb2,0938d2fb07611897abf38cea727ddbeea77b73d9,CZI,Backcalculating the Incidence of Infection wit...,10.3390/jcm9030657,,32121356.0,cc-by,To understand the time-dependent risk of infec...,2020,"Nishiura, Hiroshi",J Clin Med,3005847000.0,#3329,True,comm_use_subset,https://doi.org/10.3390/jcm9030657,not found
3,1223,cja8i0hw,140e6d0298bfcd1e825a4b81dcabc50d1658357a,CZI,The Novel Coronavirus: A Bird's Eye View,10.15171/ijoem.2020.1921,,32020915.0,cc-by-nc-sa,"The novel coronavirus (2019-nCoV) outbreak, wh...",2020,"Habibzadeh, Parham; Stoneman, Emily K.",Int J Occup Environ Med,3004736000.0,#319,True,noncomm_use_subset,https://doi.org/10.15171/ijoem.2020.1921,not found


Now our goal is to match each of the columns to QuickStatemetns -compatible statements. 

* *title* = The item label and https://www.wikidata.org/wiki/Property:P1476 (string processing)
* *doi* = P356 (string processing)
* *pubmed_id* = P698 (string processing)
* *journal* = P1433 (string to item matching)
* *author* = {
        P2093(author name) (string processing)
        P50(author) (string to item matching)
}

(perhaps using https://github.com/arthurpsmith/author-disambiguator)
* *publish_time* = P577 (external matching)
* *volume* = P478 (external matching)
* *page* = P304 (external matching)
* *number* = P433 (external matching)

