# Integration of human markers from PanglaoDB to Wikidata

- Tiago Lubiana
- University of São Paulo ID: 8945857


This notebook contains the code used for:
- matching PanglaoDB markers dataset to Wikidata classes
- sending POST requests to Wikidata's API for updating their database.


I've used the pandas module to handle dataframes. 

The gene and cell type reference datasets include the dictionaries for matching cell types and genes on PanglaoDB to previously existing classes on Wikidata. 



In [1]:
import pandas as pd 

gene_reference = pd.read_csv("../data/human_gene_reference_from_panglao_to_wikidata_04_11_2020.csv")

cell_type_reference = pd.read_csv("../data/cell_type_reference_from_panglao_to_wikidata_31_10_2020.csv")

markers = pd.read_csv("../data/PanglaoDB_markers_27_Mar_2020.tsv", sep="\t")

The Panglao Database contains markers for both human and species-specific cell-types. For this work, we are only focusing on the human-specific cell types. 

We will, then, trim the markers data frame so it contains only the human-related assertions. 

In [2]:
human_markers = markers[["Hs" in val for val in markers["species"]]]
human_markers_lean = human_markers[["official gene symbol", "cell type"]]


Before this work, Wikidata did not contain human-specific cell types. 

The reference dictionary was constructed for species-neutral cell types.

In a later step, human-specific cell types were added to Wikidata. 

In the SPARQL query below, we the Wikidata endpoint for species-neutral cell types and human specific cell types. 



In [3]:
from wikidata2df import wikidata2df

query = """
SELECT ?item ?itemLabel ?superclass
WHERE
{
?item wdt:P31 wd:Q189118. # instance of cell type
?item wdt:P703 wd:Q15978631. # found in taxon Homo sapiens

?item wdt:P279 ?superclass. # subclass of ?superclass
?superclass  wdt:P31 wd:Q189118. # ?superclass is a cell type

?item rdfs:label ?itemLabel

}
"""

dataframe_to_join = wikidata2df(query)

In [4]:
cell_type_reference = cell_type_reference.merge(dataframe_to_join, left_on="wikidata", right_on="superclass")

cell_type_reference.head()

Unnamed: 0.1,Unnamed: 0,panglao,wikidata,item,superclass,itemLabel
0,0,Basal cells,Q101062513,Q101404853,Q101062513,human basal cell
1,1,Trigeminal neurons,Q101062590,Q101404857,Q101062590,human trigeminal neuron
2,2,Juxtaglomerular cells,Q2596226,Q101404858,Q2596226,human juxtaglomerular cell
3,3,Pancreatic stellate cells,Q1164962,Q101404859,Q1164962,human pancreatic stellate cell
4,4,Fibroblasts,Q463418,Q101404861,Q463418,human fibroblast


In [5]:
human_markers_lean = human_markers_lean.merge(cell_type_reference, left_on="cell type", right_on="panglao")[["official gene symbol", "cell type", "item"]]
human_markers_lean.columns = ["official gene symbol", "cell type", "cell type id"]


human_markers_lean = human_markers_lean.merge(gene_reference, left_on="official gene symbol", right_on="panglao")[["official gene symbol", "cell type", "cell type id", "wikidata"]]
human_markers_lean.columns = ["official gene symbol", "cell type (general)", "cell type id (human)", "gene id"]

human_markers_lean = human_markers_lean.drop_duplicates()
human_markers_lean.head()

Unnamed: 0,official gene symbol,cell type (general),cell type id (human),gene id
0,CEBPA,Adipocyte progenitor cells,Q101404942,Q17861031
1,CEBPA,Basophils,Q101405089,Q17861031
2,CEBPA,Hepatoblasts,Q101404910,Q17861031
3,CEBPA,Hepatocytes,Q101405101,Q17861031
4,CEBPA,Microglia,Q101404888,Q17861031


In [6]:
human_markers_lean.to_csv("../src/human_markers_to_add_to_wikidata_27_11_2020.csv")

Now that we have the formatted CSV with the proper identifiers, the next step is to add the assertions to Wikidata's database. 

For that, we will use the WikidataIntegrator python module, a wrapper for the Wikidata API. With it we can add triples to Wikidata. The statements themselves are linked to reference also via a triple system. 

This code should not be run again, as the information is already on Wikidata. Here is the code, for information: 


```
from wikidataintegrator import wdi_core, wdi_login
from wikidataintegrator.wdi_helpers import try_write
import os
import pandas as pd
import pprint
from IPython.display import clear_output
from getpass import getpass

WBUSER = getpass(prompt="username:")  
WBPASS = getpass(prompt='Enter your password: ')  
login = wdi_login.WDLogin(WBUSER, WBPASS)
```

And then, after entering login and password in the command line, 

```
for i, row in human_markers_lean.iterrows():
    s = row["cell type id (human)"]
    p = "P8872"
    o = row["gene id"]
    r1 = "P813"
    or1 = "+2020-11-27T00:00:00Z/11"
    r2 = "P854"
    or2 = 'https://panglaodb.se/markers.html'
    r3 = "P248"
    or3 = "Q99936939"

    statements =[ wdi_core.WDItemID(value= o, prop_nr=p), 
                  wdi_core.WDUrl(or2, r2,  is_reference=True),
                  wdi_core.WDItemID(or3, r3, is_reference=True)]
    
    item = wdi_core.WDItemEngine(wd_item_id=s, data=statements)   

    item.write(login)

```

To show the code was effective, we will get the information from Wikidata via SPARQL. 


In [7]:
from wikidata2df import wikidata2df

query_for_markers = """
SELECT DISTINCT ?cellTypeLabel ?markerLabel ?dbLabel ?date
WHERE
{
  
     ?cellType p:P8872 ?statement.
   
     ?statement ps:P8872 ?marker.
     ?statement prov:wasDerivedFrom ?refnode.
     
     ?refnode   pr:P813 ?date.
     ?refnode   pr:P248 wd:Q99936939.
     ?refnode   pr:P248 ?db.
  
     ?cellType rdfs:label ?cellTypeLabel.
     ?marker   rdfs:label ?markerLabel.
     ?db rdfs:label ?dbLabel.

     FILTER(LANG(?cellTypeLabel) = "en")
     FILTER(LANG(?markerLabel) = "en")
     FILTER(LANG(?dbLabel) = "en")

}
"""

panglao_markers_on_wikidata = wikidata2df(query_for_markers)

In [8]:
panglao_markers_on_wikidata.head()

Unnamed: 0,dbLabel,date,cellTypeLabel,markerLabel
0,PanglaoDB,2020-11-27T00:00:00Z,human ductal cell,CTSH
1,PanglaoDB,2020-11-27T00:00:00Z,human ductal cell,AMBP
2,PanglaoDB,2020-11-27T00:00:00Z,human ductal cell,MUC1
3,PanglaoDB,2020-11-27T00:00:00Z,human ductal cell,KRT20
4,PanglaoDB,2020-11-27T00:00:00Z,human microglia,CSF1R


Now that the PanglaoDB is released as Linked Open Data, we can make queries that were not possible before.


Thanks to other reconciliation projects, Wikidata contains already information about genes, including their relations to Gene Ontology terms. 

The PanglaoDB integration to the Wikidata ecosystem allows us to ask questions like:  

"Which human cell types are related to neurogenesis via their markers?"

In [9]:
from wikidata2df import wikidata2df

query_for_neurogenesis = """
SELECT ?geneLabel ?cellTypeLabel ?processLabel
WHERE 
{
  ?protein wdt:P682 wd:Q1456827. # protein molecular process neurogenesis
  ?protein wdt:P702 ?gene.       # protein encoded by gene
  
  {?gene wdt:P31 wd:Q277338.}    # gene is an instance of a pseudogene 
  UNION                          # or
  {?gene wdt:P31 wd:Q7187.}      # gene is an instance of a gene
  ?gene wdt:P703 wd:Q15978631.   # gene is found in taxon Homo sapiens
  
  ?cellType wdt:P8872 ?gene.     # cell type has marker gene
  
  ?cellType rdfs:label ?cellTypeLabel.
  ?gene   rdfs:label ?geneLabel.
  wd:Q1456827 rdfs:label ?processLabel.

  FILTER(LANG(?cellTypeLabel) = "en")
  FILTER(LANG(?geneLabel) = "en")
  FILTER(LANG(?processLabel) = "en")

}
"""

cell_types_related_to_neurogenesis = wikidata2df(query_for_neurogenesis)

In [10]:
cell_types_related_to_neurogenesis.head()

Unnamed: 0,cellTypeLabel,geneLabel,processLabel
0,human purkinje neuron,OMP,neurogenesis
1,human olfactory epithelial cell,OMP,neurogenesis
2,human neuron,OMP,neurogenesis
3,human delta cell,PCSK9,neurogenesis
4,human loop of henle cell,PCSK9,neurogenesis
