Notebook for registering the migration of HuRI protein-protein interactions to Wikidata. 



Relevant items/values: 

* The item for the article ("A reference map of the human binary protein interactome") : Q91971850

* The dataset url: http://www.interactome-atlas.org/data/HuRI.tsv

It is in CC-BY 4.0. The source is being cited on Wikidata.  I am not sure of the specifics of how this copyright issue works for the actual statements. 
I can always revert the batch if that end up being a problem. 



In [1]:
import pandas as pd

In [11]:
huri = pd.read_csv("http://www.interactome-atlas.org/data/HuRI.tsv", sep = "\t", header=None)

In [18]:

huri.columns = ["gene A", "gene B"]
huri.head()


Unnamed: 0,gene A,gene B
0,ENSG00000000005,ENSG00000061656
1,ENSG00000000005,ENSG00000099968
2,ENSG00000000005,ENSG00000104765
3,ENSG00000000005,ENSG00000105383
4,ENSG00000000005,ENSG00000114455


A table of Ensemble IDs to Wikidata entries for genes & the proteins these genes encode was made via SPARQL (https://w.wiki/UGb). 

In [8]:
ensemble_to_wikidata = pd.read_csv("sparql_ensg_gene_protein.csv")

Some genes code more than one protein. That is awesome, biologically, but it makes it hard to make precise statements about protein-protein interactions. 

That is why we will keep only genes that are reported to encode only one protein. 

In [24]:
ensemble_to_wikidata_unique_protein_per_gene = ensemble_to_wikidata.drop_duplicates(subset="geneLabel", keep=False)


ensemble_to_unique_protein_qid = ensemble_to_wikidata_unique_protein_per_gene[["ensemble_symbol", "protein"]]



ensemble_to_unique_protein_qid["protein"] = [url.split("/")[4] for url in ensemble_to_unique_protein_qid["protein"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [35]:
huri_with_proteins_A = huri.merge(ensemble_to_unique_protein_qid, left_on="gene A", right_on="ensemble_symbol")

huri_with_proteins_A = huri_with_proteins_A[["gene A", "gene B", "protein"]]

huri_with_proteins_A.columns = ["gene A", "gene B", "protein A"]

In [38]:
huri_with_proteins_A_and_B = huri_with_proteins_A.merge(ensemble_to_unique_protein_qid, left_on="gene B", right_on="ensemble_symbol")

In [41]:
huri_with_proteins_A_and_B = huri_with_proteins_A_and_B[["gene A", "gene B", "protein A", "protein"]]
huri_with_proteins_A_and_B.columns = ["gene A", "gene B", "protein A", "protein B"]

In [43]:
huri_with_proteins_A_and_B

Unnamed: 0,gene A,gene B,protein A,protein B
0,ENSG00000000005,ENSG00000061656,Q21134652,Q21122808
1,ENSG00000000005,ENSG00000099968,Q21134652,Q21100296
2,ENSG00000088782,ENSG00000099968,Q21112136,Q21100296
3,ENSG00000089356,ENSG00000099968,Q21132070,Q21100296
4,ENSG00000099219,ENSG00000099968,Q21130397,Q21100296
...,...,...,...,...
23409,ENSG00000233822,ENSG00000272196,Q21110989,Q22677759
23410,ENSG00000233822,ENSG00000277075,Q21110989,Q21119044
23411,ENSG00000233822,ENSG00000278463,Q21110989,Q21119044
23412,ENSG00000255009,ENSG00000284662,Q21135161,Q21133408


In [62]:
from datetime import datetime

today = datetime.now()
today_wikidata_format = today.strftime("+%Y-%m-%dT00:00:00Z/11")

reference_url = "|S854|"  + '"' +  "http://www.interactome-atlas.org/download"  + '"' 
reference_access_date = "|S813|" +  today_wikidata_format
reference_article = "|S248|" + "Q91971850" 
determination_method = "|P459|" + "Q1337844" 


with open(f'./huri.qs', 'w') as file:
    for index, row in huri_with_proteins_A_and_B.iterrows():
    
        qid_A = row["protein A"]
        qid_B = row["protein B"]


        statement = qid_A + "|P129|" + qid_B + determination_method + reference_url + reference_access_date + reference_article
        print(statement, file=file)
        statement_reverse = qid_B + "|P129|" + qid_A + determination_method + reference_url + reference_access_date + reference_article
        print(statement_reverse, file=file)


        
    