# Check if APID is updated with HuRi data

High quality binary PPI data is needed for this project. The APID human interactome dataset with interactions proven by at least 1 binary method (binary interactomes) therefore going to be used.

http://cicblade.dep.usal.es:8080/APID/init.action

At the start of the project, a new HQ PPI dataset has been released (HuRi). This file checks if the APID database is updated with data from the HuRi dataset.

http://www.interactome-atlas.org/download

In [1]:
import pandas as pd
import numpy as np

In [2]:
#open both datasets as pandas DataFrame
df_huri = pd.read_csv('../../Data/HuRi/HuRI_04_05_2020.tsv', header=None, sep='\t')
df_apid = pd.read_csv('../../Data/APID/HUMAN_INTACT_LVL2_FILTER_INT-SCECIES_04_05_2020.txt', sep='\t')

In [3]:
df_apid.head(3)

Unnamed: 0,InteractionID,UniprotID_A,UniprotName_A,GeneName_A,UniprotID_B,UniprotName_B,GeneName_B,ExpEvidences,Methods,Publications,3DStructures,CurationEvents
0,1818,P54727,RD23B_HUMAN,RAD23B,P55036,PSMD4_HUMAN,PSMD4,9,8,7,0,17
1,1819,P55036,PSMD4_HUMAN,PSMD4,Q9UMX0,UBQL1_HUMAN,UBQLN1,9,5,6,0,15
2,1826,Q9UMX0,UBQL1_HUMAN,UBQLN1,Q16186,ADRM1_HUMAN,ADRM1,3,4,2,0,8


In [4]:
df_huri.head(3)

Unnamed: 0,0,1
0,ENSG00000000005,ENSG00000061656
1,ENSG00000000005,ENSG00000099968
2,ENSG00000000005,ENSG00000104765


### ID conversion

The id's from the APID and HuRi dataset do not match. The ids are therefore converted to the same format to be able to compare them. This is easiest done by converting every protein to a ensemble id. BioMart is used for the converting task.

https://www.ensembl.org/biomart/martview/

The APID database uses Gene names/synonyms/uniprot_id/synonyms in the GeneName_A and GeneName_B columns. A conversion dataset is created to go from Gene names/synonyms/uniprot_id/synonyms -->> ENSEMBLE ID.

http://www.ensembl.org/biomart/martview/8edb1f3d5a7570a15d2fba0b3a0f842e?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.external_synonym|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.uniprot_gn_symbol|hsapiens_gene_ensembl.default.feature_page.uniprot_gn_id&FILTERS=&VISIBLEPANEL=resultspanel


In [5]:
df_conversion = pd.read_csv('../../Data/BioMart/human_gene_id_apid_huri_conversion_05_05_2020.txt', sep='\t')

In [6]:
df_conversion.head(3)

Unnamed: 0,Gene stable ID,Gene Synonym,Gene name,UniProtKB Gene Name symbol,UniProtKB Gene Name ID
0,ENSG00000276191,CRF-R,CRHR1,CRHR1,P34998
1,ENSG00000276191,CRF-R,CRHR1,CRHR1,J3KSM0
2,ENSG00000276191,CRF-R,CRHR1,CRHR1,A0A0A0MQZ1


In [16]:
#Gene name =>> ENSG id
conversion_dict = pd.Series(df_conversion['Gene stable ID'].values,
                                  index=df_conversion['Gene name']).to_dict()
#Gene Synonym =>> ENSG id
conversion_dict_synonym = pd.Series(df_conversion['Gene stable ID'].values,
                                  index=df_conversion['Gene Synonym']).to_dict()
#UniProtKB Gene Name ID =>> ENSG id
conversion_dict_uni = pd.Series(df_conversion['Gene stable ID'].values,
                                  index=df_conversion['UniProtKB Gene Name ID']).to_dict()
#UniProtKB Gene Name symbol =>> ENSG id
conversion_dict_uni_gene = pd.Series(df_conversion['Gene stable ID'].values,
                                  index=df_conversion['UniProtKB Gene Name symbol']).to_dict()

In [17]:
def convert(x):
    """Gets a gene id as a Gene name/synonym or Uniprot gene name/symbol
    and converts it to an ENSG id

    Parameters
    ----------
    x : Some sort of Gene id that should be converted to 
    an ENSG id

    Returns
    -------
    string
        The ENSG id
    """
    if x != x:
        return np.nan
    if x in conversion_dict.keys():
        return conversion_dict[x]
    if x in conversion_dict_uni.keys():
        return conversion_dict_uni[x]
    if x in conversion_dict_synonym.keys():
        return conversion_dict_synonym[x]
    if x in conversion_dict_uni_gene.keys():
        return conversion_dict_uni_gene[x]
        
    return np.nan

In [18]:
#create new columns for the ENSG ids
df_apid['stabel_id_A'] = df_apid['GeneName_A'].apply(convert)
df_apid['stabel_id_B'] = df_apid['GeneName_B'].apply(convert)

In [19]:
df_apid.head(3)

Unnamed: 0,InteractionID,UniprotID_A,UniprotName_A,GeneName_A,UniprotID_B,UniprotName_B,GeneName_B,ExpEvidences,Methods,Publications,3DStructures,CurationEvents,stabel_id_A,stabel_id_B
0,1818,P54727,RD23B_HUMAN,RAD23B,P55036,PSMD4_HUMAN,PSMD4,9,8,7,0,17,ENSG00000119318,ENSG00000159352
1,1819,P55036,PSMD4_HUMAN,PSMD4,Q9UMX0,UBQL1_HUMAN,UBQLN1,9,5,6,0,15,ENSG00000159352,ENSG00000135018
2,1826,Q9UMX0,UBQL1_HUMAN,UBQLN1,Q16186,ADRM1_HUMAN,ADRM1,3,4,2,0,8,ENSG00000135018,ENSG00000130706


In [20]:
#remove the PPIs with one or two gene names missing
df_apid_rem_nan_genes = df_apid[~df_apid[['GeneName_A', 'GeneName_B']].isnull().any(axis=1)]
df_apid_rem_nan_genes.isna().sum()

InteractionID       0
UniprotID_A         0
UniprotName_A       0
GeneName_A          0
UniprotID_B         0
UniprotName_B       0
GeneName_B          0
ExpEvidences        0
Methods             0
Publications        0
3DStructures        0
CurationEvents      0
stabel_id_A        90
stabel_id_B       111
dtype: int64

In [21]:
#Most of the genes that are left are not findable on BioMart
df_ENSG_nan = df_apid_rem_nan_genes[df_apid_rem_nan_genes[['stabel_id_A', 'stabel_id_B']].isnull().any(axis=1)]
df_ENSG_nan.head(3)

Unnamed: 0,InteractionID,UniprotID_A,UniprotName_A,GeneName_A,UniprotID_B,UniprotName_B,GeneName_B,ExpEvidences,Methods,Publications,3DStructures,CurationEvents,stabel_id_A,stabel_id_B
998,60173,Q6PK50,Q6PK50_HUMAN,HSP90AB1,Q15850,Q15850_HUMAN,urf-ret,1,1,1,0,1,ENSG00000096384,
999,60174,Q15850,Q15850_HUMAN,urf-ret,Q16543,CDC37_HUMAN,CDC37,1,1,1,0,1,,ENSG00000105401
1257,60454,Q6N074,Q6N074_HUMAN,DKFZp686N224,Q9H6T3,RPAP3_HUMAN,RPAP3,1,1,1,0,1,,ENSG00000005175


In [22]:
#get a dataframe with the not NaN ENSG combinations
df_apid_ENSG = df_apid_rem_nan_genes[~df_apid_rem_nan_genes[['stabel_id_A', 'stabel_id_B']].isnull().any(axis=1)]
df_apid_ENSG = df_apid_ENSG[['stabel_id_A', 'stabel_id_B']]
df_apid_ENSG.head(3)

Unnamed: 0,stabel_id_A,stabel_id_B
0,ENSG00000119318,ENSG00000159352
1,ENSG00000159352,ENSG00000135018
2,ENSG00000135018,ENSG00000130706


# Check if the same

The combinations of genes for APID and HuRi are investigated on coexistence in both datasets. If more than 90% of the HuRi PPIs are found in the APID dataset, it can be concluded that the APID database was updated with the HuRi dataset. Only 9% of the HuRi PPIs were found in the APID database. From this it can be concluded that the APID database was not updated with this dataset

In [24]:
#Create a list of sorted tuples
apid_ppis = df_apid_ENSG.values
apid_ppis = [tuple(sorted(x)) for x in apid_ppis]
huri_ppis = list(df_huri.values)
huri_ppis = [tuple(sorted(x)) for x in huri_ppis]

#Intersection of the HuRi and APID PPIs
huri_apid_intersect = list(set(apid_ppis) & set(huri_ppis))

In [25]:
print('HuRi APID intersect:', len(huri_apid_intersect))
print('APID PPIs:', len(apid_ppis))
print('HuRi PPIs:', len(huri_ppis))
print('Corrected Fraction HuRi PPIs in APID:', len(huri_apid_intersect)/len(apid_ppis))

HuRi APID intersect: 5896
APID PPIs: 65192
HuRi PPIs: 52548
Corrected Fraction HuRi PPIs in APID: 0.0904405448521291
