# Get the uniprot ids of the HuRi data

At the start of the project, a new HQ PPI dataset has been released (HuRi).

https://www.nature.com/articles/s41586-020-2188-x <br>
http://www.interactome-atlas.org/download

There is a tsv file with only gene name combinations while we want the UniProtKB ids.
There is also a PSI-MI file where more informations is stored.

There is the github page on how to interpret the PSI-MI file:

https://github.com/HUPO-PSI/miTab/blob/master/PSI-MITAB27Format.md

This script tries to extract the UniProtIDs from the PSI-MI file

In [1]:
import pandas as pd
import numpy as np
import itertools

In [2]:
# Import huri PSI-MI file
df_huri = pd.read_csv('../Data/HuRi/HuRI_04_05_2020.psi', sep='\t', header=None)

In [42]:
df_huri.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,uniprotkb:P0DP25,uniprotkb:A0A087WXN0,ensembl:ENST00000291295.13|ensembl:ENSP0000029...,ensembl:ENST00000612316.4|ensembl:ENSP00000481...,human orfeome collection:1(author assigned name),human orfeome collection:56859(author assigned...,psi-mi:MI:1112(two hybrid prey pooling approach),Luck et al.(2019),unassigned1304,taxid:9606(Homo Sapiens),...,-,-,-,-,gal4 dna binding domain:n-n (DB domain (n-term...,gal4 activation domain:n-n (AD domain (n-termi...,-,-,psi-mi:MI1180(partial DNA sequence identificat...,psi-mi:MI1180(partial DNA sequence identificat...
1,uniprotkb:Q68D86-1,uniprotkb:Q9HD26-2,ensembl:ENST00000360242.9|ensembl:ENSP00000353...,ensembl:ENST00000052569.10|ensembl:ENSP0000005...,human orfeome collection:54581(author assigned...,human orfeome collection:121(author assigned n...,psi-mi:MI:1112(two hybrid prey pooling approach),Luck et al.(2019),unassigned1304,taxid:9606(Homo Sapiens),...,-,-,-,-,gal4 dna binding domain:n-n (DB domain (n-term...,gal4 activation domain:n-n (AD domain (n-termi...,-,-,psi-mi:MI1180(partial DNA sequence identificat...,psi-mi:MI1180(partial DNA sequence identificat...


In [3]:
# filter out PPIs from the test screens
df_huri_no_test = df_huri[~df_huri[27].str.contains('test')]

In [4]:
# sort to filter redundant combinations
df_uniprot_huri = pd.DataFrame(df_huri[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())
df_uniprot_huri_no_test = pd.DataFrame(df_huri_no_test[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())

In [5]:
# filter redundant combinations
df_uniprot_comb = df_uniprot_huri.drop_duplicates()
df_uniprot_comb_no_test = df_uniprot_huri_no_test.drop_duplicates()

In [6]:
# Now something weird is happening. 
# The numbers do not match with the papers claim. 
print('The paper PPI claim: 52569')
print('PSI-MI PPIs number:', len(df_uniprot_comb))
print('PSI-MI PPIs number without test results:', len(df_uniprot_comb_no_test))

The paper PPI claim: 52569
PSI-MI PPIs number: 52929
PSI-MI PPIs number without test results: 52237


In [7]:
# Lets see if we can find the reason using the other tsv file HuRi provides on the website
# This file only contains the ENSG id of interactor A and B

# load the ENSG data
df_huri_ensg =  pd.read_csv('../Data/HuRi/HuRI_genes_04_05_2020.tsv', sep='\t', header=None)

In [8]:
# the numebr of combinations doesnt match the paper, but maybe there are redundant gene combinations
# due to splice variation
df_huri_ensg = pd.DataFrame(df_huri_ensg[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())
print('HuRi gene combination number:', len(df_huri_ensg.drop_duplicates()))

HuRi gene combination number: 52548


In [9]:
# This is annoying since we now need the ENSG mapping of the Uniprot id
# to figure out why we have less PPIs than the paper claims.
# The ENSG ids of each interactor are in the PSI file but apparently
# some of UniProt ids map to multiple ENSG ids.
df_huri_no_test.iloc[113,2]

'ensembl:ENST00000380750.7|ensembl:ENSP00000370126.3|ensembl:ENSG00000205572.9|ensembl:ENST00000354833.7|ensembl:ENSP00000346892.3|ensembl:ENSG00000172058.15'

In [10]:
# prove that there can multiple ENSGs in one cell
# if the split list length is greater than 2, multiple ENSGs are found
check_one_ensg_id = lambda x: len(x.split('ENSG')) > 2
print('multiple ENSG ids column 2?', True in df_huri[2].apply(check_one_ensg_id).unique())

multiple ENSG ids column 2? True


In [11]:
#now see how many rows have multiple ENSGs in one of the two columns
df_multiple_ensg = df_huri_no_test[df_huri_no_test[2].apply(check_one_ensg_id) | 
                                   df_huri_no_test[3].apply(check_one_ensg_id)]
print('Number of entries with multiple ENSG ids:', len(df_multiple_ensg))

Number of entries with multiple ENSG ids: 903


In [12]:
# They have max one uniprot mapping
check_multiple_uniprot_ids = lambda x: len(x.split('uniprotkb')) > 2
print('multiple UniProt ids column 0?', True in df_huri[0].apply(check_multiple_uniprot_ids).unique())
print('multiple UniProt ids column 1?', True in df_huri[1].apply(check_multiple_uniprot_ids).unique())

multiple UniProt ids column 0? False
multiple UniProt ids column 1? False


In [13]:
# lets just try to create all possible combinations of ENSG ids to see
# if this number matches the tsv file. 

list_of_genes_A = df_huri_no_test[2].apply(
                            lambda x: [i[i.find(':')+1:] for i in x.split('|')[2::3]]
                ).to_list()

list_of_genes_B = df_huri_no_test[3].apply(
                            lambda x: [i[i.find(':')+1:] for i in x.split('|')[2::3]]
                ).to_list()

In [14]:
#create a list with tuples of all combinations of ESNG ids and sort the tuples
all_combinations = [list(itertools.product(a, b)) for a,b in zip(list_of_genes_A,list_of_genes_B)]
all_combinations_list = [tuple(sorted(x)) for x in itertools.chain.from_iterable(all_combinations)]

print('Total rows is the PSI file:', len(df_huri_no_test))
print('Possible ENSG id combinations with redundancy:', len(all_combinations_list))

Total rows is the PSI file: 167924
Possible ENSG id combinations with redundancy: 168956


In [15]:
#So there it is our number, it matches :D
# Now we now where the ENSG combination number comes from is the tsv file.

print('Unique ENSG id combinations from tsv:', len(df_huri_ensg.drop_duplicates()))
print('Unique ENSG id combinations from PSI:', len(set(all_combinations_list)))

Unique ENSG id combinations from tsv: 52548
Unique ENSG id combinations from PSI: 52548


In [16]:
# Every row in the HuRi PSI-MI file has one uniprot id for each interactor
# The number of unique uniprot id combinations is lower than the paper claims so that is weird
# Maybe they used all unique combination of ENSP to get to this number?

list_of_proteins_A = df_huri_no_test[2].apply(
                            lambda x: [i[i.find(':')+1:] for i in x.split('|')[1::3]]
                ).to_list()

list_of_proteins_B = df_huri_no_test[3].apply(
                            lambda x: [i[i.find(':')+1:] for i in x.split('|')[1::3]]
                ).to_list()

In [17]:
#create a list with tuples of all combinations of ENSP ids and sort the tuples
all_comb_prot = [list(itertools.product(a, b)) for a,b in zip(list_of_proteins_A,list_of_proteins_B)]
all_comb_prot_list = [tuple(sorted(x)) for x in itertools.chain.from_iterable(all_comb_prot)]

In [35]:
# create a DataFrame with unique ENSP combinations
ensp_comb_huri_final = pd.DataFrame(all_comb_prot_list).drop_duplicates()
# remove version number
ensp_comb_huri_final.iloc[:,0] = ensp_comb_huri_final.iloc[:,0].apply(lambda x: x.split('.')[0])
ensp_comb_huri_final.iloc[:,1] = ensp_comb_huri_final.iloc[:,1].apply(lambda x: x.split('.')[0])

In [36]:
# Nope, it's not the number we are looking for
# We are somehow missing 20 PPIs
print('Paper PPIs claim: 52569')
print('Unique protein combinations using ESPN ids from PSI_MI:', len(ensp_comb_huri_final))

Paper PPIs claim: 52569
Unique protein combinations using ESPN ids from PSI_MI: 52549


In [38]:
# Lets see how many uniprot id combinations we can actually make. 
# All combinations where at least one of the two id's is not UniProt are filtered out.
uniprot_comb_huri_final = df_uniprot_comb_no_test[df_uniprot_comb_no_test\
                                            .apply(lambda x: all('uniprotkb' in i for i in x), axis=1)]
# remove 'uniprot:' in value
uniprot_comb_huri_final.iloc[:,0] = uniprot_comb_huri_final.iloc[:,0].apply(lambda x: x.split(':')[1])
uniprot_comb_huri_final.iloc[:,1] = uniprot_comb_huri_final.iloc[:,1].apply(lambda x: x.split(':')[1])

In [40]:
print('Paper PPIs claim: 52569')
print('Unique protein combinations using UniProt ids from PSI_MI:', len(uniprot_comb_huri_final))
print('Unique protein combinations using ENSP ids from PSI_MI:', len(ensp_comb_huri_final))

Paper PPIs claim: 52569
Unique protein combinations using UniProt ids from PSI_MI: 51373
Unique protein combinations using ENSP ids from PSI_MI: 52549


In [54]:
# ENSP and UniProt have the different numbers of unique ids
all_unique_uniprot_ids = pd.Series(pd.concat([uniprot_comb_huri_final[0],uniprot_comb_huri_final[1]]).unique())
all_unique_ensp_ids = pd.Series(pd.concat([ensp_comb_huri_final[0],ensp_comb_huri_final[1]]).unique())

print('Number of unique UniProt ids:', len(all_unique_uniprot_ids))
print('Number of unique ENSP ids:', len(all_unique_ensp_ids))

Number of unique UniProt ids: 8148
Number of unique ENSP ids: 8274


# RESULT
From this point on I accept that we are missing 20 PPIs. The number of gene combinations matches the gene combination tsv file. When I use the same method to get all unique protein ENSP id combination, the number of unique combinations is 20 less than expected.

There are now two things we can do:

1. Use the UniProt id combinations. This means that we have 51373 PPIs (1196 less than the paper claims).<br>
2. Use the ENSP id combinations. This means that we have 52549 PPIs (20 less than the paper claims).

In [52]:
uniprot_comb_huri_final.to_csv('../Data/Interactome/uniprot_ids_unique_combinations_huri.csv',
                               header=False, index=False)
ensp_comb_huri_final.to_csv('../Data/Interactome/ensp_ids_unique_combinations_huri.csv',
                               header=False, index=False)

all_unique_uniprot_ids.to_csv('../Data/Interactome/all_unique_uniprot_ids.csv',
                               header=False, index=False)
all_unique_ensp_ids.to_csv('../Data/Interactome/all_unique_ensp_ids.csv',
                               header=False, index=False)

8274
8274
