# Get the uniprot ids of the HuRi data

At the start of the project, a new HQ PPI dataset has been released (HuRi).

- https://www.nature.com/articles/s41586-020-2188-x
- http://www.interactome-atlas.org/download

There is a tsv file with only gene name combinations while we want the UniProtKB ids.
There is also a PSI-MI file where more informations is stored.

There is the github page on how to interpret the PSI-MI file:

- https://github.com/HUPO-PSI/miTab/blob/master/PSI-MITAB27Format.md

This script tries to extract the UniProtIDs from the PSI-MI file

It turns out that biomart can map more ensp ids to uniprot than uniprot so this is the first thats being done

In [2]:
import pandas as pd
import numpy as np
import itertools

In [3]:
# Import huri PSI-MI file
df_huri = pd.read_csv('../Data/HuRi/HuRI_04_05_2020.psi', sep='\t', header=None)
# filter PPIs from the test screens
df_huri_no_test = df_huri[~df_huri[27].str.contains('test')]

In [8]:
df_huri_no_test.iloc[0:20, 0:2]

Unnamed: 0,0,1
0,uniprotkb:P0DP25,uniprotkb:A0A087WXN0
1,uniprotkb:Q68D86-1,uniprotkb:Q9HD26-2
2,uniprotkb:Q13515,uniprotkb:Q9UJW9
3,uniprotkb:P30049,uniprotkb:Q05519-2
4,ensembl:ENSP00000462298.1,uniprotkb:P43220
5,uniprotkb:P57771-2,uniprotkb:Q8TAS1-2
6,uniprotkb:Q12981-4,uniprotkb:Q8TBP5
7,uniprotkb:Q3LI64,uniprotkb:P49639
8,uniprotkb:P07947,uniprotkb:Q9UKT9-1
9,uniprotkb:Q9H0Q3-1,uniprotkb:Q14973


In [7]:
# some uniprot ids are still in the list because they only have a connection with one
# We should check if we can map the ensembl ids to uniprotkb ids
all_proteins = df_huri_no_test[[0,1]].values.reshape(-1)

ensmble_ids = pd.Series([x.split(':')[1] for x in set(all_proteins) if x.split(':')[0] == 'ensembl'])
# Write all ensemble ids to csv
ensmble_ids.to_csv('../Data/HuRi/ensmble_ids_without_uniprot_id.csv', header=False, index=False)
# BioMart can map some of them to UniProt
df_mapping = pd.read_csv('../Data/IDMapping/huri_ensemble_ids_mapping_without_uniprot.tsv', sep='\t')

In [8]:
mapping_dict_swiss = pd.Series(df_mapping.iloc[:,1].values, index=df_mapping.iloc[:,0]).to_dict()
mapping_dict_trembl = pd.Series(df_mapping.iloc[:,2].values, index=df_mapping.iloc[:,0]).to_dict()

In [10]:
def convert(x):
    x = x.split(':')[1]
      
    if x in ensmble_ids.to_list():
        if x in mapping_dict_swiss.keys() and mapping_dict_swiss[x] == mapping_dict_swiss[x]:
            x = mapping_dict_swiss[x]
        if x in mapping_dict_swiss.keys() and mapping_dict_trembl[x] == mapping_dict_trembl[x]:
            x = mapping_dict_trembl[x]
    
    return x

conv_ids = pd.DataFrame({0:df_huri_no_test[0].apply(convert), 1:df_huri_no_test[1].apply(convert)})

In [11]:
conv_all_proteins = conv_ids.values.reshape(-1)
conv_ensmble_ids = pd.Series([x for x in set(conv_all_proteins) if 'ENSP' in x])

print('Number of ENSP ids in the UniProtID column', len(ensmble_ids))
print('Number of ENSP ids in UniProt', len(conv_ensmble_ids))
#add six because 6 proteins were not found at all using uniprot
#they were not added to the conv dict
print('Number of ENSP ids in unable to map', 6 + sum(df_mapping.iloc[:,1:3].isna().all(axis=1)))

Number of ENSP ids in the UniProtID column 58
Number of ENSP ids in UniProt 24
Number of ENSP ids in unable to map 24


In [16]:
# These ensp ids could not be converted
# Lets check them manually by removing the version
# sequence comparison
for x in conv_ensmble_ids:
    print(x)

ENSP00000445323.1
ENSP00000457957.1
ENSP00000482225.1
ENSP00000493046.1
ENSP00000493281.1
ENSP00000473509.2
ENSP00000324909.5
ENSP00000391869.3
ENSP00000482011.1
ENSP00000489519.1
ENSP00000405973.1
ENSP00000465194.1
ENSP00000490118.1
ENSP00000476265.1
ENSP00000470609.1
ENSP00000463483.1
ENSP00000467185.1
ENSP00000408168.2
ENSP00000493302.1
ENSP00000426769.1
ENSP00000408321.1
ENSP00000493122.1
ENSP00000411822.1
ENSP00000415200.2


In [18]:
## remove version
for x in conv_ensmble_ids:
    print(x.split('.')[0])

ENSP00000445323
ENSP00000457957
ENSP00000482225
ENSP00000493046
ENSP00000493281
ENSP00000473509
ENSP00000324909
ENSP00000391869
ENSP00000482011
ENSP00000489519
ENSP00000405973
ENSP00000465194
ENSP00000490118
ENSP00000476265
ENSP00000470609
ENSP00000463483
ENSP00000467185
ENSP00000408168
ENSP00000493302
ENSP00000426769
ENSP00000408321
ENSP00000493122
ENSP00000411822
ENSP00000415200


In [8]:
# sort to filter redundant combinations
df_uniprot_huri = pd.DataFrame(df_huri[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())
df_uniprot_huri_no_test = pd.DataFrame(df_huri_no_test[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())

In [9]:
# filter redundant combinations
df_uniprot_comb = df_uniprot_huri.drop_duplicates()
df_uniprot_comb_no_test = df_uniprot_huri_no_test.drop_duplicates()

In [10]:
# Now something weird is happening. 
# The numbers do not match with the papers claim. 
print('The paper PPI claim: 52569')
# print('PSI-MI PPIs number:', len(df_uniprot_comb))
print('PSI-MI PPIs number without test results:', len(df_uniprot_comb_no_test))

The paper PPI claim: 52569
PSI-MI PPIs number without test results: 52237


In [11]:
# Lets see if we can find the reason using the other tsv file HuRi provides on the website
# This file only contains the ENSG id of interactor A and B

# load the ENSG data
df_huri_ensg =  pd.read_csv('../Data/HuRi/HuRI_genes_04_05_2020.tsv', sep='\t', header=None)

In [12]:
# the numebr of combinations doesnt match the paper, but maybe there are redundant gene combinations
# due to splice variation
df_huri_ensg = pd.DataFrame(df_huri_ensg[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())
print('HuRi gene combination number:', len(df_huri_ensg.drop_duplicates()))

HuRi gene combination number: 52548


In [32]:
# This is annoying since we now need the ENSG mapping of the Uniprot id
# to figure out why we have less PPIs than the paper claims.
# The ENSG ids of each interactor are in the PSI file but apparently
# some of UniProt ids map to multiple ENSG ids.
df_huri_no_test.iloc[113,0:2]

0    uniprotkb:O75920-1
1      uniprotkb:A1L3X0
Name: 114, dtype: object

In [14]:
# prove that there can multiple ENSGs in one cell
# if the split list length is greater than 2, multiple ENSGs are found
check_one_ensg_id = lambda x: len(x.split('ENSG')) > 2
print('multiple ENSG ids column 2?', True in df_huri[2].apply(check_one_ensg_id).unique())

multiple ENSG ids column 2? True


In [15]:
#now see how many rows have multiple ENSGs in one of the two columns
df_multiple_ensg = df_huri_no_test[df_huri_no_test[2].apply(check_one_ensg_id) | 
                                   df_huri_no_test[3].apply(check_one_ensg_id)]
print('Number of entries with multiple ENSG ids:', len(df_multiple_ensg))

Number of entries with multiple ENSG ids: 903


In [16]:
# They have max one uniprot mapping
check_multiple_uniprot_ids = lambda x: len(x.split('uniprotkb')) > 2
print('multiple UniProt ids column 0?', True in df_huri[0].apply(check_multiple_uniprot_ids).unique())
print('multiple UniProt ids column 1?', True in df_huri[1].apply(check_multiple_uniprot_ids).unique())

multiple UniProt ids column 0? False
multiple UniProt ids column 1? False


In [17]:
# lets just try to create all possible combinations of ENSG ids to see
# if this number matches the tsv file. 

list_of_genes_A = df_huri_no_test[2].apply(
                            lambda x: [i[i.find(':')+1:] for i in x.split('|')[2::3]]
                ).to_list()

list_of_genes_B = df_huri_no_test[3].apply(
                            lambda x: [i[i.find(':')+1:] for i in x.split('|')[2::3]]
                ).to_list()

In [18]:
#create a list with tuples of all combinations of ESNG ids and sort the tuples
all_combinations = [list(itertools.product(a, b)) for a,b in zip(list_of_genes_A,list_of_genes_B)]
all_combinations_list = [tuple(sorted(x)) for x in itertools.chain.from_iterable(all_combinations)]

print('Total rows is the PSI file:', len(df_huri_no_test))
print('Possible ENSG id combinations with redundancy:', len(all_combinations_list))

Total rows is the PSI file: 167924
Possible ENSG id combinations with redundancy: 168956


In [19]:
#So there it is our number, it matches :D
# Now we now where the ENSG combination number comes from.

print('Unique ENSG id combinations from tsv:', len(df_huri_ensg.drop_duplicates()))
print('Unique ENSG id combinations from PSI:', len(set(all_combinations_list)))

Unique ENSG id combinations from tsv: 52548
Unique ENSG id combinations from PSI: 52548


In [20]:
# Every row in the HuRi PSI-MI file has one uniprot id for each interactor
# The number of unique uniprot id combinations is lower than the paper claims so that is weird
# Maybe they used all unique combination of ENSP to get to this number?

list_of_proteins_A = df_huri_no_test[2].apply(
                            lambda x: [i[i.find(':')+1:] for i in x.split('|')[1::3]]
                ).to_list()

list_of_proteins_B = df_huri_no_test[3].apply(
                            lambda x: [i[i.find(':')+1:] for i in x.split('|')[1::3]]
                ).to_list()

In [21]:
#create a list with tuples of all combinations of ENSP ids and sort the tuples
all_comb_prot = [list(itertools.product(a, b)) for a,b in zip(list_of_proteins_A,list_of_proteins_B)]
all_comb_prot_list = [tuple(sorted(x)) for x in itertools.chain.from_iterable(all_comb_prot)]

In [22]:
# create a DataFrame with unique ENSP combinations
ensp_comb_huri_final = pd.DataFrame(all_comb_prot_list).drop_duplicates()
ensp_comb_huri_final.iloc[:,0] = ensp_comb_huri_final.iloc[:,0].apply(lambda x: x.split('.')[0])
ensp_comb_huri_final.iloc[:,1] = ensp_comb_huri_final.iloc[:,1].apply(lambda x: x.split('.')[0])

In [23]:
# Nope, it's not the number we are looking for
# We are somehow missing 20 PPIs
print('Paper PPIs claim: 52569')
print('Unique protein combinations using ESPN ids from PSI_MI:', len(ensp_comb_huri_final))

Paper PPIs claim: 52569
Unique protein combinations using ESPN ids from PSI_MI: 52549


In [24]:
# filter all combinations without a uniprot id
uniprot_comb_huri_final = conv_ids[~conv_ids[[0,1]]\
                                .apply(lambda x: 'ENSP' in x[0] or 'ENSP' in x[1],  axis=1)]
uniprot_comb_huri_final = pd.DataFrame(uniprot_comb_huri_final[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())
uniprot_comb_huri_final = uniprot_comb_huri_final.drop_duplicates()

In [30]:
#get all unique combinations after BioMart Mapping using column 0 and 1
total_comb_map = df_huri_ensg = pd.DataFrame(conv_ids[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())

print('Paper PPIs claim: 52569')
print('Number of unique combinations in columns 0 and 1 of the psi file:', len(total_comb_map.drop_duplicates()))
print('Unique protein combinations using UniProt ids from PSI_MI:', len(uniprot_comb_huri_final))
print('Unique protein combinations using ENSP ids from PSI_MI:', len(ensp_comb_huri_final))

Paper PPIs claim: 52569
Number of unique combinations in columns 0 and 1 of the psi file: 52236
Unique protein combinations using UniProt ids from PSI_MI: 51767
Unique protein combinations using ENSP ids from PSI_MI: 52549


In [26]:
# ENSP and UniProt have the different numbers of unique ids
all_unique_uniprot_ids = pd.Series(pd.concat([uniprot_comb_huri_final[0],uniprot_comb_huri_final[1]]).unique())
all_unique_ensp_ids = pd.Series(pd.concat([ensp_comb_huri_final[0],ensp_comb_huri_final[1]]).unique())

print('Paper claim of unique protein: 8275')
print('Number of unique proteins in columns 0 and 1 of the psi file:', len(set(conv_ids.values.reshape(-1))))
print('Number of unique UniProt ids:', len(all_unique_uniprot_ids))
print('Number of unique ENSP ids:', len(all_unique_ensp_ids))

Paper claim of unique protein: 8275
Number of unique proteins in columns 0 and 1 of the psi file: 8215
Number of unique UniProt ids: 8184
Number of unique ENSP ids: 8274


In [27]:
# We are interested in the first two columns with uniprot ids
# Some of the values in these columns are ensemble ids
# lets see if those can be converted to uniprot ids via Biomart
# So which proteins are missing
compare_to = set(conv_ids.values.reshape(-1))

test = all_unique_uniprot_ids.to_list()

#get all missing proteins
missing_proteins = [x for x in compare_to if x not in test]
len(missing_proteins)

31

In [28]:
missing_proteins

['ENSP00000408321.1',
 'ENSP00000470609.1',
 'Q9Y2P8-1',
 'ENSP00000467185.1',
 'ENSP00000411822.1',
 'ENSP00000476265.1',
 'Q5THR3-2',
 'ENSP00000426769.1',
 'ENSP00000415200.2',
 'ENSP00000493046.1',
 'ENSP00000493281.1',
 'ENSP00000408168.2',
 'ENSP00000324909.5',
 'Q96RK0',
 'ENSP00000391869.3',
 'Q8NA57',
 'ENSP00000457957.1',
 'ENSP00000482225.1',
 'ENSP00000489519.1',
 'ENSP00000482011.1',
 'Q6NS38-1',
 'ENSP00000493302.1',
 'Q99569-2',
 'Q8IX01-3',
 'ENSP00000405973.1',
 'ENSP00000473509.2',
 'ENSP00000445323.1',
 'ENSP00000490118.1',
 'ENSP00000465194.1',
 'ENSP00000493122.1',
 'ENSP00000463483.1']

# RESULT
From this point on I accept that we are missing 20 PPIs. The number of gene combinations matches the gene combination tsv file. When I use the same method to get all unique protein ENSP id combination, the number of unique combinations is 20 less than expected.

There are now two things we can do:

1. Use the UniProt id combinations. This means that we have 51767 PPIs (802 less than the paper claims).<br>
2. Use the ENSP id combinations. This means that we have 52549 PPIs (20 less than the paper claims). We are also missing one protein here. It could be that a protein was removed with 20 PPIs.

In [29]:
uniprot_comb_huri_final.to_csv('../Data/Interactome/uniprot_ids_unique_combinations_huri.csv',
                               header=False, index=False)
ensp_comb_huri_final.to_csv('../Data/Interactome/ensp_ids_unique_combinations_huri.csv',
                               header=False, index=False)

all_unique_uniprot_ids.to_csv('../Data/Interactome/all_unique_uniprot_ids_huri.csv',
                               header=False, index=False)
all_unique_ensp_ids.to_csv('../Data/Interactome/all_unique_ensp_ids_huri.csv',
                               header=False, index=False)