# Merge APID and HuRi

Two options:

- everything to UniProtKB
- everything to ENSP

UniProt retrieve mapping tool is used to convert https://www.uniprot.org/uploadlists/: 
- HuRi | ENSP => UniProtKB (Data/Interactome/all_unique_ENSP_ids_huri.csv):
    - Output: 8122 out of 8274 Ensembl Protein identifiers were successfully mapped to 8062 UniProtKB IDs in the table below.
- HuRi | UniProtID AC/ID => UniProtKB (Data/Interactome/all_unique_UniProt_ids_huri.csv):
    - 8168 out of 8184 UniProtKB AC/ID identifiers were successfully mapped to 8164 UniProtKB IDs in the table below.
- APID | UniProtID AC/ID => ENSP
    - Output: 12,414 out of 13,346 identifiers from UniProtKB AC/ID were successfully mapped to 33,333 Ensembl Protein IDs.
- APID | UniProtID AC/ID => UniProtKB
    -  Output: 13346 out of 13346 UniProtKB AC/ID identifiers were successfully mapped to 13317 UniProtKB IDs in the table below.
    
Based on these mapping results, it would be better to use UniProtKB since we only lose 143 proteins (original HuRi dataset has 8275 unique proteins). If we would use ENSP we would lose 932 proteins. The APID UniProtID AC/ID => ENSP also gives more ENSP mappings as output than we provided. This makes ENSP inconvenient to use.

Which proteins are lost?
How many PPIs are lost?
How many PPIs after merging?

In [12]:
import pandas as pd
import numpy as np

In [13]:
df_apid = pd.read_csv('../Data/Interactome/uniprot_ids_unique_combinations_apid.csv', header=None)
df_huri = pd.read_csv('../Data/Interactome/uniprot_ids_unique_combinations_huri.csv', header=None)
df_apid_mapping = pd.read_csv('../Data/IDMapping/UniprotKB_ACID_to_UniProtKB_apid_07_05_2020.tab',  sep='\t')
df_huri_mapping = pd.read_csv('../Data/IDMapping/UniProtKB_ACID_to_UniProtKB_huri_07_05_2020.tab',  sep='\t')

df_apid_mapping = pd.concat([pd.Series(row.iloc[1], row.iloc[0].split(','))              
                    for _, row in df_apid_mapping.iterrows()]).reset_index()
df_huri_mapping = pd.concat([pd.Series(row.iloc[1], row.iloc[0].split(','))              
                    for _, row in df_huri_mapping.iterrows()]).reset_index()

df_apid_mapping.columns = [0,1]
df_huri_mapping.columns = [0,1]

In [14]:
mapping_dict_apid = pd.Series(df_apid_mapping.iloc[:,1].values, index=df_apid_mapping.iloc[:,0]).to_dict()
mapping_dict_huri = pd.Series(df_huri_mapping.iloc[:,1].values, index=df_huri_mapping.iloc[:,0]).to_dict()
mapping_dict = {**mapping_dict_apid, **mapping_dict_huri}

In [15]:
len(mapping_dict_apid)

13346

In [16]:
not_found = []

def convert(x):
    if x in mapping_dict.keys():
        return mapping_dict[x]
    not_found.append(x)
    return np.nan

df_apid[0], df_apid[1] = df_apid[0].apply(convert), df_apid[1].apply(convert) 
df_huri[0], df_huri[1] = df_huri[0].apply(convert), df_huri[1].apply(convert) 
intersect = df_apid.merge(df_huri, how='inner', on=[0,1])
print(set(not_found))

{'Q8TCX5-2', 'Q96AP0-1', 'Q9Y2S0-1', 'Q8TBF2-8', 'Q99873-4', 'Q9ULD5-2', 'O60344-3', 'P05177-2', 'Q96QH2-2', 'Q9P0W5-4', 'Q99674-1', 'O60645-3', 'Q9NPA5-1', 'Q9BV68-2', 'Q8NDD1-3', 'Q9UJX0-2'}


##  Mannual check of the ids that could not be converted

Each isoform is characterized by a unique identifier, which is composed of the primary accession number of the entry, followed by a dash and a number.

Example: P04150-2

When alternative protein sequences differ significantly, we create separate entries and list all isoforms in each of them. Consequently, isoforms produced from a single gene listed in one entry may have identifiers derived from different primary accession numbers.
Examples: P42166, P42167

All the missing 16 isoforms were not yet uploaded to the UniProt Database:
- Q8TCX5-2 
- Q96AP0-1
- Q9Y2S0-1 
- Q8TBF2-8 
- Q99873-4 
- Q9ULD5-2 
- O60344-3 
- P05177-2  
- Q96QH2-2 
- Q9P0W5-4 
- Q99674-1 
- O60645-3 
- Q9NPA5-1
- Q9BV68-2
- Q8NDD1-3
- Q9UJX0-2

In [21]:
df_apid_huri = pd.concat([df_apid,df_huri])
df_apid_huri_no_na = df_apid_huri.dropna()
df_apid_huri_sorted = pd.DataFrame(df_apid_huri_no_na[[0, 1]].apply(lambda x: sorted(x), axis=1).to_list())
df_apid_huri_unique = df_apid_huri_sorted.drop_duplicates()

In [22]:
#
print('Original number of unique PPIs HuRi 52569')
print('Original number of unique PPIs APID', len(df_apid))
print()
print('Number of unique PPIs HuRi after mapping', len(df_huri.dropna()))
print('Number of unique PPIs APID after mapping', len(df_apid.dropna()))
print('Number of combined unique PPIs APID|HURI after mapping', len(df_apid_huri_unique))
print()
print('Number of missing PPIS HuRi', 52569-len(df_huri.dropna()))
print('Number of missing PPIS APID', len(df_apid)-len(df_apid.dropna()))

Original number of unique PPIs HuRi 52569
Original number of unique PPIs APID 66206

Number of unique PPIs HuRi after mapping 51289
Number of unique PPIs APID after mapping 66206
Number of combined unique PPIs APID|HURI after mapping 111518

Number of missing PPIS HuRi 1280
Number of missing PPIS APID 0


In [24]:
unique_proteins_apid = df_apid[0].dropna().to_list() + df_apid[1].dropna().to_list()
unique_proteins_huri = df_huri[0].dropna().to_list() + df_huri[1].dropna().to_list()
unique_proteins_huri_apid = unique_proteins_apid + unique_proteins_huri

print('Original number of unique proteins HuRi 8275')
print('Original number of unique proteins APID 13317')
print()
print('Number of unique proteins HuRi after mapping', len(set(unique_proteins_huri)))
print('Number of unique proteins APID after mapping', len(set(unique_proteins_apid)))
print('Number of combined unique proteins APID|HURI after mapping', len(set(unique_proteins_huri_apid)))

Original number of unique proteins HuRi 8275
Original number of unique proteins APID 13317

Number of unique proteins HuRi after mapping 8131
Number of unique proteins APID after mapping 13317
Number of combined unique proteins APID|HURI after mapping 15340
