# Protein ID Mapping to HGNC Symbols

This notebook demonstrates the process of mapping protein IDs to HGNC (HUGO Gene Nomenclature Committee) symbols. The workflow includes loading protein (or phosphosite) data, extracting protein IDs, mapping these IDs using different databases, and updating our dataset with the new identifiers.

## Workflow Steps:
1. Setup and import necessary libraries and modules.
2. Load the dataset containing protein IDs.
3. Extract protein IDs from the dataset.
4. Map protein IDs to HGNC symbols using UniProt and HGNC databases.
5. Update the dataset with mapped HGNC symbols.
6. Analyze and summarize the results.

Let's begin by setting up our environment.

In [1]:
# Import necessary libraries
import pandas as pd
import os
import logging

# Import functions from the id_mapping module
import src.id_mapping as idm

# Configure logging
logging.basicConfig(level=logging.INFO)

# Set working directory
os.chdir('../')

## Loading Protein Data

We start by loading our dataset containing Area Under the Peak (AUP) data from LC-MS/MS phosphoproteomics experiments. Rows represent phosphosites (or features), and columns represent samples. Understanding the structure of our data is crucial for the subsequent mapping process.

In [2]:
# Load the dataset
aup = pd.read_csv('resources/raw_data/phosphodata_aup.tsv', sep='\t', index_col=0)

# Display the first few rows of the dataset
aup.head()

Unnamed: 0_level_0,b1790p079_PP1_MCF7_AZD1480_r1,b1790p079_PP1_MCF7_AZD1480_r2,b1790p079_PP1_MCF7_AZD5363_r1,b1790p079_PP1_MCF7_AZD5363_r2,b1790p079_PP1_MCF7_BX_r1,b1790p079_PP1_MCF7_BX_r2,b1790p079_PP1_MCF7_CHIR_r1,b1790p079_PP1_MCF7_CHIR_r2,b1790p079_PP1_MCF7_DMSO_r1,b1790p079_PP1_MCF7_DMSO_r2,...,b1790p077_PP2_MCF7_KU_r1,b1790p077_PP2_MCF7_KU_r2,b1790p077_PP2_MCF7_PD_r1,b1790p077_PP2_MCF7_PD_r2,b1790p077_PP2_MCF7_PF_r1,b1790p077_PP2_MCF7_PF_r2,b1790p077_PP2_MCF7_TORIN_r1,b1790p077_PP2_MCF7_TORIN_r2,b1790p077_PP2_MCF7_TRAMETINIB_r1,b1790p077_PP2_MCF7_TRAMETINIB_r2
Substrate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1A24_HUMAN(S356),269296800.0,350275200.0,233906500.0,147323800.0,229208800.0,199261200.0,222675500.0,203111100.0,172431200.0,195279900.0,...,194267300.0,244622400.0,122890700.0,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718600.0
1A24_HUMAN(S359),269296800.0,350275200.0,233906500.0,147323800.0,229208800.0,199261200.0,222675500.0,203111100.0,172431200.0,195279900.0,...,194267300.0,244622400.0,122890700.0,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718600.0
1B39_HUMAN(M1),769594700.0,523120200.0,1278819000.0,701672800.0,63178680.0,328856200.0,767254600.0,886455900.0,3021734000.0,1579953000.0,...,0.0,0.0,527955500.0,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826200.0
1B39_HUMAN(M4),769594700.0,523120200.0,1278819000.0,701672800.0,63178680.0,328856200.0,767254600.0,886455900.0,3021734000.0,1579953000.0,...,0.0,0.0,527955500.0,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826200.0
AAAS(S495),2664712000.0,3635279000.0,3285959000.0,2743536000.0,3822861000.0,4393195000.0,3287022000.0,3540691000.0,2247902000.0,6012226000.0,...,8963979000.0,8351130000.0,9330759000.0,7767012000.0,7900387000.0,7211204000.0,8041161000.0,8921806000.0,8499730000.0,7352127000.0


## Extracting Protein IDs

Because the identifiers in our phosphoproteomics dataset are phosphosite IDs, we need to extract unique protein IDs from our dataset (indices) before mapping. These IDs will be used to fetch corresponding HGNC symbols. This is handled by the respective mapping functions (get_uniprot_id_map, get_hgnc_id_map) internally, but here is an example:

In [3]:
# Extract unique protein IDs
extracted_ids = idm.extract_protein_ids(aup.index)

# Display extracted IDs
print('Phosphosite IDs: ', aup.index.to_list()[:5])
print('Protein IDs: ', extracted_ids[:5])

Phosphosite IDs:  ['1A24_HUMAN(S356)', '1A24_HUMAN(S359)', '1B39_HUMAN(M1)', '1B39_HUMAN(M4)', 'AAAS(S495)']
Protein IDs:  ['1A24_HUMAN', '1B39_HUMAN', 'AAAS', 'AAGAB', 'AAK1']


## 1. Mapping Protein IDs to HGNC Symbols Using UniProt (UniProtMapper)

We utilize the UniProt database to map our protein IDs to HGNC symbols. This mapping is crucial for standardizing our dataset. We use the UniProtMapper, a Python wrapper for UniProt's Retrieve/ID Mapping RESTful API. 

In [4]:
# Map protein IDs using UniProt
uniprot_map = idm.get_uniprot_id_map(
    aup.index.to_list(), 
    from_id="UniProtKB_AC-ID", 
    to_id="Gene_Name", 
    tool='uniprot_protmapper', 
    phosphosites=True, 
    return_df=True)

# Show a snippet of the mapping results
aup.head()

Setting fields to `None` to retrieve all available fields...


Fetched: 3 / 500
Fetched: 2 / 500
Fetched: 6 / 500
Fetched: 4 / 500
Fetched: 0 / 500
Fetched: 3 / 500
Fetched: 3 / 500
Fetched: 2 / 500


INFO:root:
Mapped 30 out of 4593 proteins from UniProtKB_AC-IDs to Gene_Names.


Fetched: 5 / 500
Fetched: 2 / 93


Unnamed: 0_level_0,b1790p079_PP1_MCF7_AZD1480_r1,b1790p079_PP1_MCF7_AZD1480_r2,b1790p079_PP1_MCF7_AZD5363_r1,b1790p079_PP1_MCF7_AZD5363_r2,b1790p079_PP1_MCF7_BX_r1,b1790p079_PP1_MCF7_BX_r2,b1790p079_PP1_MCF7_CHIR_r1,b1790p079_PP1_MCF7_CHIR_r2,b1790p079_PP1_MCF7_DMSO_r1,b1790p079_PP1_MCF7_DMSO_r2,...,b1790p077_PP2_MCF7_KU_r1,b1790p077_PP2_MCF7_KU_r2,b1790p077_PP2_MCF7_PD_r1,b1790p077_PP2_MCF7_PD_r2,b1790p077_PP2_MCF7_PF_r1,b1790p077_PP2_MCF7_PF_r2,b1790p077_PP2_MCF7_TORIN_r1,b1790p077_PP2_MCF7_TORIN_r2,b1790p077_PP2_MCF7_TRAMETINIB_r1,b1790p077_PP2_MCF7_TRAMETINIB_r2
Substrate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1A24_HUMAN(S356),269296800.0,350275200.0,233906500.0,147323800.0,229208800.0,199261200.0,222675500.0,203111100.0,172431200.0,195279900.0,...,194267300.0,244622400.0,122890700.0,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718600.0
1A24_HUMAN(S359),269296800.0,350275200.0,233906500.0,147323800.0,229208800.0,199261200.0,222675500.0,203111100.0,172431200.0,195279900.0,...,194267300.0,244622400.0,122890700.0,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718600.0
1B39_HUMAN(M1),769594700.0,523120200.0,1278819000.0,701672800.0,63178680.0,328856200.0,767254600.0,886455900.0,3021734000.0,1579953000.0,...,0.0,0.0,527955500.0,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826200.0
1B39_HUMAN(M4),769594700.0,523120200.0,1278819000.0,701672800.0,63178680.0,328856200.0,767254600.0,886455900.0,3021734000.0,1579953000.0,...,0.0,0.0,527955500.0,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826200.0
AAAS(S495),2664712000.0,3635279000.0,3285959000.0,2743536000.0,3822861000.0,4393195000.0,3287022000.0,3540691000.0,2247902000.0,6012226000.0,...,8963979000.0,8351130000.0,9330759000.0,7767012000.0,7900387000.0,7211204000.0,8041161000.0,8921806000.0,8499730000.0,7352127000.0


## Updating the DataFrame with Mapped IDs

After mapping our protein IDs to HGNC symbols, we update our original dataset with these new identifiers.

In [5]:
# Update the DataFrame with HGNC symbols
aup_m1 = idm.index_ids_to_hgnc_symbols(aup, database='uniprot_protmapper', phosphosites=True)

# Display the updated DataFrame
aup_m1.head()

Setting fields to `None` to retrieve all available fields...


Fetched: 3 / 500
Fetched: 2 / 500
Fetched: 6 / 500
Fetched: 4 / 500
Fetched: 0 / 500
Fetched: 3 / 500
Fetched: 3 / 500
Fetched: 2 / 500


INFO:root:
Mapped 30 out of 4593 proteins from UniProtKB_AC-IDs to Gene_Names.
INFO:root:Transformed 36 out of 14448 phosphosites.


Fetched: 5 / 500
Fetched: 2 / 93


Unnamed: 0,b1790p079_PP1_MCF7_AZD1480_r1,b1790p079_PP1_MCF7_AZD1480_r2,b1790p079_PP1_MCF7_AZD5363_r1,b1790p079_PP1_MCF7_AZD5363_r2,b1790p079_PP1_MCF7_BX_r1,b1790p079_PP1_MCF7_BX_r2,b1790p079_PP1_MCF7_CHIR_r1,b1790p079_PP1_MCF7_CHIR_r2,b1790p079_PP1_MCF7_DMSO_r1,b1790p079_PP1_MCF7_DMSO_r2,...,b1790p077_PP2_MCF7_KU_r1,b1790p077_PP2_MCF7_KU_r2,b1790p077_PP2_MCF7_PD_r1,b1790p077_PP2_MCF7_PD_r2,b1790p077_PP2_MCF7_PF_r1,b1790p077_PP2_MCF7_PF_r2,b1790p077_PP2_MCF7_TORIN_r1,b1790p077_PP2_MCF7_TORIN_r2,b1790p077_PP2_MCF7_TRAMETINIB_r1,b1790p077_PP2_MCF7_TRAMETINIB_r2
1A24_HUMAN(S356),269296800.0,350275200.0,233906500.0,147323800.0,229208800.0,199261200.0,222675500.0,203111100.0,172431200.0,195279900.0,...,194267300.0,244622400.0,122890700.0,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718600.0
1A24_HUMAN(S359),269296800.0,350275200.0,233906500.0,147323800.0,229208800.0,199261200.0,222675500.0,203111100.0,172431200.0,195279900.0,...,194267300.0,244622400.0,122890700.0,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718600.0
1B39_HUMAN(M1),769594700.0,523120200.0,1278819000.0,701672800.0,63178680.0,328856200.0,767254600.0,886455900.0,3021734000.0,1579953000.0,...,0.0,0.0,527955500.0,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826200.0
1B39_HUMAN(M4),769594700.0,523120200.0,1278819000.0,701672800.0,63178680.0,328856200.0,767254600.0,886455900.0,3021734000.0,1579953000.0,...,0.0,0.0,527955500.0,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826200.0
AAAS(S495),2664712000.0,3635279000.0,3285959000.0,2743536000.0,3822861000.0,4393195000.0,3287022000.0,3540691000.0,2247902000.0,6012226000.0,...,8963979000.0,8351130000.0,9330759000.0,7767012000.0,7900387000.0,7211204000.0,8041161000.0,8921806000.0,8499730000.0,7352127000.0


## 2. Mapping Protein IDs to HGNC Symbols Using UniProt (UniProt Search)

Despite the previous ID mapping step, some UniProt entries remain unmapped, likely because they are historical entry names, which makes them inaccessible via the standard UniProt ID Mapping function. However, a conventional UniProt search includes entry history. By identifying all IDs still marked with the UniProt entry tag (_HUMAN), we can employ a UniProt search to locate these entries. We then extract the gene name associated with the first search result.

In [6]:
# Identify UniProt IDs that have not been mapped and contain the "_HUMAN" tag.
filtered_ids = [id_ for id_ in aup_m1.index.to_list() if '_HUMAN' in id_]
print(f"Sample of filtered IDs for further mapping: {filtered_ids[:5]}")

# Use the UniProt search to map these filtered IDs to gene names.
uniprot_map = idm.get_uniprot_id_map(
    filtered_ids, 
    from_id="UniProtKB_AC-ID", 
    to_id="Gene_Name", 
    tool='uniprot_search', 
    phosphosites=True, 
    return_df=True
)
# Display the first few mappings to verify the process
uniprot_map.head()

Sample of filtered IDs for further mapping: ['1A24_HUMAN(S356)', '1A24_HUMAN(S359)', '1B39_HUMAN(M1)', '1B39_HUMAN(M4)', 'AIM1_HUMAN(S103)']
Processed 35/114 identifiers.

ERROR:root:
FA21D_HUMAN doesn't exist in UniProtKB database.


Processed 109/114 identifiers.

INFO:root:
Mapped 105 out of 114 proteins from UniProtKB_AC-IDs to Gene_Names.


Processed 114/114 identifiers.

Unnamed: 0,From,To
1,1B39_HUMAN,SYNDIG1
2,AIM1_HUMAN,CRYBG1
3,ASUN_HUMAN,INTS13
4,BICR1_HUMAN,BICDL1
5,CA106_HUMAN,INAVA


In [7]:
# Preparing to update our DataFrame with the new mappings.
# Splitting the original DataFrame into two parts:
# 1. Entries corresponding to the filtered IDs for updating.
# 2. The rest of the entries that don't need updating.
aup_m2a = aup_m1.loc[filtered_ids]
aup_m2b = aup_m1.drop(filtered_ids)

# Update the entries with the newly mapped HGNC symbols using UniProt search results.
aup_m2a = idm.index_ids_to_hgnc_symbols(aup_m2a, database='uniprot_search', phosphosites=True)

# Reassemble the DataFrame by concatenating the updated part with the unchanged part.
aup_m2 = pd.concat([aup_m2a, aup_m2b])

# Display the updated DataFrame to confirm the successful mapping and integration.
aup_m2.head()

Processed 34/114 identifiers.

ERROR:root:
FA21D_HUMAN doesn't exist in UniProtKB database.


Processed 112/114 identifiers.

INFO:root:
Mapped 105 out of 114 proteins from UniProtKB_AC-IDs to Gene_Names.
INFO:root:Transformed 266 out of 282 phosphosites.


Processed 114/114 identifiers.

Unnamed: 0,b1790p079_PP1_MCF7_AZD1480_r1,b1790p079_PP1_MCF7_AZD1480_r2,b1790p079_PP1_MCF7_AZD5363_r1,b1790p079_PP1_MCF7_AZD5363_r2,b1790p079_PP1_MCF7_BX_r1,b1790p079_PP1_MCF7_BX_r2,b1790p079_PP1_MCF7_CHIR_r1,b1790p079_PP1_MCF7_CHIR_r2,b1790p079_PP1_MCF7_DMSO_r1,b1790p079_PP1_MCF7_DMSO_r2,...,b1790p077_PP2_MCF7_KU_r1,b1790p077_PP2_MCF7_KU_r2,b1790p077_PP2_MCF7_PD_r1,b1790p077_PP2_MCF7_PD_r2,b1790p077_PP2_MCF7_PF_r1,b1790p077_PP2_MCF7_PF_r2,b1790p077_PP2_MCF7_TORIN_r1,b1790p077_PP2_MCF7_TORIN_r2,b1790p077_PP2_MCF7_TRAMETINIB_r1,b1790p077_PP2_MCF7_TRAMETINIB_r2
1A24_HUMAN(S356),269296800.0,350275208.7,233906500.0,147323791.5,229208800.0,199261200.0,222675485.3,203111064.7,172431200.0,195279900.0,...,194267298.8,244622442.6,122890720.9,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718561.8
1A24_HUMAN(S359),269296800.0,350275208.7,233906500.0,147323791.5,229208800.0,199261200.0,222675485.3,203111064.7,172431200.0,195279900.0,...,194267298.8,244622442.6,122890720.9,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718561.8
SYNDIG1(M1),769594700.0,523120228.8,1278819000.0,701672838.7,63178680.0,328856200.0,767254583.2,886455926.9,3021734000.0,1579953000.0,...,0.0,0.0,527955514.4,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826164.4
SYNDIG1(M4),769594700.0,523120228.8,1278819000.0,701672838.7,63178680.0,328856200.0,767254583.2,886455926.9,3021734000.0,1579953000.0,...,0.0,0.0,527955514.4,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826164.4
CRYBG1(S103),18501120000.0,423470396.9,16676920000.0,268909936.1,103438300.0,7651674000.0,525139303.1,0.0,232604200.0,58119150.0,...,0.0,0.0,0.0,0.0,0.0,5767428.0,0.0,0.0,3752771.0,0.0


## 3. Mapping Protein IDs to HGNC Symbols Using HGNC (HGNC Multi-symbol checker)

The next crucial step is to align all gene names with the currently approved HGNC symbols. To achieve this, we leverage the HGNC Multi-ID Mapping Tool accessible via its API. This tool recognised outdated gene names or aliases and mapping them to their current, approved symbols.

By using this API, we not only update our dataset with the latest nomenclature but also retain information on the original state of each symbol, categorizing them as:
- Already approved (`Match type = symbol`)
- Successfully mapped from a previous name or alias (`Match type = prev_symbol`)
- Unmatched, indicating no current approved symbol could be found (`Match type = unmatched`).

In [8]:
# Utilize the HGNC API to map all gene names in our dataset to the current approved HGNC symbols.
# This includes detecting and updating previous gene names or aliases to their current approved forms.
hgnc_map = idm.get_hgnc_id_map(aup_m2.index, phosphosites=True, return_df=True)

# This DataFrame includes approved, mapped, and unmatched symbols.
hgnc_map.head()

INFO:root:Starting to fetch symbols for 4585 identifiers...


Processed 4572/4585 identifiers

INFO:root:
Mapped 123 out of 4585 proteins to approved HGNC gene symbols.


Processed 4585/4585 identifiers

Unnamed: 0,Input,Match type,Approved symbol
0,SYNDIG1,symbol,SYNDIG1
1,CFAP58,symbol,CFAP58
2,CRYBG1,symbol,CRYBG1
3,INTS13,symbol,INTS13
4,WDCP,symbol,WDCP


In [9]:
# Update the DataFrame with the HGNC-approved symbols.
# This step ensures that our dataset's gene names are standardized according to the latest nomenclature.
aup_m3 = idm.index_ids_to_hgnc_symbols(aup_m2, database='hgnc', phosphosites=True)

# Display the updated DataFrame to confirm the successful application of current approved HGNC symbols.
aup_m3.head()

INFO:root:Starting to fetch symbols for 4585 identifiers...


Processed 4583/4585 identifiers

INFO:root:
Mapped 123 out of 4585 proteins to approved HGNC gene symbols.
INFO:root:Transformed 323 out of 14448 phosphosites.


Processed 4585/4585 identifiers

Unnamed: 0,b1790p079_PP1_MCF7_AZD1480_r1,b1790p079_PP1_MCF7_AZD1480_r2,b1790p079_PP1_MCF7_AZD5363_r1,b1790p079_PP1_MCF7_AZD5363_r2,b1790p079_PP1_MCF7_BX_r1,b1790p079_PP1_MCF7_BX_r2,b1790p079_PP1_MCF7_CHIR_r1,b1790p079_PP1_MCF7_CHIR_r2,b1790p079_PP1_MCF7_DMSO_r1,b1790p079_PP1_MCF7_DMSO_r2,...,b1790p077_PP2_MCF7_KU_r1,b1790p077_PP2_MCF7_KU_r2,b1790p077_PP2_MCF7_PD_r1,b1790p077_PP2_MCF7_PD_r2,b1790p077_PP2_MCF7_PF_r1,b1790p077_PP2_MCF7_PF_r2,b1790p077_PP2_MCF7_TORIN_r1,b1790p077_PP2_MCF7_TORIN_r2,b1790p077_PP2_MCF7_TRAMETINIB_r1,b1790p077_PP2_MCF7_TRAMETINIB_r2
1A24_HUMAN(S356),269296800.0,350275208.7,233906500.0,147323791.5,229208800.0,199261200.0,222675485.3,203111064.7,172431200.0,195279900.0,...,194267298.8,244622442.6,122890720.9,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718561.8
1A24_HUMAN(S359),269296800.0,350275208.7,233906500.0,147323791.5,229208800.0,199261200.0,222675485.3,203111064.7,172431200.0,195279900.0,...,194267298.8,244622442.6,122890720.9,278181800.0,234142800.0,188540200.0,160101400.0,287738000.0,461097000.0,460718561.8
SYNDIG1(M1),769594700.0,523120228.8,1278819000.0,701672838.7,63178680.0,328856200.0,767254583.2,886455926.9,3021734000.0,1579953000.0,...,0.0,0.0,527955514.4,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826164.4
SYNDIG1(M4),769594700.0,523120228.8,1278819000.0,701672838.7,63178680.0,328856200.0,767254583.2,886455926.9,3021734000.0,1579953000.0,...,0.0,0.0,527955514.4,1183298000.0,1879283000.0,0.0,1289702000.0,1528154000.0,991706000.0,165826164.4
CRYBG1(S103),18501120000.0,423470396.9,16676920000.0,268909936.1,103438300.0,7651674000.0,525139303.1,0.0,232604200.0,58119150.0,...,0.0,0.0,0.0,0.0,0.0,5767428.0,0.0,0.0,3752771.0,0.0


## 4. Manual Mapping of Unmatched Protein IDs to HGNC Symbols

After automating the mapping process as much as possible, our dataset now includes updated HGNC symbols. The next step involves identifying IDs that lack an approved HGNC gene symbol, so we can investigate and manually map them. We begin by identifying and exporting these unmatched IDs to a file. Subsequently, we incorporate a manually curated mapping of these IDs from an external source to update our dataframe.

In [10]:
# Filter for unmatched IDs from the HGNC mapping process.
# These IDs did not find a corresponding approved HGNC gene symbol and need manual investigation.
missing_hgnc_symbols = hgnc_map.loc[hgnc_map['Match type'] == 'unmatched', 'Input'].sort_values().reset_index(drop=True)

# Display the missing HGNC symbols for review.
print(missing_hgnc_symbols)

# Export the list of missing HGNC symbols to a TSV file for external manual mapping.
# This file serves as a reference for investigating and manually mapping these IDs.
missing_hgnc_symbols.to_csv('workspace/missing_hgnc_symbols.tsv', index=False, header=False)

0      1A24_HUMAN
1          AKT1.2
2     CJ012_HUMAN
3     FA21D_HUMAN
4     FRG1B_HUMAN
5          Sep-02
6          Sep-05
7          Sep-06
8          Sep-07
9          Sep-09
10    YA043_HUMAN
11    YJ005_HUMAN
12    YL004_HUMAN
13    YS060_HUMAN
14    YV023_HUMAN
Name: Input, dtype: object


In [11]:
# Load the manually mapped HGNC symbols from an external file.
# This file contains the mappings curated through external investigation.
mapped_hgnc_symbols = pd.read_csv('resources/external/mapped_hgnc_symbols.tsv', sep='\t')

# Create a mapping dictionary from the manually mapped symbols for conversion.
mapping_dict = dict(zip(mapped_hgnc_symbols['From'], mapped_hgnc_symbols['To']))

# Update the DataFrame with manually mapped HGNC symbols.
# This step integrates our manual mapping efforts into the dataset.
aup_m4 = idm.update_df_index_with_mappings(aup_m3, mapping_dict, phosphosites=True)

# Sort the updated DataFrame by index to organize the entries.
aup_m4.sort_index(inplace=True)

# Display the updated DataFrame with both automatically and manually mapped HGNC symbols.
aup_m4.head()

INFO:root:Transformed 27 out of 14448 phosphosites.


Unnamed: 0,b1790p079_PP1_MCF7_AZD1480_r1,b1790p079_PP1_MCF7_AZD1480_r2,b1790p079_PP1_MCF7_AZD5363_r1,b1790p079_PP1_MCF7_AZD5363_r2,b1790p079_PP1_MCF7_BX_r1,b1790p079_PP1_MCF7_BX_r2,b1790p079_PP1_MCF7_CHIR_r1,b1790p079_PP1_MCF7_CHIR_r2,b1790p079_PP1_MCF7_DMSO_r1,b1790p079_PP1_MCF7_DMSO_r2,...,b1790p077_PP2_MCF7_KU_r1,b1790p077_PP2_MCF7_KU_r2,b1790p077_PP2_MCF7_PD_r1,b1790p077_PP2_MCF7_PD_r2,b1790p077_PP2_MCF7_PF_r1,b1790p077_PP2_MCF7_PF_r2,b1790p077_PP2_MCF7_TORIN_r1,b1790p077_PP2_MCF7_TORIN_r2,b1790p077_PP2_MCF7_TRAMETINIB_r1,b1790p077_PP2_MCF7_TRAMETINIB_r2
AAAS(S495),2664712000.0,3635279000.0,3285959000.0,2743536000.0,3822861000.0,4393195000.0,3287022000.0,3540691000.0,2247902000.0,6012226000.0,...,8963979000.0,8351130000.0,9330759000.0,7767012000.0,7900387000.0,7211204000.0,8041161000.0,8921806000.0,8499730000.0,7352127000.0
AAGAB(M297),0.0,0.0,65261450.0,0.0,0.0,0.0,5477754.0,4922887.0,0.0,0.0,...,26542590.0,59023130.0,77617110.0,39163530.0,42308140.0,21611060.0,0.0,31617880.0,173287600.0,221985200.0
AAGAB(S310),1219400000.0,1578639000.0,1083647000.0,946774700.0,185089700.0,306555600.0,123130700.0,63371190.0,227542100.0,84980690.0,...,708279500.0,827131000.0,533886900.0,385552100.0,932632400.0,631139500.0,682125500.0,511506600.0,990284300.0,766505200.0
AAGAB(S311),1219400000.0,1578639000.0,1083647000.0,946774700.0,185089700.0,306555600.0,123130700.0,63371190.0,227542100.0,84980690.0,...,1593237000.0,1712089000.0,1237525000.0,1012566000.0,1602475000.0,1309217000.0,1377144000.0,1177742000.0,1742970000.0,1431783000.0
AAK1(S14),862491300.0,1414261000.0,1916668000.0,1086577000.0,875104500.0,155949100.0,958328800.0,850320200.0,1115720000.0,1117170000.0,...,1392150000.0,646504000.0,1100319000.0,649593600.0,1397288000.0,1449160000.0,1454729000.0,1980154000.0,842194200.0,1566950000.0


## Conclusion and Next Steps

In this notebook, we demonstrated a comprehensive workflow for mapping protein IDs to HGNC symbols using the UniProt database. This process is essential for standardizing protein identifiers in our dataset.

### Next Steps
Utilize the standardized dataset for further biological analysis, including:
- Data preprocessing,
- Quality control to identify and exclude low-quality samples and phosphosites
- Normalisation
- Differential phosphosite occupancy analysis (DPOA),
- Estimating missing fold changes,
- Calculating signal intensity-dependent p-values (sid-scores).

## References
- [UniProt - ID mapping](https://www.uniprot.org/id-mapping)
- [UniProtMapper](https://pypi.org/project/uniprot-id-mapper/)
- [UniProt - Search](https://www.uniprot.org/)
- [HGNC - Multi-symbol checker](https://www.genenames.org/tools/multi-symbol-checker/)