# Secretory Pathway Features Retrieval
This notebook contains all the steps required to generate the Sec Recon dataset from Human Gene Symbols. It retrieves identifiers for Human, Mouse, and CHO, followed by subcellular localization and protein complex information. All gathered data is then compiled into the **"Secretory Pathway Recon" Google Sheet**.

### Load packages and define datasets

In [1]:
import pandas as pd
from Bio import Entrez
import Request_Utilis
from google_sheet import GoogleSheet

Entrez.email = "a.antonakoudis@sartorius.com"

In [2]:
##### ----- Generate datasets from Google Sheet ----- #####

#Credential file
KEY_FILE_PATH = 'credentials.json'

#CHO Network Reconstruction + Recon3D_v3 Google Sheet ID
SPREADSHEET_ID = '1DaAdZlvMYDqb7g31I5dw-ZCZH52Xj_W3FnQMFUzqmiQ'

# Initialize the GoogleSheet object
gsheet_file = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sec_recon_sheet = 'SecRecon'
complex_info_sheet = 'Complex Information'

sec_recon = gsheet_file.read_google_sheet(sec_recon_sheet)
complex_info = gsheet_file.read_google_sheet(complex_info_sheet)

# Create a copy of the datasets
sec_recon_dc = sec_recon.copy()
complex_info_dc = complex_info.copy()

## 1. Retrieve Human CHO and Mouse Entrez IDs
Here we use the fucntion get_entrez_id from the **Request Utilis** module to fetch the Entrez IDs for Human and then use this as input to retrieve information for CHO and Mouse.

### 1.1 Human Entrez ID

In [3]:
# Update Human Entrez IDs
for i,row in sec_recon_dc.iterrows():
    if pd.isnull(row['HUMAN ENTREZID']) or row['HUMAN ENTREZID'] == '':
        human_entrez = Request_Utilis.get_entrez_id(row['GENE SYMBOL'])
        sec_recon_dc.at[i, 'HUMAN ENTREZID'] = human_entrez

if not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print("Google Sheet updated.")
else:
    print('Human Entrez IDs are up-to-date')

Human Entrez IDs are up-to-date


### 1.2 CHO Entrez IDs from other databases
Before running the **get_gene_id** function on CHO genes, we first populate some of the CHO genes with a mapping of orthologs based on our own dataset comprised from different databases.

In [4]:
# Map Human IDs to CHO IDs from the "cho2human_mapping" dataset

cho2human_mapping = pd.read_csv("Orthologs/cho2human_mapping.tsv", sep='\t')
cho2human_mapping2 = pd.read_excel("Orthologs/orthologs.xlsx", index_col=0)
cho2human_mapping2['Human GeneID'] = pd.to_numeric(cho2human_mapping2['Human GeneID'], errors='coerce')
cho2human_mapping2['Human GeneID'] = cho2human_mapping2['Human GeneID'].astype('Int64')

cho_id_lookup = dict(zip(cho2human_mapping['HUMAN_ID'], cho2human_mapping['CHO_ID'])) #convert to dict for mapping
cho_id_lookup2 = dict(zip(cho2human_mapping2['Human GeneID'], cho2human_mapping2['CHO GeneID'])) #convert to dict for mapping

for index, row in sec_recon_dc.iterrows():
    if pd.isna(row['CHO ENTREZID']) or row['CHO ENTREZID'] == '':
        try:
            human_id = int(row['HUMAN ENTREZID'])
            cho_id = cho_id_lookup.get(human_id)
            if cho_id is not None:
                sec_recon_dc.at[index, 'CHO ENTREZID'] = cho_id
            else:
                try:
                    cho_id = cho_id_lookup2.get(human_id)
                    if cho_id is not None:
                        sec_recon_dc.at[index, 'CHO ENTREZID'] = cho_id
                except ValueError:
                    print(f'{human_id} is not a valid Human Entrez ID')      
        except ValueError:
            print(f'{human_id} is not a valid Human Entrez ID')
            continue        

if not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print("Google Sheet updated on CHO Entrez IDs from cho2human dataset")
else:
    print('CHO Entrez IDs from "cho2human_mapping" dataset are up-to-date')

Google Sheet updated on CHO Entrez IDs from cho2human dataset


### 1.3 CHO and Mouse Entrez IDs 
Finally we run the **get_gene_ids** function to retrieve CHO and Mouse Entrez IDs by mapping the orthologs using the Human Entrez IDs as input.

In [5]:
## -- CHO Entrez IDs -- ##

for index, row in sec_recon_dc.iterrows():
    if pd.isna(row['CHO ENTREZID']) or row['CHO ENTREZID'] == '':
        human_id = row['HUMAN ENTREZID']
        cho_ortholog_EntrezID = Request_Utilis.get_gene_ids(human_id, '10029')
        if cho_ortholog_EntrezID is not None:
            sec_recon_dc.at[index, 'CHO ENTREZID'] = cho_ortholog_EntrezID
            
if not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print("Google Sheet updated on CHO Entrez IDs from NIH database")
else:
    print('CHO Entrez IDs from NIH database are up-to-date')

No accession for gene 58477
Google Sheet updated on CHO Entrez IDs from NIH database


In [6]:
## -- Mouse Entrez IDs -- ##

loop_counter = 0
update_threshold = 50

for index, row in sec_recon_dc.iterrows():
    if pd.isna(row['MOUSE ENTREZID']) or row['MOUSE ENTREZID'] == '':
        human_id = row['HUMAN ENTREZID']
        mouse_ortholog_EntrezID = Request_Utilis.get_gene_ids(human_id, '10090')
        if mouse_ortholog_EntrezID is not None:
            sec_recon_dc.at[index, 'MOUSE ENTREZID'] = mouse_ortholog_EntrezID
            loop_counter += 1

        if loop_counter >= update_threshold:
            if not sec_recon_dc.equals(sec_recon):
                gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
                print(f"Google Sheet updated on Mouse Entrez IDs from NIH database after {loop_counter} updates")
            else:
                print('Mouse Entrez IDs from NIH database are up-to-date')
            loop_counter = 0

# Check if there are any remaining updates after exiting the loop
if loop_counter > 0 and not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print(f"Google Sheet updated on Mouse Entrez IDs from NIH database after {loop_counter} updates")


No accession for gene 6729
Google Sheet updated on Mouse Entrez IDs from NIH database after 1 updates


## 2. Ensembl IDs
In this section we retrieve Ensembl IDs fron NIH database using the **Gene_Info_from_EntrezID** function from the Request Utilis module. Secondarily, we retrieve extra information from other identifiers to fill missing data in our dataset.

### 2.1 Human Ensembl IDs and Extra Identifiers
Here we retrieve the Human Ensembl IDs and Gene Alises and Gene Names.

In [7]:
# Collect missing information from NIH database

updates = []
for i, gene in sec_recon_dc.iterrows():
    human_entrezID = gene['HUMAN ENTREZID']
    gene_symbol = gene['GENE SYMBOL']
    if gene['ALIAS'] == '' or gene['GENENAME'] == '' or gene['HUMAN ENSEMBL'] == '':
        print(gene_symbol)
        try:
            org, gene_symbol, gene_name, gene_synonyms, gene_ensemble, gene_products = Request_Utilis.Gene_Info_from_EntrezID(human_entrezID)
            updates.append((i, gene_synonyms, gene_name, gene_ensemble))
        except ValueError:
            print(f'No valid Entrez ID for gene {gene_symbol}')

# Apply the updates outside the loop
for i, gene_synonyms, gene_name, gene_ensemble in updates:
    sec_recon_dc.at[i, 'ALIAS'] = gene_synonyms
    sec_recon_dc.at[i, 'GENENAME'] = gene_name
    sec_recon_dc.at[i, 'HUMAN ENSEMBL'] = gene_ensemble
    
sec_recon_dc['ALIAS'] = sec_recon_dc['ALIAS'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
sec_recon_dc['GENENAME'] = sec_recon_dc['GENENAME'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
sec_recon_dc['HUMAN ENSEMBL'] = sec_recon_dc['HUMAN ENSEMBL'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
 
    
if not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print("Google Sheet updated.")
else:
    print('Human identifiers are up-to-date')

HSP90AA4P
Gene HSP90AA4P has no products
Gene HSP90AA4P has no products
Gene HSP90AA4P has no products
HSP90AA5P
Gene HSP90AA5P has no products
Gene HSP90AA5P has no products
HSP90AB4P
NAT8B
RNF126
SLC35A4
SLC35A5
AUP1
CAPN12
CAPN13
CASP12
FMO1
OST4
RNF121
SCAP
SDF2L1
SORCS2
TRAM1L1
TRAM2
VPS37B
VPS37C
TRIM25
Google Sheet updated.


### 2.2 CHO and Mouse Ensembl IDs and Gene Symbols
Using the same functionw we retrieve Ensembl IDs and Gene Symbols for CHO and Mouse

In [8]:
## -- CHO Ensembl IDs and Gene Symbol -- ##

loop_counter = 0
update_threshold = 50

# Collect missing information for CHO identifiers
for i, gene in sec_recon_dc.iterrows():
    cho_entrezID = str(gene['CHO ENTREZID'])
    if cho_entrezID != '':
        if (pd.isna(gene['CHO ENSEMBL']) or gene['CHO ENSEMBL'] == '') or (pd.isna(gene['CHO GENE SYMBOL']) or gene['CHO GENE SYMBOL'] == ''):
            try:
                org, gene_symbol, gene_name, gene_synonyms, gene_ensemble, gene_products = Request_Utilis.Gene_Info_from_EntrezID(cho_entrezID)
                if (pd.isna(gene['CHO GENE SYMBOL']) or gene['CHO GENE SYMBOL'] == ''):
                    sec_recon_dc.at[i, 'CHO GENE SYMBOL'] = gene_symbol
                if (pd.isna(gene['CHO ENSEMBL']) or gene['CHO ENSEMBL'] == ''):
                    sec_recon_dc.at[i, 'CHO ENSEMBL'] = gene_ensemble
            except ValueError:
                print(f'No valid Entrez ID for gene {gene_symbol}')
            loop_counter += 1

            if loop_counter >= update_threshold:
                if not sec_recon_dc.equals(sec_recon):
                    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
                    print(f"Google Sheet updated on CHO Ensembl IDs after {loop_counter} updates")
                else:
                    print('CHO Ensembl IDs are up-to-date')
                loop_counter = 0

# Check if there are any remaining updates after exiting the loop
if loop_counter > 0 and not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print(f"Google Sheet updated on CHO Ensembl IDs after {loop_counter} updates")

No ENSEMBL ID for gene A4gnt
No ENSEMBL ID for gene Abo
No ENSEMBL ID for gene Adrm1
No ENSEMBL ID for gene Agap3
No valid Entrez ID for gene Agap3
No ENSEMBL ID for gene LOC103160092
No ENSEMBL ID for gene Arpc2
No ENSEMBL ID for gene LOC103161689
No ENSEMBL ID for gene B4galt4
No ENSEMBL ID for gene Bag1
No ENSEMBL ID for gene Bet1l
No ENSEMBL ID for gene LOC100760153
No ENSEMBL ID for gene Chpf
No ENSEMBL ID for gene Chpf2
No ENSEMBL ID for gene Chst10
No ENSEMBL ID for gene Chst14
No ENSEMBL ID for gene LOC103162159
No ENSEMBL ID for gene Copz2
No ENSEMBL ID for gene Dnajc5
Gene ID 107977555; An error occurred: HTTP Error 400: Bad Request
No ENSEMBL ID for gene Dnajc5g
No ENSEMBL ID for gene Dpm2
No ENSEMBL ID for gene LOC103163675
No ENSEMBL ID for gene Fut7
No ENSEMBL ID for gene Galnt16
No ENSEMBL ID for gene Gbgt1
No ENSEMBL ID for gene LOC100764057
No ENSEMBL ID for gene Get1
Gene ID 100769304; An error occurred: HTTP Error 400: Bad Request
No ENSEMBL ID for gene Get4
No ENSEM

In [9]:
## -- Mouse Ensembl IDs and Gene Symbol-- ##

loop_counter = 0
update_threshold = 50

# Collect missing information for CHO identifiers
for i, gene in sec_recon_dc.iterrows():
    mouse_entrezID = str(gene['MOUSE ENTREZID'])
    if mouse_entrezID != '':
        if (pd.isna(gene['MOUSE ENSEMBL']) or gene['MOUSE ENSEMBL'] == '') or (pd.isna(gene['MOUSE GENE SYMBOL']) or gene['MOUSE GENE SYMBOL'] == ''):
            try:
                org, gene_symbol, gene_name, gene_synonyms, gene_ensemble, gene_products = Request_Utilis.Gene_Info_from_EntrezID(mouse_entrezID)
                if (pd.isna(gene['MOUSE GENE SYMBOL']) or gene['MOUSE GENE SYMBOL'] == ''):
                    sec_recon_dc.at[i, 'MOUSE GENE SYMBOL'] = gene_symbol
                if (pd.isna(gene['MOUSE ENSEMBL']) or gene['MOUSE ENSEMBL'] == ''):
                    sec_recon_dc.at[i, 'MOUSE ENSEMBL'] = gene_ensemble
            except ValueError:
                print(f'No valid Entrez ID for gene {gene_symbol}')
            loop_counter += 1

            if loop_counter >= update_threshold:
                if not sec_recon_dc.equals(sec_recon):
                    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
                    print(f"Google Sheet updated on Mouse Ensembl IDs after {loop_counter} updates")
                else:
                    print('Mouse Ensembl IDs are up-to-date')
                loop_counter = 0

# Check if there are any remaining updates after exiting the loop
if loop_counter > 0 and not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print(f"Google Sheet updated on Mouse Ensembl IDs after {loop_counter} updates")

Google Sheet updated on Mouse Ensembl IDs after 2 updates


## 3. Uniprot IDs
In this section we retrieve all the Uniprot IDs linked to each gene Entrez ID from NIH database, using the **Gene_Info_from_EntrezID** function from the Request Utilis module.

In [10]:
## -- Human Uniprot IDs -- ##

loop_counter = 0
update_threshold = 50

# Collect missing information for CHO identifiers
for i, gene in sec_recon_dc.iterrows():
    human_entrezID = str(gene['HUMAN ENTREZID'])
    if human_entrezID != '':
        if (pd.isna(gene['HUMAN UNIPROT']) or gene['HUMAN UNIPROT'] == ''):
            try:
                org, gene_symbol, gene_name, gene_synonyms, gene_ensemble, gene_products = Request_Utilis.Gene_Info_from_EntrezID(human_entrezID)
                unique_uniprotids = list(set([item for sublist in [x[2] for x in gene_products] for item in sublist]))
                sec_recon_dc.at[i, 'HUMAN UNIPROT'] = unique_uniprotids
                print(loop_counter+1, gene_symbol, human_entrezID, unique_uniprotids)
            except ValueError:
                print(f'No valid Entrez ID for gene {gene_symbol}')
            loop_counter += 1

            if loop_counter >= update_threshold:
                if not sec_recon_dc.equals(sec_recon):
                    sec_recon_dc['HUMAN UNIPROT'] = sec_recon_dc['HUMAN UNIPROT'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
                    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
                    print(f"Google Sheet updated on Human Uniprot IDs after {loop_counter} updates")
                else:
                    print('HUMAN Uniprot IDs are up-to-date')
                loop_counter = 0

# Check if there are any remaining updates after exiting the loop
if loop_counter > 0 and not sec_recon_dc.equals(sec_recon):
    sec_recon_dc['HUMAN UNIPROT'] = sec_recon_dc['HUMAN UNIPROT'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print(f"Google Sheet updated on Human Uniprot IDs after {loop_counter} updates")

1 ALG1L1P 200810 []
Gene HSP90AA4P has no products
Gene HSP90AA4P has no products
Gene HSP90AA4P has no products
2 HSP90AA4P 3323 []
Gene HSP90AA5P has no products
Gene HSP90AA5P has no products
3 HSP90AA5P 730211 []
Gene HSP90AB2P has no products
Gene HSP90AB2P has no products
4 HSP90AB2P 391634 []
Gene HSP90AB3P has no products
Gene HSP90AB3P has no products
5 HSP90AB3P 3327 []
6 HSP90AB4P 664618 []
7 MICALCL 84953 []
8 NAT8B 51471 []
9 SEC1P 653677 []
10 MICALCL 84953 []
11 DNAJB3 414061 []
12 TRIM25 7706 ['Q59GW5', 'Q14258']
Google Sheet updated on Human Uniprot IDs after 12 updates


In [11]:
## -- CHO Uniprot IDs -- ##

loop_counter = 0
update_threshold = 50

# Collect missing information for CHO identifiers
for i, gene in sec_recon_dc.iterrows():
    cho_entrezID = str(gene['CHO ENTREZID'])
    if cho_entrezID != '':
        if (pd.isna(gene['CHO UNIPROT']) or gene['CHO UNIPROT'] == ''):
            try:
                org, gene_symbol, gene_name, gene_synonyms, gene_ensemble, gene_products = Request_Utilis.Gene_Info_from_EntrezID(cho_entrezID)
                unique_uniprotids = list(set([item for sublist in [x[2] for x in gene_products] for item in sublist]))
                sec_recon_dc.at[i, 'CHO UNIPROT'] = unique_uniprotids
                print(loop_counter+1, gene_symbol, cho_entrezID, unique_uniprotids)
            except ValueError:
                print(f'No valid Entrez ID for gene {gene_symbol}')
            loop_counter += 1

            if loop_counter >= update_threshold:
                if not sec_recon_dc.equals(sec_recon):
                    sec_recon_dc['CHO UNIPROT'] = sec_recon_dc['CHO UNIPROT'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
                    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
                    print(f"Google Sheet updated on CHO Uniprot IDs after {loop_counter} updates")
                else:
                    print('CHO Uniprot IDs are up-to-date')
                loop_counter = 0

# Check if there are any remaining updates after exiting the loop
if loop_counter > 0 and not sec_recon_dc.equals(sec_recon):
    sec_recon_dc['CHO UNIPROT'] = sec_recon_dc['CHO UNIPROT'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print(f"Google Sheet updated on CHO Uniprot IDs after {loop_counter} updates")

No valid Entrez ID for gene TRIM25
No ENSEMBL ID for gene LOC103160092
2 LOC103160092 103160092 []
No ENSEMBL ID for gene LOC103161689
3 LOC103161689 103161689 []
No ENSEMBL ID for gene LOC103162159
4 LOC103162159 103162159 []
No ENSEMBL ID for gene LOC103163675
5 LOC103163675 103163675 []
No ENSEMBL ID for gene Get1
6 Get1 100754848 []
No ENSEMBL ID for gene LOC103162307
7 LOC103162307 103162307 []
No ENSEMBL ID for gene LOC107977047
8 LOC107977047 107977047 []
No ENSEMBL ID for gene LOC103160380
9 LOC103160380 103160380 []
No ENSEMBL ID for gene LOC103163485
10 LOC103163485 103163485 []
No ENSEMBL ID for gene Manba
11 Manba 103158768 []
12 Mir484 102466552 []
Gene ID 100764449; An error occurred: HTTP Error 400: Bad Request
No ENSEMBL ID for gene LOC100764449
Gene LOC100764449 has no products
Gene LOC100764449 has no products
13 LOC100764449 100764449 []
14 LOC100770246 100770246 []
No valid Entrez ID for gene LOC100770246
No ENSEMBL ID for gene LOC103161711
16 LOC103161711 103161711

In [12]:
## -- Mouse Uniprot IDs -- ##

loop_counter = 0
update_threshold = 50

# Collect missing information for CHO identifiers
for i, gene in sec_recon_dc.iterrows():
    mouse_entrezID = str(gene['MOUSE ENTREZID'])
    if mouse_entrezID != '':
        if (pd.isna(gene['MOUSE UNIPROT']) or gene['MOUSE UNIPROT'] == ''):
            try:
                org, gene_symbol, gene_name, gene_synonyms, gene_ensemble, gene_products = Request_Utilis.Gene_Info_from_EntrezID(mouse_entrezID)
                unique_uniprotids = list(set([item for sublist in [x[2] for x in gene_products] for item in sublist]))
                sec_recon_dc.at[i, 'MOUSE UNIPROT'] = unique_uniprotids
                print(loop_counter+1, gene_symbol, mouse_entrezID, unique_uniprotids)
            except ValueError:
                print(f'No valid Entrez ID for gene {gene_symbol}')
            loop_counter += 1

            if loop_counter >= update_threshold:
                if not sec_recon_dc.equals(sec_recon):
                    sec_recon_dc['MOUSE UNIPROT'] = sec_recon_dc['MOUSE UNIPROT'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
                    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
                    print(f"Google Sheet updated on Mouse Uniprot IDs after {loop_counter} updates")
                else:
                    print('Mouse Uniprot IDs are up-to-date')
                loop_counter = 0

# Check if there are any remaining updates after exiting the loop
if loop_counter > 0 and not sec_recon_dc.equals(sec_recon):
    sec_recon_dc['MOUSE UNIPROT'] = sec_recon_dc['MOUSE UNIPROT'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print(f"Google Sheet updated on Mouse Uniprot IDs after {loop_counter} updates")

Gene Gm29788 has no products
1 Gm29788 101056148 []
2 Trim25 217069 ['Q61510', 'Q3TU94', 'Q5SU71', 'Q5SU70', 'Q3U3X7']
Google Sheet updated on Mouse Uniprot IDs after 2 updates


## 4. Subcellular Localization
The subcellular localization is divided into two parts. First, we map the subcellular localization to all the genes from the data provided in the paper "[Global organelle profiling reveals subcellular localization and remodeling at proteome scale](https://www.biorxiv.org/content/10.1101/2023.12.18.572249v1)". Then, we use the **get_subcellular_localization** from the Request Utilis module to retrieve the subcellular localization of each gene using as input the Uniprot IDs retrieved previously.

In [13]:
# Generate "subcell_dict" for direct mapping into our dataset
subcell = pd.read_csv("Input/subcellular_localization.csv")
subcell_dict = dict(zip(subcell['Gene_name_canonical'], subcell['consensus graph-based annotation (this study)']))

# Standarization of the subcellular compartments to be merged with the compartments in the Sec Recon dataset
for key in subcell_dict:
    if subcell_dict[key] == 'early_endosome':
        subcell_dict[key] = 'Early Endosome'
    elif subcell_dict[key] == 'centrosome':
        subcell_dict[key] = 'Centrosome'
    elif subcell_dict[key] == 'ER':
        subcell_dict[key] = 'Endoplasmic Reticulum'
    elif subcell_dict[key] == 'mitochondrion':
        subcell_dict[key] = 'Mitochondria'
    elif subcell_dict[key] == 'stress_granule':
        subcell_dict[key] = 'Stress Granule'
    elif subcell_dict[key] == 'unclassified':
        subcell_dict[key] = None
    elif subcell_dict[key] == 'peroxisome':
        subcell_dict[key] = 'Peroxisome'
    elif subcell_dict[key] == '14-3-3_scaffold':
        subcell_dict[key] = None
    elif subcell_dict[key] == 'recycling_endosome':
        subcell_dict[key] = 'Recycling Endosome'
    elif subcell_dict[key] == 'plasma_membrane':
        subcell_dict[key] = 'Plasma Membrane'
    elif subcell_dict[key] == 'lysosome':
        subcell_dict[key] = 'Lysosome'
    elif subcell_dict[key] == 'translation':
        subcell_dict[key] = 'Translation'
    elif subcell_dict[key] == 'actin_cytoskeleton':
        subcell_dict[key] = 'Actin Cytoskeleton'
    elif subcell_dict[key] == 'cytosol':
        subcell_dict[key] = 'Cytosol'
    elif subcell_dict[key] == 'nucleus':
        subcell_dict[key] = 'Nucleus'
    elif subcell_dict[key] == 'ERGIC':
        subcell_dict[key] = 'ERGIC'
    elif subcell_dict[key] == 'p-body':
        subcell_dict[key] = 'P-Body'
    elif subcell_dict[key] == 'trans-Golgi':
        subcell_dict[key] = 'trans-Golgi'
    elif subcell_dict[key] == 'nucleolus':
        subcell_dict[key] = 'Nucleolus'
    elif subcell_dict[key] == 'proteasome':
        subcell_dict[key] = 'Proteasome'
    elif subcell_dict[key] == 'Golgi':
        subcell_dict[key] = 'Golgi'

# Map subcellular localization to the dataset
sec_recon_dc['Subcellular Localization'] = sec_recon_dc.apply(lambda row: row['Subcellular Localization'] 
                                                              if pd.notna(row['Subcellular Localization']) 
                                                              else subcell_dict.get(row['GENE SYMBOL'], np.nan), axis=1)


# Update the Google Sheet file
if not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print('Google Sheet updated on Subcellular Localization from "subcellular_localization.csv" dataset')
else:
    print('Subcellular Localizations from "subcellular_localization.csv" dataset are up-to-date')

Google Sheet updated on Subcellular Localization from "subcellular_localization.csv" dataset


In [14]:
#Retrieval of Subcellular localizations from Uniprot

loop_counter = 0
update_threshold = 50

for i, row in sec_recon_dc.iterrows():
    gene = row['GENE SYMBOL']
    # Subcellular compartments are extracted using the Human Uniprot ID
    uniprot_ids = row['HUMAN UNIPROT'].split(", ")
    if (pd.isna(row['Subcellular Localization']) or row['Subcellular Localization'] == ''):
         if uniprot_ids != ['']:
            for uni_id in uniprot_ids:
                sub_loc = Request_Utilis.get_subcellular_localization(uni_id)
                if sub_loc is not None:
                    new_sub_loc = []
                    for sloc in sub_loc:
                        # Standarization of the subcellular compartments to be included in the Sec Recon dataset
                        match_found = False
                        if sloc.startswith('Recycling endosome'):
                            sloc = 'Recycling Endosome'
                            match_found = True
                        if sloc.startswith('Late endosome'):
                            sloc = 'Late Endosome'
                            match_found = True
                        if sloc.startswith('Endosome membrane'):
                            sloc = 'Endosome'
                            match_found = True
                        if sloc.startswith('Early endosome'):
                            sloc = 'Early Endosome'
                            match_found = True
                        elif sloc.startswith('Endoplasmic Reticulum-Golgi'): 
                            sloc = 'ERGIC'    
                            match_found = True
                        elif sloc.startswith('Endoplasmic reticulum'):
                            sloc = 'Endoplasmic Reticulum'
                            match_found = True
                        elif 'COPII' in sloc:
                            sloc = 'ERGIC'    
                            match_found = True
                        elif 'cytoskeleton' in sloc:
                            sloc = 'Actin Cytoskeleton'
                            match_found = True
                        elif sloc.startswith('Cytoplasm'):
                            sloc = 'Cytoplasm'
                            match_found = True
                        elif 'trans-Golgi' in sloc:
                            sloc = 'trans-Golgi'
                            match_found = True
                        elif 'cis-Golgi' in sloc:
                            sloc = 'cis-Golgi'
                            match_found = True
                        elif sloc.startswith('Golgi apparatus'):
                            sloc = 'Golgi'
                            match_found = True
                        elif 'nucleolus' in sloc:
                            sloc = 'Nucleolus'
                            match_found = True
                        elif sloc.startswith('Nucleus'):
                            sloc = 'Nucleus'
                            match_found = True
                        elif sloc.startswith('Mitochondrion'):
                            sloc = 'Mitochondria'
                            match_found = True
                        elif sloc == 'Membrane' or sloc == 'Cell membrane':
                            sloc = 'Plasma Membrane'
                            match_found = True
                        elif sloc.startswith('Lysosome'):
                            sloc = 'Lysosome'
                            match_found = True
                        elif sloc == 'Secreted':
                            match_found = True
                        if not match_found:
                            continue
                            
                        new_sub_loc.append(sloc)
                            
                    break
            print(f'Subcellular localization of {gene} is {list(set(new_sub_loc))}')
            sec_recon_dc.at[i, 'Subcellular Localization'] = list(set(new_sub_loc))
            loop_counter += 1
            
            # After 50 iterations of the loop, update the Google Sheet file
            if loop_counter >= update_threshold:
                if not sec_recon_dc.equals(sec_recon):
                    sec_recon_dc['Subcellular Localization'] = sec_recon_dc['Subcellular Localization'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
                    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
                    print(f"Google Sheet updated on Subcellular Localizations from Uniprot after {loop_counter} updates")
                else:
                    print('Subcellular Localizations from Uniprot are up-to-date')
                loop_counter = 0

# Check if there are any remaining updates after exiting the loop
if loop_counter > 0 and not sec_recon_dc.equals(sec_recon):
    sec_recon_dc['Subcellular Localization'] = sec_recon_dc['Subcellular Localization'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print(f"Google Sheet updated on Subcellular Localizations from Uniprot after {loop_counter} updates")

Subcellular localization of ART1 is []
Subcellular localization of RYR1 is []
Subcellular localization of RYR2 is []
Subcellular localization of RYR3 is []
Subcellular localization of TBC1D12 is []
Subcellular localization of BIM is []
Subcellular localization of CASQ2 is []
Subcellular localization of CNIH3 is []
Subcellular localization of TRIM25 is ['Cytoplasm']
Google Sheet updated on Subcellular Localizations from Uniprot after 9 updates


## 5. Complex Information
Here we map Complex Information from the **CORUM** database (https://mips.helmholtz-muenchen.de/corum/#download)

In [15]:
# Load complexes df from the CORUM database
complexes = pd.read_excel("Input/CORUM download 2022_09_12.xlsx")

In [16]:
# Create a dictionary with the gene Entrez IDs as keys and all the complexes associated to each ID as values
subunit_complex_dict = {}

for _, row in complexes.iterrows():
    subunit_ids = str(row['subunits(Entrez IDs)']).split(';')
    complex_name = row['ComplexName']
    for subunit_id in subunit_ids:
        if subunit_id.strip():  # Check if the subunit ID is not empty
            if subunit_id not in subunit_complex_dict:
                # Initialize with a list containing the current complex name
                subunit_complex_dict[subunit_id] = [complex_name]
            else:
                # If the complex name is not already in the list for this ID, append it
                if complex_name not in subunit_complex_dict[subunit_id]:
                    subunit_complex_dict[subunit_id].append(complex_name)

In [17]:
# Add the complex information to each Entrez ID in the SeRecon dataset

# Initialize an empty set to store unique complex information
unique_complexes = set()

for i,row in sec_recon_dc.iterrows():
    # Add complex information for Human
    human_entrez = row['HUMAN ENTREZID']
    hcmpls = subunit_complex_dict.get(human_entrez, 'nan')
    if hcmpls != 'nan':
        print(f'Human: {human_entrez}, {hcmpls}')
        sec_recon_dc.at[i, 'HUMAN PROTEIN COMPLEX'] = hcmpls
        # Add each item in the list to the set
        unique_complexes.update(hcmpls)
    
    # Add complex information for Mouse
    mouse_entrez = row['MOUSE ENTREZID']
    mcmpls = subunit_complex_dict.get(mouse_entrez, 'nan')
    if mcmpls != 'nan':
        print(f'Mouse: {mouse_entrez}, {mcmpls}')
        sec_recon_dc.at[i, 'MOUSE PROTEIN COMPLEX'] = mcmpls
        # Add each item in the list to the set
        unique_complexes.update(mcmpls)
        
    # Add complex information for CHO
    cho_entrez = row['CHO ENTREZID']
    ccmpls = subunit_complex_dict.get(cho_entrez, 'nan')
    if ccmpls != 'nan':
        print(f'CHO: {cho_entrez}, {ccmpls}')
        sec_recon_dc.at[i, 'CHO PROTEIN COMPLEX'] = ccmpls
        # Add each item in the list to the set
        unique_complexes.update(ccmpls)
        
sec_recon_dc['HUMAN PROTEIN COMPLEX'] = sec_recon_dc['HUMAN PROTEIN COMPLEX'].apply(lambda x: '; '.join(x) if isinstance(x, list) else x)
sec_recon_dc['MOUSE PROTEIN COMPLEX'] = sec_recon_dc['MOUSE PROTEIN COMPLEX'].apply(lambda x: '; '.join(x) if isinstance(x, list) else x)
sec_recon_dc['CHO PROTEIN COMPLEX'] = sec_recon_dc['CHO PROTEIN COMPLEX'].apply(lambda x: '; '.join(x) if isinstance(x, list) else x)

if not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print("Google Sheet updated on Protein Complexes")
else:
    print('Protein Complexes are up-to-date')

Human: 25, ['BRCA1-cABL complex', 'c-Abl-cortactin-nmMLCK complex', 'c-Abl-CAS-Abi1 complex']
Mouse: 11350, ['Cdk5-c-Abl-Cables complex', 'Abl1-Dok1-Nck1 complex', 'Abl1-Abl2-Crk-Unc119 complex']
Human: 27, ['ABL2-HRAS-RIN1 complex']
Human: 9744, ['ACAP1-CLTC-SLC2A4 complex']
Human: 10097, ['Arp2/3 protein complex']
Human: 10096, ['Arp2/3 protein complex', 'ARP3-WHAMM complex']
Mouse: 347722, ['Agap1-Ap3b1-Ap3m1-Ap3m2-Ap3s2 complex']
Human: 207, ['YBX1-AKT1 complex', 'AR-AKT-APPL complex', 'TCL1(trimer)-AKT1 complex', 'NOS3-HSP90-AKT complex, VEGF induced', 'G alpha-13-Hax-1-cortactin-Rac complex', 'AKT1-FANCI-FANCD2-PHLPP1-PHLPP2-USP1-WDR48 complex', 'AKT1-FOXO1-WDFY2 complex', 'AKT1-NQO2 complex', 'AKT1-APPL1-HDAC3 complex', 'AKT1-HSPB1-MAPK14-MAPKAPK2 complex', 'AKT1-MAPK14-MAPKAPK2 complex', 'AKT1-CDC37-CDK4-HSPA1A-HSP90AA1-NR3C1-PPP5C-RAF1-TSC1-TSC2 complex']
Mouse: 11651, ['MCC complex (p85alpha, p110alpha, Akt,14-3-3theta, beta-catenin)', 'Arrb2-Akt-Ppp2ca complex', 'Akt-Arrb2-P

In [18]:
# Specify the columns to keep
columns_to_keep = ["ComplexName","subunits(Entrez IDs)","GO ID", "GO description","FunCat ID","FunCat description","Complex comment","subunits(Gene name)"]

subset_df = complexes[complexes['ComplexName'].isin(unique_complexes)]
subset_df = subset_df[columns_to_keep]
subset_df.reset_index(drop=True, inplace=True)

subset_df.to_csv("subset_df.csv", index=False)

## 5. Secreted Proteins
In this section we use the Supplementary Table 2 from the paper [The Human Secretome](https://www.science.org/doi/10.1126/scisignal.aaz0274) to map all the secreted protein information into our reconstruction in the column **SecP**

In [14]:
# Load secretome df
secretome = pd.read_excel("Input/human_secretome.xlsx")
# Subset of all the genes that are not considered intracellular or membrane-bound (secreted)
secretome = secretome[secretome['Annotated category'] != 'Intracellular or membrane-bound']
# Generate list of all secreted genes
secretome_list = list(secretome['Gene name'])

In [17]:
for i, row in sec_recon_dc.iterrows():
    gene_symbol = row['GENE SYMBOL']
    if gene_symbol in secretome_list:
        sec_recon_dc.loc[i, 'SecP'] = 1
        print(gene_symbol)

ABO
B3GAT1
CHSY1
COPA
FAM20A
FAM20C
FUCA2
GLT1D1
GPLD1
LGALS9
MGAT4A
P4HB
PKDCC
PLOD3
PPIA
PPIB
QSOX1
ST6GAL1
XYLT1
XYLT2


In [18]:
if not sec_recon_dc.equals(sec_recon):
    gsheet_file.update_google_sheet(sec_recon_sheet, sec_recon_dc)
    print("Google Sheet updated on Secreted Proteins Information")
else:
    print('Secreted Proteins Information is up-to-date')

Google Sheet updated on Secreted Proteins Information
