### Exemple 01

Comparative Analysis of Transcriptomic Response of *Escherichia coli* K-12 MG1655 to Nine Representative Classes of Antibiotics

doi: 10.1128/spectrum.00317-23

**Ref:** Bie L, Zhang M, Wang J, Fang M, Li L, Xu H, Wang M. Comparative Analysis of Transcriptomic Response of Escherichia coli K-12 MG1655 to Nine Representative Classes of Antibiotics. Microbiol Spectr. 2023 Feb 28;11(2):e0031723. doi: 10.1128/spectrum.00317-23. Epub ahead of print. PMID: 36853057; PMCID: PMC10100721.

- The exemple was done just for the IPM antibiotic

Obj: extract background and upregulated genes for functional enrichment analysis

In [1]:
import pandas as pd

In [5]:
#Get the data
df = pd.read_excel("spectrum.00317-23-s0002.xlsx")

In [30]:
new_header = df.iloc[0]

df = df[1:]  
df.columns = new_header
# df

In [31]:
#select the columns of interest
columns_to_keep = ['Gene_id',
    'IPM_readcount(IPMvsH2O)',
    'H2O_readcount(IPMvsH2O)',
    'log2FoldChange(IPMvsH2O)',
    'pval(IPMvsH2O)',
    'padj(IPMvsH2O)',
    'significant(IPMvsH2O)',
    'Genename'
]

# dff

#### Backgraound genes

All genes expressed  (readcounts > 1)

In [8]:
#Select expressed genes for background
dff_background = dff[dff['IPM_readcount(IPMvsH2O)'] >= 1]
dff_background = dff_background.dropna()
len(dff_background)

4164

In [9]:
# Background formulation
background = []
for g in dff_background["Gene_id"]:
    if g not in background:
        background.append(g)

with open('background_ex1.txt', 'w') as file:
    for gene in background:
        file.write(f"{gene}\n")

In [10]:
print(len(dff_background["Gene_id"]))
with open("background_ex1.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()
print(len(lines))

4164
4164


#### Upregulated genes (genes of interest)

In the data presented in the article, genes are classified based on significance as follows: UP (when log2 fold change is at least 1), DOWN (when log2 fold change is at most -1), FALSE (when the gene does not meet the thresholds for upregulation or downregulation), and NA (when data is missing or annotation failed due to an error).

In [32]:
df_up = dff[dff['significant(IPMvsH2O)'] == 'UP'][['Gene_id', 'Genename', 'significant(IPMvsH2O)']]
# df_up

In [12]:
len(df_up[df_up["Genename"] == "--"])

7

In [13]:
#need to install ResPathExplorer: !pip install git+https://github.com/lais-carvalho/ResPathExplorer.git
from ResPathExplorer.mapper_KeggFunctions import get_gene_name_by_kegg_id

In [14]:
list_gene_ids = df_up[df_up['Genename'] == '--']['Gene_id'].tolist()
dict_g = {}
not_found_list = []

for id_g in list_gene_ids:
    id_go = "eco:" + id_g

    try:
        name = get_gene_name_by_kegg_id(id_go)

        if name:
            dict_g[id_g] = name
        else:
            dict_g[id_g] = ""

    except Exception as e:
        not_found_list.append(id_g) 
        continue

In [15]:
len(dict_g)

7

In [16]:
not_found_list

[]

In [17]:
keys_with_empty_values = [k for k, v in dict_g.items() if v == ""]
keys_with_empty_values

[]

In [18]:
#Upregulated genes
upregulated = []
for g in df_up["Gene_id"]:
    if g not in upregulated:
        upregulated.append(g)

with open('upregulated_ex1.txt', 'w') as file:
    for gene in upregulated:
        file.write(f"{gene}\n")

In [19]:
print(len(df_up["Gene_id"]))
with open("upregulated_ex1.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()
print(len(lines))

589
589


#### Get upregulated (genes of interest) gene names

For the CARDAnalysis and VFDBAnalysis, only genes with a registered name in the KEGG database were selected. Genes lacking an official gene name were excluded, even if they had a KEGG gene ID.

In [20]:
df_upf = df_up.copy()
mask = df_upf['Genename'] == '--'
df_upf.loc[mask, 'Genename'] = df_upf.loc[mask, 'Gene_id'].map(dict_g)

In [21]:
mask = df_upf['Genename'].isna()
df_upf.loc[mask, 'Genename'] = df_upf.loc[mask, 'Gene_id']

In [22]:
for d in df_upf[df_upf['Genename'].duplicated(keep=False)]["Gene_id"]:
    id_g = "eco:" + d
    name = get_gene_name_by_kegg_id(id_g)
    df_upf.loc[df_upf['Gene_id'] == d, 'Genename'] = name

In [23]:
df_upf[df_upf['Genename'].duplicated(keep=False)]

Unnamed: 0,Gene_id,Genename,significant(IPMvsH2O)


In [24]:
upregulated_GNames = []
for g in df_upf["Genename"]:
    if g not in upregulated_GNames:
        upregulated_GNames.append(g)

with open('upregulated_GNames_ex1.txt', 'w') as file:
    for gene in upregulated_GNames:
        file.write(f"{gene}\n")

In [25]:
print(len(df_upf["Genename"]))
with open("upregulated_GNames_ex1.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()
print(len(lines))

589
589


In [29]:
df_upf.to_excel("genenames.xlsx")