# KEGG enrichment of stable genes

This notebooks looks at the group of most and least stable genes and performs a KEGG enrichment analysis to determine if there are any KEGG pathways that are significantly over-represented in our most or least stable gene sets.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import pandas as pd
import numpy as np
import scipy.stats
import statsmodels.stats.multitest
from scripts import paths, utils, annotations

In [2]:
# Load KEGG pathway data
pao1_pathway_filename = "https://raw.githubusercontent.com/greenelab/adage/7a4eda39d360b224268921dc1f2c14b32788ab16/Node_interpretation/pseudomonas_KEGG_terms.txt"

In [3]:
pao1_pathways = annotations.load_format_KEGG(pao1_pathway_filename)
print(pao1_pathways.shape)
pao1_pathways.head()

(169, 2)


Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
KEGG-Pathway-pae00072: Synthesis and degradation of ketone bodies,10,"{PA4785, PA2003, PA1999, PA3589, PA2000, PA200..."
KEGG-Pathway-pae00071: Fatty acid degradation,32,"{PA4785, PA2815, PA4994, PA5427, PA1525, PA534..."
KEGG-Pathway-pae00903: Limonene and pinene degradation,9,"{PA1737, PA1748, PA4899, PA2475, PA3014, PA333..."
KEGG-Pathway-pae00380: Tryptophan metabolism,27,"{PA4785, PA2081, PA4163, PA3589, PA2001, PA102..."
KEGG-Pathway-pae00900: Terpenoid backbone biosynthesis,16,"{PA4557, PA4785, PA3803, PA3627, PA4669, PA456..."


In [4]:
# Load transcriptional similarity df
# These are the subset of genes that we will consider
pao1_similarity_scores_filename = "pao1_similarity_scores.tsv"

pao1_similarity_scores = pd.read_csv(
    pao1_similarity_scores_filename, sep="\t", header=0, index_col=0
)

In [5]:
pao1_similarity_scores.head()

Unnamed: 0_level_0,PA14 homolog id,Transcriptional similarity across strains,P-value,Name,label,comparison
PAO1 id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PA0984,PA14_59230,0.469691,9.403422000000001e-292,,,
PA0680,PA14_55500,0.672592,0.0,hxcV,,
PA0847,PA14_53310,0.767508,0.0,,,
PA0164,PA14_02050,0.703763,0.0,,,
PA1160,PA14_49400,0.228437,2.819471e-64,,,


In [6]:
# Get most and least stable core genes based on label
pao1_most_stable_genes = list(
    pao1_similarity_scores[pao1_similarity_scores["label"] == "most stable"].index
)
pao1_least_stable_genes = list(
    pao1_similarity_scores[pao1_similarity_scores["label"] == "least stable"].index
)

In [7]:
# For each KEGG pathway, perform stat test, save p-values to get corrected p-values, report stats per pathway
def KEGG_enrichment_of_stable_genes(similarity_score_df, gene_list, kegg_df):
    """
    This function performs a KEGG enrichment using most or least stable genes,
    provided in `gene_list`
    """

    all_genes = set(similarity_score_df.index)
    module_genes = set(gene_list)
    not_module_genes = all_genes.difference(module_genes)

    rows = []
    # Find the KEGG pathway with significant over-representation
    for kegg_name in kegg_df.index:
        num_kegg_genes = kegg_df.loc[kegg_name, 1]
        kegg_genes = set(kegg_df.loc[kegg_name, 2])
        not_kegg_genes = all_genes.difference(kegg_genes)

        # Make contingency table
        # ---------------------| most stable  | not most stable
        # in KEGG pathway      | # genes      | # genes
        # not in KEGG pathway  | # genes     | # genes
        module_kegg_genes = module_genes.intersection(kegg_genes)
        not_module_kegg_genes = not_module_genes.intersection(kegg_genes)
        module_not_kegg_genes = module_genes.intersection(not_kegg_genes)
        not_module_not_kegg_genes = not_module_genes.intersection(not_kegg_genes)

        observed_contingency_table = np.array(
            [
                [len(module_kegg_genes), len(not_module_kegg_genes)],
                [len(module_not_kegg_genes), len(not_module_not_kegg_genes)],
            ]
        )
        # Fisher's exact test
        oddsr, pval = scipy.stats.fisher_exact(
            observed_contingency_table, alternative="greater"
        )
        # chi2 test will not accept 0 counts for the contingency table
        # chi2, pval, dof, expected_counts = scipy.stats.chi2_contingency(
        #    observed_contingency_table
        # )
        # print(oddsr, pval)

        rows.append(
            {
                "enriched KEGG pathway": kegg_name,
                "odds ratio": oddsr,
                "p-value": pval,
                "num shared genes": len(module_kegg_genes),
                "size gene set": len(module_genes),
                "size KEGG pathway": num_kegg_genes,
            }
        )

    enrichment_df = pd.DataFrame(rows)

    # Get corrected pvalues
    (
        reject_,
        pvals_corrected_,
        alphacSidak,
        alphacBonf,
    ) = statsmodels.stats.multitest.multipletests(
        enrichment_df["p-value"].values,
        alpha=0.05,
        method="fdr_bh",
        is_sorted=False,
    )

    enrichment_df["corrected p-value"] = pvals_corrected_

    return enrichment_df

In [8]:
pao1_most_stable_enrichment = KEGG_enrichment_of_stable_genes(
    pao1_similarity_scores, pao1_most_stable_genes, pao1_pathways
)

In [9]:
pao1_least_stable_enrichment = KEGG_enrichment_of_stable_genes(
    pao1_similarity_scores, pao1_least_stable_genes, pao1_pathways
)

In [10]:
print(pao1_most_stable_enrichment.shape)
pao1_most_stable_enrichment.sort_values(by="corrected p-value").head()

(169, 7)


Unnamed: 0,enriched KEGG pathway,odds ratio,p-value,num shared genes,size gene set,size KEGG pathway,corrected p-value
168,"KEGG-Module-M00178: Ribosome, bacteria",11.948434,8.019412000000001e-17,29,481,56,6.776403e-15
50,KEGG-Pathway-pae03010: Ribosome,11.948434,8.019412000000001e-17,29,481,68,6.776403e-15
61,KEGG-Pathway-pae00970: Aminoacyl-tRNA biosynth...,13.025751,1.037838e-09,15,481,90,5.846488e-08
162,KEGG-Module-M00360: Aminoacyl-tRNA biosynthesi...,14.996914,5.020267e-09,13,481,22,2.121063e-07
38,KEGG-Pathway-pae03060: Protein export,11.583686,4.504319e-06,9,481,18,0.000152246


In [11]:
print(pao1_least_stable_enrichment.shape)
pao1_least_stable_enrichment.sort_values(by="corrected p-value").head()

(169, 7)


Unnamed: 0,enriched KEGG pathway,odds ratio,p-value,num shared genes,size gene set,size KEGG pathway,corrected p-value
106,KEGG-Module-M00452: CusS-CusR (copper toleranc...,14.587678,0.003046,3,214,8,0.514792
92,KEGG-Module-M00193: Putative spermidine/putres...,8.097946,0.010628,3,214,12,0.898051
108,KEGG-Module-M00222: Phosphate transport system,0.0,1.0,0,214,5,1.0
109,KEGG-Module-M00053: Pyrimidine deoxyribonuleot...,0.0,1.0,0,214,9,1.0
110,KEGG-Module-M00050: Guanine ribonucleotide bio...,0.0,1.0,0,214,6,1.0


In [12]:
# TO DO: Remove 'compare' when we decide which input to use
# Save
pao1_most_stable_enrichment.to_csv("pao1_most_stable_enrichment_compare.tsv", sep="\t")
pao1_least_stable_enrichment.to_csv(
    "pao1_least_stable_enrichment_compare.tsv", sep="\t"
)

**Takeaway:**
* There does not appear to be any enriched KEGG pathways in the least stable genes.
    * What does this mean about the role of these least stable core genes? Maybe they are spread across multiple pathways?
    * Based on the dataframe created in the [previous notebook](2_find_KEGG_associations.ipynb) like many least stable core genes are not found in any KEGG pathway, but there are some that are found in many KEGG pathways: https://docs.google.com/spreadsheets/d/1SqEyBvutfbsOTo4afg9GiEzP32ZKplkN1a6MpAQBvZI/edit#gid=1943176121
* The most stable core genes are significantly enriched KEGG pathways include Ribosome (commonly enriched in humans), secretion system, metabolism/Krebs cycle
    * These KEGG pathways represent some of the essential functions for Pa, so it makes sense that they are enriched amongst the set of stable core genes whose transcriptional relationships don’t vary across strains.
    * Only some metabolisms and not others, is that interesting?