# Metabolic subsystems enrichment analysis

## Hypergeometric test
Hipergeometric test is based on hypergeometric distribution, which is a discrete probability distribution, describing the probability, that in a population of size M with N specific objects, the latter are selected k times when n objects are selected in total.

Hypergeometric test:
$$P(x \geq k) = 1 - hypergeom.cdf(k-1, M, n, N)$$

* k: number of differentially active reactions in a subsystem,
* n: number of differentially active reactions in a model,
* N: number of reactions in a subsystem,
* M: number of reactions in a model.

In [None]:
import pandas as pd
import numpy as np

from scipy.stats import hypergeom
#import statsmodels.stats.multitest as multi
from helpers import bh

import cobra

import os.path


## Basic settings

In [None]:
folder_enrich = 'enrichment'

## Metabolic subsystems
SBML representation of Recon3D does not include subsystem data. We can get these data through `mat` (Matlab) model format (uncomment the code below to access this).

In [None]:
#model = cobra.io.read_sbml_model('models\\Recon3D.xml')

In [None]:
#model_mat = cobra.io.load_matlab_model('models\\Recon3D.mat')

In [None]:
#reactions_subsystems = {}
#for r in model_mat.reactions:
#    reactions_subsystems[r.id] = r.subsystem

In [None]:
#df_subsystems = pd.DataFrame()
#df_subsystems['subsystem'] = reactions_subsystems.values()
#df_subsystems['reaction'] = reactions_subsystems.keys()
#df_subsystems.to_csv('models\\subsystems.csv', index=False)

In our case, subsystems data are stored in a separate file.

In [None]:
df_subsystems = pd.read_csv(os.path.join('models','subsystems.csv'))
subsystems = df_subsystems.subsystem.unique()
df_subsystems.head()

Differential reaction activities have been calculated in the previous step. Let's read these data:

In [None]:
folder_enrich

In [None]:
df_reactions = pd.read_csv(os.path.join(f'{folder_enrich}','reactions.csv'))

## Hypergeometric test
We will calculate the p-values for all subsystems:

In [None]:
df_enrichment = pd.DataFrame(columns=["subsystem", "p_up", "p_down", "q_up", "q_down", "enrichment", "p_changed", "q_changed", "changed"])
df_enrichment["subsystem"] = subsystems

M = len(df_reactions) # number of different reactions in pairs of models
n_up = sum(df_reactions.enrichment == 1) # number of upregulated reactions in models
n_down = sum(df_reactions.enrichment == -1)  # number of downregulated reactions in models
n_changed = sum(df_reactions.changed == 1)  # number of changed reactions in models

for subsystem in subsystems:
    subsystem_reactions = df_subsystems.loc[df_subsystems.subsystem == subsystem,'reaction'].values
    df_sub = df_reactions[df_reactions['reaction'].isin(subsystem_reactions)]
        
    #if not take_all:
    # option 1: take only remaining reactions
    N = len(df_sub) # number of reactions in a subsystem
    #else:
    #    # option 2: take all reactions from the original model
    #    N = len(df_subs[df_subs.subsystem == subsystem])
    k_up = sum(df_sub.enrichment == 1)# number of upregulated reactions in a subsystem
    k_down = sum(df_sub.enrichment == -1)# number of downregulated reactions in a subsystem
    k_changed = sum(df_sub.changed == 1)# number of changed reactions in a subsystem
    
    if n_up:         
        p_up = 1 - hypergeom.cdf(k_up-1, M, n_up, N)                
    else:
        p_up = 1.0
        
    if n_down:         
        p_down = 1 - hypergeom.cdf(k_down-1, M, n_down, N)                
    else:
        p_down = 1.0
        
    if n_changed:
        p_changed = 1 - hypergeom.cdf(k_changed, M, n_changed, N)                
    else:
        p_changed = 1
        
    df_enrichment.loc[df_enrichment["subsystem"] == subsystem, 'p_up'] = p_up
    df_enrichment.loc[df_enrichment["subsystem"] == subsystem, 'p_down'] = p_down
    df_enrichment.loc[df_enrichment["subsystem"] == subsystem, 'p_changed'] = p_changed
    

    
df_enrichment['q_up'] = bh(df_enrichment['p_up'])
df_enrichment['q_down'] = bh(df_enrichment['p_down'])
df_enrichment['q_changed'] = bh(df_enrichment['p_changed'])

    
df_enrichment.loc[(df_enrichment['q_up']<0.05) & (df_enrichment['q_up']<df_enrichment['q_down']),'enrichment'] = 1
df_enrichment.loc[(df_enrichment['q_down']<0.05) & (df_enrichment['q_down']<=df_enrichment['q_up']),'enrichment'] = -1
df_enrichment.loc[(df_enrichment['q_changed']<0.05),'changed'] = 1

df_enrichment = df_enrichment.fillna(0)




In [None]:
df_enrichment

In [None]:
df_enrichment.to_csv(f"{folder_enrich}\\subsystems.csv", index=False)