# Identify generic genes and pathways

Studies have found that some genes are more likely to be differentially expressed even across a wide range of experimental designs. These generic genes and subsequent pathways are not necessarily specific to the biological process being studied but instead represent a more systematic change. 

We have developed an approach, outlined below, to automatically identify these generic genes and pathways. We have validated this simulation approach can identify generic genes and pathways in the analysis notebooks: [human_general_analysis](../human_general_analysis/) and [human_cancer_analysis](../human_cancer_analysis/). Here

This notebook applies this approach to identify generic genes and pathways in the pseudomonas compendium. 

**Steps to identify generic genes:**
1. Simulates N gene expression experiments using [ponyo](https://github.com/ajlee21/ponyo)
2. Perform DE analysis to get association statistics for each gene

In this case the DE analysis is based on the experimental design of the template experiment, described in the previous [notebook](1_process_pseudomonas_data.ipynb). 
The template experiment is [GEOD-33245](https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-33245/?s_sortby=col_8&s_sortorder=ascending), which contains multiple different comparisons including WT vs *crc* mutants, WT vs *cbr* mutants in different conditions. So the DE analysis is comparing WT vs mutant.

3. For each gene, aggregate statistics across all simulated experiments 
4. Rank genes based on this aggregated statistic

**Steps to identify generic gene sets (pathways):**
1. Using the same simulated experiments from above, perform GSEA analysis. This analysis will determine whether the genes contained in a gene set are clustered towards the beginning or the end of the ranked list of genes, where genes are ranked by log fold change, indicating a correlation with change in expression.
2. For each gene set (pathway), aggregate statistics across all simulated experiments
3. Rank gene sets based on this aggregated statistic

In [1]:
%load_ext autoreload
%load_ext rpy2.ipython
%autoreload 2

import os
import sys
import glob
import pandas as pd
import numpy as np
import seaborn as sns
import pickle
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from ponyo import utils, simulate_expression_data
from generic_expression_patterns_modules import calc, process

np.random.seed(123)

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
# Read in config variables
base_dir = os.path.abspath(os.path.join(os.getcwd(),"../"))

config_file = os.path.abspath(os.path.join(base_dir,
                                           "configs",
                                           "config_pseudomonas_33245.tsv"))
params = utils.read_config(config_file)

In [3]:
# Load params
local_dir = params["local_dir"]
dataset_name = params['dataset_name']
NN_architecture = params['NN_architecture']
num_runs = params['num_simulated']
project_id = params['project_id']
metadata_col_id = params['metadata_colname']
processed_template_filename = params['processed_template_filename']
normalized_compendium_filename = params['normalized_compendium_filename']
scaler_filename = params['scaler_filename']
col_to_rank_genes = params['rank_genes_by']
col_to_rank_pathways = params['rank_pathways_by']
statistic = params['gsea_statistic']

# Load metadata file with grouping assignments for samples
sample_id_metadata_filename = os.path.join(
    base_dir,
    dataset_name,
    "data",
    "metadata",
    f"{project_id}_process_samples.tsv")

# Load pickled file
scaler = pickle.load(open(scaler_filename, "rb"))

In [4]:
# Output files
gene_summary_filename = os.path.join(
    base_dir, 
    dataset_name, 
    f"generic_gene_summary_{project_id}.tsv"
)
pathway_summary_filename = os.path.join(
    base_dir, 
    dataset_name, 
    f"generic_pathway_summary_{project_id}.tsv"
)

### Simulate experiments using selected template experiment
Workflow:

1. Get the gene expression data for the selected template experiment
2. Encode this experiment into a latent space using the trained VAE model
3. Linearly shift the encoded template experiment in the latent space
4. Decode the samples. This results in a new experiment
5. Repeat steps 1-4 to get multiple simulated experiments

In [5]:
"""# Simulate multiple experiments
# This step creates the following files in "<local_dir>/pseudo_experiment/" directory:           
#   - selected_simulated_data_SRP012656_<n>.txt
#   - selected_simulated_encoded_data_SRP012656_<n>.txt
#   - template_normalized_data_SRP012656_test.txt
# in which "<n>" is an integer in the range of [0, num_runs-1] 
os.makedirs(os.path.join(local_dir, "pseudo_experiment"), exist_ok=True)
for run_id in range(num_runs):
    simulate_expression_data.shift_template_experiment(
        normalized_compendium_filename,
        project_id,
        metadata_col_id,
        NN_architecture,
        dataset_name,
        scaler,
        local_dir,
        base_dir,
        run_id)"""

'# Simulate multiple experiments\n# This step creates the following files in "<local_dir>/pseudo_experiment/" directory:           \n#   - selected_simulated_data_SRP012656_<n>.txt\n#   - selected_simulated_encoded_data_SRP012656_<n>.txt\n#   - template_normalized_data_SRP012656_test.txt\n# in which "<n>" is an integer in the range of [0, num_runs-1] \nos.makedirs(os.path.join(local_dir, "pseudo_experiment"), exist_ok=True)\nfor run_id in range(num_runs):\n    simulate_expression_data.shift_template_experiment(\n        normalized_compendium_filename,\n        project_id,\n        metadata_col_id,\n        NN_architecture,\n        dataset_name,\n        scaler,\n        local_dir,\n        base_dir,\n        run_id)'

In [6]:
"""# This step modifies the following files:
# "<local_dir>/pseudo_experiments/selected_simulated_data_SRP012656_<n>.txt"
if os.path.exists(sample_id_metadata_filename):
    # Read in metadata
    metadata = pd.read_csv(sample_id_metadata_filename, sep='\t', header=0, index_col=0)
    
    # Get samples to be dropped
    sample_ids_to_drop = list(metadata[metadata["processing"] == "drop"].index)

    process.subset_samples(
        sample_ids_to_drop,
        num_runs,
        local_dir,
        project_id
    )"""

'# This step modifies the following files:\n# "<local_dir>/pseudo_experiments/selected_simulated_data_SRP012656_<n>.txt"\nif os.path.exists(sample_id_metadata_filename):\n    # Read in metadata\n    metadata = pd.read_csv(sample_id_metadata_filename, sep=\'\t\', header=0, index_col=0)\n    \n    # Get samples to be dropped\n    sample_ids_to_drop = list(metadata[metadata["processing"] == "drop"].index)\n\n    process.subset_samples(\n        sample_ids_to_drop,\n        num_runs,\n        local_dir,\n        project_id\n    )'

### Differential expression analysis

In [7]:
# Load metadata file with grouping assignments for samples
metadata_filename = os.path.join(
    base_dir,
    dataset_name,
    "data",
    "metadata",
    f"{project_id}_groups.tsv")

In [8]:
# Check whether ordering of sample ids is consistent between gene expression data and metadata
process.compare_and_reorder_samples(processed_template_filename, metadata_filename)

sample ids are ordered correctly


In [9]:
# Create subdirectory: "<local_dir>/DE_stats/"
os.makedirs(os.path.join(local_dir, "DE_stats"), exist_ok=True)

In [10]:
%%R -i metadata_filename -i project_id -i processed_template_filename -i local_dir

source(paste0(base_dir, '/generic_expression_patterns_modules/DE_analysis.R'))

get_DE_stats_limma(metadata_filename,
                   project_id, 
                   processed_template_filename,
                   "template",
                   local_dir,
                   "real")


Error in paste0(base_dir, "/generic_expression_patterns_modules/DE_analysis.R") : 
  object 'base_dir' not found


  object 'base_dir' not found



In [11]:
# Check whether ordering of sample ids is consistent between gene expression data and metadata
for i in range(num_runs):
    simulated_data_filename = os.path.join(
        local_dir,
        "pseudo_experiment",
        f"selected_simulated_data_{project_id}_{i}.txt")
        
    process.compare_and_reorder_samples(simulated_data_filename, metadata_filename)

sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly


In [12]:
%%R -i metadata_filename -i project_id -i base_dir -i local_dir -i num_runs -o num_sign_DEGs_simulated

source(paste0(base_dir,'/generic_expression_patterns_modules/DE_analysis.R'))

num_sign_DEGs_simulated <- c()

for (i in 0:(num_runs-1)){
    simulated_data_filename <- paste(
        local_dir, 
        "pseudo_experiment/selected_simulated_data_",
        project_id,
        "_", 
        i,
        ".txt",
        sep=""
    )
    
    run_output <- get_DE_stats_limma(
        metadata_filename,
        project_id, 
        simulated_data_filename,
        "simulated",
        local_dir,
        i
    )
    num_sign_DEGs_simulated <- c(num_sign_DEGs_simulated, run_output)
}





Attaching package: ‘BiocGenerics’



    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB



    plotMA



    IQR, mad, sd, var, xtabs



    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’



    expand.grid








    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.




Attaching package: ‘matrixStats’



    anyMissing, rowMedians



Attaching package: ‘DelayedArray’

### Rank genes

In [13]:
# Concatenate simulated experiments
simulated_DE_stats_all = process.concat_simulated_data(local_dir, num_runs, project_id, 'DE')

print(simulated_DE_stats_all.shape)

(138725, 7)


In [14]:
# Take absolute value of logFC and t statistic
simulated_DE_stats_all = process.abs_value_stats(simulated_DE_stats_all)

In [15]:
# Aggregate statistics across all simulated experiments
simulated_DE_summary_stats = calc.aggregate_stats(
    col_to_rank_genes,
    simulated_DE_stats_all,
    'DE'
)

In [16]:
# Load association statistics for template experiment
template_DE_stats_filename = os.path.join(
    local_dir,
    "DE_stats",
    "DE_stats_template_data_"+project_id+"_real.txt")

template_DE_stats = pd.read_csv(
    template_DE_stats_filename,
    header=0,
    sep='\t',
    index_col=0)

# Take absolute value of logFC and t statistic
template_DE_stats = process.abs_value_stats(template_DE_stats)

# Rank genes in template experiment
template_DE_stats = calc.rank_genes_or_pathways(
    col_to_rank_genes,
    template_DE_stats,
    True
)

In [17]:
# Rank genes in simulated experiments
simulated_DE_summary_stats = calc.rank_genes_or_pathways(
    col_to_rank_genes,
    simulated_DE_summary_stats,
    False
)

### Gene summary table

In [18]:
summary_gene_ranks = process.generate_summary_table(
    template_DE_stats,
    simulated_DE_summary_stats,
    col_to_rank_genes,
    local_dir
)

summary_gene_ranks.head()

(5549, 13)




Unnamed: 0,Gene ID,Adj P-value (Real),Rank (Real),Test statistic (Real),Median adj p-value (simulated),Rank (simulated),Mean test statistic (simulated),Std deviation (simulated),Number of experiments (simulated),abs(Z score)
PA3600,PA3600,7.068633e-08,5549.0,6.137566,0.658483,5386.0,0.29825,0.181704,6,32.136455
PA4470,PA4470,1.17603e-08,5548.0,6.016552,0.715044,4864.5,0.206083,0.145293,6,39.991398
PA3601,PA3601,1.100335e-08,5547.0,5.836628,0.678337,4229.0,0.171667,0.100604,6,56.309776
PA2426,PA2426,3.642444e-08,5546.0,5.533709,0.62367,5355.0,0.377667,0.264477,6,19.495214
PA4469,PA4469,3.642444e-08,5545.0,5.423924,0.710824,3832.0,0.252917,0.277097,6,18.661361


In [19]:
# Add gene name as column to summary dataframe
summary_gene_ranks = process.add_pseudomonas_gene_name_col(summary_gene_ranks, base_dir)

In [20]:
# Create `gene_summary_filename`
summary_gene_ranks.to_csv(gene_summary_filename, sep='\t')

### GSEA 
**Goal:** To detect modest but coordinated changes in prespecified sets of related genes (i.e. those genes in the same pathway or share the same GO term).

1. Ranks all genes based using DE association statistics. In this case we used the p-value scores to rank genes. logFC returned error -- need to look into this.
2. An enrichment score (ES) is defined as the maximum distance from the middle of the ranked list. Thus, the enrichment score indicates whether the genes contained in a gene set are clustered towards the beginning or the end of the ranked list (indicating a correlation with change in expression). 
3. Estimate the statistical significance of the ES by a phenotypic-based permutation test in order to produce a null distribution for the ES( i.e. scores based on permuted phenotype)

In [21]:
# Load pathway data
adage_kegg_DB_filename = "https://raw.githubusercontent.com/greenelab/adage/master/Node_interpretation/pseudomonas_KEGG_terms.txt"

In [22]:
adage_kegg_DB = pd.read_csv(adage_kegg_DB_filename, sep="\t", header=None)
adage_kegg_DB.head()

Unnamed: 0,0,1,2
0,KEGG-Pathway-pae00072: Synthesis and degradati...,10,PA2553;PA2000;PA2011;PA1999;PA2001;PA3925;PA17...
1,KEGG-Pathway-pae00071: Fatty acid degradation ...,32,PA5427;PA1821;PA2553;PA1737;PA1027;PA3014;PA25...
2,KEGG-Pathway-pae00903: Limonene and pinene deg...,9,PA1821;PA1737;PA1027;PA3014;PA3426;PA4899;PA24...
3,KEGG-Pathway-pae00380: Tryptophan metabolism -...,27,PA1821;PA3366;PA2080;PA0421;PA2553;PA2081;PA17...
4,KEGG-Pathway-pae00900: Terpenoid backbone bios...,16,PA2553;PA4044;PA2001;PA3925;PA3803;PA4043;PA17...


In [23]:
adage_kegg_DB.drop(columns=[1], inplace=True)
adage_kegg_DB.head()

Unnamed: 0,0,2
0,KEGG-Pathway-pae00072: Synthesis and degradati...,PA2553;PA2000;PA2011;PA1999;PA2001;PA3925;PA17...
1,KEGG-Pathway-pae00071: Fatty acid degradation ...,PA5427;PA1821;PA2553;PA1737;PA1027;PA3014;PA25...
2,KEGG-Pathway-pae00903: Limonene and pinene deg...,PA1821;PA1737;PA1027;PA3014;PA3426;PA4899;PA24...
3,KEGG-Pathway-pae00380: Tryptophan metabolism -...,PA1821;PA3366;PA2080;PA0421;PA2553;PA2081;PA17...
4,KEGG-Pathway-pae00900: Terpenoid backbone bios...,PA2553;PA4044;PA2001;PA3925;PA3803;PA4043;PA17...


In [24]:
adage_kegg_DB[2] = adage_kegg_DB[2].str.split(";").str.join("\t")

In [25]:
adage_kegg_DB.head()

Unnamed: 0,0,2
0,KEGG-Pathway-pae00072: Synthesis and degradati...,PA2553\tPA2000\tPA2011\tPA1999\tPA2001\tPA3925...
1,KEGG-Pathway-pae00071: Fatty acid degradation ...,PA5427\tPA1821\tPA2553\tPA1737\tPA1027\tPA3014...
2,KEGG-Pathway-pae00903: Limonene and pinene deg...,PA1821\tPA1737\tPA1027\tPA3014\tPA3426\tPA4899...
3,KEGG-Pathway-pae00380: Tryptophan metabolism -...,PA1821\tPA3366\tPA2080\tPA0421\tPA2553\tPA2081...
4,KEGG-Pathway-pae00900: Terpenoid backbone bios...,PA2553\tPA4044\tPA2001\tPA3925\tPA3803\tPA4043...


In [26]:
adage_kegg_DB.shape

(169, 2)

In [27]:
import csv
adage_kegg_DB.to_csv("adage_kegg_DB_tmp_filename.gmt", 
                     quoting=csv.QUOTE_NONE,
                     escapechar="\\",
                     index=False,
                     header=False,
                     sep="\t"
                    )

In [28]:
with open("adage_kegg_DB_tmp_filename.gmt", "r") as f:
    temp = f.read()

In [29]:
with open("adage_kegg_DB_process_filename.gmt", "w") as of:
    of.write(temp.replace("\\", ""))

In [30]:
adage_kegg_DB_processed_filename = "adage_kegg_DB_process_filename.gmt"

In [31]:
# Need to format data into tab-delimited matrix
# with columns= KEGG pathway name, description, gene ids
# Each gene ids is tab separated

In [32]:
%%R -i base_dir -i template_DE_stats_filename -i adage_kegg_DB_processed_filename -i statistic -o template_enriched_pathways

source(paste0(base_dir, '/generic_expression_patterns_modules/GSEA_analysis.R'))
template_enriched_pathways <- find_enriched_pathways(template_DE_stats_filename, adage_kegg_DB_processed_filename, statistic)


  res = PandasDataFrame.from_items(items)


In [33]:
print(template_enriched_pathways.shape)
template_enriched_pathways[template_enriched_pathways['padj'] < 0.05].sort_values(by='padj')

(169, 8)


Unnamed: 0,pathway,pval,padj,ES,NES,nMoreExtreme,size,leadingEdge
168,"KEGG-Module-M00178: Ribosome, bacteria",0.000113,0.001228,0.900619,6.085628,0.0,55,
113,KEGG-Module-M00335: Sec (secretion) system,0.000146,0.001228,0.730965,2.579776,0.0,10,
70,"KEGG-Pathway-pae00250: Alanine, aspartate and ...",0.000122,0.001228,0.460515,2.615437,0.0,34,
68,"KEGG-Pathway-pae00400: Phenylalanine, tyrosine...",0.000125,0.001228,0.483272,2.557964,0.0,28,
61,KEGG-Pathway-pae00970: Aminoacyl-tRNA biosynth...,0.000126,0.001228,0.638701,3.293676,0.0,26,
59,KEGG-Pathway-pae00240: Pyrimidine metabolism -...,0.000117,0.001228,0.513021,3.180929,0.0,43,
50,KEGG-Pathway-pae03010: Ribosome - Pseudomonas ...,0.000113,0.001228,0.900619,6.085628,0.0,55,
48,KEGG-Pathway-pae00330: Arginine and proline me...,0.000109,0.001228,0.350134,2.550105,0.0,69,
145,"KEGG-Module-M00009: Citrate cycle (TCA cycle, ...",0.000129,0.001228,0.655828,3.176966,0.0,22,
46,KEGG-Pathway-pae00230: Purine metabolism - Pse...,0.000108,0.001228,0.405511,3.040558,0.0,76,


In [34]:
# Create "<local_dir>/GSEA_stats/" subdirectory
os.makedirs(os.path.join(local_dir, "GSEA_stats"), exist_ok=True)

In [35]:
%%R -i project_id -i local_dir -i adage_kegg_DB_processed_filename -i num_runs -i statistic

source(paste0(base_dir, '/generic_expression_patterns_modules/GSEA_analysis.R'))

# New files created: "<local_dir>/GSEA_stats/GSEA_stats_simulated_data_<project_id>_<n>.txt"
for (i in 0:(num_runs-1)) {
    simulated_DE_stats_file <- paste(local_dir, 
                                     "DE_stats/DE_stats_simulated_data_", 
                                     project_id,
                                     "_", 
                                     i,
                                     ".txt",
                                     sep = "")
    
    out_file <- paste(local_dir, 
                     "GSEA_stats/GSEA_stats_simulated_data_",
                     project_id,
                     "_",
                     i,
                     ".txt", 
                     sep = "")
    
    enriched_pathways <- find_enriched_pathways(simulated_DE_stats_file, adage_kegg_DB_processed_filename, statistic) 
    
    # Remove column with leading edge since its causing parsing issues
    write.table(as.data.frame(enriched_pathways[1:7]), file = out_file, row.names = F, sep = "\t")
}

### Rank pathways 

In [36]:
# Concatenate simulated experiments
simulated_GSEA_stats_all = process.concat_simulated_data(local_dir, num_runs, project_id, 'GSEA')
simulated_GSEA_stats_all.set_index('pathway', inplace=True)
print(simulated_GSEA_stats_all.shape)

(4225, 6)


In [37]:
# Aggregate statistics across all simulated experiments
simulated_GSEA_summary_stats = calc.aggregate_stats(
    col_to_rank_pathways,
    simulated_GSEA_stats_all,
    'GSEA'
)

simulated_GSEA_summary_stats.head()

Unnamed: 0_level_0,padj,padj,padj,padj
Unnamed: 0_level_1,median,mean,std,count
pathway,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"KEGG-Module-M00002: Glycolysis, core module involving three-carbon compounds",0.003091,0.008388,0.01108,25
"KEGG-Module-M00008: Entner-Doudoroff pathway, glucose-6P => glyceraldehyde-3P + pyruvate",0.04985,0.076868,0.074792,25
"KEGG-Module-M00009: Citrate cycle (TCA cycle, Krebs cycle)",0.001348,0.001369,0.00037,25
"KEGG-Module-M00010: Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate",0.001865,0.003223,0.002802,25
"KEGG-Module-M00011: Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate",0.001348,0.001434,0.000546,25


In [38]:
# Load association statistics for template experiment
template_GSEA_stats = template_enriched_pathways.iloc[:, :-1]
template_GSEA_stats.set_index('pathway', inplace=True)

template_GSEA_stats.head()

# Rank genes in template experiment
template_GSEA_stats = calc.rank_genes_or_pathways(
    col_to_rank_pathways,
    template_GSEA_stats,
    True
)

In [39]:
# Rank genes in simulated experiments
simulated_GSEA_summary_stats = calc.rank_genes_or_pathways(
    col_to_rank_pathways,
    simulated_GSEA_summary_stats,
    False
)

### Pathway summary table

In [40]:
# Create intermediate file: "<local_dir>/gene_summary_table_<col_to_rank_pathways>.tsv"
summary_pathway_ranks = process.generate_summary_table(
    template_GSEA_stats,
    simulated_GSEA_summary_stats,
    col_to_rank_pathways,
    local_dir
)

summary_pathway_ranks.head()

(169, 12)


Unnamed: 0_level_0,Gene ID,Adj P-value (Real),Rank (Real),Test statistic (Real),Median adj p-value (simulated),Rank (simulated),Mean test statistic (simulated),Std deviation (simulated),Number of experiments (simulated),abs(Z score)
pathway,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
"KEGG-Module-M00178: Ribosome, bacteria","KEGG-Module-M00178: Ribosome, bacteria",0.001228,159.0,0.001228,0.001348,162.5,0.001369,0.00037,25,0.379764
KEGG-Pathway-pae03060: Protein export - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae03060: Protein export - Pseudo...,0.001228,159.0,0.001228,0.001348,162.5,0.001447,0.000507,25,0.432351
KEGG-Pathway-pae00020: Citrate cycle (TCA cycle) - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00020: Citrate cycle (TCA cycl...,0.001228,159.0,0.001228,0.001348,162.5,0.001369,0.00037,25,0.379764
"KEGG-Module-M00156: Cytochrome c oxidase, cbb3-type","KEGG-Module-M00156: Cytochrome c oxidase, cbb3...",0.001228,159.0,0.001228,0.003548,143.0,0.059462,0.173213,25,0.336202
"KEGG-Module-M00157: F-type ATPase, prokaryotes and chloroplasts","KEGG-Module-M00157: F-type ATPase, prokaryotes...",0.001228,159.0,0.001228,0.001348,162.5,0.001369,0.000371,25,0.381636


In [41]:
# Create `pathway_summary_filename`
summary_pathway_ranks.to_csv(pathway_summary_filename, sep='\t')