# Process _P. aeruginosa_ KEGG pathways

This notebook downloads _P. aeruginosa_ KEGG pathways from [here](https://raw.githubusercontent.com/greenelab/adage/master/Node_interpretation/pseudomonas_KEGG_terms.txt). These are the pathways used in the [ADAGE paper](https://msystems.asm.org/content/1/1/e00025-15) and formats it to be input to PLIER. PLIER expects pathway data to be a matrix of the form gene x pathway, where the values are 1 if the gene is contained with the pathway and 0 otherwise.

This formated pathway data is used as in [PLIER R script](../generic_expression_patterns_modules/plier_util.R).

In [1]:
import pandas as pd

In [2]:
# Load data
expression_filename = "https://github.com/greenelab/adage/blob/2575a60804218db7f91402b955371bb60e5b00d6/Data_collection_processing/Pa_compendium_02.22.2014.pcl"
pa_kegg_pathway_filename = "https://github.com/greenelab/adage/blob/7a4eda39d360b224268921dc1f2c14b32788ab16/Node_interpretation/pseudomonas_KEGG_terms.txt"

In [3]:
# Load expression data to get gene ids
expression_data = pd.read_csv(expression_filename, sep="\t", index_col=0, header=0)
expression_data.head()

Unnamed: 0,05_PA14000-4-2_5-10-07_S2.CEL,54375-4-05.CEL,AKGlu_plus_nt_7-8-09_s1.CEL,anaerobic_NO3_1.CEL,anaerobic_NO3_2.CEL,control1aerobic_Pae_G1a.CEL,control1_anaerobic_Pae_G1a.CEL,control2aerobic_Pae_G1a.CEL,control2_anaerobic_Pae_G1a.CEL,control3aerobic_Pae_G1a.CEL,...,Van_Delden_Kohler_0311_BAL6+_1.CEL,Van_Delden_Kohler_0311_BAL6_2.CEL,Van_Delden_Kohler_0311_BAL6+_2.CEL,Van_Delden_Kohler_0311_BAL6_3.CEL,Van_Delden_Kohler_0311_BAL6+_3.CEL,Van_Delden_Kohler_0311_PT5_1.CEL,Van_Delden_Kohler_0311_PT5_2.CEL,Van_Delden_Kohler_0311_PT5_3.CEL,WT12935-18-05.CEL,WT12935-4-05.CEL
PA0001,9.62009,9.327996,9.368599,9.083292,8.854901,7.709114,8.977267,7.660104,8.918712,7.841023,...,8.079553,8.195997,7.77628,7.684584,7.909227,7.695878,8.40424,8.299717,9.204897,9.266802
PA0002,10.575783,10.781977,10.596248,9.89705,9.931392,9.838418,10.566976,9.875493,10.36016,10.230597,...,10.266889,10.198538,10.199793,10.140484,10.235538,10.336794,10.605097,10.684966,10.482059,10.712017
PA0003,9.296287,9.169988,9.714517,8.068471,8.167126,8.203564,8.656296,7.638618,8.682695,7.767134,...,8.281785,8.434625,8.210999,8.063184,8.10316,8.598781,8.443917,8.574174,8.743051,8.986237
PA0004,9.870074,10.269239,9.487155,7.310218,7.526595,9.255719,9.829098,9.158528,9.603645,9.379108,...,9.39371,8.709487,9.182668,8.797223,9.144579,9.05061,9.056112,9.314661,9.824223,10.4221
PA0005,8.512268,7.237999,7.804147,6.723634,6.864015,7.350254,8.188652,6.733706,8.349532,7.598661,...,6.579707,7.105952,6.556796,6.50488,6.422697,6.90833,6.988165,7.609215,7.50743,7.284721


In [4]:
# Load Pa KEGG pathway data
pa_kegg_pathway = pd.read_csv(
    pa_kegg_pathway_filename, sep="\t", index_col=0, header=None
)
pa_kegg_pathway.head()

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
KEGG-Pathway-pae00072: Synthesis and degradation of ketone bodies - Pseudomonas aeruginosa PAO1,10,PA2553;PA2000;PA2011;PA1999;PA2001;PA3925;PA17...
KEGG-Pathway-pae00071: Fatty acid degradation - Pseudomonas aeruginosa PAO1,32,PA5427;PA1821;PA2553;PA1737;PA1027;PA3014;PA25...
KEGG-Pathway-pae00903: Limonene and pinene degradation - Pseudomonas aeruginosa PAO1,9,PA1821;PA1737;PA1027;PA3014;PA3426;PA4899;PA24...
KEGG-Pathway-pae00380: Tryptophan metabolism - Pseudomonas aeruginosa PAO1,27,PA1821;PA3366;PA2080;PA0421;PA2553;PA2081;PA17...
KEGG-Pathway-pae00900: Terpenoid backbone biosynthesis - Pseudomonas aeruginosa PAO1,16,PA2553;PA4044;PA2001;PA3925;PA3803;PA4043;PA17...


In [5]:
# Create new dataframe
# gene x pathway

gene_ids = list(expression_data.index)
pathway_names = list(pa_kegg_pathway.index)

out_pathway = pd.DataFrame(data=0, index=gene_ids, columns=pathway_names)

out_pathway

Unnamed: 0,KEGG-Pathway-pae00072: Synthesis and degradation of ketone bodies - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00071: Fatty acid degradation - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00903: Limonene and pinene degradation - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00380: Tryptophan metabolism - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00900: Terpenoid backbone biosynthesis - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00660: C5-Branched dibasic acid metabolism - Pseudomonas aeruginosa PAO1,"KEGG-Pathway-pae00260: Glycine, serine and threonine metabolism - Pseudomonas aeruginosa PAO1",KEGG-Pathway-pae00780: Biotin metabolism - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae02060: Phosphotransferase system (PTS) - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00364: Fluorobenzoate degradation - Pseudomonas aeruginosa PAO1,...,KEGG-Module-M00436: Sulfonate transport system,KEGG-Module-M00300: Putrescine transport system,KEGG-Module-M00200: Putative sorbitol/mannitol transport system,"KEGG-Module-M00360: Aminoacyl-tRNA biosynthesis, prokaryotes",KEGG-Module-M00238: D-Methionine transport system,KEGG-Module-M00208: Glycine betaine/proline transport system,"KEGG-Module-M00176: Assimilatory sulfate reduction, sulfate => H2S","KEGG-Module-M00570: Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine","KEGG-Module-M00572: Pimeloyl-ACP biosynthesis, BioC-BioH pathway, malonyl-ACP => pimeloyl-ACP","KEGG-Module-M00178: Ribosome, bacteria"
PA0001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0003,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0004,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0005,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0006,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0007,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0008,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0009,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# Fill in dataframe such that
# 1 if gene is contained within the pathway and 0 otherwise

for pathway_i in pathway_names:
    gene_ids_contained = pa_kegg_pathway.loc[pathway_i, 2].split(";")
    # Remove "." from gene ids
    gene_ids_contained = [gene_id.split(".")[0] for gene_id in gene_ids_contained]

    # Filter to gene ids that are contained within expression index
    gene_ids_contained = [
        gene_id for gene_id in gene_ids_contained if gene_id in gene_ids
    ]

    out_pathway.loc[gene_ids_contained, pathway_i] = 1

out_pathway.head()

Unnamed: 0,KEGG-Pathway-pae00072: Synthesis and degradation of ketone bodies - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00071: Fatty acid degradation - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00903: Limonene and pinene degradation - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00380: Tryptophan metabolism - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00900: Terpenoid backbone biosynthesis - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00660: C5-Branched dibasic acid metabolism - Pseudomonas aeruginosa PAO1,"KEGG-Pathway-pae00260: Glycine, serine and threonine metabolism - Pseudomonas aeruginosa PAO1",KEGG-Pathway-pae00780: Biotin metabolism - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae02060: Phosphotransferase system (PTS) - Pseudomonas aeruginosa PAO1,KEGG-Pathway-pae00364: Fluorobenzoate degradation - Pseudomonas aeruginosa PAO1,...,KEGG-Module-M00436: Sulfonate transport system,KEGG-Module-M00300: Putrescine transport system,KEGG-Module-M00200: Putative sorbitol/mannitol transport system,"KEGG-Module-M00360: Aminoacyl-tRNA biosynthesis, prokaryotes",KEGG-Module-M00238: D-Methionine transport system,KEGG-Module-M00208: Glycine betaine/proline transport system,"KEGG-Module-M00176: Assimilatory sulfate reduction, sulfate => H2S","KEGG-Module-M00570: Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine","KEGG-Module-M00572: Pimeloyl-ACP biosynthesis, BioC-BioH pathway, malonyl-ACP => pimeloyl-ACP","KEGG-Module-M00178: Ribosome, bacteria"
PA0001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0003,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0004,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PA0005,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
pathway_name_test = pa_kegg_pathway.index[0]
print(pathway_name_test)
assert out_pathway[pathway_name_test].sum() == len(
    pa_kegg_pathway.loc[pathway_name_test, 2].split(";")
), print(
    out_pathway[pathway_name_test].sum(),
    len(pa_kegg_pathway.loc[pathway_name_test, 2].split(";")),
)

KEGG-Pathway-pae00072: Synthesis and degradation of ketone bodies - Pseudomonas aeruginosa PAO1


In [8]:
# Manually check that counts match
out_pathway.sum().head(10)

KEGG-Pathway-pae00072: Synthesis and degradation of ketone bodies - Pseudomonas aeruginosa PAO1    10
KEGG-Pathway-pae00071: Fatty acid degradation - Pseudomonas aeruginosa PAO1                        32
KEGG-Pathway-pae00903: Limonene and pinene degradation - Pseudomonas aeruginosa PAO1                9
KEGG-Pathway-pae00380: Tryptophan metabolism - Pseudomonas aeruginosa PAO1                         27
KEGG-Pathway-pae00900: Terpenoid backbone biosynthesis - Pseudomonas aeruginosa PAO1               16
KEGG-Pathway-pae00660: C5-Branched dibasic acid metabolism - Pseudomonas aeruginosa PAO1           14
KEGG-Pathway-pae00260: Glycine, serine and threonine metabolism - Pseudomonas aeruginosa PAO1      49
KEGG-Pathway-pae00780: Biotin metabolism - Pseudomonas aeruginosa PAO1                             21
KEGG-Pathway-pae02060: Phosphotransferase system (PTS) - Pseudomonas aeruginosa PAO1                5
KEGG-Pathway-pae00364: Fluorobenzoate degradation - Pseudomonas aeruginosa PAO1   

In [9]:
pa_kegg_pathway.head(10)

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
KEGG-Pathway-pae00072: Synthesis and degradation of ketone bodies - Pseudomonas aeruginosa PAO1,10,PA2553;PA2000;PA2011;PA1999;PA2001;PA3925;PA17...
KEGG-Pathway-pae00071: Fatty acid degradation - Pseudomonas aeruginosa PAO1,32,PA5427;PA1821;PA2553;PA1737;PA1027;PA3014;PA25...
KEGG-Pathway-pae00903: Limonene and pinene degradation - Pseudomonas aeruginosa PAO1,9,PA1821;PA1737;PA1027;PA3014;PA3426;PA4899;PA24...
KEGG-Pathway-pae00380: Tryptophan metabolism - Pseudomonas aeruginosa PAO1,27,PA1821;PA3366;PA2080;PA0421;PA2553;PA2081;PA17...
KEGG-Pathway-pae00900: Terpenoid backbone biosynthesis - Pseudomonas aeruginosa PAO1,16,PA2553;PA4044;PA2001;PA3925;PA3803;PA4043;PA17...
KEGG-Pathway-pae00660: C5-Branched dibasic acid metabolism - Pseudomonas aeruginosa PAO1,14,PA0878;PA0883;PA2035;PA3506;PA4180;PA0882;PA46...
"KEGG-Pathway-pae00260: Glycine, serine and threonine metabolism - Pseudomonas aeruginosa PAO1",49,PA5495;PA1757;PA0421;PA5131;PA0400;PA0399;PA03...
KEGG-Pathway-pae00780: Biotin metabolism - Pseudomonas aeruginosa PAO1,21,PA0500;PA0420;PA0182;PA5524;PA0504;PA1806;PA05...
KEGG-Pathway-pae02060: Phosphotransferase system (PTS) - Pseudomonas aeruginosa PAO1,5,PA3560;PA3562;PA3761;PA4464;PA0337
KEGG-Pathway-pae00364: Fluorobenzoate degradation - Pseudomonas aeruginosa PAO1,7,PA2507;PA2515;PA2516;PA2682;PA2517;PA2518;PA2509


In [10]:
# Save dataframe
# This will be used in plier_util.R

out_pathway.to_csv("pa_kegg_pathway_processed.tsv", sep="\t")