# PiMP Demo

This notebook demonstrates how to retrieve various data for an analysis on Polyomics Integrated Management Pipeline (PiMP), and to run pathway analysis using PALS on the retrieved data.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import pathlib
import pickle

sys.path.append('..')

In [3]:
import pandas as pd

### Load PALS

Before importing the line below, please install PALS using pip or pipenv.

`pip install pals-pathway` or `pipenv install pals-pathway`

In [4]:
from pals.pimp_tools import *

### Set API token

You need a token to access data from PiMP. If you know what it is, please set it in the variable below. 

Alternatively token can also be set from the environmental variable *PIMP_API_TOKEN*. In this case, use `get_pimp_API_token_from_env()` to read it from the environment.

In [5]:
token = get_pimp_API_token_from_env()
# token = 'xxx' # set your token here

If you have an account at http://polyomics.mvls.gla.ac.uk/, use the following sample codes to generate a new token for your user account.

In [6]:
# username = 'xxx' # PiMP username
# password = 'xxx' # PiMP password
# host = PIMP_HOST # server address and port
# token = get_authentication_token(host, username, password)

### Load Data

Please enter the analysis id below. Here we use an example analysis of some beer data.

In [7]:
analysis_id = 1321 # example beer analysis

#### Filtered annotations

The following convenience method `download_from_pimp` will fetch the MS1 intensities, the **filtered** MS1 annotations, and the experimental design for the analysis. The results will be cached in the temp folder, so next time we don't need to retrieve it again.

The annotation results have been filtered as follows:
- Contains only KEGG annotations
- Select only annotations for peaks that have been (identified) or (annotated with adduct type M+H and M-H).

In [8]:
int_df, annotation_df, experimental_design = download_from_pimp(token, PIMP_HOST, analysis_id, 'kegg')

2020-06-09 22:07:59.797 | DEBUG    | pals.pimp_tools:download_from_pimp:119 - Trying to load data from temp file: C:\Users\joewa\AppData\Local\Temp\pimp_analysis_1321.p
2020-06-09 22:07:59.799 | DEBUG    | pals.pimp_tools:download_from_pimp:123 - Retrieving data for analysis 1321 from PiMP
2020-06-09 22:08:01.684 | DEBUG    | pals.pimp_tools:get_data:33 - http://polyomics.mvls.gla.ac.uk/export/get_ms1_intensities?analysis_id=1321
2020-06-09 22:08:21.265 | DEBUG    | pals.pimp_tools:get_data:33 - http://polyomics.mvls.gla.ac.uk/export/get_ms1_peaks?analysis_id=1321
2020-06-09 22:08:34.335 | DEBUG    | pals.pimp_tools:get_data:33 - http://polyomics.mvls.gla.ac.uk/export/get_experimental_design?analysis_id=1321
2020-06-09 22:08:34.336 | DEBUG    | pals.pimp_tools:download_from_pimp:132 - Caching analysis data for next use
2020-06-09 22:08:34.337 | DEBUG    | pals.common:save_obj:96 - Saving <class 'dict'> to C:\Users\joewa\AppData\Local\Temp\pimp_analysis_1321.p


In [9]:
int_df

Unnamed: 0_level_0,Beer_1_full1.mzXML,Beer_1_full2.mzXML,Beer_1_full3.mzXML,Beer_2_full1.mzXML,Beer_2_full2.mzXML,Beer_2_full3.mzXML,Beer_3_full1.mzXML,Beer_3_full2.mzXML,Beer_3_full3.mzXML,Beer_4_full1.mzXML,Beer_4_full2.mzXML,Beer_4_full3.mzXML
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3033929,2.235291e+09,2.000478e+09,2.170697e+09,2.242760e+09,2.279882e+09,1.959480e+09,2.079356e+09,2.110473e+09,2.243653e+09,1.817065e+09,1.746443e+09,1.779827e+09
3033930,4.433491e+07,4.287387e+07,4.894853e+07,4.760448e+07,4.217280e+07,3.908452e+07,3.825778e+07,3.770192e+07,4.087189e+07,3.330477e+07,3.153630e+07,3.102410e+07
3033931,1.723985e+09,1.764235e+09,1.585143e+09,1.543961e+09,1.579320e+09,1.555666e+09,1.698130e+09,1.481824e+09,1.508645e+09,1.642510e+09,1.723919e+09,1.697806e+09
3033932,6.254237e+08,6.503417e+08,5.914975e+08,4.635929e+08,4.298382e+08,4.038747e+08,4.292837e+08,3.708761e+08,4.778932e+08,3.903165e+08,4.080995e+08,4.309892e+08
3033933,1.075022e+09,9.293474e+08,1.092635e+09,1.130720e+09,1.118146e+09,1.192834e+09,1.231442e+09,1.262046e+09,1.460653e+09,1.009838e+09,9.085111e+08,9.967176e+08
...,...,...,...,...,...,...,...,...,...,...,...,...
3041299,1.431211e+04,6.565678e+03,1.478325e+04,1.620252e+04,1.748920e+04,1.284756e+04,3.306687e+04,2.476216e+04,2.869417e+04,2.231166e+04,2.164017e+04,2.727751e+04
3041300,2.273721e+04,2.905976e+04,2.756565e+04,3.080164e+04,2.427240e+04,2.517871e+04,2.718543e+04,2.905361e+04,3.170420e+04,2.597168e+04,3.066904e+04,2.884984e+04
3041301,1.760107e+04,2.674373e+04,2.165889e+04,2.121242e+04,1.737357e+04,2.039137e+04,2.296079e+04,2.001743e+04,2.664124e+04,1.791359e+04,1.642459e+04,2.349759e+04
3041302,2.228221e+04,1.860160e+04,1.578596e+04,1.222031e+04,1.554959e+04,1.674346e+04,1.368053e+04,1.475503e+04,1.730055e+04,1.529614e+04,1.682113e+04,1.909567e+04


In [10]:
annotation_df

Unnamed: 0_level_0,entity_id
row_id,Unnamed: 1_level_1
3033929,C00148
3036581,C00148
3036855,C00148
3038249,C00148
3033929,C00163
...,...
3040926,C20522
3040929,C20582
3041077,C20499
3041172,C20504


In [11]:
experimental_design

{'comparisons': [{'case': 'beer1', 'control': 'beer2', 'name': 'beer1/beer2'},
  {'case': 'beer3', 'control': 'beer4', 'name': 'beer3/beer4'}],
 'groups': {'beer4': ['Beer_4_full3.mzXML',
   'Beer_4_full2.mzXML',
   'Beer_4_full1.mzXML'],
  'beer3': ['Beer_3_full3.mzXML', 'Beer_3_full2.mzXML', 'Beer_3_full1.mzXML'],
  'beer2': ['Beer_2_full3.mzXML', 'Beer_2_full1.mzXML', 'Beer_2_full2.mzXML'],
  'beer1': ['Beer_1_full2.mzXML', 'Beer_1_full1.mzXML', 'Beer_1_full3.mzXML']}}

#### Get all information associated to MS1 peaks

The following code demonstrates how to use `get_ms1_peaks` to load all the information associated to MS1 peaks. This includes the polarity, formula, adducts, inchikeys, all the annotations from matching to compound databases and fragmentation databases. 

In [15]:
df = get_ms1_peaks(token, PIMP_HOST, analysis_id)

2020-06-09 22:11:57.888 | DEBUG    | pals.pimp_tools:get_data:33 - http://polyomics.mvls.gla.ac.uk/export/get_ms1_peaks?analysis_id=1321


In [17]:
df.set_index('pid') # here we set the 'pid' column to be the index of the dataframe

Unnamed: 0_level_0,sec_id,mass,rt,polarity,cmpd_id,formula,adduct,identified,rc_id,compound,db,identifier,frank_annot,inchikey
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3033929,1,116.070550,577.986827,positive,1,C5H9NO2,M+H,False,15367695,Pterolactam,hmdb,HMDB34208,"{'frank_cmpd_name': 'L-Proline', 'inchikey': N...",VULIHENHKGDFAB-UHFFFAOYSA-N
3036581,2653,157.097190,469.781817,positive,1,C5H9NO2,M+ACN+H,False,15390525,Pterolactam,hmdb,HMDB34208,,VULIHENHKGDFAB-UHFFFAOYSA-N
3036855,2927,157.097154,569.557760,positive,1,C5H9NO2,M+ACN+H,False,15392567,Pterolactam,hmdb,HMDB34208,,VULIHENHKGDFAB-UHFFFAOYSA-N
3038249,4321,114.055969,577.210902,negative,1,C5H9NO2,M-H,False,15402468,Pterolactam,hmdb,HMDB34208,,VULIHENHKGDFAB-UHFFFAOYSA-N
3033929,1,116.070550,577.986827,positive,2,C5H9NO2,M+H,True,15367696,L-Proline,hmdb,HMDB00162,"{'frank_cmpd_name': 'L-Proline', 'inchikey': N...",ONIBWKKTOPOVIA-BYPYZUCNSA-N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3040926,6998,139.051335,413.253209,negative,16575,C6H8N2O2,M-H,False,15415837,Dihydrourocanate,kegg,C20522,,
3040929,7001,215.056411,309.828219,negative,16576,C9H12O6,M-H,False,15415838,cis-(Homo)3-aconitate,kegg,C20582,,
3041077,7149,371.114080,282.045021,negative,16577,C20H20O7,M-H,False,15416317,(1'S)-Averantin,kegg,C20499,,
3041172,7244,385.093374,258.330449,negative,16578,C20H18O8,M-H,False,15416475,Versicolorone,kegg,C20504,,


#### Load fragmentation data

The following codes show how to use `get_ms2_peaks` to load MS2 fragmentation data for the same beer analysis. The results can be returned in two different formats. When `as_dataframe` is set to False, we get a dictionary of two keys: `spectra` and `num_spectra`.

In [18]:
frags = get_ms2_peaks(token, PIMP_HOST, analysis_id, as_dataframe=False)

2020-06-09 22:14:02.746 | DEBUG    | pals.pimp_tools:get_data:33 - http://polyomics.mvls.gla.ac.uk/export/get_ms2_peaks?analysis_id=1321&as_dataframe=False


In [20]:
frags.keys()

dict_keys(['spectra', 'num_spectra'])

Accessing the returned dictionary with the key `spectra` will return a list of fragmentation (MS2) spectra. Below there are 1801 of them. 

In [28]:
spectra = frags['spectra']
len(spectra)

1801

Print the first spectra. This contains a list of 3 elements:

- The first entry is an identifier in the form: `peak_<fragset_id>_<ms1_id>`
- The second entry is the m/z of the MS1 peak having `<ms1_id>`
- The last entry is a list of fragment peaks, where each entry is another list of `[<ms2_mz>, <ms2_rt>]`

In [29]:
spectra[0]

['peak_780_4715862', 121.0720052098, [[61.0397377014, 348610.9375]]]

Print the second spectra

In [30]:
spectra[1]

['peak_780_4715863',
 434.1868723338,
 [[53.0387763977, 5160.0629882812],
  [57.0336952209, 50454.26953125],
  [61.0284957886, 137408.390625],
  [69.0337753296, 18039.81640625],
  [70.0651550293, 29612.14453125],
  [73.0284881592, 46669.90234375],
  [75.0439376831, 23252.744140625],
  [81.0332489014, 8270.5234375],
  [85.0283050537, 115575.171875],
  [90.0551300049, 21890.22265625],
  [91.0391616821, 92297.5078125],
  [93.0544128418, 109123.015625],
  [97.0283126831, 66010.9296875],
  [127.0385513306, 43543.83984375],
  [145.0498199463, 65391.8671875],
  [163.0604553223, 118952.3125]]]

Alternatively `get_ms2_peaks` can also return the data in a nicely formatted dataframe, if `as_dataframe` is set to True.

In [24]:
frags_df = get_ms2_peaks(token, PIMP_HOST, analysis_id, as_dataframe=True)

2020-06-09 22:15:12.438 | DEBUG    | pals.pimp_tools:get_data:33 - http://polyomics.mvls.gla.ac.uk/export/get_ms2_peaks?analysis_id=1321&as_dataframe=True


In [25]:
frags_df

Unnamed: 0,fragset_id,ms1_id,ms1_mz,ms1_rt,ms1_intensity,ms2_id,ms2_mz,ms2_intensity
0,780,4715862,121.072005,1409.288045,9.238152e+05,4716840,61.039738,348610.937500
1,780,4715863,434.186872,648.729416,9.178455e+05,4730542,53.038776,5160.062988
2,780,4715863,434.186872,648.729416,9.178455e+05,4730543,57.033695,50454.269531
3,780,4715863,434.186872,648.729416,9.178455e+05,4730544,61.028496,137408.390625
4,780,4715863,434.186872,648.729416,9.178455e+05,4730545,69.033775,18039.816406
...,...,...,...,...,...,...,...,...
20558,780,4732245,173.045610,400.535494,2.730342e+06,4734141,111.044861,17060.833984
20559,780,4732245,173.045610,400.535494,2.730342e+06,4734142,129.055725,41687.929688
20560,780,4732245,173.045610,400.535494,2.730342e+06,4734143,130.088058,28831.666016
20561,780,4732245,173.045610,400.535494,2.730342e+06,4734144,155.034790,10526.595703


### Perform PALS analysis using the loaded data

The following codes demonstrate how to perform pathway analysis on the loaded data (above).

In [36]:
from pals.feature_extraction import DataSource
from pals.common import *
from pals.PLAGE import PLAGE
from pals.ORA import ORA
from pals.GSEA import GSEA

#### PALS analysis using KEGG database exported from Reactome

Perform a pathway analysis in offline mode using the KEGG database exported from Reactome. Note that only metabolic pathways can be analysed in this mode.

In [37]:
ds = DataSource(int_df, annotation_df, experimental_design, DATABASE_REACTOME_KEGG, 
                reactome_species=REACTOME_SPECIES_HOMO_SAPIENS, reactome_metabolic_pathway_only=True)

2020-06-09 22:27:39.195 | DEBUG    | pals.feature_extraction:__init__:42 - Using COMPOUND as database
2020-06-09 22:27:39.196 | DEBUG    | pals.loader:load_data:83 - Loading ..\pals\data\reactome\metabolic_pathways\COMPOUND\Homo sapiens.json.zip
2020-06-09 22:27:39.219 | DEBUG    | pals.feature_extraction:__init__:55 - Mapping pathway to unique ids
2020-06-09 22:27:39.221 | DEBUG    | pals.feature_extraction:__init__:69 - Creating dataset to pathway mapping
2020-06-09 22:27:40.308 | DEBUG    | pals.feature_extraction:__init__:97 - Computing unique id counts


In [38]:
plage = PLAGE(ds)
pathway_df = plage.get_pathway_df()

2020-06-09 22:27:40.480 | DEBUG    | pals.feature_extraction:change_zero_peak_ints:307 - Setting the zero intensity values in the dataframe
2020-06-09 22:27:40.489 | DEBUG    | pals.feature_extraction:change_zero_peak_ints:309 - 0
2020-06-09 22:27:40.520 | DEBUG    | pals.feature_extraction:standardize_intensity_df:276 - Scaling the data across the sample: zero mean and unit variance
2020-06-09 22:27:40.625 | DEBUG    | pals.PLAGE:get_plage_activity_df:84 - Mean values of the rows in the DF is [ 0.  0. -0. ... -0. -0. -0.]
2020-06-09 22:27:40.626 | DEBUG    | pals.PLAGE:get_plage_activity_df:85 - Variance in the rows of the DF is [1. 1. 1. ... 1. 1. 1.]
2020-06-09 22:27:40.742 | DEBUG    | pals.PLAGE:set_up_resample_plage_p_df:96 - Calculating plage p-values with resampling
2020-06-09 22:27:40.742 | DEBUG    | pals.PLAGE:set_up_resample_plage_p_df:103 - Comparison beer1/beer2
2020-06-09 22:27:40.743 | DEBUG    | pals.PLAGE:set_up_resample_plage_p_df:111 - Resampling 0/1000
2020-06-09 2

In [39]:
pathway_df.sort_values('COMPOUND beer1/beer2 comb_p', ascending=True, inplace=True)
pathway_df

Unnamed: 0,pw_name,beer1/beer2 p-value,beer3/beer4 p-value,unq_pw_F,tot_ds_F,F_coverage,sf,exp_F,Ex_Cov,COMPOUND beer1/beer2 comb_p,COMPOUND beer3/beer4 comb_p
R-HSA-2024096,HS-GAG degradation,0.014365,0.058665,5,1,20.00,0.876218,1.70,34.00,0.014365,0.058665
R-HSA-8964208,Phenylalanine metabolism,0.017967,0.062464,15,5,33.33,0.619436,5.10,34.00,0.017967,0.062464
R-HSA-71240,Tryptophan catabolism,0.018545,0.042759,27,14,51.85,0.036229,9.17,33.96,0.018545,0.042759
R-HSA-1362409,Mitochondrial iron-sulfur cluster biogenesis,0.024064,0.071653,3,1,33.33,0.713289,1.02,34.00,0.024064,0.071653
R-HSA-351143,Agmatine biosynthesis,0.028512,0.089659,5,2,40.00,0.552519,1.70,34.00,0.028512,0.089659
...,...,...,...,...,...,...,...,...,...,...,...
R-HSA-8850843,Phosphate bond hydrolysis by NTPDase proteins,1.000000,0.029983,15,4,26.67,0.810647,5.10,34.00,1.000000,0.029983
R-HSA-9634600,"Regulation of glycolysis by fructose 2,6-bisph...",1.000000,0.055705,6,2,33.33,0.663094,2.04,34.00,1.000000,0.055705
R-HSA-141334,PAOs oxidise polyamines to amines,1.000000,0.916366,7,1,14.29,0.946866,2.38,34.00,1.000000,0.916366
R-HSA-174362,Transport and synthesis of PAPS,1.000000,0.071501,5,1,20.00,0.876218,1.70,34.00,1.000000,0.071501


#### PALS analysis of compounds by connecting to Reactome

Perform a pathway analysis in online mode by connecting to a local Reactome database. All pathways (metabolic + non-metabolic) pathways can be analysed in this mode. Note that you need to have Reactome + Neo4j installed locally for this to work!

it is also possible to analyse transcripts and proteins data in this manner.

In [40]:
ds = DataSource(int_df, annotation_df, experimental_design, DATABASE_REACTOME_KEGG, 
                reactome_species=REACTOME_SPECIES_HOMO_SAPIENS, reactome_metabolic_pathway_only=True, reactome_query=True)

2020-06-09 22:27:48.459 | DEBUG    | pals.feature_extraction:__init__:42 - Using COMPOUND as database
2020-06-09 22:27:48.460 | DEBUG    | pals.loader:load_data:55 - Retrieving data for Homo sapiens from Reactome COMPOUND metabolic_pathway_only=True
2020-06-09 22:27:53.317 | DEBUG    | pals.feature_extraction:__init__:55 - Mapping pathway to unique ids
2020-06-09 22:27:53.319 | DEBUG    | pals.feature_extraction:__init__:69 - Creating dataset to pathway mapping
2020-06-09 22:27:54.659 | DEBUG    | pals.feature_extraction:__init__:97 - Computing unique id counts


In [42]:
plage = PLAGE(ds)
pathway_df = plage.get_pathway_df()

2020-06-09 22:30:04.746 | DEBUG    | pals.feature_extraction:change_zero_peak_ints:307 - Setting the zero intensity values in the dataframe
2020-06-09 22:30:04.754 | DEBUG    | pals.feature_extraction:change_zero_peak_ints:309 - 0
2020-06-09 22:30:04.808 | DEBUG    | pals.feature_extraction:standardize_intensity_df:276 - Scaling the data across the sample: zero mean and unit variance
2020-06-09 22:30:04.969 | DEBUG    | pals.PLAGE:get_plage_activity_df:84 - Mean values of the rows in the DF is [ 0.  0. -0. ... -0. -0. -0.]
2020-06-09 22:30:04.971 | DEBUG    | pals.PLAGE:get_plage_activity_df:85 - Variance in the rows of the DF is [1. 1. 1. ... 1. 1. 1.]
2020-06-09 22:30:05.146 | DEBUG    | pals.PLAGE:set_up_resample_plage_p_df:96 - Calculating plage p-values with resampling
2020-06-09 22:30:05.148 | DEBUG    | pals.PLAGE:set_up_resample_plage_p_df:103 - Comparison beer1/beer2
2020-06-09 22:30:05.149 | DEBUG    | pals.PLAGE:set_up_resample_plage_p_df:111 - Resampling 0/1000
2020-06-09 2

In [43]:
pathway_df.sort_values('COMPOUND beer1/beer2 comb_p', ascending=True, inplace=True)
pathway_df

Unnamed: 0,pw_name,beer1/beer2 p-value,beer3/beer4 p-value,unq_pw_F,tot_ds_F,F_coverage,sf,exp_F,Ex_Cov,COMPOUND beer1/beer2 comb_p,COMPOUND beer3/beer4 comb_p
R-HSA-2024096,HS-GAG degradation,0.012480,0.054547,5,1,20.00,0.876218,1.70,34.00,0.012480,0.054547
R-HSA-8964208,Phenylalanine metabolism,0.015723,0.058184,15,5,33.33,0.619436,5.10,34.00,0.015723,0.058184
R-HSA-71240,Tryptophan catabolism,0.015745,0.037832,27,14,51.85,0.036229,9.17,33.96,0.015745,0.037832
R-HSA-1362409,Mitochondrial iron-sulfur cluster biogenesis,0.020677,0.065959,3,1,33.33,0.713289,1.02,34.00,0.020677,0.065959
R-HSA-351143,Agmatine biosynthesis,0.025333,0.084390,5,2,40.00,0.552519,1.70,34.00,0.025333,0.084390
...,...,...,...,...,...,...,...,...,...,...,...
R-HSA-8850843,Phosphate bond hydrolysis by NTPDase proteins,1.000000,0.025813,15,4,26.67,0.810647,5.10,34.00,1.000000,0.025813
R-HSA-9634600,"Regulation of glycolysis by fructose 2,6-bisph...",1.000000,0.050298,6,2,33.33,0.663094,2.04,34.00,1.000000,0.050298
R-HSA-141334,PAOs oxidise polyamines to amines,1.000000,0.915108,7,1,14.29,0.946866,2.38,34.00,1.000000,0.915108
R-HSA-174362,Transport and synthesis of PAPS,1.000000,0.066861,5,1,20.00,0.876218,1.70,34.00,1.000000,0.066861


#### ORA Analysis

Perform a pathway analysis using ORA.

In [44]:
ds = DataSource(int_df, annotation_df, experimental_design, DATABASE_PIMP_KEGG)

2020-06-09 22:30:11.706 | DEBUG    | pals.feature_extraction:__init__:42 - Using PiMP_KEGG as database
2020-06-09 22:30:11.707 | DEBUG    | pals.loader:load_data:41 - Loading C:\Users\joewa\Work\git\PALS\pals\data\PiMP_KEGG.json.zip
2020-06-09 22:30:11.733 | DEBUG    | pals.feature_extraction:__init__:55 - Mapping pathway to unique ids
2020-06-09 22:30:11.738 | DEBUG    | pals.feature_extraction:__init__:69 - Creating dataset to pathway mapping
2020-06-09 22:30:12.745 | DEBUG    | pals.feature_extraction:__init__:97 - Computing unique id counts


In [45]:
ora = ORA(ds)
pathway_df = ora.get_pathway_df()

2020-06-09 22:30:12.964 | DEBUG    | pals.ORA:get_pathway_df:35 - Calculating ORA
2020-06-09 22:30:12.965 | DEBUG    | pals.feature_extraction:change_zero_peak_ints:307 - Setting the zero intensity values in the dataframe
2020-06-09 22:30:12.969 | DEBUG    | pals.feature_extraction:change_zero_peak_ints:309 - 0
2020-06-09 22:30:20.460 | DEBUG    | pals.ORA:get_pathway_df:98 - Correcting for multiple t-tests
2020-06-09 22:30:20.470 | DEBUG    | pals.feature_extraction:_calculate_coverage_df:329 - Calculating dataset formula coverage


In [46]:
pathway_df.sort_values('PiMP_KEGG beer1/beer2 comb_p', ascending=True, inplace=True)
pathway_df

Unnamed: 0,pw_name,beer1/beer2 p-value,beer3/beer4 p-value,PiMP_KEGG beer1/beer2 comb_p,PiMP_KEGG beer3/beer4 comb_p,unq_pw_F,tot_ds_F,F_coverage
map00350,Tyrosine metabolism,3.621191e-16,1.681454e-16,8.183892e-14,3.800087e-14,53,39,73.58
map00330,Arginine and proline metabolism,1.775486e-12,1.686603e-12,2.006299e-10,1.905861e-10,79,50,63.29
map00290,"Valine, leucine and isoleucine biosynthesis",9.591121e-12,2.028069e-11,7.225312e-10,1.527812e-09,17,16,94.12
map00400,"Phenylalanine, tyrosine and tryptophan biosynt...",2.824918e-10,7.042130e-10,1.596078e-08,3.978804e-08,30,22,73.33
map00052,Galactose metabolism,5.693949e-09,1.159169e-08,1.838332e-07,2.619722e-07,21,17,80.95
...,...,...,...,...,...,...,...,...
map04111,Cell cycle - yeast,1.000000e+00,3.342697e-01,1.000000e+00,5.474271e-01,2,1,50.00
map00860,Porphyrin and chlorophyll metabolism,9.998171e-01,9.999102e-01,1.000000e+00,1.000000e+00,89,7,7.87
map00071,Fatty acid degradation,9.932555e-01,9.773278e-01,1.000000e+00,1.000000e+00,37,3,8.11
map00364,Fluorobenzoate degradation,8.903434e-01,9.064737e-01,1.000000e+00,1.000000e+00,20,2,10.00


#### GSEA Analysis

Perform a pathway analysis using GSEA.

In [47]:
ds = DataSource(int_df, annotation_df, experimental_design, DATABASE_PIMP_KEGG)

2020-06-09 22:30:20.751 | DEBUG    | pals.feature_extraction:__init__:42 - Using PiMP_KEGG as database
2020-06-09 22:30:20.753 | DEBUG    | pals.loader:load_data:41 - Loading C:\Users\joewa\Work\git\PALS\pals\data\PiMP_KEGG.json.zip
2020-06-09 22:30:20.849 | DEBUG    | pals.feature_extraction:__init__:55 - Mapping pathway to unique ids
2020-06-09 22:30:20.858 | DEBUG    | pals.feature_extraction:__init__:69 - Creating dataset to pathway mapping
2020-06-09 22:30:22.076 | DEBUG    | pals.feature_extraction:__init__:97 - Computing unique id counts


In [48]:
gsea = GSEA(ds)
pathway_df = gsea.get_pathway_df()

2020-06-09 22:30:22.182 | DEBUG    | pals.GSEA:__init__:38 - GSEA initialised with num_resamples=1000 and ranking_method=signal_to_noise
2020-06-09 22:30:22.302 | DEBUG    | pals.GSEA:get_pathway_df:54 - Calculating GSEA
2020-06-09 22:30:22.302 | DEBUG    | pals.feature_extraction:change_zero_peak_ints:307 - Setting the zero intensity values in the dataframe
2020-06-09 22:30:22.307 | DEBUG    | pals.feature_extraction:change_zero_peak_ints:309 - 0
2020-06-09 22:30:22.364 | DEBUG    | pals.GSEA:get_pathway_df:83 - Running comparison case=beer1 control=beer2
2020-06-09 22:30:38.537 | DEBUG    | pals.GSEA:get_pathway_df:83 - Running comparison case=beer3 control=beer4
2020-06-09 22:30:52.166 | DEBUG    | pals.feature_extraction:_calculate_coverage_df:329 - Calculating dataset formula coverage


In [49]:
pathway_df.sort_values('PiMP_KEGG beer1/beer2 comb_p', ascending=True, inplace=True)
pathway_df

Unnamed: 0,pw_name,beer1/beer2 p-value,PiMP_KEGG beer1/beer2 comb_p,beer1/beer2 ES_score,beer3/beer4 p-value,PiMP_KEGG beer3/beer4 comb_p,beer3/beer4 ES_score,unq_pw_F,tot_ds_F,F_coverage
map05200,Pathways in cancer,0.009615,0.028312,0.730293,0.037037,0.014600,-0.652592,15,4,26.67
map05211,Renal cell carcinoma,0.181818,0.041053,0.603136,0.000000,0.004258,-0.827485,3,2,66.67
map07015,Local analgesics,0.000000,0.079320,-0.736213,0.764286,0.906582,-0.313698,5,3,60.00
map05143,African trypanosomiasis,0.000000,0.079921,-0.663471,0.763158,0.652122,0.285540,7,3,42.86
map00680,Methane metabolism,0.084746,0.081692,-0.275637,0.155689,0.885624,-0.305568,72,20,27.78
...,...,...,...,...,...,...,...,...,...,...
map04140,Regulation of autophagy,0.881481,0.991998,0.408636,0.040404,0.281491,0.836363,2,1,50.00
map00960,"Tropane, piperidine and pyridine alkaloid bios...",0.987097,0.995855,0.140311,0.387500,0.461098,0.201227,51,25,49.02
map00740,Riboflavin metabolism,0.710983,0.996434,0.259770,0.016667,0.017864,0.615319,23,7,30.43
map00920,Sulfur metabolism,0.889831,0.997492,0.223884,0.669355,0.902932,-0.312494,27,7,25.93
