This notebook is a tutorial for running PET.
From the benchmark, we identified the optimal setting for running PET, which consists of 3 steps:
1. perform differential expression analysis with DESeq2
2. perform pathway analysis with underlying methods
3. combine the result using PET

File required:
1. Data matrix, raw read count strongly recommended.
2. Pathway file, in gmt format.

Example data:
We provided example read count matrix and pathway file (KEGG) as an example.

In [1]:
from fisher_test import run_fisher_test
from helper import *
import os, sys
import numpy as np

In [3]:
# create result directory
out_dir = 'example/'
create_dir(out_dir)

In [4]:
# perform DEseq2 analysis, the template DESeq2 script is provided in deseq2_template.R
# raw read count is required for DESeq2, first column as gene name, rest column as group1+group2 expression in order
# helper function to format DEseq2 script, please specify the result file with .csv suffix
format_deseq2_script(read_count_file_path='example/example_data.txt', sample_num=62, group1_id='poor_survival',
                     group2_id='better_survival', group1_num=33, group2_num=29, 
                     result_file_path='example/deseq_result.csv', script_name='deseq_analysis.R')

DESeq2 script written to deseq_analysis.R


In [11]:
# run DESeq2 script
! Rscript deseq_analysis.R

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’

The followin

[?25h[?25h[?25h[?25hclass: DESeqDataSet 
dim: 53657 62 
metadata(1): version
assays(1): counts
rownames(53657): TSPAN6 TNMD ... LINC01144 AC007389.5
rowData names(0):
colnames(62): TCGA_AA_3850.Tumor.Rep244 TCGA_AA_3845.Tumor.Rep208 ...
  TCGA_DM_A1HA.Tumor.Rep410 TCGA_DM_A1HB.Tumor.Rep417
colData names(1): condition
[?25hestimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 5411 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
[?25h[?25h[?25h[?25h[?25h[?25h[?25h

In [6]:
# extract DESeq2 result, CSV file only, to .rnk file for GSEA, please do check the direction of DESeq2 result (log2FoldChange column)
# if log2FoldChange>=0 means up-regulation, keep direction=1; otherwise, pass direction=-1
generate_rank_file(deseq_result_file='example/deseq_result.csv', out_file='example/prerank.rnk', direction=1)

In [9]:
# To run GSEA, please download GSEA command line tool from here: http://www.gsea-msigdb.org/gsea/downloads.jsp
! ~/GSEA_test/GSEA_cmd/gsea-cli.sh GSEAPreranked -gmx example/c2.cp.kegg.v2023.1.Hs.symbols.gmt -collapse No_Collapse -mode Max_probe -norm meandiv -nperm 1000 -rnk example/prerank.rnk  -scoring_scheme weighted -rpt_label example_test   -create_svgs false -include_only_symbols true -make_sets true -plot_top_x 5 -rnd_seed timestamp -set_max 500 -set_min 15 -zip_report false -out example/gsea_out_example/  

Using system JDK.
466      [INFO  ] - Parameters passing to GSEAPreranked.main:
467      [INFO  ] - rnk	example/prerank.rnk
468      [INFO  ] - gmx	example/c2.cp.kegg.v2023.1.Hs.symbols.gmt
468      [INFO  ] - rpt_label	example_test
468      [INFO  ] - collapse	No_Collapse
468      [INFO  ] - zip_report	false
468      [INFO  ] - gui	false
468      [INFO  ] - out	example/gsea_out_example/
468      [INFO  ] - mode	Max_probe
468      [INFO  ] - norm	meandiv
468      [INFO  ] - nperm	1000
468      [INFO  ] - scoring_scheme	weighted
468      [INFO  ] - include_only_symbols	true
468      [INFO  ] - make_sets	true
468      [INFO  ] - plot_top_x	5
468      [INFO  ] - rnd_seed	timestamp
468      [INFO  ] - create_svgs	false
468      [INFO  ] - set_max	500
468      [INFO  ] - set_min	15
692      [INFO  ] - Made Vdb dir JIT: /Users/luopin/GSEA_test/ENCODE_RNA-seq/PET/mar10
713      [INFO  ] - Begun importing: RankedList from: example/prerank.rnk
1269     [INFO  ] - Your current version of GSEA is

GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
shuffleGeneSet for GeneSet 91/176 nperm: 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
shuffleGeneSet for GeneSet 96/176 nperm: 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorted_scored: 501 / 1000
shuffleGeneSet for GeneSet 101/176 nperm: 1000
GeneSetCohorted: 501 / 1000
GeneSetCohorte

In [4]:
# before running methods, we shall prune the pathways, which will drop pathways with #gene>max_num and #gene<min_num and remove any gene that's not present in the expression matrix
# please keep the gene number setting same as GSEA command
prune_gmt(file_name='example/c2.cp.kegg.v2023.1.Hs.symbols.gmt', 
          out_file_name='example/c2.cp.kegg.v2023.1.Hs.symbols.cleaned.gmt', 
          expr_matrix_file='example/example_data.txt', 
          min_gene_num=15, max_gene_num=500)

53657 genes in the expression matrix
KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_GLOBO_SERIES removed due to gene number
KEGG_NON_HOMOLOGOUS_END_JOINING removed due to gene number
KEGG_CIRCADIAN_RHYTHM_MAMMAL removed due to gene number
KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_BIOSYNTHESIS removed due to gene number
KEGG_TAURINE_AND_HYPOTAURINE_METABOLISM removed due to gene number
KEGG_FOLATE_BIOSYNTHESIS removed due to gene number
KEGG_LIMONENE_AND_PINENE_DEGRADATION removed due to gene number
KEGG_SULFUR_METABOLISM removed due to gene number
KEGG_RIBOSOME removed due to gene number
KEGG_PROTEASOME removed due to gene number


In [None]:
# perform fisher test
# fisher test takes a set gene of interest and a pathway file
# we'll perform this step for both up and down-regulated genes, sorted by DESeq2 result p-value
run_fisher_test