This notebook is a tutorial for running PET.
From the benchmark, we identified the optimal setting for running PET, which consists of 3 steps:
1. perform differential expression analysis with DESeq2
2. perform pathway analysis with underlying methods
3. combine the result using PET

File required:
1. Data matrix, raw read count strongly recommended.
2. Pathway file, in gmt format.

Example data:\
We provided an example read count matrix text file and gmt pathway file (KEGG) as an example.

Notes:\
Though template script and command for running GSEA and DESeq2 are provided, users should feel free to use any differential expression analysis methods and GSEA mode as wanted, as long as the output format is consistent for correctly parsing the results.

In [None]:
%load_ext autoreload
%autoreload 2

In [12]:
from fisher_test import run_fisher_test
from enrichr import run_enrichr
from PET import run_PET
from helper import *
import os, sys
import numpy as np
import scipy.stats as st

In [None]:
# create result directory
out_dir = 'example_new/'
create_dir(out_dir)

**Step 1. Differential expression analysis with DESeq2**\
Based on Benchmark, the best way to run any pathway analysis method is to provide a pre-ranked gene list to the method. Here, we'll use the p-value from DESeq2 analysis results as the input to next step, which was shown to have superior performance than other ranking metrics.\
\
The template DESeq2 script in R is provided in deseq2_template.R. PyDESeq2 will be automatically installed when installing PET environment. Be default, DESeq2 will be executed through R script, if python version preferred, please specify method='python'.\
\
Note: **Raw read count** is required for DESeq2, first column as gene name, rest column as cond1+cond2+...+condN expression **in order**. Please provide the order and sample number of groups/conditions in contrast file, in format 'condition_name\tsample_num' for each condition in each row.

In [13]:
# run differential expression analysis through DESeq2  R version
# if only want one comparison saved, please provide group=[condition1, condition2] result_file_path
# By default, PET will save all comparisons results to result_dir_path
run_DEseq2(expr_file=out_dir+'example_data.txt', contrast_matrix_file='example_contrast.txt', result_dir_path=out_dir, method='R', script_file_name=out_dir+'deseq_analysis.all.R')

DESeq2 script written to example_new/deseq_analysis.all.R
Start running R script


Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min


Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loadi

class: DESeqDataSet 
dim: 53657 10 
metadata(1): version
assays(1): counts
rownames(53657): TSPAN6 TNMD ... LINC01144 AC007389.5
rowData names(0):
colnames(10): TCGA.AA.3850 TCGA.AA.3845 ... TCGA.DM.A1HA TCGA.DM.A1HB
colData names(1): condition


estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


[1] "Comparing cond1 vs. cond2"
[1] "Result saved to  example_new//cond1.vs.cond2.csv"
[1] "Comparing cond1 vs. cond3"
[1] "Result saved to  example_new//cond1.vs.cond3.csv"
[1] "Comparing cond2 vs. cond3"
[1] "Result saved to  example_new//cond2.vs.cond3.csv"
Analysis Done!


In [None]:
# Alternatively, use python version DESeq2 

In [None]:
# extract DESeq2 result, CSV file only, to .rnk file for GSEA, please do check the direction of DESeq2 result (log2FoldChange column)
# if log2FoldChange>=0 means up-regulation, keep direction=1; otherwise, pass direction=-1

generate_rank_file(deseq_result_file=out_dir+'con', out_file=out_dir+'prerank.rnk', direction=1)

In [None]:
# To run GSEA, please download GSEA command line tool from here: http://www.gsea-msigdb.org/gsea/downloads.jsp
# please change the gsea-cli.sh pathway as needed
! ~/GSEA_test/GSEA_cmd/gsea-cli.sh GSEAPreranked -gmx example/c2.cp.kegg.v2023.1.Hs.symbols.gmt -collapse No_Collapse -mode Max_probe -norm meandiv -nperm 1000 -rnk example/prerank.rnk  -scoring_scheme weighted -rpt_label example_test   -create_svgs false -include_only_symbols true -make_sets true -plot_top_x 5 -rnd_seed timestamp -set_max 500 -set_min 15 -zip_report false -out example/  

Before running other methods, we shall prune the pathways, which will drop pathways with gene_num > max_num and gene_num < min_num and also remove any gene that's not present in the expression matrix.

In [None]:
# please keep the gene number setting same as GSEA command
pruned_pathway_dict, gene_universe = prune_gmt(file_name='example/c2.cp.kegg.v2023.1.Hs.symbols.gmt', 
                                        out_file_name='example/c2.cp.kegg.v2023.1.Hs.symbols.cleaned.gmt', 
                                        expr_matrix_file='example/example_data.txt', 
                                        min_gene_num=15, max_gene_num=500)

Fisher test takes a set gene of interest and a pathway file (dict only here). We'll perform this step for both top up and down-regulated genes, sorted by DESeq2 result p-value.

In [None]:
# for extracting the top DEGs, we'll simply query the rank file here
up_regulated_gene_set = extract_top_gene(rank_file='example/prerank.rnk', num_gene=200, direction=1)
down_regulated_gene_set = extract_top_gene(rank_file='example/prerank.rnk', num_gene=200, direction=-1)

In [None]:
# run fisher test for both sets of genes against the pathway file
run_fisher_test(pathway_dict=pruned_pathway_dict, gene_set=up_regulated_gene_set, 
                gene_universe_num=len(gene_universe), out_file_name='example/fisher_up.txt')
run_fisher_test(pathway_dict=pruned_pathway_dict, gene_set=down_regulated_gene_set, 
                gene_universe_num=len(gene_universe), out_file_name='example/fisher_down.txt')

Running enrichr is similar to run fisher test, which requires a set of genes of interest and a pathway file.\
To caculate the final enrichment score, enrichr requires one step of permutation, **we recommend to set perm_num to 1000**.\
If the permutation file already exists, we'll use the existing one; if not, a new permutation will be performed. **NOTE**: the permutaiton step might take some time.

In [None]:
run_enrichr(pathway_dict=pruned_pathway_dict, gene_set=up_regulated_gene_set,
            gene_universe=gene_universe, out_file_name='example/enrichr_up.txt', 
            permutation_num=1000, permutation_file_name='example/enrichr_kegg_permutation_1000.txt')
run_enrichr(pathway_dict=pruned_pathway_dict, gene_set=down_regulated_gene_set,
            gene_universe=gene_universe, out_file_name='example/enrichr_down.txt', 
            permutation_num=1000, permutation_file_name='example/enrichr_kegg_permutation_1000.txt')

After getting results from three underlying methods, the last step is to run PET. We'll perform this step for both up and down-regulated genes. 

In [None]:
run_PET(fisher_result_file='example/fisher_up.txt', enrichr_result_file= 'example/enrichr_up.txt', 
        gsea_result_dir='example/example_test.GseaPreranked.1678566691197/', gsea_label='pos', 
        pathway_dict=pruned_pathway_dict, result_file='example/PET_output_enriched_poor_survival.txt')

run_PET(fisher_result_file='example/fisher_down.txt', enrichr_result_file= 'example/enrichr_down.txt', 
        gsea_result_dir='example/example_test.GseaPreranked.1678566691197/', gsea_label='neg', 
        pathway_dict=pruned_pathway_dict, result_file='example/PET_output_enriched_better_survival.txt')


The results of PET is a tab-delimited text file, used could choose to sort the results based on their preferred criteria. Based on the benchmark we developed, the most reliable result would be sorting the pathways based on **average rank**, which by default generates the PET rank. At the same time, the significance PET FDR is also a strong indicator in the analysis.