This notebook is a tutorial for running PET.
From the benchmark, we identified the optimal setting for running PET, which consists of 3 steps:
1. perform differential expression analysis with DESeq2
2. perform pathway analysis with underlying methods
3. combine the result using PET

File required:
1. Data matrix, raw read count strongly recommended.
2. Pathway file, in gmt format.

Example data:\
We provided an example read count matrix text file and gmt pathway file (KEGG) as an example.

Notes:\
Though template script and command for running GSEA and DESeq2 are provided, users should feel free to use any differential expression analysis methods and GSEA mode as wanted, as long as the output format is consistent for correctly parsing the results.

In [None]:
import pandas as pd
%load_ext autoreload
%autoreload 2

In [124]:
from fisher_test import run_ora
from enrichr import run_enrichr
from PET import run_PET
from helper import *

In [None]:
# create result directory
out_dir = 'example_new/'
create_dir(out_dir)

**Step 1. Differential expression analysis with DESeq2**\
Based on Benchmark, the best way to run any pathway analysis method is to provide a pre-ranked gene list to the method. Here, we'll use the p-value from DESeq2 analysis results as the input to next step, which was shown to have superior performance than other ranking metrics.\
\
Here we provided two ways to run DESeq2:
* The original [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) in R, default option. The template script for running DESeq2 is provided in [deseq2_template.R](https://github.com/hedgehug/PET/blob/main/deseq2_template.R). A new Rscript tailored to user input will be written and executed, if method='R' specified.
* [PyDESeq2](https://pydeseq2.readthedocs.io/en/latest/) in python, when method='Python' specified. 

\
Please note that we did observe ***inconsistency*** between DESeq2 R and PyDESeq2, we suggest to keep the dafualt setting of DESeq2 R versino.
\
Note: **Raw read count** is highly recommended for DESeq2. Please format the gene expression matrix file as a tab-delimited text file, where first column as gene name, rest column as cond1, cond2, ...,condN expression **in order**. Please provide the order and sample number of groups/conditions in contrast file, in format 'condition_name\tsample_num' for each condition in each row.

In [None]:
# run differential expression analysis through default DESeq2 R version

# By default, PET will save all comparisons results to result_dir_path
run_DEseq2(expr_file=out_dir+'example_data.txt', contrast_matrix_file='example_contrast.txt', result_dir_path=out_dir, method='R', script_file_name=out_dir+'deseq_analysis.all.R')

# if only want one comparison saved, please provide group=[condition1, condition2] and result_file_path
run_DEseq2(expr_file=out_dir+'example_data.txt', contrast_matrix_file='example_contrast.txt', groups= ['cond2', 'cond1'], result_file_path=out_dir+'cond2.vs.cond1.deseq_result.txt', method='R', script_file_name=out_dir+'deseq_analysis.cond2.vs.cond1.R')

In [None]:
# To run PyDESeq2, similar to DESeq2 R version, please specify method='Python'

# By default, PET will save all comparisons results to result_dir_path
run_DEseq2(expr_file=out_dir+'example_data.txt', contrast_matrix_file='example_contrast.txt', result_dir_path=out_dir, method='Python', cpu_num=10)

# if only want one comparison saved, please provide group=[condition1, condition2] and result_file_path
run_DEseq2(expr_file=out_dir+'example_data.txt', contrast_matrix_file='example_contrast.txt', groups= ['cond2', 'cond1'], result_file_path=out_dir+'cond2.vs.cond1.deseq_result.txt', method='R', script_file_name=out_dir+'deseq_analysis.cond2.vs.cond1.R')

In [None]:
# format DESeq2 result, CSV file only, to .rnk file for GSEA, please do check the direction of DESeq2 result (log2FoldChange column)
# if log2FoldChange>=0 means up-regulation, keep direction=1; otherwise, pass direction=-1

generate_rank_file(deseq_result_file=out_dir+'cond1.vs.cond2.csv', out_file=out_dir+'cond1.vs.cond2.rnk', direction=1)

**Step 2: Run GSEA**\
Based on Benchmark, the best practice to run GSEA is to use GSEA preranked function, which we will provide -log10(p-value)*sign(logFC) as the weight for each gene. Here we provided two options to run GSEA:
* [GSEAPY](https://gseapy.readthedocs.io/en/latest/introduction.html), default option.
* [GSEA command line (all platforms)](http://www.gsea-msigdb.org/gsea/downloads.jsp).

The results from two approaches are highly consistent. Please specify the path to gsea-cli.sh if using GSEA command line (all platforms).

In [None]:
# To run GSEA with GSEAPY, default option
run_GSEA(prerank_file_path=out_dir+'cond1.vs.cond2.rnk', out_dir=out_dir, thread_num=10,
         pathway_file='example_new/c2.cp.kegg.v2023.1.Hs.symbols.gmt', plot=False,
         min_size=15, max_size=500)

In [None]:
# To run GSEA command line, please specify gsea_cli_path
run_GSEA(prerank_file_path=out_dir+'cond1.vs.cond2.rnk', out_dir=out_dir,
         pathway_file='example_new/c2.cp.kegg.v2023.1.Hs.symbols.gmt', gsea_out_label='cond1.vs.cond2',
         min_size=15, max_size=500, gsea_cli_path='/Users/luopin/GSEA_test/GSEA_cmd/gsea-cli.sh', method='cli')

Before running other methods, we shall prune the pathways, which will remove pathways with gene_num > max_num and gene_num < min_num after removing any gene that's not present in the expression matrix.

In [None]:
# please keep the gene number setting same as GSEA command
pruned_pathway_dict, gene_universe = prune_gmt(file_name='example_new/c2.cp.kegg.v2023.1.Hs.symbols.gmt', 
                                        out_file_name='example_new/c2.cp.kegg.v2023.1.Hs.symbols.cleaned.gmt', 
                                        expr_matrix_file='example_new/example_data.txt', 
                                        min_gene_num=15, max_gene_num=500)

Fisher test takes a set gene of interest and a pathway file (dict only here). We'll perform this step for both top up and down-regulated genes, sorted by DESeq2 result p-value.

In [75]:
# for extracting the top DEGs, we'll query the DESeq2 result file
# please adjust the thresholds as needed and keep direction same as previously
# since in the toy example, the samples are randomly sampled replicates, we'll keep a super loose threshold
deg_dict = extract_top_gene(deseq_result_file=out_dir+'cond1.vs.cond2.csv', 
                            num_gene=200, pval_threshold=0.5, padj_threshold=0.5,
                            fc_threshold=1, basemean_threshold=2,
                            direction=1)


Based on:
p-value <=  0.5
adjusted p-value <=  0.5
fold change threshold >=  1
base Mean >=  2
top N =  200
40 up-regulated DEGs
14 down-regulated DEGs


In [127]:
# run ORA for both sets of genes against the pathway file
run_ora(pathway_dict=pruned_pathway_dict, deg_dict = deg_dict,
                gene_universe_num=len(gene_universe), out_dir=out_dir, out_file_prefix='cond1.vs.cond2')


Running ORA for genes in  up
Results written to  example_new//cond1.vs.cond2.up.ora_result.txt
********************
Running ORA for genes in  down
Results written to  example_new//cond1.vs.cond2.down.ora_result.txt
********************


Running enrichr is similar to run fisher test, which requires a set of genes of interest and a pathway file.\
To caculate the final enrichment score, enrichr requires one step of permutation, **we recommend to set perm_num to 1000**.\
If the permutation file already exists, we'll use the existing one; if not, a new permutation will be performed. **NOTE**: the permutaiton step might take some time.

In [None]:
run_enrichr(pathway_dict=pruned_pathway_dict, gene_set=up_regulated_gene_set,
            gene_universe=gene_universe, out_file_name='example/enrichr_up.txt', 
            permutation_num=1000, permutation_file_name='example/enrichr_kegg_permutation_1000.txt')
run_enrichr(pathway_dict=pruned_pathway_dict, gene_set=down_regulated_gene_set,
            gene_universe=gene_universe, out_file_name='example/enrichr_down.txt', 
            permutation_num=1000, permutation_file_name='example/enrichr_kegg_permutation_1000.txt')

After getting results from three underlying methods, the last step is to run PET. We'll perform this step for both up and down-regulated genes. 

In [None]:
run_PET(fisher_result_file='example/fisher_up.txt', enrichr_result_file= 'example/enrichr_up.txt', 
        gsea_result_dir='example/example_test.GseaPreranked.1678566691197/', gsea_label='pos', 
        pathway_dict=pruned_pathway_dict, result_file='example/PET_output_enriched_poor_survival.txt')

run_PET(fisher_result_file='example/fisher_down.txt', enrichr_result_file= 'example/enrichr_down.txt', 
        gsea_result_dir='example/example_test.GseaPreranked.1678566691197/', gsea_label='neg', 
        pathway_dict=pruned_pathway_dict, result_file='example/PET_output_enriched_better_survival.txt')


The results of PET is a tab-delimited text file, used could choose to sort the results based on their preferred criteria. Based on the benchmark we developed, the most reliable result would be sorting the pathways based on **average rank**, which by default generates the PET rank. At the same time, the significance PET FDR is also a strong indicator in the analysis.