## Apply enrichment method

This notebook plugs in other gene set enrichment methods to demonstrate that our method, SOPHIE, can be inserted into different pipelines and work with other methods

In [1]:
%load_ext autoreload
%load_ext rpy2.ipython
%autoreload 2

import os
import sys
import pandas as pd
import numpy as np
import pickle

from rpy2.robjects import pandas2ri
pandas2ri.activate()

from ponyo import utils
from generic_expression_patterns_modules import ranking

np.random.seed(123)

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))


In [2]:
# Read in config variables
base_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))

config_filename = os.path.abspath(
    os.path.join(base_dir, "configs", "config_human_general.tsv")
)

params = utils.read_config(config_filename)

In [3]:
# Load params
local_dir = params["local_dir"]
project_id = params['project_id']
statistic = params['gsea_statistic']
hallmark_DB_filename = params["pathway_DB_filename"]
num_runs = params["num_simulated"]
dataset_name = params['dataset_name']

# Select enrichment method
# enrichment_method = ["GSEA", GSVA", "ROAST", "CAMERA", "OSA"]
# If enrichment_method == "GSEA" then use "padj" to rank
# If enrichment_method == "GSVA" then use "ES" to rank
# If enrichment_method == "ROAST" or "CAMERA" then use "FDR" to rank
# If using "OSA" then use "p.adjust" to rank
enrichment_method = "GSVA"
col_to_rank_pathways = "ES"

In [4]:
# Load DE stats directory
DE_stats_dir = os.path.join(local_dir, "DE_stats")

# Template experiment gene expression
template_expression_filename = os.path.join(base_dir, dataset_name, params["processed_template_filename"])

# Template experiment DE stats
template_DE_stats_filename = os.path.join(
    DE_stats_dir,
    f"DE_stats_template_data_{project_id}_real.txt"
)

# Metadata file with sample grouping to define comparison
metadata_filename = os.path.join(
    base_dir,
    dataset_name,
    "data",
    "metadata",
    f"{project_id}_groups.tsv"
)

## Enrichment methods
* [ROAST](https://pubmed.ncbi.nlm.nih.gov/20610611/) (rotation gene set tests) performs a focused gene set testing, in which interest focuses on a few gene sets as opposed to a large dataset. (available in limma).
* [CAMERA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458527/) (Correlation Adjusted MEan RAnk gene set test) is based on the idea of estimating the variance inflation factor associated with inter-gene correlation, and incorporating this into parametric or rank-based test procedures. (available in limma) 
* [GSVA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3618321/) (Gene Set Variation Analysis) calculates sample-wise gene set enrichment scores as a function of genes inside and outside the gene set. This method is well-suited for assessing gene set variation across a dichotomous phenotype. (biocontuctor package GSVA) 
* [ORA](https://www.rdocumentation.org/packages/clusterProfiler/versions/3.0.4/topics/enricher) (over-representation analysis) uses the hypergeometric test to determine if there a significant over-representation of pathway in the selected set of DEGs. Here we're using clusterProfiler library but there are multiple options for this analysis. See [slide 6](https://docs.google.com/presentation/d/1t4rK7UiLAeIKIzeRJK-YzspNUfGM-8nuRCcevh2lx34/edit?usp=sharing)

In [5]:
# Create "<local_dir>/GSEA_stats/" subdirectory
os.makedirs(os.path.join(local_dir, "GSA_stats"), exist_ok=True)

In [6]:
# Load pathway data
hallmark_DB_filename = params["pathway_DB_filename"]

**Apply enrichment to template experiment**

See supplementary tables: https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbz158/5722384

In [7]:
%%R -i base_dir -i local_dir -i project_id -i template_expression_filename -i hallmark_DB_filename -i metadata_filename -i enrichment_method -o template_enriched_pathways

source(paste0(base_dir, '/generic_expression_patterns_modules/other_enrichment_methods.R'))

out_filename <- paste(local_dir, 
                      "GSA_stats/",
                      enrichment_method,
                      "_stats_template_data_",
                      project_id,
                      "_real.txt", 
                      sep = "")

if (enrichment_method == "GSVA"){
    
    template_enriched_pathways <- find_enriched_pathways_GSVA(
        template_expression_filename,
        hallmark_DB_filename
    )
}
else if (enrichment_method == "ROAST"){
    
    template_enriched_pathways <- find_enriched_pathways_ROAST(
        template_expression_filename,
        metadata_filename,
        hallmark_DB_filename
    )
}
else if (enrichment_method == "CAMERA"){
    
    template_enriched_pathways <- find_enriched_pathways_CAMERA(
        template_expression_filename,
        metadata_filename,
        hallmark_DB_filename
    )
}
else if (enrichment_method == "ORA"){
    
    template_enriched_pathways <- find_enriched_pathways_ORA(
        template_expression_filename,
        metadata_filename, 
        hallmark_DB_filename
    )
}
write.table(as.data.frame(template_enriched_pathways), file = out_filename, row.names = F, sep = "\t")


Attaching package: ‘edgeR’



    DGEList






Attaching package: ‘BiocGenerics’



    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB



    plotMA



    IQR, mad, sd, var, xtabs



    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’



    expand.grid








    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.




Attaching package: ‘matrixStats’



    anyMissing, ro

Estimating GSVA scores for 50 gene sets.
Computing observed enrichment scores
Estimating ECDFs with Poisson kernels
Using parallel with 6 cores
  |                                                                              |                                                                      |   0%





 






In [8]:
# Quick check
print(template_enriched_pathways.shape)
template_enriched_pathways

NameError: name 'template_enriched_pathways' is not defined

**Apply enrichment to simulated experiments**

Note: GSA takes a while to run (for 2 simulated experiments it took ~30 minutes)

In [None]:
%%R -i project_id -i local_dir -i hallmark_DB_filename -i metadata_filename -i num_runs -i base_dir -i enrichment_method

source(paste0(base_dir, '/generic_expression_patterns_modules/other_enrichment_methods.R'))

for (i in 0:(num_runs-1)){
    simulated_expression_filename <- paste(local_dir, 
                                           "pseudo_experiment/selected_simulated_data_",
                                           project_id,
                                           "_", 
                                           i,
                                           "_processed.txt",
                                           sep = "")

    out_filename <- paste(local_dir,
                          "GSA_stats/",
                          enrichment_method,
                          "_stats_simulated_data_",
                          project_id,
                          "_",
                          i,
                          ".txt", 
                          sep = "")
    
    if (enrichment_method == "GSVA"){
        enriched_pathways <- find_enriched_pathways_GSVA(
            simulated_expression_filename, 
            hallmark_DB_filename
        ) 
        write.table(as.data.frame(enriched_pathways), file = out_filename, row.names = F, sep = "\t")
        print("in GSVA")
    }
    else if (enrichment_method == "ROAST"){
        enriched_pathways <- find_enriched_pathways_ROAST(
            simulated_expression_filename,
            metadata_filename,
            hallmark_DB_filename
        ) 
        write.table(as.data.frame(enriched_pathways), file = out_filename, row.names = F, sep = "\t")
        print("in ROAST")
    }
    else if (enrichment_method == "CAMERA"){
        enriched_pathways <- find_enriched_pathways_CAMERA(
            simulated_expression_filename,
            metadata_filename, 
            hallmark_DB_filename
        ) 
        write.table(as.data.frame(enriched_pathways), file = out_filename, row.names = F, sep = "\t")
        print("in CAMERA")
    }
    else if (enrichment_method == "ORA"){
        enriched_pathways <- find_enriched_pathways_ORA(
            simulated_expression_filename,
            metadata_filename, 
            hallmark_DB_filename
        ) 
        write.table(as.data.frame(enriched_pathways), file = out_filename, row.names = F, sep = "\t")
        print("in ORA")
    }
}

## Format enrichment output

Each method yields a different output format so we will need to format the data before we can rank and summarize it

In [None]:
%%R -i hallmark_DB_filename -o hallmark_DB_names
library("GSA")

hallmark_DB <- GSA.read.gmt(hallmark_DB_filename)

hallmark_DB_names <- as.data.frame(hallmark_DB$geneset.names)

In [None]:
ranking.format_enrichment_output(
    local_dir, 
    project_id, 
    enrichment_method, 
    hallmark_DB_names,
    num_runs
)

## Rank pathways

In [None]:
analysis_type = "GSA"

template_GSEA_stats_filename = os.path.join(
    local_dir,
    "GSA_stats",
    f"{enrichment_method}_stats_template_data_{project_id}_real.txt"    
)
template_GSEA_stats, simulated_GSEA_summary_stats = ranking.process_and_rank_genes_pathways(
    template_GSEA_stats_filename,
    local_dir,
    num_runs,
    project_id,
    analysis_type,
    col_to_rank_pathways,
    enrichment_method
)

In [None]:
simulated_GSEA_stats_filename = os.path.join(
    local_dir,
    "GSA_stats",
    f"{enrichment_method}_stats_simulated_data_{project_id}_0.txt")
    
simulated = pd.read_csv(simulated_GSEA_stats_filename, sep="\t")
simulated.head()

## Pathway summary table

In [None]:
# Create intermediate file: "<local_dir>/gene_summary_table_<col_to_rank_pathways>.tsv"
summary_pathway_ranks = ranking.generate_summary_table(
    template_GSEA_stats_filename,
    template_GSEA_stats,
    simulated_GSEA_summary_stats,
    col_to_rank_pathways,
    local_dir,
    'pathway',
    params
)

summary_pathway_ranks.sort_values(by="Z score", ascending=False).head()

In [None]:
# Create `pathway_summary_filename`
summary_pathway_ranks.to_csv(pathway_summary_filename, sep='\t')