In [None]:
from sctoolbox.utils.jupyter import bgcolor, _compare_version

# change the background of input cells
bgcolor("PowderBlue", select=[2, 6, 8, 11, 14])

nb_name = "GSEA.ipynb"

_compare_version(nb_name)

# Gene Set Enrichment Analysis (GSEA)
<hr style="border:2px solid black"> </hr>

## 1 - Description

**Requires an anndata object with precomputed marker genes. Marker genes can be generated with the marker gene notebook (`general_notebooks/group_markers.ipynb`).**

**Move this notebook into the notebook folder (e.g. `rna_analysis/notebooks/`) of the respective analysis before using it!**

The main function of this notebook is to perform enrichment analysis described as "\[...\] a computational method for inferring knowledge about an input gene set by comparing it to annotated gene sets representing prior biological knowledge."\[[source](https://maayanlab.cloud/Enrichr/help#background)\]. So in other words, the goal is to collect enriched GO pathways for clusters of cells (e.g. cell types) based on cluster defining sets of marker genes.  
Available methods in this notebook are [Enrichr](https://maayanlab.cloud/Enrichr/) and [GSEA preank](https://www.genepattern.org/modules/docs/GSEAPreranked/1#gsc.tab=0), which are both implemented in [GSEApy](https://github.com/zqfang/GSEApy).

---

## 2 - Setup

In [None]:
import sctoolbox.utils as utils
import sctoolbox.tools as tools
import sctoolbox.plotting as pl
from sctoolbox import settings

import pandas as pd
import gseapy as gp
import tqdm
import matplotlib.pyplot as plt
import warnings

---

## 3 - General Input

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
# sctoolbox settings
settings.adata_input_dir = "../adatas/"
settings.adata_output_dir = "../adatas/"
settings.figure_dir = "../figures/GSEA/"
settings.log_file = "../logs/GSEA_log.txt"
last_notebook_adata = "anndata_marker.h5ad"

# Define the dataset species!
organism = "human"

---

## 4 - Load anndata

In [None]:
adata = utils.adata.load_h5ad(last_notebook_adata)

with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

---

## 5 - Select library

**Molecular Function**  
Molecular-level activities performed by gene products. Molecular function terms describe activities that occur at the molecular level, such as “catalysis” or “transport”. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where, when, or in what context the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products (i.e. a protein or RNA), but some activities are performed by molecular complexes composed of multiple gene products. Examples of broad functional terms are catalytic activity and transporter activity; examples of narrower functional terms are adenylate cyclase activity or Toll-like receptor binding. To avoid confusion between gene product names and their molecular functions, GO molecular functions are often appended with the word “activity” (a protein kinase would have the GO molecular function protein kinase activity).  

**Cellular Component**  
A location, relative to cellular compartments and structures, occupied by a macromolecular machine. There are two ways in which the gene ontology describes locations of gene products: (1) the cellular anatomical entities, in which a gene product carries out a molecular function. Cellular anatomical entities includes cellular structures such as the plasma membrane and the cytoskeleton, as well as membrane-enclosed cellular compartments such as the mitochondrion, and (2) the stable macromolecular complexes of which they are parts, e.g., the clathrin complex.  

**Biological Process**  
The larger processes, or ‘biological programs’ accomplished by multiple molecular activities. Examples of broad biological process terms are DNA repair or signal transduction. Examples of more specific terms are pyrimidine nucleobase biosynthetic process or glucose transmembrane transport. Note that a biological process is not equivalent to a pathway. At present, the GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.

-- https://geneontology.org/docs/ontology-documentation/

**List of available librarys**

In [None]:
[db for db in gp.get_library_name(organism) if db.startswith("GO")]

**List of available `marker_keys`**

In [None]:
[k for k in adata.uns.keys() if k.startswith("rank_")]

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
# key for marker table in adata.uns
marker_key = "rank_feature_leiden_0.1_filtered"  # The marker_key should match the clustering below.
clustering = "leiden_0.1"  # The marker_key is usually formatted as rank_genes_<clustering> or rank_genes_<clustering>_filtered.
pvals_adj_tresh = 0.05

# Select method. Available options: 'prerank', 'enrichr'
method = "prerank"

# Choose public library to use
library_name = "GO_Biological_Process_2023"

# If custom gene sets and background should be used set here.
# The public library will be ignored if custom_gene_set is given.
custom_gene_set = None  # {"Pathway 1": ["Gene1", "Gene2",...], ...}

# enrichr specific parameters
# To use a custom background for the public gene set library only set custom_background.
custom_background = None  # ["Gene 1", "Gene 2", ....]

# prerank specific parameters
threads = 4  # Number of threadsused by prerank function
min_size = 5  # Minimum allowed number of genes from gene set also the data set.
max_size = 1000  # Maximum allowed number of genes from gene set also the data set.
permutation_num = 1000  # Number of permutations.
seed = 0  # Seed for prerank run

---

## 6 - Run gene set enrichment analysis

In [None]:
combined = tools.gsea.gene_set_enrichment(adata,
                                          method=method,
                                          marker_key=marker_key,
                                          organism=organism,
                                          pvals_adj_tresh=pvals_adj_tresh,
                                          gene_sets=custom_gene_set,
                                          background=custom_background,
                                          library_name=library_name,
                                          seed = seed)

---

## 7 - Plotting
<hr style="border:2px solid black"> </hr>

### 7.1 - Dotplot

The dotplot shows the top enriched pathways per cluster. The size of the dot indicates the fraction of genes in the cluster that match the pathway and the color of the dot indicates statistical significance (higher is better).

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
# Dotplot
figsize = (10, 10)  # Set figure size for dotplot
top_term = 10  # Number of pathways shown per cluster
size = 2  # Size scale for dots

# Custom column to be plotted.
# If None uses default values:
#     prerank -> FDR q-val
#     enrichr -> "Adjusted P-value"
column = None

In [None]:
if column is None:
    column = "FDR q-val" if method == "prerank" else "Adjusted P-value"

---

In [None]:
for reg in ["UP", "DOWN"]:
    comb = combined[combined["UP_DW"] == reg]
    if not comb.empty:
        with warnings.catch_warnings():
            # hide future warnings until gseapy fixes them
            warnings.filterwarnings(action='ignore', message=".*Series.replace.*|.*chained assignment.*")

            ax = gp.dotplot(
                comb,
                column=column,
                figsize=figsize,
                x='Cluster',
                title=f"Top {top_term} {reg} regulated Pathways per Cluster",
                cmap = plt.cm.autumn_r,
                size=size,
                show_ring=True,
                top_term=top_term,
                xticklabels_rot=45
            )
        ax.set_xlabel("")
        plt.tight_layout()
        plt.savefig(f"{settings.figure_dir}/GSEA_dotplot_top{top_term}_{reg}_pathways_per_cluster.pdf", dpi=300)

---

### 7.2 - Term dotplot
The term dotplot focuses on a single term/pathway and thus shows individual genes instead of pathways on the y-axis. A Z-Score is applied to the mean gene expression per cluster to highlight differences in expression between the clusters (x-axis). A pathway of interest can be selected using the plot above.

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
# Term dotplot
term = "Actin Filament Organization (GO:0007015)"  # The GO term of interest.
groups = None  # Only show selected groups on x-axis e.g. ["a", "b"]. None to show all.
figsize_term_dot = None # Set figure size for term dotplot e.g. (10, 5).
save_term_dot = f"term_dotplot_{term}_{clustering}.pdf"

---

In [None]:
gene_col = "Lead_genes" if method == "prerank" else "Genes"

In [None]:
if term:
    pl.gsea.term_dotplot(term=term,
                         term_table=combined[combined["UP_DW"] == "UP"],
                         adata=adata,
                         groupby=clustering,
                         gene_col=gene_col,
                         groups=groups,
                         figsize=figsize_term_dot,
                         save=save_term_dot)

### 7.3 Network plot
The network plot shows connections between enriched pathways per cluster. In the plot the node size corresponds to the percentage of gene overlap in a certain term of interest. The colour of the node corresponds to the significance of the enriched terms and the edge size corresponds to the number of genes that overlap between two connected nodes.

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
# Significance cutoff for (sig_column), e.g. "FDR q-val"
# All terms with a sig_column value > cutoff are filtered.
cutoff = 0.05 

scale = 1  # Scale factor for node positions
figsize_network = None # Set figure size for term dotplot e.g. (10, 5). Set to None to use default.
ncols = 3  # Number of columns for network plots
save_network = f"pathway_network.pdf"

# Column containing significance of enrichted termn.
# If None uses default values:
#     prerank -> FDR q-val
#     enrichr -> Adjusted P-value
# Available options are:
#     - for prerank: 'FDR q-val', 'NOM p-val'
#     - for enrichr: 'Adjusted P-value', 'P-value'
sig_column = None   

In [None]:
if sig_column is None:
    sig_column = "FDR q-val" if method == "prerank" else "Adjusted P-value"
    
score_col = "NES" if method == "prerank" else "Combined Score"

---

In [None]:
pl.gsea.gsea_network(combined[combined["UP_DW"] == "UP"],
                     score_col = score_col,
                     sig_col=sig_column,
                     cutoff=cutoff,
                     scale=scale,
                     figsize=figsize_network,
                     ncols=ncols,
                     save=save_network)