In [7]:
from sctoolbox.utils.jupyter import bgcolor, _compare_version
from sctoolbox import settings

nb_name = "GSEA.ipynb"

_compare_version(nb_name)

Some function might not work!


# Gene Set Enrichment Analysis (GSEA)
<hr style="border:2px solid black"> </hr>

## 1 - Description

**Note: You need to have run the marker gene notebook before using the GSEA notebook!**

The main function of this notebook is to get the enrichted GO pathways per cluster. For this we use the marker genes as the gene set input.  
This notebook uses [enrichr](https://maayanlab.cloud/Enrichr/) which is implemented in [geseapy](https://github.com/zqfang/GSEApy).    

Enrichr is a web-based tool for analysing gene sets and returns any enrichment of common annotated biological features.
Enrichment analysis is a computational method for inferring knowledge about an input gene set by comparing it to annotated gene sets representing prior biological knowledge. See <https://maayanlab.cloud/Enrichr/> for further details.  
-- https://cran.r-project.org/web/packages/enrichR/index.html

---

## 2 - Setup

In [8]:
import sctoolbox.utils as utils
import sctoolbox.tools as tools
import sctoolbox.plotting as pl

import pandas as pd
import gseapy as gp
import tqdm
import matplotlib.pyplot as plt

---

## 3 - General Input

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [9]:
%bgcolor PowderBlue

# sctoolbox settings
settings.adata_input_dir = "../adatas/"
settings.adata_output_dir = "../adatas/"
settings.figure_dir = "../figures/GSEA/"
settings.log_file = "../logs/GSEA_log.txt"
last_notebook_adata = "anndata_0A.h5ad"

organism = "human"

# key for marker table in adata.uns
marker_key = "rank_genes_leiden_0.1_filtered" 
pvals_adj_tresh = 0.05



---

## 4 - Load anndata

In [None]:
adata = utils.adata.load_h5ad(last_notebook_adata)

with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

---

## 5 - Select library

**Molecular Function**  
Molecular-level activities performed by gene products. Molecular function terms describe activities that occur at the molecular level, such as “catalysis” or “transport”. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where, when, or in what context the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products (i.e. a protein or RNA), but some activities are performed by molecular complexes composed of multiple gene products. Examples of broad functional terms are catalytic activity and transporter activity; examples of narrower functional terms are adenylate cyclase activity or Toll-like receptor binding. To avoid confusion between gene product names and their molecular functions, GO molecular functions are often appended with the word “activity” (a protein kinase would have the GO molecular function protein kinase activity).  

**Cellular Component**  
A location, relative to cellular compartments and structures, occupied by a macromolecular machine. There are two ways in which the gene ontology describes locations of gene products: (1) the cellular anatomical entities, in which a gene product carries out a molecular function. Cellular anatomical entities includes cellular structures such as the plasma membrane and the cytoskeleton, as well as membrane-enclosed cellular compartments such as the mitochondrion, and (2) the stable macromolecular complexes of which they are parts, e.g., the clathrin complex.  

**Biological Process**  
The larger processes, or ‘biological programs’ accomplished by multiple molecular activities. Examples of broad biological process terms are DNA repair or signal transduction. Examples of more specific terms are pyrimidine nucleobase biosynthetic process or glucose transmembrane transport. Note that a biological process is not equivalent to a pathway. At present, the GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.

-- https://geneontology.org/docs/ontology-documentation/

**List of available librarys**

In [None]:
[db for db in gp.get_library_name(organism) if db.startswith("GO")]

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [10]:
%bgcolor PowderBlue

# Choose public library to use
library_name = "GO_Biological_Process_2023"

# If custom gene sets and background should be used set here.
# The public library will be ignored if custom_gene_set is given.
# To use a custom background for the public gene set library only set custom_background.
custom_gene_set = None     # {"Pathway 1": ["Gene1", "Gene2",...], ...}
custom_background = None   # ["Gene 1", "Gene 2", ....]

---

## 6 - Run enrichr

In [None]:
combined = tools.gsea.enrichr_marker_genes(adata,
                                           marker_key=marker_key,
                                           organism=organism,
                                           pvals_adj_tresh=pvals_adj_tresh,
                                           gene_sets=custom_gene_set,
                                           background=custom_background,
                                           library_name=library_name)

---

## 7 - Plotting
<hr style="border:2px solid black"> </hr>

### 7.1 - Dotplot

The dotplot shows all pathways as dots per cluster.  
The size of the dot indicates the fraction of genes in the cluster that match the pathway.  
The color of the dot indicates the adjusted p-value.

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [11]:
%bgcolor PowderBlue

# Dotplot
figsize = (8, 20) # Set figure size for dotplot
top_term = 10     # Number of pathways shown per cluster
size = 5          # Size scale for dots

---

In [None]:
for reg in ["UP", "DOWN"]:
    comb = combined[combined["UP_DW"] == reg]
    if not comb.empty:
        ax = gp.dotplot(comb,
                        figsize=figsize,
                        x='Cluster',
                        title=f"Top {top_term} {reg} regulated Pathways per Cluster",
                        cmap = plt.cm.autumn_r,
                        size=size,
                        show_ring=True,
                        top_term=top_term,
                        xticklabels_rot=45
                       )
        ax.set_xlabel("")
        plt.tight_layout()
        plt.savefig(f"{settings.figure_dir}/GSEA_dotplot_top{top_term}_{reg}_pathways_per_cluster.pdf", dpi=300)

---

### 7.2 - Term dotplot
The term dotplot only focuses on one term/pathway.
It shows the mean gene expression and zscore per group.

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [12]:
%bgcolor PowderBlue

# Term dotplot
term = "Actin Filament Organization (GO:0007015)"               # GO term
groupby = "clustering"            # Key of group column e.g. 'clustering'
groups = None           # Subset on groups
figsize_term_dot = None # Set figure size for term dotplot
save_term_dot = f"term_dotplot_{term}_{groupby}.pdf"

---

In [None]:
if term:
    pl.gsea.term_dotplot(term=term,
                         term_table=combined[combined["UP_DW"] == "UP"],
                         adata=adata,
                         groupby=groupby,
                         groups=groups,
                         figsize=figsize_term_dot,
                         save=save_term_dot)