[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/enrichr.ipynb)

# Gene Ontology (GO)

Pathways represent interconnected molecular networks of signaling cascades that govern critical cellular processes. They provide understandings  cellular behavior mechanisms, insights of disease progression and treatment responses. In an R&D organization, managing pathways across different datasets are crucial for gaining insights of potential therapeutic targets and intervention strategies.

In this notebook we manage a pathway registry based on "2023 GO Biological Process" ontology. We'll walk you through the steps of registering pathways and link them to genes.

In the following [Standardize metadata on-the-fly](analysis-registries) notebook, we'll demonstrate how to perform a pathway enrichment analysis and track the dataset with LaminDB.

## Setup

```{warning}

Please ensure that you have created or loaded a LaminDB instance before running the remaining part of this notebook!

This notebook follows the [CellTypist], which populate the CellType registry.
```

In [None]:
!lamin init --storage ./use-cases-registries --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import gseapy as gp

lb.settings.organism = "human"  # globally set organism

## Fetch GO pathways annotated with human genes using Enrichr

First we fetch the "GO_Biological_Process_2023" pathways for humans using [GSEApy](https://github.com/zqfang/GSEApy) which wraps [GSEA](https://www.gsea-msigdb.org/gsea/index.jsp) and [Enrichr](https://maayanlab.cloud/Enrichr/).

In [None]:
go_bp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")
print(f"Number of pathways {len(go_bp)}")

In [None]:
go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]

Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}

In [None]:
def parse_ontology_id_from_keys(key):
    """Parse out the ontology id.

    "ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
    """
    id = key.split(" ")[-1].replace("(", "").replace(")", "")
    name = key.replace(f" ({id})", "")
    return (id, name)

In [None]:
go_bp_parsed = {}

for key, genes in go_bp.items():
    id, name = parse_ontology_id_from_keys(key)
    go_bp_parsed[id] = (name, genes)

In [None]:
go_bp_parsed["GO:0036500"]

## Register pathway ontology in LaminDB

In [None]:
bionty = lb.Pathway.public()

In [None]:
bionty

Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.

### Register pathway terms

To register the pathways we make use of `.from_values` to directly parse the annotated GO pathway ontology IDs into LaminDB.

In [None]:
pathway_records = lb.Pathway.from_values(go_bp_parsed.keys(), lb.Pathway.ontology_id)

In [None]:
lb.Pathway.from_public(ontology_id="GO:0015868")

In [None]:
ln.save(pathway_records, parents=False)  # not recursing through parents

### Register gene symbols

Similarly, we use `.from_values` for all Pathway associated genes to register them with LaminDB.

In [None]:
all_genes = {g for genes in go_bp.values() for g in genes}

In [None]:
gene_records = lb.Gene.from_values(all_genes, lb.Gene.symbol)

In [None]:
gene_records[:3]

In [None]:
ln.save(gene_records);

### Link pathway to genes

Now that we are tracking all pathways and genes records, we can link both of them to make the pathways even more queryable.

In [None]:
gene_records_ids = {record.symbol: record for record in gene_records}

In [None]:
for pathway_record in pathway_records:
    pathway_genes = go_bp_parsed.get(pathway_record.ontology_id)[1]
    pathway_genes_records = [gene_records_ids.get(gene) for gene in pathway_genes]
    pathway_record.genes.set(pathway_genes_records)

Now genes are linked to pathways:

In [None]:
pathway_record.genes.list("symbol")

Move on to the next analysis: [Standardize metadata on-the-fly](analysis-registries.ipynb)