[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/celltypist.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/laminlabs/lamin-usecases/main?labpath=lamin-usecases%2Fdocs%2Fcelltypist.ipynb)

# CellTypist

Cell types classify cells based on public and private knowledge from studying transcription, morphology, function & other properties.
Established cell types have well-characterized markers and properties; however, cell subtypes and states are continuously being discovered, refined and better understood.

In this notebook, we register the immune cell type vocabulary from [CellTypist](https://www.celltypist.org), a computational tool used for cell type classification in scRNA-seq data.

In the following [Standardize metadata on-the-fly](analysis-registries) notebook, we'll demonstrate how to curate datasets analyzed with CellTypist enrichment analysis and track the dataset with LaminDB.

## Setup

Install the `lamindb` Python package:
```shell
pip install 'lamindb[jupyter,bionty]'
```

In [None]:
!lamin load use-cases-registries

In [None]:
# filter warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")

In [None]:
import lamindb as ln
import bionty as bt

## Access CellTypist records ![](https://img.shields.io/badge/Access-10b981) 

As a first step we will read in CellTypist's immune cell encyclopedia

In [None]:
import pandas as pd
description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"

celltypist_df = pd.read_excel(celltypist_source_v2_url)

It provides an `ontology_id` of the public Cell Ontology for the majority of records.

In [None]:
celltypist_df.head()

The "Cell Ontology ID" is associated with multiple "Low-hierarchy cell types":

In [None]:
celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)

## Validate CellTypist records ![](https://img.shields.io/badge/Validate-10b981) 

For any cell type record that can be validated against the public Cell Ontology, we'd like to ensure that it's actually validated.

This will avoid that we'll refer to the same cell type with different identifiers.

We need a `Bionty` object for this:

In [None]:
bionty = bt.CellType.public()
bionty

We can now validate the `"Cell Ontology ID"` column:

In [None]:
bionty.inspect(celltypist_df["Cell Ontology ID"], bionty.ontology_id);

This looks good!

But when inspecting the names, most of them don't validate:

In [None]:
bionty.inspect(celltypist_df["Low-hierarchy cell types"], bionty.name);

A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology: 

In [None]:
celltypist_df["Low-hierarchy cell types"][0]

In [None]:
bionty.search(celltypist_df["Low-hierarchy cell types"][0]).head(2)

Let's try to strip `"s"` and inspect if more names are now validated. Yes, there are!

In [None]:
bionty.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    bionty.name,
);

Every "low-hierarchy cell type" has an ontology id and most "high-hierarchy cell types" also appear as "low-hierarchy cell types" in the Cell Typist table. Four, however, don't, and therefore don't have an ontology ID.

In [None]:
high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()

high_terms_nonval = set(high_terms).difference(low_terms)
high_terms_nonval

## Register CellTypist records ![](https://img.shields.io/badge/Register-10b981) 

Let's first add the "High-hierarchy cell types" as a column `"parent"`.

This enables LaminDB to populate the `parents` and `children` fields, which will enable you to query for hierarchical relationships.

In [None]:
celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")

# if high and low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None

# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "ct_name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()

# add standardize names for each ontology_id
celltypist_df["name"] = bionty.df().loc[celltypist_df["ontology_id"]].name.values

In [None]:
celltypist_df.head(2)

Now, let's create records from the public ontology:

In [None]:
public_records = bt.CellType.from_values(
    celltypist_df.ontology_id, bt.CellType.ontology_id
)
ln.save(public_records)

Let's now amend public ontology records so that they maintain additional annotations that Cell Typist might have.

In [None]:
public_records_dict = {r.ontology_id: r for r in public_records}

for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    try:
        record.add_synonym(row["ct_name"])
    except SystemExit:
        pass

### Add parent-child relationship of the records from Celltypist

We still need to add the renaming 4 High hierarchy terms:

In [None]:
list(high_terms_nonval)

Let's get the top hits from a search:

In [None]:
for term in list(high_terms_nonval):
    print(f"Term: {term}")
    display(bionty.search(term).head(2))

So we decide to:

- Add the "T cells" to the synonyms of the public "T cell" record
- Create the remaining 3 terms only using their names (we think "B cell flow" shouldn't be identified with "B cell")

In [None]:
for name in high_terms_nonval:
    if name == "T cells":
        record = bt.CellType.from_public(name="T cell")
        record.add_synonym(name)
        record.save()
    elif name == "Erythroid":
        record = bt.CellType.from_public(name="erythroid lineage cell")
        record.add_synonym(name)
        record.save()
    else:
        record = bt.CellType(name=name)
        record.save()

In [None]:
high_terms_nonval

In [None]:
bt.CellType(name="B-cell lineage").save()

Now let's add the parent records:

In [None]:
celltypist_df["parent"] = bt.CellType.standardize(celltypist_df["parent"])

In [None]:
for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    if row["parent"] is not None:
        parent_record = bt.CellType.filter(name=row["parent"]).one()
        record.parents.add(parent_record)

## Access the registry

The previously added CellTypist ontology registry is now available in LaminDB.
To retrieve the full ontology table as a Pandas DataFrame we can use `.filter`:

In [None]:
bt.CellType.df()

This enables us to look for cell types by creating a lookup object from our new `CellType` registry.

In [None]:
db_lookup = bt.CellType.lookup()

In [None]:
db_lookup.memory_b_cell

See cell type hierarchy:

In [None]:
db_lookup.memory_b_cell.view_parents()

Access parents of a record:

In [None]:
db_lookup.memory_b_cell.parents.list()

Move on to the next registry: [GO pathways](enrichr)