# Manage a cell type registry

## Background

Cell types are categories that classify cells based on characteristics and behaviors, including gene expression patterns, morphology, and functional properties. This classification enables researchers to explore cellular diversity, comprehend cellular heterogeneity, and gain valuable insights into the specific roles and interactions of different cell types.

In the dynamic world of biotech and pharmaceutical research, where numerous single-cell datasets are generated, the ability to seamlessly query and integrate datasets across different internal groups based on, for example, specific cell types proves exceptionally valuable.

In this notebook we are creating a cell type registry for all cell types that [CellTypist](https://www.celltypist.org) supports. CellTypist is a powerful computational tool for cell type classification in single-cell RNA sequencing data. It assigns cell types based on gene expression profiles within heterogeneous cell populations. We will further use CellTypist to classify cell types of a previously unannotated dataset and ingest the dataset with Lamin. Finally, we will demonstrate how to fetch datasets with cell type queries using Lamin.

## Setup

```{warning}

Please ensure that you have created or loaded a LaminDB instance before running the remaining part of this notebook!
```

In [None]:
# A lamindb instance containing Bionty schema (skip if you already loaded your instance)
import lamindb as ln

ln.setup.init(storage="./celltypist", schema="bionty")

In [None]:
# Filter warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")

In [None]:
import lamindb as ln
import lnschema_bionty as lb

import celltypist
import pandas as pd

ln.settings.verbosity = 3  # show hints

In [None]:
# public Cell Ontology reference

celltype_bt = lb.CellType.bionty()  # equals to bionty.CellType()
celltype_bt

Next we import `lamindb` and `lnschema_bt` which enables us to connect [Bionty](https://github.com/laminlabs/bionty) with [LaminDB](https://github.com/laminlabs/lamindb). This enables us to map cell types against ontologies and create SQL records within LaminDB to eventually make them queryable.

## Create an in-house CellType registry of CellTypist terms based on the public Cell Ontology

### Fetching CellTypists immune cell encyclopedia

As a first step we will read in CellTypist's immune cell encyclopedia. It provides mapped `ontology_id` of Cell Ontology (cl) for the majority of terms.

In [None]:
celltypist_df = pd.read_excel(
    "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"
)
celltypist_df

And we observed that certain "Cell Ontology ID" is associated with multiple "Low-hierarchy cell types":

In [None]:
celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)

CellTypist uses different hierarchies of cell types (`High-hierarchy cell types` and `Low-hierarchy cell types`). By taking the intersection of both levels we learn that 4 terms of the `High-hierarchy cell types` are not mapped to Cell Ontology:

In [None]:
high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()

high_terms_umapped = set(high_terms).difference(low_terms)
high_terms_umapped

### Check compliance with the public Cell Ontology

We want to ensure that all of our cell types of any dataset are eventually mapped against the Cell Ontology and queryable. Let's first do a few inspections and see how well they align with each other

All Celltypist labeled ontology IDs are mappable to the public Cell Ontology:

In [None]:
celltype_bt.inspect(celltypist_df["Cell Ontology ID"], celltype_bt.ontology_id);

However, when inspecting the names, most of them don't match:

In [None]:
celltype_bt.inspect(celltypist_df["Low-hierarchy cell types"], celltype_bt.name);

After doing a search, we found a lot of terms that named in plural form in Celltypist are singular form in Cell Ontology: 

In [None]:
celltypist_df["Low-hierarchy cell types"][0]

In [None]:
celltype_bt.search(celltypist_df["Low-hierarchy cell types"][0], top_hit=True)

Let's try to strip out the `s` and now more names are mappable:

In [None]:
celltype_bt.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    celltype_bt.name,
);

### Register CellTypist cell type encyclopedia in LaminDB

Let's first add the "High-hierarchy cell types" as a column "parent":

In [None]:
celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")

# if Hign and Low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None

# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()

In [None]:
celltypist_df.head(2)

In [None]:
public_records = lb.CellType.from_values(
    celltypist_df.ontology_id, lb.CellType.ontology_id
)

In [None]:
# use ontology_id as keys

public_records_dict = {r.ontology_id: r for r in public_records}

In [None]:
records_names = {}

for _, row in celltypist_df.iterrows():
    name = row["name"]
    ontology_id = row["ontology_id"]
    public_record = public_records_dict[ontology_id]

    # if both name and ontology_id match public record, use public record
    if name.lower() == public_record.name.lower():
        records_names[name] = public_record
        continue
    else:  # when ontology_id matches the public record and name doesn't match
        # if singular form of the Celltypist name matches public name
        if name.lower().rstrip("s") == public_record.name.lower():
            # add the Celltypist name to the synonyms of the public ontology record
            public_record.add_synonym(name)
            records_names[name] = public_record
            continue
        if public_record.synonyms is not None:
            synonyms = [s.lower() for s in public_record.synonyms.split("|")]
            # if any of the public matches celltypist name
            if any(
                [
                    i.lower() in {name.lower(), name.lower().rstrip("s")}
                    for i in synonyms
                ]
            ):
                # add the Celltypist name to the synonyms of the public ontology record
                public_record.add_synonym(name)
                records_names[name] = public_record
                continue

        # create a record only based on Celltypist metadata
        records_names[name] = lb.CellType(
            name=name, ontology_id=ontology_id, description=row.description
        )

You can see certain records are created by adding the Celltypist name to the synonyms of the public record:

In [None]:
records_names["GMP"]

Other records are created based on Celltypist metadata:

In [None]:
records_names["Age-associated B cells"]

Let's save them to our database (you will notice parents records from public ontology is also saved):

In [None]:
records = list(records_names.values())

ln.save(records)

### Add parent-child relationship of the records from Celltypist

We still need to add the renaming 4 High hierarchy terms:

In [None]:
high_terms_umapped

Other than "T cells", we didn't find good matches in the public ontology.

In [None]:
search_results = []
for term in high_terms_umapped:
    search_results.append(celltype_bt.search(term, top_hit=True))

search_results

So we decided to:

- Add the "T cells" to the synonyms of the public "T cell" record
- Create the rest 3 terms only using their names

In [None]:
for name in high_terms_umapped:
    if name == "T cells":
        record = lb.CellType.from_bt(name="T cell")
        record.add_synonym(name)
        record.save()
    else:
        record = lb.CellType(name=name)
        record.save()
    records_names[name] = record

Now let's add the parent records:

In [None]:
for _, row in celltypist_df.iterrows():
    record = records_names[row["name"]]
    if row["parent"] is not None:
        parent_record = records_names[row["parent"]]
        record.parents.add(parent_record)

## Access the in-house CellType registry

The previously added CellTypist ontology registry is now available in LaminDB.
To retrieve the full ontology table as a Pandas DataFrame we can use {func}`docs:lamindb.select`:

In [None]:
lb.CellType.select().df()

This enables us to look for cell types by creating a lookup object from our new `CellType` registry.

In [None]:
db_lookup = lb.CellType.lookup()

In [None]:
db_lookup.memory_b_cell

Access parents of a record:

In [None]:
db_lookup.memory_b_cell.parents.all()

In [None]:
db_lookup.memory_b_cell.parents.all()[0].parents.all()

## Annotate a dataset with cell types using CellTypist

### Annotate cell types predicted with CellTypist

We now demonstrate how simple it is to predict and add cell types to LaminDB with CellTypist.
Our dataset of choice is a simple sample dataset together with a sample model.

In [None]:
input_file = celltypist.samples.get_sample_csv()
input_file

In [None]:
predictions = celltypist.annotate(
    input_file, model="Immune_All_Low.pkl", majority_voting=True
)

Now that we've predicted all cell types we create an [Anndata](https://anndata.readthedocs.io/en/latest) object that we will eventually track with LaminDB.

In [None]:
adata_annotated = predictions.to_adata()

In [None]:
adata_annotated.obs

Create cell type records using the "predicted_labels" as names:

In [None]:
celltypes = lb.CellType.from_values(
    adata_annotated.obs.predicted_labels, lb.CellType.name
)

In [None]:
celltypes[:2]

### Track the annotated dataset in LaminDB

Let's enable tracking of the current notebook as the transform of this file using {func}`docs:lamindb.track`:

In [None]:
ln.track()

Create a file record using {func}`docs:lamindb.File` of the AnnData object.
We further define a name of the dataset for clarity that can also be queried for.

In [None]:
file_annotated = ln.File(adata_annotated, key="sample_cell_by_gene-celltypist.h5ad")

In [None]:
ln.save(file_annotated)

Link cell types to the file record:

In [None]:
file_annotated.cell_types.set(celltypes)

Now we can track the file and search for it for example by querying for a specific cell type.

In [None]:
ln.select(ln.File).filter(cell_types=db_lookup.tcm_naive_helper_t_cells).df()

Or track in which notebook the file is annotated by celltypist:

In [None]:
ln.select(ln.Transform).filter(files__name__icontains="celltypist").df()

## Conclusion

Lamin makes it easy to annotate cell types with ontology information and to track any datasets with such annotated cell types.
It does not matter whether the cell types where already a part of an ontology or newly found - Lamin supports both use-cases.

## Try it yourself

This notebook is available at [https://github.com/laminlabs/lamin-examples](https://github.com/laminlabs/lamin-examples).