# Manage cell type ontology

## Background

Cell types are categories that classify cells based on characteristics and behaviors, including gene expression patterns, morphology, and functional properties. This classification enables researchers to explore cellular diversity, comprehend cellular heterogeneity, and gain valuable insights into the specific roles and interactions of different cell types.

In the dynamic world of biotech and pharmaceutical research, where numerous single-cell datasets are generated, the ability to seamlessly query and integrate datasets across different internal groups based on, for example, specific cell types proves exceptionally valuable.

In this notebook we are creating a cell type registry for all cell types that [CellTypist](https://www.celltypist.org) supports. CellTypist is a powerful computational tool for cell type classification in single-cell RNA sequencing data. It assigns cell types based on gene expression profiles within heterogeneous cell populations. We will further use CellTypist to classify cell types of a previously unannotated dataset and ingest the dataset with Lamin. Finally, we will demonstrate how to fetch datasets with cell type queries using Lamin.

## Setup

In [None]:
# warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")

In [None]:
import celltypist
import pandas as pd

## Creating the CellTypist cell type registry

### Fetching CellTypists immune cell encyclopedia

As a first step we will read in CellTypist's immune cell encyclopedia. It provides mapped `ontology_id` of Cell Ontology (cl) for the majority of terms.

In [None]:
celltypist_df = pd.read_excel(
    "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"
)
celltypist_df

CellTypist uses different hierarchies of cell types (`High-hierarchy cell types` and `Low-hierarchy cell types`). By taking the intersection of both levels we learn that 4 terms of the `High-hierarchy cell types` are not mapped to Cell Ontology:

In [None]:
high_terms, low_terms = set(celltypist_df["High-hierarchy cell types"].unique()), set(
    celltypist_df["Low-hierarchy cell types"].unique()
)

In [None]:
high_terms_umapped = high_terms.difference(low_terms)
high_terms_umapped

We want to ensure that all of our cell types of any dataset are eventually mapped against the Cell Ontology and queryable. Therefore, we will later create Lamin records for these unmapped terms. As a next step, we register all `low-hierarchy cell types` with Lamin.

### Register CellTypist cell type encyclopedia in LaminDB

```{warning}

Please ensure that you have created or loaded a LaminDB instance before running the remaining part of this notebook!
```

In [None]:
# A lamindb instance containing Bionty schema (skip if you already loaded your instance)

!lamin init --storage celltypist --schema bionty

Next we import `lamindb` and `lnschema_bionty` which enables us to connect [Bionty](https://github.com/laminlabs/bionty) with [LaminDB](https://github.com/laminlabs/lamindb). This enables us to map cell types against ontologies and create SQL records within LaminDB to eventually make them queryable.

In [None]:
import lamindb as ln
from lnschema_bionty import CellType

celltype_bionty = CellType.bionty()  # equals to bionty.CellType()

In [None]:
# Check out which ontology of cell types is used in bionty
celltype_bionty

Let's check the fields of CellType table in `lnschema-bionty`. All CellType fields are accessible via auto-completion.

In [None]:
CellType._meta.fields

We'll use {func}`docs:lamindb.parse` to parse the `Low-hierarchy cell types` from the `celltypist_df` into Bionty Schema records that are ready to be added to our LaminDB instance. Note how the DataFrame values correspond to the fields of `CellType`.

In [None]:
records = ln.parse(
    celltypist_df,
    {
        "Low-hierarchy cell types": CellType.name,
        "Description": CellType.definition,
        "Cell Ontology ID": CellType.ontology_id,
    },
)

In [None]:
len(records)

In [None]:
records[:2]

All records now contain a unique record ID, the proper cell type name, the Cell Ontology ID, the cell type definition and the operation ID that generated the records.

### Create records for High-hierarchy cell types

As mentioned above, 4 `High-hierachy cell types` are not present in the `Low-hierarchy cell types` and do not have ontology metadata.

In [None]:
high_terms_umapped

Here we have 2 options to add these 4 terms:

1. Annotate terms with ontology metadata using Bionty's lookup function
2. Create records without metadata

Let's look up `ontology_id` for T cells from the [Cell Ontology (cl)](https://www.ebi.ac.uk/ols/ontologies/cl).
For this purpose, we create a `lookup` instance from the `bionty_celltype` object.
This will allow us to use autocomplete to search for our cell type of interest (T cell) as an example here.

In [None]:
celltype_bionty_lookup = celltype_bionty.lookup()
celltype_bionty_lookup.T_cell

Alternatively, you can search for a standard term using fuzzy string matching.
This will automatically search for the best match in the complete Cell Ontology.

In [None]:
celltype_bionty.fuzzy_match("T cells", celltype_bionty.name)

Now, we can create a new CellType record for "T cells" with metadata from the "T cell" lookup:

In [None]:
record_t_cell = CellType.from_bionty(celltype_bionty_lookup.T_cell)
record_t_cell

If you want the record name to be exactly "T cells" as the CellTypist ontology, you may change it and add the name "T cell" as a synonym:

In [None]:
record_t_cell.name = "T cells"
record_t_cell.add_synonym("T cell")

record_t_cell

Add to the records list:

In [None]:
records.append(record_t_cell)

For the rest 3 terms, we directly create records without additional metadata to highlight the difference between records with and without metadata:

In [None]:
records.append(CellType(name="B-cell lineage"))
records.append(CellType(name="Cycling cells"))
records.append(CellType(name="Erythroid"))

In [None]:
records[-3:]

These 3 records are lacking Cell Ontology IDs and further cell type descriptions. However, if new cell types are discovered, a corresponding Cell Ontology may not yet exist and Lamin supports this use-case as well.

All that is now left to do is to add these records to our LaminDB instance using {func}`docs:lamindb.add` .

In [None]:
ln.save(records);

### Accessing the CellTypist ontology registry in LaminDB

The previously added CellTypist ontology registry is now available in LaminDB.
To retrieve the full ontology table as a Pandas DataFrame we can use {func}`docs:lamindb.select`:

In [None]:
ln.select(CellType).df()

This enables us to look for cell types by creating a lookup object from our new `CellType` registry.

In [None]:
db_lookup = CellType.lookup()

In [None]:
db_lookup.Memory_B_cells

## Annotate a dataset with cell types using CellTypist

### Annotate cell types predicted with CellTypist

We now demonstrate how simple it is to predict and add cell types to LaminDB with CellTypist.
Our dataset of choice is a simple sample dataset together with a sample model.

In [None]:
input_file = celltypist.samples.get_sample_csv()
input_file

In [None]:
predictions = celltypist.annotate(
    input_file, model="Immune_All_Low.pkl", majority_voting=True
)

Now that we've predicted all cell types we create an [Anndata](https://anndata.readthedocs.io/en/latest) object that we will eventually track with LaminDB.

In [None]:
adata_annotated = predictions.to_adata()

In [None]:
adata_annotated.obs

Parse cell type labels as we've seen above.

In [None]:
celltypes = ln.parse(adata_annotated.obs.predicted_labels, CellType.name)

In [None]:
celltypes[:2]

### Track the annotated dataset in LaminDB

Let's enable tracking of the current notebook as the transform of this file using {func}`docs:lamindb.track`:

In [None]:
ln.track()

Create a file record using {func}`docs:lamindb.File` of the AnnData object.
We further define a name of the dataset for clarity that can also be queried for.

In [None]:
file_annotated = ln.File(adata_annotated, name="sample_cell_by_gene-celltypist")

In [None]:
ln.save(file_annotated)

Link cell types to the file record:

In [None]:
file_annotated.cell_types.set(celltypes)

Now we can track the file and search for it for example by querying for a specific cell type.

In [None]:
ln.select(ln.File).filter(cell_types__name=db_lookup.Tcm_Naive_helper_T_cells).df()

Or track in which notebook the file is annotated by celltypist:

In [None]:
ln.select(ln.Transform).filter(files__name__icontains="celltypist").df()

## Conclusion

Lamin makes it easy to annotate cell types with ontology information and to track any datasets with such annotated cell types.
It does not matter whether the cell types where already a part of an ontology or newly found - Lamin supports both use-cases.

## Try it yourself

This notebook is available at [https://github.com/laminlabs/lamin-examples](https://github.com/laminlabs/lamin-examples).