# Create a cell type registry in LaminDB from CellTypist ontology

[Publication](https://www.science.org/doi/10.1126/science.abl5197)

In [None]:
# warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")

In [None]:
import celltypist
import pandas as pd

## Celltypist ontology reference

CellTypist's Cell Type Encyclopedia provides mapped ontology_id of Cell Ontology (cl) for majority of terms.

In [None]:
celltypist_ontology = {
    "v1": "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v1/tables/Basic_celltype_information.xlsx",
    "v2": "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx",
}

Let's read in the metadata table of v2:

In [None]:
ref = pd.read_excel(celltypist_ontology["v2"])

ref

These 4 terms are not annotated with Cell Ontology (cl):

In [None]:
high_terms, low_terms = set(ref["High-hierarchy cell types"].unique()), set(
    ref["Low-hierarchy cell types"].unique()
)

In [None]:
high_terms.difference(low_terms)

## Register CellTypist cell type ontology in LaminDB



In [None]:
# A lamindb instance containing bionty schema (skip if you already loaded your instance)
# lamin1 schema is needed for linking CellType directly to File

!lamin init --storage celltypist --schema bionty,lamin1

In [None]:
import lamindb as ln
from lnschema_bionty import CellType

bionty_celltype = CellType.bionty()

In [None]:
# Check the fields of CellType table in lnschema-bionty
# CellType fields are accessible via auto-completion `CellType.`

CellType.__fields__.keys()

### Create records for Low-hierarchy cell types

In [None]:
records = ln.parse(
    ref,
    {
        "Low-hierarchy cell types": CellType.name,
        "Description": CellType.definition,
        "Cell Ontology ID": CellType.ontology_id,
    },
)

In [None]:
records[:2]

### Create records for High-hierarchy cell types

Only 4 High-hierarchy cell types are not present in the Low-hierarchy cell types, and these terms don't have ontology metadata

In [None]:
high_terms.difference(low_terms)

Here we have 2 options to add these 4 terms:

1. Annotate terms with ontology metadata using bionty lookup function
2. Create records without metadata

In [None]:
# Let's look up ontology_id for T cells from the Cell Ontology (cl)

lookup = bionty_celltype.lookup()
lookup.T_cell

In [None]:
# Alternatively, you can search for a standard term by fuzzy string matching

bionty_celltype.fuzzy_match("T cells", bionty_celltype.name)

Now, we can create a new CellType record for "T cells" with metadata from the "T cell" lookup:

In [None]:
record_t_cell = CellType(lookup.T_cell)
record_t_cell

If you want the record name to be exactly "T cells" as the CellTypist ontology, you may change it and add the name "T cell" as a synonym:

(synonyms are concatenated by "|")

In [None]:
record_t_cell.name = "T cells"
record_t_cell.synonyms += "|T cell"

record_t_cell

Add to the records list:

In [None]:
records.append(record_t_cell)

For the rest 3 terms, we directly create records without additonal metadata:

In [None]:
records.append(CellType(name="B-cell lineage"))
records.append(CellType(name="Cycling cells"))
records.append(CellType(name="Erythroid"))

In [None]:
records[-3:]

In [None]:
ln.add(records);

### CellTypist ontology registry in LaminDB

To retrieve the full ontology table:

In [None]:
ln.select(CellType).df()

Look up for a name via auto-completion:

In [None]:
db_lookup = CellType.lookup()

In [None]:
db_lookup.Memory_B_cells

## Annotate a dataset with cell types using CellTypist

In [None]:
input_file = celltypist.samples.get_sample_csv()

input_file

In [None]:
predictions = celltypist.annotate(input_file, majority_voting=True)

In [None]:
adata_annotated = predictions.to_adata()

In [None]:
adata_annotated.obs

Parse cell type labels:

In [None]:
celltypes = ln.parse(adata_annotated.obs.predicted_labels, CellType.name)

In [None]:
celltypes[:2]

## Track the annotated dataset in LaminDB

Let's enable tracking of the current notebook as the transform of this file:

In [None]:
ln.track()

Create a file record from AnnData object:

In [None]:
file_annotated = ln.File(adata_annotated, name="sample_cell_by_gene-celltypist")

Link cell types to the file record:

In [None]:
file_annotated.cell_types = celltypes

In [None]:
ln.add(file_annotated);

Now we can track the file from a cell type:

In [None]:
ln.select(ln.File).join(ln.File.cell_types).where(
    CellType.name == db_lookup.Tcm_Naive_helper_T_cells
).all()