# Annotation

## Motivation

To understand your data better and make use of existing knowledge, it is important to figure out the "cellular identity" of each of the cells in your data. The process of labeling groups of cells in your data based on known (or sometimes unknown) cellular phenotypes is called "cell annotation". Whereas there are many ways to annotate your cells (e.g. based on batch, disease, sex and more), in this notebook we will focus on the annotation of "cell types".<br>
So what is a cell type? Biologists use the term cell type to denote a cellular phenotype that is robust across datasets, identifiable based on expression of specific markers (i.e. proteins or gene transcripts), and often linked to specific functions. For example, a plasma B cell is a type of white blood cell that secretes antibodies used to fight pathogens and it can be identified using specific markers. Knowing which cell types are in your sample is essential in understanding your data. For example, knowing that there are specific immune cell types in a tumor or unusual hematopoietic stem cells in your bone marrow sample can be a valuable insight into the disease you might be studying.<br>
However, like with any categorization the size of categories and the borders drawn between them are partly subjective and can change over time, e.g. because new technologies allow for a higher resolution view of cells, or because specific "sub-phenotypes" that were not considered biologically meaningful are found to have important biological implications (see e.g. {cite}`anno:KadurLakshminarasimhaMurthy2022`). Cell types are therefore often further classified into "subtypes" or "cell states" (e.g. activated versus resting) and some researchers use the term "cell identity" to avoid this sometimes arbitrary distinction of cell types, cell subtypes and cell states. For a more detailed discussion of this topic, we recommend the review by Wagner et al. {cite}`anno:Wagner2016` and the recently published review by Zeng {cite}`anno:ZENG20222739`.<br>
Similarly, multiple cell types can be part of a single continuum, where one cell type might transition or differentiate into another. For example, in hematopoiesis cells differentiate from a stem cell into a specific immune cell type. Although hard borders between early and late stages of this differentiation are often drawn, the state of these cells can more accurately be described by the differentiation coordinate between the less and more differentiated cellular phenotypes. We will discuss differentiation and cellular trajectories in subsequent chapters.<br>
So how do we go about annotating cells in single-cell data? There are multiple ways to do it and we will give an overview of different approaches below. As we are working with transcriptomic data, each of these methods is ultimately based on the expression of specific genes or gene sets, or general transcriptomic similarity between cells. 

## Environment setup

We'll filter out some deprecation and performance warnings that do not affect our code:

In [89]:
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

from numba.core.errors import NumbaDeprecationWarning

warnings.simplefilter("ignore", category=NumbaDeprecationWarning)

Load the needed modules:

In [90]:
import urllib.request
from pathlib import Path

import celltypist
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
# import scarches as sca
import seaborn as sns
from celltypist import models
from scipy.sparse import csr_matrix

One more pandas warning to filter:

In [91]:
warnings.filterwarnings("ignore", category=pd.errors.PerformanceWarning)

We will continue working with the scRNA-seq dataset that we earlier preprocessed and will now annotate it.

Set figure parameters:

In [92]:
sc.set_figure_params(figsize=(5, 5))

## Load data

Let's read in the toy dataset we will use for this tutorial. It includes a single sample ("site4-donor8") of the data also used in other parts of the book. Moreover, cells that didn't pass QC have already been removed.

In [None]:
# this cell is tagged 'parameters' to use papermill
input_file = 'adata_compressed.h5ad'
count_layer_to_use = 'soupX_counts'

In [None]:
adata = sc.read_h5ad(input_file, backed=False)
adata

## Manual annotation

The classical or oldest way to perform cell type annotation is based on a single or small set of marker genes known to be associated with a particular cell type. This approach dates back to "pre-scRNA-seq times", when single cell data was low dimensional (e.g. FACS data with gene panels consisting of no more than 30-40 genes). It is a fast and transparent way to annotate your data. However, when no unique markers exist for a specific cell type (which is often the case) this approach can get more complicated and even less objective, with combinations of markers or expression thresholds necessary for proper annotation. A robust set of marker genes and prior knowledge or annotation experience can help here, but the approach comes with the risk of unclear and subjective decision-making. 

In this setting, the data is often clustered before annotation, so that we can annotate groups of cells instead of making a per-cell call. This is not only less laborious, but also more robust to noise: a single cell might not have a count for a specific marker even if it was expressed in that cell, simply due to the inherent sparsity of single cell data. Clustering enables the detection of cells highly similar in overall gene expression and can therefore account for drop-outs at single cell level. 

Finally, there are two angles from which to approach the marker-gene based annotation. One option is to work from a table of marker genes for all the cell types you expect in your data and check in which those clusters are expressed. The other option is to check which genes are highly expressed in the clusters you defined and then check if they are associated with known cell types or states. If necessary, one can move back and forth between those approaches. We will show examples of both below.

### From markers to cluster annotation

Let's get started with the known-marker based approach. We will first list a set of markers for cell types in the bone marrow here that is based on literature: previous papers that study specific cell types and subtypes and report marker genes for those cell types. Note that markers at the protein level (e.g. used for FACS) sometimes do not work as well in transcriptomic data, hence using markers from RNA-based papers is often more likely to work. Moreover, sometimes markers in one dataset do not turn out to work as well in other datasets. Ideally a marker set is therefore validated across multiple datasets. Finally, it is often useful to work together with experts: as a bioinformatician, try to team up with a biologist who has more extensive knowledge of the tissue, the biology, the expected cell types and markers etc.

In [None]:
# read in csv table
cms_genes = pd.read_csv("resources/cms_genes.csv")
cms_genes

In [None]:
# convert cms_genes into dictionary
cms_genes_dict = {}
for i, row in cms_genes.iterrows():
    if row["CMS"] not in cms_genes_dict:
        cms_genes_dict[row["CMS"]] = []
    cms_genes_dict[row["CMS"]].append(row["gene"])
cms_genes_dict

Subset to only the markers that were detected in our data. We will loop through all cell types and keep only the genes that we find in our adata object as markers for that cell type. This will prevent errors once we start plotting.

In [96]:
marker_genes_in_data = {}
for ct, markers in cms_genes_dict.items():
    markers_found = []
    for marker in markers:
        if marker in adata.var.index:
            markers_found.append(marker)
    marker_genes_in_data[ct] = markers_found

To see where these markers are expressed we can work with a 2-dimensional visualization of the data, such as a UMAP. We'll construct that UMAP here based on the scran-normalized count data, using only the highly deviant genes. Note that we first perform a PCA on the normalized counts to reduce dimensionality of the data before we generate the UMAP.

To start we store our raw counts in `.layers['counts']`, so that we will still have access to them later if needed. We then set our `adata.X` to the scran-normalized, log-transformed counts.

In [97]:
adata.X = adata.layers["scran_normalization" + "_of_" + count_layer_to_use]

<div class="alert alert-block alert-info">
<b>Important variable:</b> The normalized matrix is based on the following count matrix:</div>

In [None]:
count_layer_to_use

We furthermore set our `adata.var.highly_variable` to the highly deviant genes. Scanpy uses this var column in downstream calculations, such as the PCA below

In [98]:
adata.var["highly_variable"] = adata.var["highly_deviant"]

In [128]:
# find genes in marker_genes_in_data that are highly variable
marker_genes_in_data_hv = {}
for ct, markers in marker_genes_in_data.items():
    markers_found = []
    for marker in markers:
        if marker in adata.var.index:
            if adata.var.loc[marker, "highly_variable"]:
                markers_found.append(marker)
    marker_genes_in_data_hv[ct] = markers_found

Now perform PCA. We use the highly deviant genes (set as "highly variable" above) to reduce noise and strengthen signal in our data and set number of components to the default n=50. 50 is on the high side for data of a single sample, but it will ensure that we don't ignore important variation in our data.

In [99]:
sc.tl.pca(adata, n_comps=50, use_highly_variable=True)

Calculate the neighbor graph based on the PCs:

In [100]:
sc.pp.neighbors(adata)

And use that neighbor graph to calculate a 2-dimensional UMAP embedding of the data:

In [101]:
sc.tl.umap(adata)

Now show expression of the markers using the calculated UMAP. We'll limit ourselves to B/plasma cell subtypes for this example. Note from the marker dictionary above that there are three negative markers in our list: IGHD and IGHM for B1 B, and PAX5 for plasmablasts, meaning that this cell type is expected not to or to lowly express those markers.

Let's list the B cell subtypes we want to show the markers for:

In [102]:
cms_tps = [
    "CMS1",
    "CMS2",
    "CMS3",
    "CMS4",
]

And now plot one UMAP per marker for each of the B cell subtypes. Note that we can only plot the markers that are present in our data.

In [None]:
for ct in cms_tps:
    print(f"{ct.upper()}:")  # print cell subtype name
    sc.pl.umap(
        adata,
        color=marker_genes_in_data_hv[ct],
        vmin=0,
        vmax="p99",  # set vmax to the 99th percentile of the gene count instead of the maximum, to prevent outliers from making expression in other cells invisible. Note that this can cause problems for extremely lowly expressed genes.
        sort_order=False,  # do not plot highest expression on top, to not get a biased view of the mean expression among cells
        frameon=False,
        cmap="Reds",  # or choose another color map e.g. from here: https://matplotlib.org/stable/tutorials/colors/colormaps.html
    )
    print("\n\n\n")  # print white space for legibility

As you can see, even markers for a single cell type are often expressed in different subsets of the data, i.e. individual markers are often not uniquely expressed in a single cell type. Rather, it is the intersection of those subsets that will tell you where your cell type of interest is. 

Another thing you might notice is that markers are often sparsely expressed, i.e. it is often only a subset of cells of a cell type in which a marker was detected. This is due to the nature of scRNA-seq data: we only sequence a small subset of the total amount of RNA molecules in the cell and due to this subsampling we will sometimes not sample transcripts from specific genes in a cell even if they were expressed in that cell. Therefore, we do not annotate single cells based on a minimum expression threshold of e.g. a set of markers. Instead, we first subdivide the data into groups of similar cells (i.e. "partition" the data) by clustering, thereby accounting for "missing transcripts" of single genes and rather grouping based on overall transcriptomic similarity. We can then annotate those clusters based on their overall marker expression patterns. 

Let us cluster our data now. We will use the Leiden algorithm {cite}`anno:Traag2019` as discussed in the Clustering chapter to define a grouping of our data into similar subsets of cells:

In [114]:
sc.tl.leiden(adata, resolution=1, key_added="leiden_1")

In [None]:
sc.pl.umap(adata, color="leiden_1")

You might notice that this partitioning of the data is rather coarse and some of the marker expression patterns we saw above are not captured by this clustering. We can therefore try a higher resolution clustering by changing the resolution parameter of our clustering:

In [116]:
sc.tl.leiden(adata, resolution=2, key_added="leiden_2")

In [None]:
sc.pl.umap(adata, color="leiden_2")

Or with cluster numbers in the UMAP:

In [None]:
sc.pl.umap(adata, color="leiden_2", legend_loc="on data")

This clustering is a lot finer and will help us annotate the data in more detail. You can play around with the resolution parameter to find the setting that best captures the marker expression patterns you observe.

Scrolling back up, you will see that cluster 4 and 6 are the clusters consistently expressing Naive CD20+ B cell markers. We can also visualize this using a dotplot:

In [119]:
# cms_markers = {
#     ct: [m for m in ct_markers if m in adata.var.index]
#     for ct, ct_markers in cms_genes_dict.items()
#     if ct in cms_tps
# }

In [None]:
sc.pl.dotplot(
    adata,
    groupby="leiden_2",
    var_names=marker_genes_in_data_hv,
    standard_scale="var",  # standard scale: normalize each gene to range from 0 to 1
)

Using a combination of visual inspection of the UMAPs and the dotplot above we can now start annotating the clusters:

Let's visualize our annotations so far:

In [None]:
# length of the gene list for each cell type
gene_list_len = {k: len(v) for k, v in marker_genes_in_data.items()}
gene_list_len

In [138]:
# Calculate scores manually by simply calculating the mean.
adata.obs["CMS1_score"] = (adata[:, marker_genes_in_data["CMS1"]].X.sum(axis=1) / len(marker_genes_in_data["CMS1"]))
adata.obs["CMS2_score"] = (adata[:, marker_genes_in_data["CMS2"]].X.sum(axis=1) / len(marker_genes_in_data["CMS2"]))
adata.obs["CMS3_score"] = (adata[:, marker_genes_in_data["CMS3"]].X.sum(axis=1) / len(marker_genes_in_data["CMS3"]))
adata.obs["CMS4_score"] = (adata[:, marker_genes_in_data["CMS4"]].X.sum(axis=1) / len(marker_genes_in_data["CMS4"]))

In [None]:
# plot the scores
sc.pl.umap(adata, color=["CMS1_score", "CMS2_score", "CMS3_score", "CMS4_score"], ncols=2)

In [None]:
# assign the cell type based on the highest score
adata.obs["CMS"] = adata.obs[["CMS1_score", "CMS2_score", "CMS3_score", "CMS4_score"]].idxmax(axis=1)
# replace CMS1_score, CMS2_score, CMS3_score, CMS4_score with CMS1, CMS2, CMS3, CMS4
adata.obs["CMS"] = adata.obs["CMS"].str.replace("_score", "")

# plot
sc.pl.umap(adata, color="CMS")

In [None]:
# use cmscaller or cmsclassifier later for this purpose to have a statistically more robust method

### From cluster differentially expressed genes to cluster annotation

Conversely, we can calculate marker genes per cluster and then look up whether we can link those marker genes to any known biology such as cell types and/or states. For marker gene calculation of clusters simple methods such as the Wilcoxon rank-sum test are thought to perform best {cite}`anno:Pullin2022.05.09.490241`. Importantly, as the definition of the clusters is based on the same data as used for these statistical tests, the p-values of these tests will be inflated as also described here {cite}`anno:ZHANG2019383`.

Let's calculate the differentially expressed genes for every cluster, compared to the rest of the cells in our adata:

In [141]:
sc.tl.rank_genes_groups(
    adata, groupby="leiden_2", method="wilcoxon", key_added="dea_leiden_2"
)

We can visualize expression of the top differentially expressed genes per cluster with a standard scanpy dotplot:

In [None]:
sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden_2", standard_scale="var", n_genes=5, key="dea_leiden_2"
)

As you can see above, a lot of the differentially expressed genes are highly expressed in multiple clusters. We can filter the differentially expressed genes to select for more cluster-specific differentially expressed genes:

In [147]:
# sc.tl.filter_rank_genes_groups(
#     adata,
#     min_in_group_fraction=0.1,
#     max_out_group_fraction=0.3,
#     key="dea_leiden_2",
#     key_added="dea_leiden_2_filtered",
# )

# probably problem with this scanpy's version, see https://github.com/scverse/scanpy/issues/3443

Visualize the filtered genes :

In [148]:
# sc.pl.rank_genes_groups_dotplot(
#     adata,
#     groupby="leiden_2",
#     standard_scale="var",
#     n_genes=5,
#     key="dea_leiden_2_filtered",
# )

Let's take a look at cluster 12, which seems to have a set of relatively unique markers including CDK6, ETV6, NKAIN2, and GNAQ. Some googling tells us that NKAIN2 and ETV6 are hematopoietic stem cell markers {cite}`anno:SHI20222234` {cite}`anno:Wang1998-rx` (NKAIN2 was also present in our list above). In the UMAP we can see that these genes are expressed throughout cluster 12: 

In [None]:
adata.var.columns

In [None]:
sc.pl.umap(
    adata,
    color=["CDK1", "CMS2_score", "leiden_2"],
    vmax="p99",
    legend_loc="on data",
    frameon=False,
    cmap="Reds",
)

This highlights how complicated marker-based annotation is: it is sensitive to the cluster resolution you choose, the robustness and uniqueness of the marker sets you have, and your knowledge of the cell types to be expected in your data.

For this reason, the field is partly trying to move away from manual cluster annotation and rather moving towards automated annotation algorithms instead. The rest of this tutorial will focus on those options.

Before we move on, store the final bit of annotation information in our adata:

## Automated annotation

### General remarks

The remainder of the discussed methods will be methods for automated, rather than manual annotation of your data. Unlike the method showcased above, each of these methods enables you to annotate your data in an automated way. They are based on different principles, sometimes requiring pre-defined sets of markers, other times trained on pre-existing full scRNA-seq datasets. As discussed below, the resulting annotations can be of varying quality. It is therefore important to regard these methods as a starting point rather than an end-point of the annotation process. See also several reviews {cite}`anno:PASQUINI2021961`, {cite}`anno:Abdelaal2019` for a more elaborate discussion of automated annotation methods.

As said, the quality of automatically generated annotations can vary. More specifically, the quality of the annotations depends on:
1) The type of classifier chosen: Previous benchmark studies have shown that different types of classifiers often perform comparably, with neural network-based methods generally not outperforming general-purpose models such as support vector machines or linear regression models{cite}`anno:Abdelaal2019`, {cite}`anno:PASQUINI2021961`, {cite}`anno:Huang2021`.<br>
2) The quality of the data that the classifier was trained on. If the training data was not well annotated or annotated at low resolution, the classifier will do the same. Similarly, if the training data and/or its annotation was noisy, the classifier might not perform well.<br>
3) The similarity of your own data to the data that the classifier was trained on. For example, if the classifier was trained on a drop-seq single cell dataset and your data is 10X single nucleus rather than single cell drop-seq, this might worsen the quality of the annotation. Classifiers trained on cross-dataset atlases including a diversity of datasets might give more robust and better quality annotations than classifiers trained on a single dataset. An example is the CellTypist (an automated annotation method that will be discussed more extensively below) classifier trained on the Human Lung Cell Atlas {cite}`anno:Sikkema2023` which includes 14 different lung datasets. This model is likely to perform better on new lung data than a model that was trained on a single lung dataset.  

The aforementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data and sometimes merely involves the upload of your data to an online webpage. Second, these methods don't rely on a partitioning of your data into clusters, as the manual annotation does. Third, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. And finally, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. 

Finally, as these classifiers are often less transparent than e.g. manual marker-based annotation, a good uncertainty measure quantifying annotation uncertainty will improve the quality and usability of the method. We will discuss this more extensively further down.

### Marker gene-based classifiers

One class of automated cell type annotation methods relies on a predefined set of marker genes. Cells are classified into cell types based on their expression levels of these marker genes. Examples of such methods are Garnett {cite}`anno:Pliner2019` and CellAssign {cite}`anno:Zhang2019`. The more robust and generalizable the set of marker genes these models are based on, the better the model will perform. However, like with other models they are likely to be affected by batch effect-related differences between the data the model was trained on and the data that needs to be labeled. One of the advantages of these methods compared to models based on larger gene sets (see below) is that they are more transparent: we know on the basis of which genes the classification is done.<br>
We will not show an example of marker-based classifiers in this notebook, but encourage you to explore these yourself if you are interested.

### Classifiers based on a wider set of genes

It is worth noting that the methods discussed so far use only a small subset of the genes detected in the data: often a set of only 1 to ~10 marker genes per cell type is used. An alternative approach is to use a classifier that takes as input a larger set of genes (several thousands or more), thereby making more use of the breadth of scRNA-seq data. Such classifiers are trained on previously annotated datasets or atlases. Examples of these are CellTypist {cite}`anno:Conde2022` (see also https://www.celltypist.org, where data can be uploaded to a portal to get automated cell annotations) and Clustifyr {cite}`anno:Fu2020`. 

Let's try out CellTypist on our data. Based on the CellTypist tutorial (https://www.celltypist.org/tutorials) we know we need to prepare our data so that counts are normalized to 10,000 counts per cell, then log1p-transformed:

In [34]:
adata_celltypist = adata.copy()  # make a copy of our adata
adata_celltypist.X = adata.layers[count_layer_to_use]  # set adata.X to raw counts
sc.pp.normalize_total(
    adata_celltypist, target_sum=10**4
)  # normalize to 10,000 counts per cell
sc.pp.log1p(adata_celltypist)  # log-transform
# make .X dense instead of sparse, for compatibility with celltypist:
adata_celltypist.X = adata_celltypist.X.toarray()

We'll now download the celltypist models for immune cells:

In [171]:
# models.download_models(
#     force_update=False, model=["Human_Colorectal_Cancer.pkl", "Cells_Intestinal_Tract"], 
# )

Let's try out both the `Human_Colorectal_Cancer` and `Cells_Intestinal_Tract` models (these annotate immune cell types finer annotation level (low) and coarser (high)):

In [178]:
model_crc = models.Model.load(model="Human_Colorectal_Cancer.pkl")
model_gut = models.Model.load(model="Cells_Intestinal_Tract.pkl")

For each of these, we can see which cell types it includes to see if bone marrow cell types are included:

In [None]:
model_gut.cell_types

In [None]:
model_crc.cell_types

Looks like the models include many different immune cell type progenitors!

Now let's run the models. First the coarse one:

In [None]:
predictions_gut = celltypist.annotate(
    adata_celltypist, model=model_gut, majority_voting=True
)

Transform the predictions to adata to get the full output...

In [182]:
predictions_gut_adata = predictions_gut.to_adata()

...and copy the results to our original AnnData object:

In [183]:
adata.obs["celltypist_cell_label_coarse"] = predictions_gut_adata.obs.loc[
    adata.obs.index, "majority_voting"
]
adata.obs["celltypist_conf_score_coarse"] = predictions_gut_adata.obs.loc[
    adata.obs.index, "conf_score"
]

Now the same for the finer annotations:

In [None]:
predictions_crc = celltypist.annotate(
    adata_celltypist, model=model_crc, majority_voting=True
)

In [185]:
predictions_crc_adata = predictions_crc.to_adata()

In [186]:
adata.obs["celltypist_cell_label_fine"] = predictions_crc_adata.obs.loc[
    adata.obs.index, "majority_voting"
]
adata.obs["celltypist_conf_score_fine"] = predictions_crc_adata.obs.loc[
    adata.obs.index, "conf_score"
]

Now plot:

In [None]:
sc.pl.umap(
    adata,
    color=["celltypist_cell_label_coarse", "celltypist_conf_score_coarse"],
    frameon=False,
    sort_order=False,
    wspace=1,
)

In [None]:
sc.pl.umap(
    adata,
    color=["celltypist_cell_label_fine", "celltypist_conf_score_fine"],
    frameon=False,
    sort_order=False,
    wspace=1,
)

One way of getting a feeling for the quality of these annotations is by looking if the observed cell type similarities correspond to our expectations:

In [196]:
# do only if more than one cell type is present
if len(adata.obs["celltypist_cell_label_fine"].unique()) > 1:
    sc.pl.dendrogram(adata, groupby="celltypist_cell_label_fine")

In [None]:
# do only if more than one cell type is present
if len(adata.obs["celltypist_cell_label_coarse"].unique()) > 1:
    sc.pl.dendrogram(adata, groupby="celltypist_cell_label_coarse")

This dendrogram partly reflects prior knowledge on cell type relations (e.g. B cells largely clustering together), but we also observe some unexpected patterns: Tcm/Naive helper T cells cluster with erythroid cells and macrophages rather than with the other T cells. This is a red flag! Possibly, the Tcm/Naive helper T cell annotations are wrong.

Now let's take a look at our earlier manual annotations:

In [None]:
sc.pl.umap(
    adata,
    color = "celltypist_cell_label_fine",
    frameon=False,
    palette = ["blue"]
)

sc.pl.umap(
    adata,
    color=["CMS"],
    frameon=False,
    palette=["red", "blue", "orange"],
)

You can see that our naive B cell annotation corresponds well to part of the automatic naive B cell annotation. Similarly, part of what we called transitional B cells is called "small pre-B cells" in their annotations and our B1 B cells correspond to their memory B cells, which is encouraging!

However, you'll also notice that our HSC + MK/E prog cluster is annotated as a mixture of T cells and HSCs/multipotent progenitors in their fine annotation, hence these annotations are partly contradictory. Looking at the confidence score of both annotations, we see that the annotation of the larger part of the cells is done with relatively low confidence, which is a useful indication that these annotations cannot be copied without careful validation and manual reviewing!

See here the breakdown of cluster 12 in terms of fine celltypist labels:

In [None]:
pd.crosstab(adata.obs.leiden_2, adata.obs.celltypist_cell_label_fine).loc[
    "12", :
].sort_values(ascending=False)

In the coarser cell typist labels we observe different patterns: our cluster 12 is mostly annotated as B cells or Megakaryocyte precursors, again only partly corresponding to our annotations.

In [None]:
pd.crosstab(adata.obs.leiden_2, adata.obs.celltypist_cell_label_coarse).loc[
    "12", :
].sort_values(ascending=False)

Finally, store your adata object if wanted:

In [226]:
import os
# update / save to the same file
adata.write_h5ad(input_file, compression='gzip')

## References

```{bibliography}
:filter: docname in docnames
:labelprefix: anno
```

## Contributors
We gratefully acknowledge the contributions of:
### Authors
- Lisa Sikkema
- Maren Büttner
### Reviewers
- Lukas Heumos