# Normalization

## Motivation

Up to this point, we removed low-quality cells, ambient RNA contamination and doublets from the dataset and the data is available as a count matrix in the form of a numeric matrix of shape cells x genes. These counts represent the capture, reverse transcription and sequencing of a molecule in the scRNA-seq experiment. Each of these steps adds a degree of variability to the measured count depth for identical cells, so the difference in gene expression between cells in the count data might simply be due to sampling effects. This means that the dataset and therefore the count matrix still contains widely varying variance terms. Analyzing the dataset is often challenging as many statistical methods assume data with uniform variance structure. 

```{admonition} Gamma-Poisson distribution
A theoretically and empirically established model for UMI data is the Gamma-Poisson distribution which implies a quadratic mean-variance relation with $Var[Y] = \mu + \alpha \mu^2$ with mean $\mu$ and overdispersion $\alpha$. For $\alpha=0$ this is the Poisson distribution and $\alpha$ describes the additional variance on top of the Poisson. 
```

The preprocessing step of "normalization" aims to adjust the raw counts in the dataset for variable sampling effects by scaling the observable variance to a specified range. Several normalization techniques are used in practice varying in complexity. They are mostly designed in such a way that subsequent analysis tasks and their underlying statistical methods are applicable. 

A recent benchmark published by Ahlmann-Eltze and Huber{cite}`Ahlmann-Eltze2023` compared 22 different transformations for single-cell data. The benchmark compared the performance of the different normalization techniques based on the cell graph overlap with the ground truth. We would like to highlight that a complete benchmark which also compares the impact of the normalization on a variety of different downstream analysis tasks is still outstanding. We advise analysts to choose the normalization carefully and always depend on the subsequent analysis task. 

This chapter will introduce the reader to three different normalization techniques, the shifted logarithm transformation, scran normalization and analytic approximation of Pearson residuals. The shifted logarithm works beneficial for stabilizing variance for subsequent dimensionality reduction and identification of differentially expressed genes. Scran was extensively tested and used for batch correction tasks and analytic Pearson residuals are well suited for selecting biologically variable genes and identification of rare cell types. 

We first import all required Python packages and load the dataset for which we filtered low quality cells, removed ambient RNA and scored doublets. 



In [None]:
import scanpy as sc
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import anndata2ri
import logging
from scipy.sparse import issparse

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

# Settings
sc.settings.verbosity = 0
sc.settings.set_figure_params(dpi=60, facecolor="white", frameon=False, transparent=True)
seed = 10
np.random.seed(seed)

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [None]:
# this cell is tagged 'parameters' to use papermill
input_file = '/data/cephfs-1/home/users/cemo10_c/work/scRNA/scRNA_preprocessing_pipeline/results/preprocessing_archive2/CE_SC_5FU_Conti_2/quality_control.h5ad'
count_layer = 'soupX_counts'
output_file = ''
qc_method = 'scAutoQC' # 'theislab_tutorial' or 'scAutoQC' (do not filter in this notebook/step) or 'None' 

<div class="alert alert-block alert-info">
<b>Important variable:</b> The following matrix will be used as count matrix:</div>

In [None]:
count_layer

In [None]:
adata = sc.read_h5ad(input_file, backed = False)
adata

In [None]:
adata.X = adata.layers[count_layer].copy()

We now filter our AnnData object based on these two additional columns if our quality control method is the Theislab tutorial (and not scAutoQC, in that case it has been done already).

In [None]:
if qc_method == 'theislab_tutorial':
    print(f"Total number of cells: {adata.n_obs}")
    adata = adata[(~adata.obs.outlier) & (~adata.obs.mt_outlier)].copy()
    print(f"Number of cells after filtering of low quality cells: {adata.n_obs}")

We can now inspect the distribution of the raw counts which we already calculated during quality control. This step can be neglected during a standard single-cell analysis pipeline, but might be helpful to understand the different normalization concepts. 

In [None]:
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False)

## Shifted logarithm 

The first normalization technique we will introduce is the shifted logarithm which is based on the delta method {cite}`dorfman1938note`. The delta method applies a nonlinear function $f(Y)$ to the raw counts $Y$ and aims to make the variances across the dataset more similar. 

The shifted logarithm tackles this by 

$$f(y) = \log(\frac{y}{s}+y_0)$$ 

with $y$ being the raw counts, $s$ being a so-called size factor and $y_0$ describing a pseudo-count. The size factors are determined for each cell to account for variations in sampling effects and different cell sizes. The size factor for a cell $c$ can be calculated by 

$$s_c = \frac{\sum_g y_{gc}}{L}$$ 

with $g$ indexing different genes and $L$ describing a target sum. There are different approaches to determine the size factors from the data. We will leverage the scanpy default in this section with $L$ being the median raw count depth in the dataset. Many analysis templates use fixed values for $L$, for example $L=10^5$, or $L=10^6$ resulting in values commonly known as counts per million (CPM). For a beginner, these values may seem arbitrary, but it can lead to much larger overdispersions than typically seen in single-cell datasets. 

```{admonition} Overdispersion
Overdispersion describes the presence of a greater variability in the dataset than one would expect.
```

The shifted logarithm is a fast normalization technique, outperforms other methods for uncovering the latent structure of the dataset (especially when followed by principal component analysis) and works beneficial for stabilizing variance for subsequent dimensionality reduction and identification of differentially expressed genes. We will now inspect how to apply this normalization method to our dataset. The shifted logarithm can be conveniently called with scanpy by running `pp.normalized_total` with `target_sum=None`. We are setting the `inplace` parameter to `False` as we want to explore three different normalization techniques in this tutorial. The second step now uses the scaled counts and we obtained the first normalized count matrix.


In [None]:
scales_counts = sc.pp.normalize_total(adata, target_sum=None, inplace=False)
# log1p transform
# name for new layer
layer_name = "log1p_norm" + "_of_" + count_layer
adata.layers[layer_name] = sc.pp.log1p(scales_counts["X"], copy=True)

We can now inspect how the distribution of our counts changed after we applied the shifted logarithm and compare it to the total count from our raw (but filtered) dataset.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False, ax=axes[0])
axes[0].set_title("Total counts")
p2 = sns.histplot(adata.layers[layer_name].sum(1), bins=100, kde=False, ax=axes[1])
axes[1].set_title("Shifted logarithm")
plt.show()

A second normalization method, which is also based on the delta method, is Scran's pooling-based size factor estimation method. Scran follows the same principles as the shifted logarithm by calculating $f(y) = \log(\frac{y}{s}+y_0)$ with $y$ being the raw counts, $s$ the size factor and $y_0$ describing a pseudo-count. The only difference now is that Scran leverages a deconvolution approach to estimate the size factors based on a linear regression over genes for pools of cells. This approach aims to better account for differences in count depths across all cells present in the dataset.

Cells are partitioned into pools and Scran estimates pool-based size factors using a linear regression over genes. Scran was extensively tested for batch correction tasks and can be easily called with the respective R package.

In [None]:
from scipy.sparse import csr_matrix, issparse

In [None]:
%%R
library(scran)
library(BiocParallel)

scran requires a coarse clustering input to improve size factor estimation performance. In this tutorial, we use a simple preprocessing approach and cluster the data at a low resolution to get an input for the size factor estimation. The basic preprocessing includes assuming all size factors are equal (library size normalization to counts per million - CPM) and log-transforming the count data.

In [None]:
# Preliminary clustering for differentiated normalisation
adata_pp = adata.copy()
sc.pp.normalize_total(adata_pp)
sc.pp.log1p(adata_pp)
sc.pp.pca(adata_pp, n_comps=15)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp, key_added="groups")

We now add `data_mat` and our computed groups into our R environment. 

In [None]:
data_mat = adata_pp.X.T
# convert to CSC if possible. See https://github.com/MarioniLab/scran/issues/70
if issparse(data_mat):
    if data_mat.nnz > 2**31 - 1:
        data_mat = data_mat.tocoo()
    else:
        data_mat = data_mat.tocsc()
ro.globalenv["data_mat"] = data_mat
ro.globalenv["input_groups"] = adata_pp.obs["groups"]

We can now also delete the copy of our anndata object, as we obtained all objects needed in order to run scran. 

In [None]:
del adata_pp

We now compute the size factors based on the groups of cells we calculated before. 

In [None]:
%%R -o size_factors

size_factors = sizeFactors(
    computeSumFactors(
        SingleCellExperiment(
            list(counts=data_mat)), 
            clusters = input_groups,
            min.mean = 0.1,
            BPPARAM = MulticoreParam()
    )
)

We save `size_factors` in `.obs` and are now able to normalize the data and subsequently apply a log1p transformation.

In [None]:
adata.obs["size_factors"] = size_factors
scran = adata.X / adata.obs["size_factors"].values[:, None]
layer_name = "scran_normalization" + "_of_" + count_layer
adata.layers[layer_name] = csr_matrix(sc.pp.log1p(scran))

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False, ax=axes[0])
axes[0].set_title("Total counts")
p2 = sns.histplot(
    adata.layers[layer_name].sum(1), bins=100, kde=False, ax=axes[1]
)
axes[1].set_title("log1p with Scran estimated size factors")
plt.show()

## Analytic Pearson residuals

The third normalization technique we are introducing in this chapter is the analytic approximation of Pearson residuals. This normalization technique was motivated by the observation that cell-to-cell variation in scRNA-seq data might be confounded by biological heterogeneity with technical effects. The method utilizes Pearson residuals from 'regularized negative binomial regression' to calculate a model of technical noise in the data. It explicitly adds the count depth as a covariate in a generalized linear model. {cite}`norm:germain_pipecomp_2020` showed in an independent comparison of different normalization techniques that this method removed the impact of sampling effects while preserving cell heterogeneity in the dataset. Notably, analytic Pearson residuals do not require downstream heuristic steps like pseudo count addition or log-transformation.
​
The output of this method are normalized values that can be positive or negative. Negative residuals for a cell and gene indicate that less counts are observed than expected compared to the gene's average expression and cellular sequencing depth. Positive residuals indicate the more counts respectively. Analytic Pearon residuals are implemented in scanpy and can directly be calculated on the raw count matrix.


In [None]:
analytic_pearson = sc.experimental.pp.normalize_pearson_residuals(adata, inplace=False)
layer_name = "analytic_pearson_residuals" + "_of_" + count_layer
adata.layers[layer_name] = csr_matrix(analytic_pearson["X"])

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
p1 = sns.histplot(adata.obs["total_counts"], bins=100, kde=False, ax=axes[0])
axes[0].set_title("Total counts")
try:
    p2 = sns.histplot(
        adata.layers[layer_name].sum(1), bins=100, kde=False, ax=axes[1]
    )
except Exception as e:
    print(f"An error occurred: {e}")
axes[1].set_title("Analytic Pearson residuals")
plt.show()

We applied different normalization techniques to our dataset and saved them as separate layers to our anndata object. Depending on the downstream analysis task it can be favourable to use a differently normalized layer and assess the result.

## References

```{bibliography}
:filter: docname in docnames
:labelprefix: norm
```

## Contributors

We gratefully acknowledge the contributions of:

### Authors

* Anna Schaar

### Reviewers

* Lukas Heumos

(pre-processing:feature-selection)=
# Feature selection

## Motivation

We now have a normalized data representation that still preserves biological heterogeneity but with reduced technical sampling effects in gene expression. Single-cell RNA-seq datasets usually contain up to 30,000 genes and so far we only removed genes that are not detected in at least 20 cells. However, many of the remaining genes are not informative and contain mostly zero counts. Therefore, a standard preprocessing pipeline involves the step of feature selection which aims to exclude uninformative genes which might not represent meaningful biological variation across samples. 

:::{figure-md} Feature selection

<img src="https://www.sc-best-practices.org/_images/feature_selection.jpeg" alt="Feature selection" class="bg-primary mb-1" width="800px">

Feature selection generally describes the process of only selecting a subset of relevant features which can be the most informative, most variable or most deviant ones. 

:::


Usually, the scRNA-seq experiment and resulting dataset focuses on one specific tissue and hence, only a small fraction of genes is informative and biologically variable. Traditional approaches and pipelines either compute the coefficient of variation (highly variable genes) or the average expression level (highly expressed genes) of all genes to obtain 500-2000 selected genes and use these features for their downstream analysis steps. However, these methods are highly sensitive to the normalization technique used before. As mentioned earlier, a former preprocessing workflow included normalization with CPM and subsequent log transformation. But as log-transformation is not possible for exact zeros, analysts often add a small *pseudo count*, e.g., 1 (log1p), to all normalized counts before log transforming the data. Choosing the pseudo count, however, is arbitrary and can introduce biases to the transformed data. This arbitrariness has then also an effect on the feature selection as the observed variability depends on the chosen pseudo count. A small pseudo count value close to zero is increasing the variance of genes with zero counts {cite}`Townes2019`. 

Germain et al. instead proposes to use *deviance* for feature selection which works on raw counts {cite}`fs:germain_pipecomp_2020`. Deviance can be computed in closed form and quantifies whether genes show a constant expression profile across cells as these are not informative. Genes with constant expression are described by a multinomial null model, they are approximated by the binomial deviance. Highly informative genes across cells will have a high deviance value which indicates a poor fit by the null model (i.e., they don't show constant expression across cells). According to the deviance values, the method then ranks all genes and obtains only highly deviant genes. 

As mentioned before, deviance can be computed in closed form and is provided within the R package scry.

We start by setting up our environment.

In [None]:
import scanpy as sc
import anndata2ri
import logging
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [None]:
%%R
library(scry)

Next, we load the already normalized dataset. Deviance works on raw counts.

In [None]:
adata.X = adata.layers[count_layer].copy()

Similar to before, we save the AnnData object in our R environment. 

In [None]:
data = adata.X.T

We can now directly call feature selection with deviance on the non-normalized counts matrix and export the binomial deviance values as a vector. 

In [None]:
%%R -i data -o binomial_deviance
binomial_deviance = devianceFeatureSelection(data)

As a next step, we now sort the vector an select the top 4,000 highly deviant genes and save them as an additional column in `.var` as 'highly_deviant'. We additionally save the computed binomial deviance in case we want to sub-select a different number of highly variable genes afterwards. 

In [None]:
idx = binomial_deviance.argsort()[-4000:]
mask = np.zeros(adata.var_names.shape, dtype=bool)
mask[idx] = True

adata.var["highly_deviant"] = mask
adata.var["binomial_deviance"] = binomial_deviance

Last, we visualise the feature selection results. We use a scanpy function to compute the mean and dispersion for each gene across all cells.

In [None]:
sc.pp.highly_variable_genes(adata, layer="scran_normalization" + "_of_" + count_layer)

We inspect our results by plotting dispersion versus mean for the genes and color by 'highly_deviant'.

In [None]:
ax = sns.scatterplot(
    data=adata.var, x="means", y="dispersions", hue="highly_deviant", s=5
)
ax.set_xlim(None, 1.5)
ax.set_ylim(None, 3)
plt.show()

We observe that genes with a high mean expression are selected as highly deviant. This is in agreement with empirical observations by {cite}`Townes2019`.

## References

```{bibliography}
:filter: docname in docnames
:labelprefix: fs
```

## Contributors

We gratefully acknowledge the contributions of:

### Authors

* Anna Schaar

### Reviewers

* Lukas Heumos


(pre-processing:dimensionality-reduction)=
# Dimensionality Reduction

As previously mentioned, scRNA-seq is a high-throughput sequencing technology that produces datasets with high dimensions in the number of cells and genes. This immediately points to the fact that scRNA-seq data suffers from the 'curse of dimensionality'. 

```{admonition} Curse of dimensionality
The Curse of dimensionality was first brought up by R. Bellman {cite}`bellman1957dynamic` and descibes the problem that in theory high-dimensional data contains more information, but in practice this is not the case. Higher dimensional data often contains more noise and redundancy and therefore adding more information does not provide benefits for downstream analysis steps. 
```

Not all genes are informative and are important for the task of cell type clustering based on their expression profiles. We already aimed to reduce the dimensionality of the data with feature selection, as a next step one can further reduce the dimensions of single-cell RNA-seq data with dimensionality reduction algorithms. These algorithms are an important step during preprocessing to reduce the data complexity and for visualization. Several dimensionality reduction techniques have been developed and used for single-cell data analysis.

:::{figure-md} Dimensionality reduction

<img src="https://www.sc-best-practices.org/_images/dimensionality_reduction.jpeg" alt="Dimensionality reduction" class="bg-primary mb-1" width="800px">

Dimensionality reduction embeds the high-dimensional data into a lower dimensional space. The low-dimensional representation still captures the underlying structure of the data while having as few as possible dimensions. Here we visualize a three dimensional object projected into two dimensions. 

:::

Xing et al. compared in an independent comparison the stability, accuracy and computing cost of 10 different dimensionality reduction methods {cite}`Xiang2021`. They propose to use t-distributed stochastic neighbor embedding (t-SNE) as it yielded the best overall performance. Uniform manifold approximation and projection (UMAP) showed the highest stability and separates best the original cell populations. An additional dimensionality reduction worth mentioning in this context is principal component analysis (PCA) which is still widely used.

Generally, t-SNE and UMAP are very robust and mostly equivalent if specific choices for the initialization are selected {cite}`Kobak2019`.

All aforementioned methods are implemented in scanpy.


We will use a normalized representation of the dataset for dimensionality reduction and visualization, specifically the shifted logarithm. 

In [None]:
adata.X = adata.layers["log1p_norm" + "_of_" + count_layer].copy()

We start with:

## PCA

In our dataset each cell is a vector of a `n_var`-dimensional vector space spanned by some orthonormal basis. As scRNA-seq suffers from the 'curse of dimensionality', we know that not all features are important to understand the underlying dynamics of the dataset and that there is an inherent redundancy{cite}`grun2014validation`. PCA creates a new set of uncorrelated variables, so called principle components (PCs), via an orthogonal transformation of the original dataset. The PCs are linear combinations of features in the original dataset and are ranked with decreasing order of variance to define the transformation. Through the ranking usually the first PC amounts to the largest possible variance. PCs with the lowest variance are discarded to effectively reduce the dimensionality of the data without losing information.

PCA offers the advantage that it is highly interpretable and computationally efficient. However, as scRNA-seq datasets are rather sparse due to dropout events and therefore highly non-linear, visualization with the linear dimensionality reduction technique PCA is not very appropriate. PCA is typically used to select the top 10-50 PCs which are used for downstream analysis tasks.


In [None]:
# setting highly variable as highly deviant to use scanpy 'use_highly_variable' argument in sc.pp.pca
adata.var["highly_variable"] = adata.var["highly_deviant"]
sc.pp.pca(adata, svd_solver="arpack", use_highly_variable=True)

In [None]:
sc.pl.pca_scatter(adata, color="total_counts")


## t-SNE

t-SNE is a graph based, non-linear dimensionality reduction technique which projects the high dimensional data onto 2D or 3D components. The method defines a Gaussian probability distribution based on the high-dimensional Euclidean distances between data points. Subsequently, a Student t-distribution is used to recreate the probability distribution in a low dimensional space where the embeddings are optimized using gradient descent.

In [None]:
sc.tl.tsne(adata, use_rep="X_pca")

In [None]:
sc.pl.tsne(adata, color="total_counts")


## UMAP

UMAP is a graph based, non-linear dimensionality reduction technique and principally similar to t-SNE. It constructs a high dimensional graph representation of the dataset and optimizes the low-dimensional graph representation to be structurally as similar as possible to the original graph.


We first calculate PCA and subsequently a neighborhood graph on our data.

In [None]:
sc.pp.neighbors(adata)
sc.tl.umap(adata)

In [None]:
sc.pl.umap(adata, color="total_counts")

## Inspecting quality control metrics 

We can now also inspect the quality control metrics we calculated previously in our PCA, TSNE or UMAP plot and potentially identify low-quality cells.

In [None]:
# only non-boolean columns
non_bool_cols = [col for col in adata.obs.columns if not adata.obs[col].dtype == bool]

sc.pl.umap(
    adata,
    color = non_bool_cols
)

As we can observe, cells with a high doublet score are projected to the same region in the UMAP. We will keep them in the dataset for now but might re-visit our quality control strategy later.

## References

```{bibliography}
:filter: docname in docnames
```

## Contributors

We gratefully acknowledge the contributions of:

### Authors

* Anna Schaar

### Reviewers

* Lukas Heumos


(cellular-structure:clustering)=
# Clustering

## Motivation

Preprocessing and visualization enabled us to describe our scRNA-seq dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. 

In scRNA-seq data analysis, we describe cellular structure in our dataset with finding cell identities that relate to known cell states or cell cycle stages. This process is usually called cell identity annotation. For this purpose, we structure cells into clusters to infer the identity of similar cells. Clustering itself is a common unsupervised machine learning problem. 
We can derive clusters by minimizing the intra-cluster distance in the reduced expression space. In this case, the expression space determines the gene expression similarity of cells with respect to a dimensionality-reduced representation. This lower dimensional representation is, for example, determined with a principal-component analysis and the similarity scoring is then based on Euclidean distances. 

The KNN graph consists of nodes reflecting the cells in the dataset. We first calculate a Euclidean distance matrix on the PC-reduced expression space for all cells and then connect each cell to its K most similar cells. Usually, K is set to values between 5 and 100 depending on the size of the dataset. The KNN graph reflects the underlying topology of the expression data by representing dense regions with respect to expression space also as densely connected regions in the graph {cite}`wolf_paga_2019`. Dense regions in the KNN-graph are detected by community detection methods like Leiden and Louvain{cite}`blondel_fast_2008`. 

The Leiden algorithm is an improved version of the Louvain algorithm which outperformed other clustering methods for single-cell RNA-seq data analysis ({cite}`du_systematic_2018, freytag_comparison_2018, weber_comparison_2016`). Since the Louvain algorithm is no longer maintained, using Leiden instead is preferred. 

We, therefore, propose to use the Leiden algorithm{cite}`traag_louvain_2019` on single-cell k-nearest-neighbour (KNN) graphs to cluster single-cell datasets. 

Leiden creates clusters by taking into account the number of links between cells in a cluster versus the overall expected number of links in the dataset. 

:::{figure-md} clustering

<img src="https://www.sc-best-practices.org/_images/clustering.jpeg" alt="Clustering Overview" class="bg-primary mb-1" width="800px">

The Leiden algorithm computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node forms its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. 

:::


The starting point is a singleton partition in which each node functions as its own community (a). As a next step, the algorithm creates partitions by moving individual nodes from one community to another (b), which is refined afterwards to enhance the partitioning (c). The refined partition is then aggregated to a network (d). Subsequently, the algorithm moves again individual nodes in the aggregate network (e), until refinement does no longer change the partition (f). All steps are repeated until the final clustering is created and partitions no longer change.

The Leiden module has a resolution parameter which allows to determine the scale of the partition cluster and therefore the coarseness of the clustering. A higher resolution parameter leads to more clusters. The algorithm additionally allows efficient sub-clustering of particular clusters in the dataset by sub-setting the KNN graph. Sub-clustering enables the user to identify cell-type specific states within clusters or a finer cell type labeling{cite}`wagner_revealing_2016`, but can also lead to patterns that are only due to noise present in the data.

As mentioned before, the Leiden algorithm is implemented in scanpy.

## Clustering with our dataset

We will focus on the scran normalized version of the dataset in this notebook as recommended in the preprocessing chapter to better identify substates of individual cells. 

In [None]:
adata.X = adata.layers["scran_normalization" + "_of_" + count_layer].copy()

The Leiden algorithm leverages a KNN graph on the reduced expression space. We can calculate the KNN graph on a lower-dimensional gene expression representation with the scanpy function `sc.pp.neighbors`. We call this function on the top 30 principal-components as these capture most of the variance in the dataset. Visualizing the clustering can help us to understand the results, we therefore embed our cells into a UMAP embedding. More details can be found in the {ref}`pre-processing:dimensionality-reduction` chapter.

In [None]:
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.umap(adata)

We can now call the Leiden algorithm. 

In [None]:
sc.tl.leiden(adata)

The default resolution parameter in scanpy is 1.0. However, in many cases the analyst may want to try different resolution parameters to control the coarseness of the clustering. Hence, we recommend to save the clustering result under a specified key which indicates the selected resolution.

In [None]:
sc.tl.leiden(adata, key_added="leiden_res0_25", resolution=0.25)
sc.tl.leiden(adata, key_added="leiden_res0_5", resolution=0.5)
sc.tl.leiden(adata, key_added="leiden_res1", resolution=1.0)

We now visualize the different clustering results obtained with the Leiden algorithm at different resolutions. As we can see, the resolution heavily influences how coarse our clustering is. Higher resolution parameters lead to more communities, i.e. more identified clusters, while lower resolution parameters lead to fewer communities. The resolution parameter therefore controls how densely clustered regions in the KNN-embedding are grouped together by the algorithm. This will become especially important for annotating the clusters. 

In [None]:
sc.pl.umap(
    adata,
    color=["leiden_res0_25", "leiden_res0_5", "leiden_res1"],
    legend_loc="on data",
)

We now clearly inspect the impact of different resolutions on the clustering result. For a resolution of 0.25, the clustering is much coarser and the algorithm detected fewer communities. Additionally, clustered regions are less dense compared to the clustering obtained at a resolution of 1.0. 

We would like to highlight again that distances between the displayed clusters must be interpreted with caution. As the UMAP embedding is in 2D, distances are not necessarily captured well between all points. We recommend to not interpret distances between clusters visualized on UMAP embeddings.

## Key takeaways

1. Use Leiden community detection on a single-cell KNN graph.
2. Sub-clustering with different resolution parameters allows the user to focus on more detailed substructures in the dataset to potentially identify finer cell states. 

## References

```{bibliography}
:filter: docname in docnames
```

## Contributors

We gratefully acknowledge the contributions of:

### Authors

* Anna Schaar

### Reviewers

* Lukas Heumos

# Annotation

## Motivation

To understand your data better and make use of existing knowledge, it is important to figure out the "cellular identity" of each of the cells in your data. The process of labeling groups of cells in your data based on known (or sometimes unknown) cellular phenotypes is called "cell annotation". Whereas there are many ways to annotate your cells (e.g. based on batch, disease, sex and more), in this notebook we will focus on the annotation of "cell types".<br>
So what is a cell type? Biologists use the term cell type to denote a cellular phenotype that is robust across datasets, identifiable based on expression of specific markers (i.e. proteins or gene transcripts), and often linked to specific functions. For example, a plasma B cell is a type of white blood cell that secretes antibodies used to fight pathogens and it can be identified using specific markers. Knowing which cell types are in your sample is essential in understanding your data. For example, knowing that there are specific immune cell types in a tumor or unusual hematopoietic stem cells in your bone marrow sample can be a valuable insight into the disease you might be studying.<br>
However, like with any categorization the size of categories and the borders drawn between them are partly subjective and can change over time, e.g. because new technologies allow for a higher resolution view of cells, or because specific "sub-phenotypes" that were not considered biologically meaningful are found to have important biological implications (see e.g. {cite}`anno:KadurLakshminarasimhaMurthy2022`). Cell types are therefore often further classified into "subtypes" or "cell states" (e.g. activated versus resting) and some researchers use the term "cell identity" to avoid this sometimes arbitrary distinction of cell types, cell subtypes and cell states. For a more detailed discussion of this topic, we recommend the review by Wagner et al. {cite}`anno:Wagner2016` and the recently published review by Zeng {cite}`anno:ZENG20222739`.<br>
Similarly, multiple cell types can be part of a single continuum, where one cell type might transition or differentiate into another. For example, in hematopoiesis cells differentiate from a stem cell into a specific immune cell type. Although hard borders between early and late stages of this differentiation are often drawn, the state of these cells can more accurately be described by the differentiation coordinate between the less and more differentiated cellular phenotypes. We will discuss differentiation and cellular trajectories in subsequent chapters.<br>
So how do we go about annotating cells in single-cell data? There are multiple ways to do it and we will give an overview of different approaches below. As we are working with transcriptomic data, each of these methods is ultimately based on the expression of specific genes or gene sets, or general transcriptomic similarity between cells. 

## Environment setup

We'll filter out some deprecation and performance warnings that do not affect our code:

In [None]:
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

from numba.core.errors import NumbaDeprecationWarning

warnings.simplefilter("ignore", category=NumbaDeprecationWarning)

Load the needed modules:

In [None]:
import urllib.request
from pathlib import Path

import celltypist
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
# import scarches as sca
import seaborn as sns
from celltypist import models
from scipy.sparse import csr_matrix

One more pandas warning to filter:

In [None]:
warnings.filterwarnings("ignore", category=pd.errors.PerformanceWarning)

We will continue working with the scRNA-seq dataset that we earlier preprocessed and will now annotate it.

Set figure parameters:

In [None]:
sc.set_figure_params(figsize=(5, 5))

## Load data

Let's read in the toy dataset we will use for this tutorial. It includes a single sample ("site4-donor8") of the data also used in other parts of the book. Moreover, cells that didn't pass QC have already been removed.

## Manual annotation

The classical or oldest way to perform cell type annotation is based on a single or small set of marker genes known to be associated with a particular cell type. This approach dates back to "pre-scRNA-seq times", when single cell data was low dimensional (e.g. FACS data with gene panels consisting of no more than 30-40 genes). It is a fast and transparent way to annotate your data. However, when no unique markers exist for a specific cell type (which is often the case) this approach can get more complicated and even less objective, with combinations of markers or expression thresholds necessary for proper annotation. A robust set of marker genes and prior knowledge or annotation experience can help here, but the approach comes with the risk of unclear and subjective decision-making. 

In this setting, the data is often clustered before annotation, so that we can annotate groups of cells instead of making a per-cell call. This is not only less laborious, but also more robust to noise: a single cell might not have a count for a specific marker even if it was expressed in that cell, simply due to the inherent sparsity of single cell data. Clustering enables the detection of cells highly similar in overall gene expression and can therefore account for drop-outs at single cell level. 

Finally, there are two angles from which to approach the marker-gene based annotation. One option is to work from a table of marker genes for all the cell types you expect in your data and check in which those clusters are expressed. The other option is to check which genes are highly expressed in the clusters you defined and then check if they are associated with known cell types or states. If necessary, one can move back and forth between those approaches. We will show examples of both below.

### From markers to cluster annotation

Let's get started with the known-marker based approach. We will first list a set of markers for cell types in the bone marrow here that is based on literature: previous papers that study specific cell types and subtypes and report marker genes for those cell types. Note that markers at the protein level (e.g. used for FACS) sometimes do not work as well in transcriptomic data, hence using markers from RNA-based papers is often more likely to work. Moreover, sometimes markers in one dataset do not turn out to work as well in other datasets. Ideally a marker set is therefore validated across multiple datasets. Finally, it is often useful to work together with experts: as a bioinformatician, try to team up with a biologist who has more extensive knowledge of the tissue, the biology, the expected cell types and markers etc.

In [None]:
# read in csv table
cms_genes = pd.read_csv("resources/cms_genes.csv")
cms_genes

In [None]:
# convert cms_genes into dictionary
cms_genes_dict = {}
for i, row in cms_genes.iterrows():
    if row["CMS"] not in cms_genes_dict:
        cms_genes_dict[row["CMS"]] = []
    cms_genes_dict[row["CMS"]].append(row["gene"])
cms_genes_dict

Subset to only the markers that were detected in our data. We will loop through all cell types and keep only the genes that we find in our adata object as markers for that cell type. This will prevent errors once we start plotting.

In [None]:
marker_genes_in_data = {}
for ct, markers in cms_genes_dict.items():
    markers_found = []
    for marker in markers:
        if marker in adata.var.index:
            markers_found.append(marker)
    marker_genes_in_data[ct] = markers_found

To see where these markers are expressed we can work with a 2-dimensional visualization of the data, such as a UMAP. We'll construct that UMAP here based on the scran-normalized count data, using only the highly deviant genes. Note that we first perform a PCA on the normalized counts to reduce dimensionality of the data before we generate the UMAP.

To start we store our raw counts in `.layers['counts']`, so that we will still have access to them later if needed. We then set our `adata.X` to the scran-normalized, log-transformed counts.

In [None]:
adata.X = adata.layers["scran_normalization" + "_of_" + count_layer].copy()

We furthermore set our `adata.var.highly_variable` to the highly deviant genes. Scanpy uses this var column in downstream calculations, such as the PCA below

In [None]:
adata.var["highly_variable"] = adata.var["highly_deviant"]

In [None]:
# find genes in marker_genes_in_data that are highly variable
marker_genes_in_data_hv = {}
for ct, markers in marker_genes_in_data.items():
    markers_found = []
    for marker in markers:
        if marker in adata.var.index:
            if adata.var.loc[marker, "highly_variable"]:
                markers_found.append(marker)
    marker_genes_in_data_hv[ct] = markers_found

Now perform PCA. We use the highly deviant genes (set as "highly variable" above) to reduce noise and strengthen signal in our data and set number of components to the default n=50. 50 is on the high side for data of a single sample, but it will ensure that we don't ignore important variation in our data.

In [None]:
sc.tl.pca(adata, n_comps=50, use_highly_variable=True)

Calculate the neighbor graph based on the PCs:

In [None]:
sc.pp.neighbors(adata)

And use that neighbor graph to calculate a 2-dimensional UMAP embedding of the data:

In [None]:
sc.tl.umap(adata)

Now show expression of the markers using the calculated UMAP. We'll limit ourselves to B/plasma cell subtypes for this example. Note from the marker dictionary above that there are three negative markers in our list: IGHD and IGHM for B1 B, and PAX5 for plasmablasts, meaning that this cell type is expected not to or to lowly express those markers.

Let's list the B cell subtypes we want to show the markers for:

In [None]:
cms_tps = [
    "CMS1",
    "CMS2",
    "CMS3",
    "CMS4",
]

And now plot one UMAP per marker for each of the B cell subtypes. Note that we can only plot the markers that are present in our data.

In [None]:
# for ct in cms_tps:
#     print(f"{ct.upper()}:")  # print cell subtype name
#     sc.pl.umap(
#         adata,
#         color=marker_genes_in_data_hv[ct],
#         vmin=0,
#         vmax="p99",  # set vmax to the 99th percentile of the gene count instead of the maximum, to prevent outliers from making expression in other cells invisible. Note that this can cause problems for extremely lowly expressed genes.
#         sort_order=False,  # do not plot highest expression on top, to not get a biased view of the mean expression among cells
#         frameon=False,
#         cmap="Reds",  # or choose another color map e.g. from here: https://matplotlib.org/stable/tutorials/colors/colormaps.html
#     )
#     print("\n\n\n")  # print white space for legibility

As you can see, even markers for a single cell type are often expressed in different subsets of the data, i.e. individual markers are often not uniquely expressed in a single cell type. Rather, it is the intersection of those subsets that will tell you where your cell type of interest is. 

Another thing you might notice is that markers are often sparsely expressed, i.e. it is often only a subset of cells of a cell type in which a marker was detected. This is due to the nature of scRNA-seq data: we only sequence a small subset of the total amount of RNA molecules in the cell and due to this subsampling we will sometimes not sample transcripts from specific genes in a cell even if they were expressed in that cell. Therefore, we do not annotate single cells based on a minimum expression threshold of e.g. a set of markers. Instead, we first subdivide the data into groups of similar cells (i.e. "partition" the data) by clustering, thereby accounting for "missing transcripts" of single genes and rather grouping based on overall transcriptomic similarity. We can then annotate those clusters based on their overall marker expression patterns. 

Let us cluster our data now. We will use the Leiden algorithm {cite}`anno:Traag2019` as discussed in the Clustering chapter to define a grouping of our data into similar subsets of cells:

In [None]:
sc.tl.leiden(adata, resolution=1, key_added="leiden_1")

In [None]:
sc.pl.umap(adata, color="leiden_1")

You might notice that this partitioning of the data is rather coarse and some of the marker expression patterns we saw above are not captured by this clustering. We can therefore try a higher resolution clustering by changing the resolution parameter of our clustering:

In [None]:
sc.tl.leiden(adata, resolution=2, key_added="leiden_2")

In [None]:
sc.pl.umap(adata, color="leiden_2")

Or with cluster numbers in the UMAP:

In [None]:
sc.pl.umap(adata, color="leiden_2", legend_loc="on data")

This clustering is a lot finer and will help us annotate the data in more detail. You can play around with the resolution parameter to find the setting that best captures the marker expression patterns you observe.

Scrolling back up, you will see that cluster 4 and 6 are the clusters consistently expressing Naive CD20+ B cell markers. We can also visualize this using a dotplot:

In [None]:
# cms_markers = {
#     ct: [m for m in ct_markers if m in adata.var.index]
#     for ct, ct_markers in cms_genes_dict.items()
#     if ct in cms_tps
# }

In [None]:
sc.pl.dotplot(
    adata,
    groupby="leiden_2",
    var_names=marker_genes_in_data_hv,
    standard_scale="var",  # standard scale: normalize each gene to range from 0 to 1
)

Using a combination of visual inspection of the UMAPs and the dotplot above we can now start annotating the clusters:

Let's visualize our annotations so far:

In [None]:
# length of the gene list for each cell type
gene_list_len = {k: len(v) for k, v in marker_genes_in_data.items()}
gene_list_len

In [None]:
# Calculate scores manually by simply calculating the mean.
adata.obs["CMS1_score"] = (adata[:, marker_genes_in_data["CMS1"]].X.sum(axis=1) / len(marker_genes_in_data["CMS1"]))
adata.obs["CMS2_score"] = (adata[:, marker_genes_in_data["CMS2"]].X.sum(axis=1) / len(marker_genes_in_data["CMS2"]))
adata.obs["CMS3_score"] = (adata[:, marker_genes_in_data["CMS3"]].X.sum(axis=1) / len(marker_genes_in_data["CMS3"]))
adata.obs["CMS4_score"] = (adata[:, marker_genes_in_data["CMS4"]].X.sum(axis=1) / len(marker_genes_in_data["CMS4"]))

In [None]:
# plot the scores
sc.pl.umap(adata, color=["CMS1_score", "CMS2_score", "CMS3_score", "CMS4_score"], ncols=2)

In [None]:
# assign the cell type based on the highest score
adata.obs["CMS"] = adata.obs[["CMS1_score", "CMS2_score", "CMS3_score", "CMS4_score"]].idxmax(axis=1)
# replace CMS1_score, CMS2_score, CMS3_score, CMS4_score with CMS1, CMS2, CMS3, CMS4
adata.obs["CMS"] = adata.obs["CMS"].str.replace("_score", "")

# plot
sc.pl.umap(adata, color="CMS")

In [None]:
# use cmscaller or cmsclassifier later for this purpose to have a statistically more robust method

### From cluster differentially expressed genes to cluster annotation

Conversely, we can calculate marker genes per cluster and then look up whether we can link those marker genes to any known biology such as cell types and/or states. For marker gene calculation of clusters simple methods such as the Wilcoxon rank-sum test are thought to perform best {cite}`anno:Pullin2022.05.09.490241`. Importantly, as the definition of the clusters is based on the same data as used for these statistical tests, the p-values of these tests will be inflated as also described here {cite}`anno:ZHANG2019383`.

Let's calculate the differentially expressed genes for every cluster, compared to the rest of the cells in our adata:

In [None]:
sc.tl.rank_genes_groups(
    adata, groupby="leiden_2", method="wilcoxon", key_added="dea_leiden_2"
)

We can visualize expression of the top differentially expressed genes per cluster with a standard scanpy dotplot:

In [None]:
sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden_2", standard_scale="var", n_genes=5, key="dea_leiden_2"
)

As you can see above, a lot of the differentially expressed genes are highly expressed in multiple clusters. We can filter the differentially expressed genes to select for more cluster-specific differentially expressed genes:

In [None]:
# sc.tl.filter_rank_genes_groups(
#     adata,
#     min_in_group_fraction=0.1,
#     max_out_group_fraction=0.3,
#     key="dea_leiden_2",
#     key_added="dea_leiden_2_filtered",
# )

# probably problem with this scanpy's version, see https://github.com/scverse/scanpy/issues/3443

Visualize the filtered genes :

In [None]:
# sc.pl.rank_genes_groups_dotplot(
#     adata,
#     groupby="leiden_2",
#     standard_scale="var",
#     n_genes=5,
#     key="dea_leiden_2_filtered",
# )

Let's take a look at cluster 12, which seems to have a set of relatively unique markers including CDK6, ETV6, NKAIN2, and GNAQ. Some googling tells us that NKAIN2 and ETV6 are hematopoietic stem cell markers {cite}`anno:SHI20222234` {cite}`anno:Wang1998-rx` (NKAIN2 was also present in our list above). In the UMAP we can see that these genes are expressed throughout cluster 12: 

In [None]:
adata.var.columns

In [None]:
sc.pl.umap(
    adata,
    color=["CDK1", "CMS2_score", "leiden_2"],
    vmax="p99",
    legend_loc="on data",
    frameon=False,
    cmap="Reds",
)

This highlights how complicated marker-based annotation is: it is sensitive to the cluster resolution you choose, the robustness and uniqueness of the marker sets you have, and your knowledge of the cell types to be expected in your data.

For this reason, the field is partly trying to move away from manual cluster annotation and rather moving towards automated annotation algorithms instead. The rest of this tutorial will focus on those options.

Before we move on, store the final bit of annotation information in our adata:

## Automated annotation

### General remarks

The remainder of the discussed methods will be methods for automated, rather than manual annotation of your data. Unlike the method showcased above, each of these methods enables you to annotate your data in an automated way. They are based on different principles, sometimes requiring pre-defined sets of markers, other times trained on pre-existing full scRNA-seq datasets. As discussed below, the resulting annotations can be of varying quality. It is therefore important to regard these methods as a starting point rather than an end-point of the annotation process. See also several reviews {cite}`anno:PASQUINI2021961`, {cite}`anno:Abdelaal2019` for a more elaborate discussion of automated annotation methods.

As said, the quality of automatically generated annotations can vary. More specifically, the quality of the annotations depends on:
1) The type of classifier chosen: Previous benchmark studies have shown that different types of classifiers often perform comparably, with neural network-based methods generally not outperforming general-purpose models such as support vector machines or linear regression models{cite}`anno:Abdelaal2019`, {cite}`anno:PASQUINI2021961`, {cite}`anno:Huang2021`.<br>
2) The quality of the data that the classifier was trained on. If the training data was not well annotated or annotated at low resolution, the classifier will do the same. Similarly, if the training data and/or its annotation was noisy, the classifier might not perform well.<br>
3) The similarity of your own data to the data that the classifier was trained on. For example, if the classifier was trained on a drop-seq single cell dataset and your data is 10X single nucleus rather than single cell drop-seq, this might worsen the quality of the annotation. Classifiers trained on cross-dataset atlases including a diversity of datasets might give more robust and better quality annotations than classifiers trained on a single dataset. An example is the CellTypist (an automated annotation method that will be discussed more extensively below) classifier trained on the Human Lung Cell Atlas {cite}`anno:Sikkema2023` which includes 14 different lung datasets. This model is likely to perform better on new lung data than a model that was trained on a single lung dataset.  

The aforementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data and sometimes merely involves the upload of your data to an online webpage. Second, these methods don't rely on a partitioning of your data into clusters, as the manual annotation does. Third, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. And finally, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. 

Finally, as these classifiers are often less transparent than e.g. manual marker-based annotation, a good uncertainty measure quantifying annotation uncertainty will improve the quality and usability of the method. We will discuss this more extensively further down.

### Marker gene-based classifiers

One class of automated cell type annotation methods relies on a predefined set of marker genes. Cells are classified into cell types based on their expression levels of these marker genes. Examples of such methods are Garnett {cite}`anno:Pliner2019` and CellAssign {cite}`anno:Zhang2019`. The more robust and generalizable the set of marker genes these models are based on, the better the model will perform. However, like with other models they are likely to be affected by batch effect-related differences between the data the model was trained on and the data that needs to be labeled. One of the advantages of these methods compared to models based on larger gene sets (see below) is that they are more transparent: we know on the basis of which genes the classification is done.<br>
We will not show an example of marker-based classifiers in this notebook, but encourage you to explore these yourself if you are interested.

### Classifiers based on a wider set of genes

It is worth noting that the methods discussed so far use only a small subset of the genes detected in the data: often a set of only 1 to ~10 marker genes per cell type is used. An alternative approach is to use a classifier that takes as input a larger set of genes (several thousands or more), thereby making more use of the breadth of scRNA-seq data. Such classifiers are trained on previously annotated datasets or atlases. Examples of these are CellTypist {cite}`anno:Conde2022` (see also https://www.celltypist.org, where data can be uploaded to a portal to get automated cell annotations) and Clustifyr {cite}`anno:Fu2020`. 

Let's try out CellTypist on our data. Based on the CellTypist tutorial (https://www.celltypist.org/tutorials) we know we need to prepare our data so that counts are normalized to 10,000 counts per cell, then log1p-transformed:

In [None]:
adata_celltypist = adata.copy()  # make a copy of our adata
adata_celltypist.X = adata.layers[count_layer].copy()  # set adata.X to raw counts
sc.pp.normalize_total(
    adata_celltypist, target_sum=10**4
)  # normalize to 10,000 counts per cell
sc.pp.log1p(adata_celltypist)  # log-transform
# make .X dense instead of sparse, for compatibility with celltypist:
adata_celltypist.X = adata_celltypist.X.toarray()

We'll now download the celltypist models for immune cells:

In [None]:
# models.download_models(
#     force_update=False, model=["Human_Colorectal_Cancer.pkl", "Cells_Intestinal_Tract"], 
# )

Let's try out both the `Human_Colorectal_Cancer` and `Cells_Intestinal_Tract` models (these annotate immune cell types finer annotation level (low) and coarser (high)):

In [None]:
model_crc = models.Model.load(model="Human_Colorectal_Cancer.pkl")
model_gut = models.Model.load(model="Cells_Intestinal_Tract.pkl")

For each of these, we can see which cell types it includes to see if bone marrow cell types are included:

In [None]:
model_gut.cell_types

In [None]:
model_crc.cell_types

Looks like the models include many different immune cell type progenitors!

Now let's run the models. First the coarse one:

In [None]:
predictions_gut = celltypist.annotate(
    adata_celltypist, model=model_gut, majority_voting=True
)

Transform the predictions to adata to get the full output...

In [None]:
predictions_gut_adata = predictions_gut.to_adata()

...and copy the results to our original AnnData object:

In [None]:
adata.obs["celltypist_cell_label_gut"] = predictions_gut_adata.obs.loc[
    adata.obs.index, "majority_voting"
]
adata.obs["celltypist_conf_score_gut"] = predictions_gut_adata.obs.loc[
    adata.obs.index, "conf_score"
]

Now the same for the finer annotations:

In [None]:
predictions_crc = celltypist.annotate(
    adata_celltypist, model=model_crc, majority_voting=True
)

In [None]:
predictions_crc_adata = predictions_crc.to_adata()

In [None]:
adata.obs["celltypist_cell_label_crc"] = predictions_crc_adata.obs.loc[
    adata.obs.index, "majority_voting"
]
adata.obs["celltypist_conf_score_crc"] = predictions_crc_adata.obs.loc[
    adata.obs.index, "conf_score"
]

Now plot:

In [None]:
sc.pl.umap(
    adata,
    color=["celltypist_cell_label_gut", "celltypist_conf_score_gut"],
    frameon=False,
    sort_order=False,
    wspace=1,
)

In [None]:
sc.pl.umap(
    adata,
    color=["celltypist_cell_label_crc", "celltypist_conf_score_crc"],
    frameon=False,
    sort_order=False,
    wspace=1,
)

One way of getting a feeling for the quality of these annotations is by looking if the observed cell type similarities correspond to our expectations:

In [None]:
# do only if more than one cell type is present
if len(adata.obs["celltypist_cell_label_crc"].unique()) > 1:
    sc.pl.dendrogram(adata, groupby="celltypist_cell_label_crc")

In [None]:
# do only if more than one cell type is present
if len(adata.obs["celltypist_cell_label_gut"].unique()) > 1:
    sc.pl.dendrogram(adata, groupby="celltypist_cell_label_gut")

This dendrogram partly reflects prior knowledge on cell type relations (e.g. B cells largely clustering together), but we also observe some unexpected patterns: Tcm/Naive helper T cells cluster with erythroid cells and macrophages rather than with the other T cells. This is a red flag! Possibly, the Tcm/Naive helper T cell annotations are wrong.

Now let's take a look at our earlier manual annotations:

In [None]:
sc.pl.umap(
    adata,
    color = "celltypist_cell_label_crc",
    frameon=False,
    palette = ["blue"]
)

sc.pl.umap(
    adata,
    color=["CMS"],
    frameon=False,
    palette=["red", "blue", "orange"],
)

You can see that our naive B cell annotation corresponds well to part of the automatic naive B cell annotation. Similarly, part of what we called transitional B cells is called "small pre-B cells" in their annotations and our B1 B cells correspond to their memory B cells, which is encouraging!

However, you'll also notice that our HSC + MK/E prog cluster is annotated as a mixture of T cells and HSCs/multipotent progenitors in their fine annotation, hence these annotations are partly contradictory. Looking at the confidence score of both annotations, we see that the annotation of the larger part of the cells is done with relatively low confidence, which is a useful indication that these annotations cannot be copied without careful validation and manual reviewing!

See here the breakdown of cluster 12 in terms of fine celltypist labels:

In [None]:
pd.crosstab(adata.obs.leiden_2, adata.obs.celltypist_cell_label_crc).loc[
    "12", :
].sort_values(ascending=False)

In the coarser cell typist labels we observe different patterns: our cluster 12 is mostly annotated as B cells or Megakaryocyte precursors, again only partly corresponding to our annotations.

In [None]:
pd.crosstab(adata.obs.leiden_2, adata.obs.celltypist_cell_label_gut).loc[
    "12", :
].sort_values(ascending=False)

Finally, store your adata object if wanted:

In [None]:
adata.write_h5ad(output_file)

## References

```{bibliography}
:filter: docname in docnames
:labelprefix: anno
```

## Contributors
We gratefully acknowledge the contributions of:
### Authors
- Lisa Sikkema
- Maren Büttner
### Reviewers
- Lukas Heumos

In [None]:
# are the counts still integers? (They should be)
print(adata.X[0:5, 0:5].todense())
print(adata.layers['counts'][0:5, 0:5].todense())
adata.raw[0:5, 0:5]