# Using CellTypist for cell type classification
This notebook showcases the cell type classification for scRNA-seq query data by retrieving the most likely cell type labels from either the built-in CellTypist models or the user-trained custom models.

Only the main steps and key parameters are introduced in this notebook. Refer to detailed [Usage](https://github.com/Teichlab/celltypist#usage) if you want to learn more.

## Install CellTypist

In [19]:
!pip install celltypist



In [20]:
import scanpy as sc

In [21]:
import celltypist
from celltypist import models

## Download a scRNA-seq dataset of 2,000 immune cells

In [22]:
# adata_2000 = sc.read('celltypist_demo_folder/demo_2000_cells.h5ad', backup_url = 'https://celltypist.cog.sanger.ac.uk/Notebook_demo_data/demo_2000_cells.h5ad')
adata_2000 = sc.read("/media/hieunguyen/GSHD_HN01/outdir/BrainMet_SeuratV5/integrate_BrainMet_datasets/integrated_v0.2/seurat2anndata/from_12_output/integrated_BrainMet_dataset.output.s8/integrated_BrainMet_dataset.output.s8_harmony.cluster.0.5.h5ad")

Only considering the two last: ['.5', '.h5ad'].
Only considering the two last: ['.5', '.h5ad'].
Only considering the two last: ['.5', '.h5ad'].


This dataset includes 2,000 cells and 18,950 genes collected from different studies, thereby showing the practical applicability of CellTypist.

In [23]:
adata_2000.shape

(46258, 20514)

The expression matrix (`adata_2000.X`) is pre-processed (and required) as log1p normalised expression to 10,000 counts per cell (this matrix can be alternatively stashed in `.raw.X`).

In [24]:
adata_2000.X.expm1().sum(axis = 1)

matrix([[3030.],
        [3101.],
        [2905.],
        ...,
        [3956.],
        [4751.],
        [4408.]])

Some pre-assigned cell type labels are also in the data, which will be compared to the predicted labels from CellTypist later.

In [25]:
adata_2000.obs

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,name,stage,percent.mt,percent.ribo,No.Exprs,log10GenesPerUMI,nCount_decontX,...,PrimaryTumor,nCount_SCT,nFeature_SCT,cca.cluster.0.5,seurat_clusters,rpca.cluster.0.5,harmony.cluster.0.5,barcode,UMAP_1,UMAP_2
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACCTGTCTTTAGTC_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,3152,1297,NS_02,GSE131907_NS_02,3.458122,22.747462,1,0.889771,2972.026068,...,,3030,1288,7,10,6,10,AAACCTGTCTTTAGTC_NS_02,6.517687,4.592167
AAACGGGAGCTTTGGT_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,3361,1223,NS_02,GSE131907_NS_02,3.332342,22.909848,1,0.875501,3143.914241,...,,3101,1218,0,12,0,12,AAACGGGAGCTTTGGT_NS_02,-3.119733,6.495010
AAACGGGCATACGCCG_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,2910,1030,NS_02,GSE131907_NS_02,3.573883,31.030928,1,0.869784,2792.763364,...,,2905,1024,4,3,0,3,AAACGGGCATACGCCG_NS_02,-3.296767,0.531751
AAAGATGAGAGGTAGA_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,3634,1117,NS_02,GSE131907_NS_02,2.394056,38.580077,1,0.856102,3430.073606,...,,3165,1108,1,1,1,1,AAAGATGAGAGGTAGA_NS_02,-2.616868,-2.451068
AAAGATGAGGCGTACA_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,2789,992,NS_02,GSE131907_NS_02,2.832556,29.006812,1,0.869701,2588.613492,...,,2866,990,0,0,0,0,AAAGATGAGGCGTACA_NS_02,-6.262387,3.076460
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSE193745_GSE193745_TTTGCGCCATGTAAGA-1_17,GSE193745,6524,2005,merge17samples,,1.486818,6.100552,1,0.865671,6508.388275,...,Melanoma,4466,1959,3,6,8,6,GSE193745_GSE193745_TTTGCGCCATGTAAGA-1_17,6.989755,-1.927703
GSE193745_GSE193745_TTTGCGCGTCGCATCG-1_17,GSE193745,15939,4273,merge17samples,,3.507121,18.301023,1,0.863954,15289.379068,...,Melanoma,4060,1946,6,3,0,3,GSE193745_GSE193745_TTTGCGCGTCGCATCG-1_17,-4.178997,0.325401
GSE193745_GSE193745_TTTGGTTAGGTGTTAA-1_17,GSE193745,13440,3335,merge17samples,,3.005952,18.080357,1,0.853381,10514.332278,...,Melanoma,3956,1496,10,7,8,7,GSE193745_GSE193745_TTTGGTTAGGTGTTAA-1_17,10.318720,-1.993642
GSE193745_GSE193745_TTTGGTTCAAATACAG-1_17,GSE193745,9480,3083,merge17samples,,4.641350,16.529536,1,0.877330,9379.084767,...,Melanoma,4751,2542,2,4,3,4,GSE193745_GSE193745_TTTGGTTCAAATACAG-1_17,-6.674124,-1.595920


## Assign cell type labels using a CellTypist built-in model
In this section, we show the procedure of transferring cell type labels from built-in models to the query dataset.

Download the latest CellTypist models.

In [26]:
# Enabling `force_update = True` will overwrite existing (old) models.
models.download_models(force_update = True)

📜 Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json
📚 Total models in list: 54
📂 Storing models in /home/hieunguyen/.celltypist/data/models
💾 Downloading model [1/54]: Immune_All_Low.pkl
💾 Downloading model [2/54]: Immune_All_High.pkl
💾 Downloading model [3/54]: Adult_COVID19_PBMC.pkl
💾 Downloading model [4/54]: Adult_CynomolgusMacaque_Hippocampus.pkl
💾 Downloading model [5/54]: Adult_Human_MTG.pkl
💾 Downloading model [6/54]: Adult_Human_PancreaticIslet.pkl
💾 Downloading model [7/54]: Adult_Human_PrefrontalCortex.pkl
💾 Downloading model [8/54]: Adult_Human_Skin.pkl
💾 Downloading model [9/54]: Adult_Human_Vascular.pkl
💾 Downloading model [10/54]: Adult_Mouse_Gut.pkl
💾 Downloading model [11/54]: Adult_Mouse_OlfactoryBulb.pkl
💾 Downloading model [12/54]: Adult_Pig_Hippocampus.pkl
💾 Downloading model [13/54]: Adult_RhesusMacaque_Hippocampus.pkl
💾 Downloading model [14/54]: Autopsy_COVID19_Lung.pkl
💾 Downloading model [15/54]: COVID19_HumanChallenge_Bl

All models are stored in `models.models_path`.

In [27]:
models.models_path

'/home/hieunguyen/.celltypist/data/models'

Get an overview of the models and what they represent.

In [28]:
models.models_description()

👉 Detailed model information can be found at `https://www.celltypist.org/models`


Unnamed: 0,model,description
0,Immune_All_Low.pkl,immune sub-populations combined from 20 tissue...
1,Immune_All_High.pkl,immune populations combined from 20 tissues of...
2,Adult_COVID19_PBMC.pkl,peripheral blood mononuclear cell types from C...
3,Adult_CynomolgusMacaque_Hippocampus.pkl,cell types from the hippocampus of adult cynom...
4,Adult_Human_MTG.pkl,cell types and subtypes (10x-based) from the a...
5,Adult_Human_PancreaticIslet.pkl,cell types from pancreatic islets of healthy a...
6,Adult_Human_PrefrontalCortex.pkl,cell types and subtypes from the adult human d...
7,Adult_Human_Skin.pkl,cell types from human healthy adult skin
8,Adult_Human_Vascular.pkl,vascular populations combined from multiple ad...
9,Adult_Mouse_Gut.pkl,cell types in the adult mouse gut combined fro...


Choose the model you want to employ, for example, the model with all tissues combined containing low-hierarchy (high-resolution) immune cell types/subtypes.

In [29]:
# Indeed, the `model` argument defaults to `Immune_All_Low.pkl`.
model = models.Model.load(model = 'Immune_All_Low.pkl')

Show the model meta information.

In [30]:
model

CellTypist model with 98 cell types and 6639 features
    date: 2022-07-16 00:20:42.927778
    details: immune sub-populations combined from 20 tissues of 18 studies
    source: https://doi.org/10.1126/science.abl5197
    version: v2
    cell types: Age-associated B cells, Alveolar macrophages, ..., pDC precursor
    features: A1BG, A2M, ..., ZYX

This model contains 98 cell states.

In [31]:
model.cell_types

array(['Age-associated B cells', 'Alveolar macrophages', 'B cells',
       'CD16+ NK cells', 'CD16- NK cells', 'CD8a/a', 'CD8a/b(entry)',
       'CMP', 'CRTAM+ gamma-delta T cells', 'Classical monocytes',
       'Cycling B cells', 'Cycling DCs', 'Cycling NK cells',
       'Cycling T cells', 'Cycling gamma-delta T cells',
       'Cycling monocytes', 'DC', 'DC precursor', 'DC1', 'DC2', 'DC3',
       'Double-negative thymocytes', 'Double-positive thymocytes', 'ELP',
       'ETP', 'Early MK', 'Early erythroid', 'Early lymphoid/T lymphoid',
       'Endothelial cells', 'Epithelial cells', 'Erythrocytes',
       'Erythrophagocytic macrophages', 'Fibroblasts',
       'Follicular B cells', 'Follicular helper T cells', 'GMP',
       'Germinal center B cells', 'Granulocytes', 'HSC/MPP',
       'Hofbauer cells', 'ILC', 'ILC precursor', 'ILC1', 'ILC2', 'ILC3',
       'Intermediate macrophages', 'Intestinal macrophages',
       'Kidney-resident macrophages', 'Kupffer cells',
       'Large pre-B cell

Transfer cell type labels from this model to the query dataset using [celltypist.annotate](https://celltypist.readthedocs.io/en/latest/celltypist.annotate.html).

In [37]:
# Not run; predict cell identities using this loaded model.
#predictions = celltypist.annotate(adata_2000, model = model, majority_voting = True)
# Alternatively, just specify the model name (recommended as this ensures the model is intact every time it is loaded).
sc.pp.pca(adata_2000, n_comps = 50, use_highly_variable = False)
predictions = celltypist.annotate(adata_2000, model = 'Immune_All_Low.pkl', majority_voting = True)

🔬 Input data has 46258 cells and 20514 genes
🔬 Input data has 46258 cells and 20514 genes
🔗 Matching reference genes in the model
🧬 5853 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
  from .autonotebook import tqdm as notebook_tqdm
⛓️ Over-clustering input data with resolution set to 20
🗳️ Majority voting the predictions
✅ Majority voting done!


By default (`majority_voting = False`), CellTypist will infer the identity of each query cell independently. This leads to raw predicted cell type labels, and usually finishes within seconds or minutes depending on the size of the query data. You can also turn on the majority-voting classifier (`majority_voting = True`), which refines cell identities within local subclusters after an over-clustering approach at the cost of increased runtime.

The results include both predicted cell type labels (`predicted_labels`), over-clustering result (`over_clustering`), and predicted labels after majority voting in local subclusters (`majority_voting`). Note in the `predicted_labels`, each query cell gets its inferred label by choosing the most probable cell type among all possible cell types in the given model.

In [38]:
predictions.predicted_labels

Unnamed: 0_level_0,predicted_labels,over_clustering,majority_voting
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAACCTGTCTTTAGTC_NS_02,NK cells,21,NK cells
AAACGGGAGCTTTGGT_NS_02,CD16+ NK cells,21,NK cells
AAACGGGCATACGCCG_NS_02,NK cells,21,NK cells
AAAGATGAGAGGTAGA_NS_02,Regulatory T cells,193,Tcm/Naive helper T cells
AAAGATGAGGCGTACA_NS_02,NK cells,21,NK cells
...,...,...,...
GSE193745_GSE193745_TTTGCGCCATGTAAGA-1_17,Intermediate macrophages,208,Alveolar macrophages
GSE193745_GSE193745_TTTGCGCGTCGCATCG-1_17,Tem/Trm cytotoxic T cells,38,Tem/Trm cytotoxic T cells
GSE193745_GSE193745_TTTGGTTAGGTGTTAA-1_17,DC2,236,Classical monocytes
GSE193745_GSE193745_TTTGGTTCAAATACAG-1_17,Regulatory T cells,204,Regulatory T cells


Transform the prediction result into an `AnnData`.

In [39]:
# Get an `AnnData` with predicted labels embedded into the cell metadata columns.
adata = predictions.to_adata()

Compared to `adata_2000`, the new `adata` has additional prediction information in `adata.obs` (`predicted_labels`, `over_clustering`, `majority_voting` and `conf_score`). Of note, all these columns can be prefixed with a specific string by setting `prefix` in [to_adata](https://celltypist.readthedocs.io/en/latest/celltypist.classifier.AnnotationResult.html#celltypist.classifier.AnnotationResult.to_adata).

In [40]:
adata.obs

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,name,stage,percent.mt,percent.ribo,No.Exprs,log10GenesPerUMI,nCount_decontX,...,seurat_clusters,rpca.cluster.0.5,harmony.cluster.0.5,barcode,UMAP_1,UMAP_2,predicted_labels,over_clustering,majority_voting,conf_score
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACCTGTCTTTAGTC_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,3152,1297,NS_02,GSE131907_NS_02,3.458122,22.747462,1,0.889771,2972.026068,...,10,6,10,AAACCTGTCTTTAGTC_NS_02,6.517687,4.592167,NK cells,21,NK cells,0.057789
AAACGGGAGCTTTGGT_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,3361,1223,NS_02,GSE131907_NS_02,3.332342,22.909848,1,0.875501,3143.914241,...,12,0,12,AAACGGGAGCTTTGGT_NS_02,-3.119733,6.495010,CD16+ NK cells,21,NK cells,0.983818
AAACGGGCATACGCCG_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,2910,1030,NS_02,GSE131907_NS_02,3.573883,31.030928,1,0.869784,2792.763364,...,3,0,3,AAACGGGCATACGCCG_NS_02,-3.296767,0.531751,NK cells,21,NK cells,0.192141
AAAGATGAGAGGTAGA_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,3634,1117,NS_02,GSE131907_NS_02,2.394056,38.580077,1,0.856102,3430.073606,...,1,1,1,AAAGATGAGAGGTAGA_NS_02,-2.616868,-2.451068,Regulatory T cells,193,Tcm/Naive helper T cells,0.081304
AAAGATGAGGCGTACA_NS_02,BrainMet_SeuratV5_GSE131907_NS_02,2789,992,NS_02,GSE131907_NS_02,2.832556,29.006812,1,0.869701,2588.613492,...,0,0,0,AAAGATGAGGCGTACA_NS_02,-6.262387,3.076460,NK cells,21,NK cells,0.339205
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSE193745_GSE193745_TTTGCGCCATGTAAGA-1_17,GSE193745,6524,2005,merge17samples,,1.486818,6.100552,1,0.865671,6508.388275,...,6,8,6,GSE193745_GSE193745_TTTGCGCCATGTAAGA-1_17,6.989755,-1.927703,Intermediate macrophages,208,Alveolar macrophages,0.098222
GSE193745_GSE193745_TTTGCGCGTCGCATCG-1_17,GSE193745,15939,4273,merge17samples,,3.507121,18.301023,1,0.863954,15289.379068,...,3,0,3,GSE193745_GSE193745_TTTGCGCGTCGCATCG-1_17,-4.178997,0.325401,Tem/Trm cytotoxic T cells,38,Tem/Trm cytotoxic T cells,0.208177
GSE193745_GSE193745_TTTGGTTAGGTGTTAA-1_17,GSE193745,13440,3335,merge17samples,,3.005952,18.080357,1,0.853381,10514.332278,...,7,8,7,GSE193745_GSE193745_TTTGGTTAGGTGTTAA-1_17,10.318720,-1.993642,DC2,236,Classical monocytes,0.578944
GSE193745_GSE193745_TTTGGTTCAAATACAG-1_17,GSE193745,9480,3083,merge17samples,,4.641350,16.529536,1,0.877330,9379.084767,...,4,3,4,GSE193745_GSE193745_TTTGGTTCAAATACAG-1_17,-6.674124,-1.595920,Regulatory T cells,204,Regulatory T cells,0.999452


In addition to this meta information added, the neighborhood graph constructed during over-clustering is also stored in the `adata`
(If a pre-calculated neighborhood graph is already present in the `AnnData`, this graph construction step will be skipped).  
This graph can be used to derive the cell embeddings, such as the UMAP coordinates.

In [41]:
# If the UMAP or any cell embeddings are already available in the `AnnData`, skip this command.
sc.tl.umap(adata)

Visualise the prediction results.

In [42]:
sc.pl.umap(adata, color = ['cell_type', 'predicted_labels', 'majority_voting'], legend_loc = 'on data')

KeyError: 'Could not find key cell_type in .var_names or .obs.columns.'

<Figure size 2183.4x480 with 0 Axes>

Actually, you may not need to explicitly convert `predictions` output by `celltypist.annotate` into an `AnnData` as above. A more useful way is to use the visualisation function [celltypist.dotplot](https://celltypist.readthedocs.io/en/latest/celltypist.dotplot.html), which quantitatively compares the CellTypist prediction result (e.g. `majority_voting` here) with the cell types pre-defined in the `AnnData` (here `cell_type`). You can also change the value of `use_as_prediction` to `predicted_labels` to compare the raw prediction result with the pre-defined cell types.

In [None]:
celltypist.dotplot(predictions, use_as_reference = 'cell_type', use_as_prediction = 'majority_voting')

For each pre-defined cell type (each column from the dot plot), this plot shows how it can be 'decomposed' into different cell types predicted by CellTypist (rows).

## Assign cell type labels using a custom model
In this section, we show the procedure of generating a custom model and transferring labels from the model to the query data.

Use previously downloaded dataset of 2,000 immune cells as the training set.

In [None]:
adata_2000 = sc.read('celltypist_demo_folder/demo_2000_cells.h5ad', backup_url = 'https://celltypist.cog.sanger.ac.uk/Notebook_demo_data/demo_2000_cells.h5ad')

Download another scRNA-seq dataset of 400 immune cells as a query.

In [None]:
adata_400 = sc.read('celltypist_demo_folder/demo_400_cells.h5ad', backup_url = 'https://celltypist.cog.sanger.ac.uk/Notebook_demo_data/demo_400_cells.h5ad')

Derive a custom model by training the data using the [celltypist.train](https://celltypist.readthedocs.io/en/latest/celltypist.train.html) function.

In [None]:
# The `cell_type` in `adata_2000.obs` will be used as cell type labels for training.
new_model = celltypist.train(adata_2000, labels = 'cell_type', n_jobs = 10, feature_selection = True)

Refer to the function [celltypist.train](https://celltypist.readthedocs.io/en/latest/celltypist.train.html) for what each parameter means, and to the [usage](https://github.com/Teichlab/celltypist#usage) for details of model training.

This custom model can be manipulated as with other CellTypist built-in models. First, save this model locally.

In [None]:
# Save the model.
new_model.write('celltypist_demo_folder/model_from_immune2000.pkl')

You can load this model by `models.Model.load`.

In [None]:
new_model = models.Model.load('celltypist_demo_folder/model_from_immune2000.pkl')

Next, we use this model to predict the query dataset of 400 immune cells.

In [None]:
# Not run; predict the identity of each input cell with the new model.
#predictions = celltypist.annotate(adata_400, model = new_model, majority_voting = True)
# Alternatively, just specify the model path (recommended as this ensures the model is intact every time it is loaded).
predictions = celltypist.annotate(adata_400, model = 'celltypist_demo_folder/model_from_immune2000.pkl', majority_voting = True)

In [None]:
adata = predictions.to_adata()

In [None]:
sc.tl.umap(adata)

In [None]:
sc.pl.umap(adata, color = ['cell_type', 'predicted_labels', 'majority_voting'], legend_loc = 'on data')

In [None]:
celltypist.dotplot(predictions, use_as_reference = 'cell_type', use_as_prediction = 'majority_voting')

## Examine expression of cell type-driving genes

Each model can be examined in terms of the driving genes for each cell type. Note these genes are only dependent on the model, say, the training dataset.

In [None]:
# Any model can be inspected.
# Here we load the previously saved model trained from 2,000 immune cells.
model = models.Model.load(model = 'celltypist_demo_folder/model_from_immune2000.pkl')

In [None]:
model.cell_types

Extract the top three driving genes of `Mast cells` using the [extract_top_markers](https://celltypist.readthedocs.io/en/latest/celltypist.models.Model.html#celltypist.models.Model.extract_top_markers) method.

In [None]:
top_3_genes = model.extract_top_markers("Mast cells", 3)
top_3_genes

In [None]:
# Check expression of the three genes in the training set.
sc.pl.violin(adata_2000, top_3_genes, groupby = 'cell_type', rotation = 90)

In [None]:
# Check expression of the three genes in the query set.
# Here we use `majority_voting` from CellTypist as the cell type labels for this dataset.
sc.pl.violin(adata_400, top_3_genes, groupby = 'majority_voting', rotation = 90)