# Use scaLR for cell type classification

This notebook demonstrates cell type classification for scRNA-seq query data by identifying the most probable cell type labels using either built-in models or user-trained custom models.

## Install scaLR

### For smooth run of scaLAR user can create a conda environment for Python=3.9

In [None]:
!conda create -n scaLR_env python=3.9
!conda activate scaLR_env

### Then install pytorch using below commandhttps://doi.org/10.1101/2023.01.02.522155

In [None]:
!pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118

### Last step clone the git repository and install required packages

In [None]:
!pip install -r requirements.txt

In [None]:
!conda activate scaLR_env

## Download a scRNA-seq dataset form cellxgene and explore the dataset
This data belong to "early infection response of the first trimester human placenta at single-cell scale"[https://doi.org/10.1101/2023.01.02.522155]

In [None]:
!wget = https://datasets.cellxgene.cziscience.com/6882dc72-5194-4117-9763-e1f2ade4d062.h5ad

This dataset includes 158,978 cells and 18,950 genes collected from different studies, thereby showing the practical applicability of CellTypist.

In [16]:
import scanpy as sc
# read adata
adata = sc.read_h5ad("6882dc72-5194-4117-9763-e1f2ade4d062.h5ad")
#to get the obs i.e. the metadata
adata.obs

Unnamed: 0,n_counts,n_genes,percent_mito,hpi,stage,phase,donor_id,MFgenotype,infection,stage_perInfection,...,celltype_annotation,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid
Pla_HDBR13007974_AAACCCAAGCGTTGTT,6432.0,1904,0.019590,24h,UI_24h,G1,scDonor_Tg2_mother,Maternal,UI,UI_Tg_24h,...,F,fibroblast,10x 3' v3,normal,Homo sapiens,female,placenta,unknown,unknown,<Lg-hOinxE
Pla_HDBR13007974_AAACCCAAGTAGTCAA,49221.0,5525,0.045489,24h,UI_24h,G1,scDonor_Tg1_fetus,Fetal,UI,UI_Tg_24h,...,VCT_fusing,placental villous trophoblast,10x 3' v3,normal,Homo sapiens,unknown,placenta,unknown,Carnegie stage 22,T=ikU*Fig%
Pla_HDBR13007974_AAACCCACAATGAACA,9243.0,3032,0.045332,24h,UI_24h,G1,scDonor_Tg2_mother,Maternal,UI,UI_Tg_24h,...,HBC,Hofbauer cell,10x 3' v3,normal,Homo sapiens,female,placenta,unknown,unknown,dTRvLePpDn
Pla_HDBR13007974_AAACCCACAGAGAGGG,7753.0,2803,0.031214,24h,UI_24h,G2M,scDonor_Tg2_mother,Maternal,UI,UI_Tg_24h,...,HBC,Hofbauer cell,10x 3' v3,normal,Homo sapiens,female,placenta,unknown,unknown,>!cTlf84nF
Pla_HDBR13007974_AAACCCACAGTAGAAT,14361.0,3982,0.043799,24h,UI_24h,G1,scDonor_Tg1_fetus,Fetal,UI,UI_Tg_24h,...,HBC,Hofbauer cell,10x 3' v3,normal,Homo sapiens,unknown,placenta,unknown,Carnegie stage 22,;L-Xm-1k-(
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Pla_HDBR13661576_TTTGTTGAGAGTACCG,50722.0,7538,0.071074,48h,Lm_48h,G1,Hrv236_fetus,Fetal,Lm,Lm_48h,...,VCT,placental villous trophoblast,10x 3' v3,listeriosis,Homo sapiens,unknown,placenta,unknown,Carnegie stage 14,E2iYWNWS?7
Pla_HDBR13661576_TTTGTTGAGCATATGA,16358.0,4066,0.086808,48h,Lm_48h,S,Hrv236_fetus,Fetal,Lm,Lm_48h,...,HBC,Hofbauer cell,10x 3' v3,listeriosis,Homo sapiens,unknown,placenta,unknown,Carnegie stage 14,5_j%>-!e?=
Pla_HDBR13661576_TTTGTTGCATCCCGTT,12800.0,4219,0.018906,48h,Lm_48h,S,Hrv236_fetus,Fetal,Lm,Lm_48h,...,PV,perivascular cell,10x 3' v3,listeriosis,Homo sapiens,unknown,placenta,unknown,Carnegie stage 14,31EIW>nS6G
Pla_HDBR13661576_TTTGTTGTCAGCTGAT,21082.0,5860,0.049094,48h,Lm_48h,S,Hrv236_fetus,Fetal,Lm,Lm_48h,...,Endo_f,endothelial cell of placenta,10x 3' v3,listeriosis,Homo sapiens,unknown,placenta,unknown,Carnegie stage 14,cfs=?TLXb#


In [19]:
#for top 5 rows
adata.obs['stage_perInfection']

Pla_HDBR13007974_AAACCCAAGCGTTGTT    UI_Tg_24h
Pla_HDBR13007974_AAACCCAAGTAGTCAA    UI_Tg_24h
Pla_HDBR13007974_AAACCCACAATGAACA    UI_Tg_24h
Pla_HDBR13007974_AAACCCACAGAGAGGG    UI_Tg_24h
Pla_HDBR13007974_AAACCCACAGTAGAAT    UI_Tg_24h
                                       ...    
Pla_HDBR13661576_TTTGTTGAGAGTACCG       Lm_48h
Pla_HDBR13661576_TTTGTTGAGCATATGA       Lm_48h
Pla_HDBR13661576_TTTGTTGCATCCCGTT       Lm_48h
Pla_HDBR13661576_TTTGTTGTCAGCTGAT       Lm_48h
Pla_HDBR13661576_TTTGTTGTCTGGTTGA       Lm_48h
Name: stage_perInfection, Length: 158978, dtype: category
Categories (12, object): ['Lm_24h', 'Lm_48h', 'Pf_24h', 'Pf_48h', ..., 'UI_Pf_24h', 'UI_Pf_48h', 'UI_Tg_24h', 'UI_Tg_48h']

In [3]:
adata.obs['donor_id']

NameError: name 'adata' is not defined

In [None]:
#to get the var i.e. the gene details
adata.var

### It seems this data consist of 1,58,978 sampels with 7 cell types and 15 possible cell type annotation with total of 36398 genes  

In [None]:
# to get the list of obs columns
adata.obs.columns.tolist()

In [None]:
# to get the list of gene_id or names if available in the index
adata.var.index.tolist()

In [21]:
# run the scaLR using user defined configs

In [1]:
!conda activate scaLR_env

/bin/bash: line 1: conda: command not found


In [2]:
!python pipeline.py --config config.yml --log

Traceback (most recent call last):
  File "/home/saurabh/gitlabclone/single_cell_classification/pipeline.py", line 4, in <module>
    from config.utils import load_config
  File "/home/saurabh/gitlabclone/single_cell_classification/config/__init__.py", line 1, in <module>
    from .utils import load_config
  File "/home/saurabh/gitlabclone/single_cell_classification/config/utils.py", line 3, in <module>
    from scalr.utils.file import read_yaml
  File "/home/saurabh/gitlabclone/single_cell_classification/scalr/__init__.py", line 1, in <module>
    from . import data
  File "/home/saurabh/gitlabclone/single_cell_classification/scalr/data/__init__.py", line 1, in <module>
    from .data_split import split_data, generate_train_val_test_split
  File "/home/saurabh/gitlabclone/single_cell_classification/scalr/data/data_split.py", line 6, in <module>
    import anndata as ad
ModuleNotFoundError: No module named 'anndata'


In [None]:
# Not run; predict cell identities using this loaded model.
#predictions = celltypist.annotate(adata_2000, model = model, majority_voting = True)
# Alternatively, just specify the model name (recommended as this ensures the model is intact every time it is loaded).
predictions = celltypist.annotate(adata_2000, model = 'Immune_All_Low.pkl', majority_voting = True)

By default (`majority_voting = False`), CellTypist will infer the identity of each query cell independently. This leads to raw predicted cell type labels, and usually finishes within seconds or minutes depending on the size of the query data. You can also turn on the majority-voting classifier (`majority_voting = True`), which refines cell identities within local subclusters after an over-clustering approach at the cost of increased runtime.

The results include both predicted cell type labels (`predicted_labels`), over-clustering result (`over_clustering`), and predicted labels after majority voting in local subclusters (`majority_voting`). Note in the `predicted_labels`, each query cell gets its inferred label by choosing the most probable cell type among all possible cell types in the given model.

In [None]:
predictions.predicted_labels

Transform the prediction result into an `AnnData`.

In [None]:
# Get an `AnnData` with predicted labels embedded into the cell metadata columns.
adata = predictions.to_adata()

Compared to `adata_2000`, the new `adata` has additional prediction information in `adata.obs` (`predicted_labels`, `over_clustering`, `majority_voting` and `conf_score`). Of note, all these columns can be prefixed with a specific string by setting `prefix` in [to_adata](https://celltypist.readthedocs.io/en/latest/celltypist.classifier.AnnotationResult.html#celltypist.classifier.AnnotationResult.to_adata).

In [None]:
adata.obs

In addition to this meta information added, the neighborhood graph constructed during over-clustering is also stored in the `adata`
(If a pre-calculated neighborhood graph is already present in the `AnnData`, this graph construction step will be skipped).  
This graph can be used to derive the cell embeddings, such as the UMAP coordinates.

In [None]:
# If the UMAP or any cell embeddings are already available in the `AnnData`, skip this command.
sc.tl.umap(adata)

Visualise the prediction results.

In [None]:
sc.pl.umap(adata, color = ['cell_type', 'predicted_labels', 'majority_voting'], legend_loc = 'on data')

Actually, you may not need to explicitly convert `predictions` output by `celltypist.annotate` into an `AnnData` as above. A more useful way is to use the visualisation function [celltypist.dotplot](https://celltypist.readthedocs.io/en/latest/celltypist.dotplot.html), which quantitatively compares the CellTypist prediction result (e.g. `majority_voting` here) with the cell types pre-defined in the `AnnData` (here `cell_type`). You can also change the value of `use_as_prediction` to `predicted_labels` to compare the raw prediction result with the pre-defined cell types.

In [None]:
celltypist.dotplot(predictions, use_as_reference = 'cell_type', use_as_prediction = 'majority_voting')

For each pre-defined cell type (each column from the dot plot), this plot shows how it can be 'decomposed' into different cell types predicted by CellTypist (rows).

## Assign cell type labels using a custom model
In this section, we show the procedure of generating a custom model and transferring labels from the model to the query data.

Use previously downloaded dataset of 2,000 immune cells as the training set.

In [None]:
adata_2000 = sc.read('celltypist_demo_folder/demo_2000_cells.h5ad', backup_url = 'https://celltypist.cog.sanger.ac.uk/Notebook_demo_data/demo_2000_cells.h5ad')

Download another scRNA-seq dataset of 400 immune cells as a query.

In [None]:
adata_400 = sc.read('celltypist_demo_folder/demo_400_cells.h5ad', backup_url = 'https://celltypist.cog.sanger.ac.uk/Notebook_demo_data/demo_400_cells.h5ad')

Derive a custom model by training the data using the [celltypist.train](https://celltypist.readthedocs.io/en/latest/celltypist.train.html) function.

In [None]:
# The `cell_type` in `adata_2000.obs` will be used as cell type labels for training.
new_model = celltypist.train(adata_2000, labels = 'cell_type', n_jobs = 10, feature_selection = True)

Refer to the function [celltypist.train](https://celltypist.readthedocs.io/en/latest/celltypist.train.html) for what each parameter means, and to the [usage](https://github.com/Teichlab/celltypist#usage) for details of model training.

This custom model can be manipulated as with other CellTypist built-in models. First, save this model locally.

In [None]:
# Save the model.
new_model.write('celltypist_demo_folder/model_from_immune2000.pkl')

You can load this model by `models.Model.load`.

In [None]:
new_model = models.Model.load('celltypist_demo_folder/model_from_immune2000.pkl')

Next, we use this model to predict the query dataset of 400 immune cells.

In [None]:
# Not run; predict the identity of each input cell with the new model.
#predictions = celltypist.annotate(adata_400, model = new_model, majority_voting = True)
# Alternatively, just specify the model path (recommended as this ensures the model is intact every time it is loaded).
predictions = celltypist.annotate(adata_400, model = 'celltypist_demo_folder/model_from_immune2000.pkl', majority_voting = True)

In [None]:
adata = predictions.to_adata()

In [None]:
sc.tl.umap(adata)

In [None]:
sc.pl.umap(adata, color = ['cell_type', 'predicted_labels', 'majority_voting'], legend_loc = 'on data')

In [None]:
celltypist.dotplot(predictions, use_as_reference = 'cell_type', use_as_prediction = 'majority_voting')

## Examine expression of cell type-driving genes

Each model can be examined in terms of the driving genes for each cell type. Note these genes are only dependent on the model, say, the training dataset.

In [None]:
# Any model can be inspected.
# Here we load the previously saved model trained from 2,000 immune cells.
model = models.Model.load(model = 'celltypist_demo_folder/model_from_immune2000.pkl')

In [None]:
model.cell_types

Extract the top three driving genes of `Mast cells` using the [extract_top_markers](https://celltypist.readthedocs.io/en/latest/celltypist.models.Model.html#celltypist.models.Model.extract_top_markers) method.

In [None]:
top_3_genes = model.extract_top_markers("Mast cells", 3)
top_3_genes

In [None]:
# Check expression of the three genes in the training set.
sc.pl.violin(adata_2000, top_3_genes, groupby = 'cell_type', rotation = 90)

In [None]:
# Check expression of the three genes in the query set.
# Here we use `majority_voting` from CellTypist as the cell type labels for this dataset.
sc.pl.violin(adata_400, top_3_genes, groupby = 'majority_voting', rotation = 90)