# How to Train a Spatial CellTypist Model using scRNAseq Data

#### **Prerequisites**: 
- A virtual enviroment (eg conda environment) with CellTypist from the Teich Lab installed (refer to [the GitHub](https://github.com/Teichlab/celltypist))
- Preprocessed and cell typed single-cell dataset saved in AnnData format

For this workflow, I will be using the **Lung** dataset (avaliable [here]()) that is used in the Segger paper (Heidari, Moorman et al. [*bioRxiv*](https://www.biorxiv.org/content/10.1101/2025.03.14.643160v1) 2025, [GitHub](https://github.com/dpeerlab/segger-analysis/)).

In [1]:
import celltypist as ct
import ast
from matplotlib import pyplot as plt
from pathlib import Path
import seaborn as sns
from tqdm import tqdm
import scanpy as sc
import pandas as pd
import numpy as np
import scipy as sp
import warnings
import json
import sys
import os

## Training a CellTypist Model on scRNAseq Data
For a more in-depth explanation of CellTypist and model training, refer to this [SAIL GitHub](https://github.com/joadams1/celltypist/blob/main/celltypist/How%20To%20Train%20a%20CellTypist%20Model.ipynb).

In some cases, it is suitable to train a CellTypist model for spatial data on an annotated scRNA-seq dataset. But to do this there are some steps you must take first to make the RNA data more like Xenium data. Primarily, Xenium captures far fewer reads per cell than scRNA-seq data, so to make the two more comparable, you need to downsample the data to Xenium-level reads.  

In [None]:
# NSCLC Atlas
filepath_ad = 'data_spatial/core_nsclc_atlas_panel_only.h5ad'
ad_atlas = sc.read_h5ad(filepath_ad)

# Re-normalize counts to 10K total
ad_atlas.X = ad_atlas.layers['count'].copy()
sc.pp.downsample_counts(ad_atlas, counts_per_cell=100)
ad_atlas.layers['norm_100'] = ad_atlas.X.copy()
sc.pp.normalize_total(ad_atlas, layer='norm_100', target_sum=1e2)

# Logarthmize
ad_atlas.layers['lognorm_100'] = ad_atlas.layers['norm_100'].copy()
if 'log1p' in ad_atlas.uns:
    del ad_atlas.uns['log1p']
sc.pp.log1p(ad_atlas, layer='lognorm_100')

In order to balance out the cell types and not lose any during model training, you can subset your data to include the same number of cells per cell type. 

In [None]:
gb = ad_atlas.obs.groupby('cell_type')
sample = gb.sample(2000, replace=True).index.drop_duplicates()

Now that you have adjusted the scRNA-seq data, you can train a CellTypist Model. 

In [10]:
# Predict on log counts
ad_atlas.X = ad_atlas.layers['lognorm_1k']

ct_model = ct.train(
    ad_atlas[sample],
    labels='cell_compartment',
    check_expression=False,
    n_jobs=32,
    max_iter=100,
)

filepath_ct = 'models/nsclc_celltypist_model.pkl' #replace with your path
ct_model.write(filepath)

🍳 Preparing data before training
🔬 Input data has 23569 cells and 425 genes
⚖️ Scaling input data
🏋️ Training data using logistic regression
✅ Model training done!


Now you can use this model to train other datasets. For an explanatory work-through for how to do so, see the [`Using Xenium CellTypist Models`](https://github.com/joadams1/spatial_celltypist/blob/main/notebooks/Using%20Xenium%20CellTypist%20Model.ipynb) notebook. 