# Cell2Sen Tutorial

This tutorial show you how to process your data, generate embeddings and perturbed gene profiles using the cell2sen models from the helical repo.

The model is implemented using the open-source weights available on huggingface and the Gemma 2-2b Model. 

In [1]:
from helical.utils.downloader import Downloader
from pathlib import Path

downloader = Downloader()
downloader.download_via_link(
    Path("yolksac_human.h5ad"),
    "https://huggingface.co/datasets/helical-ai/yolksac_human/resolve/main/data/17_04_24_YolkSacRaw_F158_WE_annots.h5ad?download=true",)

  from .autonotebook import tqdm as notebook_tqdm

2025-11-21 18:04:22,970 - INFO:datasets:PyTorch version 2.7.0 available.
2025-11-21 18:04:22,986 - INFO:datasets:Polars version 1.33.0 available.


# Process the dataset

In [2]:
import anndata as ad

adata = ad.read_h5ad("./yolksac_human.h5ad")
# We subset to 10 cells and 2000 genes
n_cells = 10
n_genes = 200
adata = adata[:n_cells, :n_genes].copy()

# we can specify the perturbations for each cell in the anndata or later as well in get_pertubations
perturbation_column = "perturbation"
adata.obs[perturbation_column] = ["IFNg"] * n_cells

print(adata.shape)
n_cells = adata.n_obs
print(n_cells)

(10, 200)
10


To generate cell sentences we first import the configurations the instantiate the model.

In [None]:
from helical.models.c2s import Cell2Sen
from helical.models.c2s import Cell2SenConfig

# when calling the model class both the model and weights are downloaded - we can choose the model size ("2B" vs "27B" Gemma model)
# if you would like to use 4-bit quantization for reduced memory usage, set use_quantization=True in the config
# on GPU devices, you can also use flash attention 2 by setting use_flash_attn=True in the config
# provide max_genes to only select the top genes in the ranked list
# See the config file for more details

config = Cell2SenConfig(batch_size=8, perturbation_column=perturbation_column, model_size="2B", use_quantization=True)
cell2sen_model = Cell2Sen(configurer=config)

2025-11-21 18:04:26,620 - INFO:helical.models.c2s.model:Using SDPA for attention implementation - default for CPU
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.40s/it]
2025-11-21 18:04:34,613 - INFO:helical.models.c2s.model:Successfully loaded model


Now we can call `process_data()` which returns a huggingface dictionary. The processed dataset includes `fit_parameters` (when `return_fit=True` in `Cell2SenConfig`), which capture a linear relationship between log gene rank and expression values, fitted over the non-zero expression region as `expr_value = slope * log_rank + intercept`. `log_rank` is given by `log10(rank + 1)`. In the above config, you can also specify `max_genes` to only select the top genes in the ranked list.

In [4]:
processes_dataset = cell2sen_model.process_data(adata)
print(processes_dataset[0])

2025-11-21 18:04:34,617 - INFO:helical.models.c2s.model:Processing data

Processing cells: 100%|██████████| 10/10 [00:00<00:00, 55553.70it/s]
2025-11-21 18:04:34,728 - INFO:helical.models.c2s.model:Successfully processed data


{'cell_sentence': 'ABCC3 ABCA1 ABHD14B ABCC6 ABCC4 A2M ABCC2 ABHD11 ABHD5 ABL2 ABLIM1 AC002377.1 ABT1 ABCB10 AARSD1', 'fit_parameters': None, 'organism': 'unknown', 'perturbations': 'IFNg'}


# Embeddings

The processed dataset can be used to generate cell embeddings for expression profiles in the adata. 

In [5]:
# set output_attentions=True to get the attention maps - this will return attentions for each layer in the model per head

embeddings = cell2sen_model.get_embeddings(processes_dataset)

# embeddings, attentions = cell2sen_model.get_embeddings(processes_dataset, output_attentions=True)

print(embeddings.shape)


2025-11-21 18:04:34,732 - INFO:helical.models.c2s.model:Extracting embeddings from dataset
Processing embeddings: 100%|██████████| 10/10 [00:23<00:00,  2.35s/it]
2025-11-21 18:04:58,229 - INFO:helical.models.c2s.model:Successfully extracted embeddings


(10, 2304)


# Perturbations

By providing the perturbation labels we can also generate perturbed gene profiles. If no perturbation list is provided, the underlying anndata field will be used. 

In [None]:
# Using the anndata perturbation column defined in the config
perturbed_dataset, perturbed_cell_sentences = cell2sen_model.get_perturbations(processes_dataset)
print(perturbed_cell_sentences[0])

2025-11-21 18:04:58,240 - INFO:helical.models.c2s.model:Generating perturbed cell sentences
Processing valid perturbations:   0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
# Providing a list of perturbations will override the anndata perturbation column - make sure the list is the same length as the dataset
perturbed_dataset, perturbed_cell_sentences = cell2sen_model.get_perturbations(processes_dataset, perturbations_list=["IFNg"] * n_cells)
print(perturbed_cell_sentences[0])