# Cell2Sen Tutorial

This tutorial show you how to process your data, generate embeddings and perturbed gene profiles using the cell2sen models from the helical repo.

The model is implemented using the open-source weights available on huggingface and the Gemma 2-2b Model. 

In [1]:
from helical.utils.downloader import Downloader
from pathlib import Path

downloader = Downloader()
downloader.download_via_link(
    Path("yolksac_human.h5ad"),
    "https://huggingface.co/datasets/helical-ai/yolksac_human/resolve/main/data/17_04_24_YolkSacRaw_F158_WE_annots.h5ad?download=true",)

  from .autonotebook import tqdm as notebook_tqdm

2025-11-26 15:35:50,266 - INFO:datasets:PyTorch version 2.7.0 available.
2025-11-26 15:35:50,267 - INFO:datasets:Polars version 0.20.31 available.
2025-11-26 15:35:50,583 - INFO:helical.utils.downloader:Starting to download: 'https://huggingface.co/datasets/helical-ai/yolksac_human/resolve/main/data/17_04_24_YolkSacRaw_F158_WE_annots.h5ad?download=true'
yolksac_human.h5ad: 100%|██████████| 553M/553M [00:04<00:00, 116MB/s]  


# Process the dataset

In [2]:
import anndata as ad

adata = ad.read_h5ad("./yolksac_human.h5ad")
# We subset to 10 cells and 2000 genes
n_cells = 10
adata = adata[:n_cells].copy()

# we can specify the perturbations for each cell in the anndata or later as well in get_pertubations
perturbation_column = "perturbation"
adata.obs[perturbation_column] = ["IFNg"] * n_cells

print(adata.shape)
n_cells = adata.n_obs
print(n_cells)

(10, 37318)
10


To generate cell sentences we first import the configurations the instantiate the model.

In [None]:
from helical.models.c2s import Cell2Sen
from helical.models.c2s import Cell2SenConfig
import torch

# when calling the model class both the model and weights are downloaded - we can choose the model size ("2B" vs "27B" Gemma model)
# if you would like to use 4-bit quantization for reduced memory usage, set use_quantization=True in the config
# on GPU devices, you can also use flash attention 2 by setting use_flash_attn=True in the config
# provide max_genes to only select the top genes in the ranked list and save computation time
# See the config file for more details


# You can provide a custom prompt to the model, depending on your specific task. Below you see an example of how you can structure 
# such a prompt (this is also the default prompt we use if you do not pass anything). Keep in mind to test your prompt and evaluate
# results before you use them, as the model was only trained on limited prompt type and may not react to yours in the way you expect.
custom_prompt = """
        You are given a list of genes in descending order of expression levels in a {organism} cell. \n
        Genes: {cell_sentence} \n
        Using this information, predict the cell type. Answer:  
    """

config = Cell2SenConfig(
    batch_size=8, 
    perturbation_column=perturbation_column, 
    model_size="2B", 
    device="cuda" if torch.cuda.is_available() else "cpu",
    use_quantization=True,
    max_genes=50,
    aggregation_type="mean_pool",
    embedding_prompt_template=custom_prompt)

cell2sen_model = Cell2Sen(configurer=config)

  from pkg_resources import get_distribution, DistributionNotFound

2025-11-26 15:36:18,575 - INFO:helical.models.c2s.model:Using SDPA for attention implementation - default for CPU
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.55s/it]
2025-11-26 15:36:25,397 - INFO:helical.models.c2s.model:Successfully loaded model


Now we can call `process_data()` which returns a huggingface dictionary. The processed dataset includes `fit_parameters` (when `return_fit=True` in `Cell2SenConfig`), which capture a linear relationship between log gene rank and expression values, fitted over the non-zero expression region as `expr_value = slope * log_rank + intercept`. `log_rank` is given by `log10(rank + 1)`. In the above config, you can also specify `max_genes` to only select the top genes in the ranked list.

In [8]:
processes_dataset = cell2sen_model.process_data(adata)
print(processes_dataset[0])
print(f'#Genes in sample: {len(processes_dataset[0]["cell_sentence"].split(" "))}')

2025-11-26 15:38:31,931 - INFO:helical.models.c2s.model:Processing data
Processing cells: 100%|██████████| 10/10 [00:00<00:00, 3826.22it/s]
2025-11-26 15:38:31,951 - INFO:helical.models.c2s.model:Successfully processed data


{'cell_sentence': 'MALAT1 MEG3 TTR AFP APOA1 MT-CO1 ALB APOE MT-CO2 KLF6 APOC3 VTN APOA2 NEAT1 TF MTRNR2L12 SERPINA1 MT-CYB MEG8 MT-ATP6 MT1G MT-CO3 APOB MT-ND6 SAT1 APOC2 MT-ND4 RPS15 CST3 APOM RPS18 H3F3B MT1H GPC3 JUND FADS1 TIMP3 APOC1 MT-ND4L FLVCR1 EEF1A1 RPL41 RPL37A MT-ND1 JUN H19 PHACTR2 RPL13 UBC REEP6', 'fit_parameters': None, 'organism': 'unknown', 'perturbations': 'IFNg'}
#Genes in sample: 50


# Embeddings

The processed dataset can be used to generate cell embeddings for expression profiles in the adata. 

In [None]:
# set output_attentions=True to get the attention maps - this will return attentions for each layer in the model per head

embeddings = cell2sen_model.get_embeddings(processes_dataset)

# embeddings, attentions, gene_order = cell2sen_model.get_embeddings(processes_dataset, output_attentions=True)

print(embeddings.shape)


2025-11-26 15:38:39,235 - INFO:helical.models.c2s.model:Extracting embeddings from dataset
Processing embeddings: 100%|██████████| 10/10 [00:00<00:00, 21.55it/s]
2025-11-26 15:38:39,715 - INFO:helical.models.c2s.model:Successfully extracted embeddings


(10, 2304)


# Perturbations

By providing the perturbation labels we can also generate perturbed gene profiles. If no perturbation list is provided, the underlying anndata field will be used. 

In [None]:
# Using the anndata perturbation column defined in the config
perturbed_dataset, perturbed_cell_sentences = cell2sen_model.get_perturbations(processes_dataset)
print(perturbed_cell_sentences[0])

: 

In [None]:
# Providing a list of perturbations will override the anndata perturbation column - make sure the list is the same length as the dataset
perturbed_dataset, perturbed_cell_sentences = cell2sen_model.get_perturbations(processes_dataset, perturbations_list=["IFNg"] * n_cells)
print(perturbed_cell_sentences[0])