# Cell2Sen Tutorial

This tutorial show you how to process your data, generate embeddings and perturbed gene profiles using the cell2sen models from the helical repo.

The model is implemented using the open-source weights available on huggingface and the Gemma 2-2b Model. 

In [1]:
from helical.utils.downloader import Downloader
from pathlib import Path

downloader = Downloader()
downloader.download_via_link(
    Path("yolksac_human.h5ad"),
    "https://huggingface.co/datasets/helical-ai/yolksac_human/resolve/main/data/17_04_24_YolkSacRaw_F158_WE_annots.h5ad?download=true",)

  from .autonotebook import tqdm as notebook_tqdm

INFO:datasets:PyTorch version 2.6.0 available.
INFO:datasets:Polars version 1.33.0 available.
INFO:helical.utils.downloader:Starting to download: 'https://huggingface.co/datasets/helical-ai/yolksac_human/resolve/main/data/17_04_24_YolkSacRaw_F158_WE_annots.h5ad?download=true'
yolksac_human.h5ad: 100%|██████████| 553M/553M [00:00<00:00, 1.72GB/s]


# Process the dataset

In [2]:
import anndata as ad

adata = ad.read_h5ad("./yolksac_human.h5ad")
# We subset to 10 cells and 2000 genes
n_cells = 10
n_genes = 200
adata = adata[:n_cells, :n_genes].copy()

# we can specify the perturbations for each cell in the anndata or later as well in get_pertubations
perturbation_column = "perturbation"
adata.obs[perturbation_column] = ["IFNg"] * n_cells

print(adata.shape)
n_cells = adata.n_obs
print(n_cells)

(10, 200)
10


To generate cell sentences we first import the configurations the instantiate the model.

In [None]:
from helical.models.c2s import Cell2Sen
from helical.models.c2s import Cell2SenConfig

# when calling the model class both the model and weights are downloaded - we can choose the model size ("2B" vs "27B" Gemma model)
# if you would like to use 4-bit quantization for reduced memory usage, set use_quantization=True in the config
config = Cell2SenConfig(batch_size=8, perturbation_column=perturbation_column, model_size="2B")
cell2sen_model = Cell2Sen(configurer=config)

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  5.90it/s]
INFO:helical.models.c2s.model:Successfully loaded model


Now we can call `process_data()` which returns a huggingface dictionary. The processed dataset includes `fit_parameters` (when `return_fit=True` in `Cell2SenConfig`), which capture a linear relationship between gene rank and log10-transformed expression values, fitted over the non-zero expression region as `log_expr_value = slope * rank + intercept`. `log_expr_value` is given by `log10(expr_value + 1)`.

In [4]:
processes_dataset = cell2sen_model.process_data(adata)
print(processes_dataset[0])

INFO:helical.models.c2s.model:Processing data
Processing cells: 100%|██████████| 10/10 [00:00<00:00, 34635.05it/s]
INFO:helical.models.c2s.model:Successfully processed data


{'cell_sentence': 'ABCC3 ABCA1 ABHD14B ABCC4 A2M ABCC2 ABCC6 AARSD1 ABLIM1 ABL2 ABT1 AC002377.1 ABCB10 ABHD11 ABHD5 ABCC5-AS1 ABCA9 ABCC5 ABCB1 ABCB11 ABCB4 ABCC12 ABCC11 ABCC10 ABCC1 ABCB9 ABCB6 ABCB7 ABCB8 ABCB5 ABHD13 ABHD12B ABHD12 ABHD11-AS1 ABHD10 ABHD1 ABCG8 ABCG5 ABCG4 ABCG2 ABCA9-AS1 ABCF3 ABCF2.1 ABCF2 ABCF1 ABCE1 ABCD4 ABCD3 ABCD2 ABCD1 ABCC9 ABCC8 ABCG1 AARS1 AARS AARD AAR2 AANAT AAMP AAMDC AAK1 AAGAB AC002407.1 AC002401.4 ABCA7 AC002401.1 AC002398.9 AC002398.2 AC002398.13 AC002398.12 AAED1 AADAT AADACL2-AS1 AADAC AACS AAAS A4GNT AC002401.3 ABCA8 ABHD14A ABCA6 ABCA5 ABCA4 ABCA3 ABCA2 ABCA13 ABCA12 ABCA10 ABC7-42404400C24.1 AARS2 ABC11-4932300O16.1 ABAT ABALON AB015752.3 AB015752.1 AATK-AS1 AATK AATF AATBC AASS AASDHPPT AASDH ABC12-49244600F4.4 AC002094.2 AC002091.2 AC002091.1 AC002076.10 AC002076.1 AC002074.1 AC002072.1 AC002070.1 AC002066.1 AC002064.5 AC002059.1 ABHD15-AS1 AC002044.1 AC001226.7 AC001226.2 AC001226.1 AC000403.4 AC000403.1 AC000124.1 AC000123.4 AC000123.1 AC

# Embeddings

The processed dataset can be used to generate cell embeddings for expression profiles in the adata. 

In [5]:
# set output_attentions=True to get the attention maps - this will return attentions for each layer in the model per head

embeddings = cell2sen_model.get_embeddings(processes_dataset)

# embeddings, attentions = cell2sen_model.get_embeddings(processes_dataset, output_attentions=True)

print(embeddings.shape)


INFO:helical.models.c2s.model:Extracting embeddings from dataset
Processing embeddings: 100%|██████████| 10/10 [00:00<00:00, 10.42it/s]
INFO:helical.models.c2s.model:Successfully extracted embeddings


(10, 2304)


# Perturbations

By providing the perturbation labels we can also generate perturbed gene profiles. If no perturbation list is provided, the underlying anndata field will be used. 

In [6]:
# Using the anndata perturbation column defined in the config
perturbed_dataset, pertubed_cell_sentences = cell2sen_model.get_perturbations(processes_dataset)
print(pertubed_cell_sentences[0])

INFO:helical.models.c2s.model:Generating perturbed cell sentences
Processing valid perturbations: 100%|██████████| 10/10 [00:41<00:00,  4.13s/it]
INFO:helical.models.c2s.model:Successfully generated perturbed cell sentences


MALAT1 TMSB4X GM11478 RPS14 RPS19 RPL37A RPLP0 RPL32 RPS24 RPS18 RPLP1 RPL41 RPS5 RPL39 RPS15A RPL13 RPS27RT RPLP2 RPS3 RPL37 RPL14 GM11808 RPS11 RPL34 RPS9 RPL8 RPS20 RPL26 RPS3A1 RPL4 RPS15 GM15427 RPL13A-PS1 RPL35A GM2000 RPS21 RPS26 RPL10A-PS1 RPS10 GM6030 RPL23 GM9843 RPS10-PS1 RPS19-PS6 RPL37RT RPS16 GM10076 RPL18A RPS


In [7]:
# Providing a list of perturbations will override the anndata perturbation column - make sure the list is the same length as the dataset
perturbed_dataset, pertubed_cell_sentences = cell2sen_model.get_perturbations(processes_dataset, perturbations_list=["IFNg"] * n_cells)
print(pertubed_cell_sentences[0])

INFO:helical.models.c2s.model:Generating perturbed cell sentences
Processing valid perturbations: 100%|██████████| 10/10 [01:14<00:00,  7.48s/it]
INFO:helical.models.c2s.model:Successfully generated perturbed cell sentences


MALAT1 TMSB4X GM11478 RPS14 RPS19 RPL37A RPLP0 RPL32 RPS24 RPS18 RPLP1 RPL41 RPS5 RPL39 RPS15A RPL13 RPS27RT RPLP2 RPS3 RPL37 RPL14 GM11808 RPS11 RPL34 RPS9 RPL8 RPS20 RPL26 RPS3A1 RPL4 RPS15 GM15427 RPL13A-PS1 RPL35A GM2000 RPS21 RPS26 RPL10A-PS1 RPS10 GM6030 RPL23 GM9843 RPS10-PS1 RPS19-PS6 RPL37RT RPS16 GM10076 RPL18A RPS
