# Cell2Sen Tutorial

This tutorial show you how to process your data, generate embeddings and perturbed gene profiles using the cell2sen models from the helical repo.

The model is implemented using the open-source weights available on huggingface and the Gemma 2-2b Model. 

In [None]:
from helical.utils.downloader import Downloader
from pathlib import Path

downloader = Downloader()
downloader.download_via_link(
    Path("yolksac_human.h5ad"),
    "https://huggingface.co/datasets/helical-ai/yolksac_human/resolve/main/data/17_04_24_YolkSacRaw_F158_WE_annots.h5ad?download=true",)

# Process the dataset

In [None]:
import anndata as ad

adata = ad.read_h5ad("./yolksac_human.h5ad")
# We subset to 10 cells and 2000 genes
n_cells = 10
n_genes = 200
adata = adata[:n_cells, :n_genes].copy()

# we can specify the perturbations for each cell in the anndata or later as well in get_pertubations
perturbation_column = "perturbation"
adata.obs[perturbation_column] = ["IFNg"] * n_cells

print(adata.shape)
n_cells = adata.n_obs
print(n_cells)

To generate cell sentences we first import the configurations the instantiate the model.

In [None]:
from helical.models.c2s import Cell2Sen
from helical.models.c2s import Cell2SenConfig

# when calling the model class both the model and weights are downloaded - we can choose the model size ("2B" vs "27B" Gemma model)
# if you would like to use 4-bit quantization for reduced memory usage, set use_quantization=True in the config
config = Cell2SenConfig(batch_size=8, perturbation_column=perturbation_column, model_size="2B", use_quantization=True)
cell2sen_model = Cell2Sen(configurer=config)

Now we can call `process_data()` which returns a huggingface dictionary. The processed dataset includes `fit_parameters` (when `return_fit=True` in `Cell2SenConfig`), which capture a linear relationship between gene rank and log10-transformed expression values, fitted over the non-zero expression region as `log_expr_value = slope * rank + intercept`. `log_expr_value` is given by `log10(expr_value + 1)`.

In [None]:
processes_dataset = cell2sen_model.process_data(adata)
print(processes_dataset[0])

# Embeddings

The processed dataset can be used to generate cell embeddings for expression profiles in the adata. 

In [None]:
# set output_attentions=True to get the attention maps - this will return attentions for each layer in the model per head

embeddings = cell2sen_model.get_embeddings(processes_dataset)

# embeddings, attentions = cell2sen_model.get_embeddings(processes_dataset, output_attentions=True)

print(embeddings.shape)


# Perturbations

By providing the perturbation labels we can also generate perturbed gene profiles. If no perturbation list is provided, the underlying anndata field will be used. 

In [None]:
# Using the anndata perturbation column defined in the config
perturbed_dataset, pertubed_cell_sentences = cell2sen_model.get_perturbations(processes_dataset)
print(pertubed_cell_sentences[0])

In [None]:
# Providing a list of perturbations will override the anndata perturbation column - make sure the list is the same length as the dataset
perturbed_dataset, pertubed_cell_sentences = cell2sen_model.get_perturbations(processes_dataset, perturbations_list=["IFNg"] * n_cells)
print(pertubed_cell_sentences[0])