# Tutorial Notebook 10: Finetuning for Perturbation Response Prediction

In this tutorial, we will demonstrate how to finetune a Cell2Sentence (C2S) model for perturbation response prediction tasks. This is a critical task in single-cell analysis, where the goal is to predict how a cell's gene expression profile changes in response to a specific perturbation (e.g., a genetic knockout or a drug treatment).

We will treat this as a "translation" task in natural language: translating a cell (in cell sentence format) from its basal (control) state to its perturbed state, conditioned on the perturbation applied.

At a high level, we will:
1. Load a public single-cell perturbation dataset.
2. Write a custom prompt template for perturbation prediction.
3. Subclass the `PromptFormatter` class to create pairs of control and perturbed cells.
4. Finetune a pretrained C2S-Scale model on this new task.
5. Generate a prediction with our new finetuned model to see it in action.

First, let's import the necessary libraries. We'll need standard data science libraries, single-cell analysis tools, and modules from the `cell2sentence` and `transformers` packages.

In [1]:
# Python built-in libraries
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 
os.environ["WORLD_SIZE"] = "1"

import pickle
import random
from datetime import datetime
from collections import Counter, defaultdict

# Third-party libraries
from datasets import Dataset
import numpy as np
import torch
from transformers import TrainingArguments, AutoModelForCausalLM
from tqdm import tqdm

# Single-cell libraries
import anndata
import scanpy as sc

In [2]:
# Cell2Sentence imports
import cell2sentence as cs
from cell2sentence.prompt_formatter import get_cell_sentence_str, PromptFormatter

In [3]:
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)

# Load Perturbation Data

For this tutorial, you will need a single-cell dataset that includes both control and perturbed cells. The data should be in an `AnnData` object. Crucially, your `.obs` dataframe must contain:
- A column that distinguishes control cells from perturbed cells, e.g., `adata.obs['condition']`
    - Values can be something like 'control' or 'non-targeting' for control cells, and 'perturbed' or the perturbation name for the perturbed cells

For this example, we use a public genetic perturbation dataset of Jurkat cells (Nadig et al., 2025). To use your own dataset, replace `DATA_PATH` with the path to your preprocessed data file.
- Paper: https://www.nature.com/articles/s41588-025-02169-3

<font color='red'>Ensure your data is preprocessed and normalized (e.g., using log1p transformation) before proceeding.</font>

In [4]:
# Replace this with the actual path to your dataset, if using a custom dataset
DATA_PATH = "/home/sr2464/scratch/C2S_API_Testing/Data/jurkat.h5ad"
adata = anndata.read_h5ad(DATA_PATH)
adata

AnnData object with n_obs × n_vars = 21412 × 18080
    obs: 'batch_var', 'cell_type', 'target_gene', 'gene_id', 'mitopercent', 'UMI_count'

In [5]:
adata.obs.head()

Unnamed: 0_level_0,batch_var,cell_type,target_gene,gene_id,mitopercent,UMI_count
cell_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAACCCAAGCACCAGA-42,jurkat42,jurkat,EIF4B,ENSG00000063046,0.037697,4828.0
AAACCCAAGCAGATAT-4,jurkat4,jurkat,DAXX,ENSG00000204209,0.063256,9675.0
AAACCCAAGCTAGATA-2,jurkat2,jurkat,METTL3,ENSG00000165819,0.069885,16985.0
AAACCCAAGCTGGAGT-37,jurkat37,jurkat,non-targeting,non-targeting,0.055775,20475.0
AAACCCAAGGGCCTCT-51,jurkat51,jurkat,non-targeting,non-targeting,0.047619,12642.0


In [6]:
target_gene_counter = Counter(adata.obs['target_gene'])
print(len(target_gene_counter))

69


In [7]:
target_gene_counter.most_common(20)

[('non-targeting', 12013),
 ('TFAM', 2555),
 ('EIF4B', 461),
 ('JAZF1', 265),
 ('HIRA', 233),
 ('SMARCB1', 204),
 ('TAF13', 200),
 ('KDM1A', 195),
 ('MBTPS1', 192),
 ('LZTR1', 192),
 ('METTL3', 177),
 ('DNMT1', 177),
 ('SDC1', 156),
 ('TADA1', 144),
 ('MED12', 143),
 ('USF2', 141),
 ('ARPC2', 140),
 ('PTPN1', 136),
 ('PHF10', 136),
 ('NDUFB4', 133)]

This data contains both control cells (non-targeting) as well as cells under different genetic knockouts.

In [8]:
adata.var.head()

SAMD11
NOC2L
KLHL17
PLEKHN1
PERM1


In [9]:
adata.X

<21412x18080 sparse matrix of type '<class 'numpy.float32'>'
	with 69336411 stored elements in Compressed Sparse Row format>

In [10]:
# View a few nonzero values
adata.X.data[:10]

array([1.6483034, 1.6483034, 1.6483034, 1.6483034, 1.6483034, 1.6483034,
       1.6483034, 1.6483034, 1.6483034, 1.6483034], dtype=float32)

In [11]:
adata.X.max()

8.333305

The expression is already preprocessed and log1p transformed. Typically, a log1p transform with base=10 would be used for Cell2Sentence training if we wish to convert generated cell sentences back to expression vectors, since log base=10 gives a better linear relationship between log rank and log expression in many single-cell datasets.

For this tutorial, we will use this processed data as is, to train our model to generate target cell sentences.

# Cell2Sentence Conversion

Now, we will convert the gene expression data in our `AnnData` object into cell sentences. This process creates a Hugging Face Arrow dataset, which is used in our LLM training.

In [12]:
# We'll keep all relevant columns for our new task
adata_obs_cols_to_keep = adata.obs.columns.tolist()
adata_obs_cols_to_keep

['batch_var',
 'cell_type',
 'target_gene',
 'gene_id',
 'mitopercent',
 'UMI_count']

In [13]:
# Create Arrow dataset and vocabulary
arrow_ds, vocabulary = cs.CSData.adata_to_arrow(
    adata=adata, 
    random_state=SEED, 
    sentence_delimiter=' ',
    label_col_names=adata_obs_cols_to_keep
)
arrow_ds

100%|██████████| 21412/21412 [00:11<00:00, 1899.14it/s]


Dataset({
    features: ['cell_name', 'cell_sentence', 'batch_var', 'cell_type', 'target_gene', 'gene_id', 'mitopercent', 'UMI_count'],
    num_rows: 21412
})

# Custom Prompt Formatting for Perturbation Prediction

This is the core of our tutorial. For this dataset, we have a single large pool of control cells (labeled 'non-targeting') and multiple groups of perturbed cells, each with a specific `target_gene`. Our goal is to create training pairs where each perturbed cell is matched with a randomly selected control cell. Note that for different perturbation applications, there may be better ways of pairing control and perturbed cells.

We will define a custom prompt structure that frames our task for the LLM. The input will contain the **control cell sentence** and the **perturbation name**. The model's expected output (the response) will be the **perturbed cell sentence**.

First, let's define our prompt templates.

In [14]:
# The input provides the control cell and the perturbation, asking for the perturbed result.
custom_input_prompt_template = """Given the following cell sentence of {num_genes} expressed genes representing a cell's basal state, predict the cell sentence after applying the perturbation: {perturbation_name}.
Control cell sentence: {control_cell_sentence}.

Perturbed cell sentence:"""

# The answer is simply the target cell sentence.
answer_template = "{perturbed_cell_sentence}."

To apply this template, we need to create pairs of (control cell, perturbed cell) for each perturbation. We'll create a custom `PerturbationPromptFormatter` by subclassing the base `PromptFormatter`.

Our custom `format_hf_ds` function will:
1.  First, iterate through the entire dataset to create a list of all control cell indices.
2.  Simultaneously, it will group the indices of all perturbed cells into a dictionary, with the perturbation name (`target_gene`) as the key.
3.  Then, it will iterate through each perturbation group and, for every perturbed cell, randomly select a control cell from the global pool to form a pair.
4.  Finally, it will format these pairs into the input prompts and expected responses for the model.

In [15]:
from collections import defaultdict

class PerturbationPromptFormatter(PromptFormatter):
    def __init__(self,
        task_name,
        input_prompt,
        answer_template,
        top_k_genes, 
        perturbation_col='target_gene',
        control_label='non-targeting'
    ):
        """
        Initializes the custom prompt formatter.

        Args:
            task_name (str): The name for this task.
            input_prompt (str): The template for the model's input.
            answer_template (str): The template for the model's expected response.
            top_k_genes (int): The number of top genes to include in the cell sentence.
            perturbation_col (str): The column name in the dataset that contains perturbation info.
            control_label (str): The label used to identify control cells in the perturbation_col.
        """
        super().__init__()
        self.task_name = task_name
        self.input_prompt = input_prompt
        self.answer_template = answer_template
        self.top_k_genes = top_k_genes
        self.perturbation_col = perturbation_col
        self.control_label = control_label
        assert top_k_genes > 0, "'top_k_genes' must be an integer > 0"

    def format_hf_ds(self, hf_ds):
        """
        Custom formatting function for perturbation prediction. This function creates pairs of
        (control, perturbed) cells by sampling from a global pool of control cells.
        """
        model_inputs_list = []
        responses_list = []
        
        # 1. Separate all cells into a global control pool and a dict of perturbed cells
        control_indices = []
        pert_to_indices = defaultdict(list)

        print("Grouping cells by perturbation and identifying global controls...")
        for i, sample in enumerate(hf_ds):
            if sample[self.perturbation_col] == self.control_label:
                control_indices.append(i)
            else:
                pert_to_indices[sample[self.perturbation_col]].append(i)
        
        assert len(control_indices) > 0, "No control cells found. Cannot create pairs."
        print(f"Found {len(control_indices)} control cells.")
        print(f"Found {len(pert_to_indices)} unique perturbations.")

        # 2. Create prompt-response pairs by iterating through perturbed cells
        print("Creating control-perturbed pairs...")
        for pert_name, perturbed_indices in tqdm(pert_to_indices.items()):
            for perturbed_idx in perturbed_indices:
                # Pair each perturbed cell with a random control cell from the global pool
                control_idx = random.choice(control_indices)
                
                control_sample = hf_ds[control_idx]
                perturbed_sample = hf_ds[perturbed_idx]

                # Format control cell sentence
                control_sentence, num_genes_str = get_cell_sentence_str(
                    control_sample,
                    num_genes=self.top_k_genes
                )
                # Format perturbed cell sentence
                perturbed_sentence, _ = get_cell_sentence_str(
                    perturbed_sample,
                    num_genes=self.top_k_genes
                )

                # Format the model input string using the perturbation name
                model_input_str = self.input_prompt.format(
                    num_genes=num_genes_str,
                    perturbation_name=pert_name,
                    control_cell_sentence=control_sentence
                )
                
                # Format the response string
                response_str = self.answer_template.format(
                    perturbed_cell_sentence=perturbed_sentence
                )

                model_inputs_list.append(model_input_str)
                responses_list.append(response_str)

        # Create the final Hugging Face Dataset
        ds_split_dict = {
            "sample_type": [self.task_name] * len(model_inputs_list),
            "model_input": model_inputs_list,
            "response": responses_list,
        }
        ds = Dataset.from_dict(ds_split_dict)
        return ds

Let's instantiate our custom formatter and test it on a small sample of our data to see the result.

In [16]:
task_name = "perturbation_prediction"
prompt_formatter = PerturbationPromptFormatter(
    task_name=task_name,
    input_prompt=custom_input_prompt_template,
    answer_template=answer_template,
    top_k_genes=200 # Using top 200 genes for this example. For real applications, ideal to use all nonzero expressed genes if possible.
)

In [17]:
# Format the dataset
formatted_ds = prompt_formatter.format_hf_ds(arrow_ds)
formatted_ds

Grouping cells by perturbation and identifying global controls...
Found 12013 control cells.
Found 68 unique perturbations.
Creating control-perturbed pairs...


100%|██████████| 68/68 [00:03<00:00, 20.85it/s]


Dataset({
    features: ['sample_type', 'model_input', 'response'],
    num_rows: 9399
})

Note that if you want to do a train/test split of the data, separating out a split of control cells and holdout perturbed cells / entire perturbations can be done before formatting.

In [18]:
# Inspect a formatted sample
print("--- Formatted Sample ---")
print("#----Model input:----#")
print(formatted_ds[0]["model_input"], "\n")
print("#----Response:----#")
print(formatted_ds[0]["response"])

--- Formatted Sample ---
#----Model input:----#
Given the following cell sentence of 200 expressed genes representing a cell's basal state, predict the cell sentence after applying the perturbation: EIF4B.
Control cell sentence: MT-ATP6 MT-CO3 TMSB4X MT-ND4 MT-ND1 MT-CO2 ACTB MT-ND2 MT-ND3 MT-CYB RACK1 HSP90AB1 HSP90AA1 NME2 HSPD1 H2AFZ YBX1 ACTG1 FTH1 TUBA1B TUBB LDHB PFN1 NCL FTL PRDX1 EEF1B2 HIST1H4C BTF3 ADA CFL1 CHCHD2 SET ENO1 UBA52 H3F3A ARHGDIB EIF1 EEF1G PPP1R14B GSTP1 CALR HINT1 SERF2 HSPE1 XRCC5 TMSB10 STMN1 CD3D B2M EEF2 MT-ND6 PPIA HIST1H1D MIF H3F3B ANP32B HNRNPU UBC SERBP1 CDK6 EIF4A1 TYMS UQCRH PSMA7 HNRNPA2B1 SFPQ GPX4 HMGB2 PRRC2C SIVA1 FUS SUB1 LCP1 SLC25A3 SNRPB SRSF3 CSDE1 PCLAF EIF3G EIF3A MYL6 HES4 HMGN1 SLC25A5 COX4I1 MKI67 NDUFA13 HNRNPD PEBP1 PPP1CA GABARAP PRDX5 HNRNPA3 TIMM13 PKM POMP PCNA NUDC STOML2 GADD45GIP1 TMA7 PPA1 CORO1A HNRNPK ARPP19 UQCRB ATP5F1B SNRPD1 DDX5 ARPC3 ISG15 PSMA4 SEC61B COX8A PSMA1 FABP5 TMEM160 SNRPE PRPF40A MT-ND5 ODC1 COX6A1 SH3BGRL

Now that our custom formatter is ready, we'll wrap our original `arrow_ds` in a `CSData` object. The `finetune` function will use this object and our custom formatter to prepare the data for training.

In [19]:
# Save directory for Huggingface dataset
c2s_save_dir = "/home/sr2464/scratch/C2S_API_Testing/Data/perturbation_tutorial/perturbation_tutorial"
c2s_save_name = "jurkat_perturbation_c2s"

In [20]:
csdata = cs.CSData.csdata_from_arrow(
    arrow_dataset=arrow_ds,  # Regular cell sentence dataset put here, finetune() function will repeat the formatting with the prompt formatter
    vocabulary=vocabulary,
    save_dir=c2s_save_dir,
    save_name=c2s_save_name,
    dataset_backend="arrow"
)
print(csdata)

Saving the dataset (0/1 shards):   0%|          | 0/21412 [00:00<?, ? examples/s]

CSData Object; Path=/home/sr2464/scratch/C2S_API_Testing/Data/perturbation_tutorial/perturbation_tutorial/jurkat_perturbation_c2s, Format=arrow


# Load a Pretrained Cell2Sentence Model

We will start with a C2S model that has already been pretrained on a diverse set of single-cell datasets. This provides a strong foundation of biological knowledge. The `C2S-Scale-Pythia-1b-pt` and newer C2S-Scale models are good general-purpose models to start from.

In [21]:
model_name_or_path = "vandijklab/C2S-Scale-Pythia-1b-pt"
save_dir = "/home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial"
save_name = "perturbation_pythia_1B"

In [22]:
csmodel = cs.CSModel(
    model_name_or_path=model_name_or_path,
    save_dir=save_dir,
    save_name=save_name
)
print(csmodel)

Using device: cuda
CSModel Object; Path=/home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial/perturbation_pythia_1B


# Finetune for Perturbation Prediction

Now, we'll finetune our model on the perturbation prediction task. We'll define our `TrainingArguments` and then call the `finetune()` method, passing in our `csdata` object and the `PerturbationPromptFormatter` instance we created.

For this tutorial, we'll run for a small number of steps (`max_steps=500`). For a full finetuning run, you would typically train for several epochs.

In [23]:
datetimestamp = datetime.now().strftime('%Y-%m-%d-%H_%M_%S')
output_dir = os.path.join(csmodel.save_dir, f"{datetimestamp}_finetune_{task_name}")
print(output_dir)

/home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial/2025-10-14-23_33_53_finetune_perturbation_prediction


In [24]:
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

In [25]:
train_args = TrainingArguments(
    bf16=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    logging_steps=50,
    lr_scheduler_type="cosine",
    num_train_epochs=1, 
    eval_steps=50,
    evaluation_strategy="steps",
    save_steps=100,
    save_strategy="steps",
    output_dir=output_dir,
    max_steps=500  # Shortened for tutorial purposes
)



In [26]:
csmodel.fine_tune(
    csdata=csdata,
    task=task_name,
    train_args=train_args,
    loss_on_response_only=True, # We only want to calculate loss on the predicted sentence
    top_k_genes=200,  # Use top 200 genes for this example, normally would use full cell sentence (all nonzero expressed genes) if possible
    prompt_formatter=prompt_formatter  # Pass in our custom prompt formatter
)

Grouping cells by perturbation and identifying global controls...
Found 12013 control cells.
Found 68 unique perturbations.
Creating control-perturbed pairs...


100%|██████████| 68/68 [00:03<00:00, 20.78it/s]


Reloading model from path on disk: /home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial/perturbation_pythia_1B


Map (num_proc=3):   0%|          | 0/9399 [00:00<?, ? examples/s]

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Starting training. Output directory: /home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial/2025-10-14-23_33_53_finetune_perturbation_prediction
Selecting 500 samples of eval dataset to shorten validation loop.


max_steps is given, it will override any value given in num_train_epochs
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msyeda5688[0m ([33msyed-a-rizvi[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss


Finetuning completed. Updated model saved to disk at: /home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial/2025-10-14-23_33_53_finetune_perturbation_prediction


# Generate Predictions with the Finetuned Model

After finetuning, let's load our new model and use it to predict the response to a perturbation for a cell from our test set.

In [27]:
final_ckpt_path = os.path.join(output_dir, "checkpoint-500")
final_ckpt_path

'/home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial/2025-10-14-23_33_53_finetune_perturbation_prediction/checkpoint-500'

In [28]:
save_dir

'/home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial'

In [29]:
# Load the finetuned model (it's automatically saved to csmodel.model_name_or_path)
finetuned_model = cs.CSModel(
    model_name_or_path=final_ckpt_path, # Path is updated after finetuning
    save_dir=save_dir,
    save_name="perturbation_predictor_finetuned_final"
)

Using device: cuda


In [30]:
finetuned_model.save_path

'/home/sr2464/scratch/C2S_API_Testing/Cache_Dir/perturbation_tutorial/perturbation_predictor_finetuned_final'

In [31]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [32]:
final_model = AutoModelForCausalLM.from_pretrained(
    finetuned_model.save_path,
    cache_dir=os.path.join(csmodel.save_dir, ".cache"),
    trust_remote_code=True
).to(device)

In [33]:
# Load dataset split done in finetune() function, saved to output directory
with open(os.path.join(output_dir, 'data_split_indices_dict.pkl'), 'rb') as f:
    data_split_indices_dict = pickle.load(f)

data_split_indices_dict.keys()

dict_keys(['train', 'val', 'test'])

In [34]:
# Print a few indices of test samples
data_split_indices_dict['test'][:10]

[7, 29, 33, 35, 51, 54, 65, 70, 115, 116]

In [35]:
# Select a few unseen samples
formatted_test_ds = formatted_ds.select(data_split_indices_dict['test'][:10])
formatted_test_ds

Dataset({
    features: ['sample_type', 'model_input', 'response'],
    num_rows: 10
})

In [36]:
# Select a sample from the test set for inference
sample_idx = 0
inference_prompt = formatted_test_ds[sample_idx]['model_input']
ground_truth_response = formatted_test_ds[sample_idx]['response']

print("--- Inference Prompt ---")
print(inference_prompt)

--- Inference Prompt ---
Given the following cell sentence of 200 expressed genes representing a cell's basal state, predict the cell sentence after applying the perturbation: EIF4B.
Control cell sentence: ACTB MT-ATP6 MT-CO3 MT-CO2 TMSB4X TUBA1B MT-ND4 HSP90AA1 HIST1H4C TMSB10 MT-CYB ACTG1 TUBB RACK1 PFN1 H3F3A ARHGDIB YBX1 MT-ND1 CFL1 H2AFZ B2M FTH1 HSP90AB1 CENPF UBA52 MIF MT-ND3 LDHB NCL PPIA SERF2 HNRNPA2B1 EEF1B2 CDK6 PSMA7 HSPD1 TOP2A KPNA2 GSTP1 BTF3 MT-ND2 CD3D SET CORO1A UBB ANP32B HNRNPU STMN1 EIF4A1 CCT3 EEF1G EEF2 SOX4 EIF1 BANF1 HNRNPD ADA COX4I1 MKI67 COX6C SUMO2 LCP1 SLC25A3 PKM CHCHD2 HINT1 OAZ1 NDUFS6 ENO1 TUBB4B HMGB2 MACF1 NME1 LSM4 SELENOH HSPE1 ERH NUCKS1 HMGN1 DDX5 NDUFB10 HIST1H1E GPX4 NDUFA13 GTF3A MT-ND6 PTGES3 ARL6IP1 RBMX PGK1 CTCF ATP5MC3 PSMB2 SLC25A5 PRDX5 COX7C SPTBN1 BPTF ANXA1 UBE2C PRDX1 SRSF10 NDUFA12 SFPQ EIF3G TOMM22 SIVA1 PHB2 CCT2 PA2G4 PCBP2 SUB1 CCT6A RHOA ASPM YWHAE ATP5MC1 XRCC5 SMC4 SRSF3 YBX3 HNRNPAB ATP5F1B CSDE1 SNRPB IFI16 SNRPD1 PAFAH1B

In [37]:
# Generate the prediction
predicted_response = finetuned_model.generate_from_prompt(
    model=final_model,
    prompt=inference_prompt,
    max_num_tokens=800 # max number of tokens to generate, ~4 tokens per gene
)

In [38]:
print("\n--- Ground Truth Perturbed Cell ---")
print(ground_truth_response)
print("\n--- Predicted Perturbed Cell ---")
print(predicted_response)


--- Ground Truth Perturbed Cell ---
MT-ATP6 MT-CO3 MT-CO2 MT-ND4 MT-CYB MT-ND1 MT-ND2 ACTB HSP90AA1 TMSB4X YBX1 EEF1B2 MT-ND3 RACK1 HSP90AB1 EEF1G MIF NME2 HIST1H4C TUBA1B NCL TUBB ADA ENO1 STMN1 H2AFZ PFN1 H3F3A CFL1 LDHB HINT1 HSPD1 C1QBP HSPE1 UBA52 SERF2 ACTG1 PPIA B2M CALR HNRNPA2B1 ARHGDIB GSTP1 SET MT-ND5 BTF3 CCT2 CHCHD2 NUCB2 XRCC5 PGK1 CD3D HNRNPU SUMO2 PPP1R14B HNRNPD UQCRH FDPS ALYREF SIVA1 DNAJA1 SLC25A6 ARPC2 FTL TYMS DUT COX4I1 SNRPB DDX5 PRKDC SLC25A3 PSMA7 CD3G SLC25A5 UBC ATP5F1E MYL6 ATP5F1B CCT6A PCLAF CDK6 H3F3B EIF4A1 EEF2 HMGB2 GUK1 THRAP3 HNRNPDL SERBP1 FABP5 EIF2S2 NUCKS1 HNRNPA3 HMGN1 COX7C SFPQ NDUFA13 HSPA8 OAZ1 MARCKSL1 DEK SELENOH FTH1 SRM SNRPF EIF1 DYNLL1 XRCC6 MT-ND6 RSL1D1 ANP32B EIF3E PSMA3 NDUFS5 ERH ATP5MC3 ARPC3 GLUL YBX3 FUS SNRPD1 YWHAB ATP5MC1 PRDX1 PSMA4 SOX4 NDUFAB1 PPA1 EIF5 NUDC STOML2 NME4 SRSF7 MZB1 H1FX NOP56 TPM4 NME1 NASP EIF3I ATP6V1F SF3B2 TRIR CCNI ARPC1B COX6C SLIRP SNRPC BRK1 ARPC5 ATP5MF SRSF3 CD7 HNRNPR GNAI2 CARHSP1 PPIG ATP5MG

# Conclusion

In this tutorial, you learned how to finetune a Cell2Sentence model for a custom task: perturbation response prediction. By creating a `PerturbationPromptFormatter`, we were able to structure our data into control-perturbation-response triplets, enabling the model to learn the complex transcriptional changes that occur upon perturbation.

This approach is highly flexible and can be adapted to various experimental designs. The finetuned model can now be used for in-silico experiments, such as virtual screening of genetic perturbations or predicting the effect of new compounds, significantly accelerating the pace of biological discovery.