# 6. Advanced Use Cases & Troubleshooting

In this final notebook, we'll explore:
1. Synthetic cell generation.
2. Common troubleshooting tips.

## 6.1. Generating Synthetic Cells
We can prompt the model to create a new cell's gene list based on a desired cell type. This can help with data augmentation or exploration.

We will begin by importing the necessary libraries. These include Python's built-in libraries, third-party libraries for handling numerical computations, progress tracking, and specific libraries for single-cell RNA sequencing data and C2S operations.

In [None]:
# Python built-in libraries
import os
import pickle
import random
from collections import Counter

# Third-party libraries
import numpy as np
from tqdm import tqdm

# Single-cell libraries
import anndata
import scanpy as sc

# Cell2Sentence imports
import cell2sentence as cs
from cell2sentence.tasks import generate_cells_conditioned_on_cell_type
from cell2sentence.utils import (
    post_process_generated_cell_sentences,
    reconstruct_expression_from_cell_sentence
)

In [None]:
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)

In [None]:
import torch
torch.cuda.empty_cache()

# Load Data

Next, we will load the preprocessed dataset from the tutorial 0. This dataset has already been filtered and normalized, so it it ready for transformation into cell sentences.

<font color='red'>Please make sure you have completed the preprocessing steps in Tutorial 0 before running the following code, if you are using your own dataset.</font>. Ensure that the file path is correctly set in <font color='gold'>DATA_PATH</font> to where your preprocessed data was saved from tutorial 0.

In [None]:
DATA_PATH = "./data/pbmc3k_final.h5ad"

In [None]:
adata = anndata.read_h5ad(DATA_PATH)
adata

In [None]:
adata.obs.head()

In [None]:
#adata.var.head()

In [None]:
sc.pl.umap(
    adata,
    color="cell_type",
    size=8,
    title="PBMC 3K UMAP",
)

# Cell2Sentence Conversion

In this section, we will transform our AnnData object containing our single-cell dataset into a Cell2Sentence (C2S) dataset by calling the functions of the CSData class in the C2S code base. Full documentation for the functions of the CSData class can be found in the documentation page of C2S.

In [None]:
adata_obs_cols_to_keep = ["cell_type","organism"]

In [None]:
# Create CSData object
arrow_ds, vocabulary = cs.CSData.adata_to_arrow(
    adata=adata, 
    random_state=SEED, 
    sentence_delimiter=' ',
    label_col_names=adata_obs_cols_to_keep
)

In [None]:
arrow_ds

For this exercise we will consider the top 100 genes of the cell sentences

In [None]:
k = 100  # replace with your desired number of genes

arrow_ds = arrow_ds.map(lambda x: {"cell_sentence": " ".join(x["cell_sentence"].split()[:k])})

In [None]:
sample_idx = 2000
len(arrow_ds[sample_idx]['cell_sentence'].split())

In [None]:
c2s_save_dir = "./c2s_api_testing"  # C2S dataset will be saved into this directory
c2s_save_name = "PBMC_3K_tutorial3"  # This will be the name of our C2S dataset on disk

In [None]:
cs_data = cs.CSData.csdata_from_arrow(
    arrow_dataset=arrow_ds, 
    vocabulary=vocabulary,
    save_dir=c2s_save_dir,
    save_name=c2s_save_name,
    dataset_backend="arrow"
)

In [None]:
print(cs_data)

In [None]:
len(cs_data.get_sentence_strings())

This time, we will leave off creating our CSData object until after we load our C2S model. This is because along with the model checkpoint, we saved the indices of train, val, and test set cells, which will allow us to select out test set cells for inference.

# Load C2S Model

Now, we will load a C2S model with which we will do cell type annotation. For this tutorial, this model will be the last checkpoint of the training session from <font color="red">tutorial notebook 4</font>, where we finetuned our cell type prediction model to do cell type prediction specifically on our immune tissue dataset. We will load the last checkpoint saved from training, and specify the same save_dir as we used before during training.
- <font color="red">Note:</font> If you are using your own data for this tutorial, make sure to switch out to the model checkpoint which you saved in tutorial notebook 3.
- If you want to annotate cell types without finetuning your own C2S model, then tutorial notebook 6 demonstrates how to load the C2S-Pythia-410M cell type prediction foundation model and use it to predict cell types without any finetuning.

We can define our CSModel object with our pretrained cell type prediction model as follows, specifying the same save_dir as we used in tutorial 3:

In [None]:
# Define CSModel object
cell_type_prediction_model_path = "./c2s_api_testing/csmodel_tutorial_3/2025-04-03-14_05_10_finetune_cell_type_prediction/checkpoint-330"

#save_dir = "/home/sr2464/palmer_scratch/C2S_Files_Syed/c2s_api_testing/csmodel_tutorial_3"
save_dir = "./c2s_api_testing/csmodel_tutorial_3"

save_name = "cell_type_pred_pythia_410M_2"
csmodel = cs.CSModel(
    model_name_or_path=cell_type_prediction_model_path,
    save_dir=save_dir,
    save_name=save_name
)

We will also load the data split indices saved alongside the C2S model checkpoint, so that we know which cells were part of the training and validation set. We will do inference on unseen test set cells, which are 10% of the original data.

In [None]:
base_path = "/".join(cell_type_prediction_model_path.split("/")[:-1])
print(cell_type_prediction_model_path)
print(base_path)

In [None]:
with open(os.path.join(base_path, 'data_split_indices_dict.pkl'), 'rb') as f:
    data_split_indices_dict = pickle.load(f)
data_split_indices_dict.keys()

In [None]:
print(len(data_split_indices_dict["train"]))
print(len(data_split_indices_dict["val"]))
print(len(data_split_indices_dict["test"]))

Select out test set cells from full arrow dataset

In [None]:
arrow_ds

In [None]:
test_ds = arrow_ds.select(data_split_indices_dict["test"])
test_ds

Here, we do not need to create a CSData object. We can simply supply cell types from our test dataset to the cell generation function from tasks.py

# Generate cells conditioned on cell type

Now that we have loaded our finetuned cell generation model and have our test set cells, we will generate cells using the generate_cells_conditioned_on_cell_type() function from tasks.py, which takes as input a C2S model object as well as a list of cell type to prompt C2S to generate. The function will return 1 generated cell sentences for each cell type supplied, and will handle prompt formatting for us.

In [None]:
cell_types_to_generate = test_ds["cell_type"]

In [None]:
print(len(cell_types_to_generate))
cell_types_to_generate[:3]

In [None]:
inference_batch_size = 8

In [None]:
generated_cells = generate_cells_conditioned_on_cell_type(
    csmodel=csmodel, 
    cell_types_list=cell_types_to_generate, 
    n_genes=100, 
    organism="Homo sapiens",
    inference_batch_size=inference_batch_size,
    max_num_tokens=1024,
    use_flash_attn=False,  # at smaller sequence lengths (< 1024), flash attention doesn't significantly benefit text generation.
    do_sample=True,
    top_k=50,
    top_p=0.95,
)

We can see that the function has generated 264 cells given the cell types which we provided it, mimicing the cell type frequency in the real test set. We can save our generated cells below:

In [None]:
len(generated_cells)

In [None]:
with open('/home/pieterdb/SIGNATURE-Workshop/c2s_api_testing/generated_cells.pkl', 'wb') as f:
    pickle.dump(generated_cells, f)

Here, we view a few of our generated cell sentences:

In [None]:
generated_cells[:5]

The model outputs a string of gene symbols in rank order. You can check if known CD4 T cell markers appear near the top.

### Converting Synthetic Sentences Back to Expression
C2S includes methods to approximate expression levels from ranks, although it's somewhat experimental. You can use `c2s.transforms.rank_to_expression(...)`. We'll skip a full demo here, but see the official docs for details.

## 6.2. Troubleshooting
1. **Slow inference**: Use a smaller model, ensure GPU usage if possible.
2. **Weird or unknown labels**: The model might not have learned that cell type. Double-check gene name formats. Possibly switch to a specialized model or do partial fine-tuning.
3. **Out-of-memory**: Reduce batch size, use GPU with larger memory, or use a smaller model.
4. **Installation errors**: Make sure `pip install cell2sentence` succeeded; check Python version.

## That's It!
You've completed a tour of Cell2Sentence for single-cell data. Feel free to experiment further and share feedback or questions.

*Happy analyzing your scRNA-seq data with LLMs!*