# 2. Data Preprocessing & Converting to Cell2Sentence Format

In this notebook, we will:
1. Load a sample preprocessed single-cell dataset (PBMC 3k from Scanpy).
2. Convert the default log1p data transformation (which uses the natural logarithm) to base 10.
3. Convert the data into 'cell sentences' using Cell2Sentence.

## Learning Objectives
- Understand how to handle scRNA-seq data with AnnData.
- Apply the data transformation to allow reverse encoding of cell sentences to transcriptome profiles.
- Generate cell sentences with the top genes for each cell.

## 2.1. Load the PBMC3k dataset
We'll use the built-in processed dataset from Scanpy. It contains ~2700 peripheral blood mononuclear cells.


In [None]:
import os
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import random
from collections import Counter
import pandas as pd

# Cell2Sentence imports
import cell2sentence as cs
from cell2sentence.utils import benchmark_expression_conversion, reconstruct_expression_from_cell_sentence

import tqdm as notebook_tqdm

In [None]:
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)

In [None]:
# Load dataset
adata = sc.datasets.pbmc3k_processed()
adata

The AnnData object typically has:
- `.X` for the gene expression matrix
- `.obs` for cell metadata
- `.var` for gene metadata


In [None]:
adata.obs.head()

Rename the adata.obs 'louvain' column to 'cell_type'

In [None]:
adata.obs.rename(columns={'louvain': 'cell_type'}, inplace=True)

Add a column 'organism' with the value 'Homo sapiens' to adata.obs

In [None]:
adata.obs["organism"] = "Homo sapiens"

In [None]:
adata.var.head()

In [None]:
sc.pl.umap(
    adata, color="cell_type", legend_loc="on data", title="", frameon=False
)

## Data preprocessing - Note
Cell2Sentence only deviates from the standard preprocessing and normalization pipeline in that the log transformation is done with a base of 10 rather than natural logarithm.
The PBMC 3k processed data set was transformed using the default log1p (which uses the natural logarithm), You can convert the values to base 10 by dividing by ln(10). If your AnnData object is stored in adata and the log1p’d data is in adata.X, you can do:

````python
import numpy as np
adata.X = adata.X / np.log(10)
````

This works because for any value y = ln(x+1), you have log10(x+1) = ln(x+1) / ln(10). Make sure to verify that adata.X stores the transformed data and that any downstream analysis expects the new log scale.



In [None]:
adata.X = adata.X / np.log(10)

Set all negative values to 0.0

In [None]:
import numpy as np
from scipy.sparse import issparse

if issparse(adata.X):
    adata.X.data[adata.X.data < 0] = 0.0
else:
    adata.X[adata.X < 0] = 0.0

In [None]:
SAVE_PATH = "./data/pbmc3k_final.h5ad"

In [None]:
adata.write_h5ad(SAVE_PATH)

## 2.3. Converting to Cell2Sentence (CSData)
Now that we have preprocessed and normalized data loaded, we will perform the conversion to cell sentences. In this section, we will transform our AnnData object containing our single-cell dataset into a Cell2Sentence (C2S) dataset by calling the functions of the CSData class in the C2S code base. Full documentation for the functions of the CSData class can be found in the documentation page of C2S.

First, we define which columns in adata.obs we would like to keep in our C2S dataset. The 'louvain' (Cell type) will be useful to keep, so we will define a list with this label:
We'll transform the `AnnData` into C2S's `CSData` object, then create 'cell sentences' with the top genes in rank order.

In [None]:
adata.obs.head()

In [None]:
adata_obs_cols_to_keep = ["cell_type","organism"]

Now, we create a CSData object using the adata_to_arrow() class function of the CSData model class. This will return us a Huggingface PyArrow dataset (see https://huggingface.co/docs/datasets/en/about_arrow)

In [None]:
# Create CSData object
arrow_ds, vocabulary = cs.CSData.adata_to_arrow(
    adata=adata, 
    random_state=SEED, 
    sentence_delimiter=' ',
    label_col_names=adata_obs_cols_to_keep
)

Let's examine the arrow dataset which was created:

In [None]:
arrow_ds

We can see that our 2638 cells have now been converted into rows of a Dataset object. The metadata columns of our adata object have been preserved, and two new columns have been added: cell_name and cell_sentence. These columns contain unique cell identifiers as well as cell sentences, respectively. Each cell sentence consists of a string of space-separated gene names, in order of descending expression value. For more details about the cell sentence creation process, please refer to the C2S paper.

We can look at one arrow dataset example as follows:

In [None]:
sample_idx = 0
arrow_ds[sample_idx]

When we print out an entire sample, we can see that it is a Python dictionary. The cell sentence contains a sentence of gene names ordered by descending expression level, giving a rank-based gene name representation of the cell. The rest of the columns of adata.obs which were specified also show up in the dataset sample.

This dataset format will allow us to work with cell sentence datasets in an efficient manner. For more details on the cell sentence transformation, please review the Cell2Sentence paper: https://openreview.net/pdf?id=EWt5wsEdvc

In [None]:
len(arrow_ds[sample_idx]["cell_sentence"].split(" "))  # Cell 0 has 1838 nonzero expressed genes, yielding a sentence of 1838 gene names separated by spaces.

Next, we will examine the vocabulary which was generated:

In [None]:
print(type(vocabulary))
print(len(vocabulary))

We can see that vocabulary is an OrderedDict of gene features, corresponding to the original 1838 genes in our adata object. The OrderedDict denotes the gene features present in our single-cell dataset, and also stores the number of cells that gene was expressed in.

In [None]:
list(vocabulary.items())[:10]

## CSData creation
Now that our AnnData object is converted into an arrow dataset, we can create a CSData object to wrap around our arrow dataset. This will help us manage the arrow dataset, keeping it saved on disk and out of memory until we need the data for inference or finetuning.

In [None]:
c2s_save_dir = "./c2s_api_testing"  # C2S dataset will be saved into this directory
c2s_save_name = "PBMC_3K_tutorial1"  # This will be the name of our C2S dataset on disk

In [None]:
csdata = cs.CSData.csdata_from_arrow(
    arrow_dataset=arrow_ds, 
    vocabulary=vocabulary,
    save_dir=c2s_save_dir,
    save_name=c2s_save_name,
    dataset_backend="arrow"
)

In [None]:
print(csdata)

The csdata object simply saves our arrow dataset onto disk and keeps a reference to the path. This wrapper class will work in concert with other classes such as CSModel and task functions to load the dataset whenever necessary, so that we avoid holding the C2S dataset in memory when it is not necessary.

We can retrieve and view cell sentences by calling the get_sentence_strings() function:

In [None]:
cell_sentences_list = csdata.get_sentence_strings()

In [None]:
len(cell_sentences_list)

In [None]:
def print_first_N_genes(cell_sentence_str: str, top_k_genes: int, delimiter: str = " "):
    """Helper function to print K genes of a cell sentence."""
    print(delimiter.join(cell_sentence_str.split(delimiter)[:top_k_genes]))

In [None]:
print_first_N_genes(cell_sentences_list[0], top_k_genes=100)

In [None]:
print_first_N_genes(cell_sentences_list[1], top_k_genes=100)

## Cell Sentence Transformation Benchmarking
We have successfully converted our single-cell dataset into cell sentences using the conversion functions, however it would be useful to know how well the conversion did, and how much expression information was lost when we switched to a rank ordering of genes rather than exact expression values.

In the C2S paper, a strong linear relationship was found between the log of the rank of a gene and its normalized expression value. We can similarly examine our rank transformation and reconstruction ability of the original expression by calling a rank transformation benchmarking utility function. This function will:

Fit a linear model on the ranks and expression of the original data, which can be used to reconstruct expression from rank
Save plots of log rank vs log expression and log expression vs reconstructed expression from rank
First, we define a path where the plots for the benchmarking and reconstruction will be saved:

In [None]:
output_path = os.path.join(c2s_save_dir, c2s_save_name)
output_path

In [None]:
transformation_benchmarking_save_name = "inverse_transformation_testing_tutorial_2"

We can call the benchmarking function with our output directory, as well as the normalized expression of our AnnData object. To avoid benchmarking on too many data points, we set a sample_size of cells to benchmark the rank transformation on 1024.

In [None]:
from scipy.sparse import csr_matrix

# Convert adata.X to a sparse matrix
sparse_matrix = csr_matrix(adata.X)

benchmark_expression_conversion(
    benchmark_output_dir=output_path,
    save_name=transformation_benchmarking_save_name,
    normalized_expression_matrix=sparse_matrix,
    sample_size=1024,
)

Now, we can retrieve the slope and intercept of the linear model which was fit to predict expression from rank

In [None]:
metrics_df = pd.read_csv(os.path.join(output_path, transformation_benchmarking_save_name + "_benchmark", "c2s_transformation_metrics.csv"))
metrics_df.shape

In [None]:
metrics_df

We can see here the slope and intercept of the linear model which was fit on the log rank versus normalized expression on our sample of cells. Furthermore, we can see correlation statistics of the inverse reconstruction, where the linear model predicts the original expression based on the rank of the gene.

We can see that the linear model achieves 0.88 R^2. This indicates that most of the variance in the data is preserved when converting to rank-ordered cell sentences and then recovering the expression from rank. This allows us to utilize cell sentences and LLMs without worry about losing too much information when converting back to expression.

In [None]:
slope = metrics_df.iloc[0]["slope"]
intercept = metrics_df.iloc[0]["intercept"]
print("slope:", slope)
print("intercept:", intercept)

## Reconstruct Cell Expression From Cell Sentences
To further see the ability of the linear model to reconstruct original gene expression from rank in the cell sentences, in this section we will reconstruct expression vectors from cell sentences and visualize them against the original data.

First, we need to create a list of the gene names in our vocabulary. This will determine the ordering of genes in the expression vector we reconstruct:

In [None]:
vocab_list = list(vocabulary.keys())
print(len(vocab_list))
vocab_list[:4]

Now, we will first reconstruct a single expression vector:

In [None]:
print(len(cell_sentences_list))
print_first_N_genes(cell_sentences_list[0], top_k_genes=100)

In [None]:
expression_vector = reconstruct_expression_from_cell_sentence(
    cell_sentence_str=cell_sentences_list[0],
    delimiter=" ",
    vocab_list=vocab_list,
    slope=slope,
    intercept=intercept,
)

In [None]:
print(type(expression_vector))
print(expression_vector.shape)
print(expression_vector.dtype)

In [None]:
expression_vector

In [None]:
expression_vector.sum()

In [None]:
print(len(cell_sentences_list[0].split(" ")))
print(np.nonzero(expression_vector)[0].shape)

We can see that the function reconstruct_expression_from_cell_sentence() has performed the inverse reconstruction on the cell sentence, using the rank of each gene in the cell sentence to predict its original expression using the linear model we fitted earlier:

- predicted_expression = intercept + (slope * log(rank_of_gene))

We can now repeat this and reconstruct the entire original dataset:

In [None]:
from tqdm import tqdm

all_reconstructed_expression_vectors = []
for idx in tqdm(range(len(cell_sentences_list))):
    expression_vector = reconstruct_expression_from_cell_sentence(
        cell_sentence_str=cell_sentences_list[idx],
        delimiter=" ",
        vocab_list=vocab_list,
        slope=slope,
        intercept=intercept,
    )
    all_reconstructed_expression_vectors.append(expression_vector)

all_reconstructed_expression_vectors = np.stack(all_reconstructed_expression_vectors)

In [None]:
all_reconstructed_expression_vectors.shape

Let's now make a new AnnData object, copying the .obs and .var from our original adata, but putting in our reconstructed expression vectors

In [None]:
import scipy

all_reconstructed_expression_vectors = scipy.sparse.csr_array(all_reconstructed_expression_vectors)
all_reconstructed_expression_vectors

In [None]:
import anndata

reconstructed_adata = anndata.AnnData(
    X=all_reconstructed_expression_vectors,
    obs=adata.obs.copy(),
    var=adata.var.copy()
)
reconstructed_adata

Quickly verify that the original adata.var gene list ordering matches the vocab_list which we reconstructed vectors with:

In [None]:
adata.var.head()

In [None]:
vocab_list[:5]

## Plotting Reconstructed Expression Vectors
Now we will plot original data and reconstructed expression vectors side by side, to verify that the cell sentence transformation has preserved most of the original variance of the data.

First, we will remove the extra attributes of our original adata object, since we will need to create a new joint UMAP.

In [None]:
del adata.uns
del adata.obsm
del adata.varm
del adata.obsp

In [None]:
adata

In [None]:
adata.obs["c2s_data_label"] = ["Original Data"] * adata.obs.shape[0]
reconstructed_adata.obs["c2s_data_label"] = ["Reconstructed From Cell Sentences"] * reconstructed_adata.obs.shape[0]

In [None]:
combined_adata = anndata.concat([adata, reconstructed_adata], axis=0)
combined_adata

In [None]:
combined_adata.obs_names_make_unique()

In [None]:
combined_adata.var = adata.var.copy()

In [None]:
combined_adata.obs = combined_adata.obs[["cell_type", "c2s_data_label"]]


In [None]:
combined_adata

We can now run PCA, Scanpy's neighbors algorithm, and then the UMAP algorithm:

In [None]:
sc.tl.pca(combined_adata)

In [None]:
sc.pp.neighbors(combined_adata)

In [None]:
sc.tl.umap(combined_adata)

In [None]:
combined_adata

In [None]:
combined_adata[combined_adata.obs["c2s_data_label"] == "Reconstructed From Cell Sentences", :]

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 4.5))
sc.pl.umap(
    combined_adata[combined_adata.obs["c2s_data_label"] == "Original Data", :],
    color="cell_type",
    size=8,
    title="Original PBMC 3k Data",
    show=False,
    ax=ax1
)
sc.pl.umap(
    combined_adata[combined_adata.obs["c2s_data_label"] == "Reconstructed From Cell Sentences", :],
    color="cell_type",
    size=8,
    title="Reconstructed PBMC 3k Data",
    show=False,
    ax=ax2
)
plt.tight_layout()
plt.show()
plt.close()

sc.pl.umap(
    combined_adata[combined_adata.obs["c2s_data_label"] == "Reconstructed From Cell Sentences", :],
    color="cell_type",
    size=8,
    title="Reconstructed From Cell Sentences",
    show=False,
    ax=ax2
)
plt.tight_layout()
plt.show()
plt.close()

Now our data is ready for LLM-based annotation or other tasks in the next notebook.

[Go to Notebook 3 →](./3_Annotation_with_LLM.ipynb)