# 1. Introduction to Cell2Sentence and Environment Setup

In this notebook, we introduce the concept of Large Language Models (LLMs) in single-cell transcriptomics, specifically focusing on the [Cell2Sentence (C2S)](https://github.com/vandijklab/cell2sentence) framework. We also provide instructions to set up your environment for running the subsequent hands-on tutorials.

## Learning Objectives
1. Understand the rationale for using LLMs in single-cell data interpretation.
2. Review how Cell2Sentence converts gene expression profiles into text.
3. Install and verify the required Python packages, including the `cell2sentence` library.

Let's get started!

## 1.1. Background: LLMs in Single-Cell Transcriptomics

Single-cell RNA sequencing (scRNA-seq) data often require annotation of tens of thousands of cells with diverse gene expression profiles. Traditional approaches rely on marker gene knowledge or clustering, which can be time-consuming.

Large Language Models (LLMs) like GPT, BERT variants, or specialized models can assist by using their pattern recognition strengths. **Cell2Sentence (C2S)** bridges numeric gene expression data and text-based LLMs by converting each cell's expression profile into a 'sentence' of genes, sorted by expression level. This approach:
- Allows cell type annotation via natural language classification.
- Enables generative modeling for synthetic cells.
- Facilitates marker identification and text-based data mining.


## 1.2. Environment Setup

We'll use a Python 3.8+ environment (Conda recommended) and install the `cell2sentence` library from PyPI. It also pulls in dependencies like `scanpy`, `torch`, and `transformers`.

### 1.2.1 Creating a Conda environment
```bash
conda create -n cell2sentence_env python=3.8 -y
conda activate cell2sentence_env
```

### 1.2.2 Installing Cell2Sentence
```bash
pip install cell2sentence
```

*(If you don't use Conda, just ensure your Python version is 3.8+ and run the pip install.)*

### 1.2.3 Verifying Installation
Open a Python interpreter or use the cell below to verify:


In [1]:
import cell2sentence as c2s
import scanpy as sc
import torch
import transformers

print("Cell2Sentence version:", c2s.__version__)
print("Scanpy version:", sc.__version__)
print("PyTorch version:", torch.__version__)
print("Transformers version:", transformers.__version__)

print("CUDA available?", torch.cuda.is_available())

  from .autonotebook import tqdm as notebook_tqdm


Cell2Sentence version: 0.0.2
Scanpy version: 1.9.8
PyTorch version: 2.4.1+cu121
Transformers version: 4.46.3
CUDA available? True


If everything imports without error, you're good to go! If you have a CUDA-compatible GPU and the right drivers installed, `torch.cuda.is_available()` should return `True`.

If you see any errors, please double-check that your environment is active and that all dependencies are installed.

## Next Steps
You're now ready to proceed to the data preprocessing and annotation workflows. Head over to the next notebook to learn how to load a sample dataset, filter and convert it into cell sentences.

[Go to Notebook 2 →](./2_Preprocessing_and_Cell2Sentence.ipynb)