# Tutorial Notebook 9: Natural Language Interpretation of Single-Cell Datasets with a C2S Foundation Model

In this tutorial notebook, we demonstrate how to generate natural language interpretations of single-cell datasets using a Cell2Sentence (C2S) model. We will utilize a pretrained C2S foundation model to generate descriptive summaries for sets of cells, treating the task as a form of natural language interpretation of complex biological data.

This notebook will guide you through the following steps:
1.  Load an immune tissue single-cell dataset from Domínguez Conde et al. (preprocessed in tutorial notebook 0).
2.  Load a pretrained C2S model.
3.  Format multi-cell inputs into prompts for natural language interpretation.
4.  Use the C2S model to generate insightful summaries for different sets of cells.

We will begin by importing the necessary libraries. These include standard Python libraries for data manipulation, as well as specific modules from the `cell2sentence` package for handling single-cell data and C2S models.

In [1]:
import os
import json
import random

from datasets import Dataset
import numpy as np
import anndata
import scanpy as sc
import torch
from tqdm import tqdm

In [2]:
import cell2sentence as cs

In [3]:
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)

# Load Data

Next, we will load the preprocessed single-cell dataset from tutorial 0. This dataset has been filtered and normalized, making it ready for conversion into cell sentences.

<font color='red'>Please ensure you have run the preprocessing steps in Tutorial 0 before executing the following code if you are using your own dataset.</font> Make sure the file path in <font color='gold'>DATA_PATH</font> is correctly set to your preprocessed data file.

In [4]:
DATA_PATH = "/home/sr2464/scratch/C2S_API_Testing/Data/dominguez_conde_immune_tissue_two_donors_preprocessed_tutorial_0.h5ad"
adata = anndata.read_h5ad(DATA_PATH)
adata

AnnData object with n_obs × n_vars = 29773 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_name', 'ensembl_id', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: 'batch_condition_colors', 'cell_type_colors', 'log1p', 'neighbors', 'pca', 'tissue_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

In [5]:
adata.obs.head()

Unnamed: 0,cell_type,tissue,batch_condition,organism,assay,sex,n_genes,n_genes_by_counts,total_counts,total_counts_mt,pct_counts_mt
Pan_T7935490_AAACCTGCAAATTGCC,CD4-positive helper T cell,ileum,A29,Homo sapiens,10x 5' v1,female,2191,2191,6542.0,327.0,4.998472
Pan_T7935490_AAACGGGCATCTGGTA,"CD8-positive, alpha-beta memory T cell",ileum,A29,Homo sapiens,10x 5' v1,female,2046,2046,5871.0,429.0,7.307103
Pan_T7935490_AAACGGGTCTTGCATT,"CD8-positive, alpha-beta memory T cell",ileum,A29,Homo sapiens,10x 5' v1,female,2129,2129,7248.0,337.0,4.649559
Pan_T7935490_AAAGCAATCATCGCTC,"CD8-positive, alpha-beta memory T cell",ileum,A29,Homo sapiens,10x 5' v1,female,1262,1262,3927.0,305.0,7.766743
Pan_T7935490_AAAGTAGCAGTCACTA,gamma-delta T cell,ileum,A29,Homo sapiens,10x 5' v1,female,2248,2248,6574.0,1083.0,16.473988


In [6]:
adata.obs = adata.obs[["cell_type", "tissue", "batch_condition", "organism", "sex"]]
adata

AnnData object with n_obs × n_vars = 29773 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'sex'
    var: 'gene_name', 'ensembl_id', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: 'batch_condition_colors', 'cell_type_colors', 'log1p', 'neighbors', 'pca', 'tissue_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

# Cell2Sentence Conversion

In this section, we will convert our AnnData object into a format that the C2S model can understand. This involves transforming the gene expression data of each cell into a "cell sentence," which is a string of gene names ordered by their expression level. We will also prepare the vocabulary from our dataset.

In [7]:
adata_obs_cols_to_keep = adata.obs.columns.tolist()

# Create Arrow dataset and vocabulary from AnnData
arrow_ds, vocabulary = cs.CSData.adata_to_arrow(
    adata=adata,
    random_state=SEED,
    sentence_delimiter=' ',
    label_col_names=adata_obs_cols_to_keep
)
arrow_ds

100%|██████████| 29773/29773 [00:09<00:00, 2995.86it/s]


Dataset({
    features: ['cell_name', 'cell_sentence', 'cell_type', 'tissue', 'batch_condition', 'organism', 'sex'],
    num_rows: 29773
})

In [8]:
arrow_ds[0]

{'cell_name': 'Pan_T7935490_AAACCTGCAAATTGCC',
 'cell_sentence': 'RPLP1 ACTB EEF1A1 HSP90AA1 TMSB4X B2M FTH1 KLF6 HSPA1B MALAT1 RPS12 HSPA8 RPL13 MT-CO1 ATF3 MT-CO2 RPL41 TPT1 MT-CO3 RPS19 HLA-B RPL10 RPS4X RPL28 MT-CYB DUSP1 RPL30 MT-ND4L RPS15 FOS RPL34 RPS2 RPLP2 MT-ND3 RPS18 RPS8 TRBV7-2 RPL32 RPS3 ANXA1 RPL11 HLA-C RPS27 ACTG1 UBC RPL3 RPL37 RPLP0 MT-ATP6 JUNB RPS28 RPL18 UBB MT-ATP8 RPS14 RPL39 PFN1 GAPDH HSPA1A RPL18A SRGN RPS27A RPL26 RPL19 RPS15A HLA-A DNAJB1 RPS3A CREM RPS13 MT-ND1 RPL21 RPS25 BTG2 RPL35A FAU RPL8 RPL7A RPS24 RPS6 RPS16 RACK1 NFKBIA RGS1 RPL29 CALM1 RPL9 RPL37A MT-ND5 TNFAIP3 RPS23 IL7R RPL36A PTMA NFKBIZ UBA52 EIF1 CRIP1 CORO1A RPL14 HSP90AB1 RPL10A CXCR4 RPL4 EEF1B2 RPL36 RPS9 RPL27 NACA VIM H3-3B RPS7 HSPH1 ATP5F1E HLA-E RPL17 RPSA MYL12A RPL12 CD69 TAGAP RPL35 RPS29 RPL6 SARAF ZFP36L2 MT-ND4 ARHGDIB BTG1 RPS21 EEF1D PNRC1 EEF1G HSPA5 FYB1 CD3E IFITM1 RNASEK EEF2 MT-ND2 FTL S100A4 JUN IFITM2 CYTIP OST4 LAPTM5 RPL36AL PLAAT4 PFDN5 SAMSN1 DNAJA1 EIF4A1 FXYD5

# Group cells into multi-cell samples

Here, we group together cells with the same tissue label

In [9]:
def get_multi_cell_groupings(hf_ds):
    batch_tissue_to_cell_indices_dict = {}
    for sample_idx in range(hf_ds.num_rows):
        # Load sample, get batch sample and tissue type
        sample = hf_ds[sample_idx]
        batch_sample = sample["batch_condition"]
        tissue_type = sample["tissue"]

        # If new batch sample and tissue type combination is found, add to dictionary
        if (batch_sample, tissue_type) not in batch_tissue_to_cell_indices_dict:
            batch_tissue_to_cell_indices_dict[(batch_sample, tissue_type)] = []

        # Add sample index to dictionary
        batch_tissue_to_cell_indices_dict[(batch_sample, tissue_type)].append(sample_idx)
    return batch_tissue_to_cell_indices_dict

In [10]:
batch_tissue_to_cell_indices_dict = get_multi_cell_groupings(arrow_ds)

In [11]:
print("(batch sample, tissue type): [list of cell indices with this combination]")
for key in batch_tissue_to_cell_indices_dict.keys():
    print(f"{key}:", batch_tissue_to_cell_indices_dict[key][:10])

(batch sample, tissue type): [list of cell indices with this combination]
('A29', 'ileum'): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
('A29', 'lung'): [346, 347, 348, 349, 350, 351, 352, 353, 354, 355]
('A29', 'thoracic lymph node'): [3547, 3548, 3549, 3550, 3551, 3552, 3553, 3554, 3555, 3556]
('A29', 'mesenteric lymph node'): [8117, 8118, 8119, 8120, 8121, 8122, 8123, 8124, 8125, 8126]
('A29', 'bone marrow'): [10955, 10956, 10957, 10958, 10959, 10960, 10961, 10962, 10963, 10964]
('A29', 'skeletal muscle tissue'): [12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374]
('A29', 'liver'): [12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390]
('A29', 'spleen'): [13041, 13042, 13043, 13044, 13045, 13046, 13047, 13048, 13049, 13050]
('A31', 'spleen'): [16664, 16665, 16666, 16667, 16668, 16669, 16670, 16671, 16672, 16673]
('A31', 'ileum'): [18835, 18836, 18837, 18838, 18839, 18840, 18841, 18842, 18843, 18844]
('A31', 'omentum'): [18868, 18869, 18870, 18871, 18872, 18873,

In [12]:
def get_multi_cell_arrow_ds(hf_ds, batch_tissue_to_cell_indices_dict):
    """
    Function to take in a Huggingface cell sentence arrow dataset and a dictionary 
    mapping a tuple of (batch sample, tissue type) to a list of cell indices, and 
    return a new arrow dataset containing the multi-cell groupings.

    Arguments:
        hf_ds: Huggingface arrow dataset containing cell sentences to group into multi-cells.
        batch_tissue_to_cell_indices_dict: Dictionary mapping a tuple of (batch sample, tissue type) to a list of cell indices.
    
    Returns:
        multi_cell_groupings_ds: Huggingface arrow dataset containing multi-cell groupings.
    """
    # Initialize list to store multi-cell groupings, each element is a list of 5 cell indices
    multi_cell_groupings_list = []
    batch_sample_labels = []
    tissue_type_labels = []
    organism_labels = []

    # Loop through each sample in original dataset - keeps number of overall samples (roughly) the same
    for sample_idx in tqdm(range(hf_ds.num_rows)):
        sample = hf_ds[sample_idx]
        batch_sample = sample["batch_condition"]
        tissue_type = sample["tissue"]

        # Retrieve list of cell indices for this batch sample and tissue type
        sample_key = (batch_sample, tissue_type)
        all_cell_indices = batch_tissue_to_cell_indices_dict[sample_key]
        if len(all_cell_indices) <= 5:
            # If there are less than 5 cells for this batch sample and tissue type, we will sample with replacement.
            #  Note that another option here would be to just skip this sample.
            sampled_cell_indices = random.choices(all_cell_indices, k=5)
        else:
            # If there are 5 or more cells for this batch sample and tissue type, we will sample without replacement
            sampled_cell_indices = random.sample(all_cell_indices, k=5)

        # Add list of sampled cell indices to multi-cell groupings list
        multi_cell_groupings_list.append(sampled_cell_indices)
        batch_sample_labels.append(sample["batch_condition"])
        tissue_type_labels.append(sample["tissue"])
        organism_labels.append(sample["organism"])  # all cells should be of the same organism

    multi_cell_groupings_ds = Dataset.from_dict({
        "multi_cell_groupings": multi_cell_groupings_list,
        "batch_label": batch_sample_labels,
        "tissue": tissue_type_labels,
        "organism": organism_labels,
    })
    return multi_cell_groupings_ds

In [13]:
multi_cell_arrow_ds = get_multi_cell_arrow_ds(arrow_ds, batch_tissue_to_cell_indices_dict)
multi_cell_arrow_ds

100%|██████████| 29773/29773 [00:02<00:00, 12727.38it/s]


Dataset({
    features: ['multi_cell_groupings', 'batch_label', 'tissue', 'organism'],
    num_rows: 29773
})

In [14]:
multi_cell_arrow_ds[0]

{'multi_cell_groupings': [225, 59, 3, 46, 298],
 'batch_label': 'A29',
 'tissue': 'ileum',
 'organism': 'Homo sapiens'}

In [15]:
for cell_idx in multi_cell_arrow_ds[0]["multi_cell_groupings"]:
    print(f"Cell {cell_idx}:", arrow_ds[cell_idx]["batch_condition"], arrow_ds[cell_idx]["tissue"])

Cell 225: A29 ileum
Cell 59: A29 ileum
Cell 3: A29 ileum
Cell 46: A29 ileum
Cell 298: A29 ileum


# Prompt Formatting for Natural Language Interpretation

We will format prompts that provide the model with multiple cell sentences and ask it to generate a summary of the biological information contained within them. This is a powerful feature of C2S, allowing researchers to gain high-level insights from their data in a human-readable format.

In [16]:
from cell2sentence.prompt_formatter import C2SMultiCellPromptFormatter

In [17]:
task_name = "natural_language_interpretation"
prompt_formatter = C2SMultiCellPromptFormatter(
    task=task_name,
    top_k_genes=200
)
prompt_formatter

<cell2sentence.prompt_formatter.C2SMultiCellPromptFormatter at 0x1550aaab0610>

In [18]:
# We would like to generate natural language summaries at inference for groups of cells.
# For the input sample, we will put an empty string as a placeholder for the abstract, so that
#  prompt formatting code is not affected.
multi_cell_arrow_ds = multi_cell_arrow_ds.add_column("abstract", [""] * multi_cell_arrow_ds.num_rows)
multi_cell_arrow_ds[0]

{'multi_cell_groupings': [225, 59, 3, 46, 298],
 'batch_label': 'A29',
 'tissue': 'ileum',
 'organism': 'Homo sapiens',
 'abstract': ''}

In [19]:
formatted_ds = prompt_formatter.format_hf_ds(
    hf_ds=arrow_ds,
    multi_cell_indices_ds=multi_cell_arrow_ds
)
formatted_ds

Dataset({
    features: ['sample_type', 'model_input', 'response'],
    num_rows: 29773
})

In [20]:
formatted_ds[0]

{'sample_type': 'natural_language_interpretation',
 'model_input': 'Consider these 200 highly expressed genes, ordered by descending expression, from 5 Homo sapiens cells. Using this, your task is to generate an abstract summary which summarizes the biological insights contained in these cells.\nCell 1:\nTMSB4X MT-CO1 CCL4 B2M MT-CO2 MALAT1 ACTB RPLP1 MT-CYB MT-CO3 HSP90AA1 MTRNR2L12 MT-ND3 EEF1A1 CCL5 MT-ATP8 HSPA1B RPS12 MT-ATP6 RPL41 RPL10 RPL28 HLA-B MT-ND4L RPS19 RPS27 RPL13 HLA-C TPT1 SH3BGRL3 JUN MT-ND2 MT-ND1 RPL37 RPS18 RPL19 RPS4X RPL30 RPS2 RPS15A IL32 HSPA8 ACTG1 CD69 XCL1 RPL26 DNAJB1 RPS15 KLF6 RPS28 RPL39 TMSB10 S100A4 RPL34 RPS14 RPL29 FTH1 GAPDH RPLP2 BTG1 MT-ND4 ATF3 RPL18A RPS3 FAU RPS27A RPS3A RPL32 RPL12 RPL21 NKG7 RPL37A RPS21 RPS24 PFN1 RPS13 RPL3 RPL11 IFITM1 CD8A RPL9 RPSA RPS23 CD3D MT-ND5 TNFAIP3 RPL35A RPS6 RPL18 RPLP0 RPL27 ZFP36 IFNG PTMA RPL8 RPS25 TRBV4-1 HLA-A CD52 PHLDA1 RPL17 VIM RPL14 RPS8 UBB H3-3B TSC22D3 SRGN RPL23A RPL35 RPS7 HSPA1A RPS16 RGS2 RP

In [21]:
print(formatted_ds[0]['model_input'])

Consider these 200 highly expressed genes, ordered by descending expression, from 5 Homo sapiens cells. Using this, your task is to generate an abstract summary which summarizes the biological insights contained in these cells.
Cell 1:
TMSB4X MT-CO1 CCL4 B2M MT-CO2 MALAT1 ACTB RPLP1 MT-CYB MT-CO3 HSP90AA1 MTRNR2L12 MT-ND3 EEF1A1 CCL5 MT-ATP8 HSPA1B RPS12 MT-ATP6 RPL41 RPL10 RPL28 HLA-B MT-ND4L RPS19 RPS27 RPL13 HLA-C TPT1 SH3BGRL3 JUN MT-ND2 MT-ND1 RPL37 RPS18 RPL19 RPS4X RPL30 RPS2 RPS15A IL32 HSPA8 ACTG1 CD69 XCL1 RPL26 DNAJB1 RPS15 KLF6 RPS28 RPL39 TMSB10 S100A4 RPL34 RPS14 RPL29 FTH1 GAPDH RPLP2 BTG1 MT-ND4 ATF3 RPL18A RPS3 FAU RPS27A RPS3A RPL32 RPL12 RPL21 NKG7 RPL37A RPS21 RPS24 PFN1 RPS13 RPL3 RPL11 IFITM1 CD8A RPL9 RPSA RPS23 CD3D MT-ND5 TNFAIP3 RPL35A RPS6 RPL18 RPLP0 RPL27 ZFP36 IFNG PTMA RPL8 RPS25 TRBV4-1 HLA-A CD52 PHLDA1 RPL17 VIM RPL14 RPS8 UBB H3-3B TSC22D3 SRGN RPL23A RPL35 RPS7 HSPA1A RPS16 RGS2 RPL36 CORO1A RPL13A GZMA RACK1 RPL15 RPL4 NR4A3 RAC2 RPS9 RPL7A ARHGDIB 

# Load C2S Model

Now, we will load a pretrained C2S model that has been trained on a large corpus of single-cell data and biological text. This model has learned to understand the "language" of cells and can be used for various interpretation tasks. We will use the `CSModel` class from the `cell2sentence` library to load the model.

In [22]:
# Define the path to your pretrained model and a directory to save model-related files
model_path = "vandijklab/C2S-Scale-Pythia-1b-pt"
save_dir = "/home/sr2464/scratch/C2S_API_Testing/Cache_Dir"
save_name = "natural_language_interpretation_1B_model"

# Initialize the CSModel object
csmodel = cs.CSModel(
    model_name_or_path=model_path,
    save_dir=save_dir,
    save_name=save_name
)

Using device: cuda


In [23]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [24]:
from transformers import AutoModelForCausalLM

In [25]:
model = AutoModelForCausalLM.from_pretrained(
    os.path.join(save_dir, save_name),
    cache_dir=os.path.join(save_dir, ".cache"),
    trust_remote_code=True
).to(device)
model

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 2048)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-15): 16 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=2048, out_features=6144, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=2048, out_features=8192, bias=True)
          (dense_4h_to_h): Linear(in_features=8192, out_features=2048, bias=True

# Generate Natural Language Interpretations of Cell Subsets

With our model loaded, we can now ask it to interpret different subsets of cells from our dataset.

Here, we select a sample of cells from our dataset to interpret

In [26]:
multi_cell_arrow_ds[10000]

{'multi_cell_groupings': [9744, 10316, 10819, 9040, 8557],
 'batch_label': 'A29',
 'tissue': 'mesenteric lymph node',
 'organism': 'Homo sapiens',
 'abstract': ''}

Note that the Dominguez-Conde dataset contains immune cells taken from many different tissues around the body. These are a group of 5 immune cells from a mesenteric lymph node tissue of one of the patients.

In [27]:
# Format prompts for both cell subsets
input_prompt = formatted_ds[10000]['model_input']

print("--- Input Prompt ---")
print(input_prompt)

--- Input Prompt ---
Examine the top 200 genes ordered by descending expression level in 5 Homo sapiens cells. Based on this information, generate an abstract summary which summarizes the biological insights contained in these cells.
Cell 1:
CD74 EEF1A1 MT-ND3 HLA-DRA RPS27 IGHV5-51 RPLP1 MT-CO2 RPL41 FOS RPL13 RPS19 MT-CO1 RPL32 RPL10 RPS12 RPS8 RPL30 ACTB MT-CYB HLA-DRB1 CXCR4 B2M MALAT1 RPS15A RPL8 RPL28 MT-ATP8 RPL35A RPS3A RPL18A MT-ND4L DUSP1 RPS4X TMSB4X RPL39 MT-CO3 TPT1 RPS27A RPS15 RPS28 RPS3 RPS21 RPS13 RACK1 RPL37 RPS18 RPL35 MT-ATP6 RPL26 RPS25 RPL29 CD69 RPL21 RPL19 RPS2 FAU RPL23A RPL34 RPL15 RPS5 BTG1 MT-ND4 CD52 CD37 RPL36 RPS14 RPS9 MTRNR2L12 CD79A RPLP0 JUN HLA-B RPS6 HLA-DPA1 EIF1 RPS23 RPSA RPS7 RPL27 RPL11 TCL1A HLA-DRB5 H3-3B SLC2A3 TXNIP RPS24 RPS29 RPL37A IGKC RPLP2 RPL7A CD83 CYBA PABPC1 RPL18 HLA-DQB1 MT-ND5 RPL9 RPL38 RPL10A RPL5 MT-ND1 HLA-A FTL RPL6 HLA-E EEF2 PTMA TSC22D3 FTH1 TAGAP UBC MS4A1 TMSB10 RPL36A PCBP1 LY9 ASH1L SARAF UBA52 ZFP36L1 NACA RABAC1 E

Finally, we will use our loaded C2S model to generate natural language summaries for each of our cell subsets.

In [28]:
# Generate interpretation for the ileum cells
natural_language_interpretation = csmodel.generate_from_prompt(
    model=model,
    prompt=input_prompt,
    do_sample=True,
    max_num_tokens=1024,
    temperature=0.7,
    top_k=30,
    top_p=0.9
)

In [30]:
print("--- Interpretation for cells ---")
print(natural_language_interpretation)

--- Interpretation for cells ---
The study used single-cell sequencing to analyze immune cells in 16 human tissues from 12 donors, generating a dataset of ~360,000 cells. They developed CellTypist, a machine learning tool for precise cell type annotation, and identified tissue-specific features and clonal architecture of T and B cells. This approach provides a foundation for identifying highly resolved immune cell types using a common reference dataset and antigen receptor sequencing..


The summary produced by the model correctly identified some details about the dataset it was sampled from, since this dataset (immune cells from Dominguez-Conde et al.) was a part of the C2S pretraining dataset. Sampling with different temperature and on different datasets can give different interpretations.

# Conclusion

In this tutorial, we have seen how a pretrained Cell2Sentence model can be used to generate natural language interpretations of single-cell datasets. By providing the model with a collection of cell sentences, it can produce a high-level summary of the biological information contained within those cells. This capability is a powerful tool for researchers to quickly gain insights into their data and understand the biological context of cell clusters in an accessible way.

The ability to interpret complex single-cell data in natural language opens up new avenues for data exploration and hypothesis generation. As C2S models become more sophisticated, we can expect them to provide even more detailed and nuanced interpretations, further bridging the gap between computational analysis and biological understanding.