# Label Transfer Enhancement with scGALA

This tutorial demonstrates how to improve Seurat's label transfer capabilities by replacing its default cell alignment method with scGALA. We'll walk through:

1. Running a standard Seurat label transfer workflow
2. Identifying and extracting the cell alignments (anchors) 
3. Enhancing these alignments using scGALA
4. Comparing the performance of the original and enhanced approaches

## Background

Label transfer is a common task in single-cell analysis, where cell type annotations from a reference dataset are transferred to a new (query) dataset. Seurat accomplishes this through a process that identifies "anchors" between datasets. These anchors are pairs of cells that likely represent the same biological state.

While Seurat's canonical correlation analysis (CCA) approach works well, scGALA can improve these alignments by better capturing the underlying biological relationships between cells from different datasets.

## Dataset Description

In this tutorial, we're working with:

- **Reference dataset**: A well-annotated single-cell RNA-seq dataset with established cell type labels
- **Query dataset**: A new dataset where we want to transfer the reference cell type annotations

Both datasets are loaded as Seurat objects from RDS files and can be accessed from [Figshare](https://figshare.com/articles/dataset/Label_Transfer_Example_Data/28728617). Our goal is to accurately assign cell types from the reference to the query cells.

## External Reference

Since Seurat runs with clear procedure steps, it's recommended to use the External Reference strategy.

First, we need to identify the Cell Alignment step (key words to look for: MNN, Alignment, CCA, Anchor, Correspondence). In Seurat, we found the FindTransferAnchors step to be the target. Then we export the its original CCA-based cell alignments and enhance using scGALA.

### Understanding the Preprocessing Steps

Before running label transfer, we need to prepare the reference dataset:

1. **NormalizeData**: Log-normalizes the expression data
2. **FindVariableFeatures**: Identifies genes with high variability across cells
3. **ScaleData**: Scales and centers the expression values
4. **RunPCA**: Performs dimensionality reduction via principal component analysis
5. **FindNeighbors**: Constructs a K-nearest neighbor (KNN) graph

These steps are essential for creating a clean, well-structured reference that will serve as the basis for our label transfer.

In [None]:
# Load necessary libraries for analysis
library(Seurat)
library(ggplot2)
library(pheatmap)
library(caret)

Loading required package: SeuratObject

Loading required package: sp


Attaching package: ‘SeuratObject’


The following objects are masked from ‘package:base’:

    intersect, t


Loading required package: lattice



In [None]:
# Load reference and query datasets from RDS files
data.ref <- readRDS("./ref_adata.rds")
data.query <- readRDS("./query_adata.rds")

data.ref

An object of class Seurat 
18064 features across 3206 samples within 1 assay 
Active assay: RNA (18064 features, 0 variable features)
 2 layers present: counts, data
 1 dimensional reduction calculated: umap

In [3]:
# Add metadata to data.ref
data.ref$labels <- data.ref$cell_type
data.query$labels <- data.query$cell_type
# Verify the changes
head(data.ref@meta.data[, c('cell_type', "labels")])


Unnamed: 0_level_0,cell_type,labels
Unnamed: 0_level_1,<fct>,<fct>
B4_GTGGAAGCAATATGGTATGTTGAC-1,fibroblast of lung,fibroblast of lung
B4_GTGGAAGCAGATAACTATGTTGAC-1,malignant cell,malignant cell
B1_AAACAAGCATTTGGAGACTTTAGG-1,cytotoxic T cell,cytotoxic T cell
B4_GTGGACCAGGCTGTGAATGTTGAC-1,smooth muscle cell,smooth muscle cell
B4_GTTAAGGGTGGTTATCATGTTGAC-1,malignant cell,malignant cell
B4_GTTAATGAGTCATGAAATGTTGAC-1,natural T-regulatory cell,natural T-regulatory cell


In [4]:
# pre-process dataset (without integration)
data.ref <- NormalizeData(data.ref)
data.ref <- FindVariableFeatures(data.ref)
data.ref <- ScaleData(data.ref)
data.ref <- RunPCA(data.ref)
data.ref <- FindNeighbors(data.ref, dims = 1:30)
# data.ref <- FindClusters(data.ref)

Centering and scaling data matrix

PC_ 1 
Positive:  ENSG00000121270, ENSG00000164434, ENSG00000179593, ENSG00000116133, ENSG00000084234, ENSG00000100867, ENSG00000127324, ENSG00000167642, ENSG00000178372, ENSG00000236699 
	   ENSG00000155368, ENSG00000142949, ENSG00000008394, ENSG00000176046, ENSG00000135226, ENSG00000170421, ENSG00000127249, ENSG00000204580, ENSG00000039068, ENSG00000156284 
	   ENSG00000119888, ENSG00000120833, ENSG00000124107, ENSG00000164120, ENSG00000110921, ENSG00000182054, ENSG00000140263, ENSG00000102287, ENSG00000091138, ENSG00000096060 
Negative:  ENSG00000142156, ENSG00000011465, ENSG00000204262, ENSG00000038427, ENSG00000166147, ENSG00000113140, ENSG00000130635, ENSG00000060718, ENSG00000140937, ENSG00000133110 
	   ENSG00000164932, ENSG00000182492, ENSG00000182326, ENSG00000112769, ENSG00000159674, ENSG00000163430, ENSG00000139329, ENSG00000100234, ENSG00000035862, ENSG00000091986 
	   ENSG00000107796, ENSG00000113721, ENSG00000122641, ENSG00000137809, EN

### Seurat's Original Label Transfer Workflow

Seurat's label transfer involves three main steps:

1. **FindTransferAnchors**: Identifies "anchor" pairs between reference and query datasets using canonical correlation analysis (CCA)
2. **TransferData**: Uses these anchors to transfer labels from reference to query
3. **AddMetaData**: Adds the transferred labels to the query object's metadata

In this section, we'll perform these steps with Seurat's default parameters to establish a baseline for comparison.

In [6]:
# select two technologies for the query datasets
data.query <- NormalizeData(data.query)
data.anchors <- FindTransferAnchors(reference = data.ref, query = data.query, dims = 1:30,
    reference.reduction = "pca")
predictions <- TransferData(anchorset = data.anchors, refdata = data.ref$labels, dims = 1:30)
data.query <- AddMetaData(data.query, metadata = predictions)

Projecting cell embeddings

Finding neighborhoods

Finding anchors

	Found 1049 anchors

Finding integration vectors

Finding integration vector weights

Predicting cell labels



In [None]:
# Calculate accuracy by comparing predicted labels with known labels
data.query$prediction.match <- data.query$predicted.id == data.query$labels
table(data.query$prediction.match)


FALSE  TRUE 
 2103  5380 

In [8]:
# Save the data.query$predicted.id
write.csv(data.query$predicted.id, file = "predicted_id_ori.csv", row.names = TRUE)

## Evaluate the original label transfer

### Evaluating Label Transfer Performance

We'll evaluate the accuracy of our label transfer by comparing the predicted cell types with the actual cell types in our query dataset. The key metrics we'll examine are:

1. **Overall accuracy**: Percentage of cells correctly labeled
2. **Kappa statistic**: Measures agreement while accounting for chance
3. **Per-class accuracy**: How well each cell type is identified

These metrics will allow us to compare the performance of Seurat's default approach with our scGALA-enhanced method.

In [9]:
library(caret)

# Calculate the F1 score
conf_matrix <-confusionMatrix(data = factor(data.query$predicted.id), reference = factor(data.query$labels), mode='everything')

“longer object length is not a multiple of shorter object length”
“Levels are not in the same order for reference and data. Refactoring data to match.”


In [10]:
print(conf_matrix$overall)

      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
     0.7189630      0.6120946      0.7086276      0.7291276      0.4327142 
AccuracyPValue  McnemarPValue 
     0.0000000            NaN 


In [11]:
# Ensure the levels of predicted and reference factors are the same
levels(data.query$predicted.id) <- levels(data.query$labels)

# Create a confusion matrix
conf_matrix <- confusionMatrix(data = factor(data.query$predicted.id), reference = factor(data.query$labels), mode='everything')

# Extract the confusion matrix table
conf_matrix_table <- conf_matrix$table

# Convert counts to proportions (accuracy)
accuracy_matrix <- prop.table(conf_matrix_table, margin = 1)
write.csv(accuracy_matrix, file = "accuracy_matrix_ori.csv", row.names = TRUE)

“longer object length is not a multiple of shorter object length”
“Levels are not in the same order for reference and data. Refactoring data to match.”


## Export the original cell alignments

### Understanding Seurat Anchors

Anchors are the foundation of Seurat's label transfer approach. Each anchor consists of:

1. **cell1**: Index of a cell in the reference dataset
2. **cell2**: Index of a cell in the query dataset
3. **score**: Confidence score for this anchor pair (higher is better)

We'll export these anchors to a CSV file for processing by scGALA. Note that the indices in Seurat start at 1, which is important when interfacing with Python-based tools that typically use 0-indexed arrays.

In [12]:
head(data.anchors@anchors)

cell1,cell2,score
16,4283,0.3089832
16,3810,0.5064166
16,4522,0.5064166
36,2924,0.851925
36,1200,0.851925
36,1581,0.8272458


In [None]:
write.csv(data.anchors@anchors, file = "anchors_ori.csv", row.names = FALSE)

## Running scGALA Enhancement

In a separate Python environment, we'll enhance the anchors using scGALA. The process involves:

1. Loading the original anchors from the CSV file
2. Converting the reference and query Seurat objects to AnnData objects
3. Applying scGALA's alignment enhancement algorithm
4. Saving the enhanced anchors back to a CSV file

scGALA provides two main functions for this purpose:

- `mod_seurat_anchors()`: A complete pipeline for processing Seurat anchors
- `compute_anchor_score()`: For custom anchor computation

For this tutorial, we'll assume the Python processing has already been completed, and we'll load the enhanced anchors from the saved CSV file.

**Note**: When running this tutorial yourself, you would need to execute the Python code below in a separate script:

```python
mod_seurat_anchors(anchors_ori="temp/anchors.csv",adata1='temp/adata1.h5ad',adata2='temp/adata2.h5ad') 

### Integrating Enhanced Anchors back into Seurat

Now that we have enhanced anchors from scGALA, we need to:

1. Load the enhanced anchors from the CSV file
2. Adjust indices if needed (Python uses 0-based indexing while R uses 1-based)
3. Replace the original anchors in the Seurat object
4. Re-run the TransferData function with the enhanced anchors

This approach allows us to leverage Seurat's existing transfer infrastructure while benefiting from scGALA's improved cell alignment.

In [None]:
anchors_new <- read.csv("anchors_mod0903_noprior.csv")
anchors_new$cell1 <- anchors_new$cell1 + 1
anchors_new$cell2 <- anchors_new$cell2 + 1

head(anchors_new)

Unnamed: 0_level_0,cell1,cell2,score
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,1,4285,0.974359
2,1,4309,0.1538462
3,1,4532,0.8717949
4,2,885,0.7435897
5,2,1581,0.2564103
6,2,2771,0.8461538


In [26]:
# anchors_new <- as.matrix(matched_rows)
data.anchors@anchors <- anchors_new #important: the index starts from 1

predictions <- TransferData(anchorset = data.anchors, refdata = data.ref$labels, dims = 1:30)
data.query <- AddMetaData(data.query, metadata = predictions)

data.query$prediction.match <- data.query$predicted.id == data.query$labels
table(data.query$prediction.match)

Finding integration vectors

Finding integration vector weights

Predicting cell labels




FALSE  TRUE 
 1311  6172 

In [27]:
write.csv(data.query$predicted.id, file = "predicted_id_mod_noprior.csv", row.names = TRUE)
write.csv(data.query$labels, file = "real_labels.csv", row.names = TRUE)


## Evaluate the enhanced label transfer

In [28]:
conf_matrix <- confusionMatrix(data = factor(data.query$predicted.id), reference = factor(data.query$labels), mode='everything')


“longer object length is not a multiple of shorter object length”
“Levels are not in the same order for reference and data. Refactoring data to match.”


In [29]:
print(conf_matrix$overall)

      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
     0.8248029      0.7813291      0.8159972      0.8333548      0.4327142 
AccuracyPValue  McnemarPValue 
     0.0000000            NaN 


In [30]:
# Ensure the levels of predicted and reference factors are the same
levels(data.query$predicted.id) <- levels(data.query$labels)

# Create a confusion matrix
conf_matrix <- confusionMatrix(data = factor(data.query$predicted.id), reference = factor(data.query$labels), mode='everything')

# Extract the confusion matrix table
conf_matrix_table <- conf_matrix$table

# Convert counts to proportions (accuracy)
accuracy_matrix <- prop.table(conf_matrix_table, margin = 1)

“longer object length is not a multiple of shorter object length”
“Levels are not in the same order for reference and data. Refactoring data to match.”


In [31]:
write.csv(accuracy_matrix, file = "accuracy_matrix_mod_noprior.csv", row.names = TRUE)

### Comparing Original vs Enhanced Results

Let's compare the performance metrics between the original Seurat approach and our scGALA-enhanced method:

| Metric | Original Seurat | scGALA-enhanced |
|--------|----------------|-----------------|
| Accuracy | 71.9% | 82.5% |
| Kappa | 0.612 | 0.781 |

The enhanced approach shows significant improvement in both overall accuracy and the Kappa statistic, indicating better cell type assignment.

We can also examine the confusion matrix to see per-class improvements, which provides insight into which cell types benefit most from the enhanced alignment.

## Conclusion

In this tutorial, we've demonstrated how to enhance Seurat's label transfer functionality using scGALA:

1. We ran the standard Seurat label transfer workflow to establish a baseline
2. We extracted the original cell alignments (anchors)
3. We enhanced these alignments using scGALA (in an external Python process)
4. We integrated the enhanced alignments back into the Seurat workflow
5. We compared the performance, showing substantial improvement

This approach allows you to benefit from scGALA's advanced alignment capabilities while still using Seurat's familiar interface and visualization tools.

### Next Steps

- Try different parameters in the scGALA enhancement
- Apply this approach to your own datasets
- Explore scGALA's other integration capabilities beyond label transfer