# Figure 2: Tessera Cluster and DEG Analysis

This notebook analyzes spatial transcriptomics data to identify and annotate tissue regions. The workflow involves loading pre-processed data, integrating different data types, performing clustering, and identifying differentially expressed genes (DEGs) to characterize distinct cellular neighborhoods.

## 1. Setup and Initialization

### 1.1 Load Required Libraries and Set Initial Configuration

Here, we load all the necessary R packages for data manipulation (`tidyverse`, `data.table`), single-cell analysis (`Seurat`, `presto`), parallel computing (`future`, `furrr`), and statistical modeling (`lme4`). We also define a helper function to set plot dimensions and initialize the random seed for reproducibility.

In [None]:
# --- Load Libraries ---
# Used for single-cell data handling and methods
require(singlecellmethods)
# Fast differential expression analysis for single-cell data
require(presto)
# For fitting linear mixed-effects models
require(lme4)
# Framework for parallel processing in R
require(future)
# The core toolkit for single-cell genomics
require(Seurat)
# Another library for parallel processing
require(furrr)
# A collection of packages for data science (e.g., dplyr, ggplot2)
require(tidyverse)

require(Matrix)
# --- Initial Configuration ---
# Helper function to set the output plot size in the notebook
fig.size <- function(height, width, res=400) {
    options(repr.plot.height = height, repr.plot.width = width, repr.plot.res = res)
}

# Set a random seed for reproducible results
set.seed(1)

#source('Tessera tiles/Tessera utils/libs.R')
#source('Tessera tiles/Tessera utils/utils_plotting.R')
#source('Tessera tiles/Tessera utils/utils.R')
source('../Tessera tiles/Tessera utils/cluster_utils.R')
#source('Tessera tiles/Tessera utils/utils_cygnus.R')

## 2. Data Loading and Preprocessing

### 2.1 Load and Standardize Pathology Regions

We start by loading the pathology region annotations from a CSV file. The `pathology_region` column is standardized by collapsing all "Tumor" or "tumor" entries into a single "Tumor" category to ensure consistency. A small sample is displayed to inspect the data.

In [None]:
# Define the path to the pathology annotations file
pathology_file_path <- '../Pathology annotations/figure2/pathology_regions_postQC.csv'

# Read the data and standardize the 'pathology_region' column
pathology_regions <- data.table::fread(pathology_file_path) %>%
    mutate(pathology_region = ifelse(grepl(pathology_region, pattern = 'Tumor|tumor'),
                                     yes = 'Tumor',
                                     no = pathology_region)) %>%
    mutate(MSstatus = case_when(PatientID == 'C107' ~ 'MSS', .default = MSstatus)) %>%
    mutate(MMRstatus = case_when(PatientID == 'C107' ~ 'MSS', .default = MMRstatus)) 

# Display a random sample of 20 rows to verify the data
slice_sample(pathology_regions, n = 20)

### 2.2 Inspect Unique Sample and Patient Identifiers

To understand the structure of the dataset, we extract and display the unique `sample_name` and `PatientID` values from the pathology data. This helps verify that all expected samples and patients are present.

In [None]:
# Display unique sample names
pathology_regions$sample_name %>% unique

# Display unique patient IDs
pathology_regions$PatientID %>% unique

In [None]:
table(pathology_regions$PatientID, pathology_regions$MMRstatus)
table(pathology_regions$PatientID, pathology_regions$MSstatus)

### 2.3 Load Harmonized MERFISH Data

Next, we load the main harmonized MERFISH dataset, which is stored as a Seurat object. This object contains the gene expression data and associated metadata across all samples.

In [None]:
# Define the path to the harmonized MERFISH Seurat object
merfish_file_path <- '../Harmony and UMAP embeddings of MERFISH cells/harmonized_merfish_20241105.rds'

# Read the Seurat object
merged_merfish <- readr::read_rds(merfish_file_path)

# Print a summary of the Seurat object
merged_merfish

### 2.4 Standardize Sample Identifiers

To ensure consistency between the MERFISH data and the pathology annotations, we standardize the sample identifiers (`orig.ident`) in the Seurat object's metadata. This is crucial for merging datasets later.

In [None]:
# Standardize sample names for patient G4659
merged_merfish@meta.data$orig.ident[merged_merfish@meta.data$orig.ident %in% c('G4659-CP-MET', 'G4659-CP-MET_VMSC04701')] <- 'G4659'

# Verify the unique identifiers after standardization
unique(merged_merfish@meta.data$orig.ident)

### 2.5 Verify Identifier Consistency

We perform a series of checks to confirm that the patient identifiers in the MERFISH data align with those in the `pathology_regions` table. This step is critical to prevent errors during data integration.

In [None]:
# Check if MERFISH patient IDs are present in the pathology data's PatientID column
c(merged_merfish@meta.data$orig.ident %>% unique) %in% c(pathology_regions$PatientID %>% unique)

### 2.6 Load and Process Tessera Spatial Data

This section reads the outputs from Tessera, a tool for analyzing spatial data. We read multiple `.rds` files, each corresponding to a different sample, and combine them into a single data frame.

#### 2.6.1 Identify and Name Tessera Output Files

First, we locate all processed Tessera files (`_processed_cygnus...rds`) in the specified directory. The file paths are named using their corresponding sample IDs for easy access.

In [None]:
# Directory containing the Tessera output files
dir <- '../Tessera tiles/Tessera processed results'

# List all files matching the pattern for processed Tessera data
fnames <- list.files(
    path = dir,
    pattern = '.*_processed_cygnus.*.rds',
    full.names = TRUE
)

# Extract sample names from file paths and assign them as names to the file list
names(fnames) <- gsub(x = fnames,
                     pattern = '_processed_cygnus_20241020.rds',
                     replacement = "") %>%
                gsub(x = ., pattern = dir, replacement = "") %>%
                gsub(x = ., pattern = '\\/', replacement = "")

# Display the named file paths
fnames

#### 2.6.2 Collect and Combine Tile Metadata

We now read the spatial metadata (`pts`) from each Tessera object in parallel. The results are combined into a single data frame (`tile_metadata`), and unique identifiers for aggregates (`agg_id`) and hubs (`hub_ids`) are created by prefixing them with their sample ID.

In [None]:
# Set up a multisession plan for parallel processing with 10 workers
plan(multisession, workers = 10)

# Time the process of reading and combining tile metadata
system.time({
    tile_metadata <- future_map(fnames, function(fname) {
        # Read the Tessera object from file
        obj <- readr::read_rds(fname)
        # Return the 'pts' dataframe containing tile metadata
        return(obj$dmt$pts)
    }) %>%
    # Combine all data frames into one
    do.call(rbind, .) %>%
    # Create globally unique IDs for aggregates and hubs
    mutate(
        agg_id = paste0(SampleID, '_', agg_id),
        hub_ids = paste0(SampleID, '_', hub_ids)
    )
})

# Display the dimensions and a sample of the combined metadata
dim(tile_metadata)
sample_n(tile_metadata, 20)

#### 2.6.3 Add and Standardize Patient ID in Tile Metadata

A `PatientID` column is added to the `tile_metadata` by extracting it from the `SampleID`. Identifiers are cleaned to ensure they match the main `pathology_regions` table.

In [None]:
# Extract PatientID from SampleID by removing suffixes
tile_metadata$PatientID <- gsub(tile_metadata$SampleID, pattern = '_.*', replacement = '')

# Standardize the 'G4659-CP-MET' identifier to 'G4659' for consistency
tile_metadata$PatientID[tile_metadata$PatientID == 'G4659-CP-MET'] <- 'G4659'

### 2.7 Filter Data to High-Quality Samples

We filter the `tile_metadata` to include only those patients that are also present in the `pathology_regions` table, ensuring we only analyze high-quality, annotated samples.

In [None]:
# Identify patients in tile_metadata that are NOT in pathology_regions (and should be excluded)
(tile_metadata$PatientID %>% unique)[!(tile_metadata$PatientID %>% unique) %in% (pathology_regions$PatientID %>% unique)]

# Filter the tile metadata to retain only patients present in the pathology regions data
tile_metadata <- tile_metadata %>%
    filter(PatientID %in% (pathology_regions$PatientID %>% unique))

In [None]:
tile_metadata$Status[tile_metadata$PatientID == 'C107'] = 'MMRp'

## 3. Data Aggregation and Integration

### 3.1 Collect Hub Gene Expression Counts

Here, we extract the aggregated gene expression counts for each spatial hub from the Tessera objects. This is done in parallel for efficiency. The column names are updated to be globally unique.

In [None]:
# Set up a multisession plan for parallel processing
plan(multisession, workers = 10)

# Time the process of collecting and combining aggregated counts
system.time({
    agg_counts <- future_map(names(fnames), function(fname) {
        sampleID <- fname
        message(sampleID)
        
        # Read the Tessera object
        obj <- readr::read_rds(fnames[fname])
        
        # Extract the aggregated counts matrix
        agg_counts <- obj$aggs$counts
        
        # Create unique column names by prepending the sample ID
        colnames(agg_counts) <- paste0(sampleID, '_', obj$aggs$meta_data$id)
        
        return(agg_counts)
    }) %>%
    # Combine all count matrices into a single sparse matrix
    do.call(cbind, .)
})

# Display the dimensions and a preview of the combined matrix
dim(agg_counts)
agg_counts[1:10, 1:20]

# Perform garbage collection to free up memory
gc()

### 3.2 Collect Hub Metadata

Similarly, we collect the metadata for each spatial hub from the Tessera objects. This metadata includes spatial coordinates, cluster assignments, and other annotations.

In [None]:
# Set up a multisession plan for parallel processing
plan(multisession, workers = 10)

# Time the process of collecting and combining hub metadata
system.time({
    agg_metadata <- future_map(names(fnames), function(fname) {
        sampleID <- fname
        message(sampleID)
        
        # Read the Tessera object
        obj <- readr::read_rds(fnames[fname])
        
        # Extract the aggregate metadata
        agg_metadata <- obj$aggs$meta_data
        
        # Create unique IDs and add the sample ID as a column
        agg_metadata$id <- paste0(sampleID, '_', agg_metadata$id)
        agg_metadata$SampleID <- sampleID
        
        return(agg_metadata)
    }) %>%
    # Combine all metadata frames into one
    do.call(rbind, .)
})

# Display dimensions and a sample of the combined metadata
dim(agg_metadata)
agg_metadata[1:10, ]

# Perform garbage collection to free up memory
gc()

### 3.3 Filter Aggregated Data to High-Quality Samples

We filter both the aggregated counts and metadata to align with the previously defined set of high-quality samples.

In [None]:
# For maximum speed, load the data.table library
library(data.table)

# Convert your data frames to data.tables in-place (very fast)
setDT(agg_metadata)
setDT(pathology_regions)

# 1. Create the PatientID column efficiently using `:=`, which avoids making a copy of your data
agg_metadata[, PatientID := sub(pattern = "_.*|-CP-MET", replacement = "", x = SampleID)]

# 2. Get a data.table of unique patient IDs from the smaller table
valid_patients_dt <- unique(pathology_regions[, .(sample_name)]) %>% rename(SampleID = sample_name)

# 3. Filter `agg_metadata` by performing a join. This is the fastest method.
agg_metadata <- agg_metadata[valid_patients_dt, on = .(SampleID), nomatch = 0]

In [None]:
unique(agg_metadata$SampleID) %in% unique(tile_metadata$SampleID)
unique(pathology_regions$sample_name) %in% unique(tile_metadata$SampleID)

In [None]:
# # Add and standardize PatientID in the aggregate metadata
# agg_metadata <- agg_metadata %>% 
#     mutate(PatientID = gsub(agg_metadata$SampleID, pattern = "_.*|-CP-MET", replacement = "")) %>% 
#     filter(PatientID %in% (pathology_regions$PatientID %>% unique))

# # Check the number of unique sample and patient IDs to ensure consistency
# print(unique(agg_metadata$SampleID) %>% length)
# print(unique(agg_metadata$PatientID) %>% length)
# print(unique(pathology_regions$PatientID) %>% length)

# Filter the counts matrix to match the filtered metadata
agg_counts = agg_counts[, agg_metadata$id]

# Confirm the final dimensions of the counts matrix
dim(agg_counts)

In [None]:
dim(agg_counts)
dim(agg_metadata)

## 4. Spatial Analysis and Clustering

### 4.1 Filter Tiles to Tumor Regions

To focus the analysis on tumor-specific interactions, we filter the spatial tiles to include only those located within annotated "Tumor" regions.

In [None]:
# # Join tile metadata with pathology regions to get region info for each cell
# DF <- pathology_regions %>% 
#     select(sample_cell, pathology_region) %>% 
#     distinct() %>%
#     rename(cell_id = sample_cell) %>% 
#     right_join(., tile_metadata) %>% 
#     right_join(., agg_metadata %>% select(id, PatientID) %>% rename(agg_id = id))

# # Identify tiles that are located within tumor regions
# tumor_tiles <- DF %>% 
#     select(agg_id, pathology_region) %>%
#     distinct() %>%
#     filter(grepl(pathology_region, pattern = 'Tumor|tumor'))

# # Filter the aggregate metadata to keep only tumor tiles
# agg_metadata <- left_join(agg_metadata %>% filter(id %in% tumor_tiles$agg_id),
#                         tumor_tiles %>% rename(id = agg_id))

# # Filter the tile metadata to keep only cells within tumor regions
# tumor_cells <- DF %>% 
#     select(cell_id, pathology_region) %>%
#     distinct() %>%
#     filter(grepl(pathology_region, pattern = 'Tumor|tumor'))

# tile_metadata <- left_join(tile_metadata %>% filter(cell_id %in% tumor_cells$cell_id),
#                          tumor_cells)

# # Finally, filter the expression counts to match the tumor-only tiles
# agg_counts <- agg_counts[, agg_metadata$id]

library(data.table)

# Ensure all tables are data.tables for high performance
setDT(pathology_regions)
setDT(tile_metadata)
setDT(agg_metadata)

# 1. Get the definitive list of 'cell_id's that are in tumor regions
# This starts with the smallest table, which is very fast.
tumor_cell_ids <- unique(pathology_regions[grepl(pathology_region, pattern = 'Tumor|tumor'), .(cell_id = sample_cell)])

# 2. Filter 'tile_metadata' to keep only tumor cells using a fast join.
# This is the equivalent of a semi-join.
tile_metadata <- tile_metadata[tumor_cell_ids, on = .(cell_id), nomatch = 0]

# 3. Get the unique aggregate IDs (hubs) from the *already filtered* tile_metadata
tumor_agg_ids <- unique(tile_metadata[, .(id = agg_id)])

# 4. Filter 'agg_metadata' using the much smaller list of tumor aggregate IDs
agg_metadata <- agg_metadata[tumor_agg_ids, on = .(id), nomatch = 0]

# 5. Finally, filter the counts matrix using the column names from the filtered agg_metadata
# This is very fast as it's just subsetting columns by name.
agg_counts <- agg_counts[, agg_metadata$id]

### 4.2 Integrate Harmony Embeddings with Tile Metadata

We extract the Harmony embeddings from the `merged_merfish` object. These embeddings represent the integrated, batch-corrected cellular profiles. We then merge them with our `tile_metadata` to link each cell's profile to its spatial location and hub assignment.

In [None]:
# Extract Harmony embeddings and join with tile metadata
harmonyEmbeddings <- Embeddings(merged_merfish, 'harmony') %>%
    as.data.frame() %>%
    tibble::rownames_to_column(var = 'cell_id') %>%
    right_join(., y = tile_metadata, by = "cell_id")

# Display the dimensions and a sample of the resulting data frame
dim(harmonyEmbeddings)
sample_n(harmonyEmbeddings, 20)

### 4.3 Aggregate Embeddings by Spatial Hub

To analyze at the tissue-neighborhood level, we average the Harmony embeddings of all cells within each spatial hub (`agg_id`). This gives us a single representative embedding for each tile.

In [None]:
# # Time the aggregation process
# system.time({
#     aggregatedEmbeddings <- harmonyEmbeddings %>%
#         # Convert from wide to long format
#         pivot_longer(cols = colnames(Embeddings(merged_merfish, 'harmony'))) %>%
#         # Group by hub ID and embedding dimension
#         group_by(agg_id, name) %>%
#         # Calculate the mean value for each dimension
#         summarize(hpca = mean(value)) %>%
#         ungroup() %>%
#         # Pivot back to a wide format (hubs x embeddings)
#         pivot_wider(names_from = name, values_from = hpca) %>%
#         # Set the hub IDs as row names
#         tibble::column_to_rownames(var = 'agg_id')
# })

library(data.table)

# Ensure harmonyEmbeddings is a data.table
setDT(harmonyEmbeddings)

# Get the names of the columns you want to average
embedding_cols <- colnames(Embeddings(merged_merfish, 'harmony'))

system.time({
    # Perform the grouped aggregation using data.table's fast syntax
    aggregatedEmbeddings_dt <- harmonyEmbeddings[, lapply(.SD, mean), by = agg_id, .SDcols = embedding_cols]
    
    # Convert back to a data.frame with row names to match the original output
    aggregatedEmbeddings <- as.data.frame(aggregatedEmbeddings_dt)
    rownames(aggregatedEmbeddings) <- aggregatedEmbeddings$agg_id
    aggregatedEmbeddings$agg_id <- NULL
})

# Display the dimensions and a sample of the aggregated embeddings
dim(aggregatedEmbeddings)
sample_n(as.data.frame(aggregatedEmbeddings), 10)


## 5. UMAP Dimensionality Reduction and Visualization

### 5.1 Compute UMAP on Aggregated Embeddings

We run the UMAP algorithm on the aggregated hub embeddings to visualize the high-dimensional data in two dimensions. This helps reveal the structure and relationships between different tissue neighborhoods.

In [None]:
# Set a random seed for reproducible UMAP results
set.seed(1)

# Time the UMAP computation
system.time({
    U <- uwot::umap(
        aggregatedEmbeddings,
        min_dist = 0.05,
        spread = 0.30,
        ret_extra = 'fgraph', # Return the graph for clustering
        fast_sgd = TRUE
    )
})

# Rename columns for clarity
colnames(U$embedding) <- c('HUMAP1', 'HUMAP2')
# Assign rownames to the UMAP graph
rownames(U$fgraph) <- colnames(U$fgraph) <- rownames(aggregatedEmbeddings)

### 5.2 Visualize UMAP of Spatial Hubs

We create scatter plots of the UMAP results to visualize the relationships between the aggregated tiles (hubs).

#### 5.2.1 Basic UMAP Plot

A simple plot showing the distribution of all hubs in the UMAP space.

In [None]:
# Set figure size
fig.size(10, 10)
require(scattermore)
# Create a scatter plot of the UMAP embeddings
scattermoreplot(
    U$embedding[, 'HUMAP1'],
    U$embedding[, 'HUMAP2'],
    main = 'UMAP embeddings of aggregates (tiles)'
)

#### 5.2.2 UMAP Colored by Sample ID

We color the points by their sample of origin to check for batch effects or sample-specific structures.

In [None]:
# Create a data frame for plotting
umapEmbeddings <- U$embedding %>%
    as.data.frame() %>%
    tibble::rownames_to_column(var = 'agg_id') %>%
    left_join(., tile_metadata %>%
              select(SampleID, 'agg_id') %>%
              distinct())

# Set figure size
fig.size(10, 10)

# Plot UMAP with points colored by SampleID
ggplot(sample_n(umapEmbeddings, nrow(umapEmbeddings))) +
    geom_scattermore(aes(HUMAP1, HUMAP2, color = SampleID), shape = '.') +
    guides(colour = guide_legend(override.aes = list(size = 10, shape = 16))) +
    theme(aspect.ratio = 1) +
    NULL

#### 5.2.3 Faceted UMAP by Sample ID

To inspect each sample individually, we create a faceted plot where each panel corresponds to one sample.

In [None]:
# Set figure size for a large faceted plot
fig.size(40, 40)
require(gghighlight)
# Create faceted UMAP plot
ggplot(sample_n(umapEmbeddings, nrow(umapEmbeddings))) +
    geom_scattermore(aes(HUMAP1, HUMAP2, color = SampleID), shape = '.') +
    guides(colour = guide_legend(override.aes = list(size = 10, shape = 16))) +
    theme(aspect.ratio = 1) +
    facet_wrap(~SampleID) +
    # Highlight points within each facet
    gghighlight::gghighlight() +
    NULL

## 6. Leiden Clustering and Annotation

### 6.1 Cluster Hubs using the Leiden Algorithm

Using the graph generated during the UMAP step, we apply the Leiden algorithm to cluster the spatial hubs into distinct communities, representing different types of tissue microenvironments.

In [None]:
# Set a random seed for reproducibility
set.seed(1)

# Define the clustering resolution
res <- 0.1
message(res)

# Time the Leiden clustering process
system.time({
    umapEmbeddings[, paste0('leiden_', as.character(res))] <- do_leiden_one(
        U$fgraph,
        resolution = res,
        n_starts = 3,
        n_iterations = 3,
        verbose = FALSE
    )
})

# Print the unique cluster IDs found
print(unique(umapEmbeddings[, paste0('leiden_', as.character(res))]))

### 6.2 Visualize Leiden Clusters on UMAP

We plot the UMAP again, but this time coloring the points by their assigned Leiden cluster ID to visualize the partitioning of the data.

In [None]:
# Helper function to plot clusters on the UMAP
plot_cluster_on_UMAP <- function(res, umapEmbeddings, prefix = 'leiden_') {
    fig.size(10, 10)
    require(ggthemes)
    
    # Convert resolution column to a factor
    umapEmbeddings$res <- as.factor(umapEmbeddings[, paste0(prefix, as.character(res))])
    
    # Create the plot
    leiden_umap <- ggplot(sample_n(umapEmbeddings, nrow(umapEmbeddings)),
                          aes(HUMAP1, HUMAP2, color = res, fill = res)) +
        geom_scattermore() +
        scale_color_manual(name = 'Leiden',
                           values = c(tableau_color_pal(palette = 'Tableau 20')(20), 'black', 'red', 'navyblue')) +
        scale_fill_manual(name = 'Leiden',
                           values = c(tableau_color_pal(palette = 'Tableau 20')(20), 'black', 'red', 'navyblue')) +
        guides(colour = guide_legend(override.aes = list(size = 10, shape = 16))) +
        ggtitle(paste0('Res: ', as.character(res))) +
        theme(aspect.ratio = 1, legend.position = 'right') +
        NULL
        
    return(leiden_umap)
}


# Generate and display the plot for the chosen resolution
leiden_umap = plot_cluster_on_UMAP(res = 0.1, umapEmbeddings = umapEmbeddings)

### 6.3 Annotate and Visualize High-Level Clusters

Based on subsequent analysis (not shown), we manually annotate the identified Leiden clusters into high-level categories: "Tumor," "Non-epithelial," and "Granulocyte cap." We then visualize these final annotated clusters on the UMAP.

In [None]:
# Set plot dimensions
options(repr.plot.height = 7, repr.plot.width = 16, repr.plot.res = 300)

# Create a combined plot showing the original and the annotated clusters
# Note: The 'leiden_umap' variable is from the previous cell's simple UMAP plot
leiden_umap + (
    plot_cluster_on_UMAP(
        res = 0.1,
        umapEmbeddings = umapEmbeddings %>%
            mutate(leiden_0.1 = forcats::fct_recode(!!!c('Tumor' = '1', 'Non-epithelial' = '2', 'Granulocyte cap' = '3'),
                                                    leiden_0.1))
    ) +
    ggtitle('') +
    scale_fill_manual(
        name = 'High level\nclusters',
        values = c('Tumor' = '#f5e663', 'Non-epithelial' = '#1e3888', 'Granulocyte cap' = 'red')
    ) +
    scale_color_manual(
        name = 'High level\nclusters',
        values = c('Tumor' = '#f5e663', 'Non-epithelial' = '#1e3888', 'Granulocyte cap' = 'red')
    ) +
    NULL
)


In [None]:
fig.size(5,5,res = 400)
(
    plot_cluster_on_UMAP(
        res = 0.1,
        umapEmbeddings = umapEmbeddings %>%
            mutate(leiden_0.1 = forcats::fct_recode(!!!c('Tumor' = '1', 'Non-epithelial' = '2', 'Granulocyte cap' = '3'),
                                                    leiden_0.1))
    ) +
    ggtitle('') +
    scale_fill_manual(
        name = 'High level\nclusters',
        values = c('Tumor' = 'gold', 'Non-epithelial' = 'darkblue', 'Granulocyte cap' = 'red')
    ) +
    scale_color_manual(
        name = 'High level\nclusters',
        values = c('Tumor' = 'gold', 'Non-epithelial' = 'darkblue', 'Granulocyte cap' = 'red')
    ) +
    cowplot::theme_half_open(10) +
    NULL
)

## 7. Analyzing Cell Lineage Composition on the UMAP
Now that we have clustered the spatial hubs, we want to understand their cellular composition. In this section, we will aggregate the cell type counts for each hub and visualize how different cell lineages are distributed across the UMAP plot. This will allow us to assign biological meaning to the spatial clusters we identified.

### 7.1 Aggregate Cell Type Counts by Hub
Instead of reshaping the data, we'll first create a direct mapping of fine-grained cell types (type_lvl2) to broader lineages (type_lvl1). Then, we'll loop through each sample's count file, filter for the hubs we are interested in, and perform a fast, grouped matrix summation to get the lineage counts for each hub. This avoids creating a massive intermediate table.

In [None]:
# plan(multisession, workers = 10)

# system.time({agg_lvl2 = future_map(names(fnames), function(fname){
#     sampleID = fname
#     message(sampleID)
#     obj = readr::read_rds(fnames[fname]) # read in object
#     agg_lvl2 = obj$aggs$counts_lvl2 %>% 
#         as.matrix %>% 
#         as.data.frame() %>% 
#         tibble::rownames_to_column(var = 'type_lvl2')
#     colnames(agg_lvl2) = c('type_lvl2', 
#                            paste0(sampleID, 
#                                   '_', 
#                                   obj$aggs$meta_data$id))
#     agg_lvl2 = agg_lvl2 %>% 
#         pivot_longer(cols = paste0(sampleID, 
#                                    '_', 
#                                    obj$aggs$meta_data$id), 
#                      values_to = 'counts_lvl2') %>% 
#         mutate(SampleID = sampleID)
#     return(agg_lvl2)
# }) %>% do.call(rbind, .)})
# dim(agg_lvl2)
# agg_lvl2 = agg_lvl2 %>% filter(name %in% agg_metadata$id)
# sample_n(agg_lvl2, 10)
# gc()

library(data.table)

# --- 1. Get the definitive list of hubs to keep BEFORE the loop ---
# Using a simple vector is efficient for repeated lookups.
hubs_to_keep <- agg_metadata$id

# --- 2. Process files in parallel: filter first, then reshape ---
plan(multicore, workers = 10)

system.time({
    agg_lvl2_list <- future_map(names(fnames), function(fname) {
        # Read the object for one sample
        obj <- readr::read_rds(fnames[fname])
        counts_lvl2 <- obj$aggs$counts_lvl2
        
        # Create the full, unique column names for the hubs in this sample
        original_colnames <- paste0(fname, '_', obj$aggs$meta_data$id)
        
        # Identify which of this sample's hubs are in our master list
        cols_to_keep <- intersect(original_colnames, hubs_to_keep)
        
        # If this sample has no relevant hubs, skip it
        if (length(cols_to_keep) == 0) {
            return(NULL)
        }
        
        # **PERFORMANCE KEY**: Subset the matrix while it's still wide and small
        # Match the original colnames to the ones we want to keep
        counts_lvl2_filtered <- counts_lvl2[, match(cols_to_keep, original_colnames), drop = FALSE]
        colnames(counts_lvl2_filtered) <- cols_to_keep

        # Convert the *small*, filtered matrix to a data.table
        dt <- as.data.table(counts_lvl2_filtered, keep.rownames = "type_lvl2")
        
        # **PERFORMANCE KEY**: Use data.table::melt for high-speed reshaping
        melted_dt <- melt(dt,
                          id.vars = "type_lvl2",
                          variable.name = "name",
                          value.name = "counts_lvl2",
                          variable.factor = FALSE) # More efficient
                          
        # Add the SampleID
        melted_dt[, SampleID := fname]
        
        return(melted_dt)
    }, .options = furrr_options(seed = TRUE))
    
    # --- 3. Combine the list of smaller data.tables into one ---
    # **PERFORMANCE KEY**: `rbindlist` is much faster than `do.call(rbind, .)`
    agg_lvl2 <- rbindlist(agg_lvl2_list)
})

dim(agg_lvl2)
sample_n(agg_lvl2, 10)
gc()

### 7.2 Prepare Data for Visualization
Now we'll combine the aggregated lineage counts with the UMAP coordinates. We then calculate the dominant (majority) lineage for each hub, which we will use to color the points on the UMAP.

In [None]:
.temp = agg_lvl2 %>% 
    left_join(., umapEmbeddings %>% rename(name = agg_id)) %>%
    left_join(., tile_metadata %>%
        select(type_lvl1, type_lvl2) %>%
        distinct %>%
        mutate(type_lvl1 = ifelse(type_lvl1 == 'Endo', 
                                  yes = 'Strom',
                                  no = ifelse(type_lvl1 == 'Bplasma',
                                             yes = ifelse(type_lvl2 == 'Plasma', 
                                                         yes = 'Plasma',
                                                         no = 'B'),
                                             no = type_lvl1
                                             )
                                 ))
    ) %>%
    group_by(name, HUMAP1, HUMAP2, type_lvl1, .drop = FALSE) %>%
    summarize(counts_lvl1 = sum(counts_lvl2)) %>%
    na.omit() %>%
    ungroup() %>%
    pivot_wider(names_from = type_lvl1, values_from = counts_lvl1)
head(.temp)

In [None]:
.temp = .temp %>%
  mutate(max_col = case_when(
    B == pmax(B, Plasma, Epi, Myeloid, Strom, TNKILC) ~ "B",
    Plasma == pmax(B, Plasma, Epi, Myeloid, Strom, TNKILC) ~ "Plasma",
    Epi == pmax(B, Plasma, Epi, Myeloid, Strom, TNKILC) ~ "Epi",
    Myeloid == pmax(B, Plasma, Epi, Myeloid, Strom, TNKILC) ~ "Myeloid",
    Strom == pmax(B, Plasma, Epi, Myeloid, Strom, TNKILC) ~ "Strom",
    TNKILC == pmax(B, Plasma, Epi, Myeloid, Strom, TNKILC) ~ "TNKILC"
  )) %>%
  mutate(type_lvl1_counts = pmax(B, Plasma, Epi, Myeloid, Strom, TNKILC))

### 7.3 Visualize Lineage Enrichment on UMAP
Finally, we generate the plots. The first plot shows the UMAP faceted by lineage, illustrating where each cell type is most abundant. The second plot provides a summary view, coloring each hub by its single most dominant lineage.

In [None]:
head(.temp)

In [None]:
# --- Plot 1: Faceted UMAP showing enrichment of each lineage ---
options(repr.plot.height = 10, repr.plot.width = 12, repr.plot.res = 300)
umap_lineage_df = .temp %>% 
    select(!type_lvl1_counts)  %>%
    mutate(dominant_lineage = max_col)
    
# Reshape only the count columns for plotting (this is now a small operation)
umap_lineage_df %>%
    pivot_longer(cols = all_of(c('B', 'Plasma', 'Epi', 'Myeloid', 'Strom', 'TNKILC')), names_to = 'lineage', values_to = 'count') %>%
    ggplot(., aes(HUMAP1, HUMAP2, color = log10(1 + count))) +
    geom_scattermore() +
    facet_wrap(~lineage) +
    scale_color_viridis_c() +
    gghighlight::gghighlight() + # Highlights the data within each facet
    theme_minimal(base_size = 14) +
    theme(aspect.ratio = 1, axis.text = element_blank(), axis.title = element_blank())

# --- Plot 2: UMAP colored by the single dominant lineage ---
options(repr.plot.height = 7, repr.plot.width = 8, repr.plot.res = 300)

ggplot(data = umap_lineage_df %>% na.omit, aes(HUMAP1, HUMAP2, color = dominant_lineage)) +
    geom_scattermore(pointsize = 2.5) +
    scale_color_manual(
        name = 'Dominant Lineage',
        values = c('Epi' = '#CA49FC', 'Strom' = '#00D2D0', 'Myeloid' = '#FFB946',
                   'Mast' = '#F4ED57', 'Plasma' = '#61BDFC', 'B' = '#0022FA', 'TNKILC' = '#FF3420')
    ) +
    guides(colour = guide_legend(override.aes = list(size = 8, shape = 16))) +
    theme_minimal(base_size = 14) +
    theme(aspect.ratio = 1)

## 8. Split of Tessera clusters and annotated tumor regions in the tissue specimens

In [None]:
pathology_regions %>% select(sample_cell, cell_id, PatientID, pathology_region) %>% head
umapEmbeddings %>% head
tile_metadata %>% head

In [None]:
fig.size(4,4,500)
pathology_regions %>% 
    select(sample_cell, PatientID, pathology_region, MMRstatus) %>%
    rename(cell_id = sample_cell) %>%
    left_join(., tile_metadata %>% select(cell_id, agg_id)) %>%
    left_join(., umapEmbeddings %>% 
              select(agg_id, leiden_0.1) %>% 
              mutate(leiden_0.1 = as.vector(leiden_0.1))) %>%
    mutate(tessera_annotation = case_when(leiden_0.1 == '1' ~ 'Epithelial-enriched',
                                                   leiden_0.1 == '2' ~ 'Stromal-enriched',
                                                   leiden_0.1 == '3' ~ 'Granulocyte cap',
                                                   pathology_region != 'Tumor' ~ 'Not annotated tumor',
                                                   pathology_region != 'Tumor' ~ 'Not annotated tumor',
                                                   agg_id == 'NA' & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
                                                   agg_id == NA & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
                                                   .default = NA
                                                   )) %>%
    filter(tessera_annotation != 'Annotated tumor, not assigned to tile') %>% # remove cells that were not assigned to a tessera tile 
    mutate(tessera_annotation = factor(tessera_annotation, ordered = TRUE, levels = rev(c('Epithelial-enriched', 'Stromal-enriched', 'Granulocyte cap', 'Not annotated tumor')))) %>% 
    ggplot() +
        geom_bar(aes(x = PatientID, fill = tessera_annotation), position = 'fill') +
        facet_grid(~MMRstatus, scales = 'free', space = 'free') +
        cowplot::theme_half_open(10) +
        scale_fill_manual(values = c("Epithelial-enriched" = "darkblue",
                   "Stromal-enriched" = "gold",
                    "Granulocyte cap" = "red",
                    "Not annotated tumor" = "lightgrey"
        ), name = 'Types of regions    ') +
        theme(legend.position = 'right', 
              axis.text = element_text(size = 10), 
              axis.text.x = element_text(angle = 90, hjust = 0, vjust = 0)) +
        labs(x = 'Specimens', y = 'Tessera regions') 

# Date for figure 1H

In [None]:
pathology_regions %>% 
    select(sample_cell, PatientID, pathology_region, MMRstatus) %>%
    rename(cell_id = sample_cell) %>%
    left_join(., tile_metadata %>% select(cell_id, agg_id)) %>%
    left_join(., umapEmbeddings %>% 
              select(agg_id, leiden_0.1) %>% 
              mutate(leiden_0.1 = as.vector(leiden_0.1))) %>%
    mutate(tessera_annotation = case_when(leiden_0.1 == '1' ~ 'Epithelial-enriched',
                                                   leiden_0.1 == '2' ~ 'Stromal-enriched',
                                                   leiden_0.1 == '3' ~ 'Granulocyte cap',
                                                   pathology_region != 'Tumor' ~ 'Not annotated tumor',
                                                   pathology_region != 'Tumor' ~ 'Not annotated tumor',
                                                   agg_id == 'NA' & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
                                                   agg_id == NA & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
                                                   .default = NA
                                                   )) %>%
    filter(tessera_annotation != 'Annotated tumor, not assigned to tile') %>% # remove cells that were not assigned to a tessera tile 
    filter(tessera_annotation != 'Not annotated tumor') %>% # remove cells that were not assigned to a tessera tile 
    mutate(tessera_annotation = factor(tessera_annotation, ordered = TRUE, levels = rev(c('Epithelial-enriched', 'Stromal-enriched', 'Granulocyte cap', 'Not annotated tumor')))) %>%
    group_by(PatientID, tessera_annotation) %>%
    summarize(n = n()) %>%
    mutate(percent = 100*n/sum(n)) %>%
    ungroup %>%
    group_by(tessera_annotation) %>%
    summarize(mean = round(mean(percent), 2), median = round(median(percent), 2), sd = round(sd(percent), 2), 
                                            min = round(min(percent), 2), max = round(max(percent), 2)) 

## 9. Enriched lineages in each cluster

## 10. Enriched cell states in each cluster

## get degs - first pass
- first, collect all agg counts for each sample
- then collect metadata
- then (perhaps downsample and then) run glmm

### add cluster information to agg metadata

In [None]:
res = 0.1
agg_metadata = left_join(agg_metadata %>% as.data.frame() %>% select(-shape), # dropping the geometry column for convenience
                         umapEmbeddings %>% as.data.frame() %>%
                             select(agg_id, SampleID, paste0('leiden_', as.character(res))), 
                             join_by('id' == 'agg_id', 'SampleID' == 'SampleID'))
head(agg_metadata, 20)

In [None]:
dim(agg_metadata)
head(agg_metadata)
length(unique(agg_metadata$id))
repeatedTile = table(agg_metadata$id) %>%
    as.data.frame %>%
    filter(Freq > 1) %>%
    pull(Var1)
agg_metadata %>% filter(id %in% repeatedTile) %>% pull(PatientID) %>% unique

repeatedTile = table(agg_metadata$id) %>%
    as.data.frame %>%
    filter(Freq > 1) %>%
    pull(Var1)
agg_metadata %>% filter(id %in% repeatedTile) %>% head

In [None]:
require(presto)
require(data.table)
require(Matrix)
set.seed(1)
agg_metadata = as.data.frame(agg_metadata)# %>% mutate(pathology_region = ifelse(grepl(pathology_region, pattern = 'Tumor|tumor'), yes = 'Tumor', no = pathology_region)) %>% distinct
rownames(agg_metadata) = agg_metadata$id

In [None]:
agg_counts = agg_counts[, agg_metadata$id]

In [None]:
dim(agg_counts)

## add log umi information for each tile to the metadata table 

In [None]:
agg_metadata$logUMI = log(colSums(agg_counts))

In [None]:
gc()

In [None]:
anyNA(agg_metadata)
head(agg_metadata)

## get degs

In [None]:
require(furrr)
require(future)
require(presto)
plan(multisession)
require(singlecellmethods)
agg_metadata$res = agg_metadata[,paste0('leiden_', as.character(res))]
pb = presto::collapse_counts(
    counts_mat = agg_counts, 
    meta_data = agg_metadata,
    c('SampleID', "res"), 
    min_cells_per_group = 3
)

In [None]:
require(future)
require(furrr)
system.time({presto_res = presto.presto(
    y ~ 1 + (1|res) + (1|SampleID/res) + offset(logUMI), 
    design = pb$meta_data, #metadata, 
    response = pb$counts_mat, #counts,
    size_varname = "logUMI", 
    effects_cov = 'res',
    ncore = 10, 
    min_sigma = .05, 
    family = 'poisson',
    nsim = 1000
)})
contrasts_mat = make_contrast.presto(presto_res, 'res')
effects_marginal = contrasts.presto(presto_res, contrasts_mat, one_tailed = TRUE) %>% 
    #dplyr::mutate(cluster = contrast) %>% 
    dplyr::mutate(
        ## convert stats to log2 for interpretability 
        logFC = sign(beta) * log2(exp(abs(beta))),
        SD = log2(exp(sigma)),
        zscore = logFC / SD
    ) %>% 
    #dplyr::select(cluster, feature, logFC, SD, zscore, pvalue) %>% 
    arrange(pvalue)
effects_marginal$fdr = p.adjust(effects_marginal$pvalue, method = 'BH')
effects_marginal = data.table(effects_marginal)
#data.table::fwrite(effects_marginal, 'marginal_effects_tissue_regions_20250211.csv')
sample_n(effects_marginal, 20)

In [None]:
options(repr.matrix.max.cols=100, repr.matrix.max.rows=100)
effects_marginal %>%
    filter(pvalue < 0.2) %>%
    group_by(contrast) %>%
    mutate(rank = dense_rank(logFC)) %>%
    ungroup %>%
    filter(rank < 20) %>%
    select(contrast, feature, rank) %>%
    pivot_wider(names_from = contrast, values_from = feature) %>%
    arrange(rank)

## cell type composition of clusters

In [None]:
system.time({agg_lvl2_wide = agg_lvl2 %>%
    select(name, type_lvl2, counts_lvl2) %>%
    pivot_wider(names_from = name, 
                values_from = counts_lvl2,
               values_fill = 0) %>%
    tibble::column_to_rownames(var = 'type_lvl2')
})
dim(agg_lvl2_wide)
agg_lvl2_wide[1:10, 1:10]

In [None]:
head(agg_metadata)

In [None]:
agg_lvl2 = agg_lvl2 %>% 
    left_join(., agg_metadata %>% select(id, paste0('leiden_', as.character(res))) %>% rename("name" = "id", 'res' = paste0('leiden_', as.character(res))))

In [None]:
sample_n(agg_lvl2, 10)

## formal cell state enrichment

In [None]:
.metadata = agg_lvl2 %>%
    na.omit() %>%
    group_by(name, .drop = FALSE) %>%
    mutate(cells_in_agg = sum(counts_lvl2)) %>%
    mutate(type_lvl2_as_percent_of_agg = 100*counts_lvl2/cells_in_agg) %>%
    ungroup() %>%
    as.data.frame
slice_sample(.metadata %>%
    filter(type_lvl2_as_percent_of_agg > 0), n = 20)

In [None]:
.counts = .metadata %>%
    select(name, type_lvl2_as_percent_of_agg, type_lvl2) %>%
    pivot_wider(names_from = type_lvl2, values_fill = 0, values_from = type_lvl2_as_percent_of_agg) %>%
    as.data.frame %>%
    tibble::column_to_rownames('name') %>%
    as.matrix
dim(.counts)
.counts[1:10, 1:10]

In [None]:
require(presto)
.metadata = .metadata %>% select(name, res) %>% distinct() %>% as.data.frame
rownames(.metadata) = .metadata$name
.counts = .counts[rownames(.metadata), ]
enriched_cell_states = presto::wilcoxauc(X = .counts %>% t, y = .metadata$res)
sample_n(enriched_cell_states, 10)

In [None]:
.temp = enriched_cell_states %>%
    select(group, feature, logFC) %>%
    pivot_wider(names_from = feature, values_from = logFC) %>%
    tibble::column_to_rownames('group') %>%
    as.matrix
dim(.temp)

In [None]:
.rowAnno = agg_metadata %>%
    select(res, id) %>%
    summarize(n = n(), .by = res)
rownames(.rowAnno) = .rowAnno$res
.rowAnno

In [None]:
.pval = enriched_cell_states %>%
    select(group, feature, padj, logFC) %>%
    mutate(padj = ifelse(padj < 0.05 & logFC > 5, yes = '*', no = '')) %>%
    select(!logFC) %>%
    pivot_wider(names_from = group, values_from = padj) %>%
    tibble::column_to_rownames('feature') %>%
    as.matrix %>%
    t
dim(.pval)

In [None]:
set.seed(1)
.roworder = hclust(dist(.temp)^2, "cen")
names(.roworder)
.roworder$order
.colorder = hclust(dist(t(.temp))^2, "ave")
names(.colorder)
.colorder$order
.temp = .temp[.roworder$order, .colorder$order]
.pval = .pval[.roworder$order, .colorder$order]

In [None]:
knn_mid_renamed = fread('Labeled MERFISH data/knn_cell_state_labels.csv') %>% as.data.frame
rownames(knn_mid_renamed) = knn_mid_renamed$knn_renamed_cell_states
knn_mid_renamed %>% sample_n(10)

In [None]:
.temp

In [None]:
options(repr.plot.res = 500, repr.plot.width = 12, repr.plot.height = 6)
require(ComplexHeatmap)
require(circlize)
set.seed(1)
colAnno = knn_mid_renamed$knn_coarse
names(colAnno) = knn_mid_renamed$knn_renamed_cell_states
ha1 = HeatmapAnnotation(
    which = 'column', 
    Lineage = colAnno[colnames(.temp)],  #.colorder$order
    col = list(Lineage = c('Epi' = '#CA49FC',
        'Strom' = '#00D2D0',
        'Myeloid' = '#FFB946',
        'Mast' = '#F4ED57',
        'Plasma' = '#61BDFC',
        'B' = '#0022FA',
        'TNKILC' = '#FF3420'
        )),
    annotation_legend_param = list(Lineage = list(nrow = 3, direction = 'horizontal')))
.rowAnno = .rowAnno[rownames(.temp),]
ha2 = HeatmapAnnotation(
        `log10(Count)` = anno_barplot(1 + .rowAnno$n),
        annotation_name_rot = 0,
        which = 'row'
    )
col_fun = colorRamp2(c(min(.temp), 0, 2, max(.temp)), c('white', 'white', scales::muted('navyblue'), scales::muted('navyblue')))
h1 = ComplexHeatmap::Heatmap(
                        heatmap_legend_param = list(direction = 'horizontal'),
                        col = col_fun,
                        cluster_rows = FALSE,
                        cluster_columns = FALSE,
                        top_annotation = ha1,
                        #right_annotation = ha2, # weird issue with plot dimensions - this SOMETIMES works
                        cell_fun = function(j, i, x, y, width, height, fill) {grid.text(.pval[i, j], x, y, gp = gpar(fontcolor = 'red', fontsize = 10))},
                        name = 'logFC',
                        column_names_side = 'top',
                        show_column_dend = FALSE,
                        show_row_dend = FALSE,
                        matrix = .temp,
                        row_names_side = 'left')
draw(h1,
     merge_legend = TRUE, 
     heatmap_legend_side = "bottom", 
     annotation_legend_side = "bottom")

# patient composition

In [None]:
fig.size(5,5)
patient_composition_1 = ggplot(agg_metadata) +
    geom_bar(aes(y = SampleID, fill = res), position = 'fill') +
    scale_fill_manual(values = c(tableau_color_pal(palette = 'Tableau 20')(20), 'black')) +
    NULL
patient_composition_1

In [None]:
fig.size(5,5)
patient_composition_2 = ggplot(agg_metadata) +
    geom_bar(aes(y = res, fill = SampleID), position = 'fill') +
    scale_fill_manual(values = tableau_color_pal(palette = 'Tableau 20')(20)) +
    NULL
patient_composition_2

## 11. Assemble figure 2

## 12. Figure 4: Overlay CXCL mask annotations on Tessera tiles

### 12.1 Read in hub annotations

In [None]:
mask_annotations = data.table::fread('CXCR3L mask annotations/single_cell_annots_tessera_niches_clean.csv')
mask_annotations$SampleID = mask_annotations$sample_name
mask_annotations$SampleID[mask_annotations$sample_name == 'G4659'] <- 'G4659-CP-MET_VMSC04701'
mask_annotations$SampleID[mask_annotations$sample_name == 'G4659_Beta8'] <- 'G4659-CP-MET_Beta8'
mask_annotations$sample_cell = paste0(mask_annotations$SampleID, '_', mask_annotations$cell_id)
sample_n(mask_annotations, 20)

In [None]:
colnames(mask_annotations)
mask_annotations$MMRstatus[mask_annotations$PatientID == 'C107'] %>% unique
mask_annotations$MSstatus[mask_annotations$PatientID == 'C107'] %>% unique

### 12.2 Join mask annotations to tile_metadata and agg_metadata

In [None]:
tile_metadata = tile_metadata %>% 
    #select(!c("cxcl_pos_tile.x", "cxcl_pos_tile.y", "cxcl_pos_tile")) %>%
    #select(!c( "cxcl_pos_tile")) %>%
    left_join(., mask_annotations %>% 
              select(tessera_tile_id, cxcl_pos_tile) %>% 
              distinct %>% 
              rename(agg_id = tessera_tile_id), 
              by = "agg_id")
sample_n(tile_metadata, 20)

In [None]:
head(agg_metadata)

In [None]:
agg_metadata = agg_metadata %>% 
    as.data.frame %>%
    #mutate(shape = as.vector(shape)) %>% # remember that we dropped the geometry column earlier
    #select(!c("cxcl_pos_tile.x", "cxcl_pos_tile.y", "cxcl_pos_tile")) %>%
    #select(!c( "cxcl_pos_tile")) %>%
    left_join(., mask_annotations %>% 
              select(tessera_tile_id, cxcl_pos_tile) %>% 
              distinct %>% 
              rename(id = tessera_tile_id) %>% 
              as.data.frame, 
              by = "id") 
agg_metadata %>% head()

### 12.3 View hub+ tiles on the UMAP plot 

In [None]:
table(agg_metadata$cxcl_pos_tile, useNA = 'always')

In [None]:
umapEmbeddings %>%
    left_join(., mask_annotations %>% select(tessera_tile_id, cxcl_pos_tile) %>% distinct %>% rename(agg_id = tessera_tile_id)) %>%
    pull(cxcl_pos_tile) %>%
    table(., useNA = 'always')

In [None]:
umapEmbeddings %>%
    left_join(., agg_metadata %>% select(id, cxcl_pos_tile) %>% distinct %>% rename(agg_id = id)) %>%
    pull(cxcl_pos_tile) %>%
    table(., useNA = 'always')

In [None]:
fig.size(7,7,400)
umapEmbeddings %>%
    left_join(., agg_metadata %>% select(id, cxcl_pos_tile) %>% distinct %>% rename(agg_id = id)) %>%
    ggplot() +
        geom_scattermore(aes(x = HUMAP1, y = HUMAP2, color = cxcl_pos_tile)) +
        cowplot::theme_half_open(10) +
        theme(aspect.ratio = 1, legend.position = 'top') +
        scale_color_manual(values = c('CXCL_pos' = 'red', 'CXCL_neg' = 'grey'), name = 'Hub+ tiles') +
        ggtitle('Representation of hub+ tiles across Tessera regions') +
        NULL

### 12.4 View split of hub+ annotations between specimens

In [None]:
fig.size(height = 5, width = 10, res = 400)
umapEmbeddings %>%
    left_join(., agg_metadata %>% select(id, cxcl_pos_tile, PatientID) %>% distinct %>% rename(agg_id = id)) %>%
    left_join(., mask_annotations %>% select(SampleID, MMRstatus) %>% distinct) %>%
    left_join(., umapEmbeddings %>% 
        select(agg_id, leiden_0.1) %>% 
        mutate(leiden_0.1 = as.vector(leiden_0.1))) %>%
    mutate(tessera_annotation = case_when(leiden_0.1 == '1' ~ 'Epithelial-enriched',
        leiden_0.1 == '2' ~ 'Stromal-enriched',
        leiden_0.1 == '3' ~ 'Granulocyte cap',
        # pathology_region != 'Tumor' ~ 'Not annotated tumor',
        # pathology_region != 'Tumor' ~ 'Not annotated tumor',
        # agg_id == 'NA' & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
        # agg_id == NA & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
        .default = 'Other'
    )) %>%
    ggplot() +
        geom_bar(aes(x = PatientID, fill = cxcl_pos_tile), position = 'fill') +
        facet_grid(~ interaction(tessera_annotation, MMRstatus, sep = '\n'), scales = 'free', space = 'free') +
        #facet_grid(MMRstatus ~ tessera_annotation, scales = 'free', space = 'free') +
        cowplot::theme_half_open(10) +
        theme(strip.background = element_rect(fill = NA, color = 'black'), 
              legend.position = 'top', 
              axis.text = element_text(size = 10), 
              axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1),
             strip.text.x = element_text(angle = 90)
             ) +
        labs(x = 'Specimens', y = 'Tessera regions') +
        scale_fill_manual(values = c('CXCL_pos' = 'red', 'CXCL_neg' = 'grey'), name = 'Proportion of Hub+ tiles') +
        NULL

## Cache

In [None]:
agg_metadata = agg_metadata %>%
    left_join(., mask_annotations %>% select(SampleID, MMRstatus) %>% distinct) %>%
    left_join(., umapEmbeddings %>% 
         select(agg_id, leiden_0.1) %>% 
         rename(id = agg_id) %>%
         mutate(leiden_0.1 = as.vector(leiden_0.1))) %>%
    mutate(tessera_annotation = case_when(leiden_0.1 == '1' ~ 'Epithelial-enriched',
        leiden_0.1 == '2' ~ 'Stromal-enriched',
        leiden_0.1 == '3' ~ 'Granulocyte cap',
        # pathology_region != 'Tumor' ~ 'Not annotated tumor',
        # pathology_region != 'Tumor' ~ 'Not annotated tumor',
        # agg_id == 'NA' & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
        # agg_id == NA & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
        .default = 'Other'
    )) 
agg_metadata %>%
    sample_n(20)

In [None]:
tile_metadata = tile_metadata %>%
    left_join(., mask_annotations %>% select(SampleID, MMRstatus) %>% distinct) %>%
    left_join(., umapEmbeddings %>% 
         select(agg_id, leiden_0.1) %>% 
         mutate(leiden_0.1 = as.vector(leiden_0.1))) %>%
    mutate(tessera_annotation = case_when(leiden_0.1 == '1' ~ 'Epithelial-enriched',
        leiden_0.1 == '2' ~ 'Stromal-enriched',
        leiden_0.1 == '3' ~ 'Granulocyte cap',
        # pathology_region != 'Tumor' ~ 'Not annotated tumor',
        # pathology_region != 'Tumor' ~ 'Not annotated tumor',
        # agg_id == 'NA' & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
        # agg_id == NA & pathology_region == 'Tumor' ~ 'Annotated tumor, not assigned to tile',
        .default = 'Other'
    ))  

tile_metadata = tile_metadata %>%
    mutate(type_lvl1 = as.factor(type_lvl1)) %>%
    mutate(type_lvl1 = case_when(
        type_lvl1 == 'Endo' ~ 'Strom',
        type_lvl2 == 'Plasma' ~ 'Plasma',
        grepl(type_lvl2, pattern = '^B') ~ 'B',
        .default = type_lvl1
    )) 

tile_metadata %>% 
    pull(type_lvl1) %>%
    table

tile_metadata %>%
    sample_n(20)

In [None]:
readr::write_rds(file = 'Tessera tiles/Tessera processed results/agg_metadata_2025-07-22.rds', x = agg_metadata)
readr::write_rds(file = 'Tessera tiles/Tessera processed results/tile_metadata_2025-07-22.rds', x = tile_metadata)

# Pervasive stromal networks allows immune cells access to epithelium
## Pervasive - % of cells in epithelial tiles are within 100um of stromal band 
## Immune cell trafficking - % of immune cells are found in stromal bands

In [None]:
temp = readr::read_rds(file = 'Tessera tiles/Tessera processed results/tile_metadata_2025-07-22.rds')
temp$type_lvl1[temp$type_lvl2 == 'Mast'] = 'Mast' 
head(temp)

In [None]:
temp %>%
    filter(type_lvl1 %in% c('B', 'Myeloid', 'Plasma', 'TNKILC', 'Mast')) %>%
    group_by(tessera_annotation) %>%
    summarize(n = n()) %>%
    ungroup %>%
    mutate(percent = round(100*n/sum(n), 2))

In [None]:
ids = unique(temp$SampleID) #[cells$MMRstatus == 'MMRd'])

ids
interfaces = map(ids, function(.id) {
    fname = normalizePath(list.files(path = '../Tessera tiles/Spatial objects for tumor-stromal interfaces in all MERFISH samples/', pattern = '_tumor_stromal_interfaces.rds', full.names = TRUE)[grepl(list.files(path = 'Tessera tiles/Spatial objects for tumor-stromal interfaces in all MERFISH samples/', pattern = '_tumor_stromal_interfaces.rds', full.names = TRUE), pattern = .id)])
    readRDS(fname)
})
names(interfaces) = ids
require(sf)
glimpse(interfaces[[1]])

In [None]:
options(future.globals.maxSize = 1e10)
ids = unique(temp$SampleID)
system.time({
    temp_list = future_map(ids, function(.id) {
        temp1 = temp[SampleID == .id]
        interfaces1 = interfaces[[.id]]
        #summarize_cells_by_interface_proximity_2(cells[SampleID == .id], interfaces[[.id]])    
        pts = st_as_sf(temp1[, .(X, Y)], coords = c('X', 'Y'))
        geos_pts = geos::as_geos_geometry(pts$geometry)
        geos_lines = geos::as_geos_geometry(interfaces1$x[1:nrow(interfaces1)])
        
        nearest_interfaces_idx = geos::geos_nearest(geos_pts, geos_lines)
        
        temp1$closest_interface_type = interfaces1$Type_of_Interface[nearest_interfaces_idx]
        
        temp1$dist_interface = geos::geos_distance(geos_pts, geos_lines[nearest_interfaces_idx])
        
        ## Assign sign to distances
        temp1$dist_interface_signed = fifelse(
            temp1$tessera_annotation == 'Stromal-enriched',
            -temp1$dist_interface,
            temp1$dist_interface
        )
        return(temp1)
    }, .options = furrr::furrr_options(seed=TRUE))
})

In [None]:
names(temp_list) = ids

In [None]:
temp_df = rbindlist(temp_list)
head(temp_df)

In [None]:
colnames(temp_df)

In [None]:
temp_df$tessera_annotation %>% unique

In [None]:
temp_df %>%
    filter(tessera_annotation == 'Epithelial-enriched') %>%
    mutate(isWithin100mu = ifelse(dist_interface_signed < 100, TRUE, FALSE)) %>%
    group_by(isWithin100mu) %>%
    summarize(n = n()) %>%
    ungroup %>%
    mutate(percent = round(100*n/sum(n), 2))

In [None]:
getwd()

In [None]:
sessionInfo()