# Spatial Analysis of Cell Proximity to MSI vs MSS Tumor-Stroma Interfaces

This notebook analyzes the spatial distribution of different cell types relative to defined biological interfaces between tumor and stromal regions. The primary goal is to determine if certain cell types are enriched or depleted near these interfaces and whether this pattern differs based on the patient's Mismatch Repair (MMR) status (MMRd/MSI vs. MMRp/MSS).

## Analysis Workflow

The notebook is structured into the following key stages:

1.  **Setup & Configuration**: Load all necessary R packages and configure the parallel processing environment.
2.  **Function Definitions**: Define a collection of R functions for performing the core statistical analysis, including:
    - Distance calculations and cell binning.
    - Empirical Bayes shrinkage to stabilize proportion estimates.
    - Meta-analysis to combine results across multiple samples.
3.  **Data Loading & Preprocessing**: Load cell spatial coordinates and interface geometries from external files and perform initial data cleaning.
4.  **Distance Calculation**: For each cell in every sample, calculate the distance to the nearest interface and generate binned count matrices.
5.  **Statistical Analysis**: Run a global meta-analysis across all cell types to compare MMRd vs. MMRp samples, calculating p-values and effect sizes.
6.  **Visualization**: Generate faceted plots to visualize the results and compare cell distribution profiles between the two MMR status groups.

--- 
## 1. Setup: Load Libraries and Configure Environment

This initial code block loads all the R packages required for the analysis. It also configures `future` for parallel processing, which significantly speeds up the computationally intensive steps.

In [None]:
# Suppress package startup messages for a cleaner console output.
suppressPackageStartupMessages({
    # Core data manipulation libraries
    library(data.table) # Fast data manipulation
    library(dplyr)      # User-friendly data manipulation verbs
    library(purrr)      # Functional programming tools
    
    # Spatial analysis libraries
    library(sf)         # Modern standard for spatial data in R
    library(geos)       # High-performance geometry operations
    
    # Plotting and visualization libraries
    library(ggplot2)    # The premier plotting library in R
    library(ggthemes)   # Additional themes for ggplot2
    library(patchwork)  # For combining multiple ggplot objects
    
    # Parallel processing and utility libraries
    library(glue)       # For easy string interpolation
    library(future)     # Framework for parallel processing
    library(furrr)      # Combines purrr's mapping functions with future's parallel backend
})

# Configure the parallel processing plan to use all available cores on the machine.
plan(multicore)

# A helper function to easily set the dimensions of plots generated in the environment.
fig.size <- function(h, w) {
    options(repr.plot.height = h, repr.plot.width = w)
}

--- 
## 2. Function Definitions

This section contains all the custom functions used throughout the analysis pipeline. They are grouped by their role in the workflow.

### Spatial Processing and Binning

In [None]:
#' @title Calculate Cell Counts in Distance Bins from an Interface
#' @description This function takes spatial coordinates of cells and interface lines,
#'   calculates the signed distance of each cell to the nearest interface, and
#'   groups cells into discrete distance bins. It returns a matrix of cell
#'   type counts per bin for a single sample.
#' @param cells A data.table containing cell information, including 'X'/'Y' coordinates,
#'   cell type ('type_lvl3'), and a region annotation ('tessera_annotation').
#' @param interfaces An sf object containing interface geometries (e.g., LINESTRINGs).
#' @return A matrix where rows are distance bins (e.g., "(-5,0]") and columns
#'   are cell types ('type_lvl3'), with values representing cell counts.
get_bins = function(cells, interfaces) {
    # Convert data.frame coordinates to a spatial 'sf' object
    pts = st_as_sf(cells[, .(X, Y)], coords = c('X', 'Y'))
    
    # Use 'geos' for high-performance spatial operations
    geos_pts = geos::as_geos_geometry(pts$geometry)
    geos_lines = geos::as_geos_geometry(interfaces$x[1:nrow(interfaces)])
    
    # Find the nearest interface for each cell
    nearest_interfaces = geos::geos_nearest(geos_pts, geos_lines)
    
    # Calculate the distance to that nearest interface
    cells$dist_interface = geos::geos_distance(geos_pts, geos_lines[nearest_interfaces])
    
    # Assign a sign to the distance based on tissue region (stroma vs. other)
    cells$dist_interface_signed = case_when(
        cells$tessera_annotation == 'Stromal-enriched' ~ -cells$dist_interface,
        TRUE ~ cells$dist_interface
    )
    
    # Bin cells into 5µm distance intervals
    cells$dist_bin = cut(cells$dist_interface_signed, seq(-100, 100, by = 5), include.lowest = TRUE)

    # Create the final count matrix
    counts = cells[
        !is.na(dist_bin)
    ] %>%
        with(table(dist_bin, type_lvl3)) %>%
        data.table() %>%
        dcast(dist_bin ~ type_lvl3, value.var = 'N') %>%
        dplyr::mutate(dist_bin = factor(dist_bin, levels(cells$dist_bin))) %>%
        arrange(dist_bin) %>%
        tibble::column_to_rownames('dist_bin') %>%
        as.matrix()
    
    return(counts)
}

### Empirical Bayes and Meta-Analysis Functions

In [None]:
#' @title Estimate Beta Prior Parameters from Data (Robustly)
#' @description Implements the method of moments to estimate the `alpha` and
#'   `beta` parameters of a Beta distribution that best fits the observed
#'   distribution of proportions. This version includes checks for edge cases
#'   like small sample sizes or zero-count bins to prevent errors.
#' @return A list containing `alpha` and `beta`. Returns `alpha=0`, `beta=0` if a
#'   prior cannot be estimated, which defaults the analysis to standard MLE.
estimate_beta_prior <- function(k, n) {
    # Handle cases with insufficient data to estimate a prior
    if (length(k) <= 1) return(list(alpha = 0, beta = 0))

    # Filter out bins with zero cells to avoid division-by-zero errors
    valid_bins <- n > 0
    if (sum(valid_bins) <= 1) return(list(alpha = 0, beta = 0))
    k_valid <- k[valid_bins]
    n_valid <- n[valid_bins]
    
    # Method of Moments calculation
    p_hat <- k_valid / n_valid
    mean_p <- mean(p_hat)
    var_p <- var(p_hat)
    mean_n <- mean(n_valid)
    var_true <- var_p - mean_p * (1 - mean_p) / mean_n
    
    # Handle numerical artifacts where estimated variance is not positive
    if (is.na(var_true) || var_true <= 0) return(list(alpha = 0, beta = 0))
    
    # Solve for the nu parameter
    nu <- mean_p * (1 - mean_p) / var_true - 1
    
    # ROBUSTNESS FIX: If nu is negative, the estimate is unstable and can lead
    # to negative variance. Fall back to the non-informative prior (MLE).
    if (nu <= 0) {
        return(list(alpha = 0, beta = 0))
    }
    
    # Solve for alpha and beta
    list(alpha = mean_p * nu, beta = (1 - mean_p) * nu)
}


#' @title Calculate Empirical Bayes Summaries for Count Data
#' @description Uses an empirical Bayes approach to "shrink" noisy estimates from
#'   bins with little data towards a more stable global average.
#' @return A data.table with detailed statistics for each bin.
empirical_bayes_summary <- function(k, n, bin_lvls, model = "binomial") {
    model <- match.arg(model, c("mle", "binomial", "poisson"))
    if (length(k) != length(n)) stop("Input vectors 'k' and 'n' must have the same length.")
    
    prior <- estimate_beta_prior(k, n)
    est <- (k + prior$alpha) / (n + prior$alpha + prior$beta)
    var <- ((k + prior$alpha) * (n - k + prior$beta)) /
           ((n + prior$alpha + prior$beta)^2 * (n + prior$alpha + prior$beta + 1))
    
    df = data.table(
        dist_bin = factor(bin_lvls, levels = bin_lvls),
        model = model, count = k, size = n, estimate = est, variance = var,
        alpha = prior$alpha, beta = prior$beta
    )

    df[, p := exp(pnorm(estimate / sqrt(variance), lower.tail = FALSE, log.p = TRUE))]
    df[, padj := p.adjust(p)]
    df[, asterisk := ifelse(padj < 0.01, "*", "")]
    
    return(df)
}

#' @title Perform Meta-Analysis for a Given Cell Type
#' @description Orchestrates the analysis across multiple samples for a single cell type.
get_stats = function(counts_list, .types) {
    df_list = imap(counts_list, function(counts, .id) {
        empirical_bayes_summary(
            rowSums(counts[, .types, drop = FALSE]),
            rowSums(counts),
            rownames(counts),
            'binomial'
        )
    })
    
    df = bind_rows(df_list, .id = 'SampleID')[
        , .(SampleID, dist_bin, estimate, variance)
    ][
        , meta_ashr(estimate, variance), dist_bin
    ]
    
    df[, p := exp(pnorm(estimate / sqrt(variance), lower.tail = FALSE, log.p = TRUE))]
    df[, padj := p.adjust(p)]
    df[, asterisk := case_when(is.na(padj) ~ '', padj < 0.01 ~ "*", TRUE ~ '')]
    
    return(df[])
}

#' @title Perform Meta-Analysis using Adaptive Shrinkage
#' @description Uses `ashr` to combine effect estimates, weighting by precision.
meta_ashr <- function(p_vec, var_vec) {
    ash_fit = ashr::ash(betahat = p_vec, sebetahat = sqrt(var_vec), method = "fdr", mixcompdist = 'normal')
    w = prop.table(1 / (ash_fit$result$PosteriorSD^2 + 1e-8))
    data.table(
        estimate = sum(w * ash_fit$result$PosteriorMean),
        variance = sum(w * ash_fit$result$PosteriorSD^2)
    )
}

### Utility Functions

In [None]:
#' @title Standardize Matrix Columns in a List
#' @description Ensures all matrices in a list have the exact same set of columns.
standardize_matrix_columns <- function(mat_list) {
    all_cols <- sort(unique(unlist(lapply(mat_list, colnames))))
    lapply(mat_list, function(mat) {
        missing_cols <- setdiff(all_cols, colnames(mat))
        if (length(missing_cols) > 0) {
            zeros <- matrix(0, nrow = nrow(mat), ncol = length(missing_cols),
                            dimnames = list(rownames(mat), missing_cols))
            mat <- cbind(mat, zeros)
        }
        mat[, all_cols, drop = FALSE]
    })
}

#' @title Calculate Welch's T-test and Log2 Fold Change from Summary Statistics
#' @description Performs a t-test when only summary statistics are available.
t_test_and_lfc <- function(mu1, var1, n1, mu2, var2, n2) {
  se_diff <- sqrt(var1 / n1 + var2 / n2)
  t_stat <- (mu1 - mu2) / se_diff
  df_num <- (var1 / n1 + var2 / n2)^2
  df_denom <- ((var1 / n1)^2) / (n1 - 1) + ((var2 / n2)^2) / (n2 - 1)
  df <- df_num / df_denom
  p_value <- 2 * pt(-abs(t_stat), df)
  lfc <- log2((mu1) / (mu2))
  return(list(p_value = p_value, log2_fold_change = lfc))
}

--- 
## 3. Data Loading and Preprocessing

We load the main cell data and the interface geometries. Cell type annotations are simplified to group related subtypes (`type_lvl3`), which forms the basis of our analysis.

In [None]:
tiles_to_omit = read.csv('../Tessera tiles/Tessera processed results/tiles_to_exclude_from_interface_analysis.csv') %>%
    filter(tiles_to_exclude_from_interface_analysis != '') %>%
    pull(agg_id)
length(tiles_to_omit)
head(tiles_to_omit)

In [None]:
# Load mapping file for MMR status
mmr_map = readr::read_rds('../Tessera tiles/Tessera processed results/tile_metadata_2025-07-22.rds')  %>%
    select(c('PatientID', 'SampleID', 'MMRstatus')) %>%
    distinct()

# Load cell data
cells = readr::read_rds('../Tessera tiles/Tessera processed results/tile_metadata_2025-07-22.rds') 
cells$type_lvl1[cells$type_lvl2 == 'Mast'] = 'Mast' 

# Simplify cell type annotations
cells <- cells %>%
    filter(!agg_id %in% tiles_to_omit) %>%
    mutate(type_lvl2 = case_when(type_lvl2 == 'Myeloid-ISGlow' ~ 'Myeloid-ISG', .default = type_lvl2)) %>%
    mutate(type_lvl3 = type_lvl2) %>%
    #mutate(type_lvl3 = gsub(type_lvl2, pattern = '-prolif', replacement = '')) %>% # |high|low|-PD1
    mutate(type_lvl3 = gsub(type_lvl3, pattern = 'Epi.*', replacement = 'Epi')) %>% 
    select(c('PatientID', 'SampleID', 'MMRstatus', 'X', 'Y', 'tessera_annotation', 'type_lvl3', 'type_lvl1', 'type_lvl2', 'cell_id', 'cxcl_pos_tile'))

glimpse(cells)

In [None]:
# Load interface geometry files for each sample
ids = unique(cells$SampleID)
interfaces = map(ids, function(.id) {
    fname = normalizePath(list.files(path = '../Tessera tiles/Spatial objects for tumor-stromal interfaces in all MERFISH samples/', pattern = '_tumor_stromal_interfaces.rds', full.names = TRUE)[grepl(list.files(path = '../Tessera tiles/Spatial objects for tumor-stromal interfaces in all MERFISH samples/', pattern = '_tumor_stromal_interfaces.rds', full.names = TRUE), pattern = .id)])
    readRDS(fname)
})
names(interfaces) = ids

glimpse(interfaces[[1]])

--- 
## 4. Per-Sample Distance Calculation

This is the main computational step. We use `future_map` to run the `get_bins` function in parallel for each sample. This generates a list of binned cell count matrices, which are then standardized to ensure consistent structure for the meta-analysis.

In [None]:
# Set a higher limit for global variables when using parallel processing
options(future.globals.maxSize = 1e10)

# Run get_bins for each sample in parallel
system.time({
    counts_list = future_map(ids, function(.id) {
        get_bins(cells[SampleID == .id], interfaces[[.id]])    
    }, .options = furrr::furrr_options(seed=TRUE))
    names(counts_list) = ids
})
    
# Standardize matrices to ensure consistent columns across all samples
counts_list = standardize_matrix_columns(counts_list)

In [None]:
counts_list %>% length

In [None]:
interface_plot = function(counts, .types, est_model=c('binomial', 'poisson', 'mle')) {
    est_model <- match.arg(est_model)
    df = empirical_bayes_summary(
        rowSums(counts[, .types, drop = FALSE]),
        rowSums(counts),
        rownames(counts),
        est_model
    ) 

    ## get max y value for plotting 
    ymax = 100 * max(df$estimate + 1.96 * sqrt(df$variance))
    
    p1 = ggplot(df, aes(dist_bin, 100 * estimate)) + 
        geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
        geom_point(aes(size = size)) + 
        geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0) + 
        geom_hline(yintercept = 0) + 
        geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin))) + 
        theme_bw(base_size = 16) + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
        labs(y = '% of all cells', x = 'Distance Window', size = '# Cells', subtitle = 'mean & 95% CI, *padj<0.01', title = paste(.types, collapse = '; ')) + 
        geom_text(aes(y = 100 * (estimate + 1.96 * sqrt(variance)), label = asterisk), size = 6, vjust = 0) + 
        annotate("text", x = 0.5, y = ymax + .05, label = 'Stromal Side', hjust = 0, size = 6) + 
        annotate("text", x = 40.5, y = ymax + .05, label = 'Epithelial Side', hjust = 1, size = 6) + 
        NULL
    return(p1)
}

In [None]:
.types = grep('Endo-ven', colnames(counts_list$C110), value = TRUE)

In [None]:
fig.size(18, 32)
require(patchwork)
# imap(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], function(counts, .id) {    
imap(counts_list[mmr_map[MMRstatus == 'MMRd']$SampleID], function(counts, .id) {    
    interface_plot(counts, .types, 'binomial') + labs(title = glue('{.id} (MMRd)'))    
}) %>% wrap_plots() + plot_annotation(title = paste0(.types, collapse =  ', '))

In [None]:
fig.size(18, 32)
require(patchwork)
# imap(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], function(counts, .id) {    
imap(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], function(counts, .id) {    
    interface_plot(counts, .types, 'binomial') + labs(title = glue('{.id} (MMRp)'))    
}) %>% wrap_plots()  + plot_annotation(title = paste0(.types, collapse =  ', '))

In [None]:
.types = grep('Epi-Stem_TAlike', colnames(counts_list$C110), value = TRUE)
.types

In [None]:
fig.size(18, 32)
require(patchwork)
# imap(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], function(counts, .id) {    
imap(counts_list[mmr_map[MMRstatus == 'MMRd']$SampleID], function(counts, .id) {    
    interface_plot(counts, .types, 'binomial') + labs(title = glue('{.id} (MMRd)'))    
}) %>% wrap_plots() + plot_annotation(title = paste0(.types, collapse =  ', '))

In [None]:
fig.size(18, 32)
require(patchwork)
# imap(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], function(counts, .id) {    
imap(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], function(counts, .id) {    
    interface_plot(counts, .types, 'binomial') + labs(title = glue('{.id} (MMRp)'))    
}) %>% wrap_plots() %>% wrap_plots() + plot_annotation(title = paste0(.types, collapse =  ', '))

--- 
## 5. Global Meta-Analysis: Comparing MMRd vs. MMRp

This master function automates the entire statistical comparison. It iterates through a list of cell types, runs the meta-analysis comparing MMRd vs. MMRp samples for each type, performs global FDR correction, and combines all results into a set of tidy data frames ready for plotting and interpretation.

In [None]:
#' @title Run a Global Analysis Comparing MMR Status
#' @description This master function automates the entire statistical comparison.
run_global_mmr_analysis <- function(types_list, counts_list, mmr_map) {
  
  # Dynamically calculate sample sizes to make the analysis robust
  n_msi <- length(unique(mmr_map[MMRstatus == 'MMRd']$SampleID))
  n_mss <- length(unique(mmr_map[MMRstatus == 'MMRp']$SampleID))
  
  # Iterate over the simplified cell type names (`type_lvl3`)
  results_by_type <- purrr::imap(types_list, function(.x, .y) {
    .types <- .y # Use the name of the list element (the correct type) for subsetting
    
    # Run meta-analysis for each group
    df_MSI <- get_stats(counts_list[mmr_map[MMRstatus == 'MMRd']$SampleID], .types)
    df_MSS <- get_stats(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], .types)
    
    # Combine results for direct comparison
    df <- bind_rows(list(MSI = df_MSI, MSS = df_MSS), .id = 'Status')
    
    # Reshape and run Welch's t-test on the meta-analyzed estimates
    df_stat <- dcast(df, dist_bin ~ Status, value.var = c('estimate', 'variance'))[
      , c('p', 'log2_fold_change') := t_test_and_lfc(estimate_MSI, variance_MSI, n_msi, estimate_MSS, variance_MSS, n_mss), dist_bin
    ]
    
    return(list(MSI_data = df_MSI, MSS_data = df_MSS, stats_data = df_stat))
  })
  
  # Restructure the list of lists into a more usable format
  transposed_results <- purrr::transpose(results_by_type)
  all_MSI_df <- dplyr::bind_rows(transposed_results$MSI_data, .id = "cell_type")
  all_MSS_df <- dplyr::bind_rows(transposed_results$MSS_data, .id = "cell_type")
  summary_stats <- dplyr::bind_rows(transposed_results$stats_data, .id = "cell_type")
  
  # Perform global FDR correction across all p-values from all tests
  summary_stats[, padj_global := p.adjust(p, method = 'fdr')]
  summary_stats[, asterisk := fifelse(padj_global < 0.01, "*", "")]
  summary_stats[, height := max(estimate_MSI + 1.96 * sqrt(variance_MSI), estimate_MSS + 1.96 * sqrt(variance_MSS)), by = .(cell_type, dist_bin)]
  
  # Return the final, tidy list of results
  return(list(summary_stats = summary_stats, MSI_results = all_MSI_df, MSS_results = all_MSS_df))
}

--- 
## 6. Execute Analysis and Prepare for Visualization

We now execute the main analysis function and prepare the final data frames for plotting.

In [None]:
# Create the list of cell types to analyze
cellTypes = cells %>% select(type_lvl2, type_lvl3) %>% distinct
type_list <- lapply(split(cellTypes$type_lvl2, cellTypes$type_lvl3), unique)

# Run the analysis
final_results <- run_global_mmr_analysis(type_list, counts_list, mmr_map)

# Prepare a combined data frame for plotting
df_plot <- bind_rows(list(MSI = final_results$MSI_results, MSS = final_results$MSS_results), .id = 'Status')

df_plot = df_plot %>%
    group_by(cell_type) %>%
    mutate(ymax = 100 * max(estimate + 1.96 * sqrt(variance))) %>%
    ungroup
head(df_plot)

# Create a list for grouping cell types by lineage for faceted plots
lineage_list <- cells %>% select(type_lvl1, type_lvl3) %>% distinct %>% {split(.$type_lvl3, .$type_lvl1)}

head(final_results$summary_stats)

final_results$MSS_results %>% fwrite(., 'input_data/MSS_results.csv')

## Prepare a table of cell counts in spatial bins

In [None]:
head(cells)

In [None]:
lapply(counts_list, function(x) x %>%
    as.data.frame() %>%
    tibble::rownames_to_column('bin')) %>%
    rbindlist(idcol = 'sample') %>%
    head

In [None]:
require(forcats)
require(tidyverse)
find_midpoint <- function(interval_string) {
  # 1. Remove parentheses and brackets using gsub
  # The pattern "[()\\[\\]]" matches any character inside the outer brackets.
  # We need to escape the inner square brackets with \\.
  cleaned_string <- gsub("\\(|\\[|\\)|\\]", "", interval_string)
  
  # 2. Split the string by the comma
  # strsplit returns a list, so we take the first element [[1]]
  num_strings <- strsplit(cleaned_string, ",")[[1]]
  
  # 3. Convert character vector to numbers and calculate the mean
  midpoint <- mean(as.numeric(num_strings))
  
  return(midpoint)
}
allCounts = lapply(counts_list, function(x) x %>%
    as.data.frame() %>%
    tibble::rownames_to_column('bin')) %>%
rbindlist(idcol = 'sample') %>%
mutate(midpoint = unlist(lapply(bin, find_midpoint))) 

allCounts$total_counts_per_bin = allCounts %>% select(!c(bin, sample, midpoint)) %>% rowSums
allCounts$total_TNKILC_per_bin = allCounts %>% select(lineage_list[['TNKILC']]) %>% rowSums

allCounts %>% pivot_longer(cols = allCounts %>% select(!c(bin, sample, midpoint, 
                                                          total_counts_per_bin, total_TNKILC_per_bin)) %>% names) %>% 
mutate(name = gsub(pattern = 'Epi.*', replacement = 'Epi', x = name)) %>%
left_join(., cells %>% select(PatientID,	SampleID,	MMRstatus) %>% rename(sample = SampleID) %>% distinct) %>%
group_by(name, bin, PatientID, MMRstatus, midpoint) %>%
summarize(value = sum(value)) %>%
pivot_wider(values_from = value, names_from = name) %>%
write.csv(., 'counts_in_bins_MMRp_MMRd.csv')

--- 
## 7. Visualization

In this final section, we generate faceted plots to visualize the results. We create plots that group cell types by their major lineage (e.g., T-cells, Myeloid cells) to compare their distribution profiles between MMRd (MSI) and MMRp (MSS) samples.

In [None]:
df_plot$cell_type[df_plot$cell_type %in% lineage_list[['TNKILC']]] %>% unique

### Epi

In [None]:
fig.size(h = 9, w = 16)
options(repr.plot.res = 300)
lineage = 'Epi'

# Create the plot
ggplot(
    df_plot %>%
        filter(cell_type %in% lineage_list[[lineage]]),
    # FIX 1: The main aes() mapping must be inside the ggplot() function.
    aes(x = dist_bin, y = 100 * estimate, color = Status)
) +
    geom_vline(xintercept = 20.5, size = 2, linetype = 1, color = 'grey') +
    geom_point() +
    geom_errorbar(
        aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))),
        width = 0,
        show.legend = FALSE
    ) +
    geom_hline(yintercept = 0) +
    geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin)), show.legend = FALSE) +
    cowplot::theme_half_open(10) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(
        y = 'Percent of all cells',
        x = 'Distance Window',
        size = '# Cells',
        subtitle = 'IVW meta-analysis; mean & 95% CI, *padj<0.01',
        title = lineage
    ) +
    geom_text(
        data = final_results$summary_stats %>%
            filter(cell_type %in% lineage_list[[lineage]]),
        aes(y = 100 * height, label = asterisk),
        size = 6,
        vjust = 0.2,
        show.legend = FALSE,
        color = 'black'
    ) +
    geom_text(
        data = df_plot %>%
            filter(cell_type %in% lineage_list[[lineage]]) %>%
            select(cell_type, ymax) %>%
            distinct() %>%
            mutate(ymax = ymax + ymax / 3, x = 32, label = '\nEpi-\nenriched'),
        aes(label = label, y = ymax, x = x),
        color = 'black',
        size = 3
    ) +
    geom_text(
        data = df_plot %>%
            filter(cell_type %in% lineage_list[[lineage]]) %>%
            select(cell_type, ymax) %>%
            distinct() %>%
            mutate(ymax = ymax + ymax / 3, x = 8, label = '\nStroma-\nenriched'),
        aes(label = label, y = ymax, x = x),
        color = 'black',
        size = 3
    ) +
    scale_color_manual(
        name = "MMR Status",
        values = c("MSI" = "red", "MSS" = "blue"),
        labels = c("MSI" = "MMRd (MSI)", "MSS" = "MMRp (MSS)")
    ) +
    guides(color = guide_legend(override.aes = list(size = 2, shape = 16))) +
    facet_wrap(~cell_type, scales = 'free') +
    theme(
        aspect.ratio = 0.5,
        axis.text.x = element_text(size = 4),
        strip.background = element_rect(fill = NA),
        strip.text = element_text(size = 10, face = 'bold', color = 'black'),
        title = element_text(size = 10),
        legend.position = 'top',
        legend.text = element_text(size = 10)
    )
    # FIX 2: Removed a redundant `guides()` call for 'fill', which was not used as an aesthetic.
    # FIX 3: Removed the unnecessary trailing `+ NULL`.

### TNKILCs

In [None]:
fig.size(h = 9, w = 16)
options(repr.plot.res = 300)
lineage = 'TNKILC'
ggplot(df_plot %>% 
       filter(cell_type %in% lineage_list[[lineage]]) %>% 
       mutate(cell_type = factor(cell_type, 
                                 ordered = TRUE, levels = lineage_list[[lineage]])), 
                                 #levels = c("Tcd8-CXCL13", "Tcd8-HOBIT", "Tcd8-gdlike", 
              #"Tcd8-gdlike-PD1", "Tcd8-GZMK", "Tplzf-gdlike", "Tcd4-CXCL13", 
              #"Tcd4-TFH", "Tcd4-Treg", "Tcd4-IL7R", "NK-CD16", "NK-XCL1", "ILC3"))),
    aes(dist_bin, 100 * estimate, color = Status)) + 
    geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
    geom_point() + 
    geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0, show.legend = FALSE) + 
    geom_hline(yintercept = 0) + 
    geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin)), show.legend = FALSE) + 
    cowplot::theme_half_open(10) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(y = 'Percent of all cells', x = 'Distance Window', size = '# Cells', 
         subtitle = 'IVW meta-analysis; mean & 95% CI, *padj<0.01', title = lineage) + 
    geom_text(
        data = final_results$summary_stats %>% 
            filter(cell_type %in% lineage_list[[lineage]]) %>%
            mutate(cell_type = factor(cell_type, 
                                      ordered = TRUE, 
                                      levels = lineage_list[[lineage]])),
                                      #levels = c("Tcd8-CXCL13", "Tcd8-HOBIT", "Tcd8-gdlike","Tcd8-gdlike-PD1", 
        #"Tcd8-GZMK", "Tplzf-gdlike", "Tcd4-CXCL13", "Tcd4-TFH", "Tcd4-Treg", "Tcd4-IL7R", "NK-CD16", "NK-XCL1", "ILC3"))), # REMOVED extra comma here
        aes(y = 100 * height, label = asterisk), size = 6, vjust = .2, show.legend = FALSE,
        color = 'black'
    ) + 
    geom_text(data = df_plot %>%
                  filter(cell_type %in% lineage_list[[lineage]]) %>%
                  mutate(cell_type = factor(cell_type, ordered = TRUE, 
                                            levels = lineage_list[[lineage]])) %>% 
                                            #levels = c("Tcd8-CXCL13", "Tcd8-HOBIT", "Tcd8-gdlike","Tcd8-gdlike-PD1", "Tcd8-GZMK", "Tplzf-gdlike", "Tcd4-CXCL13", "Tcd4-TFH", "Tcd4-Treg", "Tcd4-IL7R", "NK-CD16", "NK-XCL1", "ILC3"))) %>% 
                  select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 32, label = '\nEpi-\nenriched'), 
                  aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    geom_text(data = df_plot  %>% 
                  filter(cell_type %in% lineage_list[[lineage]]) %>% 
                  mutate(cell_type = factor(cell_type, ordered = TRUE, 
                                            levels = lineage_list[[lineage]])) %>%
                                            #levels = c("Tcd8-CXCL13", "Tcd8-HOBIT", "Tcd8-gdlike","Tcd8-gdlike-PD1", "Tcd8-GZMK", "Tplzf-gdlike", "Tcd4-CXCL13", "Tcd4-TFH", "Tcd4-Treg", "Tcd4-IL7R", "NK-CD16", "NK-XCL1", "ILC3"))) %>% 
                  select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 8, label = '\nStroma-\nenriched'), 
                  aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    scale_color_manual(
        name = "MMR Status",
        values = c("MSI" = "red", "MSS" = "blue"),
        labels = c("MSI" = "MMRd (MSI)", "MSS" = "MMRp (MSS)")
    ) + 
    guides(color = guide_legend(override.aes = list(size = 2, shape = 16))) + 
    facet_wrap(~cell_type, scales = 'free') +
    theme(aspect.ratio = 0.5, axis.text.x = element_text(size = 4), strip.background = element_rect(fill = NA), strip.text = element_text(size = 10, face = 'bold', color = 'black'), title = element_text(size = 10), legend.position = 'top', legend.text=element_text(size=10)) +
    guides(fill = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
    NULL

### Myeloid

In [None]:
lineage_list[['Myeloid']]

In [None]:
df_plot$cell_type %>% unique

In [None]:
lineage_list[['Myeloid']] %in% df_plot$cell_type 

In [None]:
fig.size(h = 9, w = 16)
options(repr.plot.res = 300)
lineage = 'Myeloid'
myeloid_order <- lineage_list[['Myeloid']] #c("Myeloid-ISG", "Myeloid-Macro-MMP9-APOE", "Myeloid-Macro", 
                 #  "Myeloid-Mono-VEGFA", "Myeloid-inflamm", "Myeloid-Mono-S100-VCAN", 
                 #  "Myeloid-Mono-CSF1R", "Myeloid-DC-pDC_ASDC", "Myeloid-Macro-SEPP1-LYVE1", 
                 #  "Myeloid-Macro-C1Q", "Myeloid-Granulo", "Myeloid-DC1", 
                 #  "Myeloid-DCmreg", "Myeloid-DC2", "Myeloid-DC2-C1Q")

ggplot(df_plot %>% 
       filter(cell_type %in% lineage_list[[lineage]]) %>%
       mutate(cell_type = factor(cell_type, ordered = TRUE, levels = myeloid_order)), 
       aes(dist_bin, 100 * estimate, color = Status)) + 
    geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
    geom_point() + 
    geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0, show.legend = FALSE) + 
    geom_hline(yintercept = 0) + 
    geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin)), show.legend = FALSE) + 
    cowplot::theme_half_open(10) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(y = 'Percent of all cells', x = 'Distance Window', size = '# Cells', subtitle = 'IVW meta-analysis; mean & 95% CI, *padj<0.01', title = lineage) + 
    geom_text(
        data = final_results$summary_stats %>% 
               filter(cell_type %in% lineage_list[[lineage]]) %>%
               mutate(cell_type = factor(cell_type, ordered = TRUE, levels = myeloid_order)),
        aes(y = 100 * height, label = asterisk), size = 6, vjust = .2, show.legend = FALSE,
        color = 'black'
    ) + 
    geom_text(data = df_plot %>% 
                  filter(cell_type %in% lineage_list[[lineage]]) %>% 
                  mutate(cell_type = factor(cell_type, ordered = TRUE, levels = myeloid_order)) %>% 
                  select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 32, label = '\nEpi-\nenriched'), 
                  aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    geom_text(data = df_plot %>% 
                  filter(cell_type %in% lineage_list[[lineage]]) %>% 
                  mutate(cell_type = factor(cell_type, ordered = TRUE, levels = myeloid_order)) %>% 
                  select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 8, label = '\nStroma-\nenriched'), 
                  aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    scale_color_manual(
        name = "MMR Status",
        values = c("MSI" = "red", "MSS" = "blue"),
        labels = c("MSI" = "MMRd (MSI)", "MSS" = "MMRp (MSS)")
    ) + 
    guides(color = guide_legend(override.aes = list(size = 2, shape = 16))) + 
    facet_wrap(~cell_type, scales = 'free') +
    theme(aspect.ratio = 0.5, axis.text.x = element_text(size = 4), strip.background = element_rect(fill = NA), strip.text = element_text(size = 10, face = 'bold', color = 'black'), title = element_text(size = 10), legend.position = 'top', legend.text=element_text(size=10)) +
    guides(fill = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
    NULL

In [None]:
strom_order = c("Fibro-BMP", "Fibro-CCL2", "Fibro-StemNiche", "Fibro-MMP3", "Fibro-CXCL14", "Fibro-GREM1", "Fibro-myo", "SmoothMuscle", "Pericyte", "Endo-art", "Endo-cap", "Endo", "Endo-tip", "Endo-ven", "Endo-lymph", "Schwann")
lineage_list[['Strom']][!lineage_list[['Strom']] %in% strom_order]

### Strom

In [None]:
fig.size(h = 9, w = 16)
options(repr.plot.res = 300)
lineage = 'Strom'
strom_order = lineage_list[['Strom']] #c("Fibro-BMP", "Fibro-CCL2", "Fibro-StemNiche", "Fibro-MMP3", "Fibro-CXCL14", "Fibro-GREM1", "Fibro-myo", "SmoothMuscle", "Pericyte", "Endo-art", "Endo-cap", "Endo", "Endo-tip", "Endo-ven", "Endo-lymph", "Schwann")
ggplot(df_plot %>% 
       filter(cell_type %in% lineage_list[[lineage]]) %>%
       mutate(cell_type = factor(cell_type, ordered = TRUE, levels = strom_order)), 
       aes(dist_bin, 100 * estimate, color = Status)) + 
    geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
    geom_point() + 
    geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0, show.legend = FALSE) + 
    geom_hline(yintercept = 0) + 
    geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin)), show.legend = FALSE) + 
    cowplot::theme_half_open(10) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(y = 'Percent of all cells', x = 'Distance Window', size = '# Cells', subtitle = 'IVW meta-analysis; mean & 95% CI, *padj<0.01', title = lineage) + 
    geom_text(
        data = final_results$summary_stats %>% 
               filter(cell_type %in% lineage_list[[lineage]]) %>%
               mutate(cell_type = factor(cell_type, ordered = TRUE, levels = strom_order)),
        aes(y = 100 * height, label = asterisk), size = 6, vjust = .2, show.legend = FALSE,
        color = 'black'
    ) + 
    geom_text(data = df_plot %>% 
                  filter(cell_type %in% lineage_list[[lineage]]) %>% 
                  mutate(cell_type = factor(cell_type, ordered = TRUE, levels = strom_order)) %>% 
                  select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 32, label = '\nEpi-\nenriched'), 
                  aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    geom_text(data = df_plot %>% 
                  filter(cell_type %in% lineage_list[[lineage]]) %>% 
                  mutate(cell_type = factor(cell_type, ordered = TRUE, levels = strom_order)) %>% 
                  select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 8, label = '\nStroma-\nenriched'), 
                  aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    scale_color_manual(
        name = "MMR Status",
        values = c("MSI" = "red", "MSS" = "blue"),
        labels = c("MSI" = "MMRd (MSI)", "MSS" = "MMRp (MSS)")
    ) + 
    guides(color = guide_legend(override.aes = list(size = 2, shape = 16))) + 
    facet_wrap(~cell_type, scales = 'free') +
    theme(aspect.ratio = 0.5, axis.text.x = element_text(size = 4), strip.background = element_rect(fill = NA), strip.text = element_text(size = 10, face = 'bold', color = 'black'), title = element_text(size = 10), legend.position = 'top', legend.text=element_text(size=10)) +
    guides(fill = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
    NULL


### B, Plasma, Mast

In [None]:
fig.size(h = 9, w = 16)
options(repr.plot.res = 300)
lineage = 'B|Plasma|Mast'
ggplot(df_plot%>% filter(cell_type %in% unlist(lineage_list[grepl(names(lineage_list), pattern = 'B|Plasma|Mast')])), aes(dist_bin, 100 * estimate, color = Status)) + 
    geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
    geom_point() + 
    geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0, show.legend = FALSE) + 
    geom_hline(yintercept = 0) + 
    geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin)), show.legend = FALSE) + 
    cowplot::theme_half_open(10) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(y = 'Percent of all cells', x = 'Distance Window', size = '# Cells', subtitle = 'IVW meta-analysis; mean & 95% CI, *padj<0.01', title = lineage) + #, title = paste(.types, collapse = '; ')) + 
    geom_text(
        data = final_results$summary_stats %>% filter(cell_type %in% lineage_list[grepl(names(lineage_list), pattern = 'B|Plasma|Mast')]),
        aes(y = 100 * height, label = asterisk), size = 6, vjust = .2, show.legend = FALSE,
        color = 'black'
    ) + 
    geom_text(data = df_plot %>% filter(cell_type %in% unlist(lineage_list[grepl(names(lineage_list), pattern = 'B|Plasma|Mast')])) %>% select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 32, label = '\nEpi-\nenriched'), 
              aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    geom_text(data = df_plot %>% filter(cell_type %in% unlist(lineage_list[grepl(names(lineage_list), pattern = 'B|Plasma|Mast')])) %>% select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 8, label = '\nStroma-\nenriched'), 
              aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    #annotate("text", x = 0.5, y = ymax + .05, label = 'Stromal Side', hjust = 0, size = 6) + 
    #annotate("text", x = 40.5, y = ymax + .05, label = 'Epithelial Side', hjust = 1, size = 6) + 
    scale_color_manual(
    name = "MMR Status",
    values = c("MSI" = "red", "MSS" = "blue"),
    labels = c("MSI" = "MMRd (MSI)", "MSS" = "MMRp (MSS)")
) + 
    guides(color = guide_legend(override.aes = list(size = 2, shape = 16))) + 
    facet_wrap(~cell_type, scales = 'free') +
    theme(aspect.ratio = 0.5, axis.text.x = element_text(size = 4), strip.background = element_rect(fill = NA), strip.text = element_text(size = 10, face = 'bold', color = 'black'), title = element_text(size = 10), legend.position = 'top', legend.text=element_text(size=10)) +
    guides(fill = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
    NULL

### Key cell states

In [None]:
fig.size(h = 3, w = 14)
options(repr.plot.res = 500)
states_of_interest = c('Myeloid-ISG', 'Tcd4-Treg', 'Tcd8-CXCL13', 'Tcd8-HOBIT', 'Fibro-BMP')
ggplot(df_plot%>% filter(cell_type %in% states_of_interest), aes(dist_bin, 100 * estimate, color = Status)) + 
    geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
    geom_point() + 
    geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0, show.legend = FALSE) + 
    geom_hline(yintercept = 0) + 
    geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin)), show.legend = FALSE) + 
    cowplot::theme_half_open(7) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(y = 'Percent of all cells', x = 'Distance Window', size = '# Cells', subtitle = 'IVW meta-analysis; mean & 95% CI, *padj<0.01', title = '') + #, title = paste(.types, collapse = '; ')) + 
    geom_text(
        data = final_results$summary_stats %>% filter(cell_type %in% states_of_interest),
        aes(y = 100 * height, label = asterisk), size = 6, vjust = .2, show.legend = FALSE,
        color = 'black'
    ) + 
    geom_text(data = df_plot %>% filter(cell_type %in% states_of_interest) %>% select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 32, label = '\nEpi-\nenriched'), 
              aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    geom_text(data = df_plot %>% filter(cell_type %in% states_of_interest) %>% select(cell_type, ymax) %>% distinct() %>% mutate(ymax = ymax + ymax/3, x = 8, label = '\nStroma-\nenriched'), 
              aes(label = label, y = ymax, x = x), color = 'black', size = 3) +
    scale_color_manual(
    name = "MMR Status",
    values = c("MSI" = "red", "MSS" = "blue"),
    labels = c("MSI" = "MMRd (MSI)", "MSS" = "MMRp (MSS)")
) + 
    guides(color = guide_legend(override.aes = list(size = 2, shape = 16))) + 
    facet_wrap(~cell_type, scales = 'free', nrow = 1) +
    theme(aspect.ratio = 0.5, axis.text.x = element_text(size = 4), strip.background = element_rect(fill = NA), strip.text = element_text(size = 10, face = 'bold', color = 'black'), title = element_text(size = 10), legend.position = 'top', legend.text=element_text(size=10)) +
    guides(fill = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
    NULL

## 8. Examine the spatial patterning of lineages

### `get_lineage_bins`

This is the core data processing function. For a given sample, it takes cell coordinates and interface geometries as input. It then performs the following steps:
- Calculates the distance for each cell to its nearest interface.
- Annotates each cell with the type of that nearest interface.
- Assigns a sign to the distance based on whether the cell is in a 'Stromal-enriched' region.
- Bins the cells into 5µm distance intervals.
- Returns a named list of matrices, where each matrix contains the counts of cell types within each distance bin for a specific interface type.

In [None]:
#' @title Calculate Cell Counts in Distance Bins from an Interface
#' @description This function takes spatial coordinates of cells and interface lines,
#'   calculates the signed distance of each cell to the nearest interface, and
#'   groups cells into discrete distance bins. It returns a matrix of cell
#'   type counts per bin for a single sample.
#' @param cells A data.table containing cell information, including 'X'/'Y' coordinates,
#'   cell type ('type_lvl3'), and a region annotation ('tessera_annotation').
#' @param interfaces An sf object containing interface geometries (e.g., LINESTRINGs).
#' @return A matrix where rows are distance bins (e.g., "(-5,0]") and columns
#'   are cell types ('type_lvl3'), with values representing cell counts.
get_lineage_bins = function(cells, interfaces) {
    # Convert data.frame coordinates to a spatial 'sf' object
    pts = st_as_sf(cells[, .(X, Y)], coords = c('X', 'Y'))
    
    # Use 'geos' for high-performance spatial operations
    geos_pts = geos::as_geos_geometry(pts$geometry)
    geos_lines = geos::as_geos_geometry(interfaces$x[1:nrow(interfaces)])
    
    # Find the nearest interface for each cell
    nearest_interfaces = geos::geos_nearest(geos_pts, geos_lines)
    
    # Calculate the distance to that nearest interface
    cells$dist_interface = geos::geos_distance(geos_pts, geos_lines[nearest_interfaces])
    
    # Assign a sign to the distance based on tissue region (stroma vs. other)
    cells$dist_interface_signed = case_when(
        cells$tessera_annotation == 'Stromal-enriched' ~ -cells$dist_interface,
        TRUE ~ cells$dist_interface
    )
    
    # Bin cells into 5µm distance intervals
    cells$dist_bin = cut(cells$dist_interface_signed, seq(-100, 100, by = 5), include.lowest = TRUE)

    # Create the final count matrix
    counts = cells[
        !is.na(dist_bin)
    ] %>%
        with(table(dist_bin, type_lvl1)) %>%
        data.table() %>%
        dcast(dist_bin ~ type_lvl1, value.var = 'N') %>%
        dplyr::mutate(dist_bin = factor(dist_bin, levels(cells$dist_bin))) %>%
        arrange(dist_bin) %>%
        tibble::column_to_rownames('dist_bin') %>%
        as.matrix()
    
    return(counts)
}

### 9. Main Analysis: Calculate Distances and Bin Counts

This is the main computational step. We use `future_map` to run the `summarize_lineages_by_interface_proximity` function in parallel for each sample. This generates a list where each element corresponds to a sample and contains the binned cell counts for its different interface types.

In [None]:
options(future.globals.maxSize = 1e10)

system.time({
    counts_list = future_map(ids, function(.id) {
        get_lineage_bins(cells[SampleID == .id], interfaces[[.id]])    
    }, .options = furrr::furrr_options(seed=TRUE))
    names(counts_list) = ids
})

### 10. Post-processing: Stratify and Standardize Data

After calculating the counts, we separate them based on the interface type ('hub positive' vs. 'hub negative'). We then use the `standardize_matrix_columns` utility function to ensure that all count matrices have the exact same set of cell type columns, which is essential for the downstream meta-analysis.

In [None]:
# standardize columns
counts_list = standardize_matrix_columns(counts_list)

### 11. Global Analysis Across All Cell Types

Now we run the main analysis function, `run_global_hub_analysis`. This function iterates through every cell type, performs the meta-analysis comparing hub-positive and hub-negative interfaces, calculates statistics, and returns a set of clean data frames ready for plotting.

In [None]:
# Create a list of cell types to iterate over
cellTypes = cells %>% 
    select(type_lvl1, type_lvl1) %>% 
    distinct

type_list <- lapply(split(cellTypes$type_lvl1, cellTypes$type_lvl1), unique)

# Run the full analysis
final_results <- run_global_mmr_analysis(type_list, counts_list, mmr_map)

# Display the glimpse of the main summary table
glimpse(final_results$summary_stats)

### 12. Visualization

In this final section, we generate plots to visualize the results. We create faceted plots that group cell types by their major lineage (e.g., T-cells, Myeloid cells) to compare their distribution profiles between hub-positive and hub-negative interfaces.

In [None]:
final_results$MSS_results %>%
    fwrite(., 'input_data/MSS_lineages.csv')

In [None]:
# Prepare data for plotting by combining MSI and MSS results
df = bind_rows(list(MSI = final_results$MSI_results, MSS = final_results$MSS_results), .id = 'Status') 

# Calculate y-axis limits for plotting
df <- df %>%
    group_by(cell_type) %>%
    mutate(ymax = 100 * max(estimate + 1.96 * sqrt(variance))) %>%
    ungroup

# Create a list for grouping cell types by lineage
lineage_list <- cells %>% 
    select(type_lvl1, type_lvl1) %>% 
    distinct %>%
    {split(.$type_lvl1, .$type_lvl1)}

### 13. Plot: All Lineages

In [None]:
head(df)
unique(df$cell_type)

In [None]:
fig.size(h = 9, w = 16)
options(repr.plot.res = 300)

# Define the desired order for the facets
lineage_order <- c("TNKILC", "Epi", "Strom", "Myeloid", "B", "Plasma", "Mast")

# Prepare the data for the geom_text layers beforehand for clarity
text_data_asterisk <- final_results$summary_stats %>% 
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = lineage_order))

text_data_epi <- df %>% 
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = lineage_order)) %>% 
    select(cell_type, ymax) %>% 
    distinct() %>% 
    mutate(ymax = ymax + ymax/3, x = 32, label = '\nEpi-\nenriched')

text_data_stroma <- df %>% 
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = lineage_order)) %>% 
    select(cell_type, ymax) %>% 
    distinct() %>% 
    mutate(ymax = ymax + ymax/3, x = 8, label = '\nStroma-\nenriched')

# Create the plot
ggplot(
    data = df %>% 
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = lineage_order)), 
    aes(dist_bin, 100 * estimate, color = Status)
) + 
    geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
    geom_point() + 
    geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0, show.legend = FALSE) + 
    geom_hline(yintercept = 0) + 
    geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin)), show.legend = FALSE) + 
    cowplot::theme_half_open(10) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(y = 'Percent of all cells', x = 'Distance Window', size = '# Cells', subtitle = 'IVW meta-analysis; mean & 95% CI, *padj<0.01', title = 'Lineages') + 
    geom_text(
        data = text_data_asterisk,
        aes(y = 100 * height, label = asterisk), 
        size = 6, vjust = .2, show.legend = FALSE, color = 'black'
    ) + 
    geom_text(
        data = text_data_epi, 
        aes(label = label, y = ymax, x = x), 
        color = 'black', size = 3
    ) +
    geom_text(
        data = text_data_stroma, 
        aes(label = label, y = ymax, x = x), 
        color = 'black', size = 3
    ) +
    scale_color_manual(
        name = 'Status: ', 
        values = c('MSI' = 'red', 'MSS' = 'blue'), 
        labels = c('MSI' = 'MMRd', 'MSS' = 'MMRp')
    ) + 
    guides(color = guide_legend(override.aes = list(size = 2, shape = 16))) + 
    facet_wrap(~cell_type, scales = 'free') +
    theme(
        aspect.ratio = 0.5, 
        axis.text.x = element_text(size = 4), 
        strip.background = element_rect(fill = NA), 
        strip.text = element_text(size = 10, face = 'bold', color = 'black'), 
        title = element_text(size = 10), 
        legend.position = 'top', 
        legend.text = element_text(size = 10)
    ) +
    guides(fill = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
    NULL

# 9. TNKILCs as a proportion of T cells

In [None]:
head(cells)

In [None]:
# Set a higher limit for global variables when using parallel processing
options(future.globals.maxSize = 1e10)

# Run get_bins for each sample in parallel
system.time({
    counts_list = future_map(ids, function(.id) {
        get_bins(cells[SampleID == .id & type_lvl1 == 'TNKILC'], interfaces[[.id]])    
    }, .options = furrr::furrr_options(seed=TRUE))
    names(counts_list) = ids
})
    
# Standardize matrices to ensure consistent columns across all samples
counts_list = standardize_matrix_columns(counts_list)

## 6. Global Analysis Across All Cell Types

Now we run the main analysis function, `run_global_hub_analysis`. This function iterates through every cell type, performs the meta-analysis comparing hub-positive and hub-negative interfaces, calculates statistics, and returns a set of clean data frames ready for plotting.

In [None]:
# Create the list of cell types to analyze
cellTypes = cells %>% filter(type_lvl1 == 'TNKILC') %>% select(type_lvl2, type_lvl3) %>% distinct
type_list <- lapply(split(cellTypes$type_lvl2, cellTypes$type_lvl3), unique)

# Run the analysis
final_results <- run_global_mmr_analysis(type_list, counts_list, mmr_map)

# Prepare a combined data frame for plotting
df_plot <- bind_rows(list(MSI = final_results$MSI_results, MSS = final_results$MSS_results), .id = 'Status')

df_plot = df_plot %>%
    group_by(cell_type) %>%
    mutate(ymax = 100 * max(estimate + 1.96 * sqrt(variance))) %>%
    ungroup
head(df_plot)

# Create a list for grouping cell types by lineage for faceted plots
lineage_list <- cells %>% select(type_lvl1, type_lvl3) %>% distinct %>% {split(.$type_lvl3, .$type_lvl1)}

head(final_results$summary_stats)

final_results$MSS_results %>% fwrite(., 'input_data/MSS_results_TNKILCs_as_prop_of_lineage.csv')

In [None]:
sessionInfo()