# Spatial Analysis of Cell Proximity to Interfaces

This notebook analyzes the spatial distribution of different cell types relative to defined biological interfaces. The primary goal is to determine if certain cell types are enriched or depleted near specific types of interfaces (e.g., 'hub positive' vs. 'hub negative').

The workflow consists of several key stages:
1.  **Function Definitions**: A collection of R functions for performing the core statistical analysis, including distance calculations, empirical Bayes shrinkage, and meta-analysis.
2.  **Data Loading & Preprocessing**: Loading cell spatial coordinates and interface geometries from external files.
3.  **Distance Calculation**: For each cell, calculating the distance to the nearest interface and classifying it by interface type.
4.  **Statistical Analysis**: Applying a meta-analysis across multiple samples to get robust estimates of cell type proportions at different distances from the interfaces.
5.  **Visualization**: Generating plots to visualize the results and compare cell distributions between different interface types.

## 1. Setup: Load Libraries

First, we load all the necessary R packages for the analysis. This includes libraries for data manipulation (`data.table`, `dplyr`, `purrr`), spatial analysis (`sf`, `geos`), parallel computing (`furrr`), and plotting (`ggplot2`).

In [None]:
suppressPackageStartupMessages({
    library(tidyverse)
    library(data.table)
    library(sf)
    library(purrr)
    library(ggplot2)
    library(ggthemes)
    library(geos)
    library(glue)
    library(furrr)
    library(future)
    library(dplyr)
    library(patchwork)
})

# Set up parallel processing to speed up computations
plan(multicore)

# Helper function to set plot dimensions
fig.size <- function(h, w) {
    options(repr.plot.height = h, repr.plot.width = w)
}

## 2. Function Definitions

This section contains all the custom functions used throughout the analysis pipeline.

### `summarize_cells_by_interface_proximity`

This is the core data processing function. For a given sample, it takes cell coordinates and interface geometries as input. It then performs the following steps:
- Calculates the distance for each cell to its nearest interface.
- Annotates each cell with the type of that nearest interface.
- Assigns a sign to the distance based on whether the cell is in a 'Stromal-enriched' region.
- Bins the cells into 5µm distance intervals.
- Returns a named list of matrices, where each matrix contains the counts of cell types within each distance bin for a specific interface type.

In [None]:
tile_metadata = readr::read_rds('../Tessera tiles/Tessera processed results/tile_metadata_2025-07-22.rds') #'tile_metadata_2025-03-27.rds')
head(tile_metadata)

# Plot the number of hub+ and hub- cells around hub+ and hub- interfaces, the number of cells around MMRp interfaces - Supplementary Figure 4 

# function definitions

In [None]:
summarize_cells_by_interface_proximity_2 = function(cells, interfaces) {
    # retains all cells around hub+ and hub- interfaces.
    
    ## Get distances and closest interface type
    pts = st_as_sf(cells[, .(X, Y)], coords = c('X', 'Y'))
    geos_pts = geos::as_geos_geometry(pts$geometry)
    geos_lines = geos::as_geos_geometry(interfaces$x[1:nrow(interfaces)])
    
    nearest_interfaces_idx = geos::geos_nearest(geos_pts, geos_lines)
    
    cells$closest_interface_type = interfaces$Type_of_Interface[nearest_interfaces_idx]
    
    cells$dist_interface = geos::geos_distance(geos_pts, geos_lines[nearest_interfaces_idx])
    
    ## Assign sign to distances
    cells$dist_interface_signed = fifelse(
        cells$tessera_annotation == 'Stromal-enriched',
        -cells$dist_interface,
        cells$dist_interface
    )
    
    ## Assign cells to 5um bins
    dist_breaks = seq(-100, 100, by = 5)
    cells$dist_bin = cut(cells$dist_interface_signed, breaks = dist_breaks, include.lowest = TRUE)

    # --- ROBUST SUMMARIZATION --

    # cells = cells %>% filter(
    #     (closest_interface_type == 'CXCLpos tumor & CXCLpos stroma' & cxcl_pos_tile == 'CXCL_pos') | (closest_interface_type == 'CXCLneg tumor & CXCLneg stroma' & cxcl_pos_tile == 'CXCL_neg')        
    # )

    cells = cells %>%
        mutate(type_lvl3 = case_when(MMRstatus == "MMRp" ~ 'MMRp',
                                    .default = cxcl_pos_tile
                                    ))

    cells_in_range = cells[!is.na(dist_bin)]
    
    if (nrow(cells_in_range) == 0) {
        warning("No cells found within the -100 to 100µm distance range.")
        return(list())
    }

    all_interface_types = unique(cells$closest_interface_type)
    cells_in_range[, closest_interface_type := factor(closest_interface_type, levels = all_interface_types)]

    counts_long = cells_in_range[, .N, by = .(closest_interface_type, dist_bin, type_lvl3)]

    counts_wide = dcast(counts_long,
                        closest_interface_type + dist_bin ~ type_lvl3,
                        value.var = "N",
                        fill = 0,
                        drop = FALSE)

    result_list = split(counts_wide, by = "closest_interface_type")

    result_list = lapply(result_list, function(dt) {
        row_names = dt$dist_bin
        count_cols = setdiff(names(dt), c("closest_interface_type", "dist_bin"))
        mat = as.matrix(dt[, ..count_cols])
        rownames(mat) = row_names
        return(mat)
    })

    return(result_list)
}

#' @title Calculate Cell Counts in Distance Bins from an Interface
#' @description This function takes spatial coordinates of cells and interface lines,
#'   calculates the signed distance of each cell to the nearest interface, and
#'   groups cells into discrete distance bins. It returns a matrix of cell
#'   type counts per bin for a single sample.
#' @param cells A data.table containing cell information, including 'X'/'Y' coordinates,
#'   cell type ('type_lvl3'), and a region annotation ('tessera_annotation').
#' @param interfaces An sf object containing interface geometries (e.g., LINESTRINGs).
#' @return A matrix where rows are distance bins (e.g., "(-5,0]") and columns
#'   are cell types ('type_lvl3'), with values representing cell counts.
get_bins = function(cells, interfaces) {
    # Convert data.frame coordinates to a spatial 'sf' object
    pts = st_as_sf(cells[, .(X, Y)], coords = c('X', 'Y'))
    
    # Use 'geos' for high-performance spatial operations
    geos_pts = geos::as_geos_geometry(pts$geometry)
    geos_lines = geos::as_geos_geometry(interfaces$x[1:nrow(interfaces)])
    
    # Find the nearest interface for each cell
    nearest_interfaces = geos::geos_nearest(geos_pts, geos_lines)
    
    # Calculate the distance to that nearest interface
    cells$dist_interface = geos::geos_distance(geos_pts, geos_lines[nearest_interfaces])
    
    # Assign a sign to the distance based on tissue region (stroma vs. other)
    cells$dist_interface_signed = case_when(
        cells$tessera_annotation == 'Stromal-enriched' ~ -cells$dist_interface,
        TRUE ~ cells$dist_interface
    )
    
    # Bin cells into 5µm distance intervals
    cells$dist_bin = cut(cells$dist_interface_signed, seq(-100, 100, by = 5), include.lowest = TRUE)

    # Create the final count matrix
    counts = cells[
        !is.na(dist_bin)
    ] %>%
        with(table(dist_bin, type_lvl3)) %>%
        data.table() %>%
        dcast(dist_bin ~ type_lvl3, value.var = 'N') %>%
        dplyr::mutate(dist_bin = factor(dist_bin, levels(cells$dist_bin))) %>%
        arrange(dist_bin) %>%
        tibble::column_to_rownames('dist_bin') %>%
        as.matrix()
    
    return(counts)
}

### Empirical Bayes Functions

This group of functions implements an empirical Bayes statistical framework. The core idea is to improve estimates for individual groups (here, distance bins) by "borrowing strength" from all other groups. This shrinks noisy estimates from bins with little data towards a more stable global average.

In [None]:
# Estimates the parameters (alpha, beta) for a Beta prior distribution using the
# method of moments. This prior is used for modeling proportions (binomial data).
estimate_beta_prior <- function(k, n) {
    if (length(k) <= 1) {
        return(list(alpha = 0, beta = 0))
    }

    valid_bins <- n > 0
    k_valid <- k[valid_bins]
    n_valid <- n[valid_bins]
    
    if (length(k_valid) <= 1) {
        return(list(alpha = 0, beta = 0))
    }

    p_hat <- k_valid / n_valid
    mean_p <- mean(p_hat)
    var_p <- var(p_hat)
    mean_n <- mean(n_valid)
    
    var_true <- var_p - mean_p * (1 - mean_p) / mean_n
    
    if (is.na(var_true) || var_true <= 0) {
        return(list(alpha = 0, beta = 0))
    }
    
    nu <- mean_p * (1 - mean_p) / var_true - 1
    list(alpha = mean_p * nu, beta = (1 - mean_p) * nu)
}

# Calculates summary statistics for count data, optionally applying Empirical Bayes shrinkage.
empirical_bayes_summary <- function(k, n, bin_lvls, model = c("mle", "binomial", "poisson")) {
    model <- match.arg(model)
    if (length(k) != length(n)) stop("Input lengths must match.")
    
    est <- k / n
    
    if (model == "binomial") {
        prior <- estimate_beta_prior(k, n)
        est <- (k + prior$alpha) / (n + prior$alpha + prior$beta)
        var <- ((k + prior$alpha) * (n - k + prior$beta)) /
               ((n + prior$alpha + prior$beta)^2 * (n + prior$alpha + prior$beta + 1))
               
    } else {
       stop("Only binomial model is fully implemented in this notebook version.")
    }
    
    df = data.table(
        dist_bin = factor(bin_lvls, bin_lvls),
        model = model,
        count = k,
        size = n,
        estimate = est,
        variance = var,
        alpha = prior$alpha,
        beta = prior$beta
    )

    df[, p := exp(pnorm(estimate / sqrt(variance), lower.tail = FALSE, log.p = TRUE))]
    df[, padj := p.adjust(p)]
    df[, asterisk := ifelse(padj < 0.01, "*", "")]
    
    return(df)
}

### Meta-Analysis and Statistical Functions

These functions handle the statistical aggregation and testing across multiple samples.

In [None]:
# `meta_ashr`: Performs a meta-analysis using the `ashr` package to combine estimates.
meta_ashr <- function(p_vec, var_vec) {
    ash_fit = ashr::ash(betahat = p_vec, sebetahat = sqrt(var_vec), method = "fdr", mixcompdist = 'normal')
    w = prop.table(1 / (ash_fit$result$PosteriorSD^2 + 1e-8))
    data.table(
        estimate = sum(w * ash_fit$result$PosteriorMean),
        variance = sum(w * ash_fit$result$PosteriorSD^2)
    )
}

# `get_stats`: The main driver for the meta-analysis. It takes a list of count matrices,
# calculates empirical Bayes summaries for each, and then combines them using `meta_ashr`.
get_stats = function(counts_list, .types) {
    df_list = imap(counts_list, function(counts, .id) {
        empirical_bayes_summary(
            rowSums(counts[, .types, drop = FALSE]),
            rowSums(counts),
            rownames(counts),
            'binomial'
        )
    })
    
    df = bind_rows(df_list, .id = 'SampleID')[
        , .(SampleID, dist_bin, estimate, variance)
    ][
        , meta_ashr(estimate, variance), dist_bin
    ]
    
    df[, p := exp(pnorm(estimate / sqrt(variance), lower.tail = FALSE, log.p = TRUE))]
    df[, padj := p.adjust(p)]
    df[, asterisk := case_when(
        is.na(padj) ~ '',
        padj < 0.01 ~ "*",
        TRUE ~ ''
    )]
    
    df[]
}

# `t_test_and_lfc`: Calculates a Welch's t-test and log2 fold change from summary statistics.
t_test_and_lfc <- function(mu1, var1, n1, mu2, var2, n2) {
  se_diff <- sqrt(var1 / n1 + var2 / n2)
  t_stat <- (mu1 - mu2) / se_diff
  df_num <- (var1 / n1 + var2 / n2)^2
  df_denom <- ((var1 / n1)^2) / (n1 - 1) + ((var2 / n2)^2) / (n2 - 1)
  df <- df_num / df_denom
  p_value <- 2 * pt(-abs(t_stat), df)
  lfc <- log2((mu1) / (mu2 ))
  
  return(list(
    p_value = p_value,
    log2_fold_change = lfc
  ))
}

### Utility and Plotting Functions

In [None]:
# `standardize_matrix_columns`: A utility to ensure all matrices in a list have the same columns.
standardize_matrix_columns <- function(mat_list) {
    all_cols <- sort(unique(unlist(lapply(mat_list, colnames))))
    
    lapply(mat_list, function(mat) {
        missing_cols <- setdiff(all_cols, colnames(mat))
        if (length(missing_cols) > 0) {
            zeros <- matrix(0, nrow = nrow(mat), ncol = length(missing_cols),
                            dimnames = list(rownames(mat), missing_cols))
            mat <- cbind(mat, zeros)
        }
        mat[, all_cols, drop = FALSE]
    })
}

# `interface_plot`: A plotting function to visualize the results for a single sample.
interface_plot = function(counts, .types, est_model=c('binomial', 'poisson', 'mle')) {
    est_model <- match.arg(est_model)
    df = empirical_bayes_summary(
        rowSums(counts[, .types, drop = FALSE]),
        rowSums(counts),
        rownames(counts),
        est_model
    ) 

    ymax = 100 * max(df$estimate + 1.96 * sqrt(df$variance))
    
    ggplot(df, aes(dist_bin, 100 * estimate)) + 
        geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
        geom_point(aes(size = size)) + 
        geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0) + 
        geom_hline(yintercept = 0) + 
        geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin))) + 
        theme_bw(base_size = 16) + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
        labs(y = '% of all cells', x = 'Distance Window', size = '# Cells', subtitle = 'mean & 95% CI, *padj<0.01', title = paste(.types, collapse = '; ')) + 
        geom_text(aes(y = 100 * (estimate + 1.96 * sqrt(variance)), label = asterisk), size = 6, vjust = 0) + 
        annotate("text", x = 0.5, y = ymax + .05, label = 'Stromal Side', hjust = 0, size = 6) + 
        annotate("text", x = 40.5, y = ymax + .05, label = 'Epithelial Side', hjust = 1, size = 6) + 
        NULL
}

# `run_global_hub_analysis`: The main analysis wrapper that runs the entire pipeline for all cell types.
run_global_hub_analysis <- function(types_list, counts_list) {
  n_hubPos = sum(grepl('hubPos', names(counts_list)))
  n_hubNeg = sum(grepl('hubNeg', names(counts_list)))

  results_by_type <- purrr::imap(types_list, function(.x, .y) {
    .types <- .y
    df_hubPos <- get_stats(counts_list[grepl('hubPos', names(counts_list))], .types)
    df_hubNeg <- get_stats(counts_list[grepl('hubNeg', names(counts_list))], .types)
    df <- bind_rows(list(hubPos = df_hubPos, hubNeg = df_hubNeg), .id = 'Status')
    df_stat <- dcast(df, dist_bin ~ Status, value.var = c('estimate', 'variance'))[
      , c('p', 'log2_fold_change') := t_test_and_lfc(estimate_hubPos, variance_hubPos, n_hubPos, estimate_hubNeg, variance_hubNeg, n_hubNeg), dist_bin
    ]
    return(list(
      hubPos_data = df_hubPos,
      hubNeg_data = df_hubNeg,
      stats_data = df_stat
    ))
  })
  
  transposed_results <- purrr::transpose(results_by_type)
  all_hubPos_df <- dplyr::bind_rows(transposed_results$hubPos_data, .id = "cell_type")
  all_hubNeg_df <- dplyr::bind_rows(transposed_results$hubNeg_data, .id = "cell_type")
  summary_stats <- dplyr::bind_rows(transposed_results$stats_data, .id = "cell_type")
  
  summary_stats[, padj_global := p.adjust(p, method = 'fdr')]
  summary_stats[, height := max(estimate_hubPos + 1.96 * sqrt(variance_hubPos), estimate_hubNeg + 1.96 * sqrt(variance_hubNeg)), by = .(cell_type, dist_bin)]
  summary_stats[, asterisk := fifelse(padj_global < 0.01, "*", "")]
  
  return(list(
    summary_stats = summary_stats,
    hubPos_results = all_hubPos_df,
    hubNeg_results = all_hubNeg_df
  ))
}

#' @title Run a Global Analysis Comparing MMR Status
#' @description This master function automates the entire statistical comparison.
run_global_mmr_analysis <- function(types_list, counts_list, mmr_map) {
  
  # Dynamically calculate sample sizes to make the analysis robust
  n_msi <- length(unique(mmr_map[MMRstatus == 'MMRd']$SampleID))
  n_mss <- length(unique(mmr_map[MMRstatus == 'MMRp']$SampleID))
  
  # Iterate over the simplified cell type names (`type_lvl3`)
  results_by_type <- purrr::imap(types_list, function(.x, .y) {
    .types <- .y # Use the name of the list element (the correct type) for subsetting
    
    # Run meta-analysis for each group
    df_MSI <- get_stats(counts_list[mmr_map[MMRstatus == 'MMRd']$SampleID], .types)
    df_MSS <- get_stats(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], .types)
    
    # Combine results for direct comparison
    df <- bind_rows(list(MSI = df_MSI, MSS = df_MSS), .id = 'Status')
    
    # Reshape and run Welch's t-test on the meta-analyzed estimates
    df_stat <- dcast(df, dist_bin ~ Status, value.var = c('estimate', 'variance'))[
      , c('p', 'log2_fold_change') := t_test_and_lfc(estimate_MSI, variance_MSI, n_msi, estimate_MSS, variance_MSS, n_mss), dist_bin
    ]
    
    return(list(MSI_data = df_MSI, MSS_data = df_MSS, stats_data = df_stat))
  })
  
  # Restructure the list of lists into a more usable format
  transposed_results <- purrr::transpose(results_by_type)
  all_MSI_df <- dplyr::bind_rows(transposed_results$MSI_data, .id = "cell_type")
  all_MSS_df <- dplyr::bind_rows(transposed_results$MSS_data, .id = "cell_type")
  summary_stats <- dplyr::bind_rows(transposed_results$stats_data, .id = "cell_type")
  
  # Perform global FDR correction across all p-values from all tests
  summary_stats[, padj_global := p.adjust(p, method = 'fdr')]
  summary_stats[, asterisk := fifelse(padj_global < 0.01, "*", "")]
  summary_stats[, height := max(estimate_MSI + 1.96 * sqrt(variance_MSI), estimate_MSS + 1.96 * sqrt(variance_MSS)), by = .(cell_type, dist_bin)]
  
  # Return the final, tidy list of results
  return(list(summary_stats = summary_stats, MSI_results = all_MSI_df, MSS_results = all_MSS_df))
}

## 3. Data Loading and Preprocessing

Here, we load the main cell data and the interface data. We perform some initial cleaning on the cell types, simplifying them into a `type_lvl3` category for the main analysis.

In [None]:
tiles_to_omit = read.csv('../Tessera tiles/Tessera processed results/tiles_to_exclude_from_interface_analysis.csv') %>%
    filter(tiles_to_exclude_from_interface_analysis != '') %>%
    pull(agg_id)
length(tiles_to_omit)
head(tiles_to_omit)

In [None]:
# Load cell data
cells = readr::read_rds('../Tessera tiles/Tessera processed results/tile_metadata_2025-07-22.rds') 
cells$type_lvl1[cells$type_lvl2 == 'Mast'] = 'Mast' 


# Simplify cell type annotations
cells <- cells %>%
    filter(!agg_id %in% tiles_to_omit) %>%
    mutate(type_lvl2 = case_when(type_lvl2 == 'Myeloid-ISGlow' ~ 'Myeloid-ISG', .default = type_lvl2)) %>%
    mutate(type_lvl3 = gsub(type_lvl2, pattern = '-prolif', replacement = '')) %>% # |high|low|-PD1
    #mutate(type_lvl3 = gsub(type_lvl3, pattern = 'Epi.*', replacement = 'Epi')) %>% 
    select(c('PatientID', 'SampleID', 'MMRstatus', 'X', 'Y', 'tessera_annotation', 'type_lvl3', 'type_lvl1', 'type_lvl2', 'cell_id', 'cxcl_pos_tile'))

glimpse(cells)

## Load interface data for each sample

In [None]:
ids = unique(cells$SampleID) #[cells$MMRstatus == 'MMRd'])
interfaces = map(ids, function(.id) {
    fname = normalizePath(list.files(path = '../Tessera tiles/Spatial objects for tumor-stromal interfaces in all MERFISH samples/', pattern = '_tumor_stromal_interfaces.rds', full.names = TRUE)[grepl(list.files(path = '../Tessera tiles/Spatial objects for tumor-stromal interfaces in all MERFISH samples/', pattern = '_tumor_stromal_interfaces.rds', full.names = TRUE), pattern = .id)])
    readRDS(fname)
})
names(interfaces) = ids

glimpse(interfaces[[1]])

## 4. Main Analysis: Calculate Distances and Bin Counts

This is the main computational step. We use `future_map` to run the `summarize_cells_by_interface_proximity` function in parallel for each sample. This generates a list where each element corresponds to a sample and contains the binned cell counts for its different interface types.

In [None]:
options(future.globals.maxSize = 1e10)
ids = unique(cells$SampleID[cells$MMRstatus == 'MMRd'])
system.time({
    counts_list = future_map(ids, function(.id) {
        summarize_cells_by_interface_proximity_2(cells[SampleID == .id], interfaces[[.id]])    
    }, .options = furrr::furrr_options(seed=TRUE))
    names(counts_list) = ids
})

In [None]:
mmr_map = cells  %>%
    select(c('PatientID', 'SampleID', 'MMRstatus')) %>%
    distinct()

## 5. Post-processing: Stratify and Standardize Data

After calculating the counts, we separate them based on the interface type ('hub positive' vs. 'hub negative'). We then use the `standardize_matrix_columns` utility function to ensure that all count matrices have the exact same set of cell type columns, which is essential for the downstream meta-analysis.

In [None]:
# Separate lists for hub positive and hub negative interfaces
hubPos_counts_list = lapply(counts_list, function(x){return(x[['CXCLpos tumor & CXCLpos stroma']])})
names(hubPos_counts_list) = paste0(names(counts_list), '_hubPos')

hubNeg_counts_list = lapply(counts_list, function(x){return(x[['CXCLneg tumor & CXCLneg stroma']])})
names(hubNeg_counts_list) = paste0(names(counts_list), '_hubNeg')

# Combine them back into a single list and standardize columns
counts_list = c(hubPos_counts_list, hubNeg_counts_list)
counts_list = standardize_matrix_columns(counts_list)

In [None]:
names(counts_list)

In [None]:
interface_plot = function(counts, .types, est_model=c('binomial', 'poisson', 'mle')) {
    est_model <- match.arg(est_model)
    df = empirical_bayes_summary(
        rowSums(counts[, .types, drop = FALSE]),
        rowSums(counts),
        rownames(counts),
        est_model
    ) 

    ## get max y value for plotting 
    ymax = 100 * max(df$estimate + 1.96 * sqrt(df$variance))
    
    p1 = ggplot(df, aes(dist_bin, 100 * estimate)) + 
        geom_vline(xintercept = c(20.5), size = 2, linetype = 1, color = 'grey') + 
        geom_point(aes(size = size)) + 
        geom_errorbar(aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), width = 0) + 
        geom_hline(yintercept = 0) + 
        geom_line(data = . %>% dplyr::mutate(dist_bin = as.numeric(dist_bin))) + 
        theme_bw(base_size = 16) + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
        labs(y = '% of all cells', x = 'Distance Window', 
             size = '# Cells', subtitle = 'mean & 95% CI, *padj<0.01', title = paste(.types, collapse = '; ')) + 
        geom_text(aes(y = 100 * (estimate + 1.96 * sqrt(variance)), label = asterisk), size = 6, vjust = 0) + 
        annotate("text", x = 0.5, y = ymax + .05, label = 'Stromal Side', hjust = 0, size = 6) + 
        annotate("text", x = 40.5, y = ymax + .05, label = 'Epithelial Side', hjust = 1, size = 6) + 
        NULL
    return(p1)
}

## Plot the distribution of hub+ cells around hub+ interfaces

In [None]:
.types = grep('CXCL_pos', colnames(counts_list$C110_hubPos), value = TRUE)
.types

In [None]:
fig.size(18, 32)
require(patchwork)
# imap(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], function(counts, .id) {    
imap(hubPos_counts_list, function(counts, .id) {    
    interface_plot(counts, .types, 'binomial') + labs(title = glue('{.id} (Hub+)'))    
}) %>% wrap_plots() + plot_annotation(title = 'Hub+ cells around Hub+ interfaces')

## Plot the distribution of hub+ cells around hub- interfaces

In [None]:
.types = grep('CXCL_pos', colnames(counts_list$C110_hubPos), value = TRUE)
.types

In [None]:
fig.size(18, 32)
require(patchwork)
# imap(counts_list[mmr_map[MMRstatus == 'MMRp']$SampleID], function(counts, .id) {    
imap(hubNeg_counts_list, function(counts, .id) {    
    interface_plot(counts, .types, 'binomial') + labs(title = glue('{.id} (Hub+)'))    
}) %>% wrap_plots() + plot_annotation(title = 'Hub+ cells around Hub- interfaces')

## 6. Global Analysis Across All Cell Types

Now we run the main analysis function, `run_global_hub_analysis`. This function iterates through every cell type, performs the meta-analysis comparing hub-positive and hub-negative interfaces, calculates statistics, and returns a set of clean data frames ready for plotting.

In [None]:
cells$cxcl_pos_tile %>% unique

In [None]:
# Create a list of cell types to iterate over
cellTypes = cells %>% 
    mutate(type_lvl3 = cxcl_pos_tile) %>%
    mutate(type_lvl2 = cxcl_pos_tile) %>%
    select(type_lvl2, type_lvl3) %>% 
    filter(!type_lvl2 %in% c('Plasma',  'Mast')) %>% # 'Epi',
    distinct
cellTypes
type_list <- lapply(split(cellTypes$type_lvl2, cellTypes$type_lvl3), unique)
type_list

# Run the full analysis
final_results <- run_global_hub_analysis(type_list, counts_list)

# Display the glimpse of the main summary table
glimpse(final_results$summary_stats)

## 7. Visualization

In this final section, we generate plots to visualize the results. We create faceted plots that group cell types by their major lineage (e.g., T-cells, Myeloid cells) to compare their distribution profiles between hub-positive and hub-negative interfaces.

In [None]:
# MSS_results = fread('MSS_results.csv')  # THIS COMES FROM THE OUTPUT OF A PREVIOUS NOTEBOOK Step1_MSI_vs_MSS_interfaces.ipynb - RUN THAT FIRST!
# head(MSS_results)

In [None]:
# Prepare data for plotting by combining hubPos and hubNeg results
df = bind_rows(list(hubPos = final_results$hubPos_results, 
                    hubNeg = final_results$hubNeg_results
                   ), .id = 'Status') 

# Calculate y-axis limits for plotting
df <- df %>%
    group_by(cell_type) %>%
    mutate(ymax = 100 * max(estimate + 1.96 * sqrt(variance))) %>%
    ungroup

# Create a list for grouping cell types by lineage
lineage_list <- list('CXCL_pos' = 'CXCL_pos', 'CXCL_neg' = 'CXCL_neg')
lineage_list

In [None]:
# 1. Define and create the list first
cell_type_list <- lineage_list

# 2. Now, pipe the created list into the other functions
order_of_cell_types <- cell_type_list %>%
    unlist() %>%
    str_wrap(string = ., width = 10, whitespace_only = FALSE)

# 3. View the final output
order_of_cell_types

In [None]:
find_midpoint <- function(interval_string) {
  # 1. Remove parentheses and brackets using gsub
  # The pattern "[()\\[\\]]" matches any character inside the outer brackets.
  # We need to escape the inner square brackets with \\.
  cleaned_string <- gsub("\\(|\\[|\\)|\\]", "", interval_string)
  
  # 2. Split the string by the comma
  # strsplit returns a list, so we take the first element [[1]]
  num_strings <- strsplit(cleaned_string, ",")[[1]]
  
  # 3. Convert character vector to numbers and calculate the mean
  midpoint <- mean(as.numeric(num_strings))
  
  return(midpoint)
}

In [None]:
midpoints_of_bins = final_results$summary_stats$dist_bin %>%
    unique %>%
    lapply(., find_midpoint) %>%
    unlist()
names(midpoints_of_bins) = unique(final_results$summary_stats$dist_bin)
midpoints_of_bins <- setNames(names(midpoints_of_bins), midpoints_of_bins)
print(midpoints_of_bins)

In [None]:
final_results$summary_stats = final_results$summary_stats %>%
    mutate(dist_bin = factor(dist_bin)) %>%
    mutate(midpoint = forcats::fct_recode(dist_bin, !!!midpoints_of_bins)) %>%
    mutate(midpoint = as.vector(midpoint)) %>%
    mutate(midpoint = as.numeric(midpoint))

df = df %>%
    mutate(dist_bin = factor(dist_bin)) %>%
    mutate(midpoint = forcats::fct_recode(dist_bin, !!!midpoints_of_bins)) %>%
    mutate(midpoint = as.vector(midpoint)) %>%
    mutate(midpoint = as.numeric(midpoint))

final_results$summary_stats %>%
    select(dist_bin, midpoint) %>%
    distinct

In [None]:
final_results$summary_stats$cell_type %>% unique %>% str_wrap(string = , width = 10, whitespace_only = FALSE)

# Supplementary Figure 4c

In [None]:
head(df)

In [None]:
order_of_cell_types

In [None]:
fig.size(h = 2, w = 13)
options(repr.plot.res = 400)
require(ggh4x)
require(ggragged)
order_of_cell_types = c(order_of_cell_types, 'MMRp')
manual_breaks <- c(
    "[-100,-95]", "(-75,-70]", "(-50,-45]", "(-25,-20]",
    "(0,5]", "(25,30]", "(50,55]", "(75,80]", "(95,100]"
)
order_of_cell_types
# Prepare the data for the geom_text layers beforehand for clarity
text_data_asterisk <- final_results$summary_stats %>% 
    na.omit() %>%
    mutate(cell_type = str_wrap(string = cell_type, width = 10, whitespace_only = FALSE)) %>%
    filter(cell_type %in% order_of_cell_types) %>%
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = order_of_cell_types)) 

text_data_epi <- df %>% 
    mutate(Status = case_when(Status == 'hubPos' ~ 'Hub+\ninterface', Status == 'hubNeg' ~ 'Hub-\ninterface')) %>%
    mutate(Status = factor(Status, levels = c('Hub+\ninterface', 'Hub-\ninterface'))) %>%
    na.omit() %>%
    mutate(cell_type = str_wrap(string = cell_type, width = 10, whitespace_only = FALSE)) %>%
    filter(cell_type %in% order_of_cell_types) %>%
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = order_of_cell_types)) %>%
    select(cell_type, ymax) %>% 
    distinct() %>% 
    mutate(ymax = ymax + ymax/3, x = 32, label = '\nEpi-\nenriched')

text_data_stroma <- df %>% 
    na.omit() %>%
    mutate(Status = case_when(Status == 'hubPos' ~ 'Hub+\ninterface', Status == 'hubNeg' ~ 'Hub-\ninterface')) %>%
    mutate(Status = factor(Status, levels = c('Hub+\ninterface', 'Hub-\ninterface'))) %>%
    mutate(cell_type = str_wrap(string = cell_type, width = 10, whitespace_only = FALSE)) %>%
    filter(cell_type %in% order_of_cell_types) %>%
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = order_of_cell_types)) %>% 
    select(cell_type, ymax) %>% 
    distinct() %>% 
    mutate(ymax = ymax + ymax/3, x = 8, label = '\nStroma-\nenriched') 

# Create the plot
supp_fig_4c = df %>% 
    mutate(Status = case_when(Status == 'hubPos' ~ 'Hub+\ninterface', Status == 'hubNeg' ~ 'Hub-\ninterface')) %>%
    mutate(Status = factor(Status, levels = c('Hub+\ninterface', 'Hub-\ninterface'))) %>%
    mutate(cell_type = str_wrap(string = cell_type, width = 10, whitespace_only = FALSE)) %>%
    filter(cell_type %in% order_of_cell_types) %>%
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = order_of_cell_types)) %>%
    mutate(dist_bin = fct_reorder(dist_bin, midpoint)) %>%
ggplot(show.legend = FALSE,
    data = ., 
    aes(dist_bin, 100 * estimate, 
        color = cell_type, 
        fill = cell_type)
) + 
    geom_vline(xintercept = c(20.5), size = 0.5, linetype = 1, color = 'red') + 
    geom_line(show.legend = FALSE, aes(group = cell_type), alpha = 1, key_glyph = 'point', linewidth = 0.25) +
    geom_point(show.legend = FALSE, shape = '.') + 
    geom_errorbar(
            aes(ymin = 100 * (estimate - 1.96 * sqrt(variance)), ymax = 100 * (estimate + 1.96 * sqrt(variance))), 
            show.legend = FALSE, alpha = 0.5, linewidth = 0.25) + 
    labs(y = 'Percent of all cells', x = 'Distance from the interface (\U03BCm)', 
         title = 'Percent Hub+ and Hub- cells around Hub+ and Hub- interfaces', subtitle = 'IVW meta-analysis; mean & 95% CI, *padj<0.01') +    
    # geom_text(inherit.aes = FALSE,
    #     data = text_data_asterisk,
    #     aes(dist_bin, y = 100 * height, label = asterisk), 
    #     size = 2, vjust = .2, show.legend = FALSE, color = 'black'
    # ) + 
    scale_color_manual(drop = FALSE,
        name = 'Cell assignment to CXCR3L mask: ', 
        values = c('CXCL_pos' = 'darkred', 
                  'CXCL_neg' = '#595959',
                   'MMRp' = '#2866a0'
                  ),
        labels = c('CXCL_pos' = 'In CXCR3L mask', 
                   'CXCL_neg' = 'Outside CXCR3L mask',
                   'MMRp' = 'MMRp'
                  
                  )
    ) + 
        guides(color = guide_legend(override.aes = list(
            nrow = 1, 
            shape = 16,        # Shape 16 is a solid circle
            size = 3           # Make the circle a visible size
        ))) +
    scale_fill_manual(drop = FALSE,
        name = 'Cell assignment to CXCR3L mask: ', 
        values = c('CXCL_pos' = 'darkred', 
                  'CXCL_neg' = '#595959',
                   'MMRp' = '#2866a0'
                  ),
        labels = c('CXCL_pos' = 'In CXCR3L mask', 
                   'CXCL_neg' = 'Outside CXCR3L mask',
                   'MMRp' = 'MMRp'
                  
                  )
    ) + 
        guides(fill = guide_legend(override.aes = list(
            nrow = 1, 
            shape = 16,        # Shape 16 is a solid circle
            size = 3           # Make the circle a visible size
        ))) +
    facet_wrap2(~Status, scales = 'free_y', axes = "all", remove_labels = "x", nrow = 1) +
    cowplot::theme_half_open(7) + 
    theme(
        panel.spacing = unit(0, "cm"), 
        axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
        strip.background = element_rect(fill = NA), 
        strip.text = element_text(size = 7, color = 'black'), # face = 'bold', 
        title = element_text(size = 7), 
        legend.position = 'top', 
        legend.text = element_text(size = 7)
    ) +
    guides(fill = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
    guides(color = guide_legend(override.aes = list(nrow = 1, shape =16))) +
    scale_x_discrete(breaks = manual_breaks) + 
    NULL

supp_fig_4c
ggsave(filename = 'supplementary_fig_4_panel_C.pdf', 
       plot = supp_fig_4c, width = 3, height = 2, units = 'in')

In [None]:
df$cell_type %>% unique

In [None]:
 df %>% 
    mutate(Status = case_when(Status == 'hubPos' ~ 'Hub+\ninterface', Status == 'hubNeg' ~ 'Hub-\ninterface')) %>%
    mutate(cell_type = str_wrap(string = cell_type, width = 10, whitespace_only = FALSE)) %>%
    filter(cell_type %in% order_of_cell_types) %>%
    mutate(cell_type = factor(cell_type, ordered = TRUE, levels = order_of_cell_types)) %>%
    mutate(dist_bin = fct_reorder(dist_bin, midpoint)) %>%
    group_by(cell_type, dist_bin) %>%
    summarize(estimate = mean(estimate)) %>%
    group_by(dist_bin) %>%
    summarize(estimate = sum(estimate)) # should be very close to 1

# Supplementary figure 4d

In [None]:
mmr_map$SampleID[mmr_map$MMRstatus == 'MMRp']

In [None]:
# Set a higher limit for global variables when using parallel processing
options(future.globals.maxSize = 1e10)

# Run get_bins for each sample in parallel
system.time({
    counts_list_MMRp = future_map(mmr_map$SampleID[mmr_map$MMRstatus == 'MMRp'], function(.id) {
        get_bins(cells[SampleID == .id] %>% mutate(type_lvl3 = MMRstatus), interfaces[[.id]])    
    }, .options = furrr::furrr_options(seed=TRUE))
    names(counts_list_MMRp) = mmr_map$SampleID[mmr_map$MMRstatus == 'MMRp']
})
    
# Standardize matrices to ensure consistent columns across all samples
counts_list_MMRp = standardize_matrix_columns(counts_list_MMRp)

In [None]:
counts_list_MMRp %>% names

In [None]:
hubNeg_counts_df = hubNeg_counts_list %>% lapply(., function(x){
    x = as.data.frame(x) %>%
        tibble::rownames_to_column(var = 'measurement_bin')
    return(x)}
    ) %>% rbindlist(idcol = 'Patient') %>%
    mutate(Patient = gsub(Patient, replacement = '', pattern = '_hubNeg')) %>%
    mutate(Interface = 'Hub- cells around Hub- interfaces') %>%
    mutate(cells_used_in_interface_plots = CXCL_neg) %>%
    select(-c(`CXCL_pos`,`CXCL_neg`))
glimpse(hubNeg_counts_df)

hubPos_counts_df = hubPos_counts_list %>% lapply(., function(x){
    x = as.data.frame(x) %>%
        tibble::rownames_to_column(var = 'measurement_bin')
    return(x)}
    ) %>% rbindlist(idcol = 'Patient') %>%
    mutate(Patient = gsub(Patient, replacement = '', pattern = '_hubPos')) %>%
    mutate(Interface = 'Hub+ cells around Hub+ interfaces') %>%
    mutate(cells_used_in_interface_plots = CXCL_pos) %>%
    select(-c(`CXCL_pos`,`CXCL_neg`))
glimpse(hubPos_counts_df)

MMRp_counts_df = counts_list_MMRp %>% lapply(., function(x){
    x = as.data.frame(x) %>%
        tibble::rownames_to_column(var = 'measurement_bin')
    return(x)}
    ) %>% rbindlist(idcol = 'Patient') %>%
    mutate(Interface = 'MMRp cells around MMRp interfaces') %>%
    rename(cells_used_in_interface_plots = `MMRp`)
glimpse(MMRp_counts_df)

In [None]:
all_counts_df = rbindlist(list(hubNeg_counts_df, hubPos_counts_df, MMRp_counts_df), use.names = TRUE) %>%
    mutate(midpoint = unlist(lapply(measurement_bin, find_midpoint)))
glimpse(all_counts_df)

In [None]:
head(all_counts_df)

In [None]:
fig.size(2, 4)
all_counts_df %>%
    ggplot() +
        geom_line(aes(x = midpoint, y = log10(cells_used_in_interface_plots), group = Patient, color = Patient)) +
        facet_wrap2(~Interface, scales = 'free_y', axes = "all", remove_labels = "x", nrow = 1) +
        cowplot::theme_half_open(7) + 
        theme(
            panel.spacing = unit(0, "cm"), 
            axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
            strip.background = element_rect(fill = NA), 
            strip.text = element_text(size = 7, color = 'black'), # face = 'bold', 
            title = element_text(size = 7), 
            legend.position = 'top', 
            legend.text = element_text(size = 7)
        ) +
        guides(fill = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
        guides(color = guide_legend(override.aes = list(nrow = 1, shape = 16))) +
        #facet_ragged_rows(rows = vars(type_lvl1), cols = vars(cell_type)) +
        NULL

In [None]:
all_counts_df$measurement_bin %>% unique

In [None]:
all_counts_df$Interface %>% unique

In [None]:
# Define the specific breaks you want to show
manual_breaks <- c(
    "[-100,-95]", "(-75,-70]", "(-50,-45]", "(-25,-20]",
    "(0,5]", "(25,30]", "(50,55]", "(75,80]", "(95,100]"
)
# fig.size(2, 4) 

supp_fig_4d = all_counts_df %>%
    # mutate(Interface = case_when(
    #     Interface == 'Hub+' ~ 'Hub+ cells around Hub+ interfaces',
    #     Interface == 'Hub-' ~ 'Hub- cells around Hub- interfaces',
    #     Interface == 'MMRp' ~ 'MMRp cells around MMRp interfaces'
    # )) %>%
    mutate(Interface = factor(Interface, 
                              levels = rev(c( 'MMRp cells around MMRp interfaces', 
                                              'Hub- cells around Hub- interfaces',
                                             'Hub+ cells around Hub+ interfaces'
                                            )))) %>%
    mutate(interface_title = fct_recode(Interface, 'MMRp' = 'MMRp cells around MMRp interfaces' , 
                                              'Hub-\ninterface' = 'Hub- cells around Hub- interfaces',
                                              'Hub+\ninterface' = 'Hub+ cells around Hub+ interfaces')) %>%
    ggplot() +
        geom_boxplot(key_glyph = 'point',
            aes(color = Interface,
                x = reorder(measurement_bin, midpoint),
                y = log10(1 + cells_used_in_interface_plots)),
            
            outlier.size = 0.5,
            linewidth = 0.25 
        ) +
        geom_vline(xintercept = c(20.5), size = 0.5, linetype = 1, color = 'red') +         
        facet_wrap2(~interface_title, scales = 'free_y', axes = "all", remove_labels = "x", nrow = 1) +
        scale_x_discrete(breaks = manual_breaks) +
        cowplot::theme_half_open(7) + 
        theme(
            panel.spacing = unit(0, "cm"), 
            axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
            strip.background = element_rect(fill = NA), 
            strip.text = element_text(size = 7, color = 'black'), # face = 'bold', 
            title = element_text(size = 7), 
            legend.position = 'top', 
            legend.text = element_text(size = 7)
        ) +
        labs(y = 'log10(# cells)', x = 'Distance from the interface (\U03BCm)') +     
        # scale_color_manual(drop = FALSE,
        #     name = 'Cell assignment to CXCR3L mask: ', 
        #     values = c('Hub+ interface' = '#D55E00', 'Hub- interface' = '#009E73', 'MMRp interface' = 'grey'), 
        #     labels = c('Hub+' = 'Hub+ MMRd', 'Hub-' = 'Hub- MMRd', 'MMRp' = 'MMRp')
        # ) +
    scale_color_manual(drop = FALSE,
        name = 'Cell assignment to CXCR3L mask: ', 
        values = c('Hub+ cells around Hub+ interfaces' = 'darkred', 
                  'Hub- cells around Hub- interfaces' = '#595959',
                   'MMRp cells around MMRp interfaces' = '#2866a0'
                  ),
        labels = c('Hub+ cells around Hub+ interfaces' = 'In CXCR3L mask', 
                   'Hub- cells around Hub- interfaces' = 'Outside CXCR3L mask',
                   'MMRp cells around MMRp interfaces' = 'MMRp'
                  
                  )
    ) + 
        guides(color = guide_legend(override.aes = list(
            nrow = 1, 
            shape = 16,        # Shape 16 is a solid circle
            size = 3           # Make the circle a visible size
        ))) +
        NULL

supp_fig_4d

ggsave(filename = 'supplementary_fig_4_panel_D.pdf', 
       plot = supp_fig_4d, width = 4, height = 2, units = 'in')

In [None]:
fig.size(3, 7)
require(patchwork)
(supp_fig_4c + ggtitle('', subtitle = '') ) + supp_fig_4d + plot_layout(design = 'AABBB', guides = 'collect') & theme(legend.position = 'bottom')

In [None]:
pdf('supplementary_figure_4_c_and_d.pdf', width = 7, height = 3)
(supp_fig_4c + ggtitle('', subtitle = '') ) + supp_fig_4d + plot_layout(design = 'AABBB', guides = 'collect') & theme(legend.position = 'bottom')
dev.off()