# Statistical Analysis of Differential Expression

## Sample-Level Generalized Linear Mixed Models (GLMM) 
### see ../meta_analysis_GLMMs/glmm_fine_20250112.ipynb  and ../meta_analysis_GLMMs/glmm_coarse_20250703.ipynb 

To identify cluster-specific marker genes while accounting for spatial and technical sources of variation, we performed a Generalized Linear Mixed Model (GLMM) analysis separately for each biological sample. For each sample, raw gene counts were modeled using a Poisson error distribution with a canonical log link function. The model specification included the total number of UMIs (log-transformed) as an offset to normalize for sequencing depth. To capture spatial dependencies, we included random intercepts for the coarse k-nearest neighbor (k-NN) clusters (knn_coarse) and the field of view (fov), nested as (1 | fov / knn_coarse). This hierarchical structure accounts for the correlation of gene expression within spatial neighborhoods and imaging fields. Marginal effects for each cluster were estimated using the presto package, identifying genes with differential expression relative to the grand mean. Genes expressed in fewer than 3 cells per group were excluded prior to modeling.

## DerSimonian-Laird Meta-Analysis

### ../meta_analysis_cluster_markers/meta_analysis_of_fine_type_markers.ipynb 

To identify robust consensus markers across all biological replicates, sample-level summary statistics (log-transformed coefficients $\beta$ and standard errors $\sigma$) were combined using a random-effects meta-analysis. We employed the DerSimonian-Laird (DL) estimator to calculate the between-study variance ($\tau^2$). For each feature within each cluster, we computed two estimates:Fixed Effects (FE) Model: Assumes a single true effect size shared by all samples, weighting estimates by the inverse of their within-sample variance ($w_{FE} = 1/\sigma^2$).Random Effects (RE) Model: Incorporates both within-sample variance and between-sample heterogeneity ($\tau^2$), weighting estimates as $w_{RE} = 1/(\sigma^2 + \tau^2)$.Heterogeneity was assessed using Cochranâ€™s $Q$ statistic. To avoid false positives driven by outlier samples, final feature prioritization was based on the Random Effects Z-score ($Z_{RE} = \beta_{RE} / \sigma_{RE}$), which penalizes genes with high inter-sample disagreement. P-values were adjusted for multiple hypothesis testing using the Benjamini-Hochberg (FDR) procedure. Features were considered significant if they exhibited an FDR < 0.05 in the Random Effects model.

## Key References:

- GLMM/Presto: https://rdrr.io/github/immunogenomics/presto/f/vignettes/getting-started.Rmd
- Meta-Analysis: DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled clinical trials, 7(3), 177-188.

In [None]:
require(tidyverse)
require(data.table)
require(ComplexHeatmap)
require(circlize)
require(Seurat)
require(scales)
require(readxl)
require(patchwork)
require(sf)
require(ggpubr)
require(ggthemes)
require(harmony)
require(presto)
require(ComplexHeatmap)
require(circlize)
require(glue)
require(e1071) 
require(caTools) 
require(class) 
require(tidyverse)
require(data.table)
require(lme4)
require(presto)
require(singlecellmethods)
require(future)
require(furrr)
require(gghighlight)
options(future.globals.maxSize = 1000 * 1024 ^2)
set.seed(1)
options(repr.plot.res = 500)

## function to combine analyses

In [None]:
#' Perform DerSimonian-Laird Meta-Analysis on Cluster Markers
#'
#' This function performs a meta-analysis of feature expression statistics across
#' samples (e.g., biological replicates or studies) within defined clusters. It computes
#' both Fixed Effects (FE) and Random Effects (RE) estimates. The Random Effects
#' model uses the DerSimonian-Laird estimator for between-study variance (\eqn{\tau^2}).
#'
#' @param obj A data.frame or object coercible to a data.table.
#'   The input object is expected to contain the following columns:
#'   \itemize{
#'     \item \code{beta}: The effect size (e.g., log fold change) for the feature.
#'     \item \code{sigma}: The standard error of the effect size.
#'     \item \code{cluster}: The cluster identity (e.g., cell type).
#'     \item \code{sampleID}: The identifier for the sample/replicate.
#'     \item \code{feature}: The feature identifier (e.g., gene name).
#'   }
#'
#' @return A \code{data.table} (and \code{dplyr} tibble) containing the meta-analysis results,
#'   sorted by the Random Effects Z-score in descending order. Key output columns include:
#'   \itemize{
#'     \item \code{beta_fe}, \code{beta_re}: Weighted effect sizes for fixed and random effects.
#'     \item \code{sigma_fe}, \code{sigma_re}: Standard errors for the aggregate estimates.
#'     \item \code{z_fe}, \code{z_re}: Z-scores for fixed and random effects.
#'     \item \code{p_fe}, \code{p_re}: Two-tailed P-values.
#'     \item \code{fdr_fe}, \code{fdr_re}: Benjamini-Hochberg corrected FDR values.
#'   }
#'
#' @import data.table
#' @import dplyr
#'
#' @export
dsl <- function(obj) {
  
  # Convert input to data.table to leverage reference updating (:=) and fast aggregation
  cluster_markers <- data.table::data.table(obj)[
    
    ## 1. Clean up input data
    ## Remove existing Wald test statistics if they exist to prevent confusion 
    ## with the new meta-analysis stats being calculated.
    , `:=`(zscore = NULL, pvalue = NULL)
    
  ][
    ## 2. Regularize Standard Errors (Sigma)
    ## Apply a floor to the standard error (sigma) at 0.5. 
    ## This prevents features with extremely small variance (often artifacts in 
    ## single-cell data or dropouts) from dominating the weights.
    , sigma := pmax(0.5, sigma) 
    
  ][
    ## 3. Calculate Fixed Effects Weights
    ## Weight is the inverse of the variance (1 / sigma^2).
    ## Studies/samples with lower variance get higher weight.
    , w_fe := 1 / (sigma^2) 
    
  ][
    ## 4. Calculate Cochran's Q Statistic
    ## Q measures the weighted sum of squared deviations of individual study 
    ## effects (beta) from the overall mean effect. 
    ## Note: Grouping by 'cluster' here implies calculation across all features/samples within that cluster.
    , Q := sum(w_fe * (beta - mean(beta))^2)
    , by = cluster
    
  ][
    ## 5. Calculate Between-Study Variance (Tau-squared)
    ## Uses the DerSimonian-Laird (DL) estimator.
    ## tau2 represents the variance of the true effect sizes across studies.
    ## If Q < degrees of freedom (N-1), tau2 is set to 0.
    , tau2 := max(0, (Q - (.N - 1)) / (sum(w_fe) - (sum(w_fe^2) / sum(w_fe))))
    , by = sampleID
    
  ][
    ## 6. Calculate Random Effects Weights
    ## RE weights incorporate both within-study variance (sigma^2) and 
    ## between-study variance (tau2).
    , w_re := 1 / (sigma^2 + tau2)
    
  ][
    ## 7. Aggregate Results per Feature and Cluster
    ## Compute the weighted average of betas and the standard error of the 
    ## weighted mean for both Fixed and Random effects models.
    , .(
      # Fixed Effects Estimate: Weighted mean using w_fe
      beta_fe = sum(w_fe * beta) / sum(w_fe),
      # Fixed Effects SE: Sqrt(1 / sum(weights))
      sigma_fe = 1 / sum(sqrt(w_fe)),
      
      # Random Effects Estimate: Weighted mean using w_re
      beta_re = sum(w_re * beta) / sum(w_re),
      # Random Effects SE
      sigma_re = 1 / sum(sqrt(w_re))
    )
    , by = .(feature, cluster)
    
  ] %>%
    
    ## 8. Calculate Z-scores
    ## Z = Effect Size / Standard Error
    dplyr::mutate(
      z_fe = beta_fe / sigma_fe,
      z_re = beta_re / sigma_re
    ) %>%
    
    ## 9. Calculate P-values
    ## Two-tailed P-value based on the normal distribution.
    dplyr::mutate(
      p_fe = 2 * pnorm(-abs(z_fe)),
      p_re = 2 * pnorm(-abs(z_re))
    ) %>%
    
    ## 10. Multiple Hypothesis Correction
    ## Adjust P-values using the Benjamini-Hochberg (BH) method to control FDR.
    dplyr::mutate(
      fdr_fe = p.adjust(p_fe, 'BH'),
      fdr_re = p.adjust(p_re, 'BH')
    ) %>%
    
    ## 11. Final Sorting
    ## Sort the results by Random Effects Z-score (descending) to highlight 
    ## the most significant upregulated features first.
    dplyr::arrange(-z_re)
}

# dsl = function(obj){
#     cluster_markers = data.table(obj)[
#     ## drop the wald stats
#     , `:=`(zscore = NULL, pvalue = NULL)
# ][
#     , sigma := pmax(0.5, sigma) ## to avoid effect of outliers 
# ][
#     ## fixed effects weights
#     , w_fe := 1 / (sigma ^ 2) 
# ][
#     ## Cochrane's Q statistic for each cluster
#     , Q := sum(w_fe * (beta - mean(beta)) ^ 2)
#     , by = cluster
# ][
#     ## between-tissue variance estimator (DerSimonian and Laird method)
#     , tau2 := max(0, (Q - (.N - 1)) / (sum(w_fe) - (sum(w_fe^2)/sum(w_fe))))
#     , by = sampleID
# ][
#     , w_re := 1 / (sigma ^ 2 + tau2)
# ][
#     , .(
#         beta_fe = sum(w_fe * beta) / sum(w_fe),
#         sigma_fe = 1 / sum(sqrt(w_fe)),
#         beta_re = sum(w_re * beta) / sum(w_re),
#         sigma_re = 1 / sum(sqrt(w_re))
#     ) 
#     , by = .(feature, cluster)
# ] %>% 
#     dplyr::mutate(
#         z_fe = beta_fe / sigma_fe,
#         z_re = beta_re / sigma_re
#     ) %>% 
#     dplyr::mutate(
#         p_fe = 2 * pnorm(-abs(z_fe)),
#         p_re = 2 * pnorm(-abs(z_re))
#     ) %>% 
#     dplyr::mutate(
#         fdr_fe = p.adjust(p_fe, 'BH'),
#         fdr_re = p.adjust(p_re, 'BH')
#     ) %>% 
#     arrange(-z_re)
# }


# Function to find per sample DEGs for top-level lineages

In [None]:
doGLMM_coarse = function(obj, effects_cov, filename){

    temp = GetAssayData(obj, slot = 'counts')
    varyingGenes = rownames(temp[apply(temp, 1, function(x){length(unique(x)) > 3}),])
    rm(temp)
    obj = obj[varyingGenes, ]
    pb = presto::collapse_counts(
        GetAssayData(obj, slot = 'counts'), 
        obj@meta.data, 
        c('orig.ident', 'fov', 'ClusterTop'), 
        min_cells_per_group = 3
    )

    pb$exprs_norm = pb$exprs_norm[rownames(pb$counts_mat), colnames(pb$counts_mat)]

    system.time({
    presto_res = presto::presto.presto(
        y ~ 1 + (1|ClusterTop) +  (1|fov/ClusterTop) + offset(logUMI), 
        pb$meta_data, 
        pb$counts_mat,
        size_varname = "logUMI", 
        effects_cov = 'ClusterTop',
        ncore = 1, 
        min_sigma = .05,
        family = "poisson",
        nsim = 1000 
    )})

    readr::write_rds(presto_res, filename)

    contrasts_mat = make_contrast.presto(
        presto_res, 
        var_contrast = effects_cov
    )
    
    effects_marginal = contrasts.presto(
    presto_res, 
    contrasts_mat, 
    one_tailed = TRUE
    ) %>% 
    dplyr::mutate(cluster = contrast) %>% 
    dplyr::mutate(
        logFC = sign(beta) * log2(exp(abs(beta))), # convert stats to log2 for interpretability 
        SD = log2(exp(sigma)),
        zscore = logFC / SD
    ) %>%
    arrange(pvalue)

    effects_marginal$fdr = p.adjust(effects_marginal$pvalue, method = 'BH')
    effects_marginal$corr_fdr = effects_marginal$fdr
    effects_marginal$corr_fdr[effects_marginal$fdr == 0] = min(effects_marginal$fdr[effects_marginal$fdr != 0])
    effects_marginal$`-log10_fdr` = (-1) * log10(effects_marginal$corr_fdr) 

    meanExp = rowMeans(GetAssayData(obj, slot = 'data')) 
    meanExp = data.frame(feature = names(meanExp), meanExp = meanExp)
    for (cluster in unique(effects_marginal$cluster)) {
    temp = GetAssayData(obj, slot = 'counts')[,rownames(obj@meta.data)[obj@meta.data$ClusterTop == cluster]] %>% as.data.frame()
    temp = temp %>%
    rowwise() %>%
    mutate(`N_zeros` = sum(c_across(everything()) == 0)) %>%
    select(`N_zeros`) %>% as.data.frame()
    rownames(temp) = rownames(GetAssayData(obj, slto = 'counts'))
    meanExp[,cluster] = temp$`N_zeros`/length(rownames(obj@meta.data)[obj@meta.data$ClusterTop == cluster])
    }
    return(effects_marginal)
}

# Function to prepare a marker heatmap

In [None]:
makeMarkerHeatmap = function(cluster_markers, cluster_rows = TRUE, cluster_columns = TRUE, row_km = NULL, column_km = NULL, row_names_fontsize = 18, width = 14, row_names_width = 20){
    cluster_markers = cluster_markers %>% 
        filter(beta_fe > 0)

    genes_to_show = cluster_markers %>% 
        group_by(cluster) %>%
        top_n(n = 10, wt = beta_re) %>%
        pull(feature) %>%
        unique

    mat = cluster_markers %>%
        filter(feature %in% genes_to_show) %>%
        group_by(cluster) %>%
        arrange(desc(beta_re), .by_group = TRUE) %>%
        mutate(feature = as.factor(feature), cluster = as.factor(cluster)) %>%
        pivot_wider(id_cols = 'cluster', 
        values_from = 'beta_re',
        names_from = 'feature', 
        values_fill = 0) %>%
        mutate(cluster = str_wrap(cluster, width = row_names_width)) %>%
        column_to_rownames(var = 'cluster') %>%
        as.matrix
    
    h1 = Heatmap(mat,
        name = 'log2 FC', 
        col = circlize::colorRamp2(c(0, max(mat)), c("white", "darkblue")),
        width = unit(width, 'in'), 
        rect_gp = gpar(col = "lightgrey", lwd = 2), 
        border = TRUE, 
        cluster_rows = cluster_rows,
        cluster_columns = cluster_columns,
        row_km = row_km,
        column_km = column_km,
        heatmap_legend_param = list(direction = "horizontal", 
        title_position = "lefttop", 
        legend_width = unit(10, "cm")),
        row_names_gp = gpar(fontsize = row_names_fontsize),
        column_title_gp = gpar(fontsize = 0),
        show_column_dend = FALSE,
        show_row_dend = FALSE
    ) 
    h1 = grid.grabExpr(draw(h1, heatmap_legend_side = 'bottom'), padding = unit(c(10, 2, 2, 2), "mm"))
    h1 = patchwork::wrap_elements(full=h1)
    return(h1)
}

## coarse type markers

In [None]:
# function to get marginal effects

getMarginalEffects = function(filename, dir_path) {
    presto_res = readr::read_rds(filename)

    print(head(presto_res))
    contrasts_mat = make_contrast.presto(presto_res,
                                       var_contrast = "knn_renamed_cell_states")
    
    effects_marginal = contrasts.presto(presto_res,
                                      contrasts_mat,
                                      one_tailed = TRUE) %>%
    dplyr::mutate(cluster = contrast) %>%
    dplyr::mutate(
      logFC = sign(beta) * log2(exp(abs(beta))),
      # convert stats to log2 for interpretability
      SD = log2(exp(sigma)),
      zscore = logFC / SD
    ) %>%
    arrange(pvalue)
    
    effects_marginal$fdr = p.adjust(effects_marginal$pvalue, method = 'BH')
    effects_marginal$corr_fdr = effects_marginal$fdr
    effects_marginal$corr_fdr[effects_marginal$fdr == 0] = min(effects_marginal$fdr[effects_marginal$fdr != 0])
    effects_marginal$`-log10_fdr` = (-1) * log10(effects_marginal$corr_fdr)


    #dir_path <- "../meta_analysis_GLMMs/fine_type_GLMM/"

    # Check and create
    if (!dir.exists(dir_path)) {
      # recursive = TRUE allows creating nested paths (e.g., "folder/subfolder")
      dir.create(dir_path, recursive = TRUE)}
    
    new_filename = gsub(filename, pattern = 'glmm.rds', replacement = 'fine_marginal_effects.csv' ) %>% gsub(x = ., pattern = 'meta_analysis_GLMMs', replacement = 'meta_analysis_GLMMs/fine_type_GLMM') 
    data.table::fwrite(effects_marginal, new_filename)
    return(new_filename)
    
}

# Coarse type markers

In [None]:
cluster_markers = list.files('../meta_analysis_GLMMs/', #/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH_niches/labeled_seurat_objects
           full.names = TRUE,
           recursive = TRUE)
cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '*coarse_marginal_effects.csv')]

cluster_markers = map(cluster_markers, function(x){
    sampleID = gsub(x = x, 
                    pattern = '__coarse_marginal_effects.csv', 
                    replacement = "")
    sampleID = gsub(sampleID, pattern = '.*\\/', replacement = '')
    #message(sampleID)
    return(data.table::fread(x) %>% mutate(sampleID = sampleID))
}) %>% do.call(rbind, .) 
fwrite(cluster_markers, 'per_sample_coarse_lineage_cluster_markers.csv')
cluster_markers %>% dim
cluster_markers %>% head

In [None]:
cluster_markers = dsl(cluster_markers)
cluster_markers %>% dim
cluster_markers %>% head

In [None]:
fwrite(cluster_markers, 'meta_analyzed_coarse_lineage_cluster_markers.csv')

In [None]:
options(repr.plot.height = 4, repr.plot.width = 15.5)
cluster_markers = fread('meta_analyzed_coarse_lineage_cluster_markers.csv')
makeMarkerHeatmap(cluster_markers, column_km = 6)

# Collect per-sample fine type markers

In [None]:
cluster_markers = list.files('../meta_analysis_GLMMs/',
           full.names = TRUE,
           recursive = TRUE) 
cluster_markers = cluster_markers[grepl(x = cluster_markers, 
                                        pattern = '_TNKILC_marginal_effects.csv|_B_marginal_effects.csv|_Myeloid_marginal_effects.csv|_Strom_marginal_effects.csv|_Epi_marginal_effects.csv')]
cluster_markers = map(cluster_markers, function(x){
    lineage = gsub(x, pattern = '_marginal_effects.csv', replacement = '')
    lineage = gsub(lineage, pattern = '.*_', replacement = '')
    return(data.table::fread(x) %>% 
           mutate(lineage = lineage) %>%
           mutate(sampleID = gsub(x = x, 
                                                           pattern = '_.*_marginal_effects.csv|\\.\\/meta_analysis_GLMMs\\/\\/', 
                                                        replacement = "")))}) %>%
    rbindlist()
unique(cluster_markers$lineage)
data.table::fwrite(cluster_markers, file = 'collected_per_sample_within_lineage_fine_type_markers.csv')
sample_n(cluster_markers, 5)

## TNKILCs

### load cluster markers

In [None]:
cluster_markers = list.files('../meta_analysis_GLMMs/',
           full.names = TRUE,
           recursive = TRUE) 
cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '_TNKILC_marginal_effects.csv')]
cluster_markers = map(cluster_markers, function(x){
    return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
                                                           pattern = '_TNKILC_marginal_effects.csv|\\.\\/meta_analysis_GLMMs\\/\\/', 
                                                        replacement = "")))
}) %>% do.call(rbind, .)
cluster_markers %>% dim
cluster_markers %>% head

### run dsl

In [None]:
cluster_markers = dsl(cluster_markers)
cluster_markers %>% write.csv(file = 'TNKILC_meta_analysis_markers.csv')
cluster_markers %>% dim
cluster_markers %>% head

In [None]:
cluster_markers = fread("TNKILC_meta_analysis_markers.csv")
head(cluster_markers)

### plot markers

In [None]:
options(repr.plot.height = 7, repr.plot.width = 18)
makeMarkerHeatmap(cluster_markers, column_km = 6, width = 14, row_names_fontsize = 12, row_names_width = 30)

## B

### load cluster markers

In [None]:
cluster_markers = list.files('../meta_analysis_GLMMs/',
           full.names = TRUE,
           recursive = TRUE) 
cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '_B_marginal_effects.csv')]
cluster_markers = map(cluster_markers, function(x){
    return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
                                                           pattern = '_B_marginal_effects.csv|\\.\\/meta_analysis_GLMMs\\/\\/', 
                                                        replacement = "")))
}) %>% do.call(rbind, .)
cluster_markers %>% dim
cluster_markers %>% head

# cluster_markers = list.files('/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH/Fine_typing_with_weighted_KNN/B/',
#            full.names = TRUE,
#            recursive = TRUE)
# cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '*fovs.*.csv')]
# cluster_markers = map(cluster_markers, function(x){
#     return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
#                                                            pattern = '.*effects_marginal_merfish_fovs|.csv', 
#                                                         replacement = "")))
# }) %>% do.call(rbind, .)
# cluster_markers %>% dim
# cluster_markers %>% head

### run dsl

In [None]:
cluster_markers = dsl(cluster_markers)
cluster_markers %>% write.csv(file = 'B_meta_analysis_markers.csv')
cluster_markers %>% dim
cluster_markers %>% head

In [None]:
cluster_markers = fread("B_meta_analysis_markers.csv")
head(cluster_markers)

### plot markers

In [None]:
options(repr.plot.height = 5, repr.plot.width = 18)
mat = cluster_markers %>% 
group_by(cluster) %>%
top_n(n = 10, wt = beta_re) %>%
group_by(cluster) %>%
arrange(desc(beta_re), .by_group = TRUE) %>%
mutate(feature = as.factor(feature), cluster = as.factor(cluster)) %>%
pivot_wider(id_cols = 'cluster', 
            values_from = 'beta_re', 
            names_from = 'feature', 
            values_fill = 0) %>%
mutate(cluster = str_wrap(cluster, width = 30)) %>%
column_to_rownames(var = 'cluster') %>%
as.matrix
h1 = Heatmap(mat,
        name = 'beta_re', 
        col = circlize::colorRamp2(c(0, max(mat)), c("white", "darkblue")), 
        #col = circlize::colorRamp2(c(min(mat), 0, max(mat)), c(muted('blue'),'white', muted("red"))), 
        width = unit(13, 'in'), 
        rect_gp = gpar(col = "lightgrey", lwd = 2), 
        border = TRUE, 
        cluster_rows = FALSE,
        cluster_columns = FALSE,
        heatmap_legend_param = list(direction = "horizontal", 
            title_position = "lefttop", 
            legend_width = unit(10, "cm")),
        row_names_gp = gpar(fontsize = 14)
       ) 
draw(h1, padding = unit(c(5, 5, 5, 5), "mm"), heatmap_legend_side = 'bottom')

In [None]:
options(repr.plot.height = 4, repr.plot.width = 18)
cluster_markers = read.csv('B_meta_analysis_markers.csv')
makeMarkerHeatmap(cluster_markers, column_km = 6, width = 14, row_names_fontsize = 14, row_names_width = 20)

## Epi

### load cluster markers

In [None]:
cluster_markers = list.files('../meta_analysis_GLMMs/',
           full.names = TRUE,
           recursive = TRUE) 
cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '_Epi_marginal_effects.csv')]
cluster_markers = map(cluster_markers, function(x){
    return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
                                                           pattern = '_Epi_marginal_effects.csv|\\.\\/meta_analysis_GLMMs\\/\\/', 
                                                        replacement = "")))
}) %>% do.call(rbind, .)
cluster_markers %>% dim
cluster_markers %>% head

# cluster_markers = list.files('/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH/Fine_typing_with_weighted_KNN/B/',
#            full.names = TRUE,
#            recursive = TRUE)
# cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '*fovs.*.csv')]
# cluster_markers = map(cluster_markers, function(x){
#     return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
#                                                            pattern = '.*effects_marginal_merfish_fovs|.csv', 
#                                                         replacement = "")))
# }) %>% do.call(rbind, .)
# cluster_markers %>% dim
# cluster_markers %>% head
# cluster_markers = list.files('/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH/Fine_typing_with_weighted_KNN/Epi/',
#            full.names = TRUE,
#            recursive = TRUE)
# cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '*fovs.*.csv')]
# cluster_markers = map(cluster_markers, function(x){
#     return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
#                                                            pattern = '.*effects_marginal_merfish_fovs|.csv', 
#                                                         replacement = "")))
# }) %>% do.call(rbind, .)
# cluster_markers %>% dim
# cluster_markers %>% head

### run dsl

In [None]:
cluster_markers = dsl(cluster_markers)
cluster_markers %>% dim
cluster_markers %>% head
cluster_markers %>% write.csv(file = 'Epi_meta_analysis_markers.csv')

### plot markers

In [None]:
options(repr.plot.height = 6, repr.plot.width = 18)
cluster_markers = fread("Epi_meta_analysis_markers.csv")
head(cluster_markers)
makeMarkerHeatmap(cluster_markers, column_km = 7, width = 14, row_names_fontsize = 14, row_names_width = 20)

## Myeloid

### load cluster markers

In [None]:
cluster_markers = list.files('../meta_analysis_GLMMs/',
           full.names = TRUE,
           recursive = TRUE) 
cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '_Myeloid_marginal_effects.csv')]
cluster_markers = map(cluster_markers, function(x){
    return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
                                                           pattern = '_Myeloid_marginal_effects.csv|\\.\\/meta_analysis_GLMMs\\/\\/', 
                                                        replacement = "")))
}) %>% do.call(rbind, .)
cluster_markers %>% dim
cluster_markers %>% head

# cluster_markers = list.files('/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH/Fine_typing_with_weighted_KNN/B/',
#            full.names = TRUE,
#            recursive = TRUE)
# cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '*fovs.*.csv')]
# cluster_markers = map(cluster_markers, function(x){
#     return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
#                                                            pattern = '.*effects_marginal_merfish_fovs|.csv', 
#                                                         replacement = "")))
# }) %>% do.call(rbind, .)
# cluster_markers %>% dim
# cluster_markers %>% head

# cluster_markers = list.files('/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH/Fine_typing_with_weighted_KNN/Myeloid/',
#            full.names = TRUE,
#            recursive = TRUE)
# cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '*fovs.*.csv')]
# cluster_markers = map(cluster_markers, function(x){
#     return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
#                                                            pattern = '.*effects_marginal_merfish_fovs|.csv', 
#                                                         replacement = "")))
# }) %>% do.call(rbind, .)
# cluster_markers %>% dim
# cluster_markers %>% head

### run dsl

In [None]:
cluster_markers = dsl(cluster_markers)
cluster_markers %>% dim
cluster_markers %>% head
cluster_markers %>% write.csv(file = 'Myeloid_meta_analysis_markers.csv')

### plot markers

In [None]:
options(repr.plot.height = 7, repr.plot.width = 19)
cluster_markers = fread("Myeloid_meta_analysis_markers.csv")
cluster_markers$cluster = gsub(pattern = "\\_", replacement = " ", x = cluster_markers$cluster)
head(cluster_markers)
makeMarkerHeatmap(cluster_markers, column_km = 7, width = 15, row_names_fontsize = 12, row_names_width = 30)

## Strom

### load cluster markers

In [None]:
cluster_markers = list.files('../meta_analysis_GLMMs/',
           full.names = TRUE,
           recursive = TRUE) 
cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '_Strom_marginal_effects.csv')]
cluster_markers = map(cluster_markers, function(x){
    return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
                                                           pattern = '_Strom_marginal_effects.csv|\\.\\/meta_analysis_GLMMs\\/\\/', 
                                                        replacement = "")))
}) %>% do.call(rbind, .)
cluster_markers %>% dim
cluster_markers %>% head

# cluster_markers = list.files('/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH/Fine_typing_with_weighted_KNN/B/',
#            full.names = TRUE,
#            recursive = TRUE)
# cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '*fovs.*.csv')]
# cluster_markers = map(cluster_markers, function(x){
#     return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
#                                                            pattern = '.*effects_marginal_merfish_fovs|.csv', 
#                                                         replacement = "")))
# }) %>% do.call(rbind, .)
# cluster_markers %>% dim
# cluster_markers %>% head# cluster_markers = list.files('/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH/Fine_typing_with_weighted_KNN/Strom/',
#            full.names = TRUE,
#            recursive = TRUE)
# cluster_markers = cluster_markers[grepl(x = cluster_markers, pattern = '*fovs.*.csv')]
# cluster_markers = map(cluster_markers, function(x){
#     return(data.table::fread(x) %>% mutate(sampleID = gsub(x = x, 
#                                                            pattern = '.*effects_marginal_merfish_fovs|.csv', 
#                                                         replacement = "")))
# }) %>% do.call(rbind, .)
# cluster_markers %>% dim
# cluster_markers %>% head

### run dsl

In [None]:
cluster_markers = dsl(cluster_markers)
cluster_markers %>% dim
cluster_markers %>% head
cluster_markers %>% write.csv(file = 'Strom_meta_analysis_markers.csv')

In [None]:
cluster_markers = fread("Strom_meta_analysis_markers.csv")
head(cluster_markers)

### plot markers

In [None]:
options(repr.plot.height = 7, repr.plot.width = 20)
cluster_markers = fread("Strom_meta_analysis_markers.csv")
cluster_markers$cluster = gsub(pattern = "\\_", replacement = " ", x = cluster_markers$cluster)
head(cluster_markers)
makeMarkerHeatmap(cluster_markers, column_km = 7, width = 16, row_names_fontsize = 12, row_names_width = 35)

# Collect meta-analyzed fine type markers

In [None]:
per_sample_markers = data.table::fread('collected_per_sample_within_lineage_fine_type_markers.csv')
head(per_sample_markers)
state_to_lineage = per_sample_markers %>% select(contrast, lineage) %>% distinct %>% rename(cluster = contrast)
state_to_lineage

In [None]:
list.files(pattern = 'TNKILC|B|Epi|Myeloid|Strom') %>%
    lapply(., FUN = data.table::fread) %>%
    rbindlist() %>%
    select(!V1) %>%
    left_join(., state_to_lineage) %>%
    data.table::fwrite(., file = 'collected_meta_analyzed_within_lineage_fine_type_markers.csv')

In [None]:
list.files(pattern = 'TNKILC|B|Epi|Myeloid|Strom') %>%
    lapply(., FUN = data.table::fread) %>%
    rbindlist() %>%
    select(!V1) %>%
    left_join(., state_to_lineage) %>%
    head

## mean expression

In [None]:
annotated_merged = readr::read_rds('/n/data1/bwh/medicine/korsunsky/lab/mup728/CRC_MERFISH_niches/labeled_seurat_objects/annotated_merged_merfish.rds')
annotated_merged

### myeloid markers

In [None]:
cluster_markers = fread("Myeloid_meta_analysis_markers.csv")
head(cluster_markers)

In [None]:
myeloid_data = subset(annotated_merged, ClusterTop == 'Myeloid')

In [None]:
myeloid_data

In [None]:
markers = cluster_markers %>% 
group_by(cluster) %>%
top_n(n = 10, wt = beta_re) %>%
group_by(cluster) %>%
arrange(desc(beta_re), .by_group = TRUE) %>%
mutate(feature = as.factor(feature), cluster = as.factor(cluster)) %>%
pull(feature) %>%
unique() %>%
as.vector()
markers

In [None]:
pb = AverageExpression(myeloid_data, group.by = 'cleaned_fine_types', slot = 'counts')
pb

In [None]:
pb = pb$RNA[markers,]
dim(pb)
pb[1:5, 1:5]

In [None]:
pb = as.matrix(pb) %>% t

In [None]:
pb %>% head

In [None]:
range(pb)

In [None]:
options(repr.plot.width = 20, repr.plot.height = 10)
h1 = Heatmap(as.matrix(pb),
        name = 'beta_re', 
        col = circlize::colorRamp2(c(0, max(pb)), c("white", "darkred")), 
        width = unit(17, 'in'), 
        rect_gp = gpar(col = "lightgrey", lwd = 2), 
        border = TRUE, 
        cluster_rows = FALSE,
        cluster_columns = FALSE,
        heatmap_legend_param = list(direction = "horizontal", 
            title_position = "lefttop", 
            legend_width = unit(10, "cm")),
        row_names_gp = gpar(fontsize = 14)
       ) 
draw(h1, padding = unit(c(5, 5, 5, 5), "mm"), heatmap_legend_side = 'bottom')

### stromal markers

In [None]:
strom_markers = fread("Strom_meta_analysis_markers.csv")
head(strom_markers)

In [None]:
strom_data = subset(annotated_merged, ClusterTop == 'Strom')
strom_data

In [None]:
strom_markers = strom_markers %>% 
group_by(cluster) %>%
top_n(n = 10, wt = beta_re) %>%
group_by(cluster) %>%
arrange(desc(beta_re), .by_group = TRUE) %>%
mutate(feature = as.factor(feature), cluster = as.factor(cluster)) %>%
pull(feature) %>%
unique() %>%
as.vector()
strom_markers

In [None]:
pb = AverageExpression(strom_data, group.by = 'cleaned_fine_types', slot = 'counts')
pb

In [None]:
pb = pb$RNA[strom_markers,]
dim(pb)
pb[1:5, 1:5]

In [None]:
pb = as.matrix(pb) %>% t

In [None]:
pb %>% head

In [None]:
range(pb)

In [None]:
options(repr.plot.width = 20, repr.plot.height = 10)
h1 = Heatmap(as.matrix(pb),
        name = 'avg_counts', 
        col = circlize::colorRamp2(c(0, max(pb)), c("white", "darkred")), 
        width = unit(17, 'in'), 
        rect_gp = gpar(col = "lightgrey", lwd = 2), 
        border = TRUE, 
        cluster_rows = FALSE,
        cluster_columns = FALSE,
        heatmap_legend_param = list(direction = "horizontal", 
            title_position = "lefttop", 
            legend_width = unit(10, "cm")),
        row_names_gp = gpar(fontsize = 14)
       ) 
draw(h1, padding = unit(c(5, 5, 5, 5), "mm"), heatmap_legend_side = 'bottom')

### epi markers

In [None]:
epi_markers = fread("Epi_meta_analysis_markers.csv")
head(epi_markers)

In [None]:
epi_data = subset(annotated_merged, ClusterTop == 'Epi')
epi_data

In [None]:
markers = epi_markers %>% 
group_by(cluster) %>%
top_n(n = 10, wt = beta_re) %>%
group_by(cluster) %>%
arrange(desc(beta_re), .by_group = TRUE) %>%
mutate(feature = as.factor(feature), cluster = as.factor(cluster)) %>%
pull(feature) %>%
unique() %>%
as.vector()
markers