In [None]:
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In [None]:
# Install Google Colab dependencies
# Note: this can take 30+ minutes (many of the dependencies include C++ code, which needs to be compiled)

# First install `sf`, `ragg` and `textshaping` and their system dependencies:
system("apt-get -y update && apt-get install -y  libudunits2-dev libgdal-dev libgeos-dev libproj-dev libharfbuzz-dev libfribidi-dev")
install.packages("sf")
install.packages("textshaping")
install.packages("ragg")

# Install system dependencies of some other R packages that Voyager either imports or suggests:
system("apt-get install -y libfribidi-dev libcairo2-dev libmagick++-dev")

# Install Voyager from Bioconductor:
install.packages("BiocManager")
BiocManager::install(version = "release", ask = FALSE, update = FALSE, Ncpus = 2)
BiocManager::install("scater")
system.time(
  BiocManager::install("Voyager", dependencies = TRUE, Ncpus = 2, update = FALSE)
)

# Additional dependencies for this notebook
install.packages("reticulate")
#system("pip3 install gget")
packageVersion("Voyager")

# Introduction

For areal spatial data, the spatial neighborhood graph is used to indicate proximity, and is required for all spatial analysis methods in the package `spdep`. One of the methods to find the spatial neighborhood graph is k nearest neighbors, which is also commonly used in gene expression PCA space for graph-based clustering of cells in non-spatial scRNA-seq data. Then what if we use the k nearest neighbors graph in PCA space rather than histological space for "spatial" analyses for non-spatial scRNA-seq data?

Here we try such an analysis on a human peripheral blood mononuclear cells (PBMC) scRNA-seq dataset, which doesn't originally have histological spatial organization. These are the packages loaded for the analysis:

In [None]:
library(Voyager)
library(SpatialFeatureExperiment)
library(SpatialExperiment)
library(DropletUtils)
library(BiocNeighbors)
library(scater)
library(scran)
library(bluster)
library(BiocParallel)
library(scuttle)
library(stringr)
library(BiocSingular)
library(spdep)
library(patchwork)
library(dplyr)
library(reticulate)
theme_set(theme_bw())

<<<<<<< HEAD

In [None]:
=======
```{r, eval = FALSE}
>>>>>>> documentation
# Specify Python version to use gget
PY_PATH <- Sys.which("python")
use_python(PY_PATH)
py_config()

# Load gget
gget <- import("gget")

Here we download the filtered [Cell Ranger gene count matrix from the 10X website](https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3_nextgem?). The empty droplets have already been removed.

In [None]:
if (!dir.exists("filtered_feature_bc_matrix")) {
    download.file("https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3_nextgem/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz", destfile = "5kpbmc.tar.gz", quiet = TRUE)
    system("tar -xzf 5kpbmc.tar.gz")
}

This is loaded into R as a `SingleCellExperiment` (SCE) object.

In [None]:
(sce <- read10xCounts("filtered_feature_bc_matrix/"))

In [None]:
colnames(sce) <- sce$Barcode

# Quality control (QC)
Here we perform some basic QC, to remove low quality cells with high proportion of mitochondrially encoded counts. 

In [None]:
is_mito <- str_detect(rowData(sce)$Symbol, "^MT-")
sum(is_mito)

In [None]:
sce <- addPerCellQCMetrics(sce, subsets = list(mito = is_mito))
names(colData(sce))

The `addPerCellQCMetrics()` function computes the total UMI counts detected per cell (`sum`), number of genes detected per cell (`detected`), `sum` and `detected` for mitochondrial counts, and percentage of mitochondrial counts per cell.

In [None]:
plotColData(sce, "sum") +
    plotColData(sce, "detected") +
    plotColData(sce, "subsets_mito_percent")

Here a 2D histogram is plotted to better show point density on the plot.

In [None]:
plotColData(sce, x = "sum", y = "detected", bins = 100)

In [None]:
plotColData(sce, x = "sum", y = "subsets_mito_percent", bins = 100)

Remove cells with >20% mitochondrial counts

In [None]:
sce <- sce[, sce$subsets_mito_percent < 20]
sce <- sce[rowSums(counts(sce)) > 0,]

# Basic non-spatial analyses
Here we normalize the data, perform PCA, cluster the cells, and find marker genes for the clusters.

In [None]:
#clusts <- quickCluster(sce)
#sce <- computeSumFactors(sce, cluster = clusts)
#sce <- sce[, sizeFactors(sce) > 0]
sce <- logNormCounts(sce)

Use the highly variable genes for PCA:

In [None]:
dec <- modelGeneVar(sce, lowess = FALSE)
hvgs <- getTopHVGs(dec, n = 2000)

In [None]:
set.seed(29)
sce <- runPCA(sce, ncomponents = 30, BSPARAM = IrlbaParam(),
              subset_row = hvgs, scale = TRUE)

How many PCs shall we use for further analyses?

In [None]:
ElbowPlot(sce, ndims = 30)

Variance explained drops sharply from PC1 and again from PC4 and then levels off. Plot genes with the largest loadings of the top 4 PCs:

In [None]:
plotDimLoadings(sce, swap_rownames = "Symbol")

To keep a little more information, we use 10 PCs, after which variance explained levels off even more.

In [None]:
sce$cluster <- clusterRows(reducedDim(sce, "PCA")[,1:10],
                           BLUSPARAM = SNNGraphParam(cluster.fun = "leiden",
                                                     k = 10,
                                                     cluster.args = list(
                                                         resolution=0.5,
                                                         objective_function = "modularity"
                                                     )))

Here we plot the cells in the first 4 PCs in a matrix plot. The diagonals are density plots of the number of cells as projected on each PC. The x axis should correspond to the columns of the matrix plot, and the y axis should correspond to the rows, so the plot in row 1 column 2 has PC2 in the x axis and PC1 in the y axis. The cells are colored by clusters found in the previous code chunk.

In [None]:
plotPCA(sce, ncomponents = 4, color_by = "cluster")

How many cells are there in each cluster?

In [None]:
table(sce$cluster)

Then we use the conventional Wilcoxon rank sum test to find marker genes for each cluster. The test compares each cluster to the rest of the cells, and only genes more highly expressed in the cluster compared to the other cells are considered. The result is a list of data frames, where each data frame corresponds to one cluster. Areas under the receiver operator curve (AUC), distinguishing each cluster vs. any other cluster, are also included. The closer to 1 the better, while 0.5 means no better than random guessing. The false discovery rate (FDR) column contains the Benjamini-Hochberg corrected p-values. Genes in these data frames are already sorted by p-values.

In [None]:
markers <- findMarkers(sce, groups = colData(sce)$cluster,
                       test.type = "wilcox", pval.type = "all", direction = "up")
markers[[4]]

See how specific the top markers are to each cluster:

In [None]:
top_markers <- unlist(lapply(markers, function(x) head(rownames(x), 1)))
top_markers_symbol <- rowData(sce)[top_markers, "Symbol"]

In [None]:
plotExpression(sce, top_markers_symbol, x = "cluster", swap_rownames = "Symbol",
               point_fun = function(...) list())

We can use the [gget info](https://pachterlab.github.io/gget/info.html) module from the [gget](https://pachterlab.github.io/gget/) package to get additional information for these marker genes. For example, their [NCBI](https://www.ncbi.nlm.nih.gov/) description:

In [None]:
gget_info <- gget$info(top_markers)

rownames(gget_info) <- gget_info$ensembl_gene_name
select(gget_info, ncbi_description)

# "Spatial" analyses for QC metrics
Find k nearest neighbor graph in PCA space for Moran's I: We are not using `spdep` here since its `nb2listwdist()` function for distance based edge weighting requires 2-3 dimensional spatial coordinates while the coordinates here have 10 dimensions. Here, inverse distance weighting is used for edge weights.

In [None]:
foo <- findKNN(reducedDim(sce, "PCA")[,1:10], k=10, BNPARAM=AnnoyParam())
# Split by row
foo_nb <- asplit(foo$index, 1)
dmat <- 1/foo$distance
# Row normalize the weights
dmat <- sweep(dmat, 1, rowSums(dmat), FUN = "/")
glist <- asplit(dmat, 1)
# Sort based on index
ord <- lapply(foo_nb, order)
foo_nb <- lapply(seq_along(foo_nb), function(i) foo_nb[[i]][ord[[i]]])
class(foo_nb) <- "nb"
glist <- lapply(seq_along(glist), function(i) glist[[i]][ord[[i]]])

listw <- list(style = "W",
              neighbours = foo_nb,
              weights = glist)
class(listw) <- "listw"
attr(listw, "region.id") <- colnames(sce)

Because there is no histological space, we convert the SCE object into `SpatialFeatureExperiment` (SFE) to use the spatial analysis and plotting functions from `Voyager`, and pretend that the first 2 PCs are the histological space.

In [None]:
(sfe <- toSpatialFeatureExperiment(sce, spatialCoords = reducedDim(sce, "PCA")[,1:2],
                                  spatialCoordsNames = NULL))

Add the k nearest neighbor graph to the SFE object:

In [None]:
colGraph(sfe, "knn10") <- listw

## Moran's I

In [None]:
sfe <- colDataMoransI(sfe, c("sum", "detected", "subsets_mito_percent"))
colFeatureData(sfe)[c("sum", "detected", "subsets_mito_percent"),]

For total UMI counts (sum) and genes detected (detected), Moran's I is quite strong, while it's positive but weaker for percentage of mitochondrial counts. The second column, K, is kurtosis of the feature of interest. 

## Moran plot

How about the local variations on the k nearest neighbors graph? In the Moran plot, the x axis is the value for each cell, and the y axis is the average value among neighboring cells in the graph weighted by edge weights. The slope of the fitted line is Moran's I. Sometimes there are clusters in this plot, showing different kinds of neighborhoods.

In [None]:
sfe <- colDataUnivariate(sfe, "moran.plot", c("sum", "detected", "subsets_mito_percent"))

The dashed lines are the averages on the x and y axes.

In [None]:
moranPlot(sfe, "sum", color_by = "cluster")

While most cells are in the cluster around the average, there is a cluster of cells with lower total counts whose neighbors also have lower total counts. There also is a cluster of cells with higher total counts whose neighbors also have higher total counts. These clusters seem to be somewhat related to some gene expression based clusters.

In [None]:
moranPlot(sfe, "detected", color_by = "cluster")

In [None]:
moranPlot(sfe, "subsets_mito_percent", color_by = "cluster")

There is one main cluster on this plot for the number of genes detected and for the percentage of mitochondrial counts. However, cells are somewhat separated by gene expression clusters. This is not surprising because the gene expression clusters are also based on the k nearest neighbor graph. Cluster 4 cells have a higher percentage of mitochondrial counts and so do their neighbors. 

## Local Moran's I
Also see local Moran's I for these 3 QC metrics:

In [None]:
sfe <- colDataUnivariate(sfe, "localmoran", c("sum", "detected", "subsets_mito_percent"))

Here, we don't have a histological space. So how can we visualize the local "spatial" statistics? [UMAP is bad](https://www.biorxiv.org/content/10.1101/2021.08.25.457696v1), but in this case PCA can somewhat separate the clusters. We can use the first 2 PCs as if they are the histological space. As a reference, we plot the metrics themselves and the clusters in the first 2 PCs. 

In [None]:
plotSpatialFeature(sfe, c("sum", "detected", "subsets_mito_percent", "cluster"))

Plot the local Moran's I for these metrics in the first 2 PCs:

In [None]:
plotLocalResult(sfe, "localmoran", c("sum", "detected", "subsets_mito_percent"), 
                colGeometryName = "centroids",
                divergent = TRUE, diverge_center = 0, ncol = 2)

However, what if there is no good 2D representation of the data for easy plotting? Remember that here the k nearest neighbor graph was computed on the first 10 PCs rather than the first 2 PCs. The graph is not tied to the 2D representation. We can still plot histograms to show the distribution and scatter plots to compare the same local metric for different variables, which can be colored by another variable such as cluster. This may be added to the next release of Voyager. For now, we add the results of interest to `colData(sfe)` and use the existing `colData` plotting functions from `scater` and `Voyager`.

In [None]:
localResultAttrs(sfe, "localmoran", "sum")

In [None]:
sfe$sum_localmoran <- localResult(sfe, "localmoran", "sum")[,"Ii"]
sfe$detected_localmoran <- localResult(sfe, "localmoran", "detected")[,"Ii"]
sfe$pct_mito_localmoran <- localResult(sfe, "localmoran", "subsets_mito_percent")[,"Ii"]

In [None]:
# Colorblind friendly palette
data("ditto_colors")

In [None]:
plotColDataFreqpoly(sfe, c("sum_localmoran", "detected_localmoran", 
                           "pct_mito_localmoran"), bins = 50, 
                    color_by = "cluster") +
    scale_y_log10() +
    annotation_logticks(sides = "l")

The y axis is log transformed (hence that warning when some bins have no cells), so the color of cells in the long tail can be seen because most cells don't have very strong local Moran's I. Cells in cluster 7 have high local Moran's I in total UMI counts and genes detected, which means that they tend to be more homogeneous in these QC metrics.

How do local Moran's I for these QC metrics relate to each other?

In [None]:
plotColData(sfe, x = "sum_localmoran", y = "detected_localmoran", 
            color_by = "cluster") +
    scale_color_manual(values = ditto_colors)

Cells more locally homogeneous in total UMI counts are also more homogeneous in number of genes detected, which is not surprising given the correlation between the two.

In [None]:
plotColData(sfe, x = "sum_localmoran", y = "pct_mito_localmoran", 
            color_by = "cluster") +
    scale_color_manual(values = ditto_colors)

For local Moran's I, `sum` vs percentage mitochondrial counts shows a more interesting pattern, highlighting clusters 4 and 7 as in the Moran plots.

How does local Moran's I relate to the value itself?

In [None]:
plotColData(sfe, x = "sum", y = "sum_localmoran", color_by = "cluster") +
    geom_density2d(data = as.data.frame(colData(sfe)),
                   mapping = aes(x = sum, y = sum_localmoran), color = "blue", 
                   linewidth = 0.3) +
    scale_color_manual(values = ditto_colors)

In this case, generally cells with higher total counts also tend to have higher local Moran's I in total counts. However, there is another wing where cells with lower total counts have slightly higher local Moran's I in total counts and there's a central value of total counts with near 0 local Moran's I. The density contour shows that cells are concentrated at that central value.

## Local spatial heteroscedasticity (LOSH)

LOSH indicates heterogeneity around each cell in the k nearest neighbor graph.

In [None]:
sfe <- colDataUnivariate(sfe, "LOSH", c("sum", "detected", "subsets_mito_percent"))

In [None]:
plotLocalResult(sfe, "LOSH", c("sum", "detected", "subsets_mito_percent"), 
                colGeometryName = "centroids", ncol = 2)

Here we make the same non-spatial plots for LOSH as in local Moran's I.

In [None]:
localResultAttrs(sfe, "LOSH", "sum")

In [None]:
sfe$sum_losh <- localResult(sfe, "LOSH", "sum")[,"Hi"]
sfe$detected_losh <- localResult(sfe, "LOSH", "detected")[,"Hi"]
sfe$pct_mito_losh <- localResult(sfe, "LOSH", "subsets_mito_percent")[,"Hi"]

In [None]:
plotColDataFreqpoly(sfe, c("sum_losh", "detected_losh", 
                           "pct_mito_losh"), bins = 50, 
                     color_by = "cluster") +
    scale_y_log10() +
    annotation_logticks(sides = "l")

Here, clusters 2 and 6 tend to be more locally heterogeneous. How do total counts and genes detected relate in LOSH?

In [None]:
plotColData(sfe, x = "sum_losh", y = "detected_losh", color_by = "cluster") +
    scale_color_manual(values = ditto_colors)

While generally cells higher in LOSH in total counts are also higher in LOSH in genes detected, there are some outliers that are very high in both, with more heterogeneous neighborhoods. Absolute distance to the neighbors is not taken into account when the adjacency matrix is row normalized. It would be interesting to see if those outliers tend to be further away from their 10 nearest neighbors, or in a region in the PCA space where cells are further apart.

How does total counts itself relate to its LOSH?

In [None]:
plotColData(sfe, x = "sum", y = "sum_losh", color_by = "cluster") +
    scale_color_manual(values = ditto_colors)

There does not seem to be a clear relationship in this case. 

# "Spatial" analyses for gene expression
First, we need to reorganize the differential expression results:

In [None]:
top_markers_df <- lapply(seq_along(markers), function(i) {
    out <- markers[[i]][markers[[i]]$FDR < 0.05, c("FDR", "summary.AUC")]
    if (nrow(out)) out$cluster <- i
    out
})
top_markers_df <- do.call(rbind, top_markers_df)
top_markers_df$symbol <- rowData(sce)[rownames(top_markers_df), "Symbol"]

## Moran's I

In [None]:
sfe <- runMoransI(sfe, features = hvgs, BPPARAM = MulticoreParam(2))

The results are added to `rowData(sfe)`. The NA's are for non-highly variable genes, as Moran's I was only computed for highly variable genes here.

In [None]:
rowData(sfe)

How are the Moran's I's for highly variable genes distributed? Also, where are the top cluster marker genes in this distribution?

In [None]:
plotRowDataHistogram(sfe, "moran_sample01", bins = 50) +
    geom_vline(data = as.data.frame(rowData(sfe)[top_markers,]) |> 
                   mutate(index = seq_along(top_markers)),
               aes(xintercept = moran_sample01, color = index)) +
    scale_color_continuous(breaks = scales::breaks_width(2))

The top marker genes all have quite positive Moran's I on the k nearest neighbor graph. It would also be interesting to color this histogram by gene sets. Since the k nearest neighbor graph was found in PCA space, which is based on gene expression, as expected, Moran's I with this graph is mostly positive, although often not that strong. A small number of genes have slightly negative Moran's I. What do the top genes look like in PCA?

In [None]:
top_moran <- head(rownames(sfe)[order(rowData(sfe)$moran_sample01, decreasing = TRUE)], 4)
plotSpatialFeature(sfe, top_moran, ncol = 2)

In [None]:
top_moran_symbol <- rowData(sfe)[top_moran, "Symbol"]
plotExpression(sfe, top_moran_symbol, swap_rownames = "Symbol")

They are all marker genes for the same cluster, cluster 9. Perhaps these genes have high Moran's I because they are specific to a cell type. Then how does the Moran's I relate to cluster AUC and cluster differential expression p-value?

In [None]:
# See if markers are unique to clusters
anyDuplicated(rownames(top_markers_df))

In [None]:
top_markers_df$moran <- rowData(sfe)[rownames(top_markers_df), "moran_sample01"]
top_markers_df$log_p_adj <- -log10(top_markers_df$FDR)
top_markers_df$cluster <- factor(top_markers_df$cluster, 
                                 levels = seq_len(length(unique(top_markers_df$cluster))))

How does the differential expression p-value relate to Moran's I?

In [None]:
as.data.frame(top_markers_df) |> 
    ggplot(aes(log_p_adj, moran)) +
    geom_point(aes(color = cluster)) +
    geom_smooth(method = "lm") +
    scale_color_manual(values = ditto_colors)

Generally, more significant marker genes tend to have higher Moran's I. This is not surprising because the clusters and Moran's I here are both based on the k nearest neighbor graph.

In [None]:
as.data.frame(top_markers_df) |> 
    ggplot(aes(summary.AUC, moran)) +
    geom_point(aes(color = cluster)) +
    geom_smooth(method = "lm") +
    scale_color_manual(values = ditto_colors)

Similarly, genes with higher AUC tend to have higher Moran's I. For other clusters, generally speaking, genes more specific to a cluster tend to have higher Moran's I.

Let's use permutation testing to see if Moran's I is statistically significant:

In [None]:
sfe <- runUnivariate(sfe, "moran.mc", features = top_markers, nsim = 200)

In [None]:
top_markers_symbol

In [None]:
plotMoranMC(sfe, top_markers, swap_rownames = "Symbol")

They all seem to be very significant. 

The correlogram finds Moran's I for a higher order of neighbors and can be a proxy for distance. 

In [None]:
system.time({
    sfe <- runUnivariate(sfe, "sp.correlogram", top_markers, order = 6, 
                     zero.policy = TRUE, BPPARAM = MulticoreParam(2))
})

In [None]:
plotCorrelogram(sfe, top_markers, swap_rownames = "Symbol")

We see different patterns of decay in spatial autocorrelation and different length scales of spatial autocorrelation. CLU is a marker gene very specific to the smallest cluster, so higher order neighbors are very likely to be from other clusters. Marker genes for the other larger clusters with hundreds of cells nevertheless display different patterns in the correlogram.

## Local Moran's I

In [None]:
sfe <- runUnivariate(sfe, "localmoran", features = top_markers)

In [None]:
plotLocalResult(sfe, "localmoran", top_markers, colGeometryName = "centroids", 
                divergent = TRUE, diverge_center = 0, ncol = 3,
                swap_rownames = "Symbol")

We will also plot the histograms, but for now the results need to be added to `colData` first.

In [None]:
new_colname <- paste0("cluster", seq_along(top_markers), "_", 
                      top_markers_symbol, "_localmoran")
for (i in seq_along(top_markers)) {
    g <- top_markers[i]
    colData(sfe)[[new_colname[i]]] <- 
        localResult(sfe, "localmoran", g)[,"Ii"]
}

In [None]:
plotColDataFreqpoly(sfe, new_colname, color_by = "cluster") +
    ggtitle("Local Moran's I") +
    theme(legend.position = "top") +
    scale_y_log10() +
    annotation_logticks(sides = "l")

Again, the y axis is log transformed to make the tail more visible. For some clusters, the top marker gene's local Moran's I forms its own peak for cells in the cluster with higher local Moran's I than other cells. However, sometimes cells within the cluster form a long tail shared with some cells from other clusters. Then the local Moran's I could be another method for differential expression. Or since both local Moran's I and Leiden clustering use the k nearest neighbor graph in PCA space, local Moran's I of marker genes or perhaps eigengenes signifying gene programs for each cell type on this k nearest neighbor graph can validate or criticize Leiden clusters. Furthermore, interestingly, for some genes, the tallest peak in the histogram is away from 0.

The scatter plots as shown in the "spatial" analyses for the QC metrics section can be made to see how local Moran's I relates to the expression of the gene itself.

In [None]:
i <- 6 # Change if running this notebook
plotExpression(sfe, top_markers_symbol[i], x = new_colname[i], color_by = "cluster",
               swap_rownames = "Symbol") +
    scale_color_manual(values = ditto_colors) +
    coord_flip() +
    # comment out in case of error after changing i
    geom_density2d(data = as.data.frame(colData(sfe)) |> 
                       mutate(gene = logcounts(sfe)[top_markers[i],]),
                   mapping = aes(x = .data[[new_colname[i]]], y = gene), 
                   color = "blue", linewidth = 0.3) 

For this gene, just like for total UMI counts, there are two wings and a central value where local Moran's I is around 0. Generally, cells with higher expression of this gene have higher local Moran's I for this gene as well. The density contours show that cells concentrate around 0 expression and some weaker positive local Moran. The streak of cells with 0 expression means that many cells don't express this gene, and their neighbors have low and slightly homogeneous expression of this gene. This pattern may be different for different genes. Also, the p-values for each cell for local Moran's I are available and corrected for multiple hypothesis testing, and can be plotted. The p-values are based on the z score of the local Moran statistic, although how the statistic is distributed for gene expression data warrants more investigation. This p-value can also be computed with permutation (see `localmoran_perm()`). 

In [None]:
localResultAttrs(sfe, "localmoran", top_markers[1])

## LOSH

In [None]:
sfe <- runUnivariate(sfe, "LOSH", top_markers)

In [None]:
plotLocalResult(sfe, "LOSH", top_markers, colGeometryName = "centroids", ncol = 3,
                swap_rownames = "Symbol")

In the two genes on the right, it's interesting to see higher LOSH in the middle cluster. The two genes on the left have some outliers throwing off the dynamic range, but it seems that their high LOSH regions are different.

Again, we plot the histograms:

In [None]:
new_colname2 <- paste0("cluster", seq_along(top_markers), "_", 
                      top_markers_symbol, "_losh")
for (i in seq_along(top_markers)) {
    g <- top_markers[i]
    colData(sfe)[[new_colname2[i]]] <- 
        localResult(sfe, "LOSH", g)[,"Hi"]
}

In [None]:
plotColDataFreqpoly(sfe, new_colname2, color_by = "cluster") +
    ggtitle("Local heteroscedasticity") +
    theme(legend.position = "top") +
    scale_y_log10() +
    annotation_logticks(sides = "l")

The relationship between expression and LOSH is more complicated. For some genes, such as the top marker gene for cluster 1 LYAR, cells in the cluster with higher expression also have higher LOSH - much like how in Poisson and negative binomial distributions, higher mean also means higher variance. However, some genes, such as the top marker gene for cluster 2 CTSS, have lower LOSH among cells that have higher expression, which means expression of this gene is more homogeneous within the cluster, consistent with local Moran. 

In [None]:
i <- 6 # Change if running this notebook
plotExpression(sfe, top_markers_symbol[i], x = new_colname2[i], 
               color_by = "cluster", swap_rownames = "Symbol") +
    scale_color_manual(values = ditto_colors) +
    coord_flip() +
    # comment out in case of error after changing i
    geom_density2d(data = as.data.frame(colData(sfe)) |> 
                       mutate(gene = logcounts(sfe)[top_markers[i],]),
                   mapping = aes(x = .data[[new_colname2[i]]], y = gene), 
                   color = "blue", linewidth = 0.3) 

For this gene, the density contour indicates that many cells don't express this gene and have homogeneous neighborhoods also with low expression. That streak around 0 expression means that neighbors of cells that don't express this gene have different levels of heterogeneity in this gene.

## Moran plot

Here we make Moran plots for the top marker genes.

In [None]:
sfe <- runUnivariate(sfe, "moran.plot", features = top_markers, colGraphName = "knn10")

As a reference, we show Moran's I for the top marker genes, which is the slope of the line fitted to the Moran scatter plot.

In [None]:
top_markers_df[top_markers,]

There is no significant marker gene for cluster 7. These plots are shown in sequence

In [None]:
plts <- lapply(top_markers, moranPlot, sfe = sfe, color_by = "cluster", 
               swap_rownames = "Symbol")

In [None]:
wrap_plots(plts, widths = 1, heights = 1) +
    plot_layout(ncol = 3, guides = "collect") +
    plot_annotation(tag_levels = "1")

For some genes, the points are so concentrated around the origin that there aren't "enough" points elsewhere to plot the density contours. But for the cells that do express these genes, there are clusters on this plot. Some genes are not expressed in many cells, but those cells have neighbors that do express the gene, hence the vertical streak at x = 0. 

In this tutorial, we applied univariate spatial statistics to the k nearest neighbor graph in the gene expression PCA space rather than the histological space. Just like in histological space, it would be impractical to examine all these statistics gene by gene, so multivariate analyses that incorporate the k nearest neighbor graph may be interesting.

# Session info

In [None]:
sessionInfo()