In [None]:
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In [None]:
# Install Google Colab dependencies
# Note: this can take 30+ minutes (many of the dependencies include C++ code, which needs to be compiled)

# First install `sf`, `ragg` and `textshaping` and their system dependencies:
system("apt-get -y update && apt-get install -y  libudunits2-dev libgdal-dev libgeos-dev libproj-dev libharfbuzz-dev libfribidi-dev")
install.packages("sf")
install.packages("textshaping")
install.packages("ragg")

# Install system dependencies of some other R packages that Voyager either imports or suggests:
system("apt-get install -y libfribidi-dev libcairo2-dev libmagick++-dev")

# Install Voyager from Bioconductor:
install.packages("BiocManager")
BiocManager::install(version = "release", ask = FALSE, update = FALSE, Ncpus = 2)
BiocManager::install("scater")
system.time(
  BiocManager::install("Voyager", dependencies = TRUE, Ncpus = 2, update = FALSE)
)

# Other packages used in this vignette
packageVersion("Voyager")

# Introduction
Local Geary's C [@Anselin1995-wg] is defined as:

$$
c_i = \sum_jw_{ij}(x_i - x_j)^2,
$$

where $w_{ij}$s are spatial weights from location $i$ to location $j$ and $x$ is a variable at spatial location. This is generalized to multiple variables in [@Anselin2019-ra]:

$$
c_{k,i} = \sum_{v=1}^k c_{v,i},
$$

where there are $k$ variables. This is essentially a spatially weighted sum of squared distances between locations in feature space. This vignette demonstrates usage of multivariate local Geary's C.

Here we load the packages used:

In [None]:
library(Voyager)
library(SFEData)
library(SpatialFeatureExperiment)
library(scater)
library(scran)
library(ggplot2)
library(spdep)
theme_set(theme_bw())

QC was performed in [another vignette](https://pachterlab.github.io/voyager/articles/vig1_visium_basic.html), so this vignette will not plot QC metrics.

In [None]:
(sfe <- McKellarMuscleData("full"))

The image can be added to the SFE object and plotted behind the geometries, and needs to be flipped to align to the spots because the origin is at the top left for the image but bottom left for geometries.

In [None]:
if (!file.exists("tissue_lowres_5a.jpeg")) {
    download.file("https://raw.githubusercontent.com/pachterlab/voyager/main/vignettes/tissue_lowres_5a.jpeg",
                  destfile = "tissue_lowres_5a.jpeg")
}

In [None]:
sfe <- addImg(sfe, imageSource = "tissue_lowres_5a.jpeg", sample_id = "Vis5A", 
              image_id = "lowres", 
              scale_fct = 1024/22208)

In [None]:
sfe_tissue <- sfe[,colData(sfe)$in_tissue]
sfe_tissue <- sfe_tissue[rowSums(counts(sfe_tissue)) > 0,]

In [None]:
sfe_tissue <- logNormCounts(sfe_tissue)

In [None]:
colGraph(sfe_tissue, "visium") <- findVisiumGraph(sfe_tissue)

# Gene expression
Here we compute multivariate local C for top highly variagle genes (HVGs) in this dataset:

In [None]:
hvgs <- getTopHVGs(sfe_tissue, fdr.threshold = 0.01)

In [None]:
sfe_tissue <- runMultivariate(sfe_tissue, "localC_perm_multi", subset_row = hvgs)

The results are stored in `reducedDim` although it's not really a dimension reduction. It can also go into `colData` if `dest = "colData"`. The test is two sided, but the `alternative` argument can be set to "greater" to only test for positive spatial autocorrelation and "less" for negative spatial autocorrelation.

In [None]:
names(reducedDim(sfe_tissue, "localC_perm_multi"))

In [None]:
spatialReducedDim(sfe_tissue, "localC_perm_multi", c(1, 12),
                  image_id = "lowres", maxcell = 5e4)

In Geary's C, a value below 1 indicates positive spatial autocorrelation and a value above 1 indicates negative spatial autocorrelation. Local Geary's C is not scaled, but from the square difference expression, a low value means a more homogeneous neighborhood and a high value means a more heterogeneous neighborhood. Here considering all 341 top HVGs, the muscle tendon junction and the unjury site are more heterogeneous, which is detected as negative cluster.

Permutation testing was performed, although Anselin noted that the pseudo-p-values should only be taken as indicative of _interesting_ regions and should not be interpreted in a strict sense.

In [None]:
spatialReducedDim(sfe_tissue, "localC_perm_multi", c(11, 12),
                  image_id = "lowres", maxcell = 5e4, 
                  divergent = TRUE, diverge_center = -log10(0.05))

Warm colors indicate adjusted p < 0.05. This should be interpreted along with the clusters. In this dataset, there are interestingly homogeneous regions in the myofibers, and an interestingly heterogeneous region in the injury site. Most of the significant regions are positive cluster, but the center of the injury site is significant and is negative cluster.

# Top principal components
Because multivariate local Geary's C is a spatially weighted sum of squared distances between locations in feature space, it's affected by the curse of dimensionality when used on a large number of features, when uniformly distributed data points in higher dimensions become more equidistant to each other with increasing number of dimensions. However, real data is not uniformly distributed and can have a much smaller effective dimension than the number of features, as many genes are co-regulated. Anselin suggested using the main principal components, but the issue of curse of dimensionality remains to be further investigated. Furthermore, as the cosine and Manhattan distances have been suggested to mitigate curse of dimensionality, I wonder what if I use these instead of the Euclidean distance in feature space for multivariate local Geary's C.

So here we perform multivariate local Geary's C on the top PCs:

In [None]:
sfe_tissue <- runPCA(sfe_tissue, ncomponents = 20, scale = TRUE)

In [None]:
ElbowPlot(sfe_tissue)

What percentage of variance is explained by the top 20 PCs?

In [None]:
sum(attr(reducedDim(sfe_tissue, "PCA"), "percentVar"))

In [None]:
out <- localC_perm(reducedDim(sfe_tissue, "PCA"), 
                   listw = colGraph(sfe_tissue, "visium"))
out <- Voyager:::.localCpermmulti2df(out, 
                                     nb = colGraph(sfe_tissue, "visium")$neighbours,
                                     p.adjust.method = "BH")
reducedDim(sfe_tissue, "localC_PCs", withDimnames = FALSE) <- out

In [None]:
spatialReducedDim(sfe_tissue, "localC_PCs", c(1, 12),
                  image_id = "lowres", maxcell = 5e4)

In [None]:
spatialReducedDim(sfe_tissue, "localC_PCs", c(11, 12),
                  image_id = "lowres", maxcell = 5e4, 
                  divergent = TRUE, diverge_center = -log10(0.05))

The area that seem significant from the permutation test is larger than that from the HVGs, and the area considered negative clusters is smaller. The significant regions are pretty much all positive cluster. Do the differences in results have anything to do with curse of dimensionality? Twenty dimensions can still exhibit curse of dimensionality, but over 300 HVGs here would be worse. Or is it that we lose a lot of information, including negative spatial autocorrelation, by only using 20 PCs?

# Session info

In [None]:
sessionInfo()

# References