In [None]:
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In [None]:
# Install Google Colab dependencies
# Note: this can take 30+ minutes (many of the dependencies include C++ code, which needs to be compiled)

# First install `sf`, `ragg` and `textshaping` and their system dependencies:
system("apt-get -y update && apt-get install -y  libudunits2-dev libgdal-dev libgeos-dev libproj-dev libharfbuzz-dev libfribidi-dev")
install.packages("sf")
install.packages("textshaping")
install.packages("ragg")

# Install system dependencies of some other R packages that Voyager either imports or suggests:
system("apt-get install -y libfribidi-dev libcairo2-dev libmagick++-dev")

# Install Voyager from Bioconductor:
install.packages("BiocManager")
BiocManager::install(version = "release", ask = FALSE, update = FALSE, Ncpus = 2)
BiocManager::install("scater")
system.time(
  BiocManager::install("Voyager", dependencies = TRUE, Ncpus = 2, update = FALSE)
)

# Other packages used in this vignette
packageVersion("Voyager")

# Introduction
Due to the large number of genes quantified in single cell and spatial transcriptomics, dimension reduction is part of the standard workflow to analyze such data, to visualize, to help interpreting the data, to distill relevant information and reduce noise, to facilitate downstream analyses such as clustering and pseudotime, to project different samples into a shared latent space for data integration, and so on.

The first dimension reduction methods we learn about, such as good old principal component analysis (PCA), tSNE, and UMAP, don't use spatial information. With the rise of spatial transcriptomics, some dimension reduction methods that take spatial dependence into account have been written. Some, such as `SpatialPCA` [@Shang2022-wq], `NSF` [@Townes2023-bi], and `MEFISTO` [@Velten2022-gv] use factor analysis or probabilistic PCA which is related to factor analysis, and model the factors as Gaussian processes, with a spatial kernel for the covariance matrix, so the factors have positive spatial autocorrelation and can be used for downstream clustering where the clusters can be more spatially coherent. Some use graph convolution networks on a spatial neighborhood graph to find spatially informed embeddings of the cells, such as `conST` [@Zong2022-tb] and `SpaceFlow` [@Ren2022-xt]. `SpaSRL` [@Zhang2023-mm] finds a low dimension projection of spatial neighborhood augmented data. 

Spatially informed dimension reduction is actually not new, and dates back to at least 1985, with Wartenberg's crossover of Moran's I and PCA [@Wartenberg1985-re], which was generalized and further developed as MULTISPATI PCA [@Dray2008-cg], implemented in the [`adespatial`](https://cran.r-project.org/web/packages/adespatial/index.html) package on CRAN. In short, while PCA tries to maximize the variance explained by each PC, MULTISPATI maximizes the product of Moran's I and variance explained. Also, while all the eigenvalues from PCA are non-negative, because the covariance matrix is positive semidefinite, MULTISPATI can give negative eigenvalues, which represent negative spatial autocorrelation, which can be present and interesting but is not as common as positive spatial autocorrelation and is often masked by the latter [@Griffith2019-sj]. 

In single cell -omics conventions, let $X$ denote a gene count matrix whose columns are cells or Visium spots and whose rows are genes, with $n$ columns. Let $W$ denote the row normalized $n\times n$ adjacency matrix of the spatial neighborhood graph of the cells or Visium spots, which does not have to be symmetric. MULTISPATI diagonalizes a symmetric matrix

$$
H = \frac 1 {2n} X(W^t+W)X^t
$$

However, the implementation in `adespatial` is more general and can be used for other multivariate analyses in the [duality diagram](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3265363/) paradigm, such as correspondence analysis; the equation above is simplified just for PCA, without having to introduce the duality diagram here.

Voyager 1.2.0 (Bioconductor release) has a much faster implementation of MULTISPATI PCA based on [`RSpectra`](https://cran.r-project.org/web/packages/RSpectra/index.html). See benchmark [here](https://lambdamoses.github.io/thevoyages/posts/2023-03-25-multispati-part-2/).

In this vignette, we perform MULTISPATI PCA on the MERFISH mouse liver dataset. See the first vignette using this dataset [here](https://pachterlab.github.io/voyager/articles/vig6_merfish.html).

Here we load the packages used:

In [None]:
library(Voyager)
library(SFEData)
library(SpatialFeatureExperiment)
library(scater)
library(scran)
library(scuttle)
library(ggplot2)
library(stringr)
library(tidyr)
library(tibble)
library(bluster)
library(BiocSingular)
library(BiocParallel)
library(sf)
library(patchwork)
library(spdep)
set.ZeroPolicyOption(TRUE)
theme_set(theme_bw())

In [None]:
(sfe <- VizgenLiverData())

MULTISPATI PCA is one of the multivariate methods introduced in `Voyager` 1.2.0. All multivariate methods in `Voyager` are listed here:

In [None]:
listSFEMethods(variate = "multi")

When calling `calculate*variate()` or `run*variate()`, the `type` (2nd) argument takes either an `SFEMethod` object or a string that matches an entry in the `name` column in the data frame returned by `listSFEMethods()`.

# Quality control
QC was already performed in the [first vignette](https://pachterlab.github.io/voyager/articles/vig6_merfish.html). We do the same QC here, but see the first vignette for more details.

In [None]:
is_blank <- str_detect(rownames(sfe), "^Blank-")
sfe <- addPerCellQCMetrics(sfe, subset = list(blank = is_blank))

In [None]:
get_neg_ctrl_outliers <- function(col, sfe, nmads = 3, log = FALSE) {
    inds <- colData(sfe)$nCounts > 0 & colData(sfe)[[col]] > 0
    df <- colData(sfe)[inds,]
    outlier_inds <- isOutlier(df[[col]], type = "higher", nmads = nmads, log = log)
    outliers <- rownames(df)[outlier_inds]
    col2 <- str_remove(col, "^subsets_")
    col2 <- str_remove(col2, "_percent$")
    new_colname <- paste("is", col2, "outlier", sep = "_")
    colData(sfe)[[new_colname]] <- colnames(sfe) %in% outliers
    sfe
}

In [None]:
sfe <- get_neg_ctrl_outliers("subsets_blank_percent", sfe, log = TRUE)

Remove the outliers and empty cells:

In [None]:
inds <- !sfe$is_blank_outlier & sfe$nCounts > 0
(sfe <- sfe[, inds])

There still are over 390,000 cells left after removing the outliers. 

Next we compute Moran's I for QC metrics, which requires a spatial neighborhood graph:

In [None]:
system.time(
    colGraph(sfe, "knn5") <- findSpatialNeighbors(sfe, method = "knearneigh", 
                                                  dist_type = "idw", k = 5, 
                                                  style = "W", zero.policy = TRUE)
)

In [None]:
features_use <- c("nCounts", "nGenes", "volume")
sfe <- colDataUnivariate(sfe, "moran.mc", features_use, 
                         colGraphName = "knn5", nsim = 49, 
                         BPPARAM = MulticoreParam(2))

In [None]:
plotMoranMC(sfe, features_use)

Here Moran's I is a little negative, but from the permutation testing, it is significant, though that can also be the large number of cells. The lower bound of Moran's I given the spatial neighborhood graph is usually closer to -0.5 than -1, while the upper bound is usually around 1. The bounds given a specific spatial neighborhood graph can be found with `moranBounds()`, but because it double centers the adjacency matrix, hence making it dense, there isn't enough memory to use it for the entire dataset. But we can look at the Moran bounds of a small subset of data, which might be generalizable to the whole dataset given that this tissue appears quite homogeneous in space.

In [None]:
bbox_use <- c(xmin = 6000, xmax = 7000, ymin = 4000, ymax = 5000)
inds2 <- st_intersects(cellSeg(sfe), st_as_sfc(st_bbox(bbox_use)), 
                       sparse = FALSE)[,1]
sfe_sub <- sfe[, inds2]

Note that since SpatialFeatureExperiment v1.8, the sptial graphs are subsetted rather than reconstructed when the SFE object is subsetted because reconstruction tends to be more time consuming and the BPPARAM and BNPARAM arguments can't be stored and saved by `alabaster.sfe`. Subsetting removes some of the neighbors of cells near the boundary of this bounding box but the vast majority of cells still have all 5 nearest neighbors.

In [None]:
table(card(colGraph(sfe_sub, "knn5")$neighbours))

In [None]:
(mb <- moranBounds(colGraph(sfe_sub, "knn5")))

The lower bound is quite different from if we reconstruct the knn graph on the subset

In [None]:
colGraph(sfe_sub, "knn5_reconst") <- findSpatialNeighbors(sfe_sub, method = "knearneigh", 
                                                          k = 5L, dist_type = "idw",
                                                          style = "W")

In [None]:
(mb2 <- moranBounds(colGraph(sfe_sub, "knn5_reconst")))

It would be cool to systematically investigate the effects of perturbations to the spatial neighborhood graph on Moran's I and other spatial statistics.

So considering the bounds, the Moran's I values of the QC metrics are more like

In [None]:
setNames(colFeatureData(sfe)[c("nCounts", "nGenes", "volume"), 
                             "moran.mc_statistic_sample01"] / mb2["Imin"],
         features_use)

whose magnitudes seem more substantial for `nCounts` and `nGenes` if it's positive spatial autocorrelation. So there may be mild to moderate negative spatial autocorrelation.

In [None]:
# Normalize data
sfe <- logNormCounts(sfe)

# Hepatic zonation

This dataset comes from a relatively large piece of tissue and we need to zoom into a smaller region to better see the local structures. Here we specify a bounding box.

In [None]:
bbox_use <- c(xmin = 6100, xmax = 7100, ymin = 7500, ymax = 8500)

A portal triad is shown near the top right of this bounding box. The two large vessels on the left and bottom right are central veins. The portal triad consists of the hepatic artery, portal vein which brings blood from the intestine, and bile duct, so it's more oxygenated. The regions around the central vein is more deoxygenated. The different oxygen and nutrient contents mean that hepatocytes play different metabolic roles in the zones between the portal triad and the central vein. Here we plot some zonation marker genes from [@Halpern2017-ey]. 

In [None]:
markers <- c("Axin2", "Cyp1a2", "Gstm3", "Psmd4", # Pericentral
             "Cyp2e1", "Asl", "Alb", "Ass1", # Monotonic but has intermediate
             "Hamp", "Igfbp2", "Cyp8b1", "Mup3", # Non-monotonic
             "Arg1", "Pck1", "C2", "Sdhd") # Periportal

In [None]:
(inds <- which(markers %in% rownames(sfe)))

Only 3 of these marker genes are present in this dataset. The first two are pericentral (near the central vein), and the last one is periportal (near the portal triad).

In [None]:
plotSpatialFeature(sfe, markers[inds], colGeometryName = "cellSeg",
                   ncol = 3, bbox = bbox_use)

Besides hepatocytes, the liver also has many endothelial cells and Kupffer cells (macrophages). Marker genes of these cells from [@Bonnardel2019-sq] are plotted to visualize these cell types in space:

In [None]:
# Kuppfer cells
kc_genes <- c("Timd4", "Vsig4", "Clec4f", "Clec1b", "Il18bp", "C6", "Irf7",
              "Slc40a1", "Cdh5", "Nr1h3", "Dmpk", "Paqr9", "Pcolce2", "Kcna2",
              "Gbp8", "Iigp1", "Helz2", "Cd207", "Icos", "Adcy4", "Slc1a2",
              "Rsad2", "Slc16a9", "Cd209f", "Oasl1", "Fam167a")
which(kc_genes %in% rownames(sfe))

Only one of the Kupffer cell markers is available in this dataset.

In [None]:
plotSpatialFeature(sfe, kc_genes[9], colGeometryName = "cellSeg",
                   bbox = bbox_use)

Expression of this gene does not seem very spatially coherent.

In [None]:
# Endothelial cells
lec_genes <- c("Rspo3", "Wnt2", "Wnt9b", "Pcdhgc5", "Ecm1", "Ltbp4", "Efnb2")
(inds_lec <- which(lec_genes %in% rownames(sfe)))

Only 3 of these endothelial cell marker genes are available in this dataset.

In [None]:
plotSpatialFeature(sfe, lec_genes[inds_lec], colGeometryName = "cellSeg",
                   bbox = bbox_use, ncol = 3)

Wnt2 seems to be more pericentral, while Ltbp4 and Efnb2 seem more periportal.

Some of these marker genes will show up in the top PC loadings non-spatial and spatial PCA.

# Non-spatial PCA
First we run non-spatial PCA, to compare to MULTISPATI.

In [None]:
set.seed(29)
system.time(
    sfe <- runPCA(sfe, ncomponents = 20, subset_row = !is_blank,
                  exprs_values = "logcounts",
                  scale = TRUE, BSPARAM = IrlbaParam())
)
gc()

That's pretty quick for almost 400,000 cells, but there aren't that many genes here. Use the elbow plot to see variance explained by each PC:

In [None]:
ElbowPlot(sfe)

Plot top gene loadings in each PC

In [None]:
plotDimLoadings(sfe)

Many of these genes seem to be related to the endothelium. PC1 and PC4 concern the Kupffer cells as well, as the Kupffer cell marker gene Cdh5 has high loading.

Plot the first 4 PCs in space

In [None]:
spatialReducedDim(sfe, "PCA", 4, colGeometryName = "centroids", scattermore = TRUE,
                  divergent = TRUE, diverge_center = 0)

PC1 and PC4 highlight the major blood vessels, while PC2 and PC3 have less spatial structure. While in the CosMX and Xenium datasets on this website, the top PCs have clear spatial structures despite the absence of spatial information in non-spatial PCA because of clear spatial compartments for some cell types, which does not seem to be the case in this dataset except for the blood vessels. We have seen above that some genes have strong spatial structures. 

While PC2 and PC3 don't seem to have large scale spatial structure, they may have more local spatial structure not obvious from plotting the entire section, so we zoom into a bounding box which shows hepatic zonation.

In [None]:
spatialReducedDim(sfe, "PCA", ncomponents = 4, colGeometryName = "cellSeg",
                  bbox = bbox_use, divergent = TRUE, diverge_center = 0)

There's some spatial structure at a smaller scale, perhaps some negative spatial autocorrelation.

# MULTISPATI PCA

In [None]:
system.time({
    sfe <- runMultivariate(sfe, "multispati", colGraphName = "knn5", nfposi = 20,
                       nfnega = 20)
})

Then plot the most positive and most negative eigenvalues. Note that the eigenvalues here are not variance explained. Instead, they are the product of variance explained and Moran's I. So the most positive eigenvalues correspond to eigenvectors that simultaneously explain more variance and have large positive Moran's I. The most negative eigenvalues correspond to eigenvectors that simultaneously explain more variance and have negative Moran's I.

In [None]:
ElbowPlot(sfe, nfnega = 20, reduction = "multispati")

Here the positive eigenvalues drop sharply from PC1 to PC4, and there is only one very negative eigenvalue which might be interesting, which is unsurprising given the moderately negative Moran's I for `nCounts` and `nGenes`. However, from the [first MERFISH vignette](https://pachterlab.github.io/voyager/articles/vig6_merfish.html#morans-i), none of the genes have very negative Moran's I. Perhaps the negative eigenvalue comes from negative spatial autocorrelation in a gene program or "eigengene" and is not obvious from individual genes. This is the beauty of multivariate analysis.

What do these components mean? Each component is a linear combination of genes to maximize the product of variance explained and Moran's I. The second component maximizes this product provided that it's orthogonal to the first component, and so on. As the loss in variance explained is usually not huge, these components can be considered axes along which _spatially coherent_ groups of spots are separated from each other as much as possible according to expression of the highly variable genes, so in theory, clustering with positive MULTISPATI components should give more spatially coherent clusters. Because of the spatial coherence, MULTISPATI might be more robust to outliers.

In [None]:
plotDimLoadings(sfe, dims = c(1:3, 40), reduction = "multispati")

From gene loadings, PC40 seems to separate endothelial cells and Kupffer cells from hepatocytes.

Plot the these PCs:

In [None]:
spatialReducedDim(sfe, "multispati", components = c(1:3, 40), 
                  colGeometryName = "cellSeg", bbox = bbox_use,
                  divergent = TRUE, diverge_center = 0)

The first two PCs pick up zoning. PC3 seems to have smaller scale spatial structure. PC"40" (should really be 300 something) is an example of negative spatial autocorrelation in biology. That Kupffer cells and endothelial cells are scattered among hepatocytes may play a functional role. This does not mean that non-spatial PCA is bad. While MULTISPATI tends not to lose too much variance explained in per PC with positive eigenvalues, it identifies co-expressed genes with spatially structured expression patterns. MULTISPATI tells a different story from non-spatial PCA. PCA cell embeddings are often used for downstream analysis. Whether to use MULTISPATI embeddings instead and which or how many PCs to use depend on the questions asked in the further downstream analyses.

# Spatial autocorrelation of principal components
## Moran's I
Here we compare Moran's I for cell embeddings in each non-spatial and MULTISPATI PC:

In [None]:
# non-spatial
sfe <- reducedDimMoransI(sfe, dimred = "PCA", components = 1:20,
                         BPPARAM = MulticoreParam(2))
# spatial
sfe <- reducedDimMoransI(sfe, dimred = "multispati", components = 1:40,
                         BPPARAM = MulticoreParam(2))

In [None]:
df_moran <- tibble(PCA = reducedDimFeatureData(sfe, "PCA")$moran_sample01[1:20],
                   MULTISPATI_pos = 
                       reducedDimFeatureData(sfe, "multispati")$moran_sample01[1:20],
                   MULTISPATI_neg = 
                       reducedDimFeatureData(sfe,"multispati")$moran_sample01[21:40] |> 
                       rev(),
                   index = 1:20)

In [None]:
data("ditto_colors")

In [None]:
df_moran |> 
    pivot_longer(cols = -index, values_to = "value", names_to = "name") |> 
    ggplot(aes(index, value, color = name)) +
    geom_line() +
    scale_color_manual(values = ditto_colors) +
    geom_hline(yintercept = 0, color = "gray") +
    geom_hline(yintercept = mb2, linetype = 2, color = "gray") +
    scale_y_continuous(breaks = scales::breaks_pretty()) +
    scale_x_continuous(breaks = scales::breaks_width(5)) +
    labs(y = "Moran's I", color = "Type", x = "Component")

In MULTISPATI, Moran's I is high in PC1 and PC2, but then sharply drops. Moran's I for the PC with the most negative eigenvalues is not very negative, which means the large magnitude of that eigenvalue comes from explaining more variance. However, considering the lower bound of Moran's I that is around -0.6 instead of -1, the magnitude of Moran's I for the PC with the most negative eigenvalue is not trivial.

In [None]:
min(df_moran$MULTISPATI_neg) / mb2[1]

Non-spatial PCs are not sorted by Moran's I; PC5 has surprising large Moran's I. 

In [None]:
spatialReducedDim(sfe, "PCA", component = 5, colGeometryName = "cellSeg", 
                  divergent = TRUE, diverge_center = 0, bbox = bbox_use)

PC5 must be about zonation. Also show a larger scale:

In [None]:
spatialReducedDim(sfe, "PCA", components = 5, colGeometryName = "centroids", 
                  divergent = TRUE, diverge_center = 0, scattermore = TRUE)

## Moran scatter plot
Local positive and negative spatial autocorrelation can average out in global Moran's I. From the zoomed in plots and gene loadings above, some PCs are about endothelial cells. The Moran scatter plot can help discovering more local heterogeneity.

In [None]:
sfe <- reducedDimUnivariate(sfe, "moran.plot", dimred = "PCA", components = 1:6)

In [None]:
plts <- lapply(seq_len(6), function(i) {
    moranPlot(sfe, paste0("PC", i), binned = TRUE, hex = TRUE, plot_influential = FALSE)
})

In [None]:
wrap_plots(plts, widths = 1, heights = 1) +
    plot_layout(ncol = 3) +
    plot_annotation(tag_levels = "1", 
                    title = "Moran scatter plot for non-spatial PCs") &
    theme(legend.position = "none")

PCs 1-3 have some fainter clusters outside the main cluster, indicating heterogeneous spatial autocorrelation. Also make Moran scatter plots for MULTISPATI

In [None]:
sfe <- reducedDimUnivariate(sfe, "moran.plot", dimred = "multispati", 
                            components = c(1:5, 40), 
                            # Not to overwrite non-spatial PCA moran plots
                            name = "moran.plot2") 

In [None]:
plts2 <- lapply(c(1:5, 40), function(i) {
    moranPlot(sfe, paste0("PC", i), binned = TRUE, hex = TRUE, 
              plot_influential = FALSE, name = "moran.plot2")
})

In [None]:
wrap_plots(plts2, widths = 1, heights = 1) +
    plot_layout(ncol = 3) +
    plot_annotation(tag_levels = "1",
                    title = "Moran scatter plot for MULTISPATI PCs") &
    theme(legend.position = "none")

There are some interesting clusters.

# Clustering with MULTISPATI PCA

In the standard scRNA-seq data analysis workflow, a k nearest neighbor graph is found in the PCA space, which is then used for graph based clustering such as Louvain and Leiden, which is used to perform differential expression. Spatial dimension reductions can similarly be used to perform clustering, to identify spatial regions in the tissue, as done in [@Shang2022-wq; @Ren2022-xt; @Zhang2023-mm]. This type of studies often use a manual segmentation as ground truth to compare different methods that identify spatial regions. 

The problem with this is that spatial region methods are meant to help us to identify novel spatial regions based on new -omics data, which might reveal what's previously unknown from manual annotations. If the output from a method doesn't match manual annotations, it might simply be pointing out a previously unknown aspect of the tissue rather than wrong. Depending on the questions being asked, there can simultaneously be multiple spatial partitions. This happens in geographical space. For instance, there's land use and neighborhood boundaries, but equally valid are watershed boundaries and types of rock formation. Which one is relevant depends on the questions asked.

Here we perform Leiden clustering with non-spatial and MULTISPATI PCA and compare the results. For the k nearest neighbor graph, I used the default k = 10.

In [None]:
system.time({
    set.seed(29)
    sfe$clusts_nonspatial <- clusterCells(sfe, use.dimred = "PCA", 
                                          BLUSPARAM = NNGraphParam(
                                              cluster.fun = "leiden",
                                              cluster.args = list(
                                                  objective_function = "modularity",
                                                  resolution_parameter = 1
                                              )
                                          ))
})

See if clustering with the positive MULTISPATI PCs give more spatially coherent clusters

In [None]:
system.time({
    set.seed(29)
    sfe$clusts_multispati <- clusterRows(reducedDim(sfe, "multispati")[,1:20], 
                                          BLUSPARAM = NNGraphParam(
                                              cluster.fun = "leiden",
                                              cluster.args = list(
                                                  objective_function = "modularity",
                                                  resolution_parameter = 1
                                              )
                                          ))
})

Plot the clusters in space:

In [None]:
plotSpatialFeature(sfe, c("clusts_nonspatial", "clusts_multispati"), 
                   colGeometryName = "centroids",
                   scattermore = TRUE) &
     guides(colour = guide_legend(override.aes = list(size=2), ncol = 2))

The MULTISPATI clusters do look somewhat more spatially structured than clusters from non-spatial PCA. Also zoom into a small area:

In [None]:
plotSpatialFeature(sfe, c("clusts_nonspatial", "clusts_multispati"), 
                   colGeometryName = "cellSeg",
                   bbox = bbox_use) &
     guides(fill = guide_legend(ncol = 2))

What do these clusters mean? Clusters are supposed to be groups of different spots that are more similar within a group, sharing some characteristics. Non-spatial and MULTISPATI PCA use different characteristics for the clustering. Non-spatial PCA finds genes that are good for telling cell types apart, although those genes may happen to be very spatially structured. Non-spatial clustering aims to find these groups only from gene expression, and cells with similar gene expression can be surrounded by cells of other types in histological space. This is just like mapping Art Deco buildings, which are often near Spanish revival and Beaux Art buildings whose styles are quite different and perform different functions, thus not necessarily forming a coherent spatial region. 

In contrast, MULTISPATI's positive components find genes that must characterize spatial regions in addition to distinguishing between different cell types. Which genes are involved in each MULTISPATI component may be as interesting as the clusters. It would be interesting to perform gene set enrichment analysis, or to interpret this as some sort of spatial patterns of spatially variable genes. This is like mapping when the buildings were built, so Art Deco, Spanish revival, Beaux Art popular in the 1920s and 1930s will end up in the same cluster and form a more spatially coherent region, which can be found in DTLA Historical Core and Jewelry District, and Old Pasadena. Hence non-spatial clustering of spatial data isn't necessarily bad. Rather, it tells a different story and reveals different aspects of the data from spatial clustering. 

# Session Info

In [None]:
sessionInfo()

# References