In [None]:
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", fig.align = "center"
)

In [None]:
# Install Google Colab dependencies
# Note: this can take 30+ minutes (many of the dependencies include C++ code, which needs to be compiled)

# First install `sf`, `ragg` and `textshaping` and their system dependencies:
system("apt-get -y update && apt-get install -y  libudunits2-dev libgdal-dev libgeos-dev libproj-dev libharfbuzz-dev libfribidi-dev")
install.packages("sf")
install.packages("textshaping")
install.packages("ragg")

# Install system dependencies of some other R packages that Voyager either imports or suggests:
system("apt-get install -y libfribidi-dev libcairo2-dev libmagick++-dev")

# Install Voyager from Bioconductor:
install.packages("BiocManager")
BiocManager::install(version = "release", ask = FALSE, update = FALSE, Ncpus = 2)
BiocManager::install("scater")
system.time(
  BiocManager::install("Voyager", dependencies = TRUE, Ncpus = 2, update = FALSE)
)

packageVersion("Voyager")

## Introduction

[Slide-seq V2](https://doi.org/10.1038/s41587-020-0739-1) is a spatial transcriptomic tool that measures genome-wide expression using DNA-barcoded beads patterned on a slide in an non-regular array. The beads used in the current protocol have a diameter of $10 \mu m$ and are thus larger than a single cell, but the number of detected transcripts is an order of magnitude higher compared to the previous iteration of the technology.

In this vignette, we use `Voyager` to analyze a dataset generated using the Slide-Seq V2 technology. The data is described in [Dissecting the treatment-naive ecosystem of human melanoma brain metastasis](https://doi.org/10.1016/j.cell.2022.06.007) [@Biermann2022-hl]. The raw counts and cell metadata are publicly available from GEO. We will focus on one of the human melanoma brain metastasis (MBM) samples that are provided in the `SFEData` package as a `SpatialFeatureExperiment`(SFE) object. The SFE object contains raw counts, QC metrics such as number of UMIs and genes detected per barcode, and centroid coordinates for each barcode as a `sf` POINT geometry.

In [None]:
library(Voyager)
library(SFEData)
library(SingleCellExperiment)
library(SpatialExperiment)
library(scater)
library(scran)
library(bluster)
library(ggplot2)
library(patchwork)
library(spdep)
library(BiocParallel)

theme_set(theme_bw())

In [None]:
(sfe <- BiermannMelaMetasData(dataset = "MBM05_rep1"))

The SFE object in the `SFEData` package includes information for 27,566 features and 29,536 beads/barcodes.

# Quality control (QC)

We begin by performing some exploratory data analysis on the barcodes in the tissue. There are some pre-computed QC measures that are stored in the object.

In [None]:
names(colData(sfe))

Total UMI counts (`nCounts`), number of genes detected per spot (`nGenes`), and the proportion of mitochondrially encoded counts (`prop_mito`). Below, we plot the total number of UMI counts per barcode as a violin plot and in space. For the latter task, we leverage the function `plotSpatialFeature()` which uses `geom_sf()` to plot geometries where applicable. The first few lines compute the average number of UMI counts per barcode and this average is plotted as a red line in the violin plot. 

In [None]:
avg <- as.data.frame(colData(sfe)) |>
  dplyr::summarise(across(-sample_id, mean))

violin <- plotColData(sfe, "nCounts") +
  geom_hline(aes(yintercept = nCounts), avg, color="red") +
  theme(legend.position = "top") 

spatial <- plotSpatialFeature(sfe, features = "nCounts",
                               colGeometryName = "centroids", size = 0.2) +
    theme_void()
violin + spatial

Each barcode is represented as an `sf` POINT geometry In the plot above, we note that many beads have quite low UMI counts, but there are small regions throughout the tissue that appear to have high counts. This is perhaps due to high cellular density of the melanoma cells, but we can only speculate without an image of the tissue. Interestingly, there are not any barcodes with zero counts. This is in contrast to many scRNA-seq dataset were many cells have zero counts. 

Given the density of points, we may choose to aggregate points into a hexagonal grid to avoid overplotting. Each hexagon will be colored by the total number of UMI counts in that space and each hexagon may represent more than one barcode. 

In [None]:
as.data.frame(cbind(spatialCoords(sfe), colData(sfe))) |> 
    ggplot(aes(xcoord, ycoord, z=nCounts)) +
    stat_summary_hex(fun = function(x) sum(x), bins=100) + 
    scale_fill_distiller(palette = "Blues", direction = 1) +
    labs(fill='nCounts')  +
    theme_bw() + coord_equal() +
    scale_x_continuous(expand = expansion()) +
    scale_y_continuous(expand = expansion()) +
    theme_void()

It is worthwhile to note that cell segmentation data were not included with this dataset. Even though Slide-Seq V2 does not profile gene expression at single cell resolution, cell segmentation data can be flexibly stored as `annotGeometries` in the `SFE` object. These geometries can be plotted with barcode-level data and can be used with `sf` for operations like finding the number of barcodes localized to a single cell.

In [None]:
colData(sfe)$log_nCounts <- log(colData(sfe)$nCounts)

avg <- as.data.frame(colData(sfe)) |>
  dplyr::summarise(across(-sample_id, mean))

violin <- plotColData(sfe, "log_nCounts") +
  geom_hline(aes(yintercept = log_nCounts), avg, color="red") +
  theme(legend.position = "top") 

spatial <- plotSpatialFeature(sfe, features = "log_nCounts",
                               colGeometryName = "centroids",
                              size = 0.2)
violin + spatial

The plot above visualizes the number of UMI counts per barcode on a log scale. It appears that barcodes with higher counts are co-localized in regions throughout the tissue, however, these regions are rather small and may not suggest spatial autocorrelation.

Next we find number of genes detected per barcode. Again, this QC feature is provided as `nGenes` in the `colData` attribute for barcodes.

In [None]:
violin <- plotColData(sfe, "nGenes") +
   geom_hline(aes(yintercept = nGenes), avg, color="red") +
  theme(legend.position = "top") 

spatial <- plotSpatialFeature(sfe, features = "nGenes",
                              colGeometryName = "centroids",
                              size = 0.2)
violin + spatial

Similar to the number of UMI counts per barcode, there seem to be small regions with higher number of genes throughout the tissue. These may correspond to regions of cellular diversity or high cellular density, as might be expected in the context of melanoma.

We can compute the degree to which the number of UMI counts per barcode depends on the spatial location of each measurement. This relationship, spatial autocorrelation, can be quantified using Moran's index of spatial autocorrelation, or Moran's *I*. The computation of Moran's *I* requires first a definition of what constitutes objects being "near" to each other. Most simply, this is represented as a spatial weights matrix. One possible representation is an adjacency matrix. This matrix can be computed for polygonal data and the resulting matrix can be binary, where entries are 1 if polygons share a border, and 0 elsewhere (including the diagonal). These entries can be weighted in different ways, including by the length of border shared between two polygons.

This schema does not necessarily lend itself well to spatial transcriptomic technologies, where the polygonal boundaries of cell objects may not correspond to the measurements in the count matrix, or individual spots or barcodes may themselves correspond to multiple neighborhoods of cells. Certainly, the interpretation of spatial weights matrix will change depending on the technology.

In any case, we can generate a putative spatial graph using the k-nearest neighbors algorithm. This is implemented in the `findSpatialNeighbors()` function with the argument `method = "knearneigh"` . We will store the result in the `colGraphs()` slot of the SFE object.

In [None]:
colGraph(sfe, "knn5") <- findSpatialNeighbors(sfe, method = "knearneigh",
                                              dist_type = "idw", k = 5, 
                                              style = "W")

Now compute Moran\'s I for some barcode QC metrics using `colDataMoransI()`.

In [None]:
features_use <- c("nCounts", "nGenes")
sfe <- colDataMoransI(sfe, features_use, colGraphName = "knn5")

In [None]:
colFeatureData(sfe)[features_use,]

The results above do not substantiate the visual check for spatial autocorrelation. We will continue with investigating other QC metrics.

The proportion of UMIs mapping to mitochondrial genes is a useful metric for assessing cell quality in scRNA-seq data. We will examine this QC metric below by plotting it versus total number of UMI counts for each barcode.

In [None]:

violin <- plotColData(sfe, "prop_mito") +
    geom_hline(aes(yintercept = prop_mito), avg, color="red") +
    theme(legend.position = "top") 

mito <- plotColData(sfe, x = "nCounts", y = "prop_mito")

violin + mito

In keeping with expectations, barcodes associated with fewer counts appear to be associated with higher proportions of mitochondrial reads. We will exclude barcodes containing more than \>10% mitochondrial reads for subsequent analysis. The second line removes barcodes with zero counts, but this is not necessary for this dataset as there are no barcodes with zero counts. We keep it here just to demonstrate the method. 

In [None]:
# Spatial neighborhood graph is reconstructed when subsetting columns
# Use drop = TRUE to drop the graph without reconstruction, whose indices are 
# no longer valid
sfe_filt <- sfe[, colData(sfe)$prop_mito < 0.1]
sfe_filt <- sfe_filt[rowSums(counts(sfe_filt)) > 0,]

# Data Normalization

Normalization of spatial transcriptomics data is non-trivial and requires thoughtful consideration. Similarly to scRNA-seq data analysis, the goal of normalization is to remove the effects of technical variation and derive a quantity that reflects biological variation. However, several questions arise when considering best practices for spatial data normalization. For example, spatial methods on average detect fewer UMIs than their single-cell counterparts, which may preclude the use normalization techniques such as log transformation as shown [here](<https://doi.org/10.1093/bioinformatics/btab085>). What's more, it is not always evident whether spatial autocorrelation between genes (or QC measures) is an artifact of the technology, and thus, whether normalization methods should preserve the spatial autocorrelation architecture. These questions provide avenues for active research and development, but are currently unresolved. To this end, we log-normalize the data in the cell below and identify variable genes for subsequent analysis.

In [None]:
sfe_filt <- logNormCounts(sfe_filt)

dec <- modelGeneVar(sfe_filt)
hvgs <- getTopHVGs(dec, n = 2000)

# Dimension Reduction and Clustering
Much like in scRNA-seq analysis, we perform principal component analysis (PCA) before clustering. We note that the method does not use any spatial information. 

In [None]:
set.seed(29)
sfe_filt <- runPCA(sfe_filt, ncomponents = 30, subset_row = hvgs,
                   scale = TRUE, BSPARAM = BiocSingular::IrlbaParam()) 
# scale as in Seurat

We can plot the variance explained by each PC. 

In [None]:
ElbowPlot(sfe_filt, ndims = 30) + theme_bw()

We see that the first few components explain most of the variance in the data. The principal components (PCs) can be plotted in space. Here we notice that the PCs may show some spatial structure that correlates to biological niches of cells. 

In [None]:
spatialReducedDim(sfe_filt, "PCA", ncomponents = 4, 
                  colGeometryName = "centroids", divergent = TRUE, 
                  diverge_center = 0, scattermore = TRUE, pointsize = 0.5)

Without the cellular overlays, we can only speculate about the potential relevance of the barcodes that seem to be separated by each PC, but each PC doe seem to separate distinct neighborhoods of barcodes.  

Now we can cluster the barcodes using a graph-based clustering algorithm and plot them in space. 

In [None]:
colData(sfe_filt)$cluster <- clusterRows(reducedDim(sfe_filt, "PCA")[,1:3],
                                           BLUSPARAM = SNNGraphParam(
                                               cluster.fun = "leiden",
                                               cluster.args = list(
                                                   resolution_parameter = 0.5,
                                                   objective_function = "modularity")))

The plot below is colored by cluster id. A naive interpretation of the plot shows distinct niches of barcodes separated by more abundant, intervening types. This may be indicative of the biological processes at hand, namely melanoma metastasis, where 'hotspots' of melanoma proliferation would be separated by unaffected normal tissue. 

In [None]:
plotSpatialFeature(sfe_filt, "cluster", colGeometryName = "centroids") +
  guides(colour = guide_legend(override.aes = list(size=3)))

## Moran's *I*

One avenue for future analysis includes identifying genes that are differentially expressed in each cluster, This can be interrogated with `findMarkers()` in a non-spatial context and with `calculateMoransI()` in a spatial context. In the spatial case, some consideration should be given to whether the differences seen in across the tissue represent biological difference or artifacts from field of view. 

Here we run global Moran’s *I* on log normalized gene expression.

In [None]:
sfe_filt <- runMoransI(sfe_filt, features = hvgs, 
                       BPPARAM = MulticoreParam(2))

Now, we might ask: which genes display the most spatial autocorrelation? 

In [None]:
top_moran <- rownames(sfe_filt)[order(rowData(sfe_filt)$moran_sample01, 
                                      decreasing = TRUE)[1:4]]
plotSpatialFeature(sfe_filt, top_moran, colGeometryName = "centroids",
                   scattermore = TRUE, pointsize = 0.5)

Spatial variability can also be investigated using  differential expression testing known anatomical regions complemented with spatial location. One potential drawback to this approach is the variability that is induced by the melanoma, rather than the native tissue architecture, which may preclude identification of typical structures. 

Further analyses that can be done at this stage:

1. What gene expression patterns, if any, differentiate the neighborhoods of melanoma cells?
2. What genes are differentially expressed in each cluster? 

# Session Info

In [None]:
sessionInfo()

# References