This repository has been archived by the owner on Jul 18, 2023. It is now read-only.
generated from Bioconductor/BuildABiocWorkshop
-
Notifications
You must be signed in to change notification settings - Fork 0
/
vig3_Voyager.Rmd
368 lines (283 loc) · 30.9 KB
/
vig3_Voyager.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
---
title: "3. Voyager"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{3. Voyager}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>", fig.align = "center"
)
```
# Introduction
In single cell RNA-seq (scRNA-seq), data and metadata can be represented with `SingleCellExperiment` or `Seurat` objects, and basic exploratory data analyses and visualization performed with `scater`, `scran`, and `scuttle`, or `Seurat`. The `SpatialFeatureExperiment` package and S4 class extending `SpatialExperiment` and `SingleCellExperiment` brings EDA methods for vector spatial data to spatial transcriptomics. `Voyager` to `SpatialFeatureExperiment` is just like `scater`, `scran`, and `scuttle` to `SingleCellExperiment`, implementing basic exploratory spatial data analysis (ESDA) and visualization methods.
Tobler's first law of geography:
> Everything is related to everything else. But near things are more related than distant things.
Non-spatial statistical methods often assume that the samples (cells, spots) are independent, which is not the case in spatial data, where nearby samples tend to be more similar (i.e. positive spatial autocorrelation; negative spatial autocorrelation is when nearby samples tend to be more dissimilar, like a checkered pattern). Much of ESDA is dedicated to spatial autocorrelation, such as finding whether it is present, and if so what's its length scale.
This part of the workshop gives an overview of some ESDA methods, functionalities of the `Voyager` package, and applications of the `SpatialFeatureExperiment` class with a published Visium dataset.
```{r setup}
library(Voyager)
library(SpatialFeatureExperiment)
library(scater)
library(scran)
library(SFEData)
library(sf)
library(ggplot2)
library(scales)
library(patchwork)
library(BiocParallel)
library(bluster)
theme_set(theme_bw(10))
```
# Dataset
The dataset used in this vignette comes from [Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration](https://doi.org/10.1038/s42003-021-02810-x). Notexin was injected into the tibialis anterior muscle to induce injury, and the healing muscle was collected 2, 5, and 7 days post injury for Visium. The dataset here is from the 2 day timepoint. The dataset is in a `SpatialFeatureExperiment` (SFE) object.
The gene count matrix was directly downloaded from GEO. All 4992 spots, whether in tissue or not, are included. The H&E image was used for nuclei and myofiber segmentation. A subset of nuclei from randomly selected regions from all 3 timepoints were manually annotated to train a StarDist model to segment the rest of the nuclei, and the myofibers were all manually segmented.
Tissue boundary, nuclei, myofiber, and Visium spot polygons are stored as `sf` data frames in the SFE object. See the vignette of the `SpatialFeatureExperiment` for more details on the structure of the SFE object. The SFE object of this dataset is provided in the `SFEData` package.
```{r}
(sfe <- McKellarMuscleData("full"))
```
The H&E image of this section:
```{r, echo=FALSE, out.width = "100%", fig.cap="Low resolution H&E image of the tissue section", fig.alt="A cross section of mouse muscle is slightly off center to the lower left. In the middle of the tissue is the notexin injury site with leukocyte infiltration and fewer myofibers. The rest of the tissue section is tightly packed with myofibers."}
knitr::include_graphics("tissue_lowres_5a.jpeg")
```
# Exploratory data analysis
## Spots in tissue
While the example dataset has all Visium spots whether on tissue or not, only spots that intersect tissue will be used for further analyses.
```{r}
names(colData(sfe))
```
Total UMI counts (`nCounts`), number of genes detected per spot (`nGenes`), and proportion of mitochondrially encoded counts (`prop_mito`) have been precomputed and are in `colData(sfe)`. The `plotSpatialFeature` function plots any gene, `colData` values, and geometry attributes in `colGeometry` and `annotGeometry` in space. The Visium spots are plotted as polygons reflecting their actual size relative to the tissue, rather than points as in other packages that plot Visium data. Behind the scene, `geom_sf` is used to plot the geometries.
The tissue boundary was found by thresholding the H&E image and removing small polygons that are most likely debris. The `in_tissue` column of `colData(sfe)` indicates which Visium spot polygon intersects the tissue polygon; this can be found with `SpatialFeatureExperiment::annotPred()`.
While `scran` is used for data normalization here for demonstration purposes and to make the data more normally distributed, we do not mean that it is the best practice in normalizing spatial transcriptomics data, as we don't know what the best practice really should be. As seen in the `nCounts` plot in space above, spatial autocorrelation is evident. In Visium, reverse transcription occurs in situ on the spots, but PCR amplification occurs after the cDNA is dissociated from the spots. Then artifacts introduced from the amplification step would not be spatial. Spatial artifacts may arise from diffusion of transcripts to adjacent spots and tissue permeablization. However, given how the total counts seem to correspond to histological regions, the total counts may have a biological component and hence should not be treated as a technical artifact to be normalized away as in scRNA-seq data normalization methods.
```{r}
sfe_tissue <- sfe[,colData(sfe)$in_tissue]
sfe_tissue <- sfe_tissue[rowSums(counts(sfe_tissue)) > 0,]
```
```{r}
clusters <- quickCluster(sfe_tissue)
sfe_tissue <- computeSumFactors(sfe_tissue, clusters=clusters)
sfe_tissue <- logNormCounts(sfe_tissue)
```
Myofiber and nuclei segmentation polygons are available in this dataset, in the field `annotGeometries`. Myofibers were manually segmented, and nuclei were segmented with [`StarDist`](https://github.com/stardist/stardist), trained with a manually segmented subset.
```{r}
annotGeometryNames(sfe_tissue)
```
### From myofibers and nuclei to Visium spots
The `plotSpatialFeature` function can also be used to plot attributes of geometries, i.e. the non-geometry columns in the `sf` data frames in the `rowGeometries`, `colGeometries`, or `annotGeometries` fields in the SFE object. For `rowGeometries` and `colGeometries`, such columns associated with the `sf` data frames rather than `rowData` or `colData` are allowed because one can specify how these columns associate with the geometries (see [`st_agr`](https://r-spatial.github.io/sf/reference/st_agr.html) and [documentation of `st_sf`](https://r-spatial.github.io/sf/reference/sf.html#details-1)). When an attribute of an `annotGeometry` is plotted along side gene expression or `colData` or `colGeometry` attribute, the `annotGeometry` attribute is plotted with a different color palette to distinguish from the column associated values.
Here, from the `annotGeometries`, the myofiber polygons are plotted, colored by cross section area as observed in this tissue section. The `aes_use` argument is set to `color` rather than `fill` (default for polygons) to only plot the Visium spot outlines to make the myofiber polygons more visible. The `fill` argument is set to `NA` to make the Visium spots look hollow, and the `size` argument controls the thickness of the outlines. The `annot_aes` argument specifies which column in the `annotGeometry` to use to specify the values of an aesthstic, just like `aes` in `ggplot2` (`aes_string` to be precise, since `tidyeval` is not used here). The `annot_fixed` argument (not used here) can set the fixed size, alpha, color, and etc. for the `annotGeometry`.
```{r, fig.alt="Plot of Visium spots in tissue and myofiber polygons in physical space. Visium spots are colored by nCounts, and myofibers are colored by area."}
plotSpatialFeature(sfe_tissue, features = "nCounts",
colGeometryName = "spotPoly",
annotGeometryName = "myofiber_simplified",
annot_aes = list(fill = "area"),
aes_use = "color", size = 0.5, fill = NA)
```
The larger myofibers seem to have fewer total counts, possibly because the larger size of these myofibers dilute the transcripts. If this is the case, then data normalization would be relevant to correct for this.
With `SpatialFeatureExperiment`, we can find the number of myofibers and nuclei that intersect each Visium spot. The predicate can be [anything implemented in `sf`](https://r-spatial.github.io/sf/reference/geos_binary_pred.html), so for example, the number of nuclei fully covered by each Visium spot can also be found. The default predicate is `st_intersects`.
```{r}
colData(sfe_tissue)$n_myofibers <-
annotNPred(sfe_tissue, colGeometryName = "spotPoly",
annotGeometryName = "myofiber_simplified")
```
```{r, fig.width=8, fig.height=4, fig.alt="Plot of Visium spots in tissue in physical space, colored by number of myofibers intersecting each spot."}
plotSpatialFeature(sfe_tissue, features = "n_myofibers",
colGeometryName = "spotPoly")
```
There is no one to one mapping between Visium spots and myofibers. However, we may want to relate attributes of myofibers to gene expression detected at the Visium spots. One way to do so is to summarize the attributes of all myofibers that intersect (or choose another better predicate implemented in `sf`) each spot, such as to calculate the mean, median, or sum. This can be done with the `annotSummary` function in `SpatialFeatureExperiment`. The default predicate is `st_intersects`, and the default summary function is `mean`.
```{r}
colData(sfe_tissue)$mean_myofiber_area <-
annotSummary(sfe_tissue, "spotPoly", "myofiber_simplified",
annotColNames = "area")[,1] # it always returns a data frame
```
```{r, fig.alt="Plot of Visium spots in tissue in physical space, colored by the average area of myofibers that intersect each spot. The average area is higher near the mid-top right part of the tissue."}
# The gray spots don't intersect any myofiber
plotSpatialFeature(sfe_tissue, "mean_myofiber_area", "spotPoly")
```
Now we can see how the mean area of myofibers intersecting each Visium spot relates to other aspects of the spots such as total counts and gene expression.
The NAs are for spots not intersecting any myofibers, e.g. those in the inflammatory region.
### Myofiber types
Marker genes: Myh7 (Type I, slow twitch, aerobic), Myh2 (Type IIa, fast twitch, somewhat aerobic), Myh4 (Type IIb, fast twitch, anareobic), Myh1 (Type IIx, fast twitch, anaerobic), from [this protocol](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5526362/)
```{r}
markers <- c(I = "Myh7", IIa = "Myh2", IIb = "Myh4", IIx = "Myh1")
```
First look at Type I myofibers. This is a fast twitch muscle, so we don't expect many slow twitch Type I myofibers. Row names in `sfe_tissue` are Ensembl IDs, to avoid ambiguity as sometimes multiple Ensembl IDs have the same gene symbol and some genes have aliases in symbol. However, gene symbols are shorter and more human readable than Ensembl IDs, and are nice to show on plots. In the `plotSpatialFeature` function and other functions in `Voyager`, even when the actual row names are Ensembl IDs, the `features` argument can take gene symbols if there is a column called "symbols" in `rowData(sfe)`, where the function converts the gene symbols to Ensembl IDs. By default, gene symbols are shown on the plot, but the `show_symbol` argument can be set to `FALSE` to show Ensembl IDs instead. If one gene symbol matches multiple Ensembl IDs in the dataset, then a warning will be given.
The `exprs_values` argument specifies the assay to use, which is by default "logcounts", i.e. the log normalized data. This default may or may not be best practice given that total UMI counts may have biological relevance in spatial data. Therefore we are plotting both the raw counts and the log normalized counts here.
```{r}
# Function specific for this vignette, with some hard coded values
plot_counts_logcounts <- function(sfe, feature) {
p1 <- plotSpatialFeature(sfe, feature, "spotPoly",
annotGeometryName = "myofiber_simplified",
annot_aes = list(fill = "area"), aes_use = "color",
fill = NA, size = 0.5, show_symbol = TRUE,
exprs_values = "counts") +
ggtitle("Raw counts")
p2 <- plotSpatialFeature(sfe, feature, "spotPoly",
annotGeometryName = "myofiber_simplified",
annot_aes = list(fill = "area"), aes_use = "color",
fill = NA, size = 0.5, show_symbol = TRUE,
exprs_values = "logcounts") +
ggtitle("Log normalized counts")
p1 + p2 +
plot_annotation(title = feature)
}
```
```{r, fig.alt="Raw and log normalized counts of Myh7, marker gene of type I myofiber, plotted side by side on Visium spots in space, with myofiber polygons colored by myofiber cross section area plotted in the background. Visium spots expressing Myh7 concentrate in the lower left part of the tissue where the myofibers tend to be smaller."}
plot_counts_logcounts(sfe_tissue, markers["I"])
```
Marker gene for type IIa myofibers is shown here. Those interested may change the code to plot markers for tyle IIb and IIx myofibers.
```{r, fig.alt="Raw and log normalized counts of Myh2, marker gene of type IIa myofiber, plotted side by side on Visium spots in space, with myofiber polygons colored by myofiber cross section area plotted in the background. Visium spots expressing Myh2 concentrate in the lower left and upper left parts of the tissue where the myofibers tend to be smaller. Log normalized counts show a wider region with higher expression."}
plot_counts_logcounts(sfe_tissue, markers["IIa"])
```
Type IIa myofibers also tend to be clustered together on left side of the tissue.
As SFE inherits from SCE, the non-spatial EDA plots from the `scater` package can still be used.
```{r, fig.alt="Scatter plot of mean area of myofibers intersecting each Visium spot in the x axis and proportion of mitochondrially encoded counts per spot in the y axis, with points colored by expression of Myh2."}
gene_id <- rownames(sfe_tissue)[rowData(sfe_tissue)$symbol == markers["IIa"]]
plotColData(sfe_tissue, x = "mean_myofiber_area", y = "prop_mito",
colour_by = gene_id, by_exprs_values = "logcounts")
```
Plotting proportion of mitochondrial counts vs. mean myofiber area, we see two clusters, one with higher proportion of mitochondrial counts and smaller area, and another with lower proportion of mitochondrial counts and on average slightly larger area. Type IIa myofibers tend to have smaller area and larger proportion of mitochondrial counts.
# Spatial neighborhood graphs
A spatial neighborhood graph is required to compute spatial dependency metrics such as Moran's I and Geary's C. The `SpatialFeatureExperiment` package wraps methods in `spdep` to find spatial neighborhood graphs, which are stored within the SFE object (see `spdep` documentation for [`gabrielneigh`](https://r-spatial.github.io/spdep/reference/graphneigh.html), [`knearneigh`](https://r-spatial.github.io/spdep/reference/knearneigh.html), [`poly2nb`](https://r-spatial.github.io/spdep/reference/poly2nb.html), and [`tri2nb`](https://r-spatial.github.io/spdep/reference/tri2nb.html)). The `Voyager` package then uses these graphs for spatial dependency analyses, again based on `spdep` in this first version, but methods from other geospatial packages, some of which also use the spatial neighborhood graphs, may be added later as needed.
For Visium, where the spots are in a hexagonal grid, the spatial neighborhood graph is straightforward. However, for spatial technologies with single cell resolution (e.g. MERFISH) and in this dataset, the myofibers and nuclei, many different methods can be used to find the spatial neighborhood graph. Here for myofibers, the method "poly2nb" identifies myofiber polygons that physically touch each other. `zero.policy = TRUE` will allow singletons, i.e. nodes without neighbors in the graph; in the inflamed region, there are more singletons. We have not yet benchmarked which spatial neighborhood method is the "best" in which situation; the particular method used here is for demonstration purpose and may or may not be best practice.
```{r}
colGraph(sfe_tissue, "visium") <- findVisiumGraph(sfe_tissue)
annotGraph(sfe_tissue, "myofiber_poly2nb") <-
findSpatialNeighbors(sfe_tissue, type = "myofiber_simplified", MARGIN = 3,
method = "poly2nb", zero.policy = TRUE)
```
The `plotColGraph` function plots the graph in space associated with a `colGeometry`, along with the geometry of interest.
```{r, fig.alt="Spatial neighborhood graph of Visium spots that intersect tissue."}
plotColGraph(sfe_tissue, colGraphName = "visium", colGeometryName = "spotPoly")
```
Similarly, the `plotAnnotGraph` function plots the graph associated with an `annotGeometry`, along with the geometry of interest.
```{r, fig.alt="Spatial neighborhood graph of myofibers, where each edge connects two myofibers that touch."}
plotAnnotGraph(sfe_tissue, annotGraphName = "myofiber_poly2nb",
annotGeometryName = "myofiber_simplified")
```
There is no `plotRowGraph` yet since we haven't worked with a dataset where spatial graphs related to genes are relevant, although the SFE object supports row graphs.
# Exploratory _spatial_ data analysis
All spatial autocorrelation metrics in this package can be computed directly on a vector or a matrix rather than an SFE object. The user interface emulates those of dimension reductions in the `scater` package (e.g. `calculateUMAP` that takes in a matrix or SCE object and returns a matrix, and `runUMAP` that takes in an SCE object and adds the results to the `reducedDims` field of the SCE object). So `calculate*` functions take in a matrix or an SFE object and directly return the results (format of the results depends on the structure of the results), while `run*` functions take in an SFE object and add the results to the object. In addition, `colData*` functions compute the metrics for numeric variables in `colData`. `colGeometry*` functions compute the metrics for numeric columns in a `colGeometry`. `annotGeometry*` functions compute the metrics for numeric columns in a `annotGeometry`.
## Univariate
In this first version, `Voyager` only supports univariate global spatial autocorrelation implemented in `spdep` for ESDA: Moran's I and Geary's C, permutation testing for Moran's I and Geary's C, Moran plot, and correlograms. In addition, beyond `spdep`, `Voyager` can cluster Moran plots and correlograms. Plotting functions taking in SFE objects are implemented to plot the results with `ggplot2` and with more customization options than `spdep` plotting functions.
To demonstrate spatial autocorrelation in gene expression, top highly variable genes (HVGs) are used. The HVGs are found with the `scran` method.
```{r}
dec <- modelGeneVar(sfe_tissue)
hvgs <- getTopHVGs(dec, n = 50)
```
### Moran's I
There are several ways to quantify spatial autocorrelation, the most common of which is Moran's I:
$$
I = \frac{n}{\sum_{i=1}^n \sum_{j=1}^n w_{ij}} \frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2},
$$
where $n$ is the number of spots or locations, $i$ and $j$ are different locations, or spots in the Visium context, $x$ is a variable with values at each location, and $w_{ij}$ is a spatial weight, which can be inversely proportional to distance between spots or an indicator of whether two spots are neighbors, subject to various definitions of neighborhood and whether to normalize the number of neighbors. The `spdep` package uses the neighborhood.
Moran's I takes values between -1 and 1. For positive spatial autocorrelation, i.e. nearby spots tend to be more similar, Moran's I will be positive. For negative spatial autocorrelation, i.e. nearby spots tend to be more dissimilar, Moran's I will be negative. When the variable is distributed in space randomly like salt and pepper, then Moran's I will be around 0. Positive Moran's I indicates global structure, while negative Moran's I indicates local structure.
Upon visual inspection, total UMI counts per spot seem to have spatial autocorrelation. A spatial neighborhood graph is required to compute Moran's I, and is specified with the `listw` argument.
For matrices, the rows are the features, as in the gene count matrix.
```{r}
# Directly use vector or matrix, and multiple features can be specified at once
calculateMoransI(t(colData(sfe_tissue)[,c("nCounts", "nGenes")]),
listw = colGraph(sfe_tissue, "visium"))
```
I is Moran's I, and K is sample kurtosis.
To add the results to the SFE object, specifically for colData:
```{r}
sfe_tissue <- colDataMoransI(sfe_tissue, features = c("nCounts", "nGenes"),
colGraphName = "visium")
head(colFeatureData(sfe_tissue), 10)
```
For `colData`, the results are added to `colFeatureData(sfe)`, and features for which Moran's I is not calculated have NA. The column names of `featureData` distinguishes between different samples (there's only one sample in this dataset), and are parsed by plotting functions.
To add the results to the SFE object, specifically for geometries: Here "area" is the area of the cross section of each myofiber as seen in this tissue section and "eccentricity" is the eccentricity of the ellipse fitted to each myofiber.
```{r}
# Remember zero.policy = TRUE since there're singletons
sfe_tissue <- annotGeometryMoransI(sfe_tissue,
features = c("area", "eccentricity"),
annotGeometryName = "myofiber_simplified",
annotGraphName = "myofiber_poly2nb",
zero.policy = TRUE)
head(attr(annotGeometry(sfe_tissue, "myofiber_simplified"), "featureData"))
```
For a non-geometry column in a `colGeometry`, `colGeometryMoransI` is like `annotGeometryMoransI` here, but none of the `colGeometries` in this dataset has extra columns.
For gene expression, the `logcounts` assay is used by default (use the `exprs_values` argument to change the assay), though this may or may not be best practice. If the metrics are computed for a large number of features, parallel computing is supported, with `BiocParallel`, with the `BPPARAM` argument.
```{r}
sfe_tissue <- runMoransI(sfe_tissue, features = hvgs, colGraphName = "visium",
BPPARAM = MulticoreParam(2))
rowData(sfe_tissue)[head(hvgs),]
```
### Geary's C
Another spatial autocorrelation metric is Geary's C, defined as:
$$
C = \frac{(n-1)}{2\sum_{i=1}^n \sum_{j=1}^n w_{ij}} \frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij}(x_i - x_j)^2}{{\sum_{i=1}^n (x_i - \bar{x})^2}}
$$
Geary's C well below 1 indicates positive spatial autocorrelation, and above 1 indicates negative spatial autocorrelation.
Simply substitute "MoransI" in the names of the functions in the previous section with "GearysC" to compute Geary's C for features of interest and add the results to the SFE object. For example, for `colData`
```{r}
sfe_tissue <- colDataGearysC(sfe_tissue, features = c("nCounts", "nGenes"),
colGraphName = "visium")
head(colFeatureData(sfe_tissue), 10)
```
There's only one column for K since it's the same for Moran's I and Geary's C. Here both Moran's I and Geary's C suggest positive spatial autocorrelation for `nCounts` and `nGenes`.
Other univariate global methods, including permutation testing (`runMoranMC`, `runGearyMC`), correlograms (`runCorrelogram`), and Moran scatter plot (`runMoranPlot`) functions all have the same arguments as `runMoransI`, except when additional arguments are required, such as `nsim` for the number of simulation for `runMoranMC` and `runGearyMC` and `order` for the maximum order of neighborhoods for `runCorrelogram`.
### Permutation testing
Is the spatial autocorrelation statistically significant? The `moran.test` function in `spdep` can give an analytic p-value but the p-value would not be accurate if the data is not normally distributed. As gene expression data is generally not normally distributed and data normalization doesn't necessarily make the data that close to a normal distribution, permutation testing is used in this package to test the significance of Moran's I and Geary's C, wrapping [`moran.mc` in `spdep`](https://r-spatial.github.io/spdep/reference/moran.mc.html). Just like Moran's I, there's `calculateMoranMC` and `calculateGearyMC` functions to directly return the results, and `colDataMoranMC`, `colGeometryMoranMC`, `annotGeometryMoranMC`, and `runMoranMC` to add the results to the SFE object. MC stands for Monte Carlo. The `nsim` argument specifies the number of simulations.
Add the results to the SFE object
```{r}
set.seed(29)
sfe_tissue <- colDataMoranMC(sfe_tissue, features = c("nCounts", "nGenes"),
colGraphName = "visium", nsim = 100)
head(colFeatureData(sfe_tissue), 10)
```
Note that while the test is performed for multiple features, the p-values here are not corrected for multiple hypothesis testing.
The results can be plotted:
```{r, fig.alt="Density plot of Moran's I values from 100 simulations of nCounts and nGenes. The density plots center around 0 and deminish around 0.06 on the right. Vertical lines mark the actual Moran's I. For both nCounts and nGenes, the actual value, at 0.53 and 0.38 respectively, is far higher than the simulated ones, indicating positive spatial autocorrelation."}
plotMoranMC(sfe_tissue, c("nCounts", "nGenes"))
```
By default, the colorblind friendly palette from `dittoSeq` is used for categorical variables. The density is of Moran's I from the simulations where the values are permuted and disconnected from spatial locations, and the vertical line is the actual Moran's I value. The simulation indicates that the spatial autocorrelation is significant.
Each function for Moran MC has a Geary's C equivalent (e.g. `runGearyMC`).
### Correlogram
What's the length scale of the spatial autocorrelation? In a correlogram, spatial autocorrelation of higher orders of neighbors (e.g. second order neighbors are neighbors of neighbors) is calculated to see how it decays over the orders. In Visium, with the regular hexagonal grid, order of neighbors is a proxy for distance. For more irregular patterns such as single cells, different methods to find the spatial neighbors may give different results. Functions to compute correlograms wrap [`sp.correlogram` in `spdep`](https://r-spatial.github.io/spdep/reference/sp.correlogram.html), and have the same pattern of `calculate*` and `run*` as the Moran's I and permutation test functions, except for the `order` argument specifying the maximum order of neighbors.
For `colData`, Moran's I correlogram:
```{r}
sfe_tissue <- runCorrelogram(sfe_tissue, hvgs[1:2], colGraphName = "visium",
order = 10)
```
The results can be plotted with `plotCorrelogram`
```{r, fig.alt="Line plot with order of neighbors (lags) in the x axis and Moran's I value at each lag in the y axis. The x axis ranges from 1 to 10, and the y axis ranges from 0 to 0.8. The lines show trends of decay of spatial autocorrelation with increasing distance of neighbors. Two genes, Car3 and Mb, are shown. Moran's I of both genes decay somewhat linearly from lag 1 to 10. Car3 decays from around 0.75 to around 0.23. Mb decays from around 0.7 to around 0.13. At each lag the error bars are tight (see next paragraph in the main text) and the p-values are less than 0.001 after Benjamini-Hochberg multiple testing correction over the 2 genes and 10 lags."}
plotCorrelogram(sfe_tissue, hvgs[1:2])
```
The error bars are twice the standard deviation of the Moran's I value. The standard deviation and p-values (null hypothesis is that Moran's I is 0) come from `moran.test` (for Geary's C correlogram, `geary.test`); these should be taken with a grain of salt for data that is not normally distributed. The p-values have been corrected for multiple hypothesis testing across all orders and features. As usual, . means p < 0.1, \* means p < 0.05, \*\* means p < 0.01, and \*\*\* means p < 0.001.
Again, this can be done for Geary's C, `colData`, `annotGeometry`, and etc. as in Moran's I and permutation testing.
### Moran scatter plot
In the Moran scatter plot, the x axis is the value itself and the y axis is the average value of the neighbors. The slope of the fitted line is Moran's I. Sometimes clusters appear in this plot, showing different kinds of neighborhoods for this value. Just like Moran's I, permutation testing, and correlogram functions, the functions for Moran scatter plot also follow the `calculate*` and `run*` patterns and have the same user interface. However, the plot can only be made for one feature at a time.
For gene expression, to use one gene (log normalized value) to demonstrate:
```{r}
sfe_tissue <- runMoranPlot(sfe_tissue, "Myh1", colGraphName = "visium")
```
```{r, fig.alt="Moran scatter plot of log normalized values of gene Myh1. This plot is described in the upcoming main text."}
moranPlot(sfe_tissue, "Myh1", graphName = "visium")
```
The dashed lines mark the mean in Myh1 and spatially lagged Myh1. There are no singletons here. Some Visium spots with lower Myh1 expression have neighbors that don't express Myh1 but spots that don't express Myh1 usually have at least some neighbors that do. There are 2 main clusters for spots whose neighbors do express Myh1: those with high (above average) expression whose neighbors also have high expression, and those with low expression whose neighbors also have low expression. Other features may show different kinds of clusters. We can use k-means clustering to identify clusters, though any clustering method supported by the `bluster` package can be used.
```{r}
set.seed(29)
clusts <- clusterMoranPlot(sfe_tissue, "Myh1", BLUSPARAM = KmeansParam(2))
```
```{r, fig.alt="Moran scatter plot of log normalized value of Myh1, colored by 2 k-means clusters, which correspond to the high-high and low-low spots."}
moranPlot(sfe_tissue, "Myh1", graphName = "visium", color_by = clusts$Myh1)
```
Plot the clusters in space
```{r, fig.alt="Visium spots in space colored by the k-means clusters. Cluster 2 (high-high) are mostly in the upper left and lower left parts of the tissue, and the rest of the spots are cluster 1."}
colData(sfe_tissue)$Myh1_moranPlot_clust <- clusts$Myh1
plotSpatialFeature(sfe_tissue, "Myh1_moranPlot_clust", colGeometryName = "spotPoly")
```
This can also be done for `colData`, `annotGeometry`, and etc. as in Moran's I and permutation testing.
# Limitations
1. In the first version of `Voyager`, only univariate global spatial autocorrelation metrics are supported. Anisotropy, univariate local spatial metrics, and multivariate spatial analyses will be added in later versions.
2. The plotting functions don't plot the H&E image in the background.
3. It's more convoluted to trick `geom_sf` into flipping the y axis, since the coordinates are in pixels in full resolution image and the image has the origin at the top left.
4. Only 2D data is supported at present.
# Session info
```{r}
sessionInfo()
```