-
Notifications
You must be signed in to change notification settings - Fork 4
/
downsampling.Rmd
137 lines (111 loc) · 4.24 KB
/
downsampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Downsampling"
author: "Timothy Keyes"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
description: >
Read this vignette to learn how to downsample a high-dimensional cytometry
dataset to a smaller
number of cells using {tidytof}.
vignette: >
%\VignetteIndexEntry{Downsampling}
%\VignetteEngine{knitr::knitr}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.height = 4,
fig.width = 4
)
```
```{r setup, message = FALSE}
library(tidytof)
library(dplyr)
library(ggplot2)
count <- dplyr::count
```
Often, high-dimensional cytometry experiments collect tens or hundreds or millions of cells in total, and it can be useful to downsample to a smaller, more computationally tractable number of cells - either for a final analysis or while developing code.
To do this, `{tidytof}` implements the `tof_downsample()` verb, which allows downsampling using 3 methods: downsampling to an integer number of cells, downsampling to a fixed proportion of the total number of input cells, or downsampling to a fixed cellular density in phenotypic space.
## Downsampling with `tof_downsample()`
Using `{tidytof}`'s built-in dataset `phenograph_data`, we can see that the original size of the dataset is 1000 cells per cluster, or 3000 cells in total:
```{r}
data(phenograph_data)
phenograph_data |>
dplyr::count(phenograph_cluster)
```
To randomly sample 200 cells per cluster, we can use `tof_downsample()` using the "constant" `method`:
```{r}
phenograph_data |>
# downsample
tof_downsample(
group_cols = phenograph_cluster,
method = "constant",
num_cells = 200
) |>
# count the number of downsampled cells in each cluster
count(phenograph_cluster)
```
Alternatively, if we wanted to sample 50% of the cells in each cluster, we could use the "prop" `method`:
```{r}
phenograph_data |>
# downsample
tof_downsample(
group_cols = phenograph_cluster,
method = "prop",
prop_cells = 0.5
) |>
# count the number of downsampled cells in each cluster
count(phenograph_cluster)
```
And finally, we might also be interested in taking a slightly different approach to downsampling that reduces the number of cells not to a fixed constant or proportion, but to a fixed *density* in phenotypic space. For example, the following scatterplot demonstrates that there are certain areas of phenotypic density in `phenograph_data` that contain more cells than others along the `cd34`/`cd38` axes:
```{r, warning = FALSE, message = FALSE}
rescale_max <-
function(x, to = c(0, 1), from = range(x, na.rm = TRUE)) {
x / from[2] * to[2]
}
phenograph_data |>
# preprocess all numeric columns in the dataset
tof_preprocess(undo_noise = FALSE) |>
# plot
ggplot(aes(x = cd34, y = cd38)) +
geom_hex() +
coord_fixed(ratio = 0.4) +
scale_x_continuous(limits = c(NA, 1.5)) +
scale_y_continuous(limits = c(NA, 4)) +
scale_fill_viridis_c(
labels = function(x) round(rescale_max(x), 2)
) +
labs(
fill = "relative density"
)
```
To reduce the number of cells in our dataset until the local density around each cell in our dataset is relatively constant, we can use the "density" `method` of `tof_downsample`:
```{r, warning = FALSE, message = FALSE}
phenograph_data |>
tof_preprocess(undo_noise = FALSE) |>
tof_downsample(method = "density", density_cols = c(cd34, cd38)) |>
# plot
ggplot(aes(x = cd34, y = cd38)) +
geom_hex() +
coord_fixed(ratio = 0.4) +
scale_x_continuous(limits = c(NA, 1.5)) +
scale_y_continuous(limits = c(NA, 4)) +
scale_fill_viridis_c(
labels = function(x) round(rescale_max(x), 2)
) +
labs(
fill = "relative density"
)
```
Thus, we can see that the density after downsampling is more uniform (though not exactly uniform) across the range of `cd34`/`cd38` values in `phenograph_data`.
## Additional documentation
For more details, check out the documentation for the 3 underlying members of the `tof_downsample_*` function family (which are wrapped by `tof_downsample`):
- `tof_downsample_constant`
- `tof_downsample_prop`
- `tof_downsample_density`
# Session info
```{r}
sessionInfo()
```