-
Notifications
You must be signed in to change notification settings - Fork 3
/
supplementary.Rmd
377 lines (265 loc) · 13 KB
/
supplementary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
---
title: "Supplementary Functions"
output:
rmarkdown::html_vignette:
number_sections: false
tabset: true
vignette: >
%\VignetteIndexEntry{Supplementary Functions}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.align = "center",
out.width = "100%",
fig.width = 9, fig.height = 7
)
```
# Introduction
The `NAIR` package contains a number of functions supplementary to `buildRepSeqNetwork()` that can be used to perform additional downstream tasks.
## Simulate Data for Demonstration
We simulate some toy data for demonstration.
We simulate data consisting of two samples with 100 observations each, for a total of 200 observations (rows).
```{r }
set.seed(42)
library(NAIR)
dir_out <- tempdir()
toy_data <- simulateToyData()
head(toy_data)
```
```{r}
nrow(toy_data)
```
Before we go through the various functions, the code below shows an example of how some of them might be used together.
```{r, eval = FALSE}
library(magrittr) # For pipe operator (%>%)
toy_data %>%
filterInputData("CloneSeq", drop_matches = "\\W") %>%
buildNet("CloneSeq") %>%
addNodeStats("all") %>%
addClusterMembership("greedy", cluster_id_name = "cluster_greedy") %>%
addClusterMembership("leiden", cluster_id_name = "cluster_leiden") %>%
addClusterStats("cluster_leiden", "CloneSeq", "CloneCount") %>%
addPlots(color_nodes_by = c("cluster_leiden", "cluster_greedy"),
color_scheme = "Viridis"
) %>%
labelClusters("cluster_leiden", cluster_id_col = "cluster_leiden") %>%
labelClusters("cluster_greedy", cluster_id_col = "cluster_greedy") %>%
saveNetwork(output_dir = tempdir(), output_name = "my_network")
```
# `addPlots()`
`addPlots()` can be used to generate plots of a network graph.
```{r}
net <- buildRepSeqNetwork(toy_data, "CloneSeq")
net <- addPlots(net, color_nodes_by = "SampleID")
```
See [this article](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html#new_plots) for more details.
# `addNodeStats()`
`addNodeStats()` can be used to compute node-level network properties for a network.
```{r, eval = FALSE}
net <- addNodeStats(net, stats_to_include = "all")
```
See [this article](node_properties.html) for more details.
# `addClusterStats()`
`addClusterStats()` can be used to perform cluster analysis for a network. It performs clustering, records the cluster membership of the nodes, and computes cluster properties.
```{r, eval = FALSE}
net <- addClusterStats(net, cluster_fun = "walktrap",
cluster_id_name = "cluster_walktrap")
```
See [this article](cluster_analysis.html) for more details.
# `addClusterMembership()`
`addClusterMembership()` can be used to perform clustering and record the cluster membership of the nodes without computing cluster properties. It is useful for performing multiple instances of clustering with different algorithms.
```{r}
net <- addClusterMembership(net,
cluster_fun = "leiden",
cluster_id_name = "cluster_leiden"
)
net <- addClusterMembership(net,
cluster_fun = "louvain",
cluster_id_name = "cluster_louvain"
)
```
Refer [here](cluster_analysis.html#cluster-membership-only) for more details.
# `addClusterLabels()`
`addClusterLabels()` can be used to label the clusters in a plot.
```{r}
net <- addPlots(net,
color_nodes_by = "cluster_louvain",
color_scheme = "Viridis"
)
net <- labelClusters(net,
cluster_id_col = "cluster_louvain",
top_n_clusters = 7,
size = 7
)
net$plots$cluster_louvain
```
Refer [here](cluster_analysis.html#labeling-clusters) for more details.
# `labelNodes()`
`labelNodes()` can be used to label the nodes in network plots.
```{r}
set.seed(42)
small_sample <- simulateToyData(1, sample_size = 10, prefix_length = 1)
net <- buildNet(small_sample, "CloneSeq", plot_title = NULL)
net <- labelNodes(net, "CloneSeq", size = 4)
net$plots[[1]]
```
# `saveNetwork()`
`saveNetwork()` can be used to save a list of network objects. Also prints any plots to a PDF containing one plot per page.
```{r, eval = FALSE}
saveNetwork(net, output_dir = dir_out, output_type = "individual")
```
The parameters `output_dir`, `output_type`, `output_name`, `pdf_width` and `pdf_height` have the [same behavior](buildRepSeqNetwork.html#output-settings) as they do in the `buildRepSeqNetwork()` function.
# <a name="save"></a> `saveNetworkPlots()`
`saveNetworkPlots()` can be used to save the plots for a network to a PDF containing one plot per page.
```{r, eval = FALSE}
saveNetworkPlots(net$plots, outfile = file.path(dir_out, "plots.pdf"))
```
# <a name="load"></a>`loadDataFromFileList()`
`loadDataFromFileList()` can be used to load data from multiple files and combine them into a single data frame.
```{r, eval = FALSE}
dat <- loadDataFromFileList(list.files(my_dir), input_type = "rds")
```
The primary parameter accepts a character vector containing file paths or a list containing file paths and connections. Each element corresponds to a single file. Each file is assumed to contain the data for a single sample, with observations indexed by row, and with the same columns across samples.
The supported values of `input_type` are `"rds"`, `"rda"`, `"csv"`, `"csv2"`, `"tsv"` and `"table"`. Each value specifies a different function to load the files. The respective functions are `readRDS()`, `load()`, `read.csv()`, `read.csv2()`, `read.delim()`, and `read.table()`.
## Text formats
For text formats (values of `input_type` other than `"rds"` and `"rda"`), non-default argument values can be specified to the optional parameters of the reading function using the `read.args` argument. It accepts a named list of argument values.
```{r, eval = FALSE}
loadDataFromFileList(list.files(my_dir),
input_type = "table",
read.args = list(
header = TRUE,
sep = " ",
dec = ",",
na.strings = "NA!",
row.names = 1,
col.names = c("RowID",
"CloneSeq", "CloneFrequency",
"CloneCount", "VGene"
)
)
)
```
See `?utils::read.table()` for the parameters and their accepted values. Note that `read.csv()`, `read.csv2()` and `read.delim()` are identical to `read.table()` other than their default argument values. Some examples of useful parameters include:
* `header`: Whether the first row contains column names.
* `sep`: The character separating consecutive values in each row.
* `dec`: The character used as a decimal point.
* `quote`: The character(s) used as quotes within character strings.
* `na.strings`: The character string representing NA values.
* `row.names`: The row names or the column containing the row names.
* `col.names`: The column names, for files without a header row.
* `colClasses`: For manually specifying the class of each column.
* `as.is`: For specifying the character columns to be converted to factors.
* `nrows`: Max number of rows (specification can improve memory usage)
## RData format
For the `"rda"` input type, the `data_symbols` parameter accepts a character vector specifying the name of each sample's data frame within its respective Rdata file (i.e., the name of the data frame in the R environment). A single character string can be used if each sample's data frame has the same name.
```{r, eval = FALSE}
save(df_sample1, file = file_1)
save(df_sample2, file = file_2)
save(df_sample3, file = file_3)
loadDataFromFileList(c(file_1, file_2, file_3),
input_type = "rda",
data_symbols = c("df_sample1",
"df_sample2",
"df_sample3"
)
)
```
# `combineSamples()`
`combineSamples()` has the same default behavior as `loadDataFromFileList()`, but possesses additional parameters that give it extra functionality, such as the ability to filter data and assign sample/subject/group IDs to each data file, which are then included as variables in the combined data frame.
```{r, eval = FALSE}
dat <- combineSamples(list.files(my_dir),
input_type = "rds",
min_seq_length = 7,
drop_matches = "[*|_]",
subset_cols = c("CloneSeq", "CloneCount", "VGene"),
sample_ids = 1:5,
subject_ids = c(1, 2, 2, 3, 3),
group_ids = c(1, 1, 1, 2, 2)
)
```
# `filterInputData()`
`filterInputData()` can be used to filter data prior to performing network analysis.
```{r}
filtered_data <- filterInputData(toy_data,
seq_col = "CloneSeq",
min_seq_length = 13,
drop_matches = "GGGG",
subset_cols = c("CloneFrequency", "SampleID"),
count_col = "CloneCount",
verbose = TRUE
)
```
The function has parameters [`data`, `seq_col`](buildRepSeqNetwork.html#mainargs), [`min_seq_length`, `drop_matches`](buildRepSeqNetwork.html#input_options) and [`subset_cols`](buildRepSeqNetwork.html#subset_cols), all of which behave in the same manner as seen in `buildRepSeqNetwork()`. In addition, the `count_col` parameter can be used to specify a column containing the clone count or UMI count. If specified, observations with `NA` values in this column will be removed from the data.
# `aggregateIdenticalClones()`
`aggregateIdenticalClones()` can be used on bulk AIRR-seq data to aggregate data rows containing the same clone sequence, with the clone counts and clone frequencies being added together. Aggregation can be restricted to being performed only within groups that are defined based on specified grouping variables.
```{r, eval = FALSE}
my_data <- data.frame(
clone_seq = c("ATCG", rep("ACAC", 2), rep("GGGG", 4)),
clone_count = rep(1, 7),
clone_freq = rep(1/7, 7),
time_point = c("t_0", rep(c("t_0", "t_1"), 3)),
subject_id = c(rep(1, 5), rep(2, 2))
)
# group clones by time point and subject ID
data_agg_time_subject <-
aggregateIdenticalClones(my_data,
clone_col = "clone_seq",
count_col = "clone_count",
freq_col = "clone_freq",
grouping_cols = c("subject_id", "time_point")
)
```
# `getNeighborhood()`
`getNeighborhood()` can be used to extract a subset of observations with receptor sequences sufficiently similar to a target sequence.
```{r, eval = FALSE}
nbd <- getNeighborhood(toy_data,
seq_col = "CloneSeq",
target_seq = "GGGGGGGAATTGG"
)
```
# `generateNetworkObjects()`
`generateNetworkObjects()` can be used to construct the minimal output possible from `buildRepSeqNetwork()`. It does not filter the input data, produce plots, compute network properties, perform cluster analysis or save data.
```{r, eval = FALSE}
net <- generateNetworkObjects(toy_data, "CloneSeq")
```
The function has parameters [`data`, `seq_col`](buildRepSeqNetwork.html#mainargs), [`dist_type`, `dist_cutoff` and `drop_isolated_nodes`](buildRepSeqNetwork.html#network_settings), all of which have the same behavior and default values as seen in `buildRepSeqNetwork()`.
# `generateNetworkGraph()`
`generateNetworkGraph()` can be used to generate the network `igraph` from the adjacency matrix.
```{r, eval = FALSE}
net$igraph <- generateNetworkGraph(net$adjacency_matrix)
```
# `generateAdjacencyMatrix()`
`generateAdjacencyMatrix()` can be used to compute the network adjacency matrix from the list of receptor sequences.
```{r, eval = FALSE}
output$adjacency_matrix <- generateAdjacencyMatrix(toy_data$CloneSeq)
# use same settings from original call to buildRepSeqNetwork()
net$adjacency_matrix <- generateAdjacencyMatrix(
net$node_data$CloneSeq,
dist_type = net$details$dist_type,
dist_cutoff = net$details$dist_cutoff,
drop_isolated_nodes = net$details$drop_isolated_nodes
)
```
The parameters `dist_type`, `dist_cutoff` and `drop_isolated_nodes` have the [same behavior and default values](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#network_settings) as in `buildRepSeqNetwork()`.
```{r include=FALSE, eval = FALSE}
# clean up temp directory
file.remove(
file.path(
tempdir(),
c("MyRepSeqNetwork_NodeMetadata.csv",
"MyRepSeqNetwork_ClusterMetadata.csv",
"MyRepSeqNetwork.pdf",
"MyRepSeqNetwork_EdgeList.txt",
"MyRepSeqNetwork_AdjacencyMatrix.mtx",
"MyRepSeqNetwork_Details.rds",
"MyRepSeqNetwork_Plots.rds",
"MyRepSeqNetwork_GraphLayout.txt",
"MyRepSeqNetwork.rds"
)
)
)
```