-
Notifications
You must be signed in to change notification settings - Fork 3
/
public_clusters.Rmd
455 lines (283 loc) · 20.6 KB
/
public_clusters.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
---
title: "Searching for Public TCR/BCR Clusters"
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.align = "center",
out.width = "100%",
fig.width = 9, fig.height = 7
)
```
# Introduction
The `NAIR` package includes a set of functions that facilitate searching for public TCR/BCR clusters across multiple samples of Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data.
In this context, a public cluster consists of similar TCR/BCR clones (e.g., those whose CDR3 amino acid sequences differ by at most one amino acid) that are shared across samples (e.g., across individuals or across time points for a single individual).
#### Overview of Process
1. [**Identify prominent clusters within each sample.**](#step1). For each sample, construct the repertoire network and use cluster analysis to partition the network into clusters. From each sample, select clusters based on node count and clone count.
2. [**Construct global network using the selected clusters.**](#step2) Combine the selected data from step 1 into a single global network. Use cluster analysis to partition the global network into clusters, which are considered as the public clusters.
3. [Perform additional tasks](#step3) such as labeling the global clusters in the visual plot and analyzing individual clusters of interest.
#### Simulate Data for Demonstration
We simulate some toy data for demonstration.
Our toy data includes 30 samples, each containing 30 observations.
Some sequences are simulated with a tendency to appear in relatively few samples, while others are simulated with a tendency to appear in many samples.
```{r }
set.seed(42)
library(NAIR)
data_dir <- tempdir()
dir_input_samples <- file.path(data_dir, "input_samples")
dir.create(dir_input_samples, showWarnings = FALSE)
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
"CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
"CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
"CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
"CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
"CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
"CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
"CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
"CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
"CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
"CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# relative generation probabilities
pgen <- cbind(
stats::toeplitz(0.6^(0:(sample_size - 1))),
matrix(1, nrow = samples, ncol = length(base_seqs) - samples))
simulateToyData(
samples = samples,
sample_size = sample_size,
prefix_length = 1,
prefix_chars = c("", ""),
prefix_probs = cbind(rep(1, samples), rep(0, samples)),
affixes = base_seqs,
affix_probs = pgen,
num_edits = 0,
output_dir = dir_input_samples,
no_return = TRUE
)
```
Each sample's data frame is saved to its own file using the RDS file format. The files are named "`Sample1.rds`", "`Sample2.rds`", etc. A character string containing the directory path is assigned to the R environment variable `dir_input_samples` for later reference.
The first few rows of the data for the first sample appear as follows:
```{r}
# View first few rows of data for sample 1
head(readRDS(file.path(dir_input_samples, "Sample1.rds")))
```
# <a name="step1"></a> Step 1: Identify Prominent Clusters Within Each Sample {.tabset}
First, we use `findPublicClusters()` to search across samples and select clones for inclusion in the global network.
Each sample's repertoire network is constructed individually, and cluster analysis is used to partition each network into clusters. The clusters are then filtered according to node count and clone count based on user-specified criteria. The AIRR-Seq data for the clusters that remain after filtering is saved to files to be used as inputs for [step 2](#step2).
Below, we explain how to use `findPublicClusters()`.
## <a name="input_settings"></a> Input Data for Step 1 {.tabset}
Each sample's AIRR-Seq data must be contained in a separate file, with observations indexed by row, and with the same columns across samples.
#### <a name="files1"></a> File Paths of Sample Data
The `file_list` parameter accepts a character vector containing file paths (or a list containing file paths and connections), where each element corresponds to a file containing a single sample.
```{r}
# create vector of input file paths for step 1 (one per sample)
input_files <- file.path(dir_input_samples,
paste0("Sample", 1:samples, ".rds")
)
head(input_files)
```
#### <a name="input_type"></a> File Format of Sample Data
The file format of the input files is specified by the `input_type` parameter. The supported values are `"rds"`, `"rda"`, `"csv"`, `"csv2"`, `"tsv"` and `"table"`. Depending on the input type, further options are specified with `data_symbols` or `read.args`.
Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/supplementary.html#load) and to `loadDataFromFileList()` for details and examples.
#### <a name="seq_col"></a> Sequence Column in Sample Data
The `seq_col` parameter specifies the column containing the TCR/BCR sequences within each sample. It accepts the column name (as a character string) or the column position index.
#### <a name="count_col"></a> Count Column in Sample Data
The optional `count_col` parameter specifies the column containing the clone count (clonal abundance) within each sample. It accepts the column name (as a character string) or the column position index. If provided, clone counts [will be considered when filtering the clusters](#minimum-clone-count).
#### <a name="sample_ids"></a> Custom Sample IDs (Optional)
Each clone's sample ID is included in the output. By default, these are `"Sample1"`, `"Sample2"`, etc., according to the order in `file_list`.
The optional `sample_ids` parameter assigns custom sample IDs. It accepts a vector of the same length as `file_list`, where each entry is the corresponding sample ID.
#### Filtering the Sample Data
The clones from each sample are filtered to remove any irrelevant data. By default, clones with sequences that are less than three characters in length, as well as sequences containing any of the characters `*`, `_` or `|`, will be excluded. The `min_seq_length` and `drop_matches` parameters control the filter settings. Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#input_options) for details.
#### <a name="network_settings"></a> Construction of Sample Networks {.tabset}
The parameters that control the construction of each sample's network are shown below along with their default values.
* `dist_type = "hamming"`
* `dist_cutoff = 1`
* `drop_isolated_nodes = TRUE`
Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#network_settings) for their meaning and usage.
#### Clustering Algorithm for Sample Networks
By default, clustering within each sample's network is performed using `igraph::cluster_fast_greedy()`. A different clustering algorithm can be specified using the `cluster_fun` parameter, as described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/cluster_analysis.html#clustering-algorithm).
## Filtering the Sample Clusters {.tabset}
The following parameters control the criteria used to select clusters from each sample.
#### Top $n$ Clusters
Within each sample, the $n = 20$ clusters with the greatest node count are automatically selected. The value of $n$ can be adjusted using the `top_n_clusters` parameter.
#### Minimum Node Count
By default, any cluster containing at least ten nodes will be selected This value can be adjusted using the `min_node_count` parameter.
#### Minimum Clone Count
By default, any cluster with an aggregate clone count (summed over all nodes) of at least 100 will be selected. This value can be adjusted using the `min_clone_count` parameter.
This criterion only applies if [clone counts are provided](#count_col) using the `count_col` parameter.
## Output Settings for Step 1 {.tabset}
`findPublicClusters()` does not return any direct output. Instead, data for the selected clusters is saved to files to be used as inputs in step 2. The following parameters control the output settings.
#### <a name="subset_cols"></a> Variables to Keep From Sample Data
By default, the output includes all variables from the original sample data. These variables can be [used later as metadata in visualizations](#vis) of the global network.
To keep only a subset of the original variables, specify the variables to keep using the `subset_cols` parameter, which accepts a character vector of column names or a vector of column indices. The sequence column is always included.
#### <a name="dir_out1"></a> Output Directory for Step 1
The `output_dir` parameter specifies the output directory. It accepts a character string containing the directory path. The directory will be created if it does not exist.
```{r}
# create output directory path for step 1
dir_filtered_samples <- file.path(data_dir, "filtered_samples")
```
#### Output File Format for Step 1
By default, each file is saved as an RDS file. This can be changed using the `output_type` parameter. Other accepted values are `"rda"` and `"csv"`.
#### <a name="save_unfiltered"></a> Saving Full Networks for Each Sample (Optional)
By default, `findPublicClusters()` saves data only for the selected clusters from each sample. If desired, data for each sample's entire network can also be saved by passing a directory path to the `output_dir_unfiltered` parameter. The full network data for each sample is the [output returned by `buildNet()`](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#base-output). The `output_type_unfiltered` parameter specifies the file format in the same manner described [here]((https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#output-file-format) for the `output_type` parameter of `buildNet()`.
#### Visualization of Sample Networks (Optional)
By default, `findPublicClusters()` does not produce visual plots. The visualization of interest is of the global network in [step 2](#step2).
A plot of each sample's full network can be produced using `plots = TRUE`. Specifying `print_plots = TRUE` prints these to the R plotting window. The plots will be saved if [`output_dir_unfiltered`](#save_unfiltered) is non-null. By default, the nodes in each plot are colored according to cluster membership. A different variable can be specified using the `color_nodes_by` parameter as detailed [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html#node-colors) (or [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html#generating-multiple-plots) for multiple variables).
Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html) to learn about other parameters for customizing the visualization.
## <a name="demo1"></a> Demonstration, Step 1
```{r}
findPublicClusters(input_files,
input_type = "rds",
seq_col = "CloneSeq",
count_col = "CloneCount",
min_seq_length = NULL,
drop_matches = NULL,
top_n_clusters = 3,
min_node_count = 5,
min_clone_count = 15000,
output_dir = dir_filtered_samples
)
```
<a name="filtered_node_data"></a>
<a name="node_meta"></a>
Two new directories are created within the specified output directory:
```{r}
list.files(dir_filtered_samples)
```
These directories contain cluster-level and node-level metadata, respectively, for the selected clusters from each sample. We require only the node metadata for step 2.
```{r}
head(list.files(file.path(dir_filtered_samples, "node_meta_data")))
```
# <a name="step2"></a> Step 2: Global Network of Public Clusters {.tabset}
`buildPublicClusterNetwork()` combines the selected clusters from all samples into a single global network, where a new round of cluster analysis is performed to partition the global network into clusters.
## Input Data for Step 2
The input files for `buildPublicClusterNetwork()` are the [node metadata files](#demo1) from the output of step 1. Each file contains data for one sample.
#### <a name="files2"></a> File Paths of Node Metadata From Step 1
The `file_list` parameter accepts a character vector of file paths for the input files, which are located in the `node_meta_data` [subdirectory](#demo1) of the [output directory from step 1](#dir_out1).
```{r}
# Directory of node metadata from step 1
dir_filtered_samples_node <-
file.path(dir_filtered_samples, "node_meta_data")
# Vector of file paths to node metadata from step 1
files_filtered_samples_node <-
list.files(dir_filtered_samples_node, full.names = TRUE)
```
#### File Format of Node Metadata From Step 1
If `findPublicClusters()` was called with a non-default value of `output_type`, this value must be passed to the `input_type` parameter of `buildPublicClusterNetwork()`.
#### Argument Values From Step 1
The `seq_col` and `count_col` parameters specify the input data columns containing receptor sequences and clone counts, respectively. Users should pass the same argument values to these parameters as they did when calling `findPublicClusters()` during step 1.
## Global Network Analysis {.tabset}
#### Network Construction
The parameters that control construction of the global network are shown below along with their default values.
* `dist_type = "hamming"`
* `dist_cutoff = 1`
* `drop_isolated_nodes = FALSE`
Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#network_settings) for their meaning and usage.
#### Clustering Algorithm for Global Network
A clustering algorithm is used to partition the global network graph into densely-connected subgraphs (clusters). Each cluster can contain clones from different samples.
By default, clustering within is performed using `igraph::cluster_fast_greedy()`. A different clustering algorithm can be specified using the `cluster_fun` parameter, as described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/cluster_analysis.html#clustering-algorithm).
#### <a name="vis"></a> Visualization of Global Network
By default, `buildPublicClusterNetwork()` produces a visual plot of the global network graph with the nodes colored according to sample ID.
The `color_nodes_by` parameter specifies the variable used to color the nodes. It accepts a character string naming a variable [kept](#subset_cols) from the original sample data or one of the node-level network properties [listed here](#node_variables).
`color_nodes_by` also accepts a vector naming multiple variables. [One plot will be created for each entry, with the nodes colored according to the respective variable](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html#generating-multiple-plots).
Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html) to learn about other parameters for customizing the visualization.
## Output Settings for Step 2 {.tabset}
`buildPublicClusterNetwork()` returns a list containing plots, metadata and other network objects, with the [same structure](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#base-output) as the output of `buildRepSeqNetwork()`.
The output can be saved to a local directory using the parameters `output_dir`, `output_type` and `output_name`, whose usage is described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#output-directory).
## <a name="demo2"></a> Demonstration, Step 2
```{r}
public_clusters <- buildPublicClusterNetwork(files_filtered_samples_node,
seq_col = "CloneSeq",
count_col = "CloneCount",
size_nodes_by = 1,
print_plots = TRUE
)
```
The returned list contains the following elements:
```{r}
names(public_clusters)
```
The elements are described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#base-output). We inspect the node metadata and cluster metadata.
#### <a name="node_variables"></a> Node Metadata for Global Network
The list element `node_data` is a data frame containing metadata for the network nodes, where each row represents a distinct clone corresponding to a node in the global network graph.
```{r}
nrow(public_clusters$node_data)
```
```{r}
# variables in the node-level metadata
names(public_clusters$node_data)
```
All variables [kept from the original sample data during step 1](#subset_cols) are present. The variable `ClusterIDPublic` contains the global cluster membership, while `ClusterIDInSample` contains the in-sample cluster membership. [Node-level network properties](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#node-level-network-properties) are also present. Those beginning with `SampleLevel` correspond to the sample networks, while those beginning with `Public` correspond to the global network.
```{r}
# View some of the node metadata for the global network
view_cols <- c("CloneSeq", "SampleID", "ClusterIDInSample", "ClusterIDPublic")
public_clusters$node_data[49:54 , view_cols]
```
The row names indicate the original row ID of each clone within its sample's data.
#### <a name="cluster_variables"></a> Cluster Metadata for Global Network
The list element `cluster_data` is a data frame containing metadata for the public clusters, where each row corresponds to a cluster in the global network.
```{r}
# variables in the cluster-level metadata
names(public_clusters$cluster_data)
```
Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#cluster-level-network-properties) for more information about the cluster-level network properties.
```{r}
# View some of the cluster metadata for the global network
head(public_clusters$cluster_data[, 1:6])
```
# <a name="step3"></a> Step 3: Additional Tasks {.tabset}
After calling `buildPublicClusterNetwork()`, the following tasks can be performed using the returned output.
## Labeling the Global Clusters
In order to more easily cross-reference the clusters in the visual plot with the clusters in the data, we can label the clusters with their ID numbers.
This is accomplished using `labelClusters()` as described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/cluster_analysis.html#labeling-clusters).
Below, we label the six largest clusters in the plot with their cluster IDs. The node metadata variable `ClusterIDPublic` contains the global cluster membership, so we pass its name to the `cluster_id_col` parameter.
```{r}
public_clusters <-
labelClusters(public_clusters,
top_n_clusters = 6,
cluster_id_col = "ClusterIDPublic",
size = 7
)
public_clusters$plots[[1]]
```
## Focusing on Individual Clusters
To focus on a particular cluster, we can subset the node metadata based on the value of `ClusterIDPublic` and use `buildNet()` to produce plots of the cluster's graph.
```{r}
# focus on cluster 1
buildNet(
public_clusters$node_data[public_clusters$node_data$ClusterIDPublic == 1, ],
"CloneSeq",
color_nodes_by = "CloneSeq",
size_nodes_by = 3,
output_name = "Cluster 1",
print_plots = TRUE
)
```
```{r}
# focus on cluster 6
buildNet(
public_clusters$node_data[public_clusters$node_data$ClusterIDPublic == 6, ],
"CloneSeq",
color_nodes_by = "CloneSeq",
color_scheme = "plasma",
size_nodes_by = 4,
output_name = "Cluster 6",
print_plots = TRUE
)
```
```{r include=FALSE}
# clean up temp directory
file.remove(
file.path(dir_input_samples,
paste0("Sample", 1:samples, ".rds")
)
)
unlink(
c(dir_filtered_samples_node,
file.path(dir_filtered_samples, "cluster_meta_data")
),
recursive = TRUE
)
```