vignettes/public_clusters.Rmd

---
title: "Searching for Public TCR/BCR Clusters"
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center", 
  out.width = "100%",
  fig.width = 9, fig.height = 7
)
```

# Introduction

The `NAIR` package includes a set of functions that facilitate searching for public TCR/BCR clusters across multiple samples of Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data. 

In this context, a public cluster consists of similar TCR/BCR clones (e.g., those whose CDR3 amino acid sequences differ by at most one amino acid) that are shared across samples (e.g., across individuals or across time points for a single individual).


#### Overview of Process

1. [**Identify prominent clusters within each sample.**](#step1). For each sample, construct the repertoire network and use cluster analysis to partition the network into clusters. From each sample, select clusters based on node count and clone count. 
2. [**Construct global network using the selected clusters.**](#step2) Combine the selected data from step 1 into a single global network. Use cluster analysis to partition the global network into clusters, which are considered as the public clusters.
3. [Perform additional tasks](#step3) such as labeling the global clusters in the visual plot and analyzing individual clusters of interest. 


#### Simulate Data for Demonstration

We simulate some toy data for demonstration.

Our toy data includes 30 samples, each containing 30 observations. 

Some sequences are simulated with a tendency to appear in relatively few samples, while others are simulated with a tendency to appear in many samples. 

```{r }
set.seed(42)
library(NAIR)
data_dir <- tempdir()
dir_input_samples <- file.path(data_dir, "input_samples")
dir.create(dir_input_samples, showWarnings = FALSE)

samples <- 30
sample_size <- 30 # (seqs per sample)          
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF", 
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF") 

# relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples))

simulateToyData(    
  samples = samples, 
  sample_size = sample_size,
  prefix_length = 1, 
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs, 
  affix_probs = pgen, 
  num_edits = 0,
  output_dir = dir_input_samples, 
  no_return = TRUE
)
```

Each sample's data frame is saved to its own file using the RDS file format. The files are named "`Sample1.rds`", "`Sample2.rds`", etc. A character string containing the directory path is assigned to the R environment variable `dir_input_samples` for later reference.

The first few rows of the data for the first sample appear as follows:

```{r}
# View first few rows of data for sample 1
head(readRDS(file.path(dir_input_samples, "Sample1.rds")))
```


# <a name="step1"></a> Step 1: Identify Prominent Clusters Within Each Sample {.tabset}

First, we use `findPublicClusters()` to search across samples and select clones for inclusion in the global network.

Each sample's repertoire network is constructed individually, and cluster analysis is used to partition each network into clusters. The clusters are then filtered according to node count and clone count based on user-specified criteria. The AIRR-Seq data for the clusters that remain after filtering is saved to files to be used as inputs for [step 2](#step2).

Below, we explain how to use `findPublicClusters()`.


## <a name="input_settings"></a> Input Data for Step 1 {.tabset}

Each sample's AIRR-Seq data must be contained in a separate file, with observations indexed by row, and with the same columns across samples.

#### <a name="files1"></a> File Paths of Sample Data

The `file_list` parameter accepts a character vector containing file paths (or a list containing file paths and connections), where each element corresponds to a file containing a single sample.

```{r}
# create vector of input file paths for step 1 (one per sample)
input_files <- file.path(dir_input_samples, 
                         paste0("Sample", 1:samples, ".rds")
)
head(input_files)
```


#### <a name="input_type"></a> File Format of Sample Data

The file format of the input files is specified by the `input_type` parameter. The supported values are `"rds"`, `"rda"`, `"csv"`, `"csv2"`, `"tsv"` and `"table"`. Depending on the input type, further options are specified with `data_symbols` or `read.args`. 

Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/supplementary.html#load) and to `loadDataFromFileList()` for details and examples.


#### <a name="seq_col"></a> Sequence Column in Sample Data 

The `seq_col` parameter specifies the column containing the TCR/BCR sequences within each sample. It accepts the column name (as a character string) or the column position index. 

#### <a name="count_col"></a> Count Column in Sample Data

The optional `count_col` parameter specifies the column containing the clone count (clonal abundance) within each sample. It accepts the column name (as a character string) or the column position index. If provided, clone counts [will be considered when filtering the clusters](#minimum-clone-count).

#### <a name="sample_ids"></a> Custom Sample IDs (Optional)

Each clone's sample ID is included in the output. By default, these are `"Sample1"`, `"Sample2"`, etc., according to the order in `file_list`. 

The optional `sample_ids` parameter assigns custom sample IDs. It accepts a vector of the same length as `file_list`, where each entry is the corresponding sample ID.

#### Filtering the Sample Data 

The clones from each sample are filtered to remove any irrelevant data. By default, clones with sequences that are less than three characters in length, as well as sequences containing any of the characters `*`, `_` or `|`, will be excluded. The `min_seq_length` and `drop_matches` parameters control the filter settings. Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#input_options) for details.


#### <a name="network_settings"></a> Construction of Sample Networks {.tabset}

The parameters that control the construction of each sample's network are shown below along with their default values.

* `dist_type = "hamming"` 
* `dist_cutoff = 1`
* `drop_isolated_nodes = TRUE`

Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#network_settings) for their meaning and usage. 


#### Clustering Algorithm for Sample Networks

By default, clustering within each sample's network is performed using `igraph::cluster_fast_greedy()`. A different clustering algorithm can be specified using the `cluster_fun` parameter, as described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/cluster_analysis.html#clustering-algorithm). 


## Filtering the Sample Clusters {.tabset}

The following parameters control the criteria used to select clusters from each sample.

#### Top $n$ Clusters

Within each sample, the $n = 20$ clusters with the greatest node count are automatically selected. The value of $n$ can be adjusted using the `top_n_clusters` parameter.

#### Minimum Node Count

By default, any cluster containing at least ten nodes will be selected This value can be adjusted using the `min_node_count` parameter. 

#### Minimum Clone Count

By default, any cluster with an aggregate clone count (summed over all nodes) of at least 100 will be selected. This value can be adjusted using the `min_clone_count` parameter. 

This criterion only applies if [clone counts are provided](#count_col) using the `count_col` parameter.


## Output Settings for Step 1 {.tabset}

`findPublicClusters()` does not return any direct output. Instead, data for the selected clusters is saved to files to be used as inputs in step 2. The following parameters control the output settings.

#### <a name="subset_cols"></a> Variables to Keep From Sample Data 

By default, the output includes all variables from the original sample data. These variables can be [used later as metadata in visualizations](#vis) of the global network.

To keep only a subset of the original variables, specify the variables to keep using the `subset_cols` parameter, which accepts a character vector of column names or a vector of column indices. The sequence column is always included.


#### <a name="dir_out1"></a> Output Directory for Step 1

The `output_dir` parameter specifies the output directory. It accepts a character string containing the directory path. The directory will be created if it does not exist.

```{r}
# create output directory path for step 1
dir_filtered_samples <- file.path(data_dir, "filtered_samples")
```


#### Output File Format for Step 1

By default, each file is saved as an RDS file. This can be changed using the `output_type` parameter. Other accepted values are `"rda"` and `"csv"`.


#### <a name="save_unfiltered"></a> Saving Full Networks for Each Sample (Optional) 

By default, `findPublicClusters()` saves data only for the selected clusters from each sample. If desired, data for each sample's entire network can also be saved by passing a directory path to the `output_dir_unfiltered` parameter. The full network data for each sample is the [output returned by `buildNet()`](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#base-output). The `output_type_unfiltered` parameter specifies the file format in the same manner described [here]((https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#output-file-format) for the `output_type` parameter of `buildNet()`.

#### Visualization of Sample Networks (Optional) 

By default, `findPublicClusters()` does not produce visual plots. The visualization of interest is of the global network in [step 2](#step2).

A plot of each sample's full network can be produced using `plots = TRUE`. Specifying `print_plots = TRUE` prints these to the R plotting window. The plots will be saved if [`output_dir_unfiltered`](#save_unfiltered) is non-null. By default, the nodes in each plot are colored according to cluster membership. A different variable can be specified using the `color_nodes_by` parameter as detailed [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html#node-colors) (or  [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html#generating-multiple-plots) for multiple variables).

Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html) to learn about other parameters for customizing the visualization.


## <a name="demo1"></a> Demonstration, Step 1

```{r}
findPublicClusters(input_files, 
                   input_type = "rds",
                   seq_col = "CloneSeq", 
                   count_col = "CloneCount",
                   min_seq_length = NULL, 
                   drop_matches = NULL,
                   top_n_clusters = 3, 
                   min_node_count = 5, 
                   min_clone_count = 15000,
                   output_dir = dir_filtered_samples
)
```

<a name="filtered_node_data"></a> 
<a name="node_meta"></a>

Two new directories are created within the specified output directory:

```{r}
list.files(dir_filtered_samples)
```

These directories contain cluster-level and node-level metadata, respectively, for the selected clusters from each sample. We require only the node metadata for step 2.

```{r}
head(list.files(file.path(dir_filtered_samples, "node_meta_data")))
```
 

# <a name="step2"></a> Step 2: Global Network of Public Clusters {.tabset}

`buildPublicClusterNetwork()` combines the selected clusters from all samples into a single global network, where a new round of cluster analysis is performed to partition the global network into clusters.

## Input Data for Step 2

The input files for `buildPublicClusterNetwork()` are the [node metadata files](#demo1) from the output of step 1. Each file contains data for one sample.

#### <a name="files2"></a> File Paths of Node Metadata From Step 1

The `file_list` parameter accepts a character vector of file paths for the input files, which are located in the `node_meta_data` [subdirectory](#demo1) of the [output directory from step 1](#dir_out1). 

```{r}
# Directory of node metadata from step 1
dir_filtered_samples_node <- 
  file.path(dir_filtered_samples, "node_meta_data")

# Vector of file paths to node metadata from step 1
files_filtered_samples_node <- 
  list.files(dir_filtered_samples_node, full.names = TRUE)
```


#### File Format of Node Metadata From Step 1

If `findPublicClusters()` was called with a non-default value of `output_type`, this value must be passed to the `input_type` parameter of `buildPublicClusterNetwork()`. 


#### Argument Values From Step 1

The `seq_col` and `count_col` parameters specify the input data columns containing receptor sequences and clone counts, respectively. Users should pass the same argument values to these parameters as they did when calling `findPublicClusters()` during step 1.


## Global Network Analysis {.tabset}

#### Network Construction

The parameters that control construction of the global network are shown below along with their default values.

* `dist_type = "hamming"` 
* `dist_cutoff = 1`
* `drop_isolated_nodes = FALSE`

Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#network_settings) for their meaning and usage. 


#### Clustering Algorithm for Global Network

A clustering algorithm is used to partition the global network graph into densely-connected subgraphs (clusters). Each cluster can contain clones from different samples.

By default, clustering within is performed using `igraph::cluster_fast_greedy()`. A different clustering algorithm can be specified using the `cluster_fun` parameter, as described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/cluster_analysis.html#clustering-algorithm). 


#### <a name="vis"></a> Visualization of Global Network

By default, `buildPublicClusterNetwork()` produces a visual plot of the global network graph with the nodes colored according to sample ID. 

The `color_nodes_by` parameter specifies the variable used to color the nodes. It accepts a character string naming a variable [kept](#subset_cols) from the original sample data or one of the node-level network properties [listed here](#node_variables).  

`color_nodes_by` also accepts a vector naming multiple variables. [One plot will be created for each entry, with the nodes colored according to the respective  variable](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html#generating-multiple-plots). 

Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/network_visualization.html) to learn about other parameters for customizing the visualization.


## Output Settings for Step 2 {.tabset}

`buildPublicClusterNetwork()` returns a list containing plots, metadata and other network objects, with the [same structure](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#base-output) as the output of `buildRepSeqNetwork()`. 

The output can be saved to a local directory using the parameters `output_dir`, `output_type` and `output_name`, whose usage is described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#output-directory). 


## <a name="demo2"></a> Demonstration, Step 2

```{r}
public_clusters <- buildPublicClusterNetwork(files_filtered_samples_node,
                                             seq_col = "CloneSeq", 
                                             count_col = "CloneCount",
                                             size_nodes_by = 1,
                                             print_plots = TRUE
)
```
The returned list contains the following elements:

```{r}
names(public_clusters)
```

The elements are described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#base-output). We inspect the node metadata and cluster metadata.


#### <a name="node_variables"></a> Node Metadata for Global Network

The list element `node_data` is a data frame containing metadata for the network nodes, where each row represents a distinct clone corresponding to a node in the global network graph.

```{r}
nrow(public_clusters$node_data)
```


```{r}
# variables in the node-level metadata
names(public_clusters$node_data)
```

All variables [kept from the original sample data during step 1](#subset_cols) are present. The variable `ClusterIDPublic` contains the global cluster membership, while `ClusterIDInSample` contains the in-sample cluster membership. [Node-level network properties](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#node-level-network-properties) are also present. Those beginning with `SampleLevel` correspond to the sample networks, while those beginning with `Public` correspond to the global network. 

```{r}
# View some of the node metadata for the global network
view_cols <- c("CloneSeq", "SampleID", "ClusterIDInSample", "ClusterIDPublic")
public_clusters$node_data[49:54 , view_cols]
```
The row names indicate the original row ID of each clone within its sample's data.


#### <a name="cluster_variables"></a> Cluster Metadata for Global Network

The list element `cluster_data` is a data frame containing metadata for the public clusters, where each row corresponds to a cluster in the global network.

```{r}
# variables in the cluster-level metadata
names(public_clusters$cluster_data)
```

Refer [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#cluster-level-network-properties) for more information about the cluster-level network properties.

```{r}
# View some of the cluster metadata for the global network
head(public_clusters$cluster_data[, 1:6])
```


# <a name="step3"></a> Step 3: Additional Tasks {.tabset}

After calling `buildPublicClusterNetwork()`, the following tasks can be performed using the returned output.


## Labeling the Global Clusters

In order to more easily cross-reference the clusters in the visual plot with the clusters in the data, we can label the clusters with their ID numbers. 

This is accomplished using `labelClusters()` as described [here](https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/cluster_analysis.html#labeling-clusters). 

Below, we label the six largest clusters in the plot with their cluster IDs. The node metadata variable `ClusterIDPublic` contains the global cluster membership, so we pass its name to the `cluster_id_col` parameter. 

```{r}
public_clusters <- 
  labelClusters(public_clusters,
                top_n_clusters = 6,
                cluster_id_col = "ClusterIDPublic",
                size = 7
  )

public_clusters$plots[[1]]
```

## Focusing on Individual Clusters

To focus on a particular cluster, we can subset the node metadata based on the  value of `ClusterIDPublic` and use `buildNet()` to produce plots of the cluster's graph.

```{r}
# focus on cluster 1
buildNet(
  public_clusters$node_data[public_clusters$node_data$ClusterIDPublic == 1, ],
  "CloneSeq", 
  color_nodes_by = "CloneSeq", 
  size_nodes_by = 3, 
  output_name = "Cluster 1",
  print_plots = TRUE
)
```

```{r}
# focus on cluster 6
buildNet(
  public_clusters$node_data[public_clusters$node_data$ClusterIDPublic == 6, ],
  "CloneSeq", 
  color_nodes_by = "CloneSeq", 
  color_scheme = "plasma", 
  size_nodes_by = 4, 
  output_name = "Cluster 6",
  print_plots = TRUE
)
```


```{r include=FALSE}
# clean up temp directory
file.remove(
  file.path(dir_input_samples,
            paste0("Sample", 1:samples, ".rds")
  )
)
unlink(
  c(dir_filtered_samples_node, 
    file.path(dir_filtered_samples, "cluster_meta_data")
  ),
  recursive = TRUE
)
```