vignettes/articles/karyotapR.Rmd

---
title: "karyotapR Basic Workflow"
date: 'Compiled on `r format(Sys.Date(), "%B %d, %Y")`'
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  tidy = TRUE,
  tidy.opts = list(width.cutoff = 80)
)
```

# Setup

```{r setup, message=FALSE, warning=FALSE}
library(karyotapR)
library(forcats)
set.seed(2023) # seed is set to ensure tutorial is reproducible
```

## Data Download

This guide uses the cell mixture experiment from the KaryoTap publication ([Mays, 2023](https://www.biorxiv.org/content/10.1101/2023.09.08.555746v1)). The Tapestri Pipeline `.h5` output file is available on [Zenodo](https://doi.org/10.5281/zenodo.8305841) and can be downloaded by [`curl::curl_download()`] or directly from the website.

```{r, eval = FALSE}
curl::curl_download(
  url = "https://zenodo.org/record/8305841/files/tapestri-experiment01-panelv1.h5?download=1",
  destfile = "./tap-cellmixture.h5",
  quiet = FALSE
)
```
# Basic Usage

## Data Import

The cell mixture dataset is imported from the `.h5` file that is generated by the Tapestri Pipeline, which generates a new `TapestriExpriment` object. This dataset comprises a mixture of 5 cell lines with differing karyotypes and was processed on the Tapestri instrument for single-cell DNA sequencing using Custom Oligo Panel 261 (a.k.a. Panel Version 1). Setting the `panel.id` parameter automatically assigns the correct probes to the `grnaProbe` and `barcodeProbe` slots in the object, which are used for special applications. Several useful processes run automatically on import, indicated in the status messages. For example, cytobands are automatically added to the probe metadata, chromosome Y probes are automatically detected and moved to a specific slot in the object (although now chrY probes exist in this panel), and any special probes that do not target the endogenous human genome are moved to their appropriate slots. 

```{r, eval=FALSE}
cellmix <- createTapestriExperiment("./tap-cellmixture.h5", panel.id = "CO261")

# ── Loading Tapestri Experiment ──────────────────

# • Sample Name: Teresa_s_cell_line_mix

# • Pipeline Panel Name: CO261_NYU_Davoli_03102021_hg19

# • Pipeline Version: 2.0.2

# • Date Created: 2021-09-15

# 
# ── Metrics ──
# 

# • Number of Cells: 3555

# • Number of Probes: 330

# • Mean Reads per Cell per Probe: 89.22

# 
# ── Notes ──
# 
# ℹ Adding cytobands from hg19.

# ℹ ChrY probe ID(s) not found in TapestriExperiment object.

# ℹ No non-genomic probe IDs found.

```

```{r, include=FALSE}
cellmix <- readRDS(file = "./data/exp1.new.RDS")
```


Calling the object will print a summary of the contained data. The `TapestriExperiment` class is built on top of the [`SingleCellExperiment`](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) and [`SummarizedExperiment`](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) classes, so they inherit their basic functionality and interface. Calling `colData()` and `rowData()` will return the metadata for the cells and probes/amplicons respectively. 

```{r}
cellmix
```

```{r}
colData(cellmix) # cell metadata
```

```{r}
rowData(cellmix) # probe metadata
```

## Allele Frequency Clustering

We cluster on allele frequency to partition different cell lines represented in the experiment.
First, we run Principal Components Analysis (PCA) and use the knee plot to identify the principal components (PCs) accounting for the most variation in the dataset.

```{r pca}
cellmix <- runPCA(cellmix)
PCAKneePlot(cellmix)
```

Next, we run UMAP with the top PCs to embed them into two dimensions and plot the result.

```{r umap1}
cellmix <- runUMAP(cellmix, pca.dims = 1:4)
reducedDimPlot(cellmix, dim.reduction = "umap")
```

Next, we partition the data into clusters using the dbscan method. The `eps` parameter can be used to adjust the granularity of the clustering. We can then update the UMAP plot with the clusters.

```{r clustering}
cellmix <- runClustering(cellmix, eps = 0.9)
reducedDimPlot(cellmix, dim.reduction = "umap", group.label = "cluster")
```

As expected, we have 5 major clusters corresponding to the 5 cell lines in the sequencing run, with the smaller clusters representing doublets (i.e., two cells in sequenced together in one droplet). We can subset the doublets out by pulling the cell barcodes corresponding to clusters 1-5 (clusters are ordered and named by descending size) and using those to subset the "columns" of the object into a new object. This is done here using a logical vector, but the cell barcodes can be passed in as a character vector of barcodes as well.

```{r}
cellmix.subset <- cellmix[, colData(cellmix)$cluster %in% 1:5]
```

We'll rename the cluster labels by renaming the factor levels of the "cluster" column in the `colData` slot, print an updated plot, and count the number of cells in each cluster.

```{r}
colData(cellmix.subset)$cluster <- 
    forcats::fct_recode(colData(cellmix.subset)$cluster, 
                        cellline1 = "1", 
                        cellline2 = "2", 
                        cellline3 = "3", 
                        cellline4 = "4", 
                        cellline5 = "5")
reducedDimPlot(cellmix.subset, 
               dim.reduction = "umap", 
               group.label = "cluster")
forcats::fct_count(colData(cellmix.subset)$cluster)
```

## Copy Number Calling 

The KaryoTap method works best with a reference population where the copy number for each chromosome arm is known. Here we used RPE1 cells which are diploid (2 copies) except for a third copy of the chromosome 10q arm. We know from the KaryoTap preprint that "cellline2" corresponds to the RPE1 cells.

Her we normalize the read counts in the object and calculate a copy number score relative to cellline 2. `control.copy.number` gives the cluster label and copy number value to normalize each chromosome arm to. `generateControlCopyNumberTemplate()` creates a dataframe that is used to indicate the copy number of the reference population. The entry for chr10q has to be changed to 3. 


```{r norm}
cellmix.subset <- calcNormCounts(cellmix.subset)
control.copy.number <- 
    generateControlCopyNumberTemplate(cellmix.subset, 
                                      sample.feature.label = "cellline3", 
                                      copy.number = 2)
control.copy.number["10q", "copy.number"] <- 3
```

The `calcCopyNumber` function will throw an error if the median normalized counts for a probe in the reference population is zero, which would otherwise result in a division-by-zero calculation error. These probes need to be removed before moving forward.

```{r copy number calc error}
try(cellmix.subset <- calcCopyNumber(cellmix.subset, 
                                     control.copy.number = control.copy.number, 
                                     sample.feature = "cluster"))
```

```{r copy number calc}
probes.to.remove <- c("AMPL158845", "AMPL147043", "AMPL147154", 
                      "AMPL159975", "AMPL147293", "AMPL113086", 
                      "AMPL147323", "AMPL158390", "AMPL158655")
cellmix.subset <- cellmix.subset[!rowData(cellmix.subset)$probe.id %in% probes.to.remove, ]

cellmix.subset <- calcCopyNumber(cellmix.subset, 
                                 control.copy.number = control.copy.number, 
                                 sample.feature = "cluster")
```

The `calcNormCounts()` and `calcCopyNumber()` functions take the count matrix in the main `assay` slot of of the `TapestriExperiment`, perform their operation, and save the result to new `assay` slots, which can be accessed using `assay()` or listed using `assays()`. Assays in the `SingleCellExperiment` sense are sets of measurements for the same set of samples (columns) and features (rows). For the `counts`, `normcounts` and `copyNumber` assays, the features (probes) and samples (cell barcodes) are the same.

```{r}
assays(cellmix.subset)
```

`calcSmoothCopyNumber()` produces one smoothed copy number score for each chromosome and cell. Since the features here are chromosomes, no longer probes, the values get saved to an `altExp` (alternate experiment) slot, which allows for measurements from the same samples (cell barcodes), with a different feature set than the top-level experiment in the `TapestriExperiment` object (i.e. probes vs. chromosomes).

```{r copy number calc smooth}
cellmix.subset <- calcSmoothCopyNumber(cellmix.subset)
```

Visualization of the copy number scores reveals the heterogeneity. Here we're showing the copy number scores by probe, and the smoothed copy number scores by whole chromosome and by chromosome arm. See the documentation for `assayHeatmap()` for details on customization.

```{r heatmaps 1, message=FALSE}
assayHeatmap(cellmix.subset, assay = "copyNumber", split.col.by = "arm", 
             split.row.by = "cluster", annotate.row.by = "cluster", 
             color.preset = "copy.number")
assayHeatmap(cellmix.subset, alt.exp = "smoothedCopyNumberByChr", 
             assay = "smoothedCopyNumber", split.row.by = "cluster", 
             annotate.row.by = "cluster", color.preset = "copy.number")
assayHeatmap(cellmix.subset, alt.exp = "smoothedCopyNumberByArm", 
             assay = "smoothedCopyNumber", split.row.by = "cluster", 
             annotate.row.by = "cluster", color.preset = "copy.number")
```

Finally, a integer copy number value for each chromosome in each cell can be called using a Gaussian Mixture Model (GMM) framework. `calcGMMCopyNumber()` takes a vector of cell barcodes for the reference sample and a template data frame from `generateControlCopyNumberTemplate()` indicating the copy number of each chromosome arm in the reference. Here we can use the same `control.copy.number` template that was generated earlier. Here we are specifying a model that has copy number = {1,2,3,4}. The results of this are saved as new `assays` in the "smoothedCopyNumber" `altExp` slots for chromosomes and chromosome arms.

```{r}
reference.bcs <- colData(cellmix.subset)$cell.barcode[colData(cellmix.subset)$cluster == "cellline2"]
cellmix.subset <- calcGMMCopyNumber(cellmix.subset, 
                                    cell.barcodes = reference.bcs, 
                                    control.copy.number = control.copy.number, 
                                    model.components = 1:4)
```

```{r message=FALSE}
assayHeatmap(cellmix.subset, alt.exp = "smoothedCopyNumberByChr", 
             assay = "gmmCopyNumber", split.row.by = "cluster", 
             annotate.row.by = "cluster", color.preset = "copy.number")
assayHeatmap(cellmix.subset, alt.exp = "smoothedCopyNumberByArm", 
             assay = "gmmCopyNumber", split.row.by = "cluster", 
             annotate.row.by = "cluster", color.preset = "copy.number")
```


# Session Info

```{r}
sessioninfo::session_info()
```