vignettes/work_with_big_datasets.Rmd

---
title: "Work with Big Datasets"
author: "Zuguang Gu (z.gu@dkfz.de)"
date: '`r Sys.Date()`'
output: rmarkdown::html_vignette
---


```{r, echo = FALSE, message = FALSE}
library(markdown)
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    tidy  = FALSE,
    message = FALSE,
    fig.align = "center")
options(width = 100)
options(rmarkdown.html_vignette.check_title = FALSE)
library(cola)
```


*cola* can be idealy applied to datasets with intermediate number of samples
(around 500), however, the running time and the memory usage might not be acceptable
if the number of samples gets very huge, e.g., more than thousands. Nevertheless,
we can propose the following two strategies to partially solve this problem.

1. A randomly sampled subset of samples which take relatively short running
   time (e.g., 100-200 samples) can be firstly applied with cola, from which a
   specific combination of top-value method and partitioning method that gives
   the best results can be pre-selected. Later user can only apply these two
   specific methods to the complete dataset. This would be much faster than
   blindly running cola with many methods in sequence.
2. The random subset of samples can be treated as a training set to generate a
   classification of cells. Then, the class labels of the remaining samples
   can be predicted, e.g. by testing the distance to the centroid of each cell
   group, without rerunning consensus partitioning on them. cola implements a
   `predict_classes()` function for this purpose (see the vignette ["Predict
   classes for new samples"](predict.html) for details). 

Note, since these two strategies are performed by sampling a small subset of
cells from the cohort, the cell clusters with small size might not be
detectable.

In the following examples, we use pseudo code to demonstrate the ideas. Assuming
the full matrix is in an object `mat`. We randomly sample 200 samples from it:

```{r, eval = FALSE}
ind = sample(ncol(mat), 200)
mat1 = mat[, ind]
```

**Strategy 1**. First we can apply cola on `mat1`:

```{r, eval = FALSE}
rl = run_all_consensus_partition_methods(mat1, ...)
cola_report(rl, ...)
```

Then find a best combination of top-value method and partitioining method.
Assuming they are saved in objects `tm` and `pm`. Next run
`consensus_partition()` on `mat` only with `tm` and `pm`:

```{r, eval = FALSE}
res = consensus_partition(mat, top_value_method = tm, partition_method = pm, ...)
```

**Strategy 2**. Similar as Strategy 1, we get the `ConsensusPartition` object
from methods `tm` and `pm`, which was applied on `mat1`:


```{r, eval = FALSE}
res = rl[tm, pm]
```

Note the consensus partition object `res` is only based on a subset of
original samples. With `res`, the classes of remaining samples can be
predicted:

```{r, eval = FALSE}
mat2 = mat[, setdiff(seq_len(ncol(mat)), ind)]
mat2 = t(scale(t(mat2)))
cl = predict_classes(res, mat2)
```

Please note, by default `mat1` is scaled in cola analysis, correspondingly, `mat2` should also be row-scaled.

You can also directly send `mat` to `predict_classes()` function:

```{r, eval = FALSE}
cl = predict_classes(res, t(scale(t(mat))))
```

## The `DownSamplingConsensusPartition` class

*cola* implements a new `DownSamplingConsensusPartition` class for
**Strategy 2** mentioned above. It runs consensus partitioning only on a
subset of samples and the classes of other samples are predicted by
`predict_classes()` internally. The `DownSamplingConsensusPartition` class
inherits the `ConsensusPartition` class, so the functions that can be applied
to the `ConsensusPartition` objects can also be applied to the
`DownSamplingConsensusPartition` objects, just with some tiny difference.

In the following example, we demonstrate the usage of
`DownSamplingConsensusPartition` class with the Golub dataset. Here the Golub
dataset was used to generate another data object `golub_cola` and we can
extract the matrix and the annotations by `get_matrix()` and `get_anno()`
functions.

To perform down sampling consensus partitioning, use the helper function
`consensus_partition_by_down_sampling()`. This function basically run
`consensus_partition()` on a subset of samples and later predict the classes
for all samples by `predict_classes()` function. Here we set `subset` argument
to 50, which means to randomly sample 50 samples from the whole dataset.


```{r, eval = FALSE}
data(golub_cola)
m = get_matrix(golub_cola)

set.seed(123)
golub_cola_ds = consensus_partition_by_down_sampling(m, subset = 50,
  anno = get_anno(golub_cola), anno_col = get_anno_col(golub_cola),
  top_value_method = "SD", partition_method = "kmeans")
```

The object `golub_cola_ds` is already shipped in the package. Simply load the data object.

```{r}
data(golub_cola_ds)
golub_cola_ds
```

The summary of the `golub_cola_ds` is very similar as the `ConsensusPartition` object, except that
it mentions the object is generated from 50 samples randomly sampled from 72 samples.

All the functions that are applied to the `ConsensusParition` class can be applied to the `DownSamplingConsensusPartition`
class, except some tiny differences.

`get_classes()` returns the predicted classes for _all_ samples:


```{r}
class = get_classes(golub_cola_ds, k = 2)
nrow(class)
class
```

There is an additional column named `p` which is the p-value for predicting the class labels. For more
details of how the p-value is calculated, please refer to the vignette ["Predict classes for new samples"](predict.html).

If the argument `k` is not specified or `k` is specified as a vector, the class labels for all `k` are returned.
Now you can set the `p_cutoff` argument so that the class label with p-value larger than this is set to `NA`.

```{r}
get_classes(golub_cola_ds, p_cutoff = 0.05)
```

There are several functions uses `p_cutoff` argument which controls the number
of "usable samples" or the samples with reliable classifications. These
functions are `get_classes()`, `test_to_known_factors()`,
`dimension_reduction()` and `get_signatures()`.

For `dimension_reduction()` function, the samples with p-value higher than
`p_cutoff` are marked by crosses. The samples that were not selected in the
random sampling were mapped with smaller dots.

```{r, fig.width = 8, fig.height = 7, out.width = "500"}
dimension_reduction(golub_cola_ds, k = 2)
```

For `get_signatures()` function, the signatures are only found in the samples with p-values less than `p_cutoff`.

```{r, fig.width = 8, fig.height = 7, out.width = "500"}
get_signatures(golub_cola_ds, k = 2)
```


## Session info

```{r}
sessionInfo()
```