vignettes/peco.Rmd

---
title: "Predicting cell cycle phase using peco"
author: "Joyce Hsiao"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{peco}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

`peco` is a supervised approach for predicting cell cycle phase in a
continuum using single-cell RNA sequencing data. The R package
provides functions to build a training data set and predict cell cycle
on a continuum.

Our work shows that peco is able to predict continuous cell
cylce phase using a small set of cyclic genes, _CDK1_, _UBE2C_,
_TOP2A_, _HISTH1E_, and _HISTH1C_ (These were dentified as cell cycle marker
genes in studies of yeast ([Spellman et al, 1998][spellman]) and HeLa
cells ([Whitfield et al, 2002][whitfield]).

Below, we present two use cases.

In the first use case, we show how to use the built-in training
dataset to predict continuous cell cycle.

In the second use case, we show how to create a training data set,
then build a predictor using the training data.

```{r setup, include=FALSE}
knitr::opts_chunk$set(results = "hold",collapse = TRUE,comment = "#>",
                      fig.align = "center")
```

`peco` uses `SingleCellExperiment` class objects:

```{r load-pkgs, message=FALSE, warning=FALSE}
library(SingleCellExperiment)
library(doParallel)
library(foreach)
library(peco)
```

About the training data
-----------------------

`training_human` provides training data for 101 significant
cyclic genes. The data include:

+ `predict.yy`, a gene-by-sample matrix (101 x 888) that storing
  predicted cyclic expression values.

+ `cellcycle_peco_ordered`, cell cycle phase in a unit circle
  (angle), ordered from 0 to 2$\pi$.

+ `cellcycle_function`, functions used for prediction corresponding to
  the top 101 cyclic genes identified.

+ `sigma`, standard error associated with cyclic trends of gene
  expression.

+ `pve`, proportion of variance explained by the cyclic trend.

```{r loading-training-data}
data(training_human)
```

Predict cell cycle phase using gene expression data
---------------------------------------------------

`peco` is integrated with `SingleCellExperiment` objects in
Bioconductor. Here we illustrate using a `SingleCellExperiment` object
to perform cell cycle phase prediction.

`sce_top101genes` is a `SingleCellExpression` object for 101 genes and
888 single-cell samples.

```{r load-genes}
data(sce_top101genes)
assays(sce_top101genes)
```

Transform expression to quantile-normalizesd counts-per-million (CPM)
values; `peco` uses the `cpm_quantNormed` slot as the input data for
predictions.

```{r transform-expression, message=FALSE}
sce_top101genes <- data_transform_quantile(sce_top101genes)
assays(sce_top101genes)
```

Generate predictions using `cycle_npreg_outsample`.

```{r predict-cell-cycle}
pred_top101genes <-
  cycle_npreg_outsample(Y_test = sce_top101genes,
    sigma_est = training_human$sigma[rownames(sce_top101genes),],
    funs_est = training_human$cellcycle_function[rownames(sce_top101genes)],
    method.trend = "trendfilter",get_trend_estimates = FALSE)
```

`pred_top101genes$Y` contains a `SingleCellExperiment` object with
the predicted cell cycle phase in the `colData` slot:

```{r inspect-predictions}
head(colData(pred_top101genes$Y)$cellcycle_peco)
```

View the predictions for the *CDK1* gene (Ensembl id
ENSG00000170312). Because *CDK1* is a known cell cycle gene, this
visualization serves as a "sanity check" for the predictions.

```{r plot-predictions-single-gene, fig.height=4, fig.width=5}
x <- seq(0,2*pi,length.out = 100)
plot(x = colData(pred_top101genes$Y)$theta_shifted,
     y = assay(pred_top101genes$Y,"cpm_quantNormed")["ENSG00000170312",],
     pch = 21,col = "white",bg = "darkblue",
     main = "CDK1",xlab = "FUCCI phase",
	 ylab = "quantile-normalized expression")
lines(x = x,y = training_human$cellcycle_function[["ENSG00000170312"]](x),
      col = "tomato",lwd = 2)
```

Visualize cyclic expression trend based on predicted phase
----------------------------------------------------------

Next, we view the predictions for the top six genes. Here we use
`fit_cyclical_many` to estimate cell cycle.

```{r plot-predictions-top-genes, fig.height=8, fig.width=7, message=FALSE}
theta_predict <- colData(pred_top101genes$Y)$cellcycle_peco
names(theta_predict) <- rownames(colData(pred_top101genes$Y))
yy_input <- assay(pred_top101genes$Y,"cpm_quantNormed")[1:6,]
fit_cyclic <- fit_cyclical_many(Y = yy_input,theta = theta_predict)
gene_symbols <- rowData(pred_top101genes$Y)[rownames(yy_input),"hgnc"]
x <- seq(0,2*pi,length.out = 100)
par(mfrow = c(3,2))
for (i in 1:6) {
  plot(x = fit_cyclic$cellcycle_peco_ordered,y = yy_input[i,],
       pch = 1,col = "darkblue",main = gene_symbols[i],
       xlab = "FUCCI phase",ylab = "normalized expression")
  lines(x = x,y = fit_cyclic$cellcycle_function[[i]](x),
        col = "tomato",lwd = 2)
}
```

## Session information

```{r session-info}
sessionInfo()
```

[spellman]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC25624
[whitfield]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC117619