# Tutorial 4 - Working with monocle3

## Load required packages

In [None]:
library(monocle3)
library(useful)
library(RColorBrewer)
library(plotly)
library(here)
library(genieclust)
library(grid)
library(gridExtra)
library(matrixStats)
library(CytoTRACE)

In [None]:
here()

## monocle3

Monocle introduced the strategy of using RNA-Seq for single-cell trajectory analysis. Rather than purifying cells into discrete states experimentally, Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process. Once it has learned the overall "trajectory" of gene expression changes, Monocle can place each cell at its proper position in the trajectory. See Cao et al (2019) _Nature_ for more details.

For usage information, please see https://cole-trapnell-lab.github.io/monocle3/

### Data ingest

In [None]:
# monocle3 expects raw counts or UMIs with associated cell and gene metadata
# We read these in from the CSVs we saved in the PAGA and velocyto notebook
# monocle3 requires the expression matrix to be a matrix or sparse matrix instead of
# a data frame, so we coerce X to a dense matrix with as.matrix() and then a sparse
# matrix with as(..., "sparseMatrix")
#
# Finally, we clean up a mismatch in the cellnames, which had an "X" prepended with gsub()
#
# monocle3 is also very picky about the order of the cells and genes matching,  so we order
# all three tables using the order() command and then passing the results to the table
# Remember in R that with a table, you access [rows, columns]
X <- as(as.matrix(read.csv(here('MACA_bonemarrow_10x.csv'), row.names = 1)), "sparseMatrix")
X <- X[order(rownames(X)),order(colnames(X))]
colnames(X) <- gsub('X10X_', '10X_', colnames(X))

obs <- read.csv(here('MACA_bonemarrow_10x_obs.csv'), row.names = 1)
obs <- obs[order(rownames(obs)),,drop=FALSE]
rownames(obs) <- gsub(':', '.', rownames(obs))

var <- read.csv(here('MACA_bonemarrow_10x_var.csv'), row.names = 1)
var['gene_short_name'] <- rownames(var)
var['Gene'] <- NULL
var <- var[order(rownames(var)),]

In [None]:
# Create a cell_data_set object, the default for monocle3
cds <- new_cell_data_set(X, cell_metadata = obs, gene_metadata = var)
cds

In [None]:
# Create metadata on the number of genes and UMIs detected per cell
colData(cds)['nGene'] <- Matrix::colSums(cds@assays@data$counts > 0)
colData(cds)['nUMIs'] <- Matrix::colSums(cds@assays@data$counts)

### Explore cell_data_set object

In [None]:
# Acess the cell metadata
colData(cds)

In [None]:
# Access the gene metadata
rowData(cds)

In [None]:
# Access the UMIs, in this case two specific genes from the first 5 cells
cds@assays@data$counts[c('Actb', 'Ubb'),  1:5]

### monocle3 clustering

In [None]:
# Normalizes UMIs, log-transforms and scales them, and runs principle component analysis
# Finally, select the number of principle components to use going forward
# We use the highly variable genes identified by scanpy
cds <- preprocess_cds(cds, num_dim = 6, use_genes = rowData(cds)$gene_short_name[rowData(cds)$highly_variable == 'True'], norm_method = "log")

In [None]:
# Plot the ratio of the variance explained by each principle component
plot_pc_variance_explained(cds)

In [None]:
# Project selected principle components into two dimenions with UMAP
# Use k=200 when constructing the nearest neighbor tree
cds <- reduce_dimension(cds, reduction_method = "UMAP", preprocess_method = "PCA", umap.n_neighbors = 200)

In [None]:
# Cluster cells with the leiden algorithm using based on UMAP coordinates
# This is different from how cells are clustered in Seurat and scanpy, which
# use principle components. You can cluster on principle components with monocle3,
# but you cannot then proceed to trajectory analysis. This quick may be fixed
# soon as monocle3 is in beta (or a testing phase)
cds <- cluster_cells(cds,
                     reduction_method = "UMAP",
                     k=200,
                     cluster_method = "leiden",
                     resolution=0.25)

In [None]:
# Contrust a trajectory based on the cell clustering, UMAP embedding, and nearest neighbor map
# Conceptually similar to PAGA, but has a different underlying methology
cds <- learn_graph(cds)

In [None]:
# Choose a root node among the HSCs
cell_ids <- which(colData(cds)[, "free_annotation"] == "Stem_Progenitors")
closest_vertex <- cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
root_pr_nodes <- igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names(which.max(table(closest_vertex[cell_ids,]))))]

# Order the cells in pseudotime from the. chosen root node based on the trajectory
cds <- order_cells(cds, root_pr_nodes = root_pr_nodes)

In [None]:
# Plot the cells in the UMAP coordinates calculated above and color by annotation
# partition, leiden clustering and psuedotime from above
#
# partition refers to different, unconnected trajectories in the same dataset
# In this case, all of the cells should be connected and part of the same partition
plot_cells(cds,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "free_annotation", 
           cell_size = 1)

plot_cells(cds,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "partition", 
           cell_size = 1)

plot_cells(cds,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "cluster", 
           cell_size = 1)

plot_cells(cds,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "pseudotime", 
           cell_size = 1)
           
plot_cells(cds,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "nGene", 
           cell_size = 1)
           
plot_cells(cds,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "nUMIs", 
           cell_size = 1)

### Identifying genes correlated with pseudotime

In [None]:
# graph_test() uses the monocle3 defined trajectory to test whether cells
# in similar positions have co-correlated expression
# We then filter for significant genes with the subset() command
pr_test_res <- graph_test(cds, neighbor_graph="principal_graph", cores=4)
pr_deg_ids <- row.names(subset(pr_test_res, q_value < 0.05))

# Once you have a set of genes that vary in some interesting way across pseudotime
# monocle3 provides a means of grouping them into modules with find_gene_modules(),
# which essentially runs UMAP on the genes (as opposed to the cells) and then groups
# them into modules using Louvain community analysis
gene_module_df <- find_gene_modules(cds[pr_deg_ids,], resolution=1e-2)

In [None]:
# Add the original signficant test scores to the gene_module_df table with cbind()
gene_module_df <- cbind(gene_module_df, pr_test_res[gene_module_df$id, 'q_value',drop=FALSE])

In [None]:
# Use aggregate_gene_expression() to group expression of genes in each module among the cells in each 
# free annotation and then cluster the modules and aggregate annotations with hierarchical ward.D2
# clustering in the pheatmap() function
#
# This uses many functions from the tidyverse such as tibbles (which are another form of tables in R)
# and stringr to prepend "Module" before the module numbers.
#
cell_group_df <- tibble::tibble(cell=row.names(colData(cds)), cell_group=colData(cds)$free_annotation)
agg_mat <- aggregate_gene_expression(cds, gene_module_df, cell_group_df)
row.names(agg_mat) <- stringr::str_c("Module ", row.names(agg_mat))
pheatmap::pheatmap(agg_mat, scale="column", clustering_method="ward.D2")

In [None]:
# Print the top 25 genes in each module, sorted by significance
# Current module 3 associated with HSCs is shown
tmp <- gene_module_df[gene_module_df$module == 3,]
tmp[order(tmp$q_value, decreasing = FALSE)[1:25],]

In [None]:
# Plot the top 5 genes among all the cells differentiating from HSCs to granulocytes,
# sorted by pseudotime from the above list. The cds object is subset to remove cells from the Monocyte lineage
genes <- tmp$id[order(tmp$q_value, decreasing = FALSE)[1:5]]
subset_cds <- cds[rowData(cds)$gene_short_name %in% genes,
                       colData(cds)$free_annotation %in% c("Stem_Progenitors", "Granulocyte_progenitors", "Granulocytes")]

plot_genes_in_pseudotime(subset_cds, 
                         color_cells_by="free_annotation",
                         min_expr=0.5)

In [None]:
# Print the top 25 genes in each module, sorted by significance
# Current module 13 associated with granulocytes is shown
tmp <- gene_module_df[gene_module_df$module == 13,]
tmp[order(tmp$q_value, decreasing = FALSE)[1:25],]

In [None]:
# Plot the top 5 genes among all the cells differentiating from HSCs to granulocytes,
# sorted by pseudotime from the above list. The cds object is subset to remove cells from the Monocyte lineage
genes <- tmp$id[order(tmp$q_value, decreasing = FALSE)[1:5]]
subset_cds <- cds[rowData(cds)$gene_short_name %in% genes,
                       colData(cds)$free_annotation %in% c("Stem_Progenitors", "Granulocyte_progenitors", "Granulocytes")]

plot_genes_in_pseudotime(subset_cds, 
                         color_cells_by="free_annotation",
                         min_expr=0.5)

### Explore the monocle3 object further

In [None]:
# Access gene loadings from PCA (from first 5 genes and PCs)
corner(cds@preprocess_aux$gene_loadings)

In [None]:
# Access ratio of variance explained by each loading
cds@preprocess_aux$prop_var_expl

In [None]:
# Access igraph trajectory
cds@principal_graph$UMAP

In [None]:
# Access clusters from UMAP projection (first 5 cells)
head(cds@clusters$UMAP$clusters, 5)

In [None]:
# Access pseudotime from first 5 cells
head(cds@principal_graph_aux@listData$UMAP$pseudotime, 5)

## CytoTRACE

CytoTRACE (Cellular (Cyto) Trajectory Reconstruction Analysis using gene Counts and Expression) is a computational method that predicts the differentiation state of cells from single-cell RNA-sequencing data. CytoTRACE leverages a simple, yet robust, determinant of developmental potential—the number of detectably expressed genes per cell, or gene counts. See Gulati et al (2020) _Science_ for details.

For usage information please visit https://cytotrace.stanford.edu

In [None]:
# Run CytoTRACE on the dataset, using as.matrix() to convert it to the dense matrix
# required by the algorithm
results <- CytoTRACE(as.matrix(X), enableFast = F, ncores = 4)

In [None]:
# Print the metadata available from CytoTRACE
names(results)

In [None]:
# Add the main metric into the cds we contructed for monocle3
colData(cds)[,'CytoTRACE'] <- results$CytoTRACE

In [None]:
# Plot the free annotations, monocle3 pseudotime, PAGA pseudotime, and CytoTRACE
# on the UMAP coordinates calculated above
plot_cells(cds,
           show_trajectory_graph = FALSE,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "free_annotation", 
           cell_size = 1)

plot_cells(cds,
           show_trajectory_graph = FALSE,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "pseudotime", 
           cell_size = 1)

plot_cells(cds,
           show_trajectory_graph = FALSE,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "dpt_pseudotime", 
           cell_size = 1)

plot_cells(cds,
           show_trajectory_graph = FALSE,
           label_cell_groups = FALSE,
           reduction_method = "UMAP",
           color_cells_by = "CytoTRACE", 
           cell_size = 1)

**Question:** How does the psuedotime from monocle3 (first), PAGA (second), and CytoTRACE (third) compare? If one is an outlier, are there other pieces of metadata (plotted earlier) it uses to construct its trajectory? Which is correct biologically?

**Answer:**