Polina Pavlovich1,2,3, Nicolas Descostes1*, and Christophe Lancrin1
1European Molecular Biology Laboratory, EMBL, Rome, Italy
2Moscow Institute of Physics and Technology, MIPT, Moscow, Russia
3Current address: Max Planck Institute of Immunobiology and Epigenetics, MPI-IE, Freiburg, Germany
*Current maintainer.
A pretty doc version of this vignette is available here .
CONCLUS is a tool for robust clustering and positive marker features selection of single-cell RNA-seq (sc-RNA-seq) datasets. Of note, CONCLUS does not cover the preprocessing steps of sequencing files obtained following next-generation sequencing. You can find a good resource to start with here.
CONCLUS is organized into the following steps:
- Generation of multiple t-SNE plots with a range of parameters including different selection of genes extracted from PCA.
- Use the Density-based spatial clustering of applications with noise (DBSCAN) algorithm for idenfication of clusters in each generated t-SNE plot.
- All DBSCAN results are combined into a cell similarity matrix.
- The cell similarity matrix is used to define "CONSENSUS" clusters conserved accross the previously defined clustering solutions.
- Identify marker genes for each concensus cluster.
Issues can be submitted directly on github: https://github.com/lancrinlab/CONCLUS/issues. To contact us directly write to christophe.lancrin@embl.it. This package was developed by Polina Pavlovich who is now doing her Ph.D at the Max Planck Institute of Immunobiology and Epigenetics.
# required R >= 3.4.
if(!requireNamespace("devtools"))
install.packages("devtools")
if(!requireNamespace("BiocManager"))
install.packages("BiocManager")
library(devtools)
library(BiocManager)
if(!requireNamespace("conclus"))
install_github("lancrinlab/CONCLUS", dependencies = TRUE)
If you get a message such as: packages ‘BiocParallel’, ‘scran’, ‘scater’, ‘monocle’, ‘SingleCellExperiment’, ‘KEGGREST’, ‘AnnotationDbi’, ‘biomaRt’, ‘org.Mm.eg.db’, ‘S4Vectors’, ‘Biobase’ are not available (for R version 3.6.0), please run the following commands:
install(c("BiocParallel", "scran", "scater", "monocle", "SingleCellExperiment",
"KEGGREST", "AnnotationDbi", "biomaRt", "org.Mm.eg.db", "S4Vectors", "Biobase"))
install_github("lancrinlab/CONCLUS", dependencies = TRUE)
CONCLUS requires to start with a raw-count matrix with reads or unique molecular identifiers (UMIs). The columns of the count matrix must contain cells and the rows -- genes. CONCLUS needs a large number of cells to collect statistics, we recommend using CONCLUS if you have at least 100 cells.
library(conclus)
## Replace by the path to the directory in which results should be written
outputDirectory <- "./YourOutputDirectory"
experimentName <- "Bergiers"
countMatrix <- read.delim(file.path(system.file("extdata", package = "conclus"),
"Bergiers_counts_matrix_filtered.tsv"),
stringsAsFactors = FALSE)
columnsMetaData <- read.delim(file.path(system.file("extdata", package = "conclus"),
"Bergiers_colData_filtered.tsv"))
sceObjectCONCLUS <- runCONCLUS(outputDirectory,
experimentName,
columnsMetaData,
species = "mmu",
plotPDFcellSim = TRUE, # FALSE for > 2500 cells
k = 10, cores = 1,
statePalette = c("bisque", "cadetblue2",
"coral1", "cornflowerblue"),
deleteOutliers = FALSE)
exportClusteringResults(sceObjectCONCLUS, outputDirectory, experimentName,
"clusters_table.tsv")
Five folders will appear in your "outputDirectory":
-
output_tables:
- _cellsSimilarityMatrix.csv: Similarity matrix used to define the consensus clusters using every cells.
- _clustersSimilarityMatrix.csv: Simplified similarity matrix with median averaged cell values per cluster.
- _clusters_table.tsv: Cell-cluster corresponding table for defined consensus clusters.
- _dbscan_results.tsv: Cell-cluster corresponding table for each db-scan parameter.
- _tSNEResults.rds: Saved RDS object containing a matrix of all t-SNE results (for different PCs and perplexities, see below).
-
pictures: This folder contains two similarity matrices clustering heatmaps. One using all cells and the other median averaged by cluster. It also contains a subfolder "tSNE_pictures" containing all tSNEs generated with different PCs and perplexity parameters, and colored by condition, by cluster found, by dbscan, or uncolored (see below for details).
-
marker_genes: This folder contains a list of marker genes, ordered by decreasing score, for each consensus cluster.
-
tsnes: Folder containing, for each PCs/complexity values, the tSNE coordinates of each cell.
Further details about how all results are generated can be found below.
In this vignette, we demonstrate how to use CONCLUS on a sc-RNA-seq dataset from Bergiers et al. eLife 2018. The design for this experiment is described in (Figure 4—figure supplement 2). Bergiers et al. goal was to analyze the effect of the simultaneous expression of eight transcription factors (8TFs): Runx1 - and its partner - Cbfb, Gata2, Tal1, Fli1, Lyl1, Erg and Lmo2 in in vitro differentiated embryonic stem cells (ESCs) in comparison to control. They knocked-in a polycistronic transgenic construct allowing to over-express eigth transcription factors (i8TFs) simultaneously after adding doxycycline (dox). The Empty ESC line did not have any transgene. There were four conditions: E_minus (Empty no dox), E_plus (Empty with dox), i8TFs_minus (i8TFs no dox) and i8TFs_plus (i8TFs with dox).
This sc-RNA-seq experiment was performed using the SMARTer ICELL8 Single-Cell System (Click here for more info). The protocol was based on 3' end RNA sequencing where each mRNA molecule is labeled with a unique molecular identifier (UMI) during reverse transcription in every single cell. The analysis performed by Bergiers et al. was based on the dimensionality reduction algorithm called Principal Component Analysis (PCA), and they found that there was a major gene expression difference between i8TFs_plus and the other three conditions (Figure 4—figure supplement 2). However, it was not clear if other subclusters could be identified consistently in this dataset besides the two major clusters. In the current tutorial, we show how CONCLUS can help us to answer this question.
Labels of the four conditions are in the state column of columnsMetaData. To avoid a bias in the clustering analysis due to the high expression of the eight transcription factors construct, we deleted genes Cbfb, Gata2, Tal1, Fli1, Lyl1, Erg, and Lmo2 from the provided countMatrix. Highly abundant embryonic hemoglobins with names starting with "Hba" or "Hbb" were also excluded because they seemed to be a primary source of contamination.
The TestClustering function runs one clustering round out of the 84 (default) rounds that CONCLUS normally performs. This step can be useful to determine if the default DBSCAN parameters are suitable for your dataset. By default, they are dbscanEpsilon = c(1.3, 1.4, 1.5) and minPts = c(3,4). If the dashed horizontal line in the k-NN distance plot lays on the “knee” of the curve (as shown below), it means that optimal epsilon is equal to the intersection of the line to the y-axis. In our example, optimal epsilon is 1.4 for 5-NN distance where 5 corresponds to MinPts.
In the "test_clustering" folder under outputDirectory, the three plots below will be saved where one corresponds to the “distance_graph.pdf” (Figure1), another one to “test_tSNE.pdf” (p[[1]], Figure2), and the last one will be saved as “test_clustering.pdf” (p[[3]], Figure3).
sceObject <- conclus::normaliseCountMatrix(countMatrix, species = "mmu",
colData = columnsMetaData)
p <- conclus::testClustering(sceObject, outputDirectory, experimentName)
# saved as "outputDirectory/test_clustering/test_tSNE.pdf"
p[[1]]
# saved as "outputDirectory/test_clustering/test_clustering.pdf"
p[[3]]
The wrapper function runCONCLUS is organized into 7 steps:
- Normalization of the counts matrix
- Generation of t-SNE coordinates
- Clustering with DB-SCAN
- Cell and cluster similarity matrix calculation
- Plotting
- Results export
- Marker genes identification
sc-RNA-seq datasets are quite challenging notably because of sparsity (many genes are not detected consistently yielding expression matrices with many zeroes) and also because of technical noise. To facilitate analysis, one needs to perform a step of normalization which allows the correction of unwanted technical and biological noises (click here for a complete review on normalization techniques).
CONCLUS uses Scran and Scater packages for normalization. Beforehand, the function will annotate genes creating rowData and add statistics about cells into columnsMetaData. If you already have columnsMetaData and rowData, you can give it to the function (see manual). It will keep your columns and add new ones at the end. If you do not want to lose any cell after quality metrics check, select alreadyCellFiltered = TRUE, by default it is FALSE. Before scran and scater normalization, the function will call scran::quickCluster (see manual for details). If you want to skip this step, type runQuickCluster = FALSE, by default it is TRUE. We recommend to use runQuickCluster = TRUE for medium-size datasets with 500-10000 cells. However, it can take a significant amount of time for a larger amount of cells and will not be useful for small sets of 200-300 samples.
This function is currently limited to human and mouse. \newline
sceObject <- conclus::normaliseCountMatrix(countMatrix, species = "mmu",
colData = columnsMetaData)
The function normaliseCountMatrix returns a sceObject with slots counts, exprs, colData, and rowData. rowData keeps information about GO terms containing genes on the cell surface or involved in secretion. This information can help to study cross-talk between cell types or find surface protein-coding marker genes suitable for flow cytometry. The columns with the GO terms are go_id and name_1006 (see manual).
The slots can be accessed as indicated below: \newline
# checking what changed after the normalisation
dim(sceObject)
# show first columns and rows of the count matrix
SingleCellExperiment::counts(sceObject)[1:5,1:5]
# show first columns and rows of the normalized count matrix
Biobase::exprs(sceObject)[1:5,1:5]
# visualize first rows of metadata (coldata)
coldataSCE <- as.data.frame(SummarizedExperiment::colData(sceObject))
head(coldataSCE)
# visualize beginning of the rowdata containing gene information
rowdataSCE <- as.data.frame(SummarizedExperiment:::rowData(sceObject))
head(rowdataSCE)
runCONCLUS creates needed output folders (if you did not run testClustering beforehand). Then it generates an object of fourteen (by default) tables with tSNE coordinates. Fourteen because it will vary seven values of principal components PCs=c(4, 6, 8, 10, 20, 40, 50) and two values of perplexity perplexities=c(30, 40) in all possible combinations.
The chosen values of PCs and perplexities can be changed if necessary. We found that this combination works well for sc-RNA-seq datasets with 400-2000 cells. If you have 4000-9000 cells and expect more than 15 clusters, we recommend to use more first PCs and higher perplexity, for example, PCs=c(8, 10, 20, 40, 50, 80, 100) and perplexities=c(200, 240). For details about perplexities parameter see ‘?Rtsne’. \newline
initialisePath(outputDirectory)
# default parameters, can be selected by a user
PCs=c(4, 6, 8, 10, 20, 40, 50)
perplexities=c(30, 40)
randomSeed = 42
tSNEResults <- generateTSNECoordinates(sceObject, outputDirectory,
experimentName, PCs=PCs,
perplexities=perplexities,
randomSeed = randomSeed)
Results can be explored as follows: \newline
ncol(tSNEResults)
# the third matrix of t-SNE coordinates with PC = 8 and perplixities = 30
# it is saved as "tsnes/Bergiers_tsne_coordinates_3_8PCs_30perp.tsv"
head(tSNEResults[1,3][[1]])
Following the calculation of t-SNE coordinates, DBSCAN is run with a range of epsilon and MinPoints values which will yield a total of 84 clustering solutions (PCs x perplexities x MinPoints x epsilon). minPoints is the minimum cluster size which you assume to be meaningful for your experiment and epsilon is the radius around the cell where the algorithm will try to find minPoints dots. Optimal epsilon must lay on the knee of the k-NN function as shown in the "test_clustering/distance_graph.pdf" (See Test clustering section above). \newline
epsilon=c(1.3, 1.4, 1.5)
minPoints=c(3, 4)
cores=1
dbscanResults <- conclus::runDBSCAN(tSNEResults, sceObject, outputDirectory,
experimentName, epsilon=epsilon,
minPoints=minPoints,
cores=cores)
The above calculated results are combined together in a matrix called “cell similarity matrix”. RunDBSCAN function returns a matrix where columns are cells and rows are 84 clustering solutions (which is equal to number of PCs x perplexities x MinPoints x epsilon). Since the range of cluster varies from result to result, there is no exact match between numbers in different rows of the returned matrix. Cells having the same number within a row are guaranteed to be in one cluster. We can calculate how many times out of 84 clustering solutions, every two cells were in one cluster and that is how we come to the similarity matrix of cells. We want to underline that zero in the dbscanResults means that a cell was not assigned to any cluster, so two cells with zeros were not necessarily similar, that is why we count clusters starting from one. clusterCellsInternal is a general function which returns a sceObject with new clusters and calculated cellsSimilarityMatrix. \newline
clusteringMethod="ward.D2"
k=10 # parameter for cutree
deepSplit = 0 # 0 to avoid cutreeDynamic, 1 to 4 to use it
message("Calculating cells similarity matrix.")
clusteringResults <- conclus::clusterCellsInternal(dbscanResults, sceObject, clusterNumber=k,
deepSplit=deepSplit, cores=cores,
clusteringMethod=clusteringMethod)
sceObjectFiltered <- clusteringResults[[1]]
cellsSimilarityMatrix <- clusteringResults[[2]]
print(table(SummarizedExperiment::colData(sceObjectFiltered)$clusters,
dnn=list("Cells distribuion by clusters")))
After looking at the similarity between elements on the single-cell level, which is useful if we want to understand if there is any substructure which we did not highlight with our clustering, a "bulk" level where we pool all cells from a cluster into a representative "pseudo cell" can also be generated. This gives a clusterSimilarityMatrix: \newline
clustersSimilarityMatrix <-
conclus::calculateClustersSimilarity(cellsSimilarityMatrix,
sceObject = sceObjectFiltered,
clusteringMethod = "ward.D2")[[1]]
CONCLUS generated 14 tSNE combining different values of PCs and perplexities. Each tSNE can be visualized either using coloring reflecting the results of DBScan clustering, the conditions or without colors. Here plotClusteredTSNE is used to generate all these possibilities of visualization. \newline
tSNEclusters <- conclus::plotClusteredTSNE(sceObjectFiltered, outputDirectory,
experimentName, PCs=PCs, perplexities=perplexities,
colorPalette = "default", columnName = "clusters",
returnPlot = TRUE)
tSNEnoColor <- conclus::plotClusteredTSNE(sceObjectFiltered, outputDirectory,
experimentName, PCs=PCs, perplexities=perplexities,
colorPalette = "default", columnName = "noColor",
returnPlot = TRUE)
if(any(colnames(SummarizedExperiment::colData(sceObjectFiltered))
%in% "state")){
tSNEstate <- conclus::plotClusteredTSNE(sceObjectFiltered, outputDirectory,
experimentName, PCs=PCs, perplexities=perplexities,
colorPalette = "default", columnName = "state",
returnPlot = TRUE)
}
For visualizing the 5th (on 14) tSNE cluster (figure 4): \newline
tSNEclusters[[5]]
For visualizing the 5th (on 14) tSNE cluster without colors (figure 5): \newline
tSNEnoColor[[5]]
For visualizing the 5th (on 14) tSNE cluster colored by state (figure 6): \newline
tSNEstate[[5]]
The cellsSimilarityMatrix is then used to generate a heatmap (Figure 7) summarizing the results of the clustering and to show how stable the cell clusters are accross the 84 solutions. \newline
colorPalette="default"
statePalette="default"
plotPDFcellSim = TRUE
orderClusters = FALSE
clustersNumber <- length(unique(
SummarizedExperiment::colData(sceObjectFiltered)$clusters))
colorPalette <- conclus::choosePalette(colorPalette, clustersNumber)
# Plotting stability of clusters
conclus::plotCellSimilarity(sceObjectFiltered, cellsSimilarityMatrix, outputDirectory,
experimentName, colorPalette,
orderClusters = orderClusters,
statePalette = statePalette,
clusteringMethod = clusteringMethod,
plotPDF = plotPDFcellSim,
returnPlot = TRUE)
CellsSimilarityMatrix is symmetrical and its size proportional of to the "number of cells x number of cells". Each vertical or horizontal tiny strip is a cell. Intersection shows the proportion of clustering iterations in which a pair of cells were in one cluster (score between 0 and 1, between blue and red). We will call this combination "consensus clusters" and use them everywhere later. We can appreciate that cellsSimilarityMatrix is the first evidence showing that CONCLUS managed not only to distinguish i8TFs_plus cells from the three other groups (as in the original publication) but also find subpopulations within these groups which were impossible using PCA alone.
conclus::plotClustersSimilarity(clustersSimilarityMatrix,
sceObjectFiltered,
dataDirectory = outputDirectory,
experimentName = experimentName,
colorPalette = colorPalette,
statePalette = statePalette,
clusteringMethod = clusteringMethod,
returnPlot = TRUE)
In the clusterSimilarityMatrix, we can still see two major families of clusters: clusters with 8, 9, and 10 on one side and 1, 2, 3, 4, 5, 6 and 7, on the other. Almost all clusters have a high value of similarity across all clustering solutions. Only clusters 5-6 have a quite low similarity value. Red color on the diagonal means that the group is homogenous, and usually, it is what we want to get. The yellow on the diagonal indicates that either that group consists of two or more equal sized subgroups. Bluish color points to a cluster of dbscan "outliers" that usually surrounds dense clouds of cells in t-SNE plots.
runCONCLUS is saving cellsSimilarityMatrix and clustersSimilarityMatrix to the subfolder output_tables. \newline
conclus::exportMatrix(cellsSimilarityMatrix, outputDirectory, experimentName,
"cellsSimilarityMatrix")
conclus::exportMatrix(clustersSimilarityMatrix, outputDirectory, experimentName,
"clustersSimilarityMatrix")
The exportData function enables to save the normalized expression matrix, columns and row data as the full workspace to the output_tables subfolder. \newline
conclus::exportData(sceObjectFiltered, outputDirectory, experimentName)
Finally, as seen in the runCONCLUS section, the conclus::exportClusteringResults() function enables to export all results.
To understand the nature of the consensus clusters identified by CONCLUS, it is essential to identify genes which could be classified as marker genes for each cluster. To this aim, each gene should be "associated" to a particular cluster. This association is performed by looking at upregulated genes in a particular cluster compared to the others (multiple comparisons). The function rankGenes performs multiple comparisons of all genes from the sceObjectFiltered and rank them according to a score reflecting a FDR power. It saves one file per cluster in the folder marker_genes.
In summary, the function conclus::rankGenes() gives a list of marker genes for each cluster, ordered by their significance. See ?conclus::rankGenes for more details.
\newline
conclus::rankGenes(sceObjectFiltered, clustersSimilarityMatrix, outputDirectory,
experimentName)
rankedGenesClus5 <- read.delim(file.path(outputDirectory, "marker_genes",
"Bergiers_cluster_5_genes.tsv"),
stringsAsFactors = FALSE)
head(rankedGenesClus5, n = 10)
To understand the nature of the consensus clusters identified by CONCLUS, it is essential to identify genes, which could be classified as marker genes for each cluster. Following the execution of the runCONCLUS function, a folder called “marker_genes” is automatically generated in your dataDirectory. It contains one Tab Separated Values (tsv) file for each cluster. Each tsv file contains a list of genes ranked according to how high these genes are expressed in a given cluster in comparison to all the other clusters.
CONCLUS offers the option to visualize the marker genes on a heatmap (Figure 9). Below we chose to select 10 marker genes per cluster which should generate a heatmap with 100 genes (10 marker genes x 10 clusters) which is convenient for visualization. In practice, the number of genes in this heatmap will be less than 100 because some genes have been classified as markers for more than one cluster. This can happen when several clusters correspond to similar cellular types. In this particular example, we have a list of 88 genes.
We ask the function plotCellHeatmap to order clusters and genes by similarity (the same order as in the clusterSimilarityMatrix) and show mean-centered normalized data. Mean-centering allows seeing the relative expression of a gene compared to the mean. \newline
genesNumber <- 10
markersClusters <- conclus::getMarkerGenes(outputDirectory, sceObjectFiltered,
experimentName = experimentName,
genesNumber = genesNumber)
orderClusters <- T # F will apply hierarchical clustering to all cells
orderGenes <- T # F will apply hierarchical clustering to all genes
meanCentered <- T # F to show normalized counts
conclus::plotCellHeatmap(markersClusters, sceObjectFiltered, outputDirectory,
experimentName,
paste0("clusters",
length(levels(
SummarizedExperiment::colData(sceObjectFiltered)$clusters)),
"_meanCentered",meanCentered,
"_orderClusters",orderClusters,
"_orderGenes",orderGenes,"_top",
genesNumber, "markersPerCluster"),
meanCentered = meanCentered,
colorPalette = RColorBrewer::brewer.pal(10, "Paired"),
orderClusters = orderClusters,
orderGenes = orderGenes,
fontsize_row = 4,
statePalette = c("bisque", "cadetblue2",
"coral1", "cornflowerblue"),
color = colorRampPalette(c("#023b84","#4b97fc",
"#FEE395",
"#F4794E", "#D73027",
"#a31008","#7a0f09"))(100),
returnPlot = TRUE,
width = 7.5, height = 6.5)
The second heatmap (Figure 10) below also shows the order of genes and clusters by similarity but for normalized expression data. As you can see, genes expressed at a level of seven and nine look very similar. It is hard to highlight all the differences in expression of both lowly and highly detected genes in one heatmap using normalized data. For this reason, mean-centering helps to solve this issue. \newline
orderClusters <- T # F will apply hierarchical clustering to all cells
orderGenes <- T # F will apply hierarchical clustering to all genes
meanCentered <- F # F to show normalized counts
conclus::plotCellHeatmap(markersClusters, sceObjectFiltered, outputDirectory,
experimentName,
paste0("clusters",
length(levels(SummarizedExperiment::colData(sceObjectFiltered)$clusters)),
"_meanCentered",meanCentered,
"_orderClusters",orderClusters,
"_orderGenes",orderGenes,"_top",
genesNumber, "markersPerCluster"),
meanCentered = meanCentered,
colorPalette = RColorBrewer::brewer.pal(10, "Paired"),
orderClusters = orderClusters,
orderGenes = orderGenes,
fontsize_row = 4,
statePalette = c("bisque", "cadetblue2",
"coral1", "cornflowerblue"),
color = colorRampPalette(c("#023b84","#4b97fc",
"#FEE395",
"#F4794E", "#D73027",
"#a31008","#7a0f09"))(100),
returnPlot = TRUE)
Alternative order of clusters is by name or by hierarchical clustering as in the default pheatmap function.
PlotGeneExpression allows visualizing the normalized expression of one gene in a t-SNE plot. It can be useful to inspect the specificity of top markers. Below are examples of marker genes that define a particular cluster (Figures 11-14). \newline
# Plot gene expression in a selected tSNE plot
# Ccl3, marker gene for clusters 2
plotGeneExpression("Ccl3", experimentName, outputDirectory, sceObjectFiltered,
tSNEpicture = 10, returnPlot=T)
# Gp9, marker gene for cluster 6
plotGeneExpression("Gp9", experimentName, outputDirectory, sceObjectFiltered,
tSNEpicture = 10, returnPlot=T)
# Fn1, marker gene for cluster 7
plotGeneExpression("Fn1", experimentName, outputDirectory, sceObjectFiltered,
tSNEpicture = 10, returnPlot=T)
# Alox5ap, marker gene for cluster 9
plotGeneExpression("Alox5ap", experimentName, outputDirectory, sceObjectFiltered,
tSNEpicture = 10, returnPlot=T)
GetGenesInfo is a function for gene information retrieval from open-source databases and websites of NCBI, MGI, and UniProt by web-scraping. It requires markersClusters data frame where the first column “geneName” is mandatory and the second column “clusters” is optional. It can have more than two columns; they will be kept in the output file. GetGenesInfo can recognize Gene Symbols and Ensembl IDs, genes from both annotations can be present simultaneously. The second argument is “databaseDir” which is a path to the “Mmus_gene_database_secretedMol.tsv” database contained in the package. Databases for other organisms than mouse will be provided in the future or on request. \newline
result <- getGenesInfo(markersClusters, groupBy = "clusters",
getUniprot = TRUE)
outputDir <- file.path(outputDirectory, "/marker_genes/getGenesInfo")
dir.create(outputDir, showWarnings=F)
write.table(result, file = file.path(outputDir,
"Bergiers_markersClusters_top10_clusters9_genesInfo.csv"),
quote = FALSE, sep = ";", row.names = FALSE)
result contains the following columns:
- geneName: official gene symbol or ENSEMBL ID.
- clusters: The cluster to which the gene is associated.
- Name: The complete gene name.
- Feature.Type: protein coding gene, lincRNA gene, miRNA gene, unclassified non-coding RNA gene, or pseudogene.
- go_id: Gene Ontology (GO) identification number.
- name_1006: Whether or not a gene belongs to GO terms containing proteins on the cell surface or involved in secretion.
- MGI: Knockout information from MGI.
- NCBI: A summary from NCBI.
- chromosome_name: The chromosome on which the gene is located
- Symbol: Official gene symbol
- Ensembl: Ensembl ID
- MGI.Gene.Marker.ID: MGI ID
- Entrez.Gene.ID: Entrez ID
- Uniprot.ID: Uniprot ID
MarkersClusters usually contains 10-20 top marker genes per clusters. However, it is more reliable to use at least 100 genes for cell state annotation.
The function saveMarkersLists will create a data frames with top 100 marker genes for each cluster (by default). The first column of the output table is the gene name and the second column is the associated cluster. It will create “marker_genes/markers_lists” folder and save one table per cluster. \newline
saveMarkersLists(experimentName, outputDirectory)
The saveGenesInfo function will save results of getGenesInfo into "outputDirectory/marker_genes/saveGenesInfo" for each file in “marker_genes/markers_lists”. It goes file by file sequentially. If you see an error message that the connection to the website takes too much time, restart the saveGenesInfo from the table where it failed using the startFromFile option. As an example, we will specify “startFromFile = 10” to calculate the output only for the last cluster. Hence, we reduce the computational time and the number of output messages. For your analysis, please select “startFromFile = 1”.
saveGenesInfo(outputDirectory, sep = ";", header = TRUE,
startFromFile = 10, getUniprot = TRUE)
Until now, we have been using CONCLUS in an unsupervised fashion. This is a good way to start the analysis of a sc-RNA-seq dataset. However, the knowledge of the biologist remains a crucial asset to get the maximum of the data. This is why we have included in CONCLUS, additional options to do supervised analysis (or “manual” clustering) to allow the researcher to use her/his biological knowledge in the CONCLUS workflow. Going back to the example of the Bergiers et al. dataset above (cluster similarity heatmap), one can see that some clusters clearly belong to the same family of cells after examining the clusters_similarity matrix generated by CONCLUS.
It is mostly obvious for clusters 9 and 10. In order to figure out what marker genes are defining these families of clusters, one can use manual clustering in CONCLUS to fuse clusters of similar nature: i.e. combine clusters 9 and 10 together. \newline
exportClusteringResults(sceObjectFiltered, outputDirectory, experimentName,
"clusters_table.tsv")
clustersTable <- read.delim(file.path(outputDirectory, "output_tables",
paste0(experimentName, "_clusters_table.tsv")), stringsAsFactors = FALSE)
#replace “10” by “9” in the clusters_table to merge 9/10
clustersTable$clusters[clustersTable$clusters == "10"] = "9"
write.table(clustersTable, file.path(outputDirectory, "output_tables",
paste0(experimentName, "_clusters_table_manual.tsv")),
quote = FALSE, sep = "\t")
Now we can use runCONCLUS again and overwrite old marker genes and save new heatmaps. If you want to keep an old folder with marker genes, please rename it, so runCONCLUS will create a new marker_genes folder. \newline
#Correcting clustering manually
sceObjectFiltered <- addClusteringManually(fileName = "clusters_table_manual.tsv",
dataDirectory = outputDirectory,
experimentName = experimentName,
sceObject = sceObjectFiltered,
columnName = "clusters")
# Redo the analysis with manual clustering
sceObjectFiltered <- runCONCLUS(outputDirectory, experimentName,
statePalette= c("bisque", "cadetblue2", "coral1", "cornflowerblue"),
preClustered = TRUE, manualClusteringObject = sceObjectFiltered)
As previously, we can now make heatmaps of marker gene expression (Figure 15). \newline
# Plotting heatmap displaying mean-centered values
meanCentered <- T # F to show normalized counts,
plotCellHeatmap(markersClusters, sceObjectFiltered, outputDirectory,
experimentName,
paste0("clusters",
length(levels(
SummarizedExperiment::colData(sceObjectFiltered)$clusters)),
"_meanCentered",meanCentered,"_orderClusters",orderClusters,
"_orderGenes",orderGenes,"_top",
genesNumber, "markersPerCluster"),
meanCentered = meanCentered,
colorPalette = c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C",
"#FB9A99", "#E31A1C", "#FDBF6F", "#FF7F00", "#CAB2D6"),
orderClusters = orderClusters,
orderGenes = orderGenes,
fontsize_row = 5,
statePalette = c("bisque", "cadetblue2",
"coral1", "cornflowerblue"),
color = colorRampPalette(c("#023b84","#4b97fc",
"#FEE395",
"#F4794E", "#D73027",
"#a31008","#7a0f09"))(100),
returnPlot = TRUE)
This heatmap shows that Cd52 and Alox5ap are good markers of cluster 9 (mix of old clusters 9 and 10). We can visualize them in the t-SNE plots below. \newline
# Plot gene expression in a selected tSNE plot
# Cd52, marker gene for clusters 9 (mix of old clusters 9 and 10)
plotGeneExpression("Cd52", experimentName, outputDirectory, sceObjectFiltered,
tSNEpicture = 10, returnPlot = TRUE)
# Alox5ap, marker gene for cluster 9 (mix of old clusters 9 and 10)
plotGeneExpression("Alox5ap", experimentName, outputDirectory, sceObjectFiltered,
tSNEpicture = 10, returnPlot = TRUE)
Here we demonstrated how to use CONCLUS and combine multiple parameters testing for sc-RNA-seq analysis. It allowed us to gain more information on the dataset of Bergiers et al and will help gaining deeper insights in others.
Indeed in the original analysis using PCA, two major clusters were found (one composed of i8TFs_plus cells and another comprising E_minus, E_plus, i8TFs_minus cells). Using CONCLUS, we see that there is still a big difference between the i8TFs_plus experimental group and the other three. Interestingly, CONCLUS was able to unveil heterogeneity within the i8TFs group while the previous analysis performed by Bergiers et al was not able to reveal it. This analysis offers additional information on the function of these eight transcription factors.
This section lists all the packages used for the CONCLUS tool.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /opt/R/3.6.0/lib/R/lib/libRblas.so
## LAPACK: /opt/R/3.6.0/lib/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=it_IT.UTF-8 LC_NAME=it_IT.UTF-8
## [9] LC_ADDRESS=it_IT.UTF-8 LC_TELEPHONE=it_IT.UTF-8
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=it_IT.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] prettydoc_0.2.1 conclus_0.99.3 BiocManager_1.30.4
## [4] usethis_1.5.0 devtools_2.0.2 rmarkdown_1.13
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.4 VGAM_1.1-1
## [3] plyr_1.8.4 igraph_1.2.4.1
## [5] lazyeval_0.2.2 splines_3.6.0
## [7] BiocParallel_1.18.0 densityClust_0.3
## [9] GenomeInfoDb_1.20.0 fastICA_1.2-1
## [11] ggplot2_3.2.0 scater_1.12.2
## [13] digest_0.6.19 foreach_1.4.4
## [15] htmltools_0.3.6 viridis_0.5.1
## [17] magrittr_1.5 memoise_1.1.0
## [19] xlsx_0.6.1 cluster_2.1.0
## [21] doParallel_1.0.14 limma_3.40.2
## [23] remotes_2.1.0 Biostrings_2.52.0
## [25] matrixStats_0.54.0 docopt_0.6.1
## [27] prettyunits_1.0.2 colorspace_1.4-1
## [29] blob_1.1.1 rvest_0.3.4
## [31] ggrepel_0.8.1 xfun_0.8
## [33] dplyr_0.8.1 sparsesvd_0.1-4
## [35] callr_3.2.0 crayon_1.3.4
## [37] RCurl_1.95-4.12 org.Mm.eg.db_3.8.2
## [39] zoo_1.8-6 iterators_1.0.10
## [41] glue_1.3.1 gtable_0.3.0
## [43] zlibbioc_1.30.0 XVector_0.24.0
## [45] DelayedArray_0.10.0 kernlab_0.9-27
## [47] pkgbuild_1.0.3 BiocSingular_1.0.0
## [49] SingleCellExperiment_1.6.0 DEoptimR_1.0-8
## [51] prabclus_2.3-1 BiocGenerics_0.30.0
## [53] scales_1.0.0 pheatmap_1.0.12
## [55] mvtnorm_1.0-11 DBI_1.0.0
## [57] edgeR_3.26.5 Rcpp_1.0.1
## [59] progress_1.2.2 viridisLite_0.3.0
## [61] dqrng_0.2.1 bit_1.1-14
## [63] rsvd_1.0.1 mclust_5.4.3
## [65] stats4_3.6.0 httr_1.4.0
## [67] FNN_1.1.3 RColorBrewer_1.1-2
## [69] fpc_2.2-2 modeltools_0.2-22
## [71] factoextra_1.0.5 rJava_0.9-11
## [73] XML_3.98-1.20 pkgconfig_2.0.2
## [75] flexmix_2.3-15 nnet_7.3-12
## [77] locfit_1.5-9.1 dynamicTreeCut_1.63-1
## [79] labeling_0.3 tidyselect_0.2.5
## [81] rlang_0.4.0 reshape2_1.4.3
## [83] AnnotationDbi_1.46.0 munsell_0.5.0
## [85] tools_3.6.0 cli_1.1.0
## [87] dbscan_1.1-3 RSQLite_2.1.1
## [89] evaluate_0.14 stringr_1.4.0
## [91] yaml_2.2.0 processx_3.3.1
## [93] knitr_1.23 bit64_0.9-7
## [95] fs_1.3.1 robustbase_0.93-5
## [97] DDRTree_0.1.5 purrr_0.3.2
## [99] RANN_2.6.1 KEGGREST_1.24.0
## [101] monocle_2.12.0 slam_0.1-45
## [103] scran_1.12.1 xml2_1.2.0
## [105] biomaRt_2.40.0 compiler_3.6.0
## [107] beeswarm_0.2.3 curl_3.3
## [109] png_0.1-7 ggsignif_0.5.0
## [111] tibble_2.1.3 statmod_1.4.32
## [113] stringi_1.4.3 ps_1.3.0
## [115] desc_1.2.0 lattice_0.20-38
## [117] Matrix_1.2-17 HSMMSingleCell_1.4.0
## [119] pillar_1.4.1 combinat_0.0-8
## [121] BiocNeighbors_1.2.0 data.table_1.12.2
## [123] bitops_1.0-6 irlba_2.3.3
## [125] GenomicRanges_1.36.0 R6_2.4.0
## [127] gridExtra_2.3 vipor_0.4.5
## [129] IRanges_2.18.1 sessioninfo_1.1.1
## [131] codetools_0.2-16 MASS_7.3-51.4
## [133] assertthat_0.2.1 pkgload_1.0.2
## [135] xlsxjars_0.6.1 SummarizedExperiment_1.14.0
## [137] rprojroot_1.3-2 withr_2.1.2
## [139] qlcMatrix_0.9.7 S4Vectors_0.22.0
## [141] GenomeInfoDbData_1.2.1 hms_0.4.2
## [143] diptest_0.75-7 parallel_3.6.0
## [145] DataCombine_0.2.21 grid_3.6.0
## [147] class_7.3-15 DelayedMatrixStats_1.6.0
## [149] Cairo_1.5-10 Rtsne_0.15
## [151] ggpubr_0.2.1 Biobase_2.44.0
## [153] tinytex_0.14 ggbeeswarm_0.6.0