/
concepts.Rmd
238 lines (182 loc) · 12.3 KB
/
concepts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: "Concepts on cell types and signatures"
author:
- name: Kevin Rue-Albrecht
affiliation:
- &id1 Kennedy Institute of Rheumatology, University of Oxford, Headington, Oxford OX3 7FY, UK.
email: kevin.rue-albrecht@kennedy.ox.ac.uk
- name: Second Author
affiliation: Second Author's Affiliation
email: corresponding@author.com
date: "`r BiocStyle::doc_date()`"
package: hancock
output:
BiocStyle::html_document:
toc_float: true
abstract: |
A discussion of concepts associated with cell types and transcriptional signatures.
vignette: |
%\VignetteIndexEntry{Concepts on cell types and signatures}
%\VignetteEncoding{UTF-8}
%\VignettePackage{hancock}
%\VignetteKeywords{GeneExpression, RNASeq, Sequencing}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
bibliography: hancock.bib
---
**Compiled date**: `r Sys.Date()`
**Last edited**: 2018-03-08
**License**: `r packageDescription("hancock")[["License"]]`
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
error = FALSE,
warning = FALSE,
message = FALSE,
crop = NULL
)
```
# Definition of cell identity
Discuss:
- cell type is a continuum (e.g. differentiation, pseudotime)
- on a similar note, all cells in an organism basically originate from a common progenitor (itself coming from two, etc.)
- cell "states" may be defined similarly to cell "types" (e.g. markers of activated/resting cells)
# Definitions
## Definitions of absolute and relative markers and signatures
**Absolute markers** (also known as "pan markers") may be defined as molecules (e.g. protein, transcripts) known to be present or absent in a given population of cells, _irrespective of their expression in other cells of the same sample_.
For instance, T helper lymphocytes can be defined by the presence of surface proteins Cd3 and Cd4, while T cytotoxic lymphocytes are defined by the presence of surface proteins Cd3 and Cd8.
To assess whether a cell is likely a T helper lymphocyte, one does not need to know the markers of T cytotoxic lymphocytes, nor that the two cell types have the Cd3 marker in common.
**Relative markers** (also known as "key markers") may be defined by _differential_ expression against other cells in the same sample.
For instance, in a biological sample including both T helper and T cytotoxic lymphocytes, differential expression between the two lypmphocyte subsets would highlight Cd4 protein as a (relative) marker of T helper lymphocytes and Cd8 protein as a (relative) marker of T cytotoxic lymphocytes.
In contrast, Cd3 protein may be considered as a marker of either cell type only if its expression level or frequency significantly differs between the two cell types.
Note that sets of **absolute markers** may also be trimmed to markers _specific_ to each cell population (i.e., excluding markers present in other signatures), either to increase stringency (due to their specificity) or sensitivity (due to their smaller number).
For instance, Cd3 protein being a marker of both T helper and T cytotoxic lymphocytes, one may wish to exclude it from both signatures, in a manner similar--yet more stringent--to **relative markers**.
Such markers may be called **relative sets of absolute markers**, as they are composed of **absolute markers** compared to one another in order to identify specific subsets of markers.
## Definitions qualititative and quantitative signatures
**Qualitative signatures** may be defined as those comprising lists of gene identifiers that are known to be either present or absent in a given population of cells, without _any_ quantitative information (i.e., neither absolute nor relative, see below).
**Quantitative signatures** may be defined as either full transcriptional profiles of cell populatitons, or gene lists accompanied with either absolute or relative quantitative gene expression information (e.g., counts, transcripts per million).
In addition, **Semi-quantitative signatures** may be defined as signatures accompanied with summarized gene expression data (e.g., gene rank by decreasing TPM).
## Applications
The origin of markers and signatures dictates how they should be used in downstream classification tasks.
### Markers
**Absolute markers** present the advantage of allowing the immediate characterization of any cell or cluster, without the need of a reference cell of cluster.
In this case, each cell or cluster may be screened for the presence of absolute markers, and assigned an identity independently of all other cells in the sample.
**Relative markers** can be advantageous when the general cell type composition of a given sample is known in advance, and the problem is only to distribute a predefined set of identities expected in a given sample to a similar number of cell clusters.
In this case, differential expression analysis may be performed between clusters of cells in the new sample, and markers of each cluster may be compared to a similar reference sample to assign identities defined in the reference sample to each of the cells or clusters in the new sample.
However, cells within a given sample are generally the result of sorting (e.g. FACS) and enriching a population of interest on a set of (protein) markers.
In those cases, the markers used for sorting the cells generally appear as highly expressed in all cells, making it difficult or impossible to identify as relative markers.
**Relative sets of absolute markers** may be advantageous when dealing with closely related cell populations or novel populations defined by new markers relative to canonical markers and cell types.
### Signatures
**Qualitative signatures** may provide particularly fast and convenient ways to apply FACS-like "gating" strategies to the definition of cell identities.
The main challenge for such signatures is to define _thresholds_ under and above which transcripts may be considered as absent or present (a natural default threshold being 0).
The main advantage of qualitative signatures being that the presence or absence of any transcript should be considerably more stable than its fluctuating expression level.
**Quantitative signatures** may provide considerably more precise information on the relative expression level of markers (and other) genes.
However, this additional information carries additional constraints and caveats with it, namely:
- the naturally fluctuating level of transcripts in individual samples means that independent data sets will never produce exactly identical quantitative signatures
- in relation to this, quantitative signatures should ideally require users to process and quantitate their new data set _identically_ to the reference data set use to define those signatures; such methods generally limit reproducibility between researchers and software versions, in addition to greatly hindering the definition, distribution, and interpretation of "gold standard" signatures.
**Semi-quantitative signatures** may provide a compromise between qualitative and quantitaive signatures, summarizing fluctuating **quantitative signatures** into more stable semi-quantitative summary metrics (e.g., gene rank by expression level).
For instance, such signatures may be used to identify the most correlated reference sample to any cell or cluster in a new data set, allowing (to some extent) the comparison of even imperfectly correlated quantitative measurements (e.g. TPM and CPM).
# Representation of signatures
## Lists of marker names {#using-list}
At its simplest a signature could list _positive_ markers that are known to be expressed in each cell type.
For instance, the PBMC signatures used in the Seurat [PBMC 3k tutorial](https://satijalab.org/seurat/pbmc3k_tutorial.html) may be represented as follows:
```{r}
pbmc3k_markers_list <- list(
"CD4 T cells" = c("IL7R"),
"CD14+ Monocytes" = c("CD14", "LYZ"),
"B cells" = c("MS4A1"),
"CD8 T cells" = c("CD8A"),
"FCGR3A+ Monocytes" = c("FCGR3A", "MS4A7"),
"NK cells" = c("GNLY", "NKG7"),
"Dendritic Cells" = c("FCER1A", "CST3"),
"Megakaryocytes" = c("PPBP")
)
pbmc3k_markers_list
```
## Using GSEABase `GeneSetCollection` {#using-gseabase}
The `r Biocpkg("GSEABase")` package provides infrastructure for enumerating pathways and their contents
and facilitating translations among the different nomenclatures that
are used to describe contents of pathways ([source](https://support.bioconductor.org/p/27770/)).
The above list may be more formally (and elegantly) represented as follows:
```{r, message=FALSE}
library(GSEABase)
pbmc3k_markers_gsc <- GeneSetCollection(list(
GeneSet(setName="CD4 T cells", c("IL7R")),
GeneSet(setName="CD14+ Monocytes", c("CD14", "LYZ")),
GeneSet(setName="B cells", c("MS4A1")),
GeneSet(setName="CD8 T cells", c("CD8A")),
GeneSet(setName="FCGR3A+ Monocytes", c("FCGR3A", "MS4A7")),
GeneSet(setName="NK cells", c("GNLY", "NKG7")),
GeneSet(setName="Dendritic Cells", c("FCER1A", "CST3")),
GeneSet(setName="Megakaryocytes", c("PPBP"))
))
pbmc3k_markers_gsc
```
Note that generic R lists can easily be packaged into `r Biocpkg("GSEABase")` `GeneSetCollection` objects, for instance:
```{r}
pbmc3k_markers_gsc <- GeneSetCollection(mapply(function(geneIds, geneSetId) {
GeneSet(geneIds, geneIdType=EntrezIdentifier(),
collectionType=NullCollection(),
setName=geneSetId)
}, pbmc3k_markers_list, names(pbmc3k_markers_list)))
```
## Using unisets `Sets` {#using-unisets}
The `r Githubpkg("kevinrue/unisets")` package is being developed using `r Biocpkg("S4Vectors")` `Hits` to associate identifiers in a vector of elements to identifiers in a vector of sets.
The package can be installed as follows:
```{r, eval=FALSE}
devtools::install_github("kevinrue/unisets")
```
Conveniently, this package supports multiple source formats to create gene sets, including `list` described [above](#using-list).
```{r}
library(unisets)
inputList <- list(
"CD4 T cells" = c("IL7R"),
"CD14+ Monocytes" = c("CD14", "LYZ"),
"B cells" = c("MS4A1"),
"CD8 T cells" = c("CD8A"),
"FCGR3A+ Monocytes" = c("FCGR3A", "MS4A7"),
"NK cells" = c("GNLY", "NKG7"),
"Dendritic Cells" = c("FCER1A", "CST3"),
"Megakaryocytes" = c("PPBP")
)
pbmc3k_markers_sets <- as(inputList, "Sets")
pbmc3k_markers_sets
```
## Combinations of positive and negative markers using the `GeneColorSet` class
"Colored" gene sets, implemented in the `r Biocpkg("GSEABase")` `GeneColorSet` class, can store information about the relation between each gene and a given phenotype.
Here, "colors" are `factor` levels used to represent the "state" of each gene (e.g., expression levels "up", "down", or "unchanged"), and a phenotypic consequence (e.g., the phenotype is "enhanced" or "reduced").
For the purpose of signatures associated with individual cell types, we may consider the identity of a differentiated cell as the phenotype of interest.
In particular, individual genes may be positively or negatively associated with certain populations of cells.
Such genes may be called **positive** or **negative** markers for the associated cell population.
For instance, the differentiation of monocytes into mature F4/80^hi^ CX3CR1^hi^ MHCII^+^ macrophages in the murine colonic mucosa is described as a "waterfall" that visually describes the concomitant downregulation of Ly6C and upregulation of MHCII
[@monocytewaterfall2012].
Conceptually, the transcriptional signatures of Ly6C^hi^ monocytes and mature F4/80^hi^ CX3CR1^hi^ MHCII^+^ macrophages may be represented as follows:
```{r}
colored_gsc <- GeneSetCollection(list(
GeneColorSet(
setName="Monocytes", c("Ly6c1", "MHCII"), phenotype=c("Ly6Chi"),
geneColor=factor(c("high", "-")),
phenotypeColor=factor(c(TRUE, TRUE))),
GeneColorSet(
setName="Macrophages", c("Ly6c1", "MHCII"), phenotype=c("mature F4/80hi CX3CR1hi MHCII+"),
geneColor=factor(c("low", "+")),
phenotypeColor=factor(c(TRUE, TRUE)))
))
colored_gsc
```
The monocytic signature can be extracted, for further inspection.
```{r}
colored_gsc[["Monocytes"]]
```
It is also possible to examine the relationship of individual markers with respect to the "Ly6Chi Monocyte" phenotype.
```{r}
colored_gsc[["Monocytes"]][["Ly6c1"]]
colored_gsc[["Monocytes"]][["MHCII"]]
```
# Session info {.unnumbered}
```{r sessionInfo, echo=FALSE}
sessionInfo()
```
# References {.unnumbered}