/
processing.Rmd
355 lines (248 loc) · 15.1 KB
/
processing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: "Setting up for Differential Gene Expression Analysis"
author: "Nhi Hin"
date: "2021-06-28"
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
---
```{r Load-Packages, include=FALSE, warning=FALSE, error=FALSE}
# Working with data:
library(dplyr)
library(magrittr)
library(readr)
library(tibble)
library(reshape2)
# Visualisation:
library(kableExtra)
library(ggplot2)
library(ggbiplot)
library(ggrepel)
library(grid)
# Set ggplot theme
theme_set(theme_bw())
# Other packages:
library(here)
library(export)
# Bioconductor packages:
library(AnnotationHub)
library(edgeR)
library(limma)
library(Glimma)
```
## Summary
- On this page, we will prepare the data for the differential gene expression analysis. This involves:
1. Importing in the gene counts and sample metadata.
2. Obtaining gene annotations for the genes detected in this experiment.
3. Creating a `DGEList` object to store the gene counts, sample metadata, and gene annotations in a single object.
4. Dimension reduction to check how similar samples are, as well as whether there are any mislabeled samples.
5. Export out `DGEList` for use in the [differential gene expression analysis](de.html).
## Data Description
- The dataset we will be using in this analysis comes from a recent study published in [Cell](https://www.cell.com/iscience/pdf/S2589-0042(21)00679-9.pdf).
- The data has been uploaded onto the public repository GEO with the [GSE171110](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171110) accession number, and I've copied the description of the data the authors gave there below:
<blockquote>
The identification of COVID-19 patients with high-risk of severe disease is a challenge in routine care. We performed blood RNA-seq gene expression analyses in severe hospitalized patients compared to healthy donors. Supervised and unsupervised analyses revealed a high abundance of CD177, a specific neutrophil activation marker, contributing to the clustering of severe patients. Gene abundance correlated with high serum levels of CD177 in severe patients. These results highlight neutrophil activation as a hallmark of severe disease and CD177 assessment as a reliable prognostic marker for routine care.
</blockquote>
- The data was pre-processed by the authors of the [Cell](https://www.cell.com/iscience/pdf/S2589-0042(21)00679-9.pdf) paper, by aligning .fastq files to the human `hg18` transcriptome using [Salmon](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/). We will not be covering the pre-processing in this workshop due to time constraints, although I would highly encourage those interested to bookmark the [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#using-salmon) which goes through the commands needed to run Salmon on .fastq files yourself.
## Import Data
- The first step is to import the gene counts and sample metadata data into R.
### Gene count matrix
- The raw gene counts after RNA-seq data pre-processing were downloaded from GEO [GSE171110](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171110) under the "Supplementary Files". The `GSE171110_Data_RNAseq_raw_counts_geo.txt.gz` supplementary file was downloaded and uncompressed into the `data` directory, and the resulting tab-delimited text file (containing gene counts) is imported below.
```{r importGeneCounts, message=FALSE, warning=FALSE}
counts <- readr::read_tsv(here("data",
"GSE171110_Data_RNAseq_raw_counts_geo.txt"),
skip = 2,
col_names = TRUE) %>%
as.data.frame %>%
tibble::column_to_rownames("Code_Patient")
```
- By previewing the `counts` object below, we can see that each row is a gene and each column is a sample. The `counts` object is a type of object in R called a `data.frame` which you can basically think of as being like a spreadsheet but in R.
```{r message=FALSE, warning=FALSE}
# Preview first 5 rows and columns
counts[1:5,1:5]
```
- The `nrow` and `ncol` functions can be used to check how many rows and columns are in a `data.frame` object:
```{r message=FALSE, warning=FALSE}
nrow(counts)
ncol(counts)
```
#### Questions
1. How many genes are in this dataset?
2. How many Healthy and Covid-19 samples are in this dataset?
### Sample Metadata Table
- We import the sample metadata from the table stored in the CSV file at `data/sample_table.csv` and store it in an object called `samples`:
```{r}
samples <- readr::read_csv(here("data", "sample_table.csv")) %>%
as.data.frame
```
- As we can see in the preview below, each row in the `samples` `data.frame` is a sample, and the columns represent different characteristics/information about the samples, including `age`, `sex`, and `group` (Healthy or Covid-19).
```{r}
samples %>%
kable %>%
kable_styling()
```
### Readable sample names
- You may have noticed when previewing the gene counts matrix stored in `counts` using `head(counts)` earlier that the sample names are not very descriptive (eg. 507-V, 1189-V etc.). These correspond to the `patient_code` column in the sample metadata table above.
- Below, we will replace these `patient_code` names in the gene counts with readable sample names so that the rest of the analysis is easier to interpret, especially when visualising the data.
- First, we create a new column in the `samples` metadata table, called `sample`, that will contain readable sample names (the `group` they are in followed by a number):
```{r}
samples <- samples %>%
dplyr::mutate(sample = paste0(group, "_", c(1:10, 1:44)))
rownames(samples) <- samples$sample
# Preview the samples table with the added column:
samples %>%
kable %>%
kable_styling()
```
- We will then set the column names in `counts` (corresponding to sample names) to the new readable sample names:
```{r}
# Reorder the columns in `counts` so they are in the same order as the rows in the samples table
counts <- counts[, samples$patient_code]
# Set the colnames to the readable sample names
colnames(counts) <- samples$sample
head(counts)
```
## Gene annotations
- Gene annotations are useful for later interpreting the results of the differential gene expression analysis. They give information about genes, including genomic location (e.g. start and end coordinates), descriptions, type of gene (e.g. whether it's a non-coding or protein-coding gene) etc.
- Below, we retrieve gene annotation metadata from the Ensembl database (this step can take a while to run if running `AnnotationHub` for the first time, so I have saved out the `genes` object as an R object).
```{r eval=FALSE}
# Below is not run in this RMarkdown document:
# Load annotation hub
ah <- AnnotationHub()
# Search for the correct Ensembl gene annotation
ah %>%
subset(grepl("sapiens", species)) %>%
subset(rdataclass == "EnsDb")
ensDb <- ah[["AH78783"]]
# Get gene annotations
genes <- genes(ensDb) %>%
as.data.frame
# genes %>% saveRDS(here("data", "R", "genes.rds"))
```
```{r}
# I load in a saved version below since the above code can take a while to run
# if AnnotationHub objects are not already pre-downloaded.
genes <- readRDS(here("data", "R", "genes.rds"))
```
- The `genes` data.frame is prepared so that the `row.names` (corresponding to Ensembl gene IDs) are in the same order as for the gene count matrix stored in `counts`.
```{r message=FALSE, warning=FALSE}
genes <- data.frame(gene = rownames(counts)) %>%
left_join(genes %>% as.data.frame,
by = c("gene"="symbol")) %>%
dplyr::distinct(gene, .keep_all=TRUE)
rownames(genes) <- genes$gene
```
- From previewing the gene annotations below, we can see that each row is a gene and each column contains information about the gene.
```{r}
genes %>%
head(50) %>%
kable %>%
kable_styling %>%
scroll_box(height = "600px")
```
- The columns of information about the genes include:
- **seqnames**: Chromosome the gene is located on
- **start**: Start coordinate
- **end**: End coordinate
- **width**: Length of gene (bp)
- **strand** Strand the gene is located on
- **gene_id**: Ensembl gene identifier
- **gene_name**: Gene symbol
- **gene_biotype**: Type of gene
- **seq_coord_system**: Whether the gene is located on a chromosome or scaffold
- **description**: Short description of the gene / what it encodes
- **gene_id_version**: Version of the gene model in Ensembl
- **entrezid**: Entrez gene identifier
- This information can help to put the differential gene expression analysis results in context later. For example, in some analyses, people may wish to focus on looking at particular `gene_biotype`s (e.g. protein-coding genes).
- By using the `table` function, we can count the number of genes belonging to each `gene_biotype` in this particular dataset.
```{r}
genes$gene_biotype %>%
table
```
### Questions
3. What percentage of genes in this dataset are `protein_coding`?
## Create DGEList object
- We will store the gene counts, sample information, and gene information in a `DGEList` (Digital Gene Expression) object called `dge`, which simplifies plotting and interacting with this information. It will also be used for the [differential gene expression analysis](de.html) later on.
- The `DGEList` object is prepared as follows. Here, we remove all genes that have 0 counts across all samples, and apply Trimmed Mean of M-values (TMM) normalisation to normalise counts according to library size differences between samples.
```{r createDGEList}
dge <- DGEList(counts = counts,
genes = genes,
samples = samples,
remove.zeros = TRUE) %>%
calcNormFactors("TMM")
# Preview the DGEList object
dge
```
- The `counts`, `genes` or `samples` can be accessed from the `dge` object using the `$` operator. This is exactly the same information as that stored in the original `counts`, `samples` and `genes` objects, but putting them together in the `dge` object makes it easier to do certain plots/analysis later.
```{r}
# Preview first 5 rows and columns of the counts stored in `dge`:
dge$counts[1:5, 1:5]
# Preview first 5 rows and columns of the samples stored in `dge`:
dge$samples[1:5, 1:5]
# Preview first 5 rows and columns of the genes stored in `dge`:
dge$genes[1:5, 1:5]
```
## Quality Checks
- In this section we will do some basic quality checks to see if any samples are outliers or potentially mislabeled.
### Library Sizes of samples
- Overall, the library sizes of samples are shown. Sometimes if there is an outlier sample with an abnormally large or small library size, it is worth keeping an eye on in case it is a failed sample (e.g. library prep went wrong). We can see that certain samples have a larger library size below, but overall they are quite comparable.
```{r libSizes}
dge$samples %>%
ggplot(aes(x = sample, y = lib.size, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library size (total number of mapped and quantified reads)",
x = "Sample", fill = "Group") +
coord_flip()
```
### Dimension Reduction
#### Multi Dimensional Scaling (MDS)
- Multi Dimensional Scaling (MDS) is a dimension reduction technique that summarises expression of the top 500 most variable genes expressed in each sample down to 2 dimensions, known as `Dimension 1` and `Dimension 2`. Dimension 1 represents the largest amount of variance that can be explained in the data. Plotting these principal components can help us visualise which samples show overall similar gene expression to each other -- similar samples will be situated closer to each other on the plot.
- Because our information is stored in a `DGEList` type of object, we can make use of the `glMDSPlot` function from the *Glimma* package to make an interactive MDS plot. The MDS plot produced using the function below can be viewed by navigating to the `glimma-plots` directory and clicking on the MDS-Plot.html file, or by clicking [here](assets/glimma-plots/MDS-Plot.html).
```{r}
glMDSPlot(dge,
labels = dge$samples$sample,
groups = dge$samples$group)
```
#### Principal Component Analysis (PCA)
- Principal component analysis (PCA) is similar to MDS, but it uses all genes (`r nrow(dge)` genes for this particular dataset) to summarise down into a few principal components. Plotting these in the same way as for an MDS plot can also give us information about which samples are similar to each other. In most cases, the MDS and PCA plots will look very similar.
- Below, the PCA plot created using the `ggbiplot` function shows that overall, samples belonging to the same `group` tend to have similar gene expression patterns, and there is quite a large difference between `Healthy` and `Covid-19` samples. There do not appear to be any obvious outlier samples, although there is quite a bit of variation in gene expression amongst `Covid-19` samples.
- We will not remove any samples in this particular dataset as there is no indication so far that any of the samples are obvious outliers.
```{r pca}
# Perform PCA analysis:
pca_analysis <- prcomp(t(cpm(dge, log=TRUE)))
summary(pca_analysis)
# Create the plot:
pca_plot <- ggbiplot::ggbiplot(pca_analysis,
groups = dge$samples$group,
ellipse = TRUE,
var.axes = FALSE)
pca_plot
```
- We also have other sample metadata including `age` and `sex` of the samples in the `dge$samples` table. These variables can also be plotted on the PCA plot by changing what is specified in the `groups` argument below, so that we can see whether samples are grouping by age or sex. Doing this is useful as it can tell us whether these variables are [confounding variables](https://en.wikipedia.org/wiki/Confounding) that have the potential to obscure real biological effects.
- Below, we can see that the ages of patients from which samples were derived ranges from 51 to 61. It does appear that some of the older patients' samples are closer to the top of the plot.
```{r}
ggbiplot::ggbiplot(pca_analysis,
groups = dge$samples$age,
ellipse = FALSE,
var.axes = FALSE)
```
- Below, we can see that `sex` of the samples does not seem to really influence where the points lie on the plot, suggesting that sex does not have a noticeable effect on gene expression of the samples here.
```{r}
ggbiplot::ggbiplot(pca_analysis,
groups = dge$samples$sex,
ellipse = FALSE,
var.axes = FALSE)
```
### Questions
4. Should we treat male and female samples differently in the analysis, based on the information in the MDS/PCA plots?
5. Should we do anything about the older patients' samples, based on the information in the MDS/PCA plots?
## Next Step - Differential gene expression analysis
- Overall the data quality looks good and there do not appear to be outlier samples or mislabeled samples.
- We will export the DGEList object created in this document for the next step of the analysis, [differential gene expression analysis](de.html). The `dge` object is exported into the `data/R` directory.
```{r eval=FALSE}
r.dir <- here("data", "R")
ifelse(!dir.exists(r.dir),
dir.create(r.dir),
FALSE)
dge %>% saveRDS(file.path(r.dir, "dge.rds"))
```