/
data-overview.Rmd
185 lines (122 loc) · 5.88 KB
/
data-overview.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: "Data overview"
author: "Joyce Hsiao"
output:
html_document:
toc: TRUE
toc_float: FALSE
---
<!-- The file analysis/chunks.R contains chunks that define default settings
shared across the workflowr files. -->
```{r read-chunk, include=FALSE, cache=FALSE}
knitr::read_chunk("chunks.R")
```
<!-- Update knitr chunk options -->
```{r knitr-opts-chunk, include=FALSE}
```
<!-- Insert the date the file was last updated -->
```{r last-updated, echo=FALSE, results='asis'}
```
<!-- Insert the code version (Git commit SHA1) if Git repository exists and R
package git2r is installed -->
```{r code-version, echo=FALSE, results='asis'}
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning=FALSE)
```
---
## Packages
```{r}
library(Biobase)
```
---
## Overview
We collected two types of data for each single cell sample: single-cell RNA-seq using C1 plates and FUCCI image intensity data.
* Raw RNA-seq data: `data/eset-raw.rds`
* FUCCI intensity data: `data/intensity.rds`
* Filtered RNA-seq data: `output/gene-filtering.Rmd/eset-gene-filtering.rds`
* Final data combining filtered intensity and RNA-seq, including 12792 genes and 990 samples: `data/eset-final.rds`
---
## FUCCI intensity data
* Combined intensity data are stored in `data/intensity.rds`. These include samples that were identified to have a single nuclei .
* Data generated by [combine-intensity-data.R][data/combine-intensity-data.R]. Combining image analysis output stored in `/project2/gilad/fucci-seq/intensities_stats/` into one `data.frame` and computes summary statistics, including background-corrected RFP and GFP intensity measures.
```{r}
ints <- readRDS(file="../data/intensity.rds")
colnames(ints)
```
---
## Sequencing data
* Raw data from each C1 plate are stored separatley in `data/eset/` by experiment (batch) ID.
* Raw data combining C1 plate are stored in `data/eset-raw.rds`.
* Filtered raw data excluding low-quality sequencing samples and genes that are lowly expressed or overly expressed are stored in `output/gene-filtering.Rmd/eset-gene-filtering.rds` (both eSet and CPM data.table).
```{r}
eset_filtered <- readRDS("../output/gene-filtering.Rmd/eset-gene-filtering.rds")
```
---
## Match intensity with sequencing data
We match labels in the intensity data and in the sequencing data. 990 samples are quantified in both datasets.
```{r}
pdata <- pData(eset_filtered)
pdata$unique <- paste(pdata$image_individual, sprintf("%05d", pdata$image_label), sep="_")
sample_include_bothdata <- intersect(ints$unique, pdata$unique)
length(sample_include_bothdata)
```
Make a combined eSet object include both FUCCI intensity data and RNA-seq data.
```{r, eval = FALSE}
ints_combo <- ints[which(ints$unique %in% sample_include_bothdata), ]
eset_combo <- new("ExpressionSet",
exprs = exprs(eset_filtered)[,which(pdata$unique %in% sample_include_bothdata)],
phenoData = phenoData(eset_filtered)[which(pdata$unique %in% sample_include_bothdata), ],
featureData = featureData(eset_filtered))
pdata_combo <- pData(eset_combo)
pdata_combo$unique <- paste(pdata_combo$image_individual,
sprintf("%05d", pdata_combo$image_label), sep="_")
all.equal(ints_combo$unique, pdata_combo$unique)
```
Import estimated cell time. Computed using cell times adjusted for plate effect `../output/images-normalize-anova.Rmd/pdata.adj.rds`. See `../docs/npreg.html` for methods.
```{r}
theta <- readRDS(file = "../output/npreg.Rmd/theta.rds")
```
Make variable labels.
```{r, eval = FALSE}
pdata_table <- rbind(varMetadata(phenoData(eset_combo)),
"mCherry background-corrected intensity (log10sum)",
"EGFP background-corrected intensity (log10sum)",
"DAPI background-corrected intensity (log10sum)",
"nucleus size",
"nucleus perimeter",
"nucleus eccentricity",
"cell time")
rownames(pdata_table) <- c(rownames(varMetadata(phenoData(eset_combo))),
"rfp.median.log10sum",
"gfp.median.log10sum",
"dapi.median.log10sum",
"size", "perimeter", "eccentricity",
"theta")
phenoData(eset_combo) <- new("AnnotatedDataFrame",
data = data.frame(pData(eset_combo),
rfp.median.log10sum=ints_combo$rfp.median.log10sum,
gfp.median.log10sum=ints_combo$gfp.median.log10sum, dapi.median.log10sum=ints_combo$dapi.median.log10sum,
size=ints_combo$size,
perimeter=ints_combo$perimeter,
eccentricity=ints_combo$eccentricity,
theta=theta),
varMetadata = pdata_table)
saveRDS(eset_combo, file = "../data/eset-final.rds")
```
---
## Access expressionSets
We store feature-level (gene) read count and molecule count in `expressionSet` (`data/eset`) objects, which also contain sample metadata (e.g., assigned indivdual ID, cDNA concentraion) and quality filtering criteria (e.g., number of reads mapped to FUCCI transgenes, ERCC conversion rate). Data from different C1 plates are stored in separate `eset` objects:
To combine `eset` objects from the different C1 plates:
`eset <- Reduce(combine, Map(readRDS, Sys.glob("data/eset/*.rds")))`
To access data stored in `expressionSet`:
* `exprs(eset)`: access count data, 20,421 features by 1,536 single cell samples.
* `pData(eset)`: access sample metadata. Returns data.frame of 1,536 samples by 43 labels. Use `varMetadata(phenoData(eset))` to view label descriptions.
* `fData(eset)`: access feature metadata. Returns data.frame of 20,421 features by 6 labels. Use `varMetadata(featureData(eset))` to view label descriptions.
* `varMetadata(phenoData(eset))`: view the sample metadata labels.
* `varMetadata(featureData(eset))`: view the feature (gene) metadata labels.
---
## Session information
```{r, echo = FALSE}
sessionInfo()
```