/
index.Rmd
195 lines (125 loc) · 12.3 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Home"
output:
html_document:
toc: TRUE
toc_depth: 4
toc_float: FALSE
---
We used this site to collaborate and share our results. Please feel free to explore. The results that made it into the final paper are in the section [Finalizing](#finalizing) below. Here are some useful links:
* [Characterizing and inferring quantitative cell-cycle phase in single-cell RNA-seq data analysis](biorxiv)
* GEO record [GSE121265](http://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE121265) for the raw FASTQ files
* [GitHub repo](https://github.com/jdblischak/fucci-seq) for code and data
---
# Finalizing
#### [Data description](data-overview.html)
#### [Quality control of scRNA-seq data](#rna-seq-data-preprcessing)
#### [Process images - from images to intensities](images-process.html)
#### [Inspect and correct for the C1 plate effect in fluorescence intensities](images-normalize-anova.html)
#### [Infer an angel for each cell based on FUCCI fluorescence intensities](images-circle-ordering-eval.html)
#### [Estimate the cyclic trends of gene expression levels](npreg-trendfilter-quantile.html)
<!-- * Code for compiling results and plots for the paper: `code/makeresults_paper.R` -->
<!-- ## Supervised method learning: finalizing results on holding-an-individual-out scenario -->
<!-- * [Validation sample](method-eval.html) (to be removed...) -->
<!-- * Predicting individual X using expression from random individuals -->
<!-- * [Selecting the top X cyclical genes in the training sample](method-train-ind.html) -->
<!-- * [Learning gene annotations](method-train-ind-genes.html) -->
<!-- * [Validation sample](method-eval-ind.html) (to be removed) -->
<!-- * Analyze the public datasets, including [Leng et al.](method-eval-leng.html), [Bottcher et al.](method-eval-bottcher.html), and [Buettner et al.](method-eval-buettner.html). -->
# Analysis
## Supervised method learning: finalizing on data holiding out validation samples
* Perform model training on cell times computed in different ways
* Based on Fucci only
* Cells from mixed individuals: [Prediction error](method-train-classifiers-all.html), and [functional annotations of genes selected](method-train-classifiers-genes.html)
* Cells from one individual: [Prediction error](method-train-ind.html), and [functional annotations of genes selected](method-train-ind-genes.html)
* Based on Fucci and DAPI and derived from trigonometric transformation
* Cells from mixed individuals: [Prediction error](method-train-triple.html)
* Cells from one individual: [Prediction error](method-train-ind.html), and [functional annotations of genes selected](method-train-ind-genes.html)
* Based on FUCCI and DAPI and derived from an algebraic approach
* [Cells from one individual](method-train-alge.html)
* Compare model training results
* [Compute and compare cell time estimates](method-labels-compare.html) derived under two different assumptions: assuming equal variance in PC1 and PC2 (circle) vs. unequal variances in PC1 and PC2.
* [Prediction error for cells from mixed individuals](method-train-genes-fucci-vs-triple.html): Prediction error smaller when based on fucci time only than based on fucci and dapi time, though only a very small difference. However, the genes selected in these two are similar, both found to have 37 genes enriched for the Cell Cycle GO (0007049).
* Compiling results
* [Compile training results for both training scenarios](method-train-summary-output.html) and [Plotting prediction error margin and genes selected](method-train-summary.html)
* [Compile results on validation samples](method-validation.html)
## Supervised method learning: building the analysis approach
* Early analysis [using trendfilter and evaluating model performance](method-npreg.html)
* Investigate the property and quality of [the cell time labels derived from GFP and RFP](method-labels.html): PVE of intensities by cell times, comparing PVEs by cell times from DAPI and FUCCI vs from FUCCI; comparing prediction before and after removing PC outliers (after is slightly better), and in different sets of genes.
* Investigate the cyclical trends for [each individual](trendfilter-individua.Rmd)
* Investigate the property of [Seurat classes and cell time](method-label-seurat.html)
* Noisy label analysis
* PVE in DAPI, GFP, and RFP [before vs after removing noisy labels](method-labels-noisy.html)
* Comparing results excluding/including noisy training labels on top 10 and top 101 cyclical genes, when [excluding noisy labels from each fold validation, hence unequal sample numbers per fold](method-train-labels.html) as opposed to [excluding noisy labels before making fold valdiation, hence equal nubmer per fold](method-train-labels-equalfold.html)
* Analyze the training datasets
* Comparing trendfilter order 2 and order 3 with npcirc.nw [on the odd-numbered genes in the top 101 cyclical genes](method-npreg-prelim-results.html)
* Comparing resuls of the various classifers (supervised, PC, unsupervised) [on top 101 cyclical genes](method-train-classifiers.html), and on [top 10 cyclial genes](method-train-classifiers-top10.html).
* Analyze the withheld samples
* Evaluate the withheld sample [including noisy labels](method-eval-withheld-explore.html) on the top 10 and top 101 cyclical genes] and comparing peco methods with seurat predictions.
* Evaluate the withheld samples [removing noisy labels](method-eval-withheld-explore-removenoisy.html) on the top 10 and top 101 cyclical genes and comparing peco methods with seurat predictions.
## Approaches to fitting cyclical trend in gene expression data
* Apply spherically projected multivariate linear model
* [First trying out spml package](model-spml.pilot.html)
* [To predictd cell times using traing/test samples set-up](method-spml.html)
* Learning about correlation for circular response variable: some simulations to check [its property and also relations with Fisher's z transformation](circ-simulation-correlation.html)
* CellcycleR 0.1.6
* [Model convergence assessment](images-cellcycleR-convergence.html)
* [Fitting on intensities across plates and individuals](images-cellcycleR.html)
* [Fitting on Leng data)](cellcycler-seqdata-leng.html)
* [Fitting on fucci-seq RNA-seq data)](cellcycler-seqdata-fucci.html)
## Cell cycle signal in gene expression data
1. We investigated cell cycle signals in the sequencing data alone.
* Consider [transgene count in sequencing data](images-transgene.html)
* Compute linear correlation between gene expression levels [with DAPI and FUCCI intensities](images-seq-correlation.html)
2. We then assign categorical labels of cell cycle and explored the expresson profiles of these categories.
* Cluster samples by [Partition around medoids(PAM)](images-pam.html)
* Cluster samples by [Guassian mixture modeling](images-mclust.html)
* Select a subset of samples that are closet to the cluster centers (cluster representives) [using silhouette index](images-subset-silhouette.html)
* Examine gene pression scores defined by the Macosko paper in the selected cluster representatives [before confounding correction](images-classify-fucci.html) and [after confounding correction](images-classify-fucci-adjusted.html)
* Examine gene expression scores defined by the Macosk paer [in the sorted cells of the Leng et al. 2015 paper](images-classify-leng.html)
3. We inferred an angle for each cell on a unit circle using FUCCI intensities alone.
* First, I used GFP and RFP intensities to estimate a [least-square fit of unit circle](images-circle-ordering.html)
* I computed the PCs of GFP and RFP and used these to infer an angle for each cell on a unit circle (i.e., polar coordinates), and [to evaluate these results, I computed circular-circular and circular-linear correlation with DAPI and gene expression](images-circle-ordering-eval.html)
* Here I put together lists of genes identified as significantly correlatd with the PC-based fit [by linear correlation and circular-linear correlation, also cyclical genes by smash](images-circle-ordering-sigcorgenes.html)
4. I used nonparametric methods to identify genes that may be cyclical along cell cycle phases.
* Fit [smash and kernel regression on circular variables](npreg-methods.html) on a subset of genes with detection rate > .8.
* Fit [trendfilter](npreg-trendfilter-prelim.html) on a subset of genes (5) that are observed (visually) to have cyclical pattern. trendfilter is robust to small proportion of undetected cells, approx 2 or 3%. In cases of simulation when increasing proportion of undetected cells to 20%, we observed a flat line in gene expression for genes previously identified to tend to a cyclical pattern.
* Next, we fit trendfilter on all genes after transforming the data to follow standard normal distribution, permutation-based p-values for PVE are used to select [101 significant cyclical genes](npreg-trendfilter-quantile.html).
5. Additional analysis done to identify top cyclical genes in each individual. The top 5 are not shared across the six individuals. [Results](trendfilter-individual.html)
## RNA-seq data preprcessing
1. The first step in preprocessing RNA-seq data consists of QC and filtering.
* Sample QC and filtering
* [Sample QC criteria](sampleqc.html)
* [Sequencing depth](totals.html)
* [Reads versus molecules](reads-v-molecules.html)
* Gene QC and filtering
* [gene filtering](gene-filtering.html)
* [PCA with technical fators](pca-tf.html)
2. We then analyzed and corrected for batch effect due to C1 plate in the sequencing data
* [Estimate variance explained in IBD and correct for batch effects](seqdata-batch-correction.html)
## Microscopy image analysis
* [Processing images - from images to intensities](images-process.html)
We evaluated and pre-processed the results of image analysis as follows:
1. We visually inspect images deteced to have none or more than one nucleus. For cases that are inconsistent with visual inspection, we correct the number of nuclei detected.
* [Inspect images with multiple nuclei](images-multiple-nuclei.html)
* [Inspect images with no nucleus](images-zero-nuclei.html)
2. We applied background correction to the intensity measurements of GFP, RFP and DAPI based on the following analyses.
* [CONFESS results](confess-prelim.html)
* [QC analysis including no. nuclei detected, DAPI, and intensity variation](images-qc.html)
* [Explore using log10 sum pixel intensity for signal metrics](images-qc-followup.html)
* [Compare correction approaches using median versus mean background](images-metrics.html)
* [Explore associations between nucleus shape metrics vs intensities](images-metrics-cell-shape.html)
3. We analyzed intensity variation across individuals and batches and considers approaches for removing batch effects in the data.
* [Visualize signal variation by plate and individual identity](images-qc-labels.html)
* [Visualize the structure of signal variation by individual identity](images-qc-variation.html)
* [Quantile normalization for GFP, RFP and DAPI](images-normalize-quantile.html)
* [Estimate variance explained in IBD and correct for batch effects in intensities](images-normalize-anova.html)
4. We investigated the cell time estimates based on FUCCI intensities.
* Consider the mean-adjusted FUCCI intensities, what's the relationship between cell time [estimates and DAPI and sample molecule count](images-time-eval.html)?
* Consider the quantile-normalized FUCCI intensities versus the mean-adjusted intensities (for adjusting batch effect). What's the differences between cell time estimates from [the two normalization approaches](images-time-compare-normalize.html)?
## One-time investigations
* Why some gene symbols (genes) correspond to [multiple Ensembl IDs?](ensembl.html)
* I selected a set of cell cycle [genes that belong to GO term Cell Cycle and looked at the overlap with the detected genes in our data.](seqdata-select-cellcyclegenes.html)
* [Replicate Seurat example of scoring cell cycle phases](seurat-cellcycle.Rmd)
* [Reproducing some QC analysis and making plots for a seminar talk](talk-20180416.html)
* [Compile codes for the paper in ]