analysis/index.Rmd

---
title: "Home"
output: 
  html_document:
    toc: TRUE
    toc_depth: 4
    toc_float: FALSE
---

We used this site to collaborate and share our results. Please feel free to explore. The results that made it into the final paper are in the section [Finalizing](#finalizing) below. Here are some useful links:

* [Characterizing and inferring quantitative cell-cycle phase in single-cell RNA-seq data analysis](biorxiv)
* GEO record [GSE121265](http://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE121265) for the raw FASTQ files
* [GitHub repo](https://github.com/jdblischak/fucci-seq) for code and data


---

# Finalizing

#### [Data description](data-overview.html)
#### [Quality control of scRNA-seq data](#rna-seq-data-preprcessing)
#### [Process images - from images to intensities](images-process.html)
#### [Inspect and correct for the C1 plate effect in fluorescence intensities](images-normalize-anova.html)
#### [Infer an angel for each cell based on FUCCI fluorescence intensities](images-circle-ordering-eval.html)
#### [Estimate the cyclic trends of gene expression levels](npreg-trendfilter-quantile.html)


<!-- * Code for compiling results and plots for the paper: `code/makeresults_paper.R` -->
        
<!-- ## Supervised method learning: finalizing results on holding-an-individual-out scenario -->

<!--     * [Validation sample](method-eval.html) (to be removed...) -->
<!-- * Predicting individual X using expression from random individuals -->
<!--     * [Selecting the top X cyclical genes in the training sample](method-train-ind.html) -->
<!--     * [Learning gene annotations](method-train-ind-genes.html) -->
<!--     * [Validation sample](method-eval-ind.html) (to be removed) -->
<!-- * Analyze the public datasets, including [Leng et al.](method-eval-leng.html), [Bottcher et al.](method-eval-bottcher.html), and [Buettner et al.](method-eval-buettner.html). -->


# Analysis 

## Supervised method learning: finalizing on data holiding out validation samples

* Perform model training on cell times computed in different ways
    * Based on Fucci only
        * Cells from mixed individuals: [Prediction error](method-train-classifiers-all.html), and [functional annotations of genes selected](method-train-classifiers-genes.html)
        * Cells from one individual: [Prediction error](method-train-ind.html), and [functional annotations of genes selected](method-train-ind-genes.html)
    * Based on Fucci and DAPI and derived from trigonometric transformation
        * Cells from mixed individuals: [Prediction error](method-train-triple.html)
        * Cells from one individual: [Prediction error](method-train-ind.html), and [functional annotations of genes selected](method-train-ind-genes.html)
    * Based on FUCCI and DAPI and derived from an algebraic approach
        * [Cells from one individual](method-train-alge.html)


* Compare model training results
    * [Compute and compare cell time estimates](method-labels-compare.html) derived under two different assumptions: assuming equal variance in PC1 and PC2 (circle) vs. unequal variances in PC1 and PC2. 
    * [Prediction error for cells from mixed individuals](method-train-genes-fucci-vs-triple.html): Prediction error smaller when based on fucci time only than based on fucci and dapi time, though only a very small difference. However, the genes selected in these two are similar, both found to have 37 genes enriched for the Cell Cycle GO (0007049).
        
* Compiling results
    * [Compile training results for both training scenarios](method-train-summary-output.html) and [Plotting prediction error margin and genes selected](method-train-summary.html)
    * [Compile results on validation samples](method-validation.html)
    

## Supervised method learning: building the analysis approach

* Early analysis [using trendfilter and evaluating model performance](method-npreg.html)

* Investigate the property and quality of [the cell time labels derived from GFP and RFP](method-labels.html): PVE of intensities by cell times, comparing PVEs by cell times from DAPI and FUCCI vs from FUCCI; comparing prediction before and after removing PC outliers (after is slightly better), and in different sets of genes.

* Investigate the cyclical trends for [each individual](trendfilter-individua.Rmd)

* Investigate the property of [Seurat classes and cell time](method-label-seurat.html) 

* Noisy label analysis
    * PVE in DAPI, GFP, and RFP [before vs after removing noisy labels](method-labels-noisy.html)
    * Comparing results excluding/including noisy training labels on top 10 and top 101 cyclical genes, when [excluding noisy labels from each fold validation, hence unequal sample numbers per fold](method-train-labels.html) as opposed to [excluding noisy labels before making fold valdiation, hence equal nubmer per fold](method-train-labels-equalfold.html)

* Analyze the training datasets
    * Comparing trendfilter order 2 and order 3 with npcirc.nw [on the odd-numbered genes in the top 101 cyclical genes](method-npreg-prelim-results.html)
    * Comparing resuls of the various classifers (supervised, PC, unsupervised) [on top 101 cyclical genes](method-train-classifiers.html), and on [top 10 cyclial genes](method-train-classifiers-top10.html).

* Analyze the withheld samples
    * Evaluate the withheld sample [including noisy labels](method-eval-withheld-explore.html) on the top 10 and top 101 cyclical genes] and comparing peco methods with seurat predictions.
    * Evaluate the withheld samples [removing noisy labels](method-eval-withheld-explore-removenoisy.html) on the top 10 and top 101 cyclical genes and comparing peco methods with seurat predictions.
    

## Approaches to fitting cyclical trend in gene expression data

* Apply spherically projected multivariate linear model
    * [First trying out spml package](model-spml.pilot.html)
    * [To predictd cell times using traing/test samples set-up](method-spml.html)

* Learning about correlation for circular response variable: some simulations to check [its property and also relations with Fisher's z transformation](circ-simulation-correlation.html)
    
* CellcycleR 0.1.6
    * [Model convergence assessment](images-cellcycleR-convergence.html)
    * [Fitting on intensities across plates and individuals](images-cellcycleR.html)
    * [Fitting on Leng data)](cellcycler-seqdata-leng.html)
    * [Fitting on fucci-seq RNA-seq data)](cellcycler-seqdata-fucci.html)
 

## Cell cycle signal in gene expression data

1. We investigated cell cycle signals in the sequencing data alone.
    * Consider [transgene count in sequencing data](images-transgene.html)
    * Compute linear correlation between gene expression levels [with DAPI and FUCCI intensities](images-seq-correlation.html)
  
2. We then assign categorical labels of cell cycle and explored the expresson profiles of these categories. 
    * Cluster samples by [Partition around medoids(PAM)](images-pam.html)
    * Cluster samples by [Guassian mixture modeling](images-mclust.html)
    * Select a subset of samples that are closet to the cluster centers (cluster representives) [using silhouette index](images-subset-silhouette.html)
    * Examine gene pression scores defined by the Macosko paper in the selected cluster representatives [before confounding correction](images-classify-fucci.html) and [after confounding correction](images-classify-fucci-adjusted.html)
    * Examine gene expression scores defined by the Macosk paer [in the sorted cells of the Leng et al. 2015 paper](images-classify-leng.html)

3. We inferred an angle for each cell on a unit circle using FUCCI intensities alone. 
    * First, I used GFP and RFP intensities to estimate a [least-square fit of unit circle](images-circle-ordering.html)
    * I computed the PCs of GFP and RFP and used these to infer an angle for each cell on a unit circle (i.e., polar coordinates), and [to evaluate these results, I computed circular-circular and circular-linear correlation with DAPI and gene expression](images-circle-ordering-eval.html)
    * Here I put together lists of genes identified as significantly correlatd with the PC-based fit [by linear correlation and circular-linear correlation, also cyclical genes by smash](images-circle-ordering-sigcorgenes.html)
  
4. I used nonparametric methods to identify genes that may be cyclical along cell cycle phases.
    * Fit [smash and kernel regression on circular variables](npreg-methods.html) on a subset of genes with detection rate > .8.
    * Fit [trendfilter](npreg-trendfilter-prelim.html) on a subset of genes (5) that are observed (visually) to have cyclical pattern. trendfilter is robust to small proportion of undetected cells, approx 2 or 3%. In cases of simulation when increasing proportion of undetected cells to 20%, we observed a flat line in gene expression for genes previously identified to tend to a cyclical pattern.
    * Next, we fit trendfilter on all genes after transforming the data to follow standard normal distribution, permutation-based p-values for PVE are used to select [101 significant cyclical genes](npreg-trendfilter-quantile.html).
    
5. Additional analysis done to identify top cyclical genes in each individual. The top 5 are not shared across the six individuals. [Results](trendfilter-individual.html)


## RNA-seq data preprcessing

1. The first step in preprocessing RNA-seq data consists of QC and filtering. 
    * Sample QC and filtering
        * [Sample QC criteria](sampleqc.html)  
        * [Sequencing depth](totals.html)  
        * [Reads versus molecules](reads-v-molecules.html)  
    
    * Gene QC and filtering
        * [gene filtering](gene-filtering.html)
        * [PCA with technical fators](pca-tf.html)

2. We then analyzed and corrected for batch effect due to C1 plate in the sequencing data
    * [Estimate variance explained in IBD and correct for batch effects](seqdata-batch-correction.html)


## Microscopy image analysis

* [Processing images - from images to intensities](images-process.html)

We evaluated and pre-processed the results of image analysis as follows:

1. We visually inspect images deteced to have none or more than one nucleus. For cases that are inconsistent with visual inspection, we correct the number of nuclei detected.  
    * [Inspect images with multiple nuclei](images-multiple-nuclei.html)
    * [Inspect images with no nucleus](images-zero-nuclei.html)
        
2. We applied background correction to the intensity measurements of GFP, RFP and DAPI based on the following analyses.  
    * [CONFESS results](confess-prelim.html)
    * [QC analysis including no. nuclei detected, DAPI, and intensity variation](images-qc.html)
    * [Explore using log10 sum pixel intensity for signal metrics](images-qc-followup.html)
    * [Compare correction approaches using median versus mean background](images-metrics.html)
    * [Explore associations between nucleus shape metrics vs intensities](images-metrics-cell-shape.html)

3. We analyzed intensity variation across individuals and batches and considers approaches for removing batch effects in the data.
    * [Visualize signal variation by plate and individual identity](images-qc-labels.html)
    * [Visualize the structure of signal variation by individual identity](images-qc-variation.html)
    * [Quantile normalization for GFP, RFP and DAPI](images-normalize-quantile.html)
    * [Estimate variance explained in IBD and correct for batch effects in intensities](images-normalize-anova.html)

4. We investigated the cell time estimates based on FUCCI intensities.
    * Consider the mean-adjusted FUCCI intensities, what's the relationship between cell time [estimates and DAPI and sample molecule count](images-time-eval.html)? 
    * Consider the quantile-normalized FUCCI intensities versus the mean-adjusted intensities (for adjusting batch effect). What's the differences between cell time estimates from [the two normalization approaches](images-time-compare-normalize.html)?


## One-time investigations

* Why some gene symbols (genes) correspond to [multiple Ensembl IDs?](ensembl.html)

* I selected a set of cell cycle [genes that belong to GO term Cell Cycle and looked at the overlap with the detected genes in our data.](seqdata-select-cellcyclegenes.html)

* [Replicate Seurat example of scoring cell cycle phases](seurat-cellcycle.Rmd)

* [Reproducing some QC analysis and making plots for a seminar talk](talk-20180416.html)

* [Compile codes for the paper in ]