Merge pull request #696 from maxplanck-ie/dev_ksikora2

scRNAseq
maxplanck-ie · Sep 29, 2020 · 1288303 · 1288303
2 parents 5477d67 + 8b49379
commit 1288303
Show file tree

Hide file tree

Showing 16 changed files with 489 additions and 156 deletions.
diff --git a/.ci_stuff/test_dag.sh b/.ci_stuff/test_dag.sh
@@ -225,14 +225,18 @@ if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 503 ]; then exit 1 ; fi
 
 
 # scRNA-seq
-WC=`scRNAseq -i PE_input -o output --mode Gruen --snakemakeOptions " --dryrun --conda-prefix /tmp" .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`
-if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 1038 ]; then exit 1 ; fi
-WC=`scRNAseq -i PE_input -o output --mode Gruen --snakemakeOptions " --dryrun --conda-prefix /tmp" --skipRaceID --splitLib .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`
-if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 1015 ]; then exit 1 ; fi
+#WC=`scRNAseq -i PE_input -o output --mode Gruen --snakemakeOptions " --dryrun --conda-prefix /tmp" .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`
+#if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 1038 ]; then exit 1 ; fi
+#WC=`scRNAseq -i PE_input -o output --mode Gruen --snakemakeOptions " --dryrun --conda-prefix /tmp" --skipRaceID --splitLib .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`
+#if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 1015 ]; then exit 1 ; fi
 WC=`scRNAseq -i PE_input -o output --mode STARsolo --snakemakeOptions " --dryrun --conda-prefix /tmp" .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`
 if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 965 ]; then exit 1 ; fi
+WC=`scRNAseq -i PE_input -o output --mode STARsolo --skipVelocyto --snakemakeOptions " --dryrun --conda-prefix /tmp" .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`
+if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 857 ]; then exit 1 ; fi
 WC=`scRNAseq -i PE_input -o output --mode Alevin --snakemakeOptions " --dryrun --conda-prefix /tmp" .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`
-if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 337 ]; then exit 1 ; fi
+if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 412 ]; then exit 1 ; fi
+WC=`scRNAseq -i PE_input -o output --mode Alevin --skipVelocyto --snakemakeOptions " --dryrun --conda-prefix /tmp" .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`
+if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 354 ]; then exit 1 ; fi
 
 # WGBS
 WC=`WGBS -i PE_input -o output --sampleSheet .ci_stuff/test_sampleSheet.tsv --snakemakeOptions " --dryrun --conda-prefix /tmp" .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v "Conda environment" | sed '/^\s*$/d' | wc -l`

diff --git a/docs/content/News.rst b/docs/content/News.rst
@@ -1,6 +1,12 @@
 snakePipes News
 ===============
 
+snakePipes 2.x.y
+----------------
+
+* Deprecated mode Gruen in scRNAseq.
+* scRNAseq mode Alevin now outputs spliced/unspliced counts for RNA velocity estimation based on Soneson et al.  2020, bioRxiv https://doi.org/10.1101/2020.03.13.990069 .
+
 
 snakePipes 2.2.3
 ----------------

diff --git a/docs/content/workflows/scRNA-seq.rst b/docs/content/workflows/scRNA-seq.rst
@@ -8,18 +8,11 @@ What it does
 
 The scRNA-seq pipeline is intended to process UMI-based data, expecting the cell barcode and umi in Read1, and the cDNA sequence in Read2. 
 
-There are currently three analysis modes available:
-- "Gruen" to reproduce CellSeq2 data analysis by Gruen et al.
+There are currently two analysis modes available:
 - "STARsolo" which uses STAR solo for mapping and quantitation.
 - "Alevin" based on Salmon for generating the count matrix.
 
-The general procedure for mode "Gruen" involves:
-
-1. moving cell barcodes and UMIs from read 1 into the read headers of read 2,
-2. mapping read 2,
-3. quantification at the single cell level.
-
-Mode "Gruen" is going to be deprecated by the end of 2020.
+.. note:: Mode "Gruen" has been deprecated.
 
 The general procedure for mode "STARsolo" involves:
 
@@ -36,6 +29,7 @@ Mode "Alevin" involves:
 2. Mapping and generation of a readcount matrix.
 3. Estimation of uncertainty of gene counts using bootstrap method implemented in Salmon Alevin.
 4. General QC of the Alevin run using the AlevinQC R package.
+5. Quantification of "spliced" and "unspliced" read counts in each cell with Alevin - unless this has been disabled with --skipVelocyto . This analysis is derived from the code underlying Soneson et al.  2020, bioRxiv https://doi.org/10.1101/2020.03.13.990069. 
 
 .. image:: ../images/scRNAseq_pipeline.png
 
@@ -60,28 +54,7 @@ In the VelocytoCounts folder, loom files with counts of spliced, unspliced and a
 Input requirements
 ------------------
 
-The primary input requirement is a directory of paired-end fastq files. For the Gruen mode, if you do not wish to use the default list of cell-barcodes you must then supply your own, and to modify the cellBarcodePattern accordingly.. For the STAR solo mode, a barcode whitelist is required, as well as specification of UMI and CB positions and length, if different from default or available presets.
-
-Cell barcodes
-~~~~~~~~~~~~~
-
-The format of the cell barcodes file is shown below. Note that the default file is included in the snakePipes source code under ``snakePipes/workflows/scRNAseq``. This file is automatically used if you leave :code:`cellBarcodeFile` empty.
-
-::
-
-    1       AGTGTC
-    2       ACCATG
-    3       GAGTGA
-    4       CACTCA
-    5       CATGTC
-    6       ACAGGA
-    7       GTACCA
-    8       ACAGAC
-    9       ACGTTG
-
-The default cell barcodes in the Gruen mode are 192 hexamers listed in a file with the first column a cell number and the second the barcode sequence.
-
-Predefined cell barcodes are required right now. However it is planned to make this more generic in future workflow versions.
+The primary input requirement is a directory of paired-end fastq files. For the STAR solo mode, a barcode whitelist is required, as well as specification of UMI and CB positions and length, if different from default or available presets.
 
 Barcode whitelist
 ~~~~~~~~~~~~~~~~~
@@ -117,23 +90,25 @@ The default configuration file is listed below and can be found in ``snakePipes/
     trim: False
     trimmer: cutadapt
     trimmerOptions: -a A{'30'}
-    ## N.B., setting --outBAMsortingBinsN too high can result in cryptic errors
-    alignerOptions: "--outBAMsortingBinsN 30 --twopassMode Basic"
+    ## --twopassMode Basic is not compatible with --outStd in all STAR versions
+    alignerOptions: ""
     ## further options
     filterGTF: "-v -P 'decay|pseudogene' "
     cellBarcodeFile:
     cellBarcodePattern: "NNNNNNXXXXXX"
     splitLib: False
     cellNames:
-    ##STARsolo options
+    ##mode STARsolo options
     myKit: CellSeq384
     BCwhiteList:
     STARsoloCoords: ["1","7","8","7"]
+    skipVelocyto: False
     ##mode Alevin options
     alevinLibraryType: "ISR"
     prepProtocol: "celseq2"
     salmonIndexOptions: --type puff -k 31
     expectCells: 
+    readLengthFrx: 0.2
     #generic options
     libraryType: 1
     bwBinSize: 10
@@ -152,24 +127,6 @@ The default configuration file is listed below and can be found in ``snakePipes/
     UMIDedupOpts: --paired
 
 
-While some of these can be changed on the command line, you may find it useful to change ``cellBarcodePattern`` and ``cellBarcodeFile`` if you find that you need to change them frequently.
-
-Barcode pattern
-~~~~~~~~~~~~~~~
-
-The scRNA-seq pipeline requires barcodes at 5' end of read 1. The default cellBarcodePattern takes the first 6 bases as UMI (NNNNNN) and the following 6 bases as cell barcode (XXXXXX).
-If your read/barcode layout requires additional **'Don't care'** positions eg. before stretches of N one can indicate these with ``.``
-
-Barcode file
-~~~~~~~~~~~~~~~
-
-Only specify a file if you use other than the default CEL-seq2 barcodes (mode Gruen).
-
-
-Trimming
-~~~~~~~~
-
-It is recommended to use the :code:`--trim` option as this uses cutadapt to trim remaining adapters *and* poly-A tails from read 2 (see defaults for ``--trimmerOptions``).
 
 Pseudogene filter
 ~~~~~~~~~~~~~~~~~
@@ -181,57 +138,18 @@ Here we assume you provide eg. a gencode or ensemble annotation file (via genes_
 Library Type
 ~~~~~~~~~~~~
 
-The CEL-seq2 protocol produces reads where read 2 maps in sense direction (:code:`libraryType: 1`). After barcodes are transferred to read 2, the workflow continues in single-end mode.
+The CEL-seq2 protocol produces reads where read 2 maps in sense direction (:code:`libraryType: 1`).
 
-Split lib
-~~~~~~~~~
 
-This option you need in case a library contains only 96 instead of 192 cells (mode Gruen).
+Fraction of read length required to overlap the intron
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+In mode Alevin, the fraction of read length required to overlap the intron in order to be counted as "unspliced" is set to 0.2 (i.e. 20%) by default. This corresponds to 10nt in a 50nt-long read, or to 20nt in a 100nt-long read. The user is encouraged to modify this value as deemed appropriate via the ``--readLengthFrx`` commandline argument.
 
 
 Output structure
 ----------------
 
-The following will be produced in the output directory when the workflow is run in mode Gruen::
-
-    |-- cluster_logs
-    |-- Filtered_cells_RaceID
-    |   `-- logs
-    |-- Filtered_cells_monocle
-    |    `-- logs
-    |-- cellQC_test
-    |-- mtab_test
-    |-- QC_report
-    |   `-- data
-    |-- Results
-    |-- Counts
-    |   `-- logs
-    |-- multiQC
-    |   `-- multiqc_data
-    |-- bamCoverage
-    |   `-- logs
-    |-- deepTools_qc
-    |   |-- logs
-    |   |-- bamPEFragmentSize
-    |   |-- plotEnrichment
-    |   `-- estimateReadFiltering
-    |-- Sambamba
-    |-- STAR_genomic
-    |   |-- logs
-    |   `-- GSM2668205
-    |-- FastQC
-    |   `-- logs
-    |-- Annotation
-    |-- FASTQ_barcoded
-    `-- FASTQ
-
- - The **Annotation** directory contains a filtered version of your original GTF file, with pseudogenes removed by default.
- - The **bamCoverage** directory contains a bigwig track for each sample (not per cell!). This can be used eg. in IGV to check where your reads map in general.
- - The **Counts** directory contains 4 sets of counts: UMIs/feature/cell (.umis.txt), reads/feature/cell (.reads.txt), corrected number of UMIs/feature/cell (corrected.txt) and raw counts per cell per UMI per feature (raw_counts.txt). Of these, the values in corrected.txt should be used for further analysis and the others for quality control.
- - The **deeptools_qc** directory contains additional QC reports and plots. The ``FASTQC`` directory can be used to verify eg. the barcode layout of read 1.
- - The **QC_report** directory contains additional QC stats as tables and plots.
-
 The following will be produced in the output directory when the workflow is run in mode STARsolo::
 
     analysis/
@@ -255,12 +173,15 @@ The following will be produced in the output directory when the workflow is run
  - The **VelocytoCounts** directory contains loom files in sample subdirectories.
  - The **VelocytoCounts_merged** directory containes one loom file with all samples merged.
  - The **STARsolo** directory contains bam files and 10X-format cell count matrices produced by STARsolo.
+ - The **Annotation** directory contains a filtered version of your original GTF file, with pseudogenes removed by default.
+ - The **bamCoverage** directory contains a bigwig track for each sample (not per cell!). This can be used eg. in IGV to check where your reads map in general.
+ - The **deeptools_qc** directory contains additional QC reports and plots. The ``FASTQC`` directory can be used to verify eg. the barcode layout of read 1.
 
-The remaining folders are described in the Gruen mode above.
 
 The following output structure will be produced when running in Alevin mode::
 
     ├── Alevin
+    ├── AlevinForVelocity
     ├── Annotation
     ├── cluster_logs
     ├── FastQC
@@ -272,23 +193,14 @@ The following output structure will be produced when running in Alevin mode::
     ├── scRNAseq_organism.yaml
     ├── scRNAseq_pipeline.pdf
     ├── scRNAseq_run-1.log
-    └── scRNAseq_tools.txt
+    ├── scRNAseq_tools.txt
+    └── SingleCellExperiment
 
  - The **Salmon** directory contains the generated genome index.
- - The **Alevin** directory contains the matrix failes (both bootstrapped and raw) per sample in subdirectories.
+ - The **Alevin** directory contains the matrix files (both bootstrapped and raw) per sample in subdirectories.
  - The **multiQC** directory contains an additional alevinQC html file generated per sample.
-
-Understanding the outputs: mode Gruen
---------------------------------------
-
-- **Main result:** the genes per cell count table with poisson-corrected counts can be found under ``Results/all_samples.gencode_genomic.corrected_merged.csv``
-
-- Corresponding annotation files are: ``Annotation/genes.filtered.bed`` and ``Annotation/genes.filtered.gtf``, respectively.
-
-- The folders ``QC_report``, ``FASTQC``, ``deeptools_qc`` and ``multiQC`` contain various QC tables and plots.
-
-- **Sambamba** and **STAR_genomic** directories contain the output file from duplicate marking and genomic alignments, respectively.
-
+ - The **AlevinForVelocity** directory contains the matrix files with "spliced" and "unspliced" reads per cell in subdirectories.
+ - The **SingleCellExperiment** directory contains the RDS files with "SingleCellExperiment" class R objects, storing spliced/unspliced counts per cell in corresponding assays.
 
 Understanding the outputs: mode STARsolo
 ----------------------------------------
@@ -301,48 +213,17 @@ Understanding the outputs: mode STARsolo
 
 - *STARsolo* directory contain the output from genomic alignments.
 
-Filtered_cells_monocle
-~~~~~~~~~~~~~~~~~~~~~~
-
-The poisson-rescaled count matrix is read and converted into a monocle dataset. A range of transcript counts per cell thresholds (from 1000 to 5000 by 500) are applied to filter cells and the resulting R objects are written to minT*.mono.set.RData. For every cell filtering threshold, several metrics are collected and written to metrics.tab.txt: number of retained cells, median number of expressed genes per cell (GPC), size of the total gene universe. Plots of median GPC as well as gene universe size as functions of the cell filtering threshold are written to medGPCvsminT.downscaled.png and gene_universevsminT.downscaled.png, respectively.
-
-The optimal cell filtering threshold for the subsequent analyses is selected as the value that results in maximizing a gene expression metric choosable from "gene_universe" (default) and "medGPC". Using gene universe tends to maximize the overall cell diversity while using median genes per cell (medGPC) maximizes the information content per cell.
-Gene expression dispersions are calculated for the corresponding monocle object and the trend plot is written to mono.set.*.disp.estim.png. A first iteration of cell clustering with default settings resutls in a rho-delta plot written to mono.set.*.rho_delta.png and a tSNE plot with cell cluster colouring written to mono.set.*.tsne.auto.Cluster.png. Rho and delta are now re-evaluated and set to the 80th and the 95th percentiles of the original distributions, respectively. Cells are reclustered and the corresponding tSNE plot is written to mono.set.*.tsne.thd.Cluster.png. The monocle object containing the updated clustering information is written to minT*.mono.set.RData. It is also converted to a seurat object and the clustering information is transferred. The seurat object is saved as minT*.seuset.RData. The tSNE plot with clustering information produced with seurat is written to minT*.seuset.tSNE.png.
-Top10 as well as top2 markers are calculated for each cell cluster and written to minT\*.Top10markers.txt and minT\*.Top2markers.txt, respectively. The corresponding heatmaps are written to minT\*.Top10markers.heatmap.png and minT\*.Top2markers.heatmap.png, respectively. For the top2 marker list, violin as well as feature plots are produced and saved under Top2.clu\*.violin.png and Top2.clu\*.featurePlot.png, respectively. The R session info is written to sessionInfo.txt.
-Statistical procedures and results are summarized in Stats_report.html.
-
-Filtered_cells_RaceID
-~~~~~~~~~~~~~~~~~~~~~
-
-Cell filtering, metrics collection and threshold selection are done as above only using RaceID package functions, where applicable.
-
-Clustering is done with RaceID default settings. The fully processed RaceID object is written to sc.minT\*.RData, the tsne plot with the clustering information to sc.minT\*.tsne.clu.png.
-Top 10 and top 2 markers are calculated, and the resulting plots and tables written out as above. Violin and feature plots are generated for the top2 marker list and saved to files as in the description above. Session info is written to sessionInfo.txt. Statistical procedures and results are summarized in Stats_report.html.
 
 Understanding the outputs: mode Alevin
 --------------------------------------
 
-- **Main result:** output folders containing the raw and boostrapped count matrices are found under the sample subfolders under ``Alevin``. The sample specific Alevin folders contain the matrices, as well as column data (barcodes) and row data (genes).
+- **Main result:** output folders containing the raw and boostrapped count matrices are found under the sample subfolders under ``Alevin``. The sample specific Alevin folders contain the matrices, as well as column data (barcodes) and row data (genes). Alevin spliced/unspliced counts for RNA velocity are stored as alevin matrices in the sample subfolders under ``AlevinForVelocity`` and as "SingleCellExperiment" class R objects under ``SingleCellExperiment``.
 
 - Corresponding annotation files are: ``Annotation/genes.filtered.bed`` and ``Annotation/genes.filtered.gtf``, respectively.
 
 - The QC plots (both from multiQC and AlevinQC) are available in the ``multiQC`` folder.
 
 
-Example images
-~~~~~~~~~~~~~~
-
-There are a number of QC images produced by the pipeline:
-
-.. image:: ../images/scRNAseq_UMI_plot.png
-
-This figure plots the number of UMIs on transcripts per cell vs the number of reads aligning to transcripts. These should form a largely straight line, with the slope indicating the level of PCR duplication.
-
-.. image:: ../images/scRNAseq_plate_abs_transcript.png
-
-This figure shows the distribution of the number of UMIs across the single cells. Each block is a single cell and the color indicates the number of UMIs assigned to it. This is useful for flagging outlier cells.
-Note: the layout corresponds to half of a 384-well plate as this is used usually for CEL-seq2. The plot can also help to see biases corresponding to the well-plate.
-
 Command line options
 --------------------
 

diff --git a/snakePipes/common_functions.py b/snakePipes/common_functions.py
@@ -26,6 +26,7 @@ def set_env_yamls():
             'CONDA_seurat3_ENV': 'envs/sc_rna_seq_seurat3.yaml',
             'CONDA_loompy_ENV': 'envs/sc_rna_seq_loompy.yaml',
             'CONDA_alevinqc_ENV': 'envs/sc_rna_seq_alevinqc.yaml',
+            'CONDA_eisaR_ENV': 'envs/sc_rna_seq_eisaR.yaml',
             'CONDA_DNA_MAPPING_ENV': 'envs/dna_mapping.yaml',
             'CONDA_CHIPSEQ_ENV': 'envs/chip_seq.yaml',
             'CONDA_ATAC_ENV': 'envs/atac_seq.yaml',