Merge pull request #673 from maxplanck-ie/develop

Develop
maxplanck-ie · Aug 24, 2020 · 4145ce4 · 4145ce4
2 parents 3c4177b + 122f9ba
commit 4145ce4
Show file tree

Hide file tree

Showing 105 changed files with 1,900 additions and 310 deletions.
diff --git a/.ci_stuff/genome.fa b/.ci_stuff/genome.fa
@@ -0,0 +1,4 @@
+>1
+AAAAA
+>2_spikein
+TTTTT
diff --git a/.ci_stuff/genome.fa.fai b/.ci_stuff/genome.fa.fai
@@ -0,0 +1,3 @@
+1	10
+2_spikein	10
+
diff --git a/.ci_stuff/spikein_organism.yaml b/.ci_stuff/spikein_organism.yaml
@@ -0,0 +1,18 @@
+genome_size: 2652783500
+genome_fasta: "/tmp/genome.fa"
+genome_index: "/tmp/genome.fa.fai"
+genome_2bit: ".ci_stuff/genome_fasta/genome.2bit"
+bowtie2_index: ".ci_stuff/BowtieIndex/genome"
+hisat2_index: ".ci_stuff/HISAT2Index/genome"
+bwa_index: ".ci_stuff/BWAindex/genome.fa"
+bwameth_index: "/tmp/genome.fa"
+known_splicesites: ".ci_stuff/gencode/m9/HISAT2/splice_sites.txt"
+star_index: ".ci_stuff/STARIndex/"
+genes_bed: "/tmp/genes.bed"
+genes_gtf: "/tmp/genes.gtf"
+spikein_genes_gtf: "/tmp/spikein_genes.gtf"
+extended_coding_regions_gtf: ".ci_stuff/gencode/m9/genes.slop.gtf"
+blacklist_bed: ".ci_stuff/DKFZ/GRCm38_General_readAttractingRegions.UseThisOne.bed"
+spikein_blacklist_bed: ""
+ignoreForNormalization: "MT X Y JH584299.1 GL456233.1 JH584301.1 GL456211.1 GL456350.1 JH584293.1 GL456221.1 JH584297.1 JH584296.1 GL456354.1 JH584294.1 JH584298.1 JH584300.1 GL456219.1 GL456210.1 JH584303.1 JH584302.1 GL456212.1 JH584304.1 GL456379.1 GL456216.1 GL456393.1 GL456366.1 GL456367.1 GL456239.1 GL456213.1 GL456383.1 GL456385.1 GL456360.1 GL456378.1 GL456389.1 GL456372.1 GL456370.1 GL456381.1 GL456387.1 GL456390.1 GL456394.1 GL456392.1 GL456382.1 GL456359.1 GL456396.1 GL456368.1 JH584292.1 JH584295.1"
+rmsk_file: "/tmp/rmsk.txt"
diff --git a/.ci_stuff/test_dag.sh b/.ci_stuff/test_dag.sh
diff --git a/bin/snakePipes b/bin/snakePipes
@@ -317,7 +317,7 @@ def updateConfig(args):
     """Update the global defaults"""
     baseDir = os.path.dirname(snakePipes.__file__)
     # Load, update and rewrite the default dictionary
-    currentDict = cof.load_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), True)
+    currentDict = cof.load_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), False, "Default Config")
 
     if args.configMode=="manual":
         d = {
@@ -338,13 +338,20 @@ def updateConfig(args):
     elif args.configMode=="recycle":
         oldConfig=args.oldConfig
         if os.path.isfile(oldConfig):
-            d = cof.load_configfile(oldConfig, True)
+            d = cof.load_configfile(oldConfig, False, "Old Config")
+            if args.organismsDir:
+                od = {'organismsDir': args.organismsDir}
+                d.update(od)
+            if args.clusterConfig:
+                od = {'clusterConfig': args.clusterConfig}
+                d.update(od)
             if not currentDict.keys() & d.keys():
                 sys.exit("The old and the new config have no matching keys!!!\n")
         else:
             sys.exit("Config file not found\n")
     updatedDict=cof.merge_dicts(currentDict, d)
     cof.write_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), updatedDict)
+    newDict=cof.load_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), True, "Final Updated Config")
 
 
 def version():

diff --git a/conda-recipe/meta.yaml b/conda-recipe/meta.yaml
@@ -1,6 +1,6 @@
 package:
   name: snakepipes
-  version: 2.1.2
+  version: 2.2.0
 
 source:
   path: ../
@@ -14,7 +14,7 @@ requirements:
     - python >=3
   run:
     - python >=3.7
-    - snakemake >=5.13
+    - snakemake ==5.18.0
     - pandas
     - graphviz
     - fuzzywuzzy

diff --git a/docs/content/News.rst b/docs/content/News.rst
@@ -1,6 +1,31 @@
 snakePipes News
 ===============
 
+snakePipes 2.2.0
+----------------
+* Added Alevin mode in scRNA workflow
+* Added a new conda environment using to call AlevinQC.
+* Added filtering of empty drops with Dropletutils to scRNA-seq mode STARsolo
+* Added spikein normalization to ChIPseq workflow
+* Added hybrid genome creation to createIndices
+* Added STARsolo report for all samples to STARsolo output folder
+* FASTQ1 and FASTQ2 are not localrules anymore due to buggy logging
+* Included optional differential splicing analysis using rmats within mRNA-seq workflow
+* Symlinks in the output path are relative 
+* Increased BBmap version
+* Increased STAR version to 2.7.4a in scRNAseq, noncoding-RNA-seq and mRNA-seq workflows
+* Fixed snakemake version at 5.18.0 due to a bug in DAG handling
+* Minor changes to shared FastQC and multiQC rule with regards to scRNA-seq workflow.
+* Fixed issue with missing input for running the DNA-mapping Snakefile
+* Fixed rule TrimGalore for single end reads
+* deepTools heatmaps for differentially bound regions are now ordered by sample sheet condition
+* Genrich is now run on namesorted bams
+* Workflow help message now points to example sampleSheet on GitHub
+* organismsDir can now be updated with snakePipes config mode "recycle"
+
+.. note::
+   Please be aware that this version requires regeneration of STAR indices!
+
 snakePipes 2.1.2
 ----------------
 * small bug fix: SE mode in noncoding-RNA-seq pipeline

diff --git a/docs/content/setting_up.rst b/docs/content/setting_up.rst
@@ -30,16 +30,18 @@ Next, install snakePipes.
 Installing snakePipes
 ---------------------
 
-The easiest way to install snakePipes is via our conda channel. The following command install snakePipes and also creates a conda virtual environment named ``snakePipes``, which you can then activate via ``conda activate snakePipes``.
+The easiest way to install snakePipes is via our conda channel. The following command install snakePipes and also creates a conda virtual environment named ``snakePipes``, which you can then activate via ``conda activate snakePipes``. Specifying snakePipes version avoids issues with conda's environment solver.
 
 .. code:: bash
 
-    conda create -n snakePipes -c mpi-ie -c bioconda -c conda-forge snakePipes
+    conda create -n snakePipes -c mpi-ie -c conda-forge -c bioconda snakePipes==2.1.1
 
 This way, the software used within snakePipes do not conflict with the software pre-installed on your terminal or in your python environment.
 
 .. note:: This might take a few minutes depending on the access to conda channels.
 
+snakePipes is going to move to mamba in the future.
+
 Development installation
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -94,6 +96,9 @@ This would show the locations of:
  * **organisms/<organism>.yaml** : Defines genome indices and annotations for various organisms. See :ref:`organisms`
  * Workflow-specific defaults : Defines default options for our command line wrappers. See :ref:`workflowOpts`
 
+It is a good idea to keep a copy of your defaults.yaml, cluster.yaml and the whole organism folder in a dedicated location e.g. some folder *outside the snakePipes installation folder* named "snakePipes_configs" .
+You can configure snakePipes to use these files after a fresh installation or update with ``snakePipes config --organismsDir my_organisms_dir --clusterConfig my_cluster_config`` . This will also work if you add ``--configMode recycle``.  
+
 
 .. _conda:
 

diff --git a/docs/content/workflows/ChIP-seq.rst b/docs/content/workflows/ChIP-seq.rst
@@ -68,6 +68,19 @@ As you can see above, the same control can be used for multiple samples.
 
 .. note:: Set the flag broad to `True` for broad marks, such as H3K27me and H3K9me3
 
+
+.. _spikein:
+
+Spikein Normalization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If chromatin from an external organism was spikein in, it is possible to obtain spikein-derived scaling factors for the ChIP (and input) samples with the flag ``--useSpikeInForNorm``. This requires providing a hybrid bam file, with reads aligned to a hybrid genome of host and spikein chromosomes. Spikein chromosome extention can be specified with `--spikeinExt`. Scale factors can be obtained either from whole spikein genome in the ChIP samples, from windows centered on TSS in the spikein genome in the ChIP samples, or from whole spikein genome in the input samples . The default scale factors from whole spikein genome in the ChIP samples can be changed to something else with ``--getSizeFactorsFrom``.
+
+DESeq2-style scaling factors produced with deepTools multiBamSummary will then be used to create bam coverage tracks and passed to CSAW as size Factors if sample sheet is provided.
+
+A hybrid genome can be obtained with createIndices workflow and can be passed to the DNA-mapping workflow without any particular arguments.
+
+
 .. _diffBinding:
 
 Differential Binding analysis

diff --git a/docs/content/workflows/createIndices.rst b/docs/content/workflows/createIndices.rst
@@ -45,6 +45,11 @@ There is a configuration file in ``snakePipes/workflows/createIndices/defaults.y
 
 These values are most conveniently set on the command line.
 
+Hybrid genome
+-------------
+
+To create a hybrid fasta, specify the host genome with ``--genomeURL`` and the spikein genome with ``--spikeinGenomeURL``. On top of ``--gtfURL`` and ``--blacklist``, you may optionally provide ``--spikeinGtfURL`` and `--spikeinBlacklist`. Default extention added to spikein chromosomes is '_spikein' and can be changes with ``--spikeinExt``.
+
 Output structure
 ----------------
 

diff --git a/docs/content/workflows/scRNA-seq.rst b/docs/content/workflows/scRNA-seq.rst
@@ -18,6 +18,8 @@ The general procedure for mode "Gruen" involves:
 2. mapping read 2,
 3. quantification at the single cell level.
 
+Mode "Gruen" is going to be deprecated by the end of 2020.
+
 The general procedure for mode "STARsolo" involves:
 
 1. moving cell barcodes and UMIs from read 1 into the CB and UMI tags of read 2 during mapping (STARsolo),
@@ -259,7 +261,7 @@ Understanding the outputs: mode Gruen
 Understanding the outputs: mode STARsolo
 ----------------------------------------
 
-- **Main result:** output folders with 10x-format count matrices can be found in sample subfolders under ``STARsolo``. The ouput consists of three files: barcodes.tsv, features.tsv, matrix.mtx. Their gzipped versions are stored in the same folder.
+- **Main result:** output folders with 10x-format count matrices can be found in sample subfolders under ``STARsolo``. The ouput consists of three files: barcodes.tsv, features.tsv, matrix.mtx. Their gzipped versions are stored in the same folder. Seurat objects from merged samples are available in the ``Seurat`` folder.
 
 - Corresponding annotation files are: ``Annotation/genes.filtered.bed`` and ``Annotation/genes.filtered.gtf``, respectively.
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -34,13 +34,15 @@ Quick start
 
 .. code:: bash
 
-    conda create -n snakePipes -c mpi-ie -c bioconda -c conda-forge snakePipes
+    conda create -n snakePipes -c mpi-ie -c conda-forge -c bioconda snakePipes==2.1.1
 
 * You can update snakePipes to the latest version available on conda with:
 
 .. code:: bash
 
-    conda update -n snakePipes -c mpi-ie -c bioconda -c conda-forge snakePipes
+    conda update -n snakePipes -c mpi-ie -c conda-forge -c bioconda --prune snakePipes
+
+snakePipes is going to move to mamba in the future.
 
 * Download genome fasta and annotations for an your organism, and build indexes, Check in :ref:`createIndices`
 
@@ -50,7 +52,7 @@ Quick start
 
     snakePipes config --help
 
-.. note:: If you have a copy of a `shared/defaults.yaml` with the necessary paths configured (i.e. from a previous installation), you can pass it to snakePipes config with `--oldConfig` and `--configMode recycle` instead of providing all the paths manually again. Config keys have to match for this to work.
+.. note:: If you have a copy of a `shared/defaults.yaml` with the necessary paths configured (i.e. from a previous installation), you can pass it to snakePipes config with `--oldConfig` and `--configMode recycle` instead of providing all the paths manually again. Config keys have to match for this to work. In the same way, you can pass your external organism yaml folder with ``--organismsDir`` or cluster config with ``--clusterConfig``.
 
 * Download example fastq files for the human genome `here <https://zenodo.org/record/3707259>`_
 

diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,4 @@
-snakemake >= 5.13.0
+snakemake==5.18.0
 psutil
 pandas
 fuzzywuzzy

diff --git a/snakePipes/__init__.py b/snakePipes/__init__.py
@@ -1 +1 @@
-__version__ = '2.1.2'
+__version__ = '2.2.0'
diff --git a/snakePipes/common_functions.py b/snakePipes/common_functions.py
@@ -10,6 +10,8 @@
 import sys
 import shutil
 from fuzzywuzzy import fuzz
+import smtplib
+from email.message import EmailMessage
 from snakePipes import __version__
 
 
@@ -23,16 +25,17 @@ def set_env_yamls():
             'CONDA_scRNASEQ_ENV': 'envs/sc_rna_seq.yaml',
             'CONDA_seurat3_ENV': 'envs/sc_rna_seq_seurat3.yaml',
             'CONDA_loompy_ENV': 'envs/sc_rna_seq_loompy.yaml',
+            'CONDA_alevinqc_ENV': 'envs/sc_rna_seq_alevinqc.yaml',
             'CONDA_DNA_MAPPING_ENV': 'envs/dna_mapping.yaml',
             'CONDA_CHIPSEQ_ENV': 'envs/chip_seq.yaml',
-            'CONDA_HISTONE_HMM_ENV': 'envs/histone_hmm.yaml',
             'CONDA_ATAC_ENV': 'envs/atac_seq.yaml',
             'CONDA_HIC_ENV': 'envs/hic.yaml',
             'CONDA_WGBS_ENV': 'envs/wgbs.yaml',
             'CONDA_RMD_ENV': 'envs/rmarkdown.yaml',
             'CONDA_PREPROCESSING_ENV': 'envs/preprocessing.yaml',
             'CONDA_NONCODING_RNASEQ_ENV': 'envs/noncoding.yaml',
-            'CONDA_SAMBAMBA_ENV': 'envs/sambamba.yaml'}
+            'CONDA_SAMBAMBA_ENV': 'envs/sambamba.yaml',
+            'CONDA_pysam_ENV': 'envs/pysam.yaml'}
 
 
 def merge_dicts(x, y):
@@ -137,11 +140,19 @@ def get_sample_names(infiles, ext, reads):
         x = os.path.basename(x)[:-lext]
         if x.endswith(reads[0]):
             x = x[:-l0]
+            s.add(x)
         elif x.endswith(reads[1]):
             x = x[:-l1]
+            s.add(x)
         else:
-            continue
-        s.add(x)
+            sys.stderr.write("Warning! {} does not have {} as its name suffix. "
+                             "Either change it or modify the 'reads' in the "
+                             "config.yaml to your deired ones.\n".format(x, reads))
+
+    if sorted(list(s)) == []:
+        sys.exit("Error! No sample has the right read suffix ({}). "
+                 "Please modify them or update the config.yaml with "
+                 "your desired suffix.".format(reads))
     return sorted(list(s))
 
 
@@ -171,9 +182,13 @@ def is_paired(infiles, ext, reads):
                 infiles_dic[bname] = [infile]
             else:
                 infiles_dic[bname].append(infile)
-    if infiles_dic and max([len(x) for x in infiles_dic.values()]) == 2:
+    if not infiles_dic:
+        sys.exit("Error: No fastq file has been found to be checked.")
+    values_length = [len(x) for x in infiles_dic.values()]
+    if min(values_length) == 2:
         pairedEnd = True
-    # TODO: raise exception if single-end and paired-end files are mixed
+    elif min(values_length) == 1 and max(values_length) == 2:
+        sys.exit("Error: The directory contains a mixture of paired-end and single-end data!")
     return pairedEnd
 
 
@@ -365,8 +380,6 @@ def sendEmail(args, returnCode):
     Try to send an email to the user. Errors must be non-fatal.
     """
     try:
-        import smtplib
-        from email.message import EmailMessage
         msg = EmailMessage()
         msg['Subject'] = "Snakepipes completed"
         msg['From'] = args.emailSender

diff --git a/snakePipes/shared/cluster.yaml b/snakePipes/shared/cluster.yaml
@@ -16,7 +16,7 @@ bamCoverage_raw:
 create_snpgenome:
     memory: 30G
 FASTQdownsample:
-    memory: 3G
+    memory: 4G
 filter_reads_umi:
     memory: 10G
 plotCorrelation_pearson:

diff --git a/snakePipes/shared/defaults.yaml b/snakePipes/shared/defaults.yaml
diff --git a/snakePipes/shared/organisms/mm10.yaml b/snakePipes/shared/organisms/mm10.yaml
@@ -7,7 +7,7 @@ hisat2_index: "/data/repository/organisms/GRCm38_ensembl/HISAT2Index/genome"
 bwa_index: "/data/repository/organisms/GRCm38_ensembl/BWAindex/genome.fa"
 bwameth_index: "/data/repository/organisms/GRCm38_ensembl/BWAmethIndex/genome.fa"
 known_splicesites: "/data/repository/organisms/GRCm38_ensembl/gencode/m9/HISAT2/splice_sites.txt"
-star_index: "/data/repository/organisms/GRCm38_ensembl/STARIndex/2.7.1a/"
+star_index: "/data/repository/organisms/GRCm38_ensembl/STARIndex/2.7.4a/"
 genes_bed: "/data/repository/organisms/GRCm38_ensembl/gencode/m9/genes.bed"
 genes_gtf: "/data/repository/organisms/GRCm38_ensembl/gencode/m9/genes.gtf"
 extended_coding_regions_gtf: "/data/repository/organisms/GRCm38_ensembl/gencode/m9/genes.slop.gtf"

diff --git a/snakePipes/shared/rscripts/CSAW.R b/snakePipes/shared/rscripts/CSAW.R
@@ -13,6 +13,11 @@ importfunc <- snakemake@params[["importfunc"]]  #"DB_functions.R"
 allelic_info <- as.logical(snakemake@params[["allele_info"]])
 outdir<-snakemake@params[["outdir"]]
 yaml_path<-snakemake@params[["yaml_path"]]
+useSpikeInForNorm<-snakemake@params[["useSpikeInForNorm"]]
+scale_factors<-snakemake@params[["scale_factors"]]
+
+bam_pfx<-ifelse(useSpikeInForNorm,"_host",".filtered")
+bam_folder<-ifelse(useSpikeInForNorm,"split_bam","filtered_bam")
 
 ##set up a primitive log
 logfile <- file(snakemake@log[["err"]], open="w+")
@@ -94,11 +99,11 @@ if (!is.null(sampleInfo$UseRegions)) {
 
 if(snakemake@params[['peakCaller']] == "MACS2") {
     allpeaks <- lapply(fnames, function(x) {
-        narrow <- paste0("../MACS2/",x,".filtered.BAM_peaks.narrowPeak")
+        narrow <- paste0("../MACS2/",x,bam_pfx,".BAM_peaks.narrowPeak") #bam_pfx
         if(snakemake@params[["pipeline"]] %in% "ATAC-seq"){
             narrow <- paste0("../MACS2/",x,".filtered.short.BAM_peaks.narrowPeak")
         }
-        broad <- paste0("../MACS2/",x,".filtered.BAM_peaks.broadPeak")
+        broad <- paste0("../MACS2/",x,bam_pfx,".BAM_peaks.broadPeak") #bam_pfx
         # first look for narrowpeak then braod peak
         if(file.exists(narrow)) {
             bed <- read.delim(narrow, header = FALSE)
@@ -128,9 +133,14 @@ message(paste0("Filtering windows using MACS2 output : ", length(allpeaks) , " r
 keep <- overlapsAny(SummarizedExperiment::rowRanges(chip_object$windowCounts), allpeaks)
 chip_object$windowCounts <- chip_object$windowCounts[keep,]
 
-## TMM normalize
-message("Normalizing using TMM (using 10kb background counts)")
-chip_object <- tmmNormalize_chip(chip_object, binsize = 10000, plotfile = "TMM_normalizedCounts.pdf")
+## normalize
+if(useSpikeInForNorm){
+    message("Normalizing using spikein-derived scale factors")
+    chip_object <- tmmNormalize_chip(chip_object, binsize = 10000, plotfile = "spikein_normalizedCounts.pdf")
+}else{
+    message("Normalizing using TMM (using 10kb background counts)")
+    chip_object <- tmmNormalize_chip(chip_object, binsize = 10000, plotfile = "TMM_normalizedCounts.pdf")
+}
 
 ## get DB regions
 print("Performing differential binding")

diff --git a/snakePipes/shared/rscripts/CSAW_report.Rmd b/snakePipes/shared/rscripts/CSAW_report.Rmd
@@ -22,13 +22,14 @@ This report summarizes statistical analysis of binding events using group inform
 
 ```{r data_load}
 DBdata<-file.path(snakemake@params[["outdir"]], "DiffBinding_analysis.Rdata")
+useSpikeInForNorm<-snakemake@params[["useSpikeInForNorm"]]
 suppressMessages(require(GenomicRanges))
 suppressMessages(require(csaw))
 load(DBdata)
 ```
 
-Input to the differential binding analysis were `r length(chip_object$windowCounts)` genomic bins of size `r width(chip_object$windowCounts)[1]` bp having any overlap to the union of MACS2 peaks from all samples listed in the sample sheet.   
-Size factors were calculated via TMM normalization of read counts obtained for 10kb genomic bins and amounted to `r chip_object$normFactors`.   
+Input to the differential binding analysis were `r length(chip_object$windowCounts)` genomic bins of size `r width(chip_object$windowCounts)[1]` bp having any overlap to the union of `r unlist(strsplit(snakemake@output[["outfile"]],"_"))[2]` peaks from all samples listed in the sample sheet.   
+Size factors were calculated via `r ifelse(useSpikeInForNorm,"spikein normalization","TMM normalization of read counts obtained for 10kb genomic bins")` and amounted to `r chip_object$normFactors`.   
 
 ### Statistical analysis
 
@@ -71,7 +72,7 @@ Deeptools matrices were computed for the filtered lists of merged regions showin
 Heatmaps obtained using log2 ratio of ChIP signal over input were as follows:
 
 ```{r plot_l2r,fig.show='hold',fig.align='center', out.width ='40%',out.height='30%',fig.cap="UP and DOWN regions using log2 ratio."}
-if(!snakemake@params[["pipeline"]] %in% "chip-seq"){
+if(sum(grepl("log2r.heatmap.png",snakemake@input[["csaw_in"]]))<1){
     message("ChIP signal enrichment over input is only available for ChIP-seq workflow!")
 }else{
     knitr::include_graphics(c(file.path(snakemake@params[["outdir"]], "CSAW.UP.log2r.heatmap.png"),