Skip to content

Commit

Permalink
Merge pull request #673 from maxplanck-ie/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
katsikora committed Aug 24, 2020
2 parents 3c4177b + 122f9ba commit 4145ce4
Show file tree
Hide file tree
Showing 105 changed files with 1,900 additions and 310 deletions.
4 changes: 4 additions & 0 deletions .ci_stuff/genome.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
>1
AAAAA
>2_spikein
TTTTT
3 changes: 3 additions & 0 deletions .ci_stuff/genome.fa.fai
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
1 10
2_spikein 10

18 changes: 18 additions & 0 deletions .ci_stuff/spikein_organism.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
genome_size: 2652783500
genome_fasta: "/tmp/genome.fa"
genome_index: "/tmp/genome.fa.fai"
genome_2bit: ".ci_stuff/genome_fasta/genome.2bit"
bowtie2_index: ".ci_stuff/BowtieIndex/genome"
hisat2_index: ".ci_stuff/HISAT2Index/genome"
bwa_index: ".ci_stuff/BWAindex/genome.fa"
bwameth_index: "/tmp/genome.fa"
known_splicesites: ".ci_stuff/gencode/m9/HISAT2/splice_sites.txt"
star_index: ".ci_stuff/STARIndex/"
genes_bed: "/tmp/genes.bed"
genes_gtf: "/tmp/genes.gtf"
spikein_genes_gtf: "/tmp/spikein_genes.gtf"
extended_coding_regions_gtf: ".ci_stuff/gencode/m9/genes.slop.gtf"
blacklist_bed: ".ci_stuff/DKFZ/GRCm38_General_readAttractingRegions.UseThisOne.bed"
spikein_blacklist_bed: ""
ignoreForNormalization: "MT X Y JH584299.1 GL456233.1 JH584301.1 GL456211.1 GL456350.1 JH584293.1 GL456221.1 JH584297.1 JH584296.1 GL456354.1 JH584294.1 JH584298.1 JH584300.1 GL456219.1 GL456210.1 JH584303.1 JH584302.1 GL456212.1 JH584304.1 GL456379.1 GL456216.1 GL456393.1 GL456366.1 GL456367.1 GL456239.1 GL456213.1 GL456383.1 GL456385.1 GL456360.1 GL456378.1 GL456389.1 GL456372.1 GL456370.1 GL456381.1 GL456387.1 GL456390.1 GL456394.1 GL456392.1 GL456382.1 GL456359.1 GL456396.1 GL456368.1 JH584292.1 JH584295.1"
rmsk_file: "/tmp/rmsk.txt"
85 changes: 51 additions & 34 deletions .ci_stuff/test_dag.sh

Large diffs are not rendered by default.

11 changes: 9 additions & 2 deletions bin/snakePipes
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ def updateConfig(args):
"""Update the global defaults"""
baseDir = os.path.dirname(snakePipes.__file__)
# Load, update and rewrite the default dictionary
currentDict = cof.load_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), True)
currentDict = cof.load_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), False, "Default Config")

if args.configMode=="manual":
d = {
Expand All @@ -338,13 +338,20 @@ def updateConfig(args):
elif args.configMode=="recycle":
oldConfig=args.oldConfig
if os.path.isfile(oldConfig):
d = cof.load_configfile(oldConfig, True)
d = cof.load_configfile(oldConfig, False, "Old Config")
if args.organismsDir:
od = {'organismsDir': args.organismsDir}
d.update(od)
if args.clusterConfig:
od = {'clusterConfig': args.clusterConfig}
d.update(od)
if not currentDict.keys() & d.keys():
sys.exit("The old and the new config have no matching keys!!!\n")
else:
sys.exit("Config file not found\n")
updatedDict=cof.merge_dicts(currentDict, d)
cof.write_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), updatedDict)
newDict=cof.load_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), True, "Final Updated Config")


def version():
Expand Down
4 changes: 2 additions & 2 deletions conda-recipe/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package:
name: snakepipes
version: 2.1.2
version: 2.2.0

source:
path: ../
Expand All @@ -14,7 +14,7 @@ requirements:
- python >=3
run:
- python >=3.7
- snakemake >=5.13
- snakemake ==5.18.0
- pandas
- graphviz
- fuzzywuzzy
Expand Down
25 changes: 25 additions & 0 deletions docs/content/News.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,31 @@
snakePipes News
===============

snakePipes 2.2.0
----------------
* Added Alevin mode in scRNA workflow
* Added a new conda environment using to call AlevinQC.
* Added filtering of empty drops with Dropletutils to scRNA-seq mode STARsolo
* Added spikein normalization to ChIPseq workflow
* Added hybrid genome creation to createIndices
* Added STARsolo report for all samples to STARsolo output folder
* FASTQ1 and FASTQ2 are not localrules anymore due to buggy logging
* Included optional differential splicing analysis using rmats within mRNA-seq workflow
* Symlinks in the output path are relative
* Increased BBmap version
* Increased STAR version to 2.7.4a in scRNAseq, noncoding-RNA-seq and mRNA-seq workflows
* Fixed snakemake version at 5.18.0 due to a bug in DAG handling
* Minor changes to shared FastQC and multiQC rule with regards to scRNA-seq workflow.
* Fixed issue with missing input for running the DNA-mapping Snakefile
* Fixed rule TrimGalore for single end reads
* deepTools heatmaps for differentially bound regions are now ordered by sample sheet condition
* Genrich is now run on namesorted bams
* Workflow help message now points to example sampleSheet on GitHub
* organismsDir can now be updated with snakePipes config mode "recycle"

.. note::
Please be aware that this version requires regeneration of STAR indices!

snakePipes 2.1.2
----------------
* small bug fix: SE mode in noncoding-RNA-seq pipeline
Expand Down
9 changes: 7 additions & 2 deletions docs/content/setting_up.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,18 @@ Next, install snakePipes.
Installing snakePipes
---------------------

The easiest way to install snakePipes is via our conda channel. The following command install snakePipes and also creates a conda virtual environment named ``snakePipes``, which you can then activate via ``conda activate snakePipes``.
The easiest way to install snakePipes is via our conda channel. The following command install snakePipes and also creates a conda virtual environment named ``snakePipes``, which you can then activate via ``conda activate snakePipes``. Specifying snakePipes version avoids issues with conda's environment solver.

.. code:: bash
conda create -n snakePipes -c mpi-ie -c bioconda -c conda-forge snakePipes
conda create -n snakePipes -c mpi-ie -c conda-forge -c bioconda snakePipes==2.1.1
This way, the software used within snakePipes do not conflict with the software pre-installed on your terminal or in your python environment.

.. note:: This might take a few minutes depending on the access to conda channels.

snakePipes is going to move to mamba in the future.

Development installation
~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -94,6 +96,9 @@ This would show the locations of:
* **organisms/<organism>.yaml** : Defines genome indices and annotations for various organisms. See :ref:`organisms`
* Workflow-specific defaults : Defines default options for our command line wrappers. See :ref:`workflowOpts`

It is a good idea to keep a copy of your defaults.yaml, cluster.yaml and the whole organism folder in a dedicated location e.g. some folder *outside the snakePipes installation folder* named "snakePipes_configs" .
You can configure snakePipes to use these files after a fresh installation or update with ``snakePipes config --organismsDir my_organisms_dir --clusterConfig my_cluster_config`` . This will also work if you add ``--configMode recycle``.


.. _conda:

Expand Down
13 changes: 13 additions & 0 deletions docs/content/workflows/ChIP-seq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,19 @@ As you can see above, the same control can be used for multiple samples.

.. note:: Set the flag broad to `True` for broad marks, such as H3K27me and H3K9me3


.. _spikein:

Spikein Normalization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If chromatin from an external organism was spikein in, it is possible to obtain spikein-derived scaling factors for the ChIP (and input) samples with the flag ``--useSpikeInForNorm``. This requires providing a hybrid bam file, with reads aligned to a hybrid genome of host and spikein chromosomes. Spikein chromosome extention can be specified with `--spikeinExt`. Scale factors can be obtained either from whole spikein genome in the ChIP samples, from windows centered on TSS in the spikein genome in the ChIP samples, or from whole spikein genome in the input samples . The default scale factors from whole spikein genome in the ChIP samples can be changed to something else with ``--getSizeFactorsFrom``.

DESeq2-style scaling factors produced with deepTools multiBamSummary will then be used to create bam coverage tracks and passed to CSAW as size Factors if sample sheet is provided.

A hybrid genome can be obtained with createIndices workflow and can be passed to the DNA-mapping workflow without any particular arguments.


.. _diffBinding:

Differential Binding analysis
Expand Down
5 changes: 5 additions & 0 deletions docs/content/workflows/createIndices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ There is a configuration file in ``snakePipes/workflows/createIndices/defaults.y

These values are most conveniently set on the command line.

Hybrid genome
-------------

To create a hybrid fasta, specify the host genome with ``--genomeURL`` and the spikein genome with ``--spikeinGenomeURL``. On top of ``--gtfURL`` and ``--blacklist``, you may optionally provide ``--spikeinGtfURL`` and `--spikeinBlacklist`. Default extention added to spikein chromosomes is '_spikein' and can be changes with ``--spikeinExt``.

Output structure
----------------

Expand Down
4 changes: 3 additions & 1 deletion docs/content/workflows/scRNA-seq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ The general procedure for mode "Gruen" involves:
2. mapping read 2,
3. quantification at the single cell level.

Mode "Gruen" is going to be deprecated by the end of 2020.

The general procedure for mode "STARsolo" involves:

1. moving cell barcodes and UMIs from read 1 into the CB and UMI tags of read 2 during mapping (STARsolo),
Expand Down Expand Up @@ -259,7 +261,7 @@ Understanding the outputs: mode Gruen
Understanding the outputs: mode STARsolo
----------------------------------------

- **Main result:** output folders with 10x-format count matrices can be found in sample subfolders under ``STARsolo``. The ouput consists of three files: barcodes.tsv, features.tsv, matrix.mtx. Their gzipped versions are stored in the same folder.
- **Main result:** output folders with 10x-format count matrices can be found in sample subfolders under ``STARsolo``. The ouput consists of three files: barcodes.tsv, features.tsv, matrix.mtx. Their gzipped versions are stored in the same folder. Seurat objects from merged samples are available in the ``Seurat`` folder.

- Corresponding annotation files are: ``Annotation/genes.filtered.bed`` and ``Annotation/genes.filtered.gtf``, respectively.

Expand Down
8 changes: 5 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,15 @@ Quick start

.. code:: bash
conda create -n snakePipes -c mpi-ie -c bioconda -c conda-forge snakePipes
conda create -n snakePipes -c mpi-ie -c conda-forge -c bioconda snakePipes==2.1.1
* You can update snakePipes to the latest version available on conda with:

.. code:: bash
conda update -n snakePipes -c mpi-ie -c bioconda -c conda-forge snakePipes
conda update -n snakePipes -c mpi-ie -c conda-forge -c bioconda --prune snakePipes
snakePipes is going to move to mamba in the future.

* Download genome fasta and annotations for an your organism, and build indexes, Check in :ref:`createIndices`

Expand All @@ -50,7 +52,7 @@ Quick start
snakePipes config --help
.. note:: If you have a copy of a `shared/defaults.yaml` with the necessary paths configured (i.e. from a previous installation), you can pass it to snakePipes config with `--oldConfig` and `--configMode recycle` instead of providing all the paths manually again. Config keys have to match for this to work.
.. note:: If you have a copy of a `shared/defaults.yaml` with the necessary paths configured (i.e. from a previous installation), you can pass it to snakePipes config with `--oldConfig` and `--configMode recycle` instead of providing all the paths manually again. Config keys have to match for this to work. In the same way, you can pass your external organism yaml folder with ``--organismsDir`` or cluster config with ``--clusterConfig``.

* Download example fastq files for the human genome `here <https://zenodo.org/record/3707259>`_

Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
snakemake >= 5.13.0
snakemake==5.18.0
psutil
pandas
fuzzywuzzy
Expand Down
2 changes: 1 addition & 1 deletion snakePipes/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '2.1.2'
__version__ = '2.2.0'
29 changes: 21 additions & 8 deletions snakePipes/common_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
import sys
import shutil
from fuzzywuzzy import fuzz
import smtplib
from email.message import EmailMessage
from snakePipes import __version__


Expand All @@ -23,16 +25,17 @@ def set_env_yamls():
'CONDA_scRNASEQ_ENV': 'envs/sc_rna_seq.yaml',
'CONDA_seurat3_ENV': 'envs/sc_rna_seq_seurat3.yaml',
'CONDA_loompy_ENV': 'envs/sc_rna_seq_loompy.yaml',
'CONDA_alevinqc_ENV': 'envs/sc_rna_seq_alevinqc.yaml',
'CONDA_DNA_MAPPING_ENV': 'envs/dna_mapping.yaml',
'CONDA_CHIPSEQ_ENV': 'envs/chip_seq.yaml',
'CONDA_HISTONE_HMM_ENV': 'envs/histone_hmm.yaml',
'CONDA_ATAC_ENV': 'envs/atac_seq.yaml',
'CONDA_HIC_ENV': 'envs/hic.yaml',
'CONDA_WGBS_ENV': 'envs/wgbs.yaml',
'CONDA_RMD_ENV': 'envs/rmarkdown.yaml',
'CONDA_PREPROCESSING_ENV': 'envs/preprocessing.yaml',
'CONDA_NONCODING_RNASEQ_ENV': 'envs/noncoding.yaml',
'CONDA_SAMBAMBA_ENV': 'envs/sambamba.yaml'}
'CONDA_SAMBAMBA_ENV': 'envs/sambamba.yaml',
'CONDA_pysam_ENV': 'envs/pysam.yaml'}


def merge_dicts(x, y):
Expand Down Expand Up @@ -137,11 +140,19 @@ def get_sample_names(infiles, ext, reads):
x = os.path.basename(x)[:-lext]
if x.endswith(reads[0]):
x = x[:-l0]
s.add(x)
elif x.endswith(reads[1]):
x = x[:-l1]
s.add(x)
else:
continue
s.add(x)
sys.stderr.write("Warning! {} does not have {} as its name suffix. "
"Either change it or modify the 'reads' in the "
"config.yaml to your deired ones.\n".format(x, reads))

if sorted(list(s)) == []:
sys.exit("Error! No sample has the right read suffix ({}). "
"Please modify them or update the config.yaml with "
"your desired suffix.".format(reads))
return sorted(list(s))


Expand Down Expand Up @@ -171,9 +182,13 @@ def is_paired(infiles, ext, reads):
infiles_dic[bname] = [infile]
else:
infiles_dic[bname].append(infile)
if infiles_dic and max([len(x) for x in infiles_dic.values()]) == 2:
if not infiles_dic:
sys.exit("Error: No fastq file has been found to be checked.")
values_length = [len(x) for x in infiles_dic.values()]
if min(values_length) == 2:
pairedEnd = True
# TODO: raise exception if single-end and paired-end files are mixed
elif min(values_length) == 1 and max(values_length) == 2:
sys.exit("Error: The directory contains a mixture of paired-end and single-end data!")
return pairedEnd


Expand Down Expand Up @@ -365,8 +380,6 @@ def sendEmail(args, returnCode):
Try to send an email to the user. Errors must be non-fatal.
"""
try:
import smtplib
from email.message import EmailMessage
msg = EmailMessage()
msg['Subject'] = "Snakepipes completed"
msg['From'] = args.emailSender
Expand Down
2 changes: 1 addition & 1 deletion snakePipes/shared/cluster.yaml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ bamCoverage_raw:
create_snpgenome:
memory: 30G
FASTQdownsample:
memory: 3G
memory: 4G
filter_reads_umi:
memory: 10G
plotCorrelation_pearson:
Expand Down
Empty file modified snakePipes/shared/defaults.yaml
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion snakePipes/shared/organisms/mm10.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ hisat2_index: "/data/repository/organisms/GRCm38_ensembl/HISAT2Index/genome"
bwa_index: "/data/repository/organisms/GRCm38_ensembl/BWAindex/genome.fa"
bwameth_index: "/data/repository/organisms/GRCm38_ensembl/BWAmethIndex/genome.fa"
known_splicesites: "/data/repository/organisms/GRCm38_ensembl/gencode/m9/HISAT2/splice_sites.txt"
star_index: "/data/repository/organisms/GRCm38_ensembl/STARIndex/2.7.1a/"
star_index: "/data/repository/organisms/GRCm38_ensembl/STARIndex/2.7.4a/"
genes_bed: "/data/repository/organisms/GRCm38_ensembl/gencode/m9/genes.bed"
genes_gtf: "/data/repository/organisms/GRCm38_ensembl/gencode/m9/genes.gtf"
extended_coding_regions_gtf: "/data/repository/organisms/GRCm38_ensembl/gencode/m9/genes.slop.gtf"
Expand Down
20 changes: 15 additions & 5 deletions snakePipes/shared/rscripts/CSAW.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ importfunc <- snakemake@params[["importfunc"]] #"DB_functions.R"
allelic_info <- as.logical(snakemake@params[["allele_info"]])
outdir<-snakemake@params[["outdir"]]
yaml_path<-snakemake@params[["yaml_path"]]
useSpikeInForNorm<-snakemake@params[["useSpikeInForNorm"]]
scale_factors<-snakemake@params[["scale_factors"]]

bam_pfx<-ifelse(useSpikeInForNorm,"_host",".filtered")
bam_folder<-ifelse(useSpikeInForNorm,"split_bam","filtered_bam")

##set up a primitive log
logfile <- file(snakemake@log[["err"]], open="w+")
Expand Down Expand Up @@ -94,11 +99,11 @@ if (!is.null(sampleInfo$UseRegions)) {

if(snakemake@params[['peakCaller']] == "MACS2") {
allpeaks <- lapply(fnames, function(x) {
narrow <- paste0("../MACS2/",x,".filtered.BAM_peaks.narrowPeak")
narrow <- paste0("../MACS2/",x,bam_pfx,".BAM_peaks.narrowPeak") #bam_pfx
if(snakemake@params[["pipeline"]] %in% "ATAC-seq"){
narrow <- paste0("../MACS2/",x,".filtered.short.BAM_peaks.narrowPeak")
}
broad <- paste0("../MACS2/",x,".filtered.BAM_peaks.broadPeak")
broad <- paste0("../MACS2/",x,bam_pfx,".BAM_peaks.broadPeak") #bam_pfx
# first look for narrowpeak then braod peak
if(file.exists(narrow)) {
bed <- read.delim(narrow, header = FALSE)
Expand Down Expand Up @@ -128,9 +133,14 @@ message(paste0("Filtering windows using MACS2 output : ", length(allpeaks) , " r
keep <- overlapsAny(SummarizedExperiment::rowRanges(chip_object$windowCounts), allpeaks)
chip_object$windowCounts <- chip_object$windowCounts[keep,]

## TMM normalize
message("Normalizing using TMM (using 10kb background counts)")
chip_object <- tmmNormalize_chip(chip_object, binsize = 10000, plotfile = "TMM_normalizedCounts.pdf")
## normalize
if(useSpikeInForNorm){
message("Normalizing using spikein-derived scale factors")
chip_object <- tmmNormalize_chip(chip_object, binsize = 10000, plotfile = "spikein_normalizedCounts.pdf")
}else{
message("Normalizing using TMM (using 10kb background counts)")
chip_object <- tmmNormalize_chip(chip_object, binsize = 10000, plotfile = "TMM_normalizedCounts.pdf")
}

## get DB regions
print("Performing differential binding")
Expand Down
7 changes: 4 additions & 3 deletions snakePipes/shared/rscripts/CSAW_report.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,14 @@ This report summarizes statistical analysis of binding events using group inform

```{r data_load}
DBdata<-file.path(snakemake@params[["outdir"]], "DiffBinding_analysis.Rdata")
useSpikeInForNorm<-snakemake@params[["useSpikeInForNorm"]]
suppressMessages(require(GenomicRanges))
suppressMessages(require(csaw))
load(DBdata)
```

Input to the differential binding analysis were `r length(chip_object$windowCounts)` genomic bins of size `r width(chip_object$windowCounts)[1]` bp having any overlap to the union of MACS2 peaks from all samples listed in the sample sheet.
Size factors were calculated via TMM normalization of read counts obtained for 10kb genomic bins and amounted to `r chip_object$normFactors`.
Input to the differential binding analysis were `r length(chip_object$windowCounts)` genomic bins of size `r width(chip_object$windowCounts)[1]` bp having any overlap to the union of `r unlist(strsplit(snakemake@output[["outfile"]],"_"))[2]` peaks from all samples listed in the sample sheet.
Size factors were calculated via `r ifelse(useSpikeInForNorm,"spikein normalization","TMM normalization of read counts obtained for 10kb genomic bins")` and amounted to `r chip_object$normFactors`.

### Statistical analysis

Expand Down Expand Up @@ -71,7 +72,7 @@ Deeptools matrices were computed for the filtered lists of merged regions showin
Heatmaps obtained using log2 ratio of ChIP signal over input were as follows:

```{r plot_l2r,fig.show='hold',fig.align='center', out.width ='40%',out.height='30%',fig.cap="UP and DOWN regions using log2 ratio."}
if(!snakemake@params[["pipeline"]] %in% "chip-seq"){
if(sum(grepl("log2r.heatmap.png",snakemake@input[["csaw_in"]]))<1){
message("ChIP signal enrichment over input is only available for ChIP-seq workflow!")
}else{
knitr::include_graphics(c(file.path(snakemake@params[["outdir"]], "CSAW.UP.log2r.heatmap.png"),
Expand Down

0 comments on commit 4145ce4

Please sign in to comment.