A minimalist yet versatile workflow to process ChIP-seq (with or without input/spikein), RNA-seq, MNase-seq and ATAC-seq data.
tinyMapper
can also operate as a thin wrapper to process HiC data with hicstuff
(Cyril Matthey-Doret et al.) and cooler
.
Currently, this workflow only works for paired-end data.
DISCLAIMER:
- This is by no means the "best" or "only" way to process sequencing data. Do not hesitate to give suggestions / feedbacks to improve this workflow.
- This workflow does NOT include any proper QC / validation of the data. At the very least, do run
fastqc
on the sequencing data. Further QC checks are highly recommended, and will vary depending on which assay is performed.
DESTINATION=~/bin/
mkdir ${DESTINATION}
cd ${DESTINATION}
git clone https://github.com/js2264/tinyMapper.git
conda env create -n tm -f ${DESTINATION}/tinyMapper/tinymapper.yaml
echo 'export PATH=$PATH:"'${DESTINATION}'/tinyMapper/"' >> ~/.bashrc
conda activate tm
tinyMapper.sh --help
In more details:
- Choose a directory to install tinyMapper in (typically, a local
~/bin/
directory is appropriate here):
DESTINATION=~/bin/
mkdir ${DESTINATION}
cd ${DESTINATION}
- Clone tinyMapper's GitHub repository:
git clone https://github.com/js2264/tinyMapper.git
- Install tinyMapper requirements using
conda
(assumes thatconda
is already installed...):
conda env create -n tm -f ${DESTINATION}/tinyMapper/tinymapper.yaml
- Add tinyMapper script to your PATH so you can call it by
tinyMapper.sh
, rather than~/bin/tinyMapper/tinyMapper.sh
:
echo 'export PATH=$PATH:"'${DESTINATION}'/tinyMapper/"' >> ~/.bashrc
- Activate the created
conda
environemnt and start usingtinyMapper.sh
:
conda activate tm
tinyMapper.sh --help
Usage: ./tinyMapper.sh --mode <MODE> --sample <SAMPLE> --genome <GENOME> --output <OUTPUT> [ additional arguments ]
---------------------- BASIC ARGUMENTS -----------------------------------------
-m|--mode <MODE> Mapping mode (ChIP, MNase, ATAC, RNA, HiC) (Default: ChIP)
-s|--sample <SAMPLE> Path prefix to sample \`<SAMPLE>_R{1,2}.fq.gz\` (e.g. for \`~/reads/JS001_R{1,2}.fq.gz\` files, use \`--sample ~/reads/JS001\`)
-g|--genome <GENOME> Path prefix to reference genome (e.g. for \`~/genome/W303/W303.fa\` fasta file, use \`--genome ~/genome/W303/W303\`)
-h|--help Print help ('--help' for examples)
---------------------- ADVANCED ARGUMENTS --------------------------------------
-i|--input <INPUT> (Optional) Path prefix to input \`<INPUT>_R{1,2}.fq.gz\`
-c|--calibration <CALIBRATION> (Optional) Path prefix to genome used for calibration
-bl|--blacklist <BED> Bed file of blacklist regions
-a|--alignment <ALIGN.> Alignment options for \`bowtie2\` (between single quotes)
Default: '' (no specific options)
-f|--filter <FILTER> Filtering options for \`samtools view\` (between single quotes)
Default: '-f 2 -q 10' ('-f 2' to only keep concordant mapped and paired reads, '-q 10' to filter out reads with mapping quality score < 10)
-d|--duplicates Keep duplicate reads
-hic|--hicstuff <OPT> Additional arguments passed to hicstuff (default: \`--iterative --duplicates --filter --plot\`)
-r|--resolutions <#> Resolution of final matrix file (default: '10000,20000,40000,160000,1280000')
-re|--restriction <RE> Restriction enzyme(s) used for HiC (default: Arima \`--restriction DpnII,HinfI\`)
-M|--MNaseSizes <MIN,MAX> Minimum and maximum fragment size for MNase track (default: \`--MNaseSizes 70,250\`)
---------------------- OUTPUT ARGUMENTS ----------------------------------------
-t|--threads <THREADS> (Optional) Number of threads (Default: 8)
-o|--output <OUTPUT> Path to store results (Default: \`./results/\`)
-k|--keepIntermediate (Optional) Keep intermediate mapping files
Note that fastq files shoud ideally be named following this convention:
- Read 1: <SAMPLE>_R1.fq.gz
- Read 2: <SAMPLE>_R2.fq.gz
Alternatively, the script will try to find paired-end files named <SAMPLE>_R[12].fastq.gz, <SAMPLE>_nxq_R[12].fq.gz, <SAMPLE>.end1.fq.gz or <SAMPLE>.end[12].gz.
Make sure tinyMapper script (tinyMapper.sh
) is available by adding its location to your path (echo 'export PATH=$PATH:"~/bin/tinyMapper/"' >> ~/.bashrc
).
conda activate tm
sbatch --mem 32G -c 32 --wrap "tinyMapper.sh --mode <MODE> --sample <SAMPLE> --genome <GENOME> --output <OUTPUT> --threads 32"
# For ChIP processing pipelines
# - Without input
sbatch --mem 32G -c 32 --wrap "tinyMapper.sh -m ChIP -s tests/testChIP -g ~/appascratch/genomes/S288c/S288c --threads 32"
# - With input and without calibration
sbatch --mem 32G -c 32 --wrap "tinyMapper.sh -m ChIP -s tests/testChIP.IP -i tests/testChIP.input -g ~/appascratch/genomes/S288c/S288c --threads 32"
# - With input and with calibration
sbatch --mem 32G -c 32 --wrap "tinyMapper.sh -m ChIP -s tests/testChIP.IP -i tests/testChIP.input -g ~/appascratch/genomes/S288c/S288c -c ~/appascratch/genomes/CBS138/CBS138 --threads 32"
# For RNA processing pipelines
sbatch --mem 32G -c 32 --wrap "tinyMapper.sh -m RNA -s tests/testRNA -g ~/appascratch/genomes/S288c/S288c --threads 32"
# For MNase processing pipelines
sbatch --mem 32G -c 32 --wrap "tinyMapper.sh --mode MNase --sample tests/testMNase --genome ~/appascratch/genomes/S288c/S288c --threads 32"
# For Hi-C processing pipelines
sbatch --mem 32G -c 32 --wrap "tinyMapper.sh --mode HiC --sample tests/testHiC --genome ~/appascratch/genomes/S288c/S288c --threads 32"
-
ChIP-seq mode:
-
Without input:
./tinyMapper.sh \ --mode ChIP \ -s ~/testIP \ -g ~/genomes/R64-1-1/R64-1-1 \ -o ~/results
-
With input:
./tinyMapper.sh --mode ChIP \ --sample ~/testIP \ --input ~/testInput \ --genome ~/genomes/R64-1-1/R64-1-1 \ --output ~/results
-
With input and calibration:
./tinyMapper.sh --mode ChIP \ --sample ~/testIP \ --input ~/testInput \ --genome ~/genomes/R64-1-1/R64-1-1 \ --calibration ~/genomes/Cglabrata/Cglabrata \ --output ~/results
-
-
RNA-seq mode:
./tinyMapper.sh --mode RNA -s ./testRNA -g ~/genomes/W303/W303 -o ~/results
-
MNase-seq mode:
./tinyMapper.sh --mode MNase -s ./testMNase -g ~/genomes/W303/W303 -o ~/results
-
HiC mode:
./tinyMapper.sh --mode HiC -s ./testHiC -g ~/genomes/W303/W303 -o ~/results --resolutions 1000,2000,8000 --restriction 'DpnII,HinfI'
The default steps are:
- Mapping with bowtie2 (against spikein ref. as well if needed)
- Filtering
bam
files:- Fixing mates
- Removing duplicates (can be skipped with
--duplicates
) - Removing reads with mapping quality < Q10 (can be adjusted / skipped with
--filter <SAMTOOLS VIEW OPTIONS>
) - Removing unpaired reads (can be adjusted / skipped with
--filter <SAMTOOLS VIEW OPTIONS>
) - For Mase: extra filtering to keep only fragments between 70-250 bp
- For Hi-C: process
fastq
files withhicstuff
and binnify/balance/zoomify withcooler
- Generating tracks:
- CPM (counts per million) tracks
- Input and spikein-based calibrated tracks
- For Mase: extra track for nucleosome positions
- For RNA-seq: directed tracks (
fwd
andrev
transcription)
- Extracting some very succint stats on mapping results
- Keeping everything tidy, organized, documented and reproducible. Notably, when running
tinyMapper.sh
, three files are generated:*-log.txt
: A detailedlog
file*-commands.txt
: A list of the actual commands that were executed in the pipeline*-script.txt
: A backup copy of thetinyMapper.sh
entire script as it was at the time of the execution
- A. Cournac, A. Bignaud & F. Girard for tests.
- H. Bordelet for sharing her mapping scripts and configuration.
- L. Meneu for suggestions of improvements in documentation and raising bugs.