High-Noise Single-Molecule sequence Assembly pipeline
Set of different tools to perfom an assembly using High-Noise Single-Molecule sequences (from PacBio or Nanopore).
- src/snk_canupipe.py . Run correction of pacbio reads and performs a canu assembly.
- src/snk.quiver2.3.py . Polish canu assembly using Quiver 2.3.
- src/snk.quiver3.0.py . Polish canu assembly using Quiver 3.0.
- src/snk.pilon.py . Polish canu assembly using Pilon 1.20.
First, it uses colormap or proovread to run a correction of pacbio reads (long reads) using illumina reads (short reads). Since it takes long time (colormap runs bwa two times during the process, proovread is faster but takes time anyway), the script splits the pacbio reads (by default in 999 sub-files), it runs the correction in parallel and finally it merges the results.
Second, canu performs the assembly.
Considerations
- Replace the numJobs-threadsCorrection (colormap) and genomeSize (canu) as needed.
- Script optimized to run on a slurm cluster.
- python/3.5.0 required to run the snakemake script.
Usage
(dry run) $ snakemake -j 60 --snakefile snk_canupipe.py --cluster "sbatch --partition=compute --cpus-per-task=12 --time=14-0 --job-name=snkmk --mem=20GB" --config pfasta=pacbio.fasta ifasta=illumina.fastq fix=colormap -np
To use proovread instead:
(dry run) $ snakemake -j 60 --snakefile snk_canupipe.py --cluster "sbatch --partition=compute --cpus-per-task=12 --time=14-0 --job-name=snkmk --mem=20GB" --config pfasta=pacbio.fasta ifasta=illumina.fastq fix=proovread -np
Possible improvements
Once the assembly is obtained (from canu or falcon for example), the result can be polished using the PacBio reads and Quiver. The snk.quiver2.3.py script uses SMRTANALYSIS V2.3. The steps are:
- Create fofn files from the input bax.h5.
- Run pbalign on each fofn file.
- Merge all the cmp.h5 files (pbh5tools).
- Sort the cmp.h5 single file.
- Run quiver.
Installation notes
Download and install SMRT ANAlYSIS V2.3. Before run the script, the SMRTANALYSIS installation path variable SMRTloc has to be edited.
Requirements
- Raw reads in bax.h5 format
- Assembly in fasta format.
Usage
In case we run the script in a single node, the available threads will be limited by the available cpus on the node:
(dry run) $ snakemake --snakefile snk_quiver2.3.py -j 1 --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np
It can be run also in multi-node mode (for example, 80 jobs at once, each one with 24 threads):
(dry run) $ snakemake -j 80 --snakefile snk_quiver2.3.py --cluster-config cluster.json --cluster "sbatch --partition=compute --cpus-per-task=1 --time=14-0 --job-name=snkmk --mem=10GB" --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np
Considerations
Last step (the quiver run itself) has high memory demand. It took ~7 days for a ~450Mbps genome using, at least, 1T of memory.
Snakemake cluster config file is attached.
Script based in the PacificBiosciences tutorial
There are important differences using the SMRTANALYSIS V3.0, since quiver started to work with bam files instead h5 format. In this case, the steps are:
- Convert bax.h5 files to bam
- Run pbalign with each bam file.
- Merge all the pbalign bam files output in a single bam file (bamtools).
- (sort/index bam and index fasta)
- run Quiver.
Run notes
Download and install pitchfork Before run the script, the SMRTANALYSIS installation path variable SMRTloc has to be edited.
Requirements
- Raw reads in bax.h5 format
- Canu/falcon assembly in fasta format.
Usage
In case we run the script in a single node, the available threads will be limited by the available cpus on the node:
(dry run) $ snakemake --snakefile snk_quiver3.0.py -j 1 --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np
It can be run also in multi-node mode (for example, 80 jobs at once, each one with 24 threads):
(dry run) $ snakemake -j 80 --snakefile snk_quiver3.0.py --cluster-config cluster.json --cluster "sbatch --partition=compute --cpus-per-task=1 --time=14-0 --job-name=snkmk --mem=10GB" --config rdir=raw_bax.h5_folder/ assembly=canu.fasta -np
Considerations
To convert the h5 files to BAM, it uses the recommended parammeters propagating additional pulse features (QVs).
Last step (the quiver run itself) has high memory demand. It took ~7 days for a ~450Mbps genome using, at least, 1T of memory.
Snakemake cluster config file is attached.
To perform a polishing using Illumina reads instead PacBio reads we can use PILON.
- BWA alignments using the pacbio assembly and the raw illumina reads.
- Samtools conversion.
- Pilon.
Requeriments
- PacBio assembly (canu, falcon...)
- Illumina (paired) raw reads.
Run notes
Edit PILONloc with the pilon location. Pilon tested using java-jdk/1.8.0_20. Run on a node with 24 threads available.
Usage
(dry run) $ snakemake --snakefile snk.pilon.py -j 1 --config rawRead1=reads1.fq rawRead2=reads2.fq assembly=assembly.fasta -np
colormap pre-correction (using illumina reads).pilon polishing (using illumina reads).- Generate a falcon assembly and merge it with the canu one.
- Run canu without pre-correction and merge it with the corrected one.
- Generate quiver report.