PECAT is a phased error correction and assembly tool for long reads. It includes a haplotype-aware correction method and an efficient diploid assembly method.
- python3 (3.6+)
- minimap2 (2.17+)
- racon (v1.4.21+)
- perl (v5.22.1+)
- samtools (1.7+)
- clair3 (v0.1-r12+) (optional)
- medaka (1.7.2+) (optional)
$ git clone --recursive https://github.com/lemene/PECAT.git
$ cd PECAT
$ make
or
$ git clone https://github.com/lemene/PECAT.git
$ cd PECAT
$ git submodule init
$ git submodule update
$ make
After building, all the executable files can be found in PECAT/build/bin
. We can run PECAT/build/bin/pecal.pl
or add the path to the system PATH and run pecal.pl
.
ZLIB=zlib-1.3.1
wget -c http://www.zlib.net/$ZLIB.tar.gz
tar -xzf $ZLIB.tar.gz
cd $ZLIB
./configure && make
cd ..
export C_INCLUDE_PATH=`pwd`/$ZLIB:$C_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=`pwd`/$ZLIB:$CPLUS_INCLUDE_PATH
export LIBRARY_PATH=`pwd`/$ZLIB:$LIBRARY_PATH
make
Use Bioconda.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
Install PECAT.
conda create -n pecat-env
conda activate pecat-env
conda install pecat
Then we can run pecal.pl
.
PECAT depends on other tools, and their paths need to be added to the system PATH. We recommend using conda to install the third-party tools.
conda create -n pecat-env
conda activate pecat-env
conda install minimap2=2.24 racon=1.5 perl=5.32 samtools=1.17 python=3.11 # clair3 medaka
When we installed clair3 and medaka using conda, we encountered a conflict between clair3(v0.1-r12) and medaka (1.7.2). Only one of them can be installed. If you also fail to install the tools, we recommend using singularity or docker to invoke them.
Download the images
singularity pull docker://hkubal/clair3:v0.1-r12
singularity pull docker://ontresearch/medaka:v1.7.2
Add the following parameters to the config file. See cfg_cattle_ont
phase_clair3_command = singularity exec -B `pwd -P`:`pwd -P` -B /tmp:/tmp clair3_v0.1-r12.sif /opt/bin/run_clair3.sh
polish_medaka_command = singularity exec -B `pwd -P`:`pwd -P` -B /tmp:/tmp medaka_v1.7.2.sif medaka
`pwd -P`:`pwd -P`
: It maps current working directory from the host to the container, so clair3 and medaka can access the files generated by PECAT./tmp:/tmp
: prevents/tmp
in the container from becoming full.clair3_v0.1-r12.sif
andmedaka_v1.7.2.sif
should be replaced with the paths of corresponding images. Or the images are placed to the current path.
Add the following parameters to the config file.
phase_clair3_command = docker run -i -v `pwd -P`:`pwd -P` -v /tmp:/tmp hkubal/clair3:latest /opt/bin/run_clair3.sh
polish_medaka_command = docker run -i -v `pwd -P`:`pwd -P` -v /tmp:/tmp ontresearch/medaka:v1.7.2 medaka
`pwd -P`:`pwd -P`
: see Using singularity./tmp:/tmp
: see Using singularity
PECAT can run and achieve genome assembly without clair3 and medaka. Set the following parameters to the config file.
phase_method = 0
polish_medaka = 0
See cfg_cattle_clr
There is a pre-built docker image. Use the following commands to run pecat.
docker run -i -v $CWD:/mnt -v /var/run/docker.sock:/var/run/docker.sock lemene/pecat:v0.0.3 pecat.pl unzip cfg
-v $CWD:/mnt
: Map current working directory ($CWD
) of the host to current working directory (/mnt
) of the container, so PECAT can access the config filecfg
.-v /var/run/docker.sock:/var/run/docker.sock
: Docker in Docker. By adding this parameter, pecat in the container can run the docker images (clair3 and medaka) of the host.- The directory of datasets should also be mapped carefully to ensure that PECAT in the container can access them.
Add the following parameters to the config file, so that pecat can call clair3
and medaka
in the container.
phase_clair3_command = docker run -i -v $CWD:/mnt -v /tmp:/tmp hkubal/clair3:latest /opt/bin/run_clair3.sh
polish_medaka_command = docker run -i -v $CWD:/mnt -v /tmp:/tmp ontresearch/medaka:v1.7.2 medaka
$CWD
: should be set to an absolute path of current working directory in the host.
Download PECAT image.
singularity pull docker://lemene/pecat:v0.0.3
Run PECAT using the following command.
singularity exec -B `pwd -P`:`pwd -P` pecat_v0.0.3.sif pecat.pl unzip cfg
`pwd -P`:`pwd -P`
: It maps current working directory from the host to the container, so PECAT can access the config filecfg
.- The directory of datasets should also be mapped carefully to ensure that PECAT in the container can access them.
We did not successfully run the singularity image in the container. It reports
ERROR : Failed to create user namespace: user namespace disabled
.
So in this mode PECAT cannot run clair3
and medaka
. See Without medaka and clair3
We can run the demo to test whether PECAT has been succesfully installed. See demo/README.md.
cd demo
pecat.pl unzip cfgfile
Create a config file using the following command,
$ pecat.pl config cfgfile
Fill in the necessary parameters.
project=S1
reads=./demo/reads.fasta.gz
genome_size=1500000
......
Run PECAT to assemble the reads.
$ pecat.pl unzip cfgfile
- The corrected reads are in the file
S1/1-correct/corrected_reads.fasta
. - The primary/alternate-format contigs are in the files
S1/6-polish/racon/{primary.fasta,alternate.fasta}
. - The dual-format contigs are in the files
S1/6-polish/racon/{haplotype_1.fasta,haplotype_2.fasta}
. - If the paramter
polish_medaka=1
is set, PECAT uses Medaka to further polish the above results, and the contigs are placed inS1/6-polish/medaka
.
In the demo
directory, there is a small example (demo/{cfgfile,reads.fasta.gz}
) and several config files (demo/configs
). When assembling a dataset, you can choose a config file of a similar species as a template and modify its parameters. See config.md.
Note: We strongly recommend setting the parameter cleanup=1
. PECAT deletes temporary files, otherwise it take up a lot of disk space.
Note: For large genomes such as cattle and human, we strongly suggest adding the parameter -f 0.005
or -f 0.002
to corr_rd2rd_options
and align_rd2rd_options
. See cfg_cattle_clr, cfg_cattle_ont and cfg_hg002_ont. The parameter is passed to minimap2
, which means to filter out top 0.005 or 0.002 fraction of repetitive minimizers. It outputs less candidate overlaps, which reduces disk usage and speeds up error correction step and assembling step.
Dataset | Size | Cov. | Config | CPU time | Peak memory usage | Peak disk space usage |
---|---|---|---|---|---|---|
Yeast-CLR | 12Mb | 200 | cfg_yeast_clr | 11h | 18G | 4G |
Arab-CLR | 130Mb | 135 | cfg_arab_clr | 167h | 71G | 80G |
Dro-CLR | 140Mb | 146 | cfg_dro_clr | 142h | 41G | 49G |
Cattle-CLR | 2.7Gb | 135 | cfg_cattle_clr | 4437h | 219G | 1099G |
Arab-ONT | 130Mb | 106 | cfg_arab_ont | 359h | 179G | 142G |
Cattle-ONT | 2.7Gb | 200 | cfg_cattle_ont | 8869h | 381G | 1574G |
HG002-ONT | 3Gb | 59 | cfg_hg002_ont | 7456h | 348G | 1211G |
The assemblies are available at https://doi.org/10.5281/zenodo.8380113
PECAT follows the correct-then-assemble strategy, including an error correction module and a two-round string-graph-based assembly module. Here, we describe some important steps and parameters. See config.md
PECAT first extracts prep_output_coverage
longest raw reads for correction. It uses minimap2 with corr_rd2rd_options
to find the candidate overlaps between the extracted reads. PECAT corrects the raw reads with corr_correct_options
. It implements corr_iterate_number
rounds of error correction. After correcting, it extracts corr_output_coverage
longest corrected reads for assembly, which are in the file $PRJECT/1-correct/corrected_reads.fasta
. We can use the following scripts to correct raw reads.
$ pecat.pl correct cfgile
In the first round of assembly, PECAT first uses minimap2 with align_rd2rd_options
to detect the overlaps between corrected reads. Minimap2 uses the seed-based method to find the overlaps, so the overlaps may have long overhangs. To reduce overhangs of overlaps, PECAT (align_filter_options
) performs local alignment to extend overlaps to the ends of the reads and filter out the overlaps still with long overhangs. Then, PECAT (asm1_assmeble_options
) assembles the overlaps to haplotype-collapsed contigs. The contigs file is $PROJECT/3-assemble/primary.fasta
. We can use the following scripts to run this step.
$ pecat.pl assemble cfgfile
In the second round of assembly, PECAT first use minimap2 to map the reads (phase_use_reads=0
for raw reads phase_use_reads=1
for corrected reads) to $PROJECT/3-assemble/primary.fasta
with phase_rd2ctg_options
. PECAT calls the heterozygous SNP sites based on the base frequency of the alignments and identifies the inconsistent overlaps with phase_phase_options
. PECAT removes the inconsistent overlaps with phase_filter_options
.
For Nanopore reads, we recommend using clair3 to call heterozygous SNPs from the raw reads. This is a similar process. You can use similar parameters above, but the parameters start with phase_clair3_
.
After filtering out inconsistent overlaps, PECAT use asm2_assemble_options
to assemble the filtered overlaps. The contigs files are placed in $PROJECT/5-assemble
.
After generating the contigs, PECAT use minimap2 with polish_map_options
to map reads (polish_use_reads=0
for raw reads polish_use_reads=1
for corrected reads) to the $PROJECT/5-assemble/{primary.fasta,alternate.fasta}
or $PROJECT/5-assemble/haplotype_1.fasta,haplotype_2.fasta}
and uses racon with polish_cns_options
to polish the contigs. The polished contigs are placed in $PROJECT/6-polish/racon
.
If polish_medaka=1
is set, PECAT use medaka to further improve the quality of the assembly. The parameters are similar and start with polish_medaka_
. The contigs are placed in $PROJECT/6-polish/medaka
.
The pipeline script is written with plgd. It supports PBS, SGE, LSF and Slurm systems. The follow parameter in the config file need to be set:
grid= auto:4
In the above example, auto
means the pipeline automatically detects the type of cluster system. pbs
, sge
, lsf
and slurm
represent the corresponding systems, respectively. 4
computation nodes are used and each computation node run with threads
CPU threads.
The parameter grid_options
is used to add additional options.
Here is an example. When grid_options
is set to
grid_options= -A pi_zy --partition cpuQ -q cpuq
the command for slurm
system is
sbatch -D `pwd` -J al_rd2rd_split.sh --cpus-per-task=1 -o al_rd2rd_split.sh.log -A pi_zy --partition cpuQ -q al_rd2rd_split.sh
- Nie Fan, niefan@csu.edu.cn
- QQ 316859622