Skip to content

lemene/PECAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

PECAT is a phased error correction and assembly tool for long reads. It includes a haplotype-aware correction method and an efficient diploid assembly method.

Dependency

Installation

Installing PECAT from source codes

$ git clone --recursive https://github.com/lemene/PECAT.git
$ cd PECAT
$ make

or

$ git clone  https://github.com/lemene/PECAT.git
$ cd PECAT
$ git submodule init
$ git submodule update
$ make

After building, all the executable files can be found in PECAT/build/bin. We can run PECAT/build/bin/pecal.pl or add the path to the system PATH and run pecal.pl.

zlib not found

ZLIB=zlib-1.3.1
wget -c http://www.zlib.net/$ZLIB.tar.gz
tar -xzf $ZLIB.tar.gz
cd $ZLIB
./configure && make
cd ..
export C_INCLUDE_PATH=`pwd`/$ZLIB:$C_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=`pwd`/$ZLIB:$CPLUS_INCLUDE_PATH
export LIBRARY_PATH=`pwd`/$ZLIB:$LIBRARY_PATH
make

Installing PECAT using conda

Use Bioconda.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

Install PECAT.

conda create -n pecat-env
conda activate pecat-env
conda install pecat

Then we can run pecal.pl.

Installing third-party tools using conda

PECAT depends on other tools, and their paths need to be added to the system PATH. We recommend using conda to install the third-party tools.

conda create -n pecat-env
conda activate pecat-env
conda install minimap2=2.24 racon=1.5 perl=5.32 samtools=1.17 python=3.11 # clair3 medaka

Installing and configuring clair3 and medaka

When we installed clair3 and medaka using conda, we encountered a conflict between clair3(v0.1-r12) and medaka (1.7.2). Only one of them can be installed. If you also fail to install the tools, we recommend using singularity or docker to invoke them.

Using singularity

Download the images

singularity pull docker://hkubal/clair3:v0.1-r12
singularity pull docker://ontresearch/medaka:v1.7.2

Add the following parameters to the config file. See cfg_cattle_ont

phase_clair3_command = singularity exec -B `pwd -P`:`pwd -P` -B /tmp:/tmp clair3_v0.1-r12.sif /opt/bin/run_clair3.sh
polish_medaka_command = singularity exec -B `pwd -P`:`pwd -P`  -B /tmp:/tmp medaka_v1.7.2.sif medaka
  • `pwd -P`:`pwd -P`: It maps current working directory from the host to the container, so clair3 and medaka can access the files generated by PECAT.
  • /tmp:/tmp: prevents /tmp in the container from becoming full.
  • clair3_v0.1-r12.sif and medaka_v1.7.2.sif should be replaced with the paths of corresponding images. Or the images are placed to the current path.

Using docker

Add the following parameters to the config file.

phase_clair3_command =  docker run -i -v `pwd -P`:`pwd -P` -v /tmp:/tmp hkubal/clair3:latest /opt/bin/run_clair3.sh
polish_medaka_command = docker run -i -v `pwd -P`:`pwd -P` -v /tmp:/tmp ontresearch/medaka:v1.7.2 medaka

Without medaka and clair3

PECAT can run and achieve genome assembly without clair3 and medaka. Set the following parameters to the config file.

phase_method = 0
polish_medaka = 0

See cfg_cattle_clr

Docker pre-built image

There is a pre-built docker image. Use the following commands to run pecat.

docker run -i -v $CWD:/mnt -v /var/run/docker.sock:/var/run/docker.sock lemene/pecat:v0.0.3 pecat.pl unzip cfg
  • -v $CWD:/mnt: Map current working directory ($CWD) of the host to current working directory (/mnt) of the container, so PECAT can access the config file cfg.
  • -v /var/run/docker.sock:/var/run/docker.sock: Docker in Docker. By adding this parameter, pecat in the container can run the docker images (clair3 and medaka) of the host.
  • The directory of datasets should also be mapped carefully to ensure that PECAT in the container can access them.

Add the following parameters to the config file, so that pecat can call clair3 and medaka in the container.

phase_clair3_command =  docker run -i -v $CWD:/mnt -v /tmp:/tmp hkubal/clair3:latest /opt/bin/run_clair3.sh
polish_medaka_command = docker run -i -v $CWD:/mnt -v /tmp:/tmp ontresearch/medaka:v1.7.2 medaka
  • $CWD: should be set to an absolute path of current working directory in the host.

Using singularity

Download PECAT image.

singularity pull docker://lemene/pecat:v0.0.3

Run PECAT using the following command.

singularity exec -B `pwd -P`:`pwd -P` pecat_v0.0.3.sif pecat.pl unzip cfg
  • `pwd -P`:`pwd -P`: It maps current working directory from the host to the container, so PECAT can access the config file cfg.
  • The directory of datasets should also be mapped carefully to ensure that PECAT in the container can access them.

We did not successfully run the singularity image in the container. It reports ERROR : Failed to create user namespace: user namespace disabled. So in this mode PECAT cannot run clair3 and medaka. See Without medaka and clair3

Testing

We can run the demo to test whether PECAT has been succesfully installed. See demo/README.md.

cd demo
pecat.pl unzip cfgfile

Quick Start

Create a config file using the following command,

$ pecat.pl config cfgfile

Fill in the necessary parameters.

project=S1
reads=./demo/reads.fasta.gz
genome_size=1500000
  ......

Run PECAT to assemble the reads.

$ pecat.pl unzip cfgfile
  • The corrected reads are in the file S1/1-correct/corrected_reads.fasta.
  • The primary/alternate-format contigs are in the files S1/6-polish/racon/{primary.fasta,alternate.fasta}.
  • The dual-format contigs are in the files S1/6-polish/racon/{haplotype_1.fasta,haplotype_2.fasta}.
  • If the paramter polish_medaka=1 is set, PECAT uses Medaka to further polish the above results, and the contigs are placed in S1/6-polish/medaka.

In the demo directory, there is a small example (demo/{cfgfile,reads.fasta.gz}) and several config files (demo/configs). When assembling a dataset, you can choose a config file of a similar species as a template and modify its parameters. See config.md.

Notes

Note: We strongly recommend setting the parameter cleanup=1. PECAT deletes temporary files, otherwise it take up a lot of disk space.

Note: For large genomes such as cattle and human, we strongly suggest adding the parameter -f 0.005 or -f 0.002 to corr_rd2rd_options and align_rd2rd_options. See cfg_cattle_clr, cfg_cattle_ont and cfg_hg002_ont. The parameter is passed to minimap2, which means to filter out top 0.005 or 0.002 fraction of repetitive minimizers. It outputs less candidate overlaps, which reduces disk usage and speeds up error correction step and assembling step.

Resource usage

Dataset Size Cov. Config CPU time Peak memory usage Peak disk space usage
Yeast-CLR 12Mb 200 cfg_yeast_clr 11h 18G 4G
Arab-CLR 130Mb 135 cfg_arab_clr 167h 71G 80G
Dro-CLR 140Mb 146 cfg_dro_clr 142h 41G 49G
Cattle-CLR 2.7Gb 135 cfg_cattle_clr 4437h 219G 1099G
Arab-ONT 130Mb 106 cfg_arab_ont 359h 179G 142G
Cattle-ONT 2.7Gb 200 cfg_cattle_ont 8869h 381G 1574G
HG002-ONT 3Gb 59 cfg_hg002_ont 7456h 348G 1211G

The assemblies are available at https://doi.org/10.5281/zenodo.8380113

More details

PECAT follows the correct-then-assemble strategy, including an error correction module and a two-round string-graph-based assembly module. Here, we describe some important steps and parameters. See config.md

Correcting raw reads

PECAT first extracts prep_output_coverage longest raw reads for correction. It uses minimap2 with corr_rd2rd_options to find the candidate overlaps between the extracted reads. PECAT corrects the raw reads with corr_correct_options. It implements corr_iterate_number rounds of error correction. After correcting, it extracts corr_output_coverage longest corrected reads for assembly, which are in the file $PRJECT/1-correct/corrected_reads.fasta. We can use the following scripts to correct raw reads.

$ pecat.pl correct cfgile

The first round of assembly

In the first round of assembly, PECAT first uses minimap2 with align_rd2rd_options to detect the overlaps between corrected reads. Minimap2 uses the seed-based method to find the overlaps, so the overlaps may have long overhangs. To reduce overhangs of overlaps, PECAT (align_filter_options) performs local alignment to extend overlaps to the ends of the reads and filter out the overlaps still with long overhangs. Then, PECAT (asm1_assmeble_options) assembles the overlaps to haplotype-collapsed contigs. The contigs file is $PROJECT/3-assemble/primary.fasta. We can use the following scripts to run this step.

$ pecat.pl assemble cfgfile

The second round of assembly

In the second round of assembly, PECAT first use minimap2 to map the reads (phase_use_reads=0 for raw reads phase_use_reads=1 for corrected reads) to $PROJECT/3-assemble/primary.fasta with phase_rd2ctg_options. PECAT calls the heterozygous SNP sites based on the base frequency of the alignments and identifies the inconsistent overlaps with phase_phase_options. PECAT removes the inconsistent overlaps with phase_filter_options.

For Nanopore reads, we recommend using clair3 to call heterozygous SNPs from the raw reads. This is a similar process. You can use similar parameters above, but the parameters start with phase_clair3_.

After filtering out inconsistent overlaps, PECAT use asm2_assemble_options to assemble the filtered overlaps. The contigs files are placed in $PROJECT/5-assemble.

Polishing

After generating the contigs, PECAT use minimap2 with polish_map_options to map reads (polish_use_reads=0 for raw reads polish_use_reads=1 for corrected reads) to the $PROJECT/5-assemble/{primary.fasta,alternate.fasta} or $PROJECT/5-assemble/haplotype_1.fasta,haplotype_2.fasta} and uses racon with polish_cns_options to polish the contigs. The polished contigs are placed in $PROJECT/6-polish/racon.

If polish_medaka=1 is set, PECAT use medaka to further improve the quality of the assembly. The parameters are similar and start with polish_medaka_. The contigs are placed in $PROJECT/6-polish/medaka.

Running with multiple computation nodes

The pipeline script is written with plgd. It supports PBS, SGE, LSF and Slurm systems. The follow parameter in the config file need to be set:

grid= auto:4

In the above example, auto means the pipeline automatically detects the type of cluster system. pbs, sge, lsf and slurm represent the corresponding systems, respectively. 4 computation nodes are used and each computation node run with threads CPU threads.

Additional options

The parameter grid_options is used to add additional options.

Here is an example. When grid_options is set to

grid_options=  -A pi_zy --partition cpuQ -q cpuq

the command for slurm system is

sbatch -D `pwd` -J al_rd2rd_split.sh --cpus-per-task=1 -o al_rd2rd_split.sh.log -A pi_zy --partition cpuQ -q al_rd2rd_split.sh

Contact