Near-optimal read trimmer based on biolibc
The goal of fastq-trim is a read trimmer that runs in a fraction of the time of popular trimmers and produces comparable results. At present, fastq-trim uses simple alignment algorithms suitable for typical analyses such as RNA-Seq or ATAC-Seq, where a small amount of residual adapter content will not impact the downstream analysis.
The default is a simple function with two parameters, a minimum number of bases matched and a maximum percentage of mismatches. An exact-match function is also currently available, which runs slightly faster and misses slightly more adapters.
These algorithms are not suitable for analyses that are highly sensitive to adapter contamination. However, fastq-trim is designed to allow dropping in a variety of alignment functions. More sophisticated functions can easily be added, so fastq-trim can be easily adapted to perform trimming for just about any purpose.
The current version has been tested on RNA-Seq and ATAC-Seq data. The results so far are encouraging, with significantly better performance than cutadapt or trimmomatic and results nearly identical to cutadapt. The default "smart" algorithm is the same one used by cutadapt, except that it does not look for insertions and deletions (indels).
Diffing fastq-trim and cutadapt results reveals only 84 differences after trimming 1,000,000 reads.
Output from Test/compare-results.sh:
FastQ-Trim vs cutadapt:
Sequences in both files that differ: 68
Sequences only in fastq-trim output: 12
Sequences only in cutadapt output: 4
The differences are due mainly to cutadapt matching sequences with indels (insertions or deletions), while fastq-trim's default algorithm does not. E.g., using an adapter sequence of CTGTCTCTTATA, one of the differences in the trimmed reads is as follows:
fastq-trim: CCCCT ... CGCC CTGTCTCTTTATA CACATCTCC
cutadapt: CCCCT ... CGCC
In this case, cutadapt assumed that CTGTCTCTTTATA was an adapter with an inserted T, while fastq-trim assumed it is a natural sequence. Whether this is more likely an adapter with an insertion or a natural sequence similar to the adapter is up for debate. With only 84 such instances in 1,000,000 reads, it won't make any discernable difference to a downstream RNA-Seq or ATAC-Seq analysis, where trimming isn't even technically necessary (Liao, 2020, doi: 10.1093/nargab/lqaa068).
More functionality such as additional alignment algorithms and command-line options will be added at a later date as time permits and needs dictate. Feel free to open an issue to request a new feature.
Resident memory (actual RAM) use peaks at around 2 MiB, which means fastq-trim runs almost entirely in cache RAM. This is likely part of the reason for its performance advantage over other tools.
Fastq-trim is currently single-threaded, as using additional cores will not improve performance of the basic features. Additional cores, if available, may be used by compression and decompression tools reading input files and writing output files.
Run time with default parameters is only slightly longer than
xzcat infile.xz | gzip > outfile.gz
, and fastq-trim on uncompressed
input and output actually outruns xzcat and gzip -1. Multi-threading will
be considered if and when more computationally expensive features are added.
Results from running "./test.sh big" (in the Test directory) on a 2.9 GHz i5 are shown below.
Stats collected on an i5 2.9GHz 2-core, 4-hyperthread.
Peak CPU (including xzcat, gzip -1, pigz), wall time,
and peak memory (MiB, application only):
Time Memory
CPU Wall CPU*sec Virtual Resident
Fastq-trim 260% 5.42 13.0 13 2
Cutadapt 2-core 310% 13.30 41.23 174 120
Trimmomatic 150% 21.03 31.55 3473 740
Cutadapt 1-core 140% 22.37 31.32 46 30
Detailed output:
Trimming with uncompressed input and output...
*** FASTQ TRIM ***
Minimum match: 3
Minimum quality: 20
Minimum length: 30
Phred base: 33
Adapter matching: Smart
Maximum mismatch: 10%
Filename: temp-infile1.fastq
Mode: Single
Adapter: CTGTCTCTTATA
Read: 1000000 Adapter: 86768 Q < 20: 572761 Len < 30: 13012
4.52 real 4.43 user 0.08 sys
All remaining tests use compressed input and output...
Timing compressed read and write without trimming...
4.78 real 4.68 user 0.07 sys
4.79 real 4.75 user 0.03 sys
*** FASTQ TRIM ***
Minimum match: 3
Minimum quality: 20
Minimum length: 30
Phred base: 33
Adapter matching: Exact
Filename: SRR1972918_1.fastq.xz
Mode: Single
Adapter: CTGTCTCTTATA
Read: 1000000 Adapter: 86180 Q < 20: 572761 Len < 30: 13000
5.35 real 12.35 user 0.49 sys
*** FASTQ TRIM ***
Minimum match: 3
Minimum quality: 20
Minimum length: 30
Phred base: 33
Adapter matching: Smart
Maximum mismatch: 10%
Filename: SRR1972918_1.fastq.xz
Mode: Single
Adapter: CTGTCTCTTATA
Read: 1000000 Adapter: 86768 Q < 20: 572761 Len < 30: 13012
5.42 real 13.88 user 0.37 sys
*** FASTQ TRIM ***
Minimum match: 3
Minimum quality: 20
Minimum length: 30
Phred base: 33
Adapter matching: Smart
Maximum mismatch: 20%
Filename: SRR1972918_1.fastq.xz
Mode: Single
Adapter: CTGTCTCTTATA
Read: 1000000 Adapter: 108722 Q < 20: 572761 Len < 30: 13346
5.63 real 14.42 user 0.31 sys
Cutadapt 1 core...
status in_reads in_bp too_short too_long too_many_n out_reads w/adapters qualtrim_bp out_bp
OK 1000000 101000000 13020 0 0 986980 86845 1005881388836129
22.37 real 30.40 user 0.30 sys
Cutadapt 2 core...
status in_reads in_bp too_short too_long too_many_n out_reads w/adapters qualtrim_bp out_bp
OK 1000000 101000000 13020 0 0 986980 86845 1005881388836129
13.70 real 41.51 user 0.41 sys
TrimmomaticSE: Started with arguments:
/dev/stdin SRR1972918_1-trimmed-trimmomatic.fastq.gz ILLUMINACLIP:nextera.fa:2:30:5 TRAILING:20 MINLEN:30
Automatically using 4 threads
Using Short Clipping Sequence: 'CTGTCTCTTATA'
ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Reads: 1000000 Surviving: 987900 (98.79%) Dropped: 12100 (1.21%)
TrimmomaticSE: Completed successfully
21.03 real 32.94 user 1.18 sys
*** FASTQ TRIM ***
Minimum match: 3
Minimum quality: 20
Minimum length: 30
Phred base: 33
Adapter matching: Smart
Maximum mismatch: 10%
Filename: SRR1972918_2.fastq.xz
Mode: Single
Adapter: CTGTCTCTTATA
Read: 1000000 Adapter: 61965 Q < 20: 828265 Len < 30: 234054
5.26 real 11.34 user 0.36 sys
Cutadapt 2 core reverse read...
status in_reads in_bp too_short too_long too_many_n out_reads w/adapters qualtrim_bp out_bp
OK 1000000 101000000 234057 0 0 765943 62032 4028031456652196
11.87 real 34.79 user 0.37 sys
*** FASTQ TRIM ***
Minimum match: 3
Minimum quality: 20
Minimum length: 30
Phred base: 33
Adapter matching: Smart
Maximum mismatch: 10%
Filename: SRR1972918_1.fastq.xz
Filename: SRR1972918_2.fastq.xz
Mode: Paired
Adapters: CTGTCTCTTATA AGATCGGAAGAG
Read: 2000000 Adapter: 97978 Q < 20: 1401026 Len < 30: 242967
8.87 real 28.09 user 1.20 sys
Scanning for a portion of the adapter to flag any adapters missed due to
base substitutions.
Also scanning for random sequence TCGAACGGC as a control.
Raw data CTGTCTCTT : 63410
Raw data TCGAACGGC : 1525
Exact match output CTGTCTCTT : 284
Exact match output TCGAACGGC : 1304
Smart match 10 output CTGTCTCTT : 185
Smart match 10 output TCGAACGGC : 1304
Smart match 20 output CTGTCTCTT : 103
Smart match 20 output TCGAACGGC : 1300
Cutadapt output CTGTCTCTT : 169
Cutadapt output TCGAACGGC : 1304
Trimmomatic output CTGTCTCTT : 103
Trimmomatic output TCGAACGGC : 1308
The code is organized following basic object-oriented design principals, but implemented in C to minimize overhead and keep the source code accessible to scientists who don't have time to master the complexities of C++.
Structures are treated as classes, with accessor macros and mutator functions provided, so dependent applications and libraries need not access structure members directly. Since the C language cannot enforce this, it's up to application programmers to exercise self-discipline.
For detailed coding standards, see https://github.com/outpaddling/Coding-Standards/.
Fastq-trim is intended to build cleanly in any POSIX environment on any CPU architecture. Please don't hesitate to open an issue if you encounter problems on any Unix-like system.
Primary development is done on FreeBSD with clang, but the code is frequently tested on Linux, MacOS, NetBSD, and OpenIndiana as well. MS Windows is not supported, unless using a POSIX environment such as Cygwin or Windows Subsystem for Linux.
The Makefile is designed to be friendly to package managers, such as Debian packages, FreeBSD ports, MacPorts, pkgsrc, etc. End users should install using a package manager. Note that pkgsrc can be used by anyone, on virtually any POSIX operating system, with or without administrator privileges..
I maintain a FreeBSD port and a pkgsrc package, which is sufficient to install cleanly on virtually any POSIX platform. If you would like to see a package in another package manager, please consider creating a package yourself. This will be one of the easiest packages in the collection and hence a good vehicle to learn how to create packages.
For an overview of available package managers, see the Repology website.
FreeBSD is a highly underrated platform for scientific computing, with over 2,000 scientific libraries and applications in the FreeBSD ports collection (of more than 30,000 total), modern clang compiler, fully-integrated ZFS filesystem, and renowned security, performance, and reliability. FreeBSD has a somewhat well-earned reputation for being difficult to set up and manage compared to user-friendly systems like Ubuntu. However, if you're a little bit Unix-savvy, you can very quickly set up a workstation, laptop, or VM using desktop-installer. GhostBSD offers an experience very similar to Ubuntu, but is built on FreeBSD rather than Debian Linux. GhostBSD packages lag behind FreeBSD ports slightly, but this is not generally an issue and there are workarounds.
To install the binary package on FreeBSD:
pkg install fastq-trim
You can just as easily build and install from source. This is useful for FreeBSD ports with special build options, for building with non-portable optimizations such as -march=native, and for work-in-progress ports, for which binary packages are not yet maintained.
cd /usr/ports/wip/fastq-trim && env CFLAGS='-march=native -O2' make install
pkgsrc is a cross-platform package manager that works on any Unix-like platform. It is native to NetBSD and well-supported on Illumos, MacOS, RHEL/CentOS, and many other Linux distributions. Using pkgsrc does not require admin privileges. You can install a pkgsrc tree in any directory to which you have write access and easily install any of the nearly 20,000 packages in the collection.
The auto-pkgsrc-setup script will help you install pkgsrc in about 10 minutes. Just download it and run
sh auto-pkgsrc-setup
Then, assuming you selected current packages and the default prefix
source ~/Pkgsrc/pkg/etc/pkgsrc.sh # Or pkgsrc.csh for csh or tcsh
cd ~/Pkgsrc/pkgsrc/biology/fastq-trim
sbmake install clean clean-depends
See the pkgsrc documentation for more information.
Community support for pkgsrc is available through the pkgsrc-users mailing list.
If you would like to add this project to another package manager rather than use FreeBSD ports or pkgsrc, basic manual build instructions for package can be found here. Your contribution is greatly appreciated!