Automatic quality control for FASTQ sequencing files
Perl Shell Makefile
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
tools
.gitignore
LICENSE
Makefile
README.md
autoadapt.pl

README.md

autoadapt

Overview

As of November 2013, the NCBI Sequence Read Archive contains over three million gigabytes of publicly available DNA and RNA sequencing files.

However, there is a wide variety of sequencing adaptors and primers which may be contaminating each file, and these sequences normally need to be removed before doing any further analysis.

We developed a tool to automatically detect which adaptors and primers are present in a FASTQ file and remove those sequences from the file, as well to detect the quality score encoding type used.

We currently make heavy use of FastQC and cutadapt, both of which are included in the tools folder.

License

# autoadapt - Automatic quality control for FASTQ sequencing files

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Install

autoadapt needs special versions of FastQC and cutadapt to be installed. The install happens locally (inside the autoadapt/tools folder). Type:

make install

Usage

autoadapt 0.2

Usage: ./autoadapt.pl [ <options> ] { <unpaired-in> <unpaired-out> | <paired-in-1> <paired-out-1> <paired-in-2> <paired-out-2> }

Options:
    --threads=N               number of threads to use (default: 1)
    --quality-cutoff=N        quality cutoff for BWA trimming algorithm (default: 20)
    --minimum-length=N        minimum length of sequences (default: 18)

Technical details

First we run FastQC to determine the quality score encoding type (e.g. phred33, phred64) and to look for any over-represented sequences that match against known adaptors and primers in the FastQC contaminants_list.txt file.

Then, the sequences for any detected contaminants (primers, adaptors, etc.) are removed using cutadapt. In addition, cutadapt will also remove low quality sequences and sequences that are shorter than a minimum length.

In order to speed up the trimming process, cutadapt can also be run in parallel on small chunks of the original FASTQ file. The file splitting and merging is handled by our script. When specifying the number of threads to use, you should consider how many CPUs are available and how fast your hard drive can read and write data.

The exact details of how we run FastQC and cutadapt are printed to the console during execution. For further explanation of what each FastQC or cutadapt program argument means, please see their respective documentation:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://code.google.com/p/cutadapt/