Karp

Karp is a program for identifying the relative frequencies of reference sequences contributing to a pooled DNA sample. Karp was developed with 16S rRNA sequencing as its primary application, but will work in any context where reference sequences are available in fasta format and the pooled DNA is in fastq format with base quality scores. Karp employs a pseudoalignment step to quickly match reads with potential references. Reads are locally aligned to the references they pseudoalign with, and Karp calculates a likelihood of each read originating from each reference using base quality scores. Finally, Karp employs an EM algorithm that uses information from all the reads to accurately estimate the relative frequencies of each reference in the sample.

More details are provided in the manuscript: Using pseudoalignment and base quality to accurately quantify microbial community composition

Installation

After cloning the repository, or downloading and decompressing the tarball follow the following steps to install Karp. In the Karp directory

  mkdir build
  cd build
  cmake ..
  make

This should create a working karp executable in the folder /src If you get an error message during compilation see the bottom of this readme for trouble shooting tips.

Requirements

Karp requires CMake v2.8+, ZLIB, HDF5, Eigen and TCLAP. The Eigen and TCLAP libraries are included in this repository and the downloadable .tar package.

Usage

Command	Description
-c / --command	Which of Karp's functions should run, options are 'index' 'quantify' 'tabulate'
-h / --help	Print out Karp options and their descriptions

Indexing

The first stage of analysis with Karp is building the k-mer index for pseudoalignment. Indexing is performed if the option 'index' is specified as the -c value.

Command	Description
-r / --ref	Reference fasta file to build index from, must have matching .fai file created with samtools faidx. Enter multiple files with comma delimiter.
-i / --index	Name of index file to output
-k / --kmer	Length of k-mer to use when building index, must be odd and range between 3 and 31 [default = 31].

Quantifying

The main function of Karp is to quantify the taxonomy in a pooled DNA sample, that is done after the index has been constructed using the option 'quantify' as the -c value.

Command	Description
-r / --ref	Reference fasta files, must have matching .fai file created with samtools faidx. Multiple files entered with comma delimiter.
-i / --index	Name of k-mer index file built with 'index'
-f / --forward	Fastq files to be quantified, can be gzipped. Enter multiple files with comma separating them. If quantifying single-end reads, enter fastq files with this command. If quantifying paired-end reads enter forward files with this command. Note: Karp requires fastq files with base quality scores, it does not work with fasta files
--paired	Enter this flag if you are quantifying paired-end reads
-q / --reverse	Reverse oriented fastq files to be quantified. Should match order of files entered with --forward. Can be gzipped.
-t / --tax	Taxonomy file(s) matching reference fasta file(s). See note below for more details about acceptable formats.
-o / --out	Base name for output files. Karp will output a '.freqs' file containing results and '.log' file [default = 'karp'].
--threads	Number of threads to use [default = 1].
--phred	Version of phred quality scores being used. Default is Phred+33, used by Illumina 1.8+. Other valid option is '64' which corresponds to Phred+64 quality coding [default = 33].
--min_freq	Lower frequency bound during EM update step. Taxa with frequencies below this threshold are removed from the analysis [default = 1/Number of reads].
--like_thresh	The maximum likelihood z-score threshold. Karp uses distribution of quality scores in input fastq file to calculate a "null" distribution for likelihood values when each read is matched to the correct reference and all mismatches are the result of sequencing error. Using this distribution Karp calculates a z-score for the maximum likelihood observed for each read, and if this value falls to far outside the theoretical distribution the read is filtered from analysis [default = -2.0].
--collapse	Run Karp in Collapse mode. In collapse mode classification is done at the level of taxonomic labels rather than individual reference sequences.

Advanced Options for Quantifying

Command	Description
--max_like_out	Output a file containing the maximum likelihood value for each read, use this to diagnose appropriate max likelihood filter threshold.
--strict	Use the strict intersection of equivalence classes to declare psuedoalignment of read to reference (this is the definition of pseudoalignment used by Kallisto). Default Karp behavior is that if no strict intersections exists, the reference(s) with the greatest number of matching k-mers are declared as pseudoaligning to a query read.
--em_tolerance	When the squared sum of allele frequency changes between EM algorithm iterations falls below this threshold the algorithm has converged [default = 1e-12].
--no_harp_filter	Turn off the likelihood z-score filter.
--fail	Output a file listing the reads that fail to pseudoalign and those that are excluded by the maximum likelihood z-score filter.
--max_em_it	Maximum number of iterations of EM algorithm before declaring failure to converge [default = 1000].
--readinfo	Adds additional output file, that contains the posterior probabilities assigned to each individual read after the EM algorithm converges. The format is one read per line, followed by pairs of (percentage,reference ID from fasta).

Tabulate

After multiple samples have been classified using Karp, the program can be used to calculate some basic summary statistics using the 'tabulate' command.

Command	Description
--samples	File with list of Karp output files from each sample that has been classified. File should have two columns, the first with a sample ID and the second with a path to the corresponding Karp output file

Best Practices

If using Karp with multiple threads, the internal local aligner runs more quickly if the reference database fasta file is broken into multiple smaller files. A single kmer index can still be used, as long as the references in the multiple fasta files are all present in the single index.

The default likelihood threshold that karp uses to filter reads for being unlikely to originate from the reference database can be too lenient or strict with some distributions of read quality scores. If it seems that karp is filtering too many (or too few) reads, it can be helpful to use the ** --max_likes_out** option, and plot the maximum likelihood for each read to get a sense of how to adjust the ** --like_thresh** option.

Data from a single 16S hypervariable region often have little power to distinguish between closely related reference sequences. When this is the case Karp will partition the probability weight across the closely related species, and they will appear at low frequency. In such situations the default minimum EM update frequency can be too strict, and setting it to a lower value based on the depth of the samples being classified can improve results. The default value is 1/Number of Reads, with a single hypervariable region 1/(10*Number of Reads) or 1/(20*Number of Reads) can sometimes give more accurate results.

Test Example

In the repo there is a directory labeled example, this contains some toy files for demonstrating how to use Karp.

File	Description
reference.fasta	Reference database in fasta format, contains 500 sequences
reference.fasta.fai	Tabix index of reference fasta file
reference.tax	Taxonomy file with description of sequences in reference.fasta
simulated.data.fastq.gz	Fastq file containing 10,000 reads with mixture of organisms from reference.fasta
simulated.data.actual_counts.txt	True origins of reads in simulated.data.fastq.gz

The first step of running Karp is to create a k-mer index for pseudoalignment:

./karp -c index -r reference.fasta -i reference.index

This will create the index file reference.index. Next we quantify the simulated reads:

./karp -c quantify -r reference.fasta -i reference.index -f simulated.data.fastq.gz -o simulated.results -t reference.tax

In the example folder we now have the files simulated.results.freqs and simulated.results.log. simulated.results.freqs contains Karp's estimates of the number of reads each reference sequence in reference.fasta contributed to the simulated sample. simulated.results.log contains information about the Karp run we just performed.

Taxonomy File Format

Taxonomy files should have two columns and be tab delimited, with the first column giving the ID for the haplotype matching the reference fasta file, and the second giving the taxonomic label. Examples of acceptable formats include:

 815395	 Bacteria;Firmicutes;Bacilli;Bacillales;
 815395	 k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;
 815395	 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__; g__; s__

Trouble Shooting

Error message: unrecognized command line option "-std=c++0x"

or

Error message: error: call of overloaded ‘to_string(size_t&)’ is ambiguous

Likely your default C++ compiler is out of date, check for newer compiler availability. Once you have found/installed an up to date compiler, delete and recreate the build directory and rerun the cmake step with command

cmake -DCMAKE_CXX_COMPILER=/path/to/compiler -DCMAKE_C_COMPILER=/path/to/compiler ..

Error message: Could NOT find HDF5 (missing: HDF5_LIBRARIES HDF5_INCLUDE_DIRS HDF5_HL_LIBRARIES)

Check if you have HDF5 libraries installed on your machine. If you do not they can be downloaded and installed https://support.hdfgroup.org/HDF5/

If you have HDF5 installed, and are still getting this message, make sure your $HDF5_DIR variable is pointing to the correct directory. If it is, try adding -DCMAKE_PREFIX_PATH=/path/to/HDF5 to the cmake command.

There is also a known issue with some versions of CMAKE and the HDF5 library. If you are running CMAKE versions 3.6.0 or 3.6.1 see if you have access to an earlier or later version of CMAKE and try using it instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Karp

Installation

Requirements

Usage

Indexing

Quantifying

Advanced Options for Quantifying

Tabulate

Best Practices

Test Example

Taxonomy File Format

Trouble Shooting

Files

README.md

Latest commit

History

README.md

File metadata and controls

Karp

Installation

Requirements

Usage

Indexing

Quantifying

Advanced Options for Quantifying

Tabulate

Best Practices

Test Example

Taxonomy File Format

Trouble Shooting