Skip to content
This repository has been archived by the owner on Jun 13, 2024. It is now read-only.

mooreryan/derep_seqs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Derep Seqs

Dereplicate looooooong sequences!

If you want to get rid of duplicate long sequences (i.e. contigs that are exact substrings of some other contigs), derep_seqs is the tool for you!

Install

Download the source code (either with git clone or by downloading a release), cd into the source directory, and then use make to build it.

git clone https://github.com/mooreryan/derep_seqs.git
cd derep_seqs
make

This will install derep_seqs to the bin directory in the source directory. You can now move derep_seqs and sort_fasta to somewhere on your path if you'd like.

Usage

derep_seqs <num worker threads> <seqs.fasta> > seqs.derep.fa

Example

The fasta file must be sorted by increasing sequence length. The program sort_fasta (included in the bin directory) will do this for you.

$ bin/derep_seqs 10 <(bin/sort_fasta contigs.fasta) > contigs.derep.fa

That's it!

Error codes

  • 0: Success
  • 1: Argument error
  • 2: Couldn't open a file
  • 3: Error creating thread
  • 4: Error joining thread

Versions

  • v0.1.0: First release
  • v0.2.0: Sort on decreasing seq length. Use greedy algorithm. Prefilter. Use hash3 instead of SSEF.
  • v0.3.0: Use hashing for prefiltering.
  • v0.4.0: Don't store hash vals...uses way less memory :) but it's slow again :(
  • v0.5.0: Use pthreads for multithreading!
  • v0.6.0: Make prefilter length a tunable option
  • v0.7.0: Use Rabin-Karp search for filtering