Dedup

Introduction

Dedup is a tool created for deduplicating a set of paired-end and mate-pair libraries. Currently there are no good tools for performing this in a memory and time efficient manner. FastUniq is the most current version of this type of program. FastUniq however uses a sorting based method with pre-allocated arrays, unfortunately this can cause problems for extremely large datasets. No efforts have been made to handle datasets larger than memory currently available. We propose a memory efficient and iterative process to this problem. Note, this is an active work in progress.

Method

For each read pair, create a hash for the concatenated read. If the read has been seen before, keep the read with the highest quality score, otherwise record the read. Different datastructures and algorithms are used based on the size of the input used. For instance, if a dataset is approaching 2 billion reads, this would take over 1TB of memory to store all of the reads in memory and to correctly detect duplicates. By partitioning the dataset, and identifying problematic reads early (essentially any seen duplicate is kept and all others are written), we can efficiently traverse a large dataset. After each partition, the reads are shuffled and the process is run again.

Usage and setup

make ./dedup leftreads.1.fastq rightreads.2.fastq

Output should have: dedup.leftreads.1.fastq, dedup.rightreads.2.fastq Written to standard out is a set of deduplication statistics, total reads seen and total collapsed reads.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO		TODO
dedup.cpp		dedup.cpp
dedup.h		dedup.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dedup

Introduction

Method

Usage and setup

About

Releases

Packages

Languages

License

hillst/dedup_paired_reads

Folders and files

Latest commit

History

Repository files navigation

Dedup

Introduction

Method

Usage and setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages