find_orphans

Biopiece: find_orphans

Description

find_orphans can be used to detect orphans in paired end data records in the stream, where a member of a pair is missing. Detection is based on the sequence names which can either use the Illuina 1.5 scheme where names end on /1 or /2 or the Illumina 1.8 scheme where the names contain a space followed by 1 or 2 and then a :. Records are given a TYPE key where the value is orphan for orphan reads and paired for paired reads.

NB! the reads in the stream must be interleaved and sorted according to SEQ_NAME. This is normally not a problem since the sequences are already sorted when output from the sequencer.

SEQ_NAME: HWI-ST575:107:C0HE6ACXX:5:1101:1832:2218 1:N:0:TAGCTG
SEQ: GCTTTGACATAGTCGCTCCAGAATTGCCAGCTAGGGTTAGCTTGGCAACTGCAGCGACGTAATGTGCTGTGGCAGATCAATTTATCTGTTTTGAATCA
SEQ_LEN: 98
SCORES: ^P^PJ\Y`eea`e[daYdecggadgdXJIYVbdc`efg_cdedI^aXIO^abeb\eL_daQU^_V]``]UGTZ\^BBBBBBBBBBBBBBBBBBBBBBB
TYPE: paired
---
SEQ_NAME: HWI-ST575:107:C0HE6ACXX:5:1101:1832:2218 2:N:0:TAGCTG
SEQ: GGTTATCGATCTGGAAAAAGCAACTAAACCTAAAGCTAAACCACGTAGCGCCGGGTAAATGATTCAAAACAGATAAATTGATCTGCCACAGCACATTA
SEQ_LEN: 98
SCORES: ^VYPJQ`c^JJ[b[efg^dHJ`aa`adXd_ZXXbIIIY[af_H^aWHWPZ[`gggFFZ^bd_Z]Zb_]ba\^ZGY_`TZ``cc[[bbR]]]^aaXQ[bbb
TYPE: paired
---
SCORES: ffffcfffffded^eddddddbdcdeedcefecfefdffecabccBB`b`
SEQ: CCNAGGAGGAGNCAATAAGAGACCATTCGTATATGATCTCTCAGGAGAGC
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/1
TYPE: orphan
---
SCORES: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
SEQ: NNNNNNNNGGNNCNANNANNNNGTNNNTNGNANNNNCNNANTTGNNNNNN
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/2
TYPE: orphan
---

Usage

... | find_orphans [options]

Options

[-?         | --help]               #  Print full usage description.
[-I <file!> | --stream_in=<file!>]  #  Read input from stream file   -  Default=STDIN
[-O <file>  | --stream_out=<file>]  #  Write output to stream file   -  Default=STDOUT
[-v         | --verbose]            #  Verbose output.

Examples

If you filter your sequences and discard a member of a pairs, you can run the data through find_orphans to locate orphans:

read_fastq -i pair1.fq -j pair2 |      # Read in interleaved Illumina data from two files
trim_seq |                             # Trim ends according to quality scores
grab -e "SEQ_LEN>30" |                 # Remove entries with sequence shorter than 30
find_orphans |                         # Find orphans
write_fastq_files -k TYPE -x           # Sort reads into two files: paired.fastq and orphan.fastq

Author

mail@maasha.dk

September 2013

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

find_orphans is part of the Biopieces framework.

http://www.biopieces.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly