digest_seq

Biopiece: digest_seq

Description

digest_seq split sequences in the stream at a given restriction enzyme's cleavage sites. For each digestion product a new record is created and output to the stream. digest_seq requires two arguments:

The restriction enzyme recognition pattern
Cut position relative to the above match

E.g. the common restriction enzyme BamHI recognizes the pattern GGATCC and the cut position for this enzyme is 1 indicating that the cleavage site is after the first G:

cut_pos      1 23456
pattern      G|GATCC
              ^

Patterns can contain IUPAC ambiguity codes.

Records with digestion products have a REC_TYPE: DIGEST key/value pair added. Furthermore, the sequence coordinates are appended to the sequence name in brackets (0-based) and as S_BEG and S_END keys.

To obtain the reverse-complement products use reverse_seq.

Usage

... | digest_seq [options]

Options

[-?          | --help]               #  Print full usage description.
[-p <string> | --pattern=<string>]   #  Restriction enzyme recognition pattern.
[-c <int>    | --cut_pos=<int>]      #  Cut position relative to pattern match.
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following sequence in the file test.fna:

>test
cgatcgatcGGATCCgagagggtgtgtagtgGAATTCcgctgc

We can digest this sequence with BamHI like this:

read_fasta -i test.fna | digest_seq -p GGATCC -c 1

SEQ_NAME: test[0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42]
SEQ: GATCCgagagggtgtgtagtgGAATTCcgctgc
SEQ_LEN: 33
S_BEG: 10
S_END: 42
REC_TYPE: DIGEST
---

To obtain the - strand digestion products we re-read the file and reverse-complment the sequence before digestion like this:

read_fasta -i test.fna | reverse_seq | complement_seq | digest_seq -p GGATCC -c 1

SEQ_NAME: test[0-28]
SEQ: gcagcgGAATTCcactacacaccctctcG
SEQ_LEN: 29
S_BEG: 0
S_END: 28
REC_TYPE: DIGEST
---
SEQ_NAME: test[29-42]
SEQ: GATCCgatcgatcg
SEQ_LEN: 14
S_BEG: 29
S_END: 42
REC_TYPE: DIGEST
---

It is also possible to do restriction enzyme digestion with multipe enzymes simply by piping the result through digest_seq multiple times. Here we first digest with BamHI (pattern: GGATCC, cut_pos: 1) and then with EcoRI (pattern: GAATTC, cut_pos: 1):

read_fasta -i test.fna | digest_seq -p GGATCC -c 1 | digest_seq -p GAATTC -c 1

SEQ_NAME: test[0-9][0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][0-21]
SEQ: GATCCgagagggtgtgtagtgG
SEQ_LEN: 22
S_BEG: 0
S_END: 21
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][22-32]
SEQ: AATTCcgctgc
SEQ_LEN: 11
S_BEG: 22
S_END: 32
REC_TYPE: DIGEST
---

Author

mail@maasha.dk

September 2010

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

digest_seq is part of the Biopieces framework.

http://www.biopieces.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly