Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: calc_N50

Description

[calc_N50] can be used to calculate the N50 of sequences in the stream. N50 is typically used to decribe sequence assemblies.

N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome.

Read more here:

http://en.wikipedia.org/wiki/N50_statistic

Usage

... | calc_N50 [options]

Options

[-?          | --help]               #  Print full usage description.
[-x          | --no_stream]          #  Do not emit records.
[-o <file>   | --data_out=<file>]    #  Write result to file.
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following FASTA entries in the file test.fna:

>test1
ATGCACATTG
>test2
ATGCACATTGATGCACATTG
>test3
ATGCACATTGATGCACATTGATGCACATTG
>test4
ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
>test5
ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG

To find the N50 read in the sequences with read_fasta:

read_fasta -i test.fna | calc_N50

SEQ_NAME: test1
SEQ: ATGCACATTG
SEQ_LEN: 10
---
SEQ_NAME: test2
SEQ: ATGCACATTGATGCACATTG
SEQ_LEN: 20
---
SEQ_NAME: test3
SEQ: ATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 30
---
SEQ_NAME: test4
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 40
---
SEQ_NAME: test5
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 50
---
N50: 40
---

To obtain only the N50 use the -x switch:

read_fasta -i test.fna | calc_N50 -x

N50: 40
---

And to output the N50 to a file use the -o switch:

read_fasta -i test.fna | calc_N50 -o N50.txt -x

See also

mean_vals

median_vals

analyze_vals

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

mail@maasha.dk

January 2011

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

[calc_N50] is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally