These files belong to this publication:
Nederbragt, A.J., Rounge, T.B., Kausrud, K and Jakobsen, K.S. 2010: Identification and quantification of genomic repeats and sample contamination in assemblies of 454 pyrosequencing reads, Sequencing. http://www.hindawi.com/archive/2010/782465/. doi:10.1155/2010/782465
Please contact Kyrre Kausrud kyrreka@ibv.uio.no regarding the script, Lex Nederbragt lex.nederbragt@ibv.uio.no regarding the publication
This program estimates the number of copies of each contig depending on the observed distributions of read depths for a sequenced genome. In order to apply the program, one needs to use R, “a free software environment for statistical computing and graphics.” R is available at http://www.r-project.org/
The script contains commented lines (starting with the ‘#’ symbol) on how to use it at the beginning. The input file for the script is the 454AlignmentInfo.tsv
generated by the newbler assembly program (gsAssembler).
###Assemblies For the assemblies of E. coli and P. gingivalis described in the paper, two files each are made available. The descriptions of these files are partly based on the GS FLX Data Analysis Software Manual, December 2007
454AlignmentInfo.tsv
[This file is the input for the R script that estimates the genomic copy number of each contig, see above]
This file contains position-by-position summary information about the consensus sequence for the contigs generated by the GS De Novo Assembler application, listed one nucleotide per line (in a tab-delimited format). The columns of each line contain the following information:
- Position – the position in the contig
- Consensus – the consensus nucleotide for that position in the contig
- Quality Score – the quality score of the consensus base
- Depth – the number of reads that align at that position in the alignment
- Signal – the average signal of the read flowgrams, for the flows that correspond to that position in the alignment StdDevation – the standard deviation of the read flowgram signals at the corresponding flows
Prior to each region of lines for each contig, a header line beginning with a >
displays the contig name.
454LargeContigs.fna
Fasta formatted contigs of at least 500 bp generated by the GS De Novo Assembler application. The description lines are formatted as follows:
>contigXXXXX length=abc numReads=xyz
where contigXXXXX
is the identifier of the contig and XXXXX
is a sequential numbering of the contigs in the assembly; and where the length and numReads values are the length in bases of the contig and the number of reads that were used in that contig’s multiple alignment.