Skip to content
These files belong to this publication: Nederbragt et al. (2010): Identification and quantification of genomic repeats and sample contamination in assemblies of 454 pyrosequencing reads, *Sequencing*. doi:10.1155/2010/782465
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
E_coli_assembly_data.zip
P_gingivalis_assembly_data.zip
Readme.md
RepSeq_1.0.R

Readme.md

These files belong to this publication:

Nederbragt, A.J., Rounge, T.B., Kausrud, K and Jakobsen, K.S. 2010: Identification and quantification of genomic repeats and sample contamination in assemblies of 454 pyrosequencing reads, Sequencing. http://www.hindawi.com/archive/2010/782465/. doi:10.1155/2010/782465

Please contact Kyrre Kausrud kyrreka@ibv.uio.no regarding the script, Lex Nederbragt lex.nederbragt@ibv.uio.no regarding the publication

R script

This program estimates the number of copies of each contig depending on the observed distributions of read depths for a sequenced genome. In order to apply the program, one needs to use R, “a free software environment for statistical computing and graphics.” R is available at http://www.r-project.org/ The script contains commented lines (starting with the ‘#’ symbol) on how to use it at the beginning. The input file for the script is the 454AlignmentInfo.tsv generated by the newbler assembly program (gsAssembler).

###Assemblies For the assemblies of E. coli and P. gingivalis described in the paper, two files each are made available. The descriptions of these files are partly based on the GS FLX Data Analysis Software Manual, December 2007

454AlignmentInfo.tsv
[This file is the input for the R script that estimates the genomic copy number of each contig, see above]
This file contains position-by-position summary information about the consensus sequence for the contigs generated by the GS De Novo Assembler application, listed one nucleotide per line (in a tab-delimited format). The columns of each line contain the following information:

  • Position – the position in the contig
  • Consensus – the consensus nucleotide for that position in the contig
  • Quality Score – the quality score of the consensus base
  • Depth – the number of reads that align at that position in the alignment
  • Signal – the average signal of the read flowgrams, for the flows that correspond to that position in the alignment StdDevation – the standard deviation of the read flowgram signals at the corresponding flows

Prior to each region of lines for each contig, a header line beginning with a > displays the contig name.

454LargeContigs.fna
Fasta formatted contigs of at least 500 bp generated by the GS De Novo Assembler application. The description lines are formatted as follows:

>contigXXXXX length=abc numReads=xyz

where contigXXXXX is the identifier of the contig and XXXXX is a sequential numbering of the contigs in the assembly; and where the length and numReads values are the length in bases of the contig and the number of reads that were used in that contig’s multiple alignment.

You can’t perform that action at this time.