Skip to content

These files belong to this publication: Nederbragt et al. (2010): Identification and quantification of genomic repeats and sample contamination in assemblies of 454 pyrosequencing reads, *Sequencing*. doi:10.1155/2010/782465

Notifications You must be signed in to change notification settings

lexnederbragt/RepSeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

These files belong to this publication:

Nederbragt, A.J., Rounge, T.B., Kausrud, K and Jakobsen, K.S. 2010: Identification and quantification of genomic repeats and sample contamination in assemblies of 454 pyrosequencing reads, Sequencing. http://www.hindawi.com/archive/2010/782465/. doi:10.1155/2010/782465

Please contact Kyrre Kausrud kyrreka@ibv.uio.no regarding the script, Lex Nederbragt lex.nederbragt@ibv.uio.no regarding the publication

R script

This program estimates the number of copies of each contig depending on the observed distributions of read depths for a sequenced genome. In order to apply the program, one needs to use R, “a free software environment for statistical computing and graphics.” R is available at http://www.r-project.org/ The script contains commented lines (starting with the ‘#’ symbol) on how to use it at the beginning. The input file for the script is the 454AlignmentInfo.tsv generated by the newbler assembly program (gsAssembler).

###Assemblies For the assemblies of E. coli and P. gingivalis described in the paper, two files each are made available. The descriptions of these files are partly based on the GS FLX Data Analysis Software Manual, December 2007

454AlignmentInfo.tsv
[This file is the input for the R script that estimates the genomic copy number of each contig, see above]
This file contains position-by-position summary information about the consensus sequence for the contigs generated by the GS De Novo Assembler application, listed one nucleotide per line (in a tab-delimited format). The columns of each line contain the following information:

  • Position – the position in the contig
  • Consensus – the consensus nucleotide for that position in the contig
  • Quality Score – the quality score of the consensus base
  • Depth – the number of reads that align at that position in the alignment
  • Signal – the average signal of the read flowgrams, for the flows that correspond to that position in the alignment StdDevation – the standard deviation of the read flowgram signals at the corresponding flows

Prior to each region of lines for each contig, a header line beginning with a > displays the contig name.

454LargeContigs.fna
Fasta formatted contigs of at least 500 bp generated by the GS De Novo Assembler application. The description lines are formatted as follows:

>contigXXXXX length=abc numReads=xyz

where contigXXXXX is the identifier of the contig and XXXXX is a sequential numbering of the contigs in the assembly; and where the length and numReads values are the length in bases of the contig and the number of reads that were used in that contig’s multiple alignment.

About

These files belong to this publication: Nederbragt et al. (2010): Identification and quantification of genomic repeats and sample contamination in assemblies of 454 pyrosequencing reads, *Sequencing*. doi:10.1155/2010/782465

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages