GitHub - lexnederbragt/RepSeq: These files belong to this publication: Nederbragt et al. (2010): Identification and quantification of genomic repeats and sample contamination in assemblies of 454 pyrosequencing reads, *Sequencing*. doi:10.1155/2010/782465

These files belong to this publication:

Nederbragt, A.J., Rounge, T.B., Kausrud, K and Jakobsen, K.S. 2010: Identification and quantification of genomic repeats and sample contamination in assemblies of 454 pyrosequencing reads, Sequencing. http://www.hindawi.com/archive/2010/782465/. doi:10.1155/2010/782465

Please contact Kyrre Kausrud kyrreka@ibv.uio.no regarding the script, Lex Nederbragt lex.nederbragt@ibv.uio.no regarding the publication

R script

This program estimates the number of copies of each contig depending on the observed distributions of read depths for a sequenced genome. In order to apply the program, one needs to use R, “a free software environment for statistical computing and graphics.” R is available at http://www.r-project.org/ The script contains commented lines (starting with the ‘#’ symbol) on how to use it at the beginning. The input file for the script is the 454AlignmentInfo.tsv generated by the newbler assembly program (gsAssembler).

###Assemblies For the assemblies of E. coli and P. gingivalis described in the paper, two files each are made available. The descriptions of these files are partly based on the GS FLX Data Analysis Software Manual, December 2007

454AlignmentInfo.tsv
[This file is the input for the R script that estimates the genomic copy number of each contig, see above]
This file contains position-by-position summary information about the consensus sequence for the contigs generated by the GS De Novo Assembler application, listed one nucleotide per line (in a tab-delimited format). The columns of each line contain the following information:

Position – the position in the contig
Consensus – the consensus nucleotide for that position in the contig
Quality Score – the quality score of the consensus base
Depth – the number of reads that align at that position in the alignment
Signal – the average signal of the read flowgrams, for the flows that correspond to that position in the alignment StdDevation – the standard deviation of the read flowgram signals at the corresponding flows

Prior to each region of lines for each contig, a header line beginning with a > displays the contig name.

454LargeContigs.fna
Fasta formatted contigs of at least 500 bp generated by the GS De Novo Assembler application. The description lines are formatted as follows:

>contigXXXXX length=abc numReads=xyz

where contigXXXXX is the identifier of the contig and XXXXX is a sequential numbering of the contigs in the assembly; and where the length and numReads values are the length in bases of the contig and the number of reads that were used in that contig’s multiple alignment.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
E_coli_assembly_data.zip		E_coli_assembly_data.zip
P_gingivalis_assembly_data.zip		P_gingivalis_assembly_data.zip
Readme.md		Readme.md
RepSeq_1.0.R		RepSeq_1.0.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R script

About

Releases

Packages

Languages

lexnederbragt/RepSeq

Folders and files

Latest commit

History

Repository files navigation

R script

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages