Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Simulating Reads with DWGSIM
For brevity, the default usage is shown. More documentation can be made available upon request (don't hesitate to ask).
Usage: dwgsim [options] <in.ref.fa> <out.prefix> Options: -e FLOAT per base/color/flow error rate of the first read [from 0.020 to 0.020 by 0.000] -E FLOAT per base/color/flow error rate of the second read [from 0.020 to 0.020 by 0.000] -i use the inner distance instead of the outer distance for pairs [False] -d INT outer distance between the two ends for pairs  -s INT standard deviation of the distance for pairs [50.000] -N INT number of read pairs (-1 to disable) [-1] -C FLOAT mean coverage across available positions (-1 to disable) [100.00] -1 INT length of the first read  -2 INT length of the second read  -r FLOAT rate of mutations [0.0010] -F FLOAT frequency of given mutation to simulate low fequency somatic mutations [0.5000] NB: freqeuncy F refers to the first strand of mutation, therefore mutations on the second strand occour with a frequency of 1-F -R FLOAT fraction of mutations that are indels [0.10] -X FLOAT probability an indel is extended [0.30] -I INT the minimum length indel  -y FLOAT probability of a random DNA read [0.05] -n INT maximum number of Ns allowed in a given read  -c INT generate reads for : 0: Illumina 1: SOLiD 2: Ion Torrent -S INT generate reads : 0: default (opposite strand for Illumina, same strand for SOLiD/Ion Torrent) 1: same strand (mate pair) 2: opposite strand (paired end) -f STRING the flow order for Ion Torrent data [(null)] -B use a per-base error rate for Ion Torrent data [False] -H haploid mode [False] -z INT random seed (-1 uses the current time) [-1] -M generate a mutations file only [False] -m FILE the mutations txt file to re-create [not using] -b FILE the bed-like file set of candidate mutations [(null)] -v FILE the vcf file set of candidate mutations (use pl tag for strand) [(null)] -x FILE the bed of regions to cover [not using] -P STRING a read prefix to prepend to each read name [not using] -q STRING a fixed base quality to apply (single character) [not using] -Q FLOAT standard deviation of the base quality scores [2.00] -s INT standard deviation of the distance for pairs [50.000] -o INT output type for the FASTQ files : 0: interleaved (bfast) and per-read-end (bwa) 1: per-read-end (bwa) only 2: interleaved (bfast) only -h print this message
Note: For SOLiD mate pair reads and BFAST, the first read is F3 and the second is R3. For SOLiD mate pair reads and BWA, the reads in the first file are R3 the reads annotated as the first read etc.
NB: the -d option was previously incorrectly stated as being the outer distance, but is in fact the inner distance. Thanks to Brent Pedersen!
NB: the longest supported insertion is 255bp.
The -H mode will simulate a haploid genome, whereas the default is to simulate a diploid genome.
Table of Contents
The "-e" and "-E" options accept a uniform error rate (i.e. "-e 0.01" for 1%), or a uniformly increasing/decreasing error rate (i.e. "-e 0.01-0.1" for an error rate of 1% at the start of the read increasing to 10% at the end of the read).
Read names are of the form:
- contig name (chromsome name)
- start read 1 (one-based)
- start read 2 (one-based)
- strand read 1 (0 - forward, 1 - reverse)
- strand read 2 (0 - forward, 1 - reverse)
- random read 1 (0 - from the mutated reference, 1 - random)
- random read 2 (0 - from the mutated reference, 1 - random)
- number of sequencing errors read 1 (color errors for colorspace)
- number of SNPs read 1
- number of indels read 1
- number of sequencing errors read 2 (color errors for colorspace)
- number of SNPs read 2
- number of indels read 2
- read number (unique within a given contig/chromsome)
This utility can generate mate pair or paired end reads using the "-S" option. By default, Illumina (nucleotide) data are paired end, and SOLiD (color space) data are mate pair. For clarity, lets call the first end sequence E1 and the second end E2.
Paired end reads have the following orientation:
5' E1 -----> .... 3' 3' .... <------- E2 5'
Above, the start co-ordinate of E1 is less than E2, with E1 and E2 reported on opposite strands.
Mate pair reads have following orientation
5' E2 -----> .... E1 -------> 3' 3' .... 5'
Above, the start co-ordinate of E1 is greater than E2, with E1 and E2 reported on the same strand.
So for SOLiD mate pair reads, the R3 tag (E2) is listed before the F3 tag (E1). For SOLiD paired end reads, the F3 tag (E1) is listed before the F5 tag (E2).
The locations of introduced mutations are given in a <prefix></prefix>.mutations.txt text file. There are file columns:
- the chromosome/contig name
- the one-based position
- the original reference base
- the new reference base(s)
- the variant strand(s)
contig4 4 T K 1
The above shows a heterozygous mutation at position 4 of contig4 on the first strand, mutating the T base to a heterozygous K (G or T) SNP.
Insertions are represented on one line, where the reference base is missing (indicated by a '-' in the third column).
contig5 13 - TAC 3
The above shows a homozygous insertion of TAC prior to position 13.
Each base of a deletion is represented on one line, where the new reference base is missing and represented by a '-'.
contig6 22 A - 2
The above shows a heterozygous deletion of T at position 22 on the second strand. Multi-base deletions are show on consecutive lines.
contig6 22 A - 2 contig6 23 C - 2
The above shows a two base homozygous deletion of positions 22 and 23 on the second strand.
Three FASTQ files are produced, for use with BFAST and BWA.
The FASTQ for BFAST is formatted, so that the multi-end reads (paired end or mate pair) occur consecutively in the FASTQ, with the read that is 5' of the other listed first. For paired end reads, this means that E1 is always listed before E2. For mate pair reads, this means that E2 is always listed before E1.
The FASTQs for BWA are split into two files, the first file for one end, the second file for the other, with the read that is 5' of the other in the first file. For paired end reads, this means that E1 is in the first file and E2 is in the second file. For mate pair reads, this means that E2 is in the first file and E1 is in the second file.