Skip to content

Commit

Permalink
Updated the README
Browse files Browse the repository at this point in the history
  • Loading branch information
jts committed Oct 10, 2010
1 parent 98d1dfc commit 29eefdd
Showing 1 changed file with 39 additions and 26 deletions.
65 changes: 39 additions & 26 deletions src/README
@@ -1,7 +1,8 @@
SGA - String Graph Assembler

SGA is a genome assembler based on Gene Myers' string graph assembly framework. It uses
the FM-index/BWT to efficiently find overlaps between sequence reads as described here:
SGA is a de novo assembler for DNA sequence reads. It is based on Gene Myers' string graph
formulation of assembly and uses the FM-index/Burrows-Wheeler transform as to efficiently
find overlaps between sequence reads. The core algorithms are described in this paper:

http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/12/i367

Expand Down Expand Up @@ -29,45 +30,61 @@ This command would copy sga to /home/jsimpson/bin/sga

*** Running SGA

SGA is made up of a number of subprograms, together which make up the assembly pipeline. Each program and
subprogram will print a brief description and its usage info if the --help flag is used. For this command
will print all the subprogram names and their commands.
SGA is made up of a number of subprograms, together which form the assembly pipeline. The subprograms
can also be used to perform other interesting tasks, like read error correction or removing PCR duplicates.
Each program and subprogram will print a brief description and its usage instructions if the --help flag is used.

sga --help
To get a listing of all subprograms, run sga --help.

The major subprograms are:

* sga preprocess READS
* sga preprocess READS > out.fastq

Preprocess prepares a data file for assembly. It can perform optional quality filtering/trimming. By default
Preprocess prepares reads for assembly. It can perform optional quality filtering/trimming. By default
it will discard reads that have uncalled bases ('N' or '.'). If you wish to keep these reads, use the --permuteN
flag which will randomly change any uncalled bases to one of [ACGT]. It is mandatory to run this command
on real data. If you are using simulated data without uncalled bases, you do not need to run this command.
Refer to sga preprocess --help for more options and their use.
Refer to sga preprocess --help for more options and their use. If your reads are paired, the --pe-mode 1 option
should be specified. The paired reads can be input in two files (sga preprocess READS1 READS2) where the corresponding
reads in the files are assumed to go together or they can be interleaved in a single file where the two reads
are expected to appear in consecutive records. By default, the output is written to stdout.

* sga index READS

Build the FM-index for READS. READS can be fasta or fastq. The -d option can be used to limit
the amount of memory consumed, at the cost of higher running time. See --help for more information.
the amount of memory consumed, at the cost of higher running time. Typical values of -d are 2000000 or 4000000.
This program is threaded (-t N).

* sga correct READS

Perform error correction on the sequences in READS. Many options exist for this command, refer to --help.
By default, the corrected reads will be output to READS.ec.fa. This program is threaded.
Perform error correction on the sequence reads in READS. Overlap and kmer-based correction algorithms
are provided. By default, a hybrid algorithm is used which first attempts to correct the reads
using long kmers. This method of correction is very fast and will get rid of most singleton errors.
The reads that cannot be corrected using kmers are corrected by finding inexact overlaps
from which a multiple alignment and consensus sequence is found.

Many options exist for this program, see --help. Substantially improved results can be found
by changing the --min-overlap, --error-rate, --kmer, --count-threshold parameters. This program
is threaded (-t N). By default, the corrected reads will be output to READS.ec.fa.

* sga rmdup READS

Remove duplicated sequences from the READS file. This is useful for removing PCR/optical duplicates.
The --error-rate parameter controls the edit percentage that is allowed to consider two reads to be identical.
This program automatically regenerates the FM-index without the duplicated reads.

* sga overlap -m N READS

Find overlaps between reads that will be used to construct the string graph. The -m parameter specifies
the minimum length of the overlaps to find. This program is threaded. The output file is READS.asqg.gz
by default.
the minimum length of the overlaps to find. By default only non-transitive edges are output and edges
between identical sequences. If all overlaps between reads are desired, the --exhaustive option can be specified.
This program is threaded. The output file is READS.asqg.gz by default.

* sga assemble READS.asqg.gz

Assemble takes the output of the overlap step and constructs contigs. The output is in contigs.fa by default.
Assemble takes the output of the overlap step and constructs contigs. The output is in contigs.fa by default. Options
exist for cleaning the graph before assembly which will substantially increase assembly continuity.
See the --trim, --bubble, --resolve-small options.

*** Example usage

Expand Down Expand Up @@ -101,24 +118,20 @@ sga rmdup -e 0.02 -t 4 reads.pp.ec.fa

This command removes any duplicated/identical reads as they do not contribute to the string graph. The -e
parameter indicates that to reads are considered to be identical if the edit distance is 2% of lower. Again,
4 threads will be used for the computation.

sga index reads.pp.ec.rmdup.fa

The corrected, de-duplicated reads must be indexed as well.
4 threads will be used for the computation. The index files for the rmdup'd reads is automatically generated.

sga overlap -m 50 -e 0.0 -t 4 reads.pp.ec.rmdup.fa

This constructs the reads.pp.ec.rmdup.asqg.gz file which is used for the assembly. The parameters are similar
This constructs the ASQG file (reads.pp.ec.rmdup.asqg.gz) which is used for the assembly. The parameters are similar
to sga correct, -m specifies the length of the minimum overlap and -e specifies the tolerable error rate. In this
case we used -e 0.0 which means we want exact matches only. The -m/-e parameters can have a large effect
on the assembly so it is worth trying different values.
case we used -e 0.0 which means we want exact matches only which is HIGHLY RECOMMENDED after error correcting the reads.
The -m/-e parameters can have a large effect on the assembly so it is worth trying different values.

sga assemble -r -t 10 -b 2 reads.pp.ec.rmdup.asqg.gz
sga assemble -r 10 -t 10 -b 2 reads.pp.ec.rmdup.asqg.gz

Assemble the reads into contigs. The -r parameter turns on small-repeat resolution which can increase the length
of the contigs by untangling repeats that are less than a read length. The -t parameter specifies that 10 rounds
of dead-end trimming should be performed to clean up the graph. The -b parameter specifies that two rounds of
of dead-branch trimming should be performed to clean up the graph. The -b parameter specifies that two rounds of
bubble removal should be performed. The constructed contigs will be placed in contigs.fa.

*** Data quality issues
Expand All @@ -133,7 +146,7 @@ running an initial 'rmdup' step before error correction.

*** History

The first SGA-related code check-in was August, 2009. The algorithms for directly constructing the string graph from
The first SGA code check-in was August, 2009. The algorithms for directly constructing the string graph from
the FM-index were developed and implemented in the fall of 2009.

*** Third party code
Expand Down

0 comments on commit 29eefdd

Please sign in to comment.