Updated the README

jts · Oct 10, 2010 · 29eefdd · 29eefdd
1 parent 98d1dfc
commit 29eefdd
Showing 1 changed file with 39 additions and 26 deletions.
diff --git a/src/README b/src/README
@@ -1,7 +1,8 @@
 SGA - String Graph Assembler
 
-SGA is a genome assembler based on Gene Myers' string graph assembly framework. It uses
-the FM-index/BWT to efficiently find overlaps between sequence reads as described here:
+SGA is a de novo assembler for DNA sequence reads. It is based on Gene Myers' string graph
+formulation of assembly and uses the FM-index/Burrows-Wheeler transform as to efficiently
+find overlaps between sequence reads. The core algorithms are described in this paper:
 
 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/12/i367
 
@@ -29,45 +30,61 @@ This command would copy sga to /home/jsimpson/bin/sga
 
 *** Running SGA
 
-SGA is made up of a number of subprograms, together which make up the assembly pipeline. Each program and
-subprogram will print a brief description and its usage info if the --help flag is used. For this command 
-will print all the subprogram names and their commands.
+SGA is made up of a number of subprograms, together which form the assembly pipeline. The subprograms
+can also be used to perform other interesting tasks, like read error correction or removing PCR duplicates.
+Each program and subprogram will print a brief description and its usage instructions if the --help flag is used.
 
-sga --help
+To get a listing of all subprograms, run sga --help.
 
 The major subprograms are:
 
-* sga preprocess READS
+* sga preprocess READS > out.fastq
 
-Preprocess prepares a data file for assembly. It can perform optional quality filtering/trimming. By default
+Preprocess prepares reads for assembly. It can perform optional quality filtering/trimming. By default
 it will discard reads that have uncalled bases ('N' or '.'). If you wish to keep these reads, use the --permuteN 
 flag which will randomly change any uncalled bases to one of [ACGT]. It is mandatory to run this command 
 on real data. If you are using simulated data without uncalled bases, you do not need to run this command.
-Refer to sga preprocess --help for more options and their use.
+Refer to sga preprocess --help for more options and their use. If your reads are paired, the --pe-mode 1 option
+should be specified. The paired reads can be input in two files (sga preprocess READS1 READS2) where the corresponding
+reads in the files are assumed to go together or they can be interleaved in a single file where the two reads
+are expected to appear in consecutive records. By default, the output is written to stdout.
 
 * sga index READS
 
 Build the FM-index for READS. READS can be fasta or fastq. The -d option can be used to limit
-the amount of memory consumed, at the cost of higher running time. See --help for more information.
+the amount of memory consumed, at the cost of higher running time. Typical values of -d are 2000000 or 4000000.
+This program is threaded (-t N).
 
 * sga correct READS
 
-Perform error correction on the sequences in READS. Many options exist for this command, refer to --help. 
-By default, the corrected reads will be output to READS.ec.fa. This program is threaded.
+Perform error correction on the sequence reads in READS. Overlap and kmer-based correction algorithms
+are provided. By default, a hybrid algorithm is used which first attempts to correct the reads
+using long kmers. This method of correction is very fast and will get rid of most singleton errors. 
+The reads that cannot be corrected using kmers are corrected by finding inexact overlaps 
+from which a multiple alignment and consensus sequence is found. 
+
+Many options exist for this program, see --help. Substantially improved results can be found
+by changing the --min-overlap, --error-rate, --kmer, --count-threshold parameters. This program
+is threaded (-t N). By default, the corrected reads will be output to READS.ec.fa. 
 
 * sga rmdup READS
 
 Remove duplicated sequences from the READS file. This is useful for removing PCR/optical duplicates. 
+The --error-rate parameter controls the edit percentage that is allowed to consider two reads to be identical.
+This program automatically regenerates the FM-index without the duplicated reads.
 
 * sga overlap -m N READS
 
 Find overlaps between reads that will be used to construct the string graph. The -m parameter specifies
-the minimum length of the overlaps to find.  This program is threaded. The output file is READS.asqg.gz 
-by default.
+the minimum length of the overlaps to find. By default only non-transitive edges are output and edges
+between identical sequences. If all overlaps between reads are desired, the --exhaustive option can be specified.
+This program is threaded. The output file is READS.asqg.gz by default.
 
 * sga assemble READS.asqg.gz
 
-Assemble takes the output of the overlap step and constructs contigs. The output is in contigs.fa by default.
+Assemble takes the output of the overlap step and constructs contigs. The output is in contigs.fa by default. Options
+exist for cleaning the graph before assembly which will substantially increase assembly continuity. 
+See the --trim, --bubble, --resolve-small options.
 
 *** Example usage
 
@@ -101,24 +118,20 @@ sga rmdup -e 0.02 -t 4 reads.pp.ec.fa
 
 This command removes any duplicated/identical reads as they do not contribute to the string graph. The -e 
 parameter indicates that to reads are considered to be identical if the edit distance is 2% of lower. Again,
-4 threads will be used for the computation.
-
-sga index reads.pp.ec.rmdup.fa
-
-The corrected, de-duplicated reads must be indexed as well.
+4 threads will be used for the computation. The index files for the rmdup'd reads is automatically generated.
 
 sga overlap -m 50 -e 0.0 -t 4 reads.pp.ec.rmdup.fa
 
-This constructs the reads.pp.ec.rmdup.asqg.gz file which is used for the assembly. The parameters are similar
+This constructs the ASQG file (reads.pp.ec.rmdup.asqg.gz) which is used for the assembly. The parameters are similar
 to sga correct, -m specifies the length of the minimum overlap and -e specifies the tolerable error rate. In this
-case we used -e 0.0 which means we want exact matches only. The -m/-e parameters can have a large effect
-on the assembly so it is worth trying different values.
+case we used -e 0.0 which means we want exact matches only which is HIGHLY RECOMMENDED after error correcting the reads. 
+The -m/-e parameters can have a large effect on the assembly so it is worth trying different values.
 
-sga assemble -r -t 10 -b 2 reads.pp.ec.rmdup.asqg.gz
+sga assemble -r 10 -t 10 -b 2 reads.pp.ec.rmdup.asqg.gz
 
 Assemble the reads into contigs. The -r parameter turns on small-repeat resolution which can increase the length 
 of the contigs by untangling repeats that are less than a read length. The -t parameter specifies that 10 rounds
-of dead-end trimming should be performed to clean up the graph. The -b parameter specifies that two rounds of 
+of dead-branch trimming should be performed to clean up the graph. The -b parameter specifies that two rounds of 
 bubble removal should be performed. The constructed contigs will be placed in contigs.fa.
 
 *** Data quality issues
@@ -133,7 +146,7 @@ running an initial 'rmdup' step before error correction.
 
 *** History 
 
-The first SGA-related code check-in was August, 2009. The algorithms for directly constructing the string graph from
+The first SGA code check-in was August, 2009. The algorithms for directly constructing the string graph from
 the FM-index were developed and implemented in the fall of 2009.  
 
 *** Third party code