Navigation Menu

Skip to content

Commit

Permalink
Merge pull request #41 from enormandeau/master
Browse files Browse the repository at this point in the history
Spellchecking and grammar
  • Loading branch information
apmasell committed Sep 3, 2014
2 parents abe71e4 + 0ca0de7 commit af7f35f
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions README.md
Expand Up @@ -6,15 +6,15 @@ PANDASEQ is a program to align Illumina reads, optionally with PCR primers embed
INSTALLATION
------------

[![Build Status](https://travis-ci.org/neufeld/pandaseq.png?branch=master)](https://travis-ci.org/neufeld/pandaseq) [![Build Status](https://travis-ci.org/neufeld/pandaseq-sam.png?branch=master)](https://travis-ci.org/neufeld/pandaseq-sam)
[![Build Status](https://travis-ci.org/neufeld/pandaseq.png?branch=master)](https://travis-ci.org/neufeld/pandaseq)

Binary packages are available for recent versions of Windows, MacOS and Linux. Installing from source is not too difficult. See [Installation instructions](https://github.com/neufeld/pandaseq/wiki/Installation) for details.
Binary packages are available for recent versions of Windows, MacOS and Linux. Source code is also available. See [Installation instructions](https://github.com/neufeld/pandaseq/wiki/Installation) for details.

Development packages for zlib and libbz2 are needed, as is a standard compiler environment. On Ubuntu, this can be installed via
Development packages for zlib and libbz2 are needed, as well as a standard compiler environment. On Ubuntu, this can be installed via:

sudo apt-get install build-essential libtool automake zlib1g-dev libbz2-dev

On MacOS, the Apple Developer tools and Fink (or MacPorts or Brew) must be installed, then
On MacOS, the Apple Developer tools and Fink (or MacPorts or Brew) must be installed, then:

sudo fink install bzip2-dev

Expand All @@ -29,13 +29,13 @@ If you receive an error that `libpandaseq.so.[number]` is not found on Linux, tr
USAGE
-----

Please consult the manual page by invoking
Please consult the manual page by invoking:

man pandaseq

or visiting [online PANDAseq manual page](http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html).

The short version is
The short version is:

pandaseq -f forward.fastq -r reverse.fastq

Expand Down Expand Up @@ -72,29 +72,29 @@ or using, in `configure.ac`:

A [Vala binding](http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq-vapi/) is also included.

Other lanugage bindings are welcome.
Other language bindings are welcome.

FAQ
---

### Can I insist that PANDAseq only assembler perfect sequences?
### Can I insist that PANDAseq only assemble perfect sequences?
Yes, but you shouldn't want to do it. The whole point is to fix sequences which are probably good. There is no quality setting that will achieve this effect. You can use the plugin `completely_miss_the_point`, but this really does miss the point. Moreover, assuming that the sequencer is right in the overlap region and in the non-overlapping regions requires an unsound leap in statistics.

### Can I use SAM/BAM files as input without converting them to FASTQ?
Yes. [PANDAseq-sam](https://github.com/neufeld/pandaseq-sam) extends PANDAseq to do this. SAM/BAM files do not guarantee that sequences will be in the right order, so files may be slower and PANDAseq will use more memory.
Yes. [PANDAseq-sam](https://github.com/neufeld/pandaseq-sam) extends PANDAseq to do this. SAM/BAM files do not guarantee that sequences will be in the right order, so using SAM/BAM files may be slower and PANDAseq will use more memory.

### The scores of the output bases seem really low. What's wrong?
Nothing. The quality scores of the output do not have any similarity to the original quality scores and are not uniform across the sequence (i.e., the overlap is scored differently from the unpaired ends.

In the overlap region where there is a mismatch, it is the probability that one base was sequenced correctly and the other was sequenced incorrectly. If both bases have high scores (i.e., are probably correct), the chance of the resulting base is low (i.e., is probably incorrect). For more information, see the paper. Also, remember that the PHRED to probability conversion is not linear, so most scores are relatively high. It's also not uncommon to see the PHRED score `!`, which is zero, but in this context, it means less than `"` (PHRED = 1, P = .20567).
In the overlap region where there is a mismatch, it is probable that one base was sequenced correctly and the other was sequenced incorrectly. If both bases have high scores (i.e., are probably correct), the chance of the resulting base is low (i.e., is probably incorrect). For more information, see the paper. Also, remember that the PHRED to probability conversion is not linear, so most scores are relatively high. It's also not uncommon to see the PHRED score `!`, which is zero, but in this context, it means less than `"` (PHRED = 1, P = .20567).

Again, these scores are not meant to be interpreted as regular scores and should not be processed by downstream applications expecting PHRED scores from Illumina sequences.

### The scores of the non-overlapping regions are not the same as the original reads. Why?
The PHRED scores from the input are not copied directly to the output when using FASTQ (`-F`) output. They go through a transformation from PHRED scores into probabilities, which is how PANDAseq uses them. When output as FASTQ, the probability is converted back to a PHRED scores. The rounding error involved can cause a score to jump by one.
The PHRED scores from the input are not copied directly to the output when using FASTQ (`-F`) output. They go through a transformation from PHRED scores into probabilities, which is what PANDAseq uses. When output as FASTQ, the probabilities are converted back to PHRED scores. The rounding error involved can cause a score to jump by one.

### How many sequences should there be in the output?
You should expect that PANDAseq will output fewer sequences than the read pairs given to it. There are several `STAT` lines in the log that will help with the analysis. First, `STAT READS` is the number of read pairs in the input. Sequences first go through a number of basic checks and the user-specified checks. If provided, forward and reverse primers are aligned and clipped. The optimal overlap is selected and the sequence is constructed. The quality score is check and any user-specified checks are done. Any of these steps might fail and cause the sequence to be rejected. Each of the rejection reasons will have a `STAT` line, which is described in the _Output Statistics_ section of the manual.
You should expect that PANDAseq will output fewer sequences than the read pairs given to it. The log contains several `STAT` lines that will help with the analysis. Lines containing `STAT READS` report the number of read pairs in the input. Sequences first go through a number of basic filtering steps and then user-specified filtering steps. If provided, forward and reverse primers are aligned and clipped. The optimal overlap is selected and the sequence is constructed. The quality score is verified and any user-specified filtering is done. Any of these steps might fail and cause the sequence to be rejected. For each of the possible rejection reasons, the log file will contain a `STAT` line reporting the number of sequences filtered, as is described in the _Output Statistics_ section of the manual.

If multiple threads are used, which the default on most platforms, each thread collects this information separately. The output log will output a group of `STAT` lines per thread.

Expand Down

0 comments on commit af7f35f

Please sign in to comment.