Skip to content

Commit

Permalink
kmer trimming
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Oct 12, 2016
1 parent e15b389 commit 87202dc
Show file tree
Hide file tree
Showing 4 changed files with 111 additions and 12 deletions.
27 changes: 27 additions & 0 deletions assemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,33 @@ The output assembly will be in ``combined/final.contigs.fa``.
While the assembly runs...
--------------------------

.. Graph assembly / What doesn’t get assembled? (Repeats, strain variation)
.. Sherine work on metagenomics
.. Our read length figure / soil
How assembly works - whiteboarding the De Bruijn graph approach.

Interpreting the MEGAHIT working output :)

What does, and doesn't, assemble?

How good is assembly anyway?

Discussion:

Why would we assemble, vs looking at raw reads? What are the
advantages and disadvantages?

What are the technology tradeoffs between Illumina HiSeq, Illumina
MiSeq, and PacBio?

What kind of experimental design considerations should you have if you
plan to assemble?


After the assembly is finished
------------------------------

At this point we can do a bunch of things:

* annotate the assembly;
Expand Down
Binary file added files/kmer-trimming.graffle
Binary file not shown.
31 changes: 19 additions & 12 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,26 @@ Rough schedule:
* Wed, 1-4pm: Assembly, annotation, and evaluation.
- Short-read assembly with MEGAHIT
- Advantages and drawbacks to assembly (discussion)
- Evaluating assembly results (mapping rates, etc.)
- Evaluating assembly results (mapping rates, mapping viz, etc.)
- Annotating assembly (Prokka)
- Abundance calculations
* Th: topics TBD. Possible topics:
- Abundance comparisons between samples
- CIRCOS plots
- ShotMap for annotating shotgun reads
- Taxonomic analysis of reads and assemblies
- Genome binning
- Jupyter Notebook or RMarkdown
- git for version control
- Docker execution environments
- Workflow strategies (scripting, make, doit, etc.)
- More Amazon Web Services: EC2, S3, ...?
* Thursday: topics TBD from below, or others:
- Biolog and bioinformatics:
- Abundance comparisons between samples
- Taxonomic analysis of reads and assemblies
- Genome binning
- CIRCOS plots
- ShotMap for annotating shotgun reads
- Moar computing:
- Jupyter Notebook or RMarkdown
- git for version control
- Docker execution environments
- Workflow strategies (scripting, make, doit, etc.)
- More Amazon Web Services: EC2, S3, ...?
- Miscellany:
- Visualizing assembly graphs
- MinHash sketches for comparing genomes and metagenomes
- Tricks with sequences using Python libraries (screed & khmer)

Tutorials:

Expand All @@ -42,6 +48,7 @@ Tutorials:
welcome
aws/index
quality
kmer_trimming
assemble
prokka_tutorial
salmon_tutorial
Expand Down
65 changes: 65 additions & 0 deletions kmer_trimming.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
=============================
K-mer Spectral Error Trimming
=============================

(Optional)

If you plot a k-mer abundance histogram @@ of the samples, you'll
notice something: there's an awful lot of unique (abundance=1) k-mers.
These are erroneous k-mers caused by sequencing errors.

@@plot

Many of these errors remain even after you do the Trimmomatic run. This is for
two reasons:

* first, Trimmomatic trims based solely on the quality score, which is
a statistical statement about the correctness of a base - a Q score
of 30 means that, of 10000@ bases with that Q score, 1 of those
bases will be wrong. So, a base can have a high Q score and still
be wrong! (and **many** bases will have a low Q score and still be
correct)

* second, we trimmed **very** lightly - only bases that had a very low
quality were removed. This was intentional because with assembly,
you want to retain as much coverage as possible, and the assembler
will generally figure out what the "correct" base is from the coverage.

An alternative to trimming based on the quality scores is to trim based on
k-mer abundance - this is known as k-mer spectral error trimming.
(@@qingpeng)

The logic is this: if you see low abundance k-mers in a high coverage data
set, those k-mers are almost certainly wrong.

In metagenomic data sets we do have the problem that we may have very
low and very high coverage data. So we don't necessarily want to get
rid of all low-abundance k-mers, because they may represent truly low
abundance (but useful) data.

As part of the khmer@@ project in my lab, we have developed an approach
that sorts reads into high abundance and low abundance reads, and only
error trims the high abundance reads.

@@image

This does mean that many errors may get left in the data set, because we
have no way of figuring out if they are errors or simply low coverage,
but that's OK (and you can always trim them off if you really care).

Error profile@@

To run such error trimming, use the command ``trim-low-abund.py``::

trim-low-abund.py -V -M 8e9 $datafile

Why (or why not) do k-mer trimming?
-----------------------------------

If you can assemble your data set without k-mer trimming, there's no
reason to do it. The reason we're error trimming here is to speed up
the assembler (by removing data) and to decrease the memory requirements
of the assembler (by removing a number of k-mers).

To see how many k-mers we removed, @@count k-mers.

0 comments on commit 87202dc

Please sign in to comment.