kmer trimming

ngs-docs · Oct 12, 2016 · 87202dc · 87202dc
1 parent e15b389
commit 87202dc
Show file tree

Hide file tree

Showing 4 changed files with 111 additions and 12 deletions.
diff --git a/assemble.rst b/assemble.rst
@@ -29,6 +29,33 @@ The output assembly will be in ``combined/final.contigs.fa``.
 While the assembly runs...
 --------------------------
 
+.. Graph assembly / What doesn’t get assembled? (Repeats, strain variation)
+.. Sherine work on metagenomics
+.. Our read length figure / soil
+
+How assembly works - whiteboarding the De Bruijn graph approach.
+
+Interpreting the MEGAHIT working output :)
+
+What does, and doesn't, assemble?
+
+How good is assembly anyway?
+
+Discussion:
+
+Why would we assemble, vs looking at raw reads?  What are the
+advantages and disadvantages?
+
+What are the technology tradeoffs between Illumina HiSeq, Illumina
+MiSeq, and PacBio?
+
+What kind of experimental design considerations should you have if you
+plan to assemble?
+
+
+After the assembly is finished
+------------------------------
+
 At this point we can do a bunch of things:
 
 * annotate the assembly;

diff --git a/files/kmer-trimming.graffle b/files/kmer-trimming.graffle
diff --git a/index.rst b/index.rst
@@ -19,20 +19,26 @@ Rough schedule:
 * Wed, 1-4pm: Assembly, annotation, and evaluation.
  - Short-read assembly with MEGAHIT
  - Advantages and drawbacks to assembly (discussion)
- - Evaluating assembly results (mapping rates, etc.)
+ - Evaluating assembly results (mapping rates, mapping viz, etc.)
  - Annotating assembly (Prokka)
  - Abundance calculations
-* Th: topics TBD. Possible topics:
- - Abundance comparisons between samples
- - CIRCOS plots
- - ShotMap for annotating shotgun reads
- - Taxonomic analysis of reads and assemblies
- - Genome binning
- - Jupyter Notebook or RMarkdown
- - git for version control
- - Docker execution environments
- - Workflow strategies (scripting, make, doit, etc.)
- - More Amazon Web Services: EC2, S3, ...?
+* Thursday: topics TBD from below, or others:
+ - Biolog and bioinformatics:
+  - Abundance comparisons between samples
+  - Taxonomic analysis of reads and assemblies
+  - Genome binning
+  - CIRCOS plots
+  - ShotMap for annotating shotgun reads
+ - Moar computing:
+  - Jupyter Notebook or RMarkdown
+  - git for version control
+  - Docker execution environments
+  - Workflow strategies (scripting, make, doit, etc.)
+  - More Amazon Web Services: EC2, S3, ...?
+ - Miscellany:
+  - Visualizing assembly graphs
+  - MinHash sketches for comparing genomes and metagenomes
+  - Tricks with sequences using Python libraries (screed & khmer)
 
 Tutorials:
 
@@ -42,6 +48,7 @@ Tutorials:
    welcome
    aws/index
    quality
+   kmer_trimming
    assemble
    prokka_tutorial
    salmon_tutorial

diff --git a/kmer_trimming.rst b/kmer_trimming.rst
@@ -0,0 +1,65 @@
+=============================
+K-mer Spectral Error Trimming
+=============================
+
+(Optional)
+
+If you plot a k-mer abundance histogram @@ of the samples, you'll
+notice something: there's an awful lot of unique (abundance=1) k-mers.
+These are erroneous k-mers caused by sequencing errors.
+
+@@plot
+
+Many of these errors remain even after you do the Trimmomatic run.  This is for
+two reasons:
+
+* first, Trimmomatic trims based solely on the quality score, which is
+  a statistical statement about the correctness of a base - a Q score
+  of 30 means that, of 10000@ bases with that Q score, 1 of those
+  bases will be wrong.  So, a base can have a high Q score and still
+  be wrong! (and **many** bases will have a low Q score and still be
+  correct)
+
+* second, we trimmed **very** lightly - only bases that had a very low
+  quality were removed.  This was intentional because with assembly,
+  you want to retain as much coverage as possible, and the assembler
+  will generally figure out what the "correct" base is from the coverage.
+
+An alternative to trimming based on the quality scores is to trim based on
+k-mer abundance - this is known as k-mer spectral error trimming.
+(@@qingpeng)
+
+The logic is this: if you see low abundance k-mers in a high coverage data
+set, those k-mers are almost certainly wrong.
+
+In metagenomic data sets we do have the problem that we may have very
+low and very high coverage data.  So we don't necessarily want to get
+rid of all low-abundance k-mers, because they may represent truly low
+abundance (but useful) data.
+
+As part of the khmer@@ project in my lab, we have developed an approach
+that sorts reads into high abundance and low abundance reads, and only
+error trims the high abundance reads.
+
+@@image
+
+This does mean that many errors may get left in the data set, because we
+have no way of figuring out if they are errors or simply low coverage,
+but that's OK (and you can always trim them off if you really care).
+
+Error profile@@
+
+To run such error trimming, use the command ``trim-low-abund.py``::
+
+  trim-low-abund.py -V -M 8e9 $datafile
+
+Why (or why not) do k-mer trimming?
+-----------------------------------
+
+If you can assemble your data set without k-mer trimming, there's no
+reason to do it.  The reason we're error trimming here is to speed up
+the assembler (by removing data) and to decrease the memory requirements
+of the assembler (by removing a number of k-mers).
+
+To see how many k-mers we removed, @@count k-mers.
+