Merge branch 'ha_work' of github.com:ngs-docs/2016-metagenomics-sio i…

…nto work
ngs-docs · Oct 12, 2016 · d33e0b1 · d33e0b1
2 parents 611d85d + 6d0b552
commit d33e0b1
Show file tree

Hide file tree

Showing 3 changed files with 48 additions and 9 deletions.
diff --git a/circos_tutorial.rst b/circos_tutorial.rst
@@ -25,7 +25,7 @@ Circos runs within Perl and as such does not need to be compiled to run. So, we
 ::
     export PATH=~/circos/circos-0.69-3/bin:$PATH
 
-Circos does, however, require quite a few additional perl modules to opperate correctly. To see what modules are missing and need to be downloaded type the following:
+Circos does, however, require quite a few additional perl modules to operate correctly. To see what modules are missing and need to be downloaded type the following:
 ::
     circos -modules > modules
 
@@ -64,7 +64,7 @@ And with that, circos should be up and ready to go. Run the example by navigatin
 
 This will take a little bit to run but should generate a file called ``circos.png``.  Open it and you can get an idea of the huge variety of things that are possible with circos and a lot of patience. We will not be attempting anything that complex today, however.
 
-Compairing our assembly
+Comparing our assembly
 =======================
 Create a reference database for blastn:
 ::

diff --git a/prokka_tutorial.rst b/prokka_tutorial.rst
@@ -48,7 +48,7 @@ Now it is time to run Prokka! There are tons of different ways to specialize the
 
 This will generate a new folder called ``prokka_annotation`` in which will be a series of files, which are detailed `here <https://github.com/tseemann/prokka/blob/master/README.md#output-files>`__.
 
-In particular, we will be using the ``*.ffn`` file to assess
+In particular, we will be using the ``*.ffn`` file to assess the relative read coverage within our metagenomes across the predicted genomic regions.
 
 References
 ===========

diff --git a/salmon_tutorial.rst b/salmon_tutorial.rst
@@ -2,17 +2,19 @@
 Gene Abundance Estimation with Salmon
 ======================================
 
-Salmon is one of a breed of new, very fast RNAseq counting packages. Like Kallisto and Sailfish, Salmon counts fragments without doing up-front read mapping. Salmon can be used with edgeR and others to do differential expression analysis.
+Salmon is one of a breed of new, very fast RNAseq counting packages. Like Kallisto and Sailfish, Salmon counts fragments without doing up-front read mapping. Salmon can be used with edgeR and others to do differential expression analysis (if you are quantifying RNAseq data).
+
+Today we will use it to get a handle on the relative distribution of genomic reads across the predicted protein regions.
 
 The goals of this tutorial are to:
 
 *  Install salmon
 *  Use salmon to estimate gene coverage in our metagenome dataset
 
-Installing salmon
+Installing Salmon
 ==================================================
 
-Download and extract the latest version of Salmon and add it to your path:
+Download and extract the latest version of Salmon and add it to your PATH:
 ::
     wget https://github.com/COMBINE-lab/salmon/releases/download/v0.7.2/Salmon-0.7.2_linux_x86_64.tar.gz
     tar -xvzf Salmon-0.7.2_linux_x86_64.tar.gz
@@ -29,13 +31,50 @@ Make a new directory for the quantification of data with Salmon:
 
 Grab the nucleotide (``*ffn``) predicted protein regions from Prokka and link them here. Also grab the trimmed sequence data (``*fq``)
 ::
-    ln -fs annotation/prokka_annotation *ffn .
+    ln -fs annotation/prokka_annotation/*ffn .
     ln -fs data/*.abundtrim.subset.pe.fq.gz .
 
 Create the salmon index:
 ::
-  ~/Salmon-0.7.2_linux_x86_64/bin/salmon index -t PROKKA_10112016.ffn -i transcript_index --type quasi -k 31
-  
+  salmon index -t metag_10112016.ffn -i transcript_index --type quasi -k 31
+
+Salmon requires that paired reads be separated into two files. We can split the reads using the XXX script XXX: *CHECK ME!*
+::
+  for file in *.abundtrim.subset.pe.fq.gz
+  do
+    split-reads.py $file
+  done
+
+Now, we can run our reads against this reference:
+::
+  for file in *1.fq
+  do
+  BASE=${file/.1.fq/}
+  salmon quant -i transcript_index --libType IU \
+        -1 $BASE.1.fq -2 $BASE.2.fq -o $BASE.quant;
+
+(Note that --libType must come before the read files!)
+
+This will create a bunch of directories named after the fastq files that we just pushed through. Take a look at what files there are within one of these directories: **FIX**
+::
+  find SRR1976948.quant -type f
+
+Working with count data
+=======================
+
+Now, the ``quant.sf`` files actually contain the relevant information about expression – take a look:
+::
+  head -10 SRR1976948.quant/quant.sf
+
+The first column contains the transcript names, and the fourth column is what we will want down the road - the normalized counts (TPM). However, they’re not in a convenient location / format for use; let's fix that.
+
+Download the gather-counts.py script:
+::
+  curl -L -O https://github.com/ngs-docs/2016-aug-nonmodel-rnaseq/raw/master/files/gather-counts.py
+and run it:
+
+  python ./gather-counts.py
+This will give you a bunch of .counts files, which are processed from the quant.sf files and named for the directory from which they emanate.
 
 References
 ===========