# General introduction
Today, you will assemble and annotate a metagenome of a plant: *Azolla filiculoides*. *Azolla* is known for its endophytes: microorganisms that live inside the plant. These microorganisms live specifically inside a leaf cavity in every upper leaf lobe of the plant. One of these endophytes has been known for a long time: *Nostoc azollae* They are best known for fixing N^2 from the air, in service of the plant. The image included below illustrates the physiology of these leaf cavities. 

While the presence of *N. azollae* is well studied, the presence of other microbes in the *Azolla* leaf pocket is not. Today, we will create and analyse the metagenome of *A. filiculoides* plants, in search for these other "neglected" microbes. 

![image](http://theazollafoundation.org/wp-content/uploads/2013/05/Azolla-leaf-cavities-with-Anabaena.png)

Hence, the open biological question we are answering today is "which species live inside these leaf cavities and what are the metabolic capacities encoded on their genomes?" Based on these genomes, you may be able to infer their function and think of a hypotheses to test. 

*Tip*: If you stumble upon a step with a long calculation time, first read ahead to see what steps come next, and otherwise you can google some information about the plant you are researching. Notice that this [paper](http://onlinelibrary.wiley.com/doi/10.1111/nph.14843/full) specifically describes the data that we will be using today. For a more general introduction into Azolla, perhaps have a look at [this short movie](https://youtu.be/O34gTsxyDq8) or [this short movie](https://youtu.be/OI4VV4M2-f4) (Dutch) recorded for Dutch local television.


# Workflow

## Acquire Samples
A typical metagenomics workflow starts in the environment and/or the wetlab with a DNA extraction from some biological sample. This is also true for this workflow. We have extracted DNA from *Azolla* from the environment in two ways. Firstly, the whole Plant (P), secondly from the Leaf cavities (L). Both were collected in three biological replicates. Hence the samples we will work with are coded L1 L2 L3 and P1 P2 P3. The reasoning behind these two sampling types is that true endophytes are expected to be more abundant in leaf samples than in plant samples. So we may later use these different sampling types to infer if a microbial genome that we assembled originated from the leaf cavity, or from somewhere else in or on the plant. 

Extracted DNA is processed in a sequencing library; it is prepared in a way that a [Illumina sequencing](https://en.wikipedia.org/wiki/DNA_sequencing#Illumina_(Solexa)_sequencing) machine can process it. The sequencing machine then provides us with [FastQ](https://en.wikipedia.org/wiki/FASTQ_format) files, you may have handled these before, perhaps not. We will have a look at FastQ files later in this practical.

## First Quality Control and processing.
FastQ files contain the output of the Illumina sequencing machine: tens of millions of short DNA sequences. Where short is anything between 50 and 250 bp. These original short DNA sequences are further refered to as DNA reads, or just reads. FastQ files contain unprocessed output of the machine, every read consists of four lines: 
1. a header proceeded by the '@' character. 
2. the DNA sequence
3. a comment line proceeded by a '+'
4. a quality string.

The quality string has the exact same length as the DNA read, every character in the quality string corresponds to some number between 0 and 40. The sequencing machine assigns this quality score based on "how sure it is" that this specific base is correct. To put it a bit loosly.

These are two reads encoded in a FastQ file.

> @NS500813:28:H3M3VAFXX:1:11101:19270:1015 1:N:0:CGATGT
GNGGTGAAGAAATCAGCCATTCTAAACCAATTGCTCTCCAAGGATTATCAGGTGCTTTTTCACCTTGCGTCCAAGAAACCAAAATATTGAATATGAAGGGTAAGGTGGACATTCCTAATAAGAATGCACCAAGACTAGCGAGAACATTCC
+
A#AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEE<EEEEEEEEEEEEEEEEEEEEEAEEEAEEEEEEEEEEEEEEEEEEEEAAAEEAEEEEEAAAAA6AEAEAAEEEEEEEEEEEEA6
@NS500813:28:H3M3VAFXX:1:11101:24948:1015 1:N:0:CGATGT
ANATTAATTATACAATCACTAATTTAGCAGAGCATTTAGAAAGTATAAGCCATGATGCAATTAACTATTATTTAAAAACCGAAAAGTTAACATCTCGTTTACTATGGGATAAGGTGAAAGAGGTAGTAGAACCTGATGGTAATGGGTACAT
+
A#AAAAEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEAEEEEEEEAEEEEEEEEEEE<EEE<EEEEE66EEA<EEEEAEEEEEEEEEEEAE<E/AE/EEEEEAEAAEAAAEEEEAEEEE6<EEAEA<//AAEEAEE<AA

FastQ files often contain some reads of poor quality, we definitively want to filter those out before starting our analyses. There are many ways available, for todays practical, this was done for you already with a tool called [trimmomatic](https://academic.oup.com/bioinformatics/article-abstract/30/15/2114/2390096).

Before and after trimming, general Quality Controll (QC( of a FastQ file is assessed with a tool called FastQC. The FastQC reports are rendered as html pages so we can view them in our webbrowsers. You can find the FastQC reports for this practical here:

### Filtering
Finally, we chose here to filter the sequencing data for plant DNA, since we are only interested in microbial DNA. The sequencing data was rid of any plant DNA by mapping (aligning) the reads to a reference plant genome. Only the sequencing data that did not map anything was kept for further analysis.

### laura add fastQC links


## Assembly
Assmbly is the process of combining these many short DNA reads in longer contiguous strings of DNA: contigs. For the assembly process we have used both sample types and all biological replicates as if they were one big sample. This yields the best assembly results. A metagenome assembly is different than a regular genome assembly in that it has DNA of multiple species in one sequencing file. Metagenome assembly was achieved with [SPAdes](http://online.liebertpub.com/doi/abs/10.1089/cmb.2012.0021). 

Some more elaborate information of Assembly
> The process of assembly takes the millions of short reads in a fastq file, and processes these to thousands of longer nucleotide sequences. Typically the output is a fasta file with contiguous sequences called contigs (contigs.fasta) or a fasta file with multiple contigs scaffolded based on their relative position and orientation to each other called scaffolds, in a file scaffolds.fasta. Ideally, these scaffolds represent the genome of a single species in the metagenome. This is however, extremely unlikely. At least in my personal experience. In reality the assembly process wil produce thousands of longer scaffolds where a single species' genome is still represented by multiple sequences in the assembly fasta file.

Also, you probably know what a fasta file is. It is a text file with DNA or protein sequences. A fasta file can contain multiple sequences. Fasta files are similar to fastq files, but a bit simpeler. There are just two types of lines. The header line is preceded by a `>` and the lines after contain the DNA sequence. The DNA sequence can continue until the next `>` or until the end of the document. We may have a quick look at a fasta file:
> \>Sequenceheader <br>
>  ACTCATGAGACTAGACA


# Here you will take over the workflow
We have provided you with a subset of the assembly and a subset of the reads. We have made subsets rather than the whole sets to keep this one day practical feasible. The complete workflow you will follow, is seen on this sketch below. Whenever you are lost in commandline, error messages and output tablaes, have a look again at this flowchart. In these bio-informatics workflows it is essential to keep a feeling for where you are, where you are going and where you come from.  
![image](./metagenomics/workflowsketch.png)

The programmes mentioned in the flowchart are preinstalled on this virtual machine you are currently working on:
bwa: http://bio-bwa.sourceforge.net/ <br>
samtools: http://samtools.sourceforge.net/ <br>
binningmetabat: https://bitbucket.org/berkeleylab/metabat/overview <br>
checkm: https://github.com/Ecogenomics/CheckM/wiki/Installation#how-to-install-checkm <br>
prokka: http://www.vicbioinformatics.com/software.prokka.shtml <br>


## Jupyter and Bash Basics
Before we will get our hands dirty with the 'real data', we will first do some exercises to learn how to work in a notebook like this then we will continue the metagenomics workflow. To keep all information central, I will continue here with the explaination of the workflow. The exercises will be available in another notebook after you have read the full introduction.

## Quick assessment of the raw data
After we have learned about bash and jupyter notebook, we will practice our bash skills by having a quick look at the data files and assessing the quality.

## Backmapping
After assembly of the scaffolds we want to know how abundant each scaffold is in the different metagenomic samples. To do this, we align the illumina reads (FastQ files) to the scaffolds (fasta) of the assembly in a step called **backmapping** with a tool called BWA (Burrows-wheeler aligner). BWA will provide us then with a table of which reads mapped on the metagenome assembly and their specific coördinates. This table is stored in a `.bam` file. Finally, these tables in the `.bam` files must be sorted by coördinate. This we achieve with a tool called `samtools`.

## Binning
A single metagenome assembly contains scaffolds representing DNA of many species in the original sample, but without any information on which scaffold belongs to which species. To tackle this issue, we will categorise each individual scaffold in a 'bin' which then ideally comprises one species in the metagenome. This procedure is called binning. The binning procudure needs the bam files we will make in the 'backmapping' step. The tool we use for binning is called `metabat`. Metabat uses the sorted bamfiles to subdivide the scaffolds in bins. This will produce between 5 and 20 bins. Hence we have moved from millions of very short sequences with very little biological information to thousands of longer scaffolds with more biological information, to dozens of long sequences with a lot of biological information. 

## QC
One bin ideally represents one species. To check whether this is the case, we can use the tool `checkM`. CheckM assesses the completeness and contamination of bins based on single-copy-marker-genes (collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage). 

## Annotation
For the sake of simplicity and time, we will isolate one bin per group and annotate this bin using prokka. Annotation is the process of adding functional information to parts of DNA sequence, i.e. making a table of genes and locations of these genes. These are often stored in a GFF file, the genomic feature format. Once you have this gff file, and a fasta for your specific bin/microbial species, we will visualise it using KEGG. Again, for more details please google kegg. Here we will use it to load our gff file and then explore the metabolic pathways present. 


# Final remarks
Not all steps are feasible on these humble web based machines. Be paitient and don't overload the machines, if we crash the server we all loose acces to the notebooks.

The steps listed above are all part of the practical today, and each step has it's own jupyter notebook. If you are experienced with jupyter notebooks already then you may skip the 'Jupyter and bash basics notebook'.

Good luck! and remember Google is your friend!!!