## 4 Analysis Step #2; Read Mapping
Now that we know what the quality is of the NGS patient data and have corrected the reads on low quality and short read lengths, we can continue with the mapping of our reads against a reference genome. In this step we use the human genome as a reference to which we will map the cleaned reads. After mapping the reads we can look for variations between the reference and our patient. 

## 4.1 Mapping with BWA
Before starting with mapping the data to the human reference genome, we can calculate some important statistics to see what we can achieve with our data.

### Assignment 4; Mapping Quality
The *[coverage][1]* is the number of reads that were mapped at a given position and it is considered a good measurement to determine if there was enough data for further analysis. To identify significant variations, a minimal coverage of **20** should be used. Given the Illumina platform and given these facts:
<ul>
    <li>a minimal average coverage (read depth) of 20,</li>
    <li>the read length is 2 * 150 (read pair with each read 150 bases),</li>
    <li>the human genome size is 3,137,161,264 bases and</li>
    <li>the CARDIO target region (captured region) totals 320,000 bases,</li>
</ul>

we can use the Lander/Waterman equation to calculate our coverage: $$C = LN / G $$
Where: 
<ul>
    <li>`C` is the coverage,</li>
    <li>`G` is the haploid genome (or target region) length,</li>
    <li>`L` is the read length and</li>
    <li>`N` is the number of reads</li>
</ul>

From the cleaned data, look up how many reads are left for mapping. Calculate the coverage if we used this read data set for the whole genome (`G` = human size) and also for the CARDIO panel (`G` = captured region). Is the coverage high enough if the reads where coming from the whole genome? An Illumina MiSeq V3 can produce up to 25 million reads of length 300, how many patients could you analyse per run if the expected coverage is 20 and you where using the CARDIO panel? Show all the calculations in your report

Select the **Map with [BWA-MEM](https://arxiv.org/abs/1303.3997)** tool from the Tools menu. For the mapping we will be using a built-in genome (we made the human genome available in Galaxy). Select the <strong>Human reference HG19</strong> in the reference genome setting. Select paired end reads. 

From your previous experiments with different FASTQ cleaning settings, select the R1 paired Trimmomatic data as the first set of reads and the R2 paired Trimmomatic as the second set of reads. Execute the tool, this will take a while to run (10 - 20 minutes).
<img src="pics/BWAMEMSettings.png">

[1]: https://en.wikipedia.org/wiki/Coverage_(genetics)

### 4.2 Marking Duplicate Mapped Reads

In the process of creating the reads, *[duplicates](http://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in-next-generation-sequencing/)* may have arisen by PCR artifacts. These duplicate reads are not real biological sequences in the sense that they originate from the sample, but are a technical artifact. In downstream analysis these duplicate reads may generate false positive variants. Can you think of a reason why this is the case?

Before we are going to look at any differences between the reference and our patient, we first have to *mark* the duplicate mapped reads. To do this, select the [MarkDuplicates](http://broadinstitute.github.io/picard/) tool from the Tools menu. Select the *Map with BWA-MEM output on data ... and ....*, set the <strong>Assume input file is allready sorted option</strong> to No and Execute the tool. This tool will add a *flag* to each read that it finds as being duplicate and other tools will ignore any read that has this flag. It will therefore not *remove* the read from the data.
<img src="pics/removeDuplicates.png">

### Assignment 5; Visualizing the Mapping Data

We are going to look at the actual mapping to get a bit of feel for what has happend till now. To do this we will look at the mapping output from the previous step - with the marked duplicates - in a *Genome Browser*.  

On our system the *Integrated Genome Viewer* ([IGV](http://software.broadinstitute.org/software/igv/)) has been installed. First we need to download the mapping data to our computer. To do this, download the <strong>dataset</strong> and <strong>bam_index</strong> files from the markduplicates output in Galaxy as shown below. 
<img src="pics/bamDownload.png">  
Select <strong>Save File</strong> in the pop up window.

Open IGV either by going to the (Linux) <strong> Applications Menu -> Run Program...</strong> and type in <strong>igv</strong> and click on <strong>launch</strong> or by opening a terminal and entering the **igv** command. Next, you can load the mapping data into IGV by clicking on <strong>File -> load from File...</strong>. Look in your <strong>Downloads</strong> folder for a file name starting with **Galaxy** and ending with <strong>.bam</strong> (you only need to open the BAM file, the index file is automatically loaded).

<!--
The IGV program is standard installed with version 18 of the human reference genome. In our galaxy workflow we have been using the newer 19 version. The first thing we should do is tell IGV where to find the newer genome. 
In IGV select the <strong>genomes menu</strong> and select <strong>Load Genome from File...</strong>. The <strong>Look in:</strong> will show that you are in your home directory. First go to <strong>/</strong>, than select the <strong>commons</strong> folder, next click on the <strong>minor</strong>, <strong>projectgenomics</strong> and <strong>genomes</strong> folders. From the last folder select the <strong>hg19.genome</strong> and click on <strong>Open</strong>.
-->

Because our sequence reads are from captured exomes (totalling 320.000 bases, which is just 0,001% of the total genome), you have to zoom in quite a bit to see any of the mappings. Too help you find where to zoom in, we can add an extra layer to the genome browser (called a track). I have uploaded a file containing all the exome regions of the cardiopanel to the galaxy server. You can download this file by going to <strong>Shared Data -> Data Libraries -> Cardio Panel</strong> in the Galaxy browser. Select <strong>CAR_0394321_+en-20_target_v2.BED</strong> and click on the <strong>to History</strong> button. Please have a look at the file in you History. The file consists of 4 columns, which describe the chromosome number, start location exome, end location exome and gene name. Download this file (*Save File*) to your computer.

Now from IGV, again select <strong>File -> Load from File..</strong>. Look in your Downloads folders for a file ending in **.bed** and open this file. Your screen should look like this now:
<img src="pics/igvMain.png">

From the 4th column in the bed file, choose a couple of gene names. I will take `SOD2` as an example (`SOD2` lays on chr6 and has 5 exomes).

In IGV type in the name of your selected gene in the search box and click on Go. 
<img src="pics/IGVsearch.png">

The screen will load the mapping results of the region that includes the `SOD2` gene. 
A couple of regions are important in this genome browser screen. The top row shows the location you are looking at now. 
<img src="pics/IGVlocation.png">

The bottom row are the locations of the reference human genes and the locations of our cardiopanel captured exomes. 
<img src="pics/IGVrefseqBed.png">

The middle row is the actual mapping data. The first row shows a coverage plot. You can hover over the plot with the mouse. It will show how many reads where mapped at this position and what the nucloetide distribution is at this position. Also the number of forward and reverse reads is shown. In this case at this position there where 328 reads mapped. It also tells us that 100% of the reads have a G at that position (157 in the forward mapped reads and 171 in the reverse mapped reads)
<img src="pics/IGVcoverage.png">

The middle row shows the mapped reads. We mapped paired-end reads and to make this visible in IGV, right click the mapping track and select <strong>View as pairs</strong>. Reads are colored according to their read orientation and insert size. Look in the [IGV online manual](http://www.broadinstitute.org/software/igv/AlignmentData) for the explaination of the colors. 

Zoom in on your gene of interest. Regions of reads that are grey of color indicate a simular region as the reference. Variants are shown by colored vertical bars (each nucleotide has its own color). Zoom in till you have the nucleotide sequence showing for a variant. In our example we are looking at a T variant for this patient at this position.  We see that a total of 117 reads were mapped at this position and that from all reads 64 had a T and 53 had a A at this position. The patient is heterozygous for this allele. Can you see if this variant is in an exon or not?, what are the consequences of a variant in an exon location? Look for a variant in an exon. The bottom row will show the translation from DNA to protein. Does the found variant causes a change (*non-synonymous*) or is the aminoacid sequence the same (*synonymous*)?
<img src="pics/IGVvariant.png">

Look at the reads at the end of an exon. The cardiopanel captured exons + 20 bases of 55 genes. Why are some reads outside of this location? It is possible to show the read mapping statistics for every read (and position) in IGV. Hover over a position in a read to get information over: the read and it's paired partner, the base location and quality. During the following steps we will answer the following questions *for each gene*: how many variants are found?, how many are in the exons?, how many variants actually cause the aminoacid sequence to change?