# Mapping reads to scaffolds with bwa (Burrows-Wheeler Aligner) and sorting with Samtools
We now have the reads and the assembly so we can start by mapping (aligning) the original reads back to the scaffolds created with the assembly. Doing this will allow us to see the read depth on the scaffolds.

The reads are paired-end, which means that for each DNA fragment, we have sequence data from both ends. The sequences are therefore stored in two separate files (one for the data from each end). We’ll use bwa’s default settings to align the reads back and then Samtools allows us to work with the bwa results.

<b>Assignment:</b><br>
Use bwa to backmap the reads against the scaffolds and save the mapping as a `.bam` file. When in doubt, don't forget to read the 'help page' of a command like so
> ls --help

or sometimes by executing the command without any arguments like so
> bwa

> bwa index

For more elaborate information, read the manual with the `man` command
> man bwa

> man samtools

In this notebook we will do the following.
1. To run bwa we first need to make an index file of the scaffolds. 
2. Then we can run bwa and immediatly pipe the output of bwa through samtools view which will output a bam file 
3. Finally, since we have quite a few bam files to align to the assembly, we will make a `for` loop that will iterate over the different samples we have. 

Hence, we're combining a lot of the things we have learnt so far. Make sure you tackle this assignment step by step and help each other where neccesary.


some background: 
> By default mapping algorithms like bwa spit out a .sam file. The sam format is widely used and accepted by many different downstream programmes. Basically it's just a big table with a standardised format. Although a sam file is human readable, it is also rather bulky. The filesize of a single SAM file will quickly exceed the disk space you have on this virtual machine. Therefore, we convert it to a BAM file, which is a Binary sAM file. There is no loss of information, it is just stored much more efficiently. Naturally, a binary file is not human readable. Always store your files as bam files, if you need to have a look in the bam file you can view them using samtools view or any of the many downstream programmes available online.

## Index the assembly with bwa index
add new cells below and create an index with bwa

## Check the output of bwa
There should now be multiple files in the assembly folder. Think about what command you could use to see if that is the case

## Run bwa mem for backmapping
For this part we will use a bash for loop to run bwa mem and samtools view on all the reads in metagenomics/reads and  create BAM output files in a newly made directory

Then we use a bash for loop to run samtools sort on all the files created by samtools view

Step by step instruction:<br>
1. make a new directory for the samtools view results, give this a usefull name
    1. I suggest ./metagenomics mapped
    2. to make a directory, use the command `mkdir`
2. see how the loop I pre-made for you works, play a bit to see if you understand correctly
3. finish the for loop and run both 
    1. `bwa mem` 
    2. and `samtools view`
    
Don't forget to read the manual page of both!

Mapping each individual sample will take about 5 to 6 minutes

1 First make your new directory here.

2 now see how this loop works. 

In [34]:
%%bash

# first I define the samples in a variable called 'samples'
samples=$(find ./metagenomics/reads -name '*.fastq.gz' -type f -printf '%P\n' | cut -d '.' -f 1 | sort | uniq)

# next I use this variable in my loop
for i in $samples
    do echo $i
done

L1
L2
L3
P1
P2
P3


In [36]:
%%bash

# first I define the samples in a variable called 'samples'
samples=$(find ./metagenomics/reads -name '*.fastq.gz' -type f -printf '%P\n' | cut -d '.' -f 1 | sort | uniq)

# next I use this variable in my loop
for i in $samples
    do ls ./metagenomics/reads/$i*.fastq.gz
done

./metagenomics/reads/L1.R1.fastq.gz
./metagenomics/reads/L1.R2.fastq.gz
./metagenomics/reads/L2.R1.fastq.gz
./metagenomics/reads/L2.R2.fastq.gz
./metagenomics/reads/L3.R1.fastq.gz
./metagenomics/reads/L3.R2.fastq.gz
./metagenomics/reads/P1.R1.fastq.gz
./metagenomics/reads/P1.R2.fastq.gz
./metagenomics/reads/P2.R1.fastq.gz
./metagenomics/reads/P2.R2.fastq.gz
./metagenomics/reads/P3.R1.fastq.gz
./metagenomics/reads/P3.R2.fastq.gz


Now copy the loop I made above, and substitute the `ls` command for a bwa command