# Mapping reads to scaffolds with BWA (Burrows-Wheeler Aligner) and sorting with Samtools
We now have seen the reads and the assembly with our own eyes. 
In any bio-informatic step, learn yourself always to have a peek into the actual data you are producing. 
Since you now have seen that the files look OK, we can start by mapping (aligning) the original reads in `.fastq.gz` files back to the scaffolds created with the assembly; the `scaffolds.fasta` file. 
Doing this will allow us to calculate the **depth on the scaffolds**; a prerequisite for the binning procedure.

Algning is achieved with bwa.
Per sample, we map the reads against the scaffolds and save the mapping as a `.bam` file. 
1. To run `bwa`, we first need to make an index file of the scaffolds. 
2. Then we can run `bwa` and immediately pipe the output of `bwa` through `samtools view` which will output a `.bam` file .
3. Finally, since we have quite a few `.bam` files to align to the assembly, we will make a `for` loop that will iterate over the different samples we have. 

Rember that the reads are paired-end Illumina sequences, which means that for each DNA fragment, we have sequence data from both ends. 
Therefore, the sequences are stored in two separate files (one for the data from each end). 
We'll use `bwa` with default settings to align the reads back to the scaffolds, and then `samtools` allows us to work with the bwa results.

**[Q:] Why this called back-mapping and not just mapping?**

**[A:]**

**[Q:] Why do we not use the depth in the scaffold names?**

<details>
    <summary>&#10551; Click here for a hint</summary> 
    Remember that in the back-mapping step, we map, store and process all samples individually.
</details>

**[A:]**

The next assignment is a complicated one.
Make sure to read the instructions carefully and remember what you learned about bash loops earlier.

When in doubt, don't forget to read the 'help page' of a command like so:
> ls --help

or sometimes by executing the command without any arguments like so:
> bwa

> bwa index

For more elaborate information, read the manual with the `man` command.
Unfortunately, this doesn't work well inside Jupyter notebooks.
> man bwa

> man samtools

## 1 Index the assembly with bwa index
**[DO:] Create an index of the scaffolds with** `bwa index`**
Start by reading the `bwa index` instructions. 
1. Remember to look for the 'usage' line first. 
2. You won't need any options here.
3. Remember where the assembly was? We just gUnzipped it in the last notebook. 
4. use `ls` and auto completion to find your way to the right file

In [1]:
bwa index


Usage:   bwa index [options] <in.fasta>

Options: -a STR    BWT construction algorithm: bwtsw, is or rb2 [auto]
         -p STR    prefix of the index [same as fasta name]
         -b INT    block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.* 

         `-a div' do not work not for long genomes.



: 1

In [3]:
ls data/assembly

scaffolds.fasta  [0m[01;31mscaffolds.fasta.gz[0m


In [4]:
bwa index ./data/assembly/scaffolds.fasta

[bwa_index] Pack FASTA... 0.61 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=210601834, availableWord=26818708
[BWTIncConstructFromPacked] 10 iterations done. 44238506 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 81726650 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 115042122 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 144649002 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 170959642 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 194340570 characters processed.
[bwt_gen] Finished constructing BWT in 68 iterations.
[bwa_index] 61.56 seconds elapse.
[bwa_index] Update BWT... 0.51 sec
[bwa_index] Pack forward-only FASTA... 0.40 sec
[bwa_index] Construct SA from BWT and Occ... 31.49 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index ./data/assembly/scaffolds.fasta
[main] Real time: 96.467 sec; CPU: 94.577 sec


While this runs, think about what is happening. 
Mapping always takes a query sequence and a reference sequence. 
The reference is our assembly, specifically the `scaffolds.fasta` file. 
What `bwa index` does, is take this reference and 'index' it. 
This means as much as it converts this regular text fasta file, into a binary format that computers can efficiently search.

## Check the output of bwa
After creating the index, there should be multiple files in the assembly folder. 
Think about what command you could use to see if that is the case. 
Use the empty cell below.

**[DO:] Check if the index files were created:** 

In [5]:
ls data/assembly

scaffolds.fasta      scaffolds.fasta.bwt  scaffolds.fasta.sa
scaffolds.fasta.amb  [0m[01;31mscaffolds.fasta.gz[0m
scaffolds.fasta.ann  scaffolds.fasta.pac


Besides the original scaffolds file, we now expect multiple other files. 
All of these together comprise the `index` we just made.
A bwa index of a fasta file can be seen as a binary representation of the fasta file that bwa can search efficiently. 
It is not meant for human eyes, but purely for the computer algorithms to search through.

## Run bwa mem for backmapping
For this part, we will combine your skills on **bash loops** and **variables** to run bwa mem and samtools view on all the reads in data/reads and create BAM output files in a newly made directory.

Then we use a bash for loop to run samtools sort on all the files created by samtools view

Step by step instructions:
1. make a new directory for the samtools view results, give this a useful name
    1. I suggest ./data/mapped
    2.  to make a directory, use the command `mkdir`
2. See how the loop I pre-made for you works, play a bit to see if you understand correctly
3. Don't forget to read the manual page of both! You can make extra cells to print these manual pages if you want to.
4. finish the for loop and run both these commands in a pipe
    1. `bwa mem` 
    2. `samtools view`
5. bwa mem does not require any options, 
6. For `samtools view`, look for the options to convert to a BAM file and write the output to a file.
   
It should look somewhat like this `bwa mem argument argument.fastq  | samtools view argument argument`
    
If you forgot how a loop works, check the notebook of this morning' m1-jupyter_and_bash_basics.ipynb' .

Mapping each separate sample will take about 5 to 6 minutes.
If you think your loop works (read: it doesn't crash immediately), then check in the next notebook if the files are created and if they increase in size. Use `ls -sh`.

**[DO:] 1 First make your new directory here:**

In [6]:
mkdir ./data/mapped

**[DO:] 2 now see how this loop works:**

In [7]:
# first I define the samples in a variable called 'samples'
samples=( L1 L2 L3 P1 P2 P3 )
# next I use this variable in my loop
for i in ${samples[@]}
    do echo $i  
done

L1
L2
L3
P1
P2
P3


**[DO:] 3 Read the manual pages of bwa and samtools.** 
Remember to find the **usage** lines.

In [8]:
bwa mem


Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

Algorithm options:

       -t INT        number of threads [1]
       -k INT        minimum seed length [19]
       -w INT        band width for banded alignment [100]
       -d INT        off-diagonal X-dropoff [100]
       -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
       -y INT        seed occurrence for the 3rd round seeding [20]
       -c INT        skip seeds with more than INT occurrences [500]
       -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
       -W INT        discard a chain if seeded bases shorter than INT [0]
       -m INT        perform at most INT rounds of mate rescues for each read [50]
       -S            skip mate rescue
       -P            skip pairing; mate rescue performed unless -S also in use

Scoring options:

       -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
     

: 1

In [9]:
samtools view


Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -X       include customized index file
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -d STR:STR
           only include reads with tag STR and associated value STR [null]
  -D STR:FILE
           only include reads with tag STR and associated values listed in
           FILE [null]
  -q INT   only in

Now it's up to you! Here you have another variant of the loop I made above. 
Substitute the `ls` command for a `bwa` command. 
Also, note that you can use the variable inside a path!

**[DO:] Map the paired-end reads of each sample to the index scaffolds and save the output as a** `.bam` **file.**

In [10]:
# first I define the samples in a variable called 'samples'
samples=( L1 L2 L3 P1 P2 P3 )
# next I use this variable in my loop
for i in ${samples[@]}
    do bwa mem data/assembly/scaffolds.fasta data/reads/$i.R1.fastq.gz data/reads/$i.R2.fastq.gz | samtools view -b -o data/mapped/$i.bam
done

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 66522 sequences (10000240 bp)...
[M::process] read 66530 sequences (10000146 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (5, 24302, 3, 3)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (465, 542, 621)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (153, 933)
[M::mem_pestat] mean and std.dev: (540.79, 119.46)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1089)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66522 reads in 9.961 CPU sec, 9.837 real sec
[M::process] read 66546 sequences (10000170 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 24410, 2, 3)
[M::mem_pestat] skip orientation 

[M::process] read 66574 sequences (10000292 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6, 24095, 1, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (462, 539, 615)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (156, 921)
[M::mem_pestat] mean and std.dev: (536.18, 117.95)
[M::mem_pestat] low and high boundaries for proper pairs: (3, 1074)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66574 reads in 10.721 CPU sec, 10.550 real sec
[M::process] read 66570 sequences (10000186 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (5, 24231, 3, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientat

[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (463, 539, 616)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (157, 922)
[M::mem_pestat] mean and std.dev: (537.01, 117.05)
[M::mem_pestat] low and high boundaries for proper pairs: (4, 1075)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66578 reads in 9.169 CPU sec, 9.012 real sec
[M::process] read 66558 sequences (10000010 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 24145, 1, 4)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (463, 541, 618)
[M::mem_pestat] low and high boundaries for computing mean and std

[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (442, 514, 591)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (144, 889)
[M::mem_pestat] mean and std.dev: (515.43, 115.71)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1038)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation FF
[M::mem_process_seqs] Processed 66558 reads in 9.119 CPU sec, 8.961 real sec
[M::process] read 66574 sequences (10000045 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (12, 24510, 4, 3)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (95, 174, 320)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 770)
[M::mem_pestat] mean and std.dev: (184.83, 112.73)
[M::mem_pestat] low and high b

[M::mem_pestat] low and high boundaries for proper pairs: (1, 850)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (441, 514, 590)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (143, 888)
[M::mem_pestat] mean and std.dev: (514.61, 114.84)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1037)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation FF
[M::mem_process_seqs] Processed 66604 reads in 9.075 CPU sec, 8.916 real sec
[M::process] read 66590 sequences (10000210 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (8, 24394, 2, 4)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (440, 513, 590)
[M::mem_pestat] low and high 

[M::mem_pestat] low and high boundaries for computing mean and std.dev: (146, 886)
[M::mem_pestat] mean and std.dev: (515.62, 114.18)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1034)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation FF
[M::mem_process_seqs] Processed 66592 reads in 8.561 CPU sec, 8.404 real sec
[M::process] read 66610 sequences (10000108 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (17, 24430, 4, 1)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (155, 257, 383)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 839)
[M::mem_pestat] mean and std.dev: (252.44, 115.86)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1067)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (2

[M::mem_pestat] (25, 50, 75) percentile: (442, 513, 588)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (150, 880)
[M::mem_pestat] mean and std.dev: (513.58, 112.97)
[M::mem_pestat] low and high boundaries for proper pairs: (4, 1026)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation FF
[M::mem_process_seqs] Processed 66598 reads in 8.841 CPU sec, 8.686 real sec
[M::process] read 66604 sequences (10000077 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (10, 24354, 1, 3)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (165, 238, 407)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 891)
[M::mem_pestat] mean and std.dev: (263.30, 112.78)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1133)
[M::mem_pestat] analyzing insert s

[M::mem_pestat] mean and std.dev: (523.30, 117.85)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1054)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66558 reads in 8.991 CPU sec, 8.844 real sec
[M::process] read 66570 sequences (10000269 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (9, 23804, 4, 3)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (447, 522, 598)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (145, 900)
[M::mem_pestat] mean and std.dev: (520.98, 117.87)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1051)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pro

[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66586 reads in 8.984 CPU sec, 8.828 real sec
[M::process] read 66606 sequences (10000143 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (3, 23722, 3, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (447, 522, 599)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (143, 903)
[M::mem_pestat] mean and std.dev: (520.82, 118.92)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1055)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66602 reads in 9.177 CPU sec, 9.024 real sec
[M::process] read 66594 sequences (10000045 bp)...
[M

[M::mem_pestat] (25, 50, 75) percentile: (4, 6, 11)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 25)
[M::mem_pestat] mean and std.dev: (5.38, 2.64)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 32)
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 66604 reads in 8.926 CPU sec, 8.769 real sec
[M::process] read 66584 sequences (10000136 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (5, 23800, 2, 5)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (446, 521, 597)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (144, 899)
[M::mem_pestat] mean and std.dev: (518.50, 117.05)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1050)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there ar

[M::process] read 66688 sequences (10000147 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 16365, 6, 13)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (453, 528, 607)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (145, 915)
[M::mem_pestat] mean and std.dev: (528.76, 118.42)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1069)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (5, 6, 331)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 983)
[M::mem_pestat] mean and std.dev: (35.09, 93.60)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1309)
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 66662 reads 

[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (5, 16635, 5, 8)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (451, 527, 605)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (143, 913)
[M::mem_pestat] mean and std.dev: (527.63, 118.52)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1067)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66726 reads in 10.269 CPU sec, 10.114 real sec
[M::process] read 66704 sequences (10000251 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 16390, 5, 5)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: 

[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66686 reads in 10.329 CPU sec, 10.176 real sec
[M::process] read 66680 sequences (10000183 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 16534, 8, 6)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (450, 526, 604)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (142, 912)
[M::mem_pestat] mean and std.dev: (526.57, 117.47)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1066)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66682 reads in 10.481 CPU sec, 10.325 real sec
[M::process] read 66682 sequences (10000021 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (

[M::process] read 66686 sequences (10000281 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (3, 18206, 15, 6)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (456, 532, 608)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (152, 912)
[M::mem_pestat] mean and std.dev: (530.61, 118.73)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1064)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (207, 445, 3103)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 8895)
[M::mem_pestat] mean and std.dev: (1311.60, 1533.54)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 11791)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 6

[M::mem_process_seqs] Processed 66702 reads in 9.568 CPU sec, 9.411 real sec
[M::process] read 66696 sequences (10000093 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (4, 18305, 7, 4)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (454, 530, 607)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (148, 913)
[M::mem_pestat] mean and std.dev: (528.11, 119.63)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1066)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66684 reads in 9.591 CPU sec, 9.438 real sec
[M::process] read 66692 sequences (10000092 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 18194, 7, 8)
[M::mem_pestat] skip orientation FF as there are not e

[M::mem_pestat] low and high boundaries for proper pairs: (1, 1066)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (93, 889, 4609)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 13641)
[M::mem_pestat] mean and std.dev: (1918.40, 2283.80)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 18157)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 66716 reads in 9.542 CPU sec, 9.387 real sec
[M::process] read 66690 sequences (10000153 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (3, 18286, 6, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (453, 530, 606)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (147, 912)
[M::mem_

[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (459, 537, 613)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (151, 921)
[M::mem_pestat] mean and std.dev: (533.26, 119.32)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1075)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66722 reads in 9.986 CPU sec, 9.826 real sec
[M::process] read 66732 sequences (10000110 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (5, 17366, 6, 4)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (457, 534, 612)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (147, 922)
[M::mem_pestat] mean and std.dev: (530.64, 119.50

[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (460, 537, 614)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (152, 922)
[M::mem_pestat] mean and std.dev: (534.72, 119.09)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1076)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (3160, 4106, 5773)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 10999)
[M::mem_pestat] mean and std.dev: (3669.27, 1987.62)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 13612)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 66746 reads in 10.148 CPU sec, 9.993 real sec
[M::process] read 66762 sequences (10000278 bp)...
[M::mem_pestat] # candidate uniqu

[M::mem_pestat] low and high boundaries for proper pairs: (1, 1065)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66714 reads in 10.222 CPU sec, 10.066 real sec
[M::process] read 66710 sequences (10000189 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6, 17398, 6, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (459, 535, 609)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (159, 909)
[M::mem_pestat] mean and std.dev: (531.18, 116.05)
[M::mem_pestat] low and high boundaries for proper pairs: (9, 1059)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66718 reads in 10.133 CPU se

TIP: If you are working on your own computer and not some shared server, you can speed up the process substantially by using more threads (CPUs/cores) to do the mapping. 
Find the amount of cores available on your computer with `nproc` and read the bwa manual to use more threads.

**[DO:] If the loop is running, then proceed to the next notebook.**
Check with `ls` if your files are being created and perhaps if they increase in size over time.
Then start preparing the next loop: sorting of the bam files.


Some background: 
> By default, mapping algorithms like bwa spit out a .sam file. The sam format is widely used and accepted by many different downstream programmes. Basically, it's just a big table with a standardised format. Although a sam file is human-readable, it is also rather bulky. The file size of a single SAM file will quickly exceed the disk space you have on this virtual machine. Therefore, we convert it to a BAM file, a Binary sAM file. There is no loss of information; it is just saved much more efficiently. Naturally, a binary file is not human readable. Always store your files as bam files. If you need to have a look in the bam file, you can view them using `samtools view` or any of the many downstream programmes available online.