# Sorting bam files
BAM files are Binary sAM files. 
SAM files are nothing more than big tables that tell us what read from the FastQ file, mapped where exactly on the scaffolds of the metagenome assembly. 
The rows of this table that we call a SAM file are not ordered in any logical way.
When just mapped, the order resembles the (random) order of the reads in the original fastq file.

For many [computational purposes](https://www.google.com/search?client=ubuntu&hs=tg5&channel=fs&ei=3vYPW9uDOumRgAbF3LHoCw&q=why+sorting+is+necessary&oq=why+sort&gs_l=psy-ab.3.2.35i39k1j0l2j0i20i263k1j0l6.2238.7325.0.11765.8.8.0.0.0.0.76.455.8.8.0....0...1c.1.64.psy-ab..0.8.454...0i67k1j0i203k1.0.EJEFo_3okmA), we want to sort/order these rows according to 
1. the scaffold they mapped on
2. the position on that scaffold. (position in bp)

We will achieve this with the `samtools` program. 
As the name suggests, `samtools` comprises many tools to deal with SAM (or BAM) files: one of which is the `samtools sort` tool. 
This tool we will use to sort our BAM files.

`samtools` also contains the `samtools view` tool we used earlier. 
Samtools view is used to convert SAM to BAM and also BAM to SAM.

Before we proceed, let's have a quick look if the BAM files are created as we expected.

**[DO:] check the mapped directory with** `ls` **and peek inside the bam files with** `samtools view`**.**

In [1]:
ls ./data/mapped

L1.bam  L2.bam  L3.bam  P1.bam  P2.bam  P3.bam


In [2]:
samtools view ./data/mapped/L1.bam | head

ERR2114809.1	99	NODE_472_length_18986_cov_293.449_ID_23957348	15238	60	149M	=	15505	418	TNTCTACCTATAACTAAGGCAATCAGCGCAATCAATTCCCAAAATGAACCTGTGGGAAAGAAAAGACGAGAAATTAGTACCGCCGCAATGCCTTTGAAACCTTCAGACAGCACTGCCAAAATACCCACCAATTTGCCGCCGTGATAAAA	A#AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEE<EEEEEEAEEEEEEE/EE<EEEEAEEEEEEEEEAEEEEEEAA<<AAEAEEAA<666<AAAAAAEEE	NM:i:1	MD:Z:1A147	MC:Z:151M	AS:i:147	XS:i:0
ERR2114809.1	147	NODE_472_length_18986_cov_293.449_ID_23957348	15505	60	151M	=	15238	-418	ACTANANNACCCNATNGATGAAACATTTATTGTTTAATGTGAATTTTAGACCTATCTTCTATCTGAAATTGTCAATTAGAAATATAAATTTTGATAAAAATTTTTTTGATAGTTTGATTGCTTGGTTTATGGACACTAAGCAGACACGNNN	66</#/##EEA/#</#6EEEEEEEEEEEAEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEAA/EEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAA###	NM:i:10	MD:Z:4A0T0A0A4C1G0A132G0A0A0	MC:Z:149M	AS:i:132	XS:i:19
ERR2114809.2	83	NODE_1779_length_7357_cov_353.786_ID_23982227	3055	60	150M	=	2650	-555	TACTCACTTTTAGAAACATGGCC

You should see a lot of tab-delimited names, numbers and sequences. 

**[DO:] Google how a sam/bam file should look and see if it corresponds to what you get.**

**[Q:] What are the headers of a sam/bam file?**

**[A:]**

## Another loop

We have seen how to make a loop in the backmapping part of this practical.
Now let's do the same and sort the BAM files we created earlier.

Before you start, create a new folder where you store your *sorted* BAM files.
I suggest something like `data/sorted`

The cell below contains a copy of the loop from the previous notebook.
Edit in a way so that the loop sorts your bam files.
1. Make sure to do this step by step. Test every little thing you change in the loop
2. Don't forget to very carefully read the help page of `samtools sort`
  - You want to sort the reads by coordinate (which is the default), not by name.

Make sure you only use one CPU/thread; we have to share this computer with all of us.

**[DO:] make a new directory to store your sorted bam files**

In [3]:
mkdir data/sorted

In [4]:
samtools sort

Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  --no-PG    do not add a PG line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional th

**[DO:] use** `samtools sort` **to sort the bamfiles**. 
Make sure to store them in your new directory.

In [6]:
samples=( L1 L2 L3 P1 P2 P3 )
for i in ${samples[@]}
    do samtools sort -o data/sorted/$i.sorted.bam data/mapped/$i.bam
done

## check
**[DO:] After sorting, check whether your bam files are sorted correctly.**
1. Use `ls --size` to see if the files exist and have a proportional size
2. run samtools view to view your BAM files.

In [7]:
ls -s1 ./data/mapped

total 1354552
226972 L1.bam
226456 L2.bam
226860 L3.bam
224096 P1.bam
222840 P2.bam
227328 P3.bam


In [8]:
ls -s1 ./data/sorted

total 944700
147536 L1.sorted.bam
146044 L2.sorted.bam
146656 L3.sorted.bam
169244 P1.sorted.bam
165216 P2.sorted.bam
170004 P3.sorted.bam


In [11]:
samtools view data/sorted/L1.sorted.bam | head

ERR2114809.663052	145	NODE_1_length_1935275_cov_24.6805_ID_23901540	18	60	151M	=	1885325	1885158	CAATGCGGCTTGTTACGCTCAAATTTAGCCTTCGCCATGTCCGTACAATCCTAAAAACCAGAATTGAAATCGTATCTTAACTACTTTGCCGTAAACCGGTCAGGGACCGATCCACCGCGAAACTCTGGAGCGGGTGACGGGAATCGAACCC	6/E/AAEEAEEE/<EE//EEAE//<</<EEEEEEE/EEEAEEE/EEAEAEEA/6/E<<EEEEAE/EEEEEEEAEEAE<A/EE/AAEEE/EEEEA/EEEEEEAEE/EAE<EEEEEEEE/EEEAA/AEEEEEEEEE6EEAEEAE//EEAAAAA	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:38
ERR2114809.878075	163	NODE_1_length_1935275_cov_24.6805_ID_23901540	65	60	151M	=	579	665	ATCCTAAAAACCAGAATTGAAATCGTATCTTAACTACTTTGCCGTAAACCGGTCAGGGACCGATCCACCGCGAAACTCTGCACCGGGTGACGGGAATCGAACCCGCGTAGCCAGCTTGGCAGGCTGGCGCTCTACCAATGAGCTACCCCCG	AAAAAAAEE6EEA/E6EEEEE/EEEAE///A/AAA/EEEEAEEEA/EAEE6E/AEEEE/E/E/EEE/EE/A///EE/EEA/</AA<E//<E</EE/<EAAAEEEEE<6</EE<AE/E//6/AAA<<A//A</</AAA/<AA6A////AAA/	NM:i:6	MD:Z:80G1G36A7A9T8A4	MC:Z:151M	AS:i:121	XS:i:20
ERR2114809.878075	83	NODE_1_length_1935275_cov_24.6805_ID_23901540	579	60	151M	=	65	-665	GCCAAAGACCCCAACCTCC

## clean
Did your bam files sort correctly? 
Then remove the unsorted bam files. 
We don't need these anymore and we save some disk space.

**[DO:] If you are sure, then remove the mapped directory by adding the appropriate option to the command below**

In [13]:
rm ./data/mapped -rf