## Sort and Index bam Files for Downstream Analyses; Visualizing Alignment with IGV

### Samtools
![image.png](attachment:image.png)

Our STAR alignment script should have generated a bunch of .sam (standing for sequence alignment map) files. These files are "human-readable" alignment files. However, this "human-readable-ness" comes at a huge space/memory cost. To save memory/space a compressed (non human-readable) version of these alignment files were generated - the .bam (and the .cram, but we will focus primarily on .bam). To begin we need to find a way to compress our .sam files to .bam files. Well how do we go about this?

The answer is samtools! Check out the [samtools](http://www.htslib.org/doc/samtools.html) documentation. Since BAM files are binary, they can only be read by the computer. Samtools is a great tool that lets us view the contents of bamfiles and perform various manipulations on them.

Okay let's begin: Check that you have samtools installed. You can do this with either `which samtools` or `samtools --help`. If you don't install it according to the [installations](https://github.com/jvtalwar/2021-MSTP-Bioinformatics-Bootcamp/tree/master/Day_0_Setup/Installations) - section 4.3.

Assuming that you do have it installed let's review the basic usage. Simply, commands in samtools follow the below format:

`samtools <command> [options]`

Try out some of the samtools commands (using the .sam files you generated previously):

`samtools view interesting_file.bam (or interesting_file.sam)`  - e.g., DMSO_1_ATCACGAligned.out.sam

`samtools flagstat interesting_file.bam`

If your machine shows error associated with No shared libraries:libbz2.so.1.0 it seems like you forgot to instal bzip2 in step 4.3 of [installations](https://github.com/jvtalwar/2021-MSTP-Bioinformatics-Bootcamp/tree/master/Day_0_Setup/Installations). To rectify this problem go ahead and install it with:

`conda install -c conda-forge -c bioconda bzip2`

Having tested samtools, and ensuring that things are working as expected, let's go ahead and use samtools to **sort and index our bam files**. Look at the documentation to figure out how you would sort a bam file and save it to a new file with the extension .sorted.bam


`samtools sort -@ 8 -o interesting_file.sorted.bam interesting_file.bam`


We also need a bai index of the sorted bam file. You can think of a .bai file as a table of contents for your .bam file. Like a textbook, if we want to read-up on certain topics, we would skim the table of contents to find where we need to jump to. A .bai file functions equivalently, allowing programs to look-up and jump across your .bam without having to read all the previous sequences (or in the textbook analogy - text). 

So again, let's take look at the documentation to determine how we go about generating one of these table of content (.bai) files:

`samtools index interesting_file.sorted.bam`

Now that you have figured out the commands, let's go ahead and put everything together in an automagical script. Notice the `-@ 8` flag above for sort?  The sorting takes 8 processors, so we need to submit a job requesting at least this amount of resources. Keep in mind that you can include two commands in the same script. Just put one below the other and your second one will run after the first one is finished (yes you can run things in parallel, and if we had a ton of time-intensive files, we certainly would do this, but for the sake of simplicity we are just going to focus on the iterative process for now).


Okay script time: Sort and index your bam files. You can generate and run this file from your scripts folder. However don't forget you need to point to where your star_alignment mapping files are!

`#!/bin/bash`<br>
`#PBS -q hotel`<br>
`#PBS -N TheSortingHat`<br>
`#PBS -l nodes=1:ppn=8`<br> 
`#PBS -l walltime=2:30:00`<br>
`#PBS -o sam_index_sort.out`<br>
`#PBS -e sam_index_sort.err`<br>

`cd ~/scratch/star_alignment`

`for x in DMSO_1_ATCACGAligned.out DMSO_2_CGATGTAligned.out DTP_1_CAGATCAligned.out DTP_2_CCGTCCAligned.out DTP_3_GTGAAAAligned.out`<br>
<br>
`do`
<br><br>
`echo "Beginning $x"`<br><br>
`samtools view -S -b $x.sam > $x.bam`
<br>
<br>
`samtools sort -@ 8 -o $x.sorted.bam $x.bam`
<br>
<br>
`samtools index $x.sorted.bam`
<br>
<br>
`done`




### Visualization Reads with IGV

Check out the [IGV](http://software.broadinstitute.org/software/igv/) website. Go to Downloads page and follow the instructions based on your operating system. 
 - **Note:** Using the web app version of IGV seems to fail, so we recommend downloading the Windows/Mac version and using that.


In order to view alignments, you need to upload the bam files to an external server (not TSCC) for viewing. You can also download the bam and the indexed bai files to your desktop and load them from there. But since the files are big, I have uploaded them to an external server for you to view. I have uploaded 10 files in total, two for each condition (2 parental, 3 persister).

**After IGV finishes installing, open IGV:**

Select your genome with genome - load from server. Choose hg19.

Upload the bam files with - Select File - Load from URL<br>
The URL links are:<br>
<br>
DMSO-1:<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DMSO_1_ATCACGAligned.out.sorted.bam<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DMSO_1_ATCACGAligned.out.sorted.bam.bai<br>
<br>
DMSO-2:<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DMSO_2_CGATGTAligned.out.sorted.bam<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DMSO_2_CGATGTAligned.out.sorted.bam.bai<br>
<br>
DTP-1:<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DTP_1_CAGATCAligned.out.sorted.bam<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DTP_1_CAGATCAligned.out.sorted.bam.bai<br>
<br>
DTP-2:<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DTP_2_CCGTCCAligned.out.sorted.bam<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DTP_2_CCGTCCAligned.out.sorted.bam.bai<br>
<br>
DTP-3:<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DTP_3_GTGAAAAligned.out.sorted.bam<br>
https://mstp-bootcamp-2020.s3-us-west-1.amazonaws.com/DTP_3_GTGAAAAligned.out.sorted.bam.bai<br>

After you have uploaded all the files, IGV will likely ask you to zoom in before visualization. To start type TP53 in the search bar at the top and you should see the text replaced with the actual visualization.

Play around by viewing different genes or chromosome locations. Can you see genes that clearly have fewer reads in the parental vs persister datasets? What about differences in called variants at specific positions? We'll come back to gene analyses later on after running differential expression.

When you are ready to quit IGV, you can save the session with _File - Save session_. Next time you open IGV you can open your saved session without having to reload the BAM files.