# Indexing BAM files with samtools
Before we can view our alignments in the IGV browser we need to index our BAM files. We will use samtools index for this purpose. 

In [1]:
echo $RNA_ALIGN_DIR
cd $RNA_ALIGN_DIR #RNA_ALIGN_DIR=$RNA_HOME/alignments/hisat2
# To see your full path to RNA_ALIGN_DIR
pwd
# To see what contents are in RNA_ALIGN_DIR
ls
# To navigate all directories like how you navigate folders in your computer/laptop, go to: http://YOUR_PUBLIC_IPv4_ADDRESS

/home/ubuntu/workspace/rnaseq/alignments/hisat2
/home/ubuntu/workspace/rnaseq/alignments/hisat2
HBR.bam       HBR_Rep2.bam  HBR_Rep3.sam  UHR_Rep1.sam  UHR_Rep3.bam
HBR_Rep1.bam  HBR_Rep2.sam  UHR.bam       UHR_Rep2.bam  UHR_Rep3.sam
HBR_Rep1.sam  HBR_Rep3.bam  UHR_Rep1.bam  UHR_Rep2.sam  samples.tsv


In [3]:
samtools index -M *.bam
# flag -M interprets all filename arguments as files to be indexed, allowing multiple files to be indexed at the same time.

# Optional: samtools docker image
The course's Amazon Machine Image (AMI) "cshl-seqtec-2024" has Docker already installed so we don't need to install it. 
Try to create an index file for one of your bam files using a samtools docker image rather than the locally installed version of samtools. Below is an example docker run command.

- **/tmp** is a temporary directory on your local Linux filesystem — in this case, it's on your EC2 instance
- **-v** is the parameter used to mount your workspace so that the docker container can see the files that you’re working with. In the example above, **/tmp from the EC2 instance** has been mounted as **/docker_workspace within the docker container**.
- **:v1.9-4-deb_cv1** refers to the specific tag and release of the docker container.
- Note if this's your first time running Docker, "Unable to find image 'biocontainers/samtools:v1.9-4-deb_cv1' locally" just means the image is not yet downloaded on your EC2 instance. Thus, Docker will then reach out to Docker Hub (or wherever the image is hosted) and downloads the image layers: "v1.9-4-deb_cv1: Pulling from biocontainers/samtools"

In [2]:
cp HBR.bam /tmp/
docker run -v /tmp:/docker_workspace biocontainers/samtools:v1.9-4-deb_cv1 samtools index /docker_workspace/HBR.bam
ls /tmp/HBR.bam*

Unable to find image 'biocontainers/samtools:v1.9-4-deb_cv1' locally
v1.9-4-deb_cv1: Pulling from biocontainers/samtools

[1Bd0aa93c0: Pulling fs layer 
[1Ba239eb0e: Pulling fs layer 
[1B7313e9cb: Pulling fs layer 
[1Bce2e48be: Pulling fs layer 
[1B6ad56c57: Pulling fs layer 
[1Bfdba1cc8: Pulling fs layer 
[1BDigest: sha256:da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a
Status: Downloaded newer image for biocontainers/samtools:v1.9-4-deb_cv1
/tmp/HBR.bam  /tmp/HBR.bam.bai


# In the next step we will visualize these alignment BAM files using IGV.
Start IGV on your computer/laptop. Load the UHR.bam & HBR.bam files in IGV. If you’re using AWS, you can load the necessary files in IGV directly from your web accessible amazon workspace (see below) using ‘File’ -> ‘Load from URL’.

Make sure you select the appropriate reference genome build in IGV (top left corner of IGV): in this case hg38.

AWS links to bam files:
- UHR hisat2 alignment: http://YOUR_PUBLIC_IPv4_ADDRESS/rnaseq/alignments/hisat2/UHR.bam
- HBR hisat2 alignment: http://YOUR_PUBLIC_IPv4_ADDRESS/rnaseq/alignments/hisat2/HBR.bam

Go to an example gene locus on chr22:
- e.g. EIF3L, NDUFA6, and RBX1 have nice coverage
- e.g. SULT4A1 and GTSE1 are differentially expressed. Are they up-regulated or down-regulated in the brain (HBR) compared to cancer cell lines (UHR)?
- Mouse over some reads and use the read group (RG) flag to determine which replicate the reads come from. What other details can you learn about each read and its alignment to the reference genome.



# Practical Exercise 7 - Visualize

In [1]:
cd $RNA_HOME/practice/alignments/hisat2
samtools index HCC1395_normal.bam
samtools index HCC1395_tumor.bam

Start IGV on your laptop. Load the HCC1395_normal.bam & HCC1395_tumor.bam files in IGV. You can load the necessary files in IGV directly from your web accessible amazon workspace (see below) using ‘File’ -> ‘Load from URL’.
- http://your-public-IPv4/rnaseq/practice/alignments/hisat2/HCC1395_normal.bam
- http://your-public-IPv4/rnaseq/practice/alignments/hisat2/HCC1395_tumor.bam

### Q1: Navigate to this location on chromosome 22: ‘chr22:38,466,394-38,508,115’. 
- What do you see here? 
- How would you describe the direction of transcription for the two genes? 
- Does the reported strand for the reads aligned to each of these genes appear to make sense? 
- How do you modify IGV settings to see the strand clearly? 

This region contains two genes, ‘KDELR3’ and ‘DDX17’. With repect to direction of transcription, these genes are arranged in a tail-to-tail fashion (their transcription end points are coming together). KDELR3 is transcribed from the ‘+ve’ or ‘top’ strand (left to right) and DDX17 is transcribed from the ‘-ve’ or ‘bottom’ strand (right to left). Yes, the reads aligned appear to correspond to the expected strand of transcription. To view this pattern, do an option click within the alignment track and select ‘Color alignments by’ and ‘first-of-pair strand’ from the viewing options. You can do this for both normal and tumor alignment tracks seperately.

### Q2: How can we modify IGV to color reads by Read Group? How many read groups are there for each sample (tumor & normal)? What are your read group names for the tumor sample? 
To see the read group of each read cleary, do an option click within the alignment track and select ‘Color alignments by’ and ‘read group’. By viewing the colors of reads and info for individual reads we can see there are 3 read groups for normal, and 3 for tumor. The names will be what you specified during your alignment command. For example: ‘HCC1395_tumor_rep1’, ‘HCC1395_tumor_rep2’, ‘HCC1395_tumor_rep3’.

### Q3: What are the options for visualizing splicing or alternative splicing patterns in IGV? 
- Navigate to this location on chromosome 22: ‘chr22:40,363,200-40,367,500’.
- What splicing event do you see?

There are two main options for viewing splicing patterns in IGV. You can option click within the alignment track and select ‘Show Splice Junction Track’, or you can select the ‘Sashimi Plot’ option. In this region you should see an alternative splicing pattern for the gene ADSL, where a cassette exon is either included or skipped. The exon skipping isoform appears to be the minor isoform.