# Aligning reads & processing SAM files

Exercise for creating and processing short read alignments.

* **Contact:** mate.balajti@unibas.ch

### General

Be sure to nicely format your answer.
Indicate your name in the file name!
_Ex2_solutions_Name_LastName.py_ would be a good approach to this.
Document, report and detail your work, make sure we can follow and execute it.

## Prerequisites

### Operating system

For solving these exercises, you need to work in Bash on a Unix-type terminal,
so if you are behind either a Linux distribution, a Mac or a BSD system etc.,
you ​ should ​ be good to go. If instead you only have access to a Windows 10/11
computer, you should be able to make use of the support the system offers for
Linux/Ubuntu through their cooperation with Canonical. 

If you have experience with Docker, one other possible option is to
use a Docker image that contains all required software. For example, starting
from a Linux image (e.g., latest Ubuntu), which provides Linux/GNU/Bash out of
the box, you could install STAR and/or any other required Linux software by
writing an appropriate Dockerfile and then building that image. You can also
search online for available images that already contain these tools (e.g.,
[this](https://hub.docker.com/r/mgibio/star​) looks like one you
could use for running STAR; see below). If you decide to go this route, be
aware that Docker on Windows still runs in a VM and support is not always
stable.

> **Note:** If you are planning to do more bioinformatics work in the future,
> it is definitely a good idea to have a stable Linux system at hand at all
> times. In this case, you might want to consider installing a Linux
> distribution side-by-side with your Windows OS (search for “Linux Windows
> dual boot” or similar).

### Software

In the last exercise, you were asked to write a simple, naive short read
aligner from scratch. While this is a good exercise, it is not suitable for
actual analysis, because your code won’t be optimized to handle the amounts of
data that next-generation sequencing typically yields. Since the dawn of
high-throughput sequencing techniques in the mid 2000’s, a lot of effort has
been put into designing and implementing very efficient methods at mapping
short reads to reference sequences. For this exercise, we will use a popular
option called [STAR](https://github.com/alexdobin/STAR). Among many other
features, STAR supports spliced alignments and optionally accepts gene
annotations in GTF format next to a genome reference to increase the fidelity
of mapping reads that cover splice junctions (if not provided, STAR, by
default, will try to infer splice junctions from the genome reference and the
reads). Please either install STAR (and any other third-party software you
need) via the [Conda](https://docs.conda.io/en/latest/miniconda.html) package
manager _OR_ use Docker containers as suggested above.

### Installation for Windows

_STAR_ is only available as Docker image on Windows.
Here we assume Docker is not yet installed and Windows 10 or 11 is used.
Coarse installation steps:
* Install WSL2 (Windows subsystem for Linux) backend
  https://learn.microsoft.com/en-us/windows/wsl/install
* Install Docker https://docs.docker.com/desktop/install/windows-install/
* Ensure the installation was successful by following the _Quick Start Guide_.
* Get the STAR docker image by executing the following command in
a PowerShell or Windows Command Prompt window (in Linux it is called a terminal window).
  ```bash
  docker pull mgibio/star
  ```

> Note: please refer to the most up-to-date documentation
> by checking the online instructions!

### Run a Docker image

To have the local files visible in the container, one needs to mount the directory onto the host (in this case the STAR container). 

For example, to mount `C:/paths/to/test_files` (absolute path to directory `test_files`) onto `/docker_main/files` (the path and name of the directory on the Linux-based host):
```bash
docker run -it \
  --mount type=bind,source="C:\paths\to\test_files",target=/docker_main/files \
  mgibio/star:latest
```

> Note: on Linux, paths are constructed with slash "/", whereas on Windows with backslash "\\"!

> Note: to get help about the command `docker run`,
> try running `docker run --help`
> or search online for the command.

#### Exercise 2.1: Create index (1 points)

Follow [STAR’s manual](https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf#page=4) to create an index of the provided genome ​`FASTA`​ file
and ​`GTF`​ gene annotations. Note that to allow the mapping to be done on a
laptop, only chromosome 19 and the corresponding gene annotations are provided.
In a typical setting, indexing and mapping would be done on all chromosomes and
an unfiltered set of annotations. Note also that files are provided in a
compressed form (`GZIP`) for easy sharing and may need to be uncompressed
before use (check the instructions whether passing `GZIP`ped files directly is
accepted).

Required files:

* Genome:​ `Mus_musculus.GRCm38.dna_rm.chr19.fa.gz`
* Gene annotations: `Mus_musculus.GRCm38.88.chr19.gtf.gz`


In [None]:
# Copy your indexing command here

STAR --runThreadN 4 \
     --runMode genomeGenerate \
     --genomeDir /docker_main/files/MyIndex \
     --genomeFastaFiles /docker_main/files/Mus_musculus.GRCm38.dna_rm.chr19.fa/Mus_musculus.GRCm38.dna_rm.chromosome.19.fa \
     --sjdbGTFfile /docker_main/files/Mus_musculus.GRCm38.88.chr19.gtf/Mus_musculus.GRCm38.88_chr19.gtf \
     --sjdbOverhang 100




#### Exercise 2.2: Align reads (1 points)

Follow STAR’s manual to align reads to the reference, using the index you have
created in exercise 2.1. Note that we are dealing here with (part of) a
__paired-end sequencing library__, so you will need to provide both read library
files for this step.

ATTENTION: The running of the mapping may take quite some time if you're working on your local machine. (up to 10-15min)

Required files:

* Control mate 1: `control.mate_1.fq.gz`
* Control mate 2: `control.mate_2.fq.gz`

In [None]:
# Copy your alignment command here

STAR --runThreadN 4 \
     --genomeDir /docker_main/files/MyIndex \
     --readFilesIn "/docker_main/files/control.mate_1.fq/control_R1.fastq" "/docker_main/files/control.mate_2.fq/control_R2.fq"

# --outFileNamePrefix /docker_main/files/output # vergessen



### Part 2: Process alignments

#### Exercise 2.3: Process alignments (2 points)

Working with SAM/BAM files using SAMtools

After STAR produces the alignments, the results are written by default in **SAM** format (Sequence Alignment/Map). SAM is a plain text file that can get very large and slow to work with. For efficient storage and processing, SAM files are usually converted into BAM format (the binary, compressed equivalent).

To process BAM files further (for example, to feed them into downstream analysis tools), they also need to be sorted (by genomic coordinate) and indexed (to allow rapid retrieval of reads overlapping specific regions).

This is where [SAMtools](https://www.htslib.org/) comes in. SAMtools is a widely used toolkit for manipulating SAM and BAM files.

**Installing SAMtools**

If you are working in a conda environment, you can install SAMtools with:

```bash
conda install -c bioconda samtools
```

Look up how to:
1. Convert the SAM files produced by STAR into BAM format.
2. Sort the BAM files by genomic coordinate.
3. Index the sorted BAM files to generate `.bai` index files.

In [None]:
# Copy your commands here

# works: creates BAM file
# Aligned.out.sam -> Aligned.out.bam
docker run --rm -v "C:\Users\lucas\Desktop\School\Universität Basel\3. Semester - HS 2025\Programming in Bioinformatics 2025\Exercise 2\output:/data" biocontainers/samtools:v1.9-4-deb_cv1 samtools view -Sb "/data/Aligned.out.sam" -o "/data/BAM/Aligned.out.bam"


# sort
docker run --rm -v "C:\Users\lucas\Desktop\School\Universität Basel\3. Semester - HS 2025\Programming in Bioinformatics 2025\Exercise 2\output:/data" biocontainers/samtools:v1.9-4-deb_cv1 samtools sort "/data/BAM/Aligned.out.bam" -o "/data/BAM/Aligned.sorted.bam"

# Index
docker run --rm -v "C:\Users\lucas\Desktop\School\Universität Basel\3. Semester - HS 2025\Programming in Bioinformatics 2025\Exercise 2\output:/data" biocontainers/samtools:v1.9-4-deb_cv1 samtools index "/data//BAM/Aligned.sorted.bam" 



#### Exercise 2.4: Count reads (2 points)

Find out from the STAR output (Log.out or `SAM` files):

* How many alignments were reported?
* How many reads were uniquely mapped?  
  **Hint:** check for the `NH` (number of hits) SAM tag
* How many reads were mapped to multiple loci?
* How many reads could not be mapped?

Compare the sum of uniquely mapped, multi-mapped and unmapped reads to the
total number of reads in the ​ FASTQ​ input files. Do the numbers match?

> **Note:** See the [SAM​
> specification](https://samtools.github.io/hts-specs/SAMv1.pdf) for more info
> on SAM files.

In [None]:
# Write your answers here

Number of input reads |	445798

Uniquely mapped reads number |	393775

Number of reads mapped to multiple loci |	3913

% of reads unmapped: too many mismatches |	0.00%


#### Exercise 2.5: Run custom functions on STAR results (4 points)

`FASTQ`​ (for reads), 
​`FASTA`​ (for reference sequences and reads if sequencing quality scores are disregarded or discarded), 
​`GTF`​/`GFF`​ (for gene annotations/features), 
`SAM` (for alignments; with the corresponding binary/compressed versions ​
`BAM`,​ and, more recently, 
​`CRAM​`), ​`BED​` (generic tabular format for representing genomic ranges), 
are the main file types that
are used in the analysis of RNA-Seq data. Relevant tools in the field will
nowadays almost always require input files and report their own outputs in any
of these formats. However, not every tool accepting a `FASTA`​ file as input
will also accept a ​`FASTQ`​ file, although it is trivial to convert ​`FASTQ​` to
`FASTA` in a non-lossy way. In addition, both for legacy and new tools, custom
formats are still being used, occasionally, to represent specialized
information. Therefore, writing and applying parsers and converters to convert
outputs of one tool such that they can be used as inputs to another is a
somewhat menial, but common task that bioinformaticians are often faced with.

To practice this and connect to the work you have done in the previous exercise:

* Write code that converts a ​`SAM​` file to a ​`FASTA`​ file and apply it on the
  output of exercise 2.2 (align reads). If you didn’t manage to solve exercise 2.2, write
  code that converts ​`FASTQ`​ to ​`FASTA` instead.
* Convert your files to ​`FASTA`​ and then apply your functions (`parse_fasta()`, `discard_ambiguous_seqs()` and `map_reads()`) from the previous session to the output (Exercise_1).
  * Note: `map_reads()` will likely not finish in reasonable time.
    Thus, it is fine to stop it. But report and reason about why this could be the case. What makes the difference?

In [None]:
# Function to convert SAM to FASTA


docker run --rm -v "C:\Users\lucas\Desktop\School\Universität Basel\3. Semester - HS 2025\Programming in Bioinformatics 2025\Exercise 2\output:/data" biocontainers/samtools:v1.9-4-deb_cv1 samtools fasta "/data/BAM/Aligned.out.bam" -o "/data/Aligned.out.fa"

>The system cannot find the path specified.

Dont know what to do I've been working on this for quite a while. 


> **Note:** As an alternative to alignment- and count-based approaches to a
> differential gene expression, nowadays probabilistic methods are increasingly
> being used for similar purposes.  These have several advantages, such as
> (typically) requiring fewer steps and less compute resources. Importantly,
> they provide abundances for individual transcripts and thus enable analyses,
> such as differential transcript expression and isoform usage analysis, which
> the count-based methods do not readily allow. A downside for these methods is
> that they are not easy to understand in detail, making results harder to
> interpret and potentially more sensitive to biases. The two main tools for
> alignment-free estimation of transcript abundances are
> [Salmon](​https://github.com/COMBINE-lab/salmon) and
> [kallisto](https://github.com/pachterlab/kallisto​).