# Lab 4: You are what you eat (RNA-seq)

<font color="red">This notebook is not graded, it is provided for informational purposes, but please do read through it. Note the paths are outdated (from a previous year where this was lab 3). The preprocessed files are now stored in `~/public/lab4/preprocessing/`.</font>

This notebook contains background information about how the raw RNA-seq data was preprocessed to obtain gene-level count information from the raw reads. 


# 0. Accessing the raw data

Raw data files used for preprocessing were obtained using the code below (in the terminal). Please check out the links below, but do not run this! It is meant to show you how we did this, which you may find helpful as a guide when you need to find data for your own project. 

```
# Accessions for each file were found at
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87565
# and navigating to the SRA link for each file

fastq-dump -Z SRR4340937 > Chow_Rep1.fq
fastq-dump -Z SRR4340938 > Chow_Rep2.fq
fastq-dump -Z SRR4340939 > Chow_Rep3.fq

fastq-dump -Z SRR4340943 > HFD_Rep1.fq
fastq-dump -Z SRR4340944 > HFD_Rep2.fq
fastq-dump -Z SRR4340945 > HFD_Rep3.fq

# Get reference genome from ENSEMBL
wget -O GRCm38.fa http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/ENSEMBL/mus_musculus/ENSEMBL.mus_musculus.release-75/Mus_musculus.GRCm38.75.dna.primary_assembly.fa

# Get gene annotations from ENSEMBL
wget -O GRCm38.75.gtf http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/ENSEMBL/mus_musculus/ENSEMBL.mus_musculus.release-75/Mus_musculus.GRCm38.75.gtf

```

# 1. Alignment of RNA-seq data

As for most NGS analyses we'll be doing (besides assembly), the first step is to align our reads so we know where they came from. As we discussed in class, we cannot simply align reads to the reference genome using something like BWA-MEM, since this does not take into account "splice junctions" that will be prevalent in our RNA-seq data.

We used an RNA-seq aligner called STAR for this. STAR takes in fastq files and a reference *transcriptome* and outputs BAM files with aligned reads. **Because STAR can be very memory intensive (20+GB RAM), we have already run this for you. We will still walk through the steps to run it here as a reference for using this in the future.** 

First, similar to BWA-MEM, we need to create an "index" that will be used during alignment to rapidly look up where each read came from. We created an index using the following command:

```shell
REFFA=/datasets/cs185-sp22-a00-public/genomes/GRCm38.fa
GTF=/datasets/cs185-sp22-a00-public/genomes/GRCm38.75.gtf
STARDIR=/datasets/cs185-sp22-a00-public/lab3/preprocessing/STAR
STAR \
    --runMode genomeGenerate \
    --genomeDir ${STARDIR} \
    --genomeFastaFiles ${REFFA} \
    --sjdbGTFfile ${GTF} \
    --sjdbOverhang 49
```

<blockquote>UNIX TIP: Note, in the above command we are taking advantage of bash variables. We can define variables using the syntax: VAR=value (important, no spaces!) then reference variables using \$VAR or \${VAR}.</blockquote>

Input parameters to this command were:
* `--runMode genomeGenerate`: tells STAR we're creating an index
* `--genomeDir ${STARDIR}`: tells STAR where to put the output files
* `--genomeFastaFiles ${REFFA}` pointed STAR to the fasta file for our reference genome. Here we are using a mouse genome with the GRCm38 build.
* `--sjdbGTFfile ${GTF}` provides the gene annotations (GTF format). This file is critical for telling STAR where all the exon-exon boundaries are and where genes start and end. You should take a look at this file to get an idea of what these gene annotations look like. See the [GTF format spec](https://uswest.ensembl.org/info/website/upload/gff.html) and lecture slides for more details. This file was created by ENSEMBL build 75 (http://feb2014.archive.ensembl.org/Mus_musculus/Info/Index).
* `--sjdbOverhang` should be set to the read length -1 according to the STAR manual.

After creating the index, we can align reads to the transcriptome. We used the following bash commands to run the alignment:


```shell
OUTDIR=/datasets/cs185-sp22-a00-public/lab3/preprocessing

# STAR options recommended by ENCODE
STAROPTS="--outSAMattributes NH HI AS NM MD \
	--outFilterType BySJout \
	--outFilterMultimapNmax 20 \
	--outFilterMismatchNmax 999 \
	--outFilterMismatchNoverReadLmax 0.04 \
	--alignIntronMin 20 \
	--alignIntronMax 1000000 \
	--alignMatesGapMax 1000000 \
	--alignSJoverhangMin 8 \
	--alignSJDBoverhangMin 1 \
	--sjdbScore 1 \
	--limitBAMsortRAM 50000000000"

for f in Chow_Rep1 Chow_Rep2 Chow_Rep3 HFD_Rep1 HFD_Rep2 HFD_Rep3
do
    STAR \
	--runThreadN 5 \
	--genomeDir ${STARDIR} \
	--readFilesIn ${OUTDIR}/fastqs/${f}.fq \
	--outFileNamePrefix ${OUTDIR}/${f} \
	--outSAMtype BAM SortedByCoordinate \
	--quantMode TranscriptomeSAM ${STAROPTS}
    # Reorganize the output files
    mv ${OUTDIR}/${f}Aligned.toTranscriptome.out.bam ${OUTDIR}/txBams/
    mv ${OUTDIR}/${f}Aligned.sortedByCoord.out.bam ${OUTDIR}/genomeBams/
    samtools index ${OUTDIR}/genomeBams/${f}Aligned.sortedByCoord.out.bam
done
```

<blockquote>UNIX tip: The for loop goes through each of our samples and runs a separate command for each, similar to for loops in Python. You may find the for loop syntax useful for running additional commands below. See more about bash for loops: https://www.cyberciti.biz/faq/bash-for-loop/.</blockquote>

This command uses STAR to align each of our fastq files to the reference genome and reference transcriptome. 

This outputs two BAM files:
* `${f}Aligned.sortedByCoord.out.bam`: contains our reads aligned to the reference mouse *genome* (GRChm38).
* `${f}Aligned.toTranscriptome.out.bam`: contains our reads aligned to the mouse *transcriptome* (Ensembl version 75).

If you do `samtools view` on these files, you will see the first BAM files (genome) are aligned to chromosomes like we are used to (e.g. 1, 2, etc.). The second BAM files (transcriptome) are aligned to *transcripts* (named things like "ENSMUST00000074245"). These transcriptome BAMs are required for the tools in the next section for quantifying gene expression.

A major difference between the two alignments is that alignments of RNA-seq reads to the reference genome will contain large gaps due to splice junctions. Find reads with such gaps by looking for CIGAR scores containing an "N" character (gap). e.g.:
```
samtools view /datasets/cs185-sp22-a00-public/lab3/preprocessing/genomeBams/Chow_Rep1Aligned.sortedByCoord.out.bam | awk '($6 ~ /N/)' | head
```
You should see reads with CIGAR scores like `38M723N12M`. This means a read matched for 38bp on one exon, spanned a 723bp intron, and matched to 12bp on the next exon. You won't typically see this type of read in the transcriptome BAMs, since in those cases reads were aligned directly to transcripts with intron sequences removed.


# 2. Quantifying gene expression

Next, we will use RSEM, a tool for quantifying gene expression from RNA-seq. It takes in reads aligned to the transcriptome, which we obtained using STAR, and outputs estimated expression levels of each gene.

RSEM requires an initial step to preprocess the reference transcriptome, similar to the index step required for sequence alignment. We have run this for you using the command below:

```shell
REFFA=/datasets/cs185-sp22-a00-public/genomes/GRCm38.fa
GTF=/datasets/cs185-sp22-a00-public/genomes/GRCm38.75.gtf
RSEMOUT=/datasets/cs185-sp22-a00-public/lab3/preprocessing/RSEM/RSEM
rsem-prepare-reference \
    --gtf ${GTF} ${REFFA} ${RSEMOUT}
```

This will generate multiple index files with prefix `/datasets/cs185-sp22-a00-public/lab3/RSEM/RSEM*`.

Now, we are ready to run RSEM for expression quantification. The following command shows how to run RSEM on a single sample:

```shell
rsem-calculate-expression \
	-p 5 \
	--fragment-length-mean -1 \
	--seed-length 25 \
	--bam /datasets/cs185-sp22-a00-public/lab3/preprocessing/txBams/Chow_Rep1Aligned.toTranscriptome.out.bam \
	${RSEMOUT} \
	/datasets/cs185-sp22-a00-public/lab3/Chow_Rep1
```
(we also included `-p 5` to run on 5 processors to speed this up.)

We ran `rsem-calculate-expression` on each of the six samples (separately) to create the following output files (for each condition. shown here for `Chow_Rep1`):
* `Chow_Rep1.genes.results`: gene-level expression results
* `Chow_Rep1.isoforms.results`: isoform (transcript)-level expression results

We have included the `*.genes.results` for gene-level analysis, which you will use as a starting point in the lab assignment.