# **Project №6 Lab journal**
## **Baking Bread**
> Done by Ilia Popov

We will explore how RNA expression levels change as yeast undergo fermentation to make bread rise. There are two replicates of RNA-seq data from yeast before and during fermentation, and our goal is to find out if the yeast express different genes during fermentation than they do under normal growth.

### **1) Downloading the data**

In [None]:
# SRR941816: fermentation 0 minutes replicate 1 
! wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR941/SRR941816/SRR941816.fastq.gz -P data/raw_data
# ----------------------------------------------
# SRR941817: fermentation 0 minutes replicate 2 
! wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR941/SRR941817/SRR941817.fastq.gz -P data/raw_data
# ----------------------------------------------
# SRR941818: fermentation 30 minutes replicate 1 
! wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR941/SRR941818/SRR941818.fastq.gz -P data/raw_data
# ----------------------------------------------
# SRR941819: fermentation 30 minutes replicate 2 
! wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR941/SRR941819/SRR941819.fastq.gz -P data/raw_data

As a reference genome we will use _Saccharomyces cerevisiae_, the strain S288c and assembly R64.

In [None]:
# reference genome file: 
! wget ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz -P data/reference
# ----------------------
# annotation file:
! wget ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz -P data/reference

### **2) Quality checking of sequencing**

#### **2.1) `FastQC`**

First, let's generate `FastQC` reports

In [None]:
! fastqc data/raw_data/SRR941816.fastq.gz data/raw_data/SRR941817.fastq.gz \
data/raw_data/SRR941818.fastq.gz data/raw_data/SRR941819.fastq.gz

#### **2.2) `MultiQC`**

Then, let's generate `MultiQC` report based on `FastQC` reports.

In [None]:
! multiqc data/raw_data

### **3) Analysis pipeline**

#### **3.1) Unpack downloaded data**

In [8]:
! gunzip data/reference/GCF_000146045.2_R64_genomic.fna.gz
! gunzip data/reference/GCF_000146045.2_R64_genomic.gff.gz
! gunzip data/raw_data/SRR94181*.fastq.gz

#### **3.2) Aligning with HISAT2**

First, let's build genome index

In [None]:
! hisat2-build data/reference/GCF_000146045.2_R64_genomic.fna hisat2_indeces/reference

Then, let's run `hisat2` in single-end mode

In [None]:
! hisat2 -p 8 -x hisat2_indeces/reference -U data/raw_data/SRR941816.fastq | samtools sort > bam_files/SRR941816.bam
! hisat2 -p 8 -x hisat2_indeces/reference -U data/raw_data/SRR941817.fastq | samtools sort > bam_files/SRR941817.bam
! hisat2 -p 8 -x hisat2_indeces/reference -U data/raw_data/SRR941818.fastq | samtools sort > bam_files/SRR941818.bam
! hisat2 -p 8 -x hisat2_indeces/reference -U data/raw_data/SRR941819.fastq | samtools sort > bam_files/SRR941819.bam

Finally, let's save the output of `hisat2`


|Sample name|Total reads number|Reads aligned 1 time|Alignment rate|
|-----------|------------------|--------------------|-----------------|
|SRR941816|9043877|7930593|94.25%|
|SRR941817|9929568|8645384|94.91%|
|SRR941818|1721675|1508002|96.22%|
|SRR941819|6172452|5368133|96.28%|

#### **3.2) Quantifying with featureCounts**

`featureCounts` can not work with GFF files. We need to convert the GFF file to GTF format. For this purpose we will use `gffread`

**Convert from GFF to GTF**

In [14]:
! gffread data/reference/GCF_000146045.2_R64_genomic.gff -T -o data/reference/GCF_000146045.2_R64_genomic.gtf

**!!! DISCLAMER !!!**<br>
In our case `gffread` generates GTF file with empty cells. `featureCounts` doesn't like it. There are 3 possible ways to solve this ptoblem:
1) Download file already in GTF format from: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gtf.gz
2) Use `-t gene -g ID` parameter in `featureCounts`
3) Filter GTF file output from `gffread` with our own hands...

I decided to go with 2 option.

**Run the feature counts program**

In [None]:
! featureCounts -g gene_id -a data/reference/GCF_000146045.2_R64_genomic.gff -t gene -g ID \
-o featureCounts_output/all_fc.txt bam_files/*.bam

|Sample name|Total alignments|Assigned alignments|
|-----------|----------------|-------------------|
|SRR941816|9773838|7285693 (74.5%)|
|SRR941817|10832704|7986987 (73.7%)|
|SRR941818|1885543|1406729 (74.6%)|
|SRR941819|6800272  |4994723 (73.4%)|

We don’t need all columns from featureCounts output file for further analysis, so let’s simplify it.

In [19]:
! cat featureCounts_output/all_fc.txt | cut -f 1,7-10 > simple_counts.txt

#### **3.3) Find differentially expressed genes with Deseq2**
> Scripts from: Raiko, Mike (2021). Scripts for RNA-seq project. figshare. Software. https://doi.org/10.6084/m9.figshare.14239304.v1<br>
> With custom parameters

**Calculate metrics**

This script generates following files:
1)	`result.txt` will contain calculated metrics for our genes
2)	`norm-matrix-deseq2.txt` will contain normalised counts that we will use in visualisation

In [None]:
! cat simple_counts.txt | R -f R_scripts/deseq2.r

**Draw heatmap**

In [None]:
! cat norm-matrix-deseq2.txt | R -f R_scripts/draw-heatmap.r

**Draw volcanoplot**

In [None]:
! cat result.txt | R -f R_scripts/draw-volcanoplot.r

### **4) Result interpretation**

#### **4.1) Ordinary Gene ontology (GO)**

In the `result.txt` file genes are sorted by adjusted p-values. So let’s take the first 50 genes from this file using linux head utility and keep only the first column (gene names) using linux cut program:

In [7]:
! head -n 50 result.txt | cut -f 1 | cut -d "-" -f 2 > GO/genes.txt

Go to http://www.yeastgenome.org/cgi-bin/GO/goSlimMapper.pl<br>
For top 50 differentially expressed genes:
1) Press `Choose file` and upload `genes.txt`
2) Select `Yeast GO-Slim: Process`
3) Make sure `SELECT ALL TERMS` is highlighted. Press `Submit Form`
4) Try to interpret these results

#### **4.2) ShinyGO**

In order to get "state-of-art" visualisation I want to use `ShinyGO`

We need 3 files:
1) List of `all` significantly differential expressed genes (DEG)
2) List of significantly `downregulated` genes
3) List of significantly `upregulated` genes

In [3]:
import pandas as pd

##### **4.2.1) All significantly DEGs**

In [26]:
df = pd.read_csv('result.txt', sep='\t') 

all_genes = df[(abs(df['log2FoldChange']) > 2) & (df['padj'] < 0.05)]
all_genes = all_genes.drop("Unnamed: 0", axis=1)

gene_ids_all = all_genes["id"]
gene_ids_all = [i[5:] for i in gene_ids_all]

pd.Series(gene_ids_all).to_csv("ShinyGO/all_genes.txt", index=False)

##### **4.2.2) `Downregulated` genes**

In [21]:
downreg_genes = all_genes[(all_genes['log2FoldChange']) < 2]

gene_ids_down = downreg_genes["id"]
gene_ids_down = [id[5:] for id in gene_ids_down]

pd.Series(gene_ids_down).to_csv("ShinyGO/downreg_genes.txt", index=False)

##### **4.2.3) `Upregulated` genes**

In [22]:
upreg_genes = all_genes[(all_genes['log2FoldChange']) > 2]

gene_ids_up = upreg_genes["id"]
gene_ids_up = [id[5:] for id in gene_ids_up]

pd.Series(gene_ids_up).to_csv("ShinyGO/upreg_genes.txt", index=False)

Go to http://bioinformatics.sdstate.edu/go/<br>
1) Press `Select a species` and search for _Saccharomyces cerevisiae_
2) Paste list of `downregulated` genes to the main window
3) Paste list of `all significantly DEGs` to the `Background` window
4) Press `Submit`