# Large omics mock exam

The official exam will take 3 hours. This test exam only 2 hours. Questions will be similar to this exam, but the official exam will have a number of additional theoretical questions. 

**Note:** 

* The exam is open book, open internet. However the use of any communication tool (chat, mail, etc) is strictly forbidden - you will automatically fail the exam.
* You are allowed to use Github during the exam - but do not post any comments without consulting me (for example to correct an obvious mistake).
* For all questions - please provide comments describing what you are plannning to do. Even if you get stuck - add comments describing the followup steps - If I understand your thought process - I can still give you a partial score.

You will be expected to upload the following files to a Toledo Mock Exam Assignment (the mock exam will not be graded - but you can upload anyway to practice. Please upload:

* This ipython notebook with your answers. (download using `Jupyter menu / File / Download as / Notebook (.ipynb)`) 
* An HTML copy of this notebook (download using `Jupyter menu / File / Download as / HTML (.html)`) - Note you must zip this file prior to upload, Toledo does not allow html file uploads.
* Exercise 1:
    * Your new Snakemake file (`Snakefile`)
    * `indels.0.png`
* Exercise 3:
    * Variant impact plots for NOTCH1 and OLFM1

Do not leave uploading files to the last minute - the assignment will automatically close. You are allowed multiple uploads - last one counts.


Best of luck, Mark

## Preparation

### Data required

All information you need for the exam can be found in the `mock_exam` folder. 

###  Terminal/Conda

Do your (CPU intensive) command line work in a VSC interactive session.

For **all** command line work (including snakemake) - make sure you use the correct conda environment by running the following in your shell:

    export PATH=/data/leuven/306/vsc30690/miniconda3/bin:$PATH
    
You can check if you have the correct kernel loaded by running:

    which python
    
Which should yield `/data/leuven/306/vsc30690/miniconda3/bin/python`


### Jupyter

Ensure you use the correct kernel for the jupyter work! You can confirm you have the correct kernel by running (in python):

    import sys
    sys.executable
    
Which should yield `/data/leuven/306/vsc30690/miniconda3/bin/python`

In [None]:
import sys
sys.executable

### Import a few modules
(do not forget to execute the cell below!)

In [None]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


## Question 1 - Snakemake

In your exam folder you wil a snakemake folder containing the workflow definition (`snakemake/Snakefile`). The Snakefile is exactly the workflow we discussed during this course. The workflow has almost completely executed (except for the snpEff step).

The objective of this question is to add a new analysis step to the Snakemake file.

The Snakemake file will have to run [`bcftools stats`](https://samtools.github.io/bcftools/bcftools.html#stats) and subsequent [`plot-vcfstats`](https://samtools.github.io/bcftools/bcftools.html#plot-vcfstats) to create a number of plots visualizing a number statistics from the VCF file. 

**Note**:

 * You must add **a new [rule](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)** to the `Snakefile` - not adapt an existing rule!
 * Ensure the new rule gets automatically executed when running Snakemake without defining a rule.
 * Ensure the stats are executed on the annotated vcf file generated by the snpEff step.
 * Please upload your new `Snakefile` and the generated `indels.0.png` to the Toledo assignment.
 * Make sure all generated output end up in a dedicated subfolder (eg `060.stats`). Copy the output of `ls -lt` on the new snakemake stats output folder in the markdown cell below.

## Question 2 - Extending the SNP database

In the exam folder you will find a notebook called `ParseVCF.ipynb` that was used to create the database `exam.sqlite` - also in the exam folder. Check the `ParseVCF.ipynb` file to see how this database was created. In particular, note the snp identifier format that we use to link tables (in the `snp` column).

In [None]:
dbfile = 'exam.sqlite'
db = sqlite3.connect(dbfile)
pd.read_sql('SELECT * FROM snp LIMIT 5', db)

In the exam folder you will also find a file called `dbsnp.tsv` which contains the dbSNP rs-ids for our vcf file. The first few lines look like this:

    chrom  pos        ref  alt  dbsnp
    chr9   127578816  C    T    rs4240419
    chr9   127578974  A    G    rs4240420
    chr9   127579080  A    G    rs4240421
    chr9   127663498  C    T    rs7036307
    chr9   127674824  G    T    None
    chr9   127679143  G    T    None

   
Can you load the `dbsnp.tsv` file as a pandas DataFrame, create a SNP id in exactly the same format as in the rest of the database, and save this into a **new table** into the database?


**Question:** With this new table, can you write a SQL query to find the dbSnp ID of the single HIGH impact variant in the `FAM166A` gene? 


## Question 3 - Visualization

Given the our annotated database in `exam.sqlite` with tables `snp`, `snp_call` and `snp_effect` (the database we created during class):

Can you create a plot showing the distribution of effect types for a given gene name?

* Please formulate this as a function with the name of the gene as an argument (expand the skeleton below)
* Save the generated plot as a PNG image.
* Create plots for the genes `NOTCH1` and `OLFM1`, upload the plots to the Toledo Assignment.
* Some SNPs have multiple effects at the same time (for example: `non_coding_transcript_exon_variant`) - you can plot these as they are - but extra credit for those who can separate these.
* Note - read this github issue: https://github.com/I0U19A-Large-Omics/Q-A/issues/41 
* Warning - if you think you damaged the sqlite database in the previous step - you can always get a new copy from: 

In [None]:
dbfile = 'exam.sqlite'
db = sqlite3.connect(dbfile)

In [None]:
def plot_gene(gene_name):
    print(f"Processing gene: {gene_name}")
    
    # write the rest of the code here...
    
    
plot_gene("NOTCH1")

In [None]:
plot_gene("OLFM1");