# Welcome to the Sequencing project for *Theme 05 - Genomics*

In this project you will learn the bioinformatic basics of how to analyse *Next-Generation Sequence* (NGS) data of patients diagnosed with [cardiomyopathy](https://en.wikipedia.org/wiki/Cardiomyopathy), discovering variations in their genome. The end-goal is to identify and report on those genomic variations that are possibly disease causing. All course material can be found in this document, some theory, a *lot* of links to resources, questions and (programming) assignments. 

<img src="pics/cardiomyopathy.png" style="float: right;" width="400">
As with many diseases, one of the causes of cardiomyopathy are a combination of genetic mutation(s). According to Wikipedia, the following forms of cardiomyopathy have a genetic base:

* **Hypertrophic** cardiomyopathy
* Arrhythmogenic right ventricular cardiomyopathy (ARVC)
* LV non-compaction
* Ion Channelopathies
* **Dilated** cardiomyopathy (DCM)
* **Restrictive** cardiomyopathy (RCM)

The human genetics department of the University Medical Center Groningen diagnoses patients suspected to suffer from cardiomyopathy. During the diagnosis, the patients genome is compared to a set of reference genes known to be involved in the disease (called a *gene panel*). If variations are found they are checked and compared to known variants to classify or score their *severity*. Using these variations the type (dilated, restrictive, etc.) of the disease and how severe it is can be diagnosed, combined with regular data sources (physical exam, EKG, etc.).

## Tools

In this project we will work with many different tools to replicate the genetic diagnoses process, both available either for download or on the computer you are now using. Other tools we are going to create ourselves! All of these tools however perform important steps in the analysis process and involve: 
* checking the quality of input data
* mapping the data to a reference genome (comparing with 'known' data)
* finding variations in respect to the reference and
* scoring the found variations on the probabilty that they are disease causing

Normally these steps involve many seperate tools which need to be run on the commandline, however for this course we will be using a [*worflow manager*](https://en.wikipedia.org/wiki/Scientific_workflow_system) in which most of the tools are available and can be joined together to form logical steps in the analysis process. The workflow manager used in the course is **Galaxy** ([wikipedia][1], [website](http://galaxyproject.org/)), but other worflow managers exist such as: [CLC Bio](https://www.qiagenbioinformatics.com/products/clc-main-workbench/), [Taverna](http://www.taverna.org.uk) or [Nextgene](http://www.softgenetics.com/NextGENe.php). It is very likely that you will encounter one of these workflow managers in your future  professional carreer, as many scientific laboratories do their biomedical research with the help of tools organised in such workflow managers.

Next to the advantage of coupling multiple tools together into a *workflow*, Galaxy is the ideal translation from often hard-to-use *commandline tools* to easy-to-use by a large audience by offering simple *graphical interfaces*. 

Actually one of the reasons that you are following this course is to become proficient in also using these commandline tools and even create your own Galaxy tool(s) so that non-technical researchers can use them!

[**Chapter 1**](http://nbviewer.jupyter.org/urls/bitbucket.org/mkempenaar/diagnosticgenomeanalysis/raw/master/chapters/01_galaxy_introduction.ipynb) of this course starts by explaining what Galaxy is and how you can use it to analyze your data, but first we will introduce and discuss the data we will be working with.

**Chapter 1** below starts by explaining what Galaxy is and how you can use it to analyze your data, but first we will introduce and discuss the data we will be working with.

## Data

As a bioinformatician we often do not *create* the data we analyze ourselves but these come from a lab which - for this project - has a **sequencer**.  This sequencer (an Illumina Miseq ([youtube](https://youtu.be/womKfikWlxM))) generates sequencing data from a biological sample.

The first step in performing a so called *sequencing run* is the sample-preparation. For this project this fase is used to *filter* the isolated DNA so that only genes of interest (consisting of one or more **exons** or EXpressed regiONs) are kept, this method is called [*exome sequencing*](https://en.wikipedia.org/wiki/Exome_sequencing). All DNA not included in exons (somewhere around **98%** of all DNA) is not sequenced, therefore from the total of ~3.2 billion basepairs only about 50 million basepairs are actually sequenced when all ~20.000 genes are included and **in our case 320.000 basepairs for a set of 50 genes from the *cardiomyopathy gene panel* (that's only 0.01% of the total genome size)**.
<img src="pics/exons.png" width="500">

The actual data that we will be using is stored in relatively simple text-files containing the sequenced letters (A, C, G and T) along with some data primarily used to describe the quality of each sequenced base. Unfortunately these files do not contain just the complete sequences of either genes or exons, but *millions* of short sequences (~150 basepairs each, called a sequence **read**) with no particular order. The challenge with using this data to answer our initial question (which specific variations (mutations) are responsible for acquiring this disease) is to find out *where* each of these sequences originate from so that we can compare the patients sequence to the sequence of a healty individual. With this comparison we can find if and where any variation is and thus begin with answering our question.

[**Chapter 2**](http://nbviewer.jupyter.org/urls/bitbucket.org/mkempenaar/diagnosticgenomeanalysis/raw/master/chapters/02_data.ipynb) of this course dives a little deeper into the actual data you will be using with an example and a small excercise.

## Analysis

This course consists of a number of documents like this that should be worked through in-order. The first section of each of these documents shows where in the analysis you are and which steps are next. It also shows what tools you are going to use and if there are any programming assignments included.

You will note that this document for the first week is not very long nor does it contain too many assignments. This is on purpose since the goal of this week is primarily for you to understand:
* what the goal of this course is
    * what is cardiomyopathy?
    * what question(s) do we want to answer?
* the tools we will be using
    * what is this Galaxy website?
        * follow a tutorial to get familiar with it
    * what kind of tools are available in Galaxy?
* the data which we will use throughout this course
    * how is this data generated?
    * which analysis steps are needed to answer our question?
    * 

### Important Note
If you are good at scanning documents you can easily spot the actual assignments in the first chapter and complete them in under two hours (note; this is not a challenge!). However, if I were to ask you to answer or explain some of the above questions you will probably have a hard time. To summarize, make sure that the general theory of what is shown here is clear at the end of this chapter. Either follow all the links to external resources, use google or (and this works pretty well) google some of the terms or techniques on youtube. Without this knowledge you will manage to follow the steps during the first few weeks but will surely struggle later on when you need to make decisions on your own. Your final grade is not based on how you complete your assignments, but on the level of understanding that you show in your end products.

Intermediate (graded) quizzes might be given to test your knowledge! Also, we can begin each new chapter by first discussing what we did in the previous chapter.

## Literature

There is no explicit book or other text that you will need to read, but there is a lot of online material available to use when you encounter unknown terms or concepts. 
The article titled '[Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis of Whole Exome Sequencing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4179624/)' discusses the complete process of analyzing raw sequencing data for variant analysis and while it goes beyond the scope of this project it is a good read.

## Reporting

As you work through these documents you are asked to answer questions, perform analysis steps (and report on the outcomes), etc. This documenting process can be regarded as creating a digital *lab-journal* where you keep track of your progress and findings. In a previous email you have been invited to join the Microsoft *OneNote Digital Classroom*, this is where you will keep and edit your digital journal. The teacher(s) have read-access to this material and can keep track of your progress and comment on your work. When the course ends, your complete report will be graded.

There are no imposed rules on how to report your progress since it is not comparable with a normal written report (no chapters called 'introduction', 'results', etc.), just try to keep it organized, clearly state which assignment your results belong to. Do note however that when you use figures and tables (please do!), provide them with a clear caption to explain what you are showing. You will receive feedback after the first assignments that will also advice on your way of logging.

### Create a Report

Start by opening OneNote and create a new document that you will use throughout this course. Add a chapter for *Week 1* and use this for all tasks/ assignments from chapter 1.

## End Products

<img src="pics/courseOverview.png" style="float: right;" width="550">

The end products of this course are the following two items:
* **The analysis**, consisting of a collection of files containing (graded as one):
    * the digital lab-journal
    * any self-written programs (Python files)
    * Database design and SQL implementation
    * one or more shared Galaxy 'histories'
    * one or more shared Galaxy 'workflows'
* **The introduction chapter** and a list of references.
    * Instead of a complete report you are asked to write a single chapter. A feedback round is available before the final deadline.

Currently the deadline for delivering these products is Friday the **1st of November** which is the last day before the exams. Note that this date might be adjusted (deadline extended) later on, you will be notified of this.

[1]: https://en.wikipedia.org/wiki/Galaxy_(computational_biology)