<img src="pics/genepanel.png" style="float: right;" width="400">
## 2 The data
For this project we have real patient data from five patients with <strong>cardiomyopathy</strong>. The patient data consists of next-generation Illumina MiSeq reads from captured exomes for a panel of 50 genes (CARDIO panel, see table and the [NCBI](https://www.ncbi.nlm.nih.gov/gtr/tests/GTR000500470/overview/) product page) which are likely involved in the disease. The total length of the captured exomes for these 50 genes is 320,000 bases.

### 2.1 FASTQ
As mentioned before, a great portion of the data produced in computational biology is from so called *Next-Generation Sequencers*. These machines read DNA or RNA material and write these sequences to a file (you will learn more about the machines and techniques in the module *Theory of Bioinformatics*). 

A file format that you will often find is the **[FASTQ](https://en.wikipedia.org/wiki/FASTQ_format)** format. You can recognize such a file by it's extention **.fastq** or **.fq**. A FASTQ file is a simple text file in which the read bases together with the predicted quality are stored.

Below you will find two (shortened) reads coming from an Illumina next-generation sequencer. Keep in mind that a typical run on a NGS machine has *millions* of these short sequences!

**Read 1**:

<font color="red">@M01785:20:000000000-A3F6F:1:1101:16810:1655 1:N:0:2</font>
<font color="blue">NTCATGTACGGTCAGGATGGACGCACTCAACATTTTCAAGTTATTACTCCTTCAACTCAAAACTCCAGAAGTACACTAAATCATATATGTTGTTTTCT</font> ...<br />
`+` <font color="green"><br />#>>1A1B3B11BAEFFBECA0B000EEGFCGBFGGHH2DEGGHGFFFGFFHHFGBGEFFFFFGGEGBF1BCFFE2BGFHBGHGHFF2FFFGHHHHHH</font> ...

**Read 2**:
<font color="red">@M01785:20:000000000-A3F6F:1:1101:12839:1664 1:N:0:2</font>
<font color="blue">
TATATCTATGTCATTTTTTTCTCAATAATACTAAGAGAAAGAAGGCAACTCAAGGATCCTATTAATCCTTTAGAATTTCTACTTAAATCTCACATCCATTA</font> ...<br />
`+` <font color="green"><br />1>1AFFFD3DDDGGGGGGGGGHF3FDFGFHHFB1110FF10000FGGGHHDC110FEGGBGHFFHFHHHHGBFHHHHHHHHHGHHFFHHHHHHH</font> ...

Each read constists of:
<ul>
<li><font color="red">A first line describing the machine it was run on, the chip identifier and the actual coordinates from the chip where the base was read</font>
<li><font color="blue">The actual read sequence</font>
<li>The plus sign
<li><font color="green">The predicted read base quality ASCII value</font>
</ul>

The difference with the widely known [FASTA](https://en.wikipedia.org/wiki/FASTA_format) format is mainly the addition of the **Quality** line (green). So given these quality-characters, how do we determine if the above sequences are *good* sequences? Each character in the quality line corresponds with a numerical quality *[score](https://en.wikipedia.org/wiki/Phred_quality_score)*, which can be looked up in so called ASCII tables. 

Lets take a closer look at the first sequence from above. The first 10 bases are: `NTCATGTACG` and the quality scores for these bases are: `#>>1A1B3B1`. We can now look up these quality characters in the ASCII table. <img src="pics/ASCII.png">
<br />
If we look for example at the first character `#`, we find the value **35** in the ASCII table. For *Illumina* reads we have to subtract 33 from this value. So we end up with: 35 - 33 = 2. So the score for the first base N is **2**. This score is called the *Phred* score. Lets also look at the Phred score for the second base T which has the ASCII character `>`. The `>` character translates to the value **62**. Again subtract 33 from this value to calculate the Phred score. 62 - 33 = **29**.

What does the quality score really mean? The score indicates the *probability that the base call is erroneous*. The quality (Phred) score Q is logarithmically related to the probability of an incorrect base call:
$ Q = log10P $ or $ P = 10^{(-Q/10)} $. To calculate the probability that our first base was incorrectly called, we can calculate it like this: $$ Q = 2 \rightarrow P = 10^{(-2/10)} \rightarrow P = 0,63 $$ 

which equates to **63%**. We can also look for the probability that the first base was *correct*, then we have to subtract that number of 1. So the probability that the first base was correct is: `1 - 0,63 = 0,37` or 37%. The probability the second base was correctly called is:  $ 1 - (10^{-29/10}) = 0,9987 $ or 99,87% accuracy. In general we can say that any *Phred-score* above 30 is acceptable (a 99,9% accuracy) which both the first two bases fail to get. 

### 2.2 Assignment 1: FASTQ Format

Complete the table below.

 <table style="width:100%">
  <tr>
  <td>Base</td>
    <td>N</td>
    <td>T</td>
    <td>C</td>
    <td>A</td>
    <td>T</td>
    <td>G</td>
    <td>T</td>
    <td>A</td>
    <td>C</td>
    <td>G</td>
  </tr>
    <tr>
  <td>Quality char</td>
    <td>#</td>
    <td>></td>
    <td>></td>
    <td>1</td>
    <td>A</td>
    <td>1</td>
    <td>B</td>
    <td>3</td>
    <td>B</td>
    <td>1</td>
  </tr>
    <tr>
  <td>Numerical score (ASCII value - 33)</td>
    <td>2</td>
    <td>29</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
      <tr>
  <td>Base call accuracy (1 - P)</td>
    <td>37%</td>
    <td>99,87%</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table> 