# Finding a disease mutation

In this tutorial, we will identify a disease mutation from sequencing data of Dr James Lupski. https://www.nejm.org/doi/full/10.1056/NEJMoa0908094

    
<center><img src="https://signature.bcm.edu/images/uploaded/full/1449086752427.jpeg" width=320></center>    

<center>https://en.wikipedia.org/wiki/James_R._Lupski</center>


## Overview of variant calling workflow:

![](images/workflow.png)

For this tutorial, we will use a smaller reference genome (chromosome 5) for quicker processing, and a small subset of the input DNA sequences from Dr Lupski.

Let's take a look at the contents of the directory

In [None]:
ls -lh

Let us look at the 2 files that will be using

- `chr5.fa` - the human reference genome (chromosome 5)
- `input.fq` - the query sequences

Note: We are using a trimmed down Illumina exome dataset of Dr. James Lupski (`SRR866988.sra`) which has a disease causing mutation on chromosome 5

### Taking a peek at the reference fasta file

In [None]:
head chr5.fa

The fasta format is quite simple. The first line is the identifier which starts with '>'

The subsequent lines are DNA sequences. Here we see 'N's which means that the sequences are unknown.

We can also take a look at 10 lines of DNA sequence starting from the 100,000th line in the reference file. Here we use the `tail` command to list the lines starting from line number 100,0000 then pass it to the `head` command to show only 10 lines of the output from the tail command.

In [None]:
tail -n+100000 chr5.fa | head -n 10 

### Taking a look at the query sequence file (first 4 lines)

In [None]:
head -n4 input.fq

### One sequence in a fastq file consists of 4 lines. 

- Line 1 - sequence identifier (starts with @)
- Line 2 - DNA sequence
- Line 3 - sequence identifier (starts with +)
- Line 4 - corresponding quality score (Phred score 0-93 + 33)

For the quality score, the following characters encode the lowest to highest scores

<pre> !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ </pre>

For more information, see https://en.wikipedia.org/wiki/FASTQ_format


## Checking the quality of the FASTQ sequences

It is a good practice to check the quality of the sequences by plotting the quality (Q) scores by the position. In general, a Q score of > 30 is good.

To generate a plot, we will use `fastQC` (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

In [None]:
fastqc input.fq

In [None]:
ls -lh

We have now generated an HTML file that we can open to see the quality scores

# Building the index for alignment

![](images/workflow-index.png)

Before we can align the query sequence, we need to build the index for alignment. In this case, we will be using the `chr5.fa` file

One of the most popular programs for alignment is BWA (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/)

This alignment program makes use of an algorithm called Burrows-Wheeler transform to speed up the alignment process, allowing millions of sequences to be aligned to a reference genome in a reasonable amount of time.


We will run `bwa` to see what the options are available

In [None]:
bwa

Prior to the alignment, the reference genome must be indexed. This process may take several hours if indexing the full human genome (~4 GB), so we will use this smaller file to speed things up. In this case, we index only chromosome 5, as the disease mutation is located on this chromosome. 

In [None]:
bwa index chr5.fa

The indexing process generates several files (`.amb`, `.ann`, `.bwt`, `.pac`, `.sa`), prefixed by the name of the input reference file (in this case, `chr5.fa`)

In [None]:
ls -lh