<center>
<h1>Workshop 2: Quality Control (QC)</h1><br><br>
<i><big>Preparing and filtering data for genome-wide association studies.</big><br><br>
Hatzikotoulas Konstantinos (Kostas) (konstantinos.hatzikotoulas@helmholtz-munich.de)</i>
</center>

## Objectives

In this workshop, you will learn the data quality assessment and control steps that are typically carried out during genome wide association studies (GWAS).

## Why do we need the QC?

Study design and errors in genotype calling can introduce systematic biases into GWAS, leading to spurious associations.
A thorough QC can help us identify samples and markers that should be removed prior to association analysis in order to minimize the number of false-positive and false-negative associations. 

<img src="FN.png" width="40%">

In this tutorial, we assume that the study design has been conducted appropriately and the QC applies to genotypes after they have been called from probe intensity data.

## QC protocol

The QC protocol of a GWAS is usually split into two broad categories, “Sample QC” and “Variant QC”.
Sample QC is done prior to Variant QC because we want to maximise the number of markers remaining in the study.

We will be using <code>plink</code> to run the QC and R to visualise the results.
You can find a manual and command reference of <code>plink</code> [here](https://www.cog-genomics.org/plink/1.9/) and [here](https://www.cog-genomics.org/plink/1.9/index).


## Sample QC

It consists of (at least) five steps:

<img src="QCsteps.png" width="40%">


<br>
Initial setup:
<br>

<br>

<div class="alert alert-warning">
<b>Location:</b> The data for this workshop can be found in this directory: <code>/home/volos/data/Workshop2_QC</code>. 

</div>


```bash

#Make your own directory.
mkdir QC_[your name]
cd QC_[your name]
(eg mkdir QC_Kostas)

## test plink runs:
plink --help
## test that R runs for you, then use q() and then n to exit R
R

#Give the path of your DIR and initial unqced FILE

DIR=<path to your directory>

(eg DIR=/home/volos/QC_Kostas)

FILE=VSS

#Copy the initial UnQCed data to your present working directory (pwd)

cd $DIR

scp -pr /home/volos/data/Workshop2_QC/Step1/VSS.* ./


```


### Step_1: Individuals with outlying missing genotype

#### Call rate

##### Run missingness across file genome-wide

```
plink --bfile $DIR/$FILE --missing --out $DIR/$FILE-missing

```

##### Produce a log file giving samples excluded at CR 0.98 to check against R result

```
plink --bfile $DIR/$FILE --mind 0.02 --make-bed --out $DIR/$FILE-mind0.02

```

### Step_2: Individuals with discordant sex information

#### Sex check

##### Run sex checking

```
plink --bfile $DIR/$FILE --check-sex --out $DIR/$FILE-sexcheck

```
##### Extract xchr SNPs

```
plink --bfile $DIR/$FILE --chr 23 --make-bed --out $DIR/$FILE-xchr

```

##### Run missingness on xchr SNPs

```
plink --bfile $DIR/$FILE-xchr --missing --out $DIR/$FILE-xchr-missing

```

### Step_3: Individuals with outlying heterozygosity rate

#### Heterozygosity

##### Extract autosomal SNPs

```
plink --bfile $DIR/$FILE --autosome --make-bed --out $DIR/$FILE-chr1-22

```

##### Extract SNPs with minor allele frequency (MAF) greater than/equal to 1%

```
plink --bfile $DIR/$FILE-chr1-22 --maf 0.01 --make-bed --out $DIR/$FILE-chr1-22-mafgte0.01

```

##### Extract SNPs with MAF less than 1%

```
plink --bfile $DIR/$FILE-chr1-22 --exclude $DIR/$FILE-chr1-22-mafgte0.01.bim --make-bed --out $DIR/$FILE-chr1-22-mafless0.01

```
##### Get missingness to plot against het in R

```
plink --bfile $DIR/$FILE-chr1-22-mafless0.01 --missing --out $DIR/$FILE-chr1-22-mafless0.01-missing
plink --bfile $DIR/$FILE-chr1-22-mafgte0.01 --missing --out $DIR/$FILE-chr1-22-mafgte0.01-missing

```

##### Convert both to ped/map files for heterozygosity’s script

```
plink --bfile $DIR/$FILE-chr1-22-mafless0.01 --recode --out $DIR/$FILE-chr1-22-mafless0.01-recode
plink --bfile $DIR/$FILE-chr1-22-mafgte0.01 --recode --out $DIR/$FILE-chr1-22-mafgte0.01-recode

```

##### Run heterozygosity’s script

```
cp /home/volos/data/Workshop2_QC/Step3/calc_het.pl $DIR/
perl calc_het.pl -f VSS-chr1-22-mafgte0.01-recode.ped
perl calc_het.pl -f VSS-chr1-22-mafless0.01-recode.ped

```
Need to change the headers in Summary file to get rid of the spaces.

ID total num_hom num_het Percent_hom Percent_het


### Step_4: Duplicated or related individuals

#### Relatedness/Duplicates

##### Pair-wise IBD to look at duplicates.Using only variants ≥1%, excluding complex regions and LD prune using R-squared 0.2.

##### Exclude complex regions

```
plink --bfile $DIR/$FILE-chr1-22-mafgte0.01 --exclude /home/volos/data/Workshop2_QC/Step4/complex_regions.txt --range --make-bed --out $DIR/$FILE-chr1-22-mafgte0.01-noCR

```
##### LD prune

```
plink --bfile $DIR/$FILE-chr1-22-mafgte0.01-noCR --indep 50 5 1.25 --out $DIR/$FILE-chr1-22-mafgte0.01-noCR-pruning
plink --bfile $DIR/$FILE-chr1-22-mafgte0.01-noCR --extract $DIR/$FILE-chr1-22-mafgte0.01-noCR-pruning.prune.in --make-bed --out $DIR/$FILE-chr1-22-mafgte0.01-noCR-LDpruned0.2

```
##### Pair-wise IBD

```
plink --bfile $DIR/$FILE-chr1-22-mafgte0.01-noCR-LDpruned0.2 --genome --out $DIR/$FILE-chr1-22-mafgte0.01-noCR-LDpruned0.2-genome

```
##### Run space_to_tab.pl on the genome result

```

perl /home/volos/data/Workshop2_QC/Step4/space_to_tab.pl VSS-chr1-22-mafgte0.01-noCR-LDpruned0.2-genome.genome

```

### Step_5: Ethnicity outliers

#### Ethnicity MDS distance matrix

For this step, you will need to merge your data with 1000 Genomes genotype data (or HapMap genotype data)
(Its a time consuming procedure thus please use the following files)

```
cp /home/volos/data/Workshop2_QC/Step5/VSS-1Kg.bed $DIR/

cp /home/volos/data/Workshop2_QC/Step5/VSS-1Kg.bim $DIR/

cp /home/volos/data/Workshop2_QC/Step5/VSS-1Kg.fam $DIR/

```

```

FILE2=VSS-1Kg

```

##### Pair-wise IBD.

##### Using only autosomal variants with MAF ≥1%, excluding complex regions.

```

plink --bfile $DIR/$FILE2 --autosome --maf 0.01 --exclude /home/volos/data/Workshop2_QC/Step4/complex_regions.txt --range --make-bed --out $DIR/$FILE2-chr1-22-mafgte0.01-noCR

```

##### LD prune using R-squared 0.2.

```
plink --bfile $DIR/$FILE2-chr1-22-mafgte0.01-noCR --indep 50 5 1.25 --out $DIR/$FILE2-chr1-22-mafgte0.01-noCR-pruning
plink --bfile $DIR/$FILE2-chr1-22-mafgte0.01-noCR --extract $DIR/$FILE2-chr1-22-mafgte0.01-noCR-pruning.prune.in --make-bed --out $DIR/$FILE2-chr1-22-mafgte0.01-noCR-LDpruned0.2

```

##### Pair-wise IBD

```

plink --bfile $DIR/$FILE2-chr1-22-mafgte0.01-noCR-LDpruned0.2 --genome --out $DIR/$FILE2-chr1-22-mafgte0.01-noCR-LDpruned0.2-genome
perl /home/volos/data/Workshop2_QC/Step4/space_to_tab.pl VSS-1Kg-chr1-22-mafgte0.01-noCR-LDpruned0.2-genome.genome

```

##### MDS distance matrix calcualting the first 10 components

```

plink --bfile $DIR/$FILE2-chr1-22-mafgte0.01-noCR-LDpruned0.2 --read-genome $DIR/$FILE2-chr1-22-mafgte0.01-noCR-LDpruned0.2-genome.genome --cluster --mds-plot 10 --out $DIR/$FILE2-chr1-22-mafgte0.01-noCR-LDpruned0.2-genome-mds

```

Now, we need to check our results against R plots