# Imputation accuracy

After imputation, we can test the accuracy by comparing the imputed genotypes to the 'true' genotypes. Here, this means comparing the imputed samples to their high-coverage samples. We will do this for sample KOS28, which has a coverage of 10x and was then downsampled to 0.1x.

First, let's install the conda environment for this session. The yml file is available on the GitHub page (https://github.com/lm-ut/Workshop_25/) as well is in this folder: /gpfs/helios/projects/echo_workshops/project.1.tk/conda_env

In [None]:
cd
cp /gpfs/helios/projects/echo_workshops/project.1.tk/conda_env/SnpSift.yml .
conda env create -f SnpSift.yml

In [None]:
conda activate SnpSift.yml

In [None]:
mkdir -p SnpSift_QUILT_raw
cd SnpSift_QUILT_raw

cp /gpfs/helios/projects/echo_workshops/project.1.tk/data/KOS028/files_for_SnpSift/KOS028_QUILT_high_cov.vcf .
cp /gpfs/helios/projects/echo_workshops/project.1.tk/data/KOS028/files_for_SnpSift/KOS028_QUILT_low_cov_raw.vcf .
cp /gpfs/helios/projects/echo_workshops/project.1.tk/scripts/SnpSift.sh .

SnpSift concordance KOS028_QUILT_high_cov.vcf KOS028_QUILT_low_cov_raw.vcf > comparison.txt
bash SnpSift.sh *.by_sample.txt

cat KOS028.matrix

The result of this script is a matrix, that has the imputed genotypes and the 'true' genotypes. Copy this matrix into an excel file, and try to calculate some accuracy metrics as explained in the slides. These statistics can be calculated for heterozygous genotypes, but also for reference and alternative. 

What is the heterozygous sensitivity of the raw imputed sample?

As a reminder: below are the formula's for the different accuracy metrics.

- Sensitivity: TP / (TP + FN)
- Specificity: TN / (TN + FP)
- Accuracy: TP + TN / Total
- Precision: TP / (TP + FN)
- False Positive Rate: 1 - Specificity
- False Negative Rate: 1 - Sensitivity
- Non-reference concordance: (TP(het) + TP(alt)) / (Total - TP(ref))

We used the raw imputed data above. But now, let's do the same for GP99 filtered data.

In [None]:
cd
mkdir -p SnpSift_QUILT_GP99
cd SnpSift_QUILT_GP99

cp /gpfs/helios/projects/echo_workshops/project.1.tk/data/KOS028/files_for_SnpSift/KOS028_QUILT_high_cov.vcf .
cp /gpfs/helios/projects/echo_workshops/project.1.tk/data/KOS028/files_for_SnpSift/KOS028_QUILT_low_cov_GP99.vcf .
cp /gpfs/helios/projects/echo_workshops/project.1.tk/scripts/SnpSift.sh .

SnpSift concordance KOS028_QUILT_high_cov.vcf KOS028_QUILT_low_cov_GP99.vcf > comparison.txt
bash SnpSift.sh *.by_sample.txt

cat KOS028.matrix

Copy the matrix to excel again. Calculate the same statistics again. Does the heterozygous sensitivity change? Also note that there is now a lot of missingness in the data.

Besides comparing different filters, we can also look at the result of different imputation tools.

In [None]:
cd
mkdir -p SnpSift_Beagle
cd SnpSift_Beagle

cp /gpfs/helios/projects/echo_workshops/project.1.tk/data/KOS028/files_for_SnpSift/KOS028_QUILT_high_cov.vcf .
cp /gpfs/helios/projects/echo_workshops/project.1.tk/data/KOS028/files_for_SnpSift/KOS028_Beagle_GP99.vcf .
cp /gpfs/helios/projects/echo_workshops/project.1.tk/scripts/SnpSift.sh .

SnpSift concordance KOS028_QUILT_high_cov.vcf KOS028_Beagle_GP99.vcf > comparison.txt
bash SnpSift.sh *.by_sample.txt

cat KOS028.matrix

Again, copy the matrix to excel and calculate the accuracy statistics.