# Task 3: Imputation

After the QC, genotype imputation can be performed with the minimac 3 algorithm at the University of Michigan server (https://imputationserver.sph.umich.edu/) using the HRC reference panel, and the SHAPEIT tool for haplotype phasing. After imputation, SNPs with an R2 quality estimate lower than 0.3 are excluded from further analyses according to the software recommendations. 

To run this notebook you need the bfile dataset.b3.QCed, generated on previous task Population stratification.

https://rstudio-pubs-static.s3.amazonaws.com/452627_d519d1c86bd249e6a2d9638ef1ea836c.html

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preparing-the-dataset-for-imputation" data-toc-modified-id="Preparing-the-dataset-for-imputation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparing the dataset for imputation</a></span></li><li><span><a href="#Upload-vcf.gz-files-to-the-Michigan-server-(phasing:-SHAPE-IT;-reference:HRC;-mode:-QC&amp;imputation)" data-toc-modified-id="Upload-vcf.gz-files-to-the-Michigan-server-(phasing:-SHAPE-IT;-reference:HRC;-mode:-QC&amp;imputation)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Upload vcf.gz files to the Michigan server (phasing: SHAPE-IT; reference:HRC; mode: QC&amp;imputation)</a></span></li><li><span><a href="#QC-of-the-imputed-genotypes" data-toc-modified-id="QC-of-the-imputed-genotypes-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>QC of the imputed genotypes</a></span></li></ul></div>

In [1]:
%load_ext rpy2.ipython

In [2]:
%env path=/mnt/data/GWAS/output/build37/task3_imputation

env: path=/mnt/data/GWAS/output/build37/task3_imputation


##  Preparing the dataset for imputation

We will use Will Rayner's toolbox to prepare the data. (https://www.well.ox.ac.uk/~wrayner/tools/)

In [3]:
%%bash
# Determine allele frequencies in the dataset
plink --bfile /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed  --freq --out $path/dataset.b37.QCed.freq

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/build37/task3_imputation/dataset.b37.QCed.freq.log.
Options in effect:
  --bfile /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed
  --freq
  --out /mnt/data/GWAS/output/build37/task3_imputation/dataset.b37.QCed.freq

257659 MB RAM detected; reserving 128829 MB for main workspace.
7076087 variants loaded from .bim file.
496 people (237 males, 259 females) loaded from .fam.
496 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 496 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999931.
--freq: 

/mnt/data/GWAS/output/build37/task3_imputation/dataset.b37.QCed.freq.hh ); many
commands treat these as missing.


**Run the perl script from Will Rayner's toolbox to check plink .bim files against HRC/1000G for strand, id names, positions, alleles, ref/alt assignment.** 

A specific population to check frequency against can be changed with -p (Default ALL, options ALL, EUR, AFR, AMR, SAS, EAS) 

Using the terminal (New->Terminal from Jupyter panel or from the terminal inside your docker container) create a bash script containing the perl command. Provide execution permission (chmod +x) and run it (./HRC.sh). Place the script where the path environment variable points.

Next, run from terminal the Run-plink.sh created by the perl one.

**IMPORTANT: Do not run this script on momic.us.es.** This script needs 30Gb of RAM to run. The ouput of this command is stored in the path where the environment variable 'path' points to. If the user needs to impute his own data, install this pipeline locally following the User Manual.

## Upload vcf.gz files to the Michigan server (phasing: SHAPE-IT; reference:HRC; mode: QC&imputation)

Log into the Michigan Imputation Server site and click on Run -> Michigan Imputation Server from the menu located at the top.

Select HRC r1.1 2016 reference panel, GRCh37/hg19 array built, 0.3 rsq filter, Eagle v2.4 as the phasing algoruthm, EUR population and "Quality control& Imputation" as the mode; you can also selecet AES 256 encryption. Upload vcf.gz files generated in the previous step and wait for a succesfull upload and initial QC.

You will receive an email when the job is completed with the link and paswword to access the results. Download them.

## QC of the imputed genotypes

 **Extract genotype doses**

**Generate a .fam file for PLINK from the chr22 vcf. Then update sex and pheno with the QCed fam from task 2.2 (i.e. dataset.b37.outliers)**

In [9]:
%%bash
plink --vcf $path/imputed_files/chr22.dose.vcf.gz --make-bed --out $path/imputed_files/chr22.dose


PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose.log.
Options in effect:
  --make-bed
  --out /mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose
  --vcf /mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose.vcf.gz

257659 MB RAM detected; reserving 128829 MB for main workspace.
--vcf: 218k variants complete.
/mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose-temporary.bed
+
/mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose-temporary.bim
+
/mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose-temporary.fam
written.
218642 variants loaded from .bim file.
496 people (0 males, 0 females, 496 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/build37/task3_imputation/imputed_

In [10]:
%%bash
head $path/imputed_files/chr22.dose.fam

HGX00096 HGX00096 0 0 0 -9
HGX00097 HGX00097 0 0 0 -9
HGX00099 HGX00099 0 0 0 -9
HGX00100 HGX00100 0 0 0 -9
HGX00101 HGX00101 0 0 0 -9
HGX00102 HGX00102 0 0 0 -9
HGX00103 HGX00103 0 0 0 -9
HGX00105 HGX00105 0 0 0 -9
HGX00106 HGX00106 0 0 0 -9
HGX00107 HGX00107 0 0 0 -9


In [11]:
%%bash
plink --bfile $path/imputed_files/chr22.dose --make-bed --update-sex /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed.fam 3 --pheno /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed.fam  --mpheno 4 --out  $path/imputed_files/chr22.dose.for.assoc


PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose.for.assoc.log.
Options in effect:
  --bfile /mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose
  --make-bed
  --mpheno 4
  --out /mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose.for.assoc
  --pheno /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed.fam
  --update-sex /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed.fam 3

257659 MB RAM detected; reserving 128829 MB for main workspace.
218642 variants loaded from .bim file.
496 people (0 males, 0 females, 496 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose.for.assoc.nosex
.
496 phenotype values present after --pheno.
--update-sex: 496 people updated.
Using 1 thread

In [12]:
%%bash
head $path/imputed_files/chr22.dose.for.assoc.fam
wc $path/imputed_files/chr22.dose.for.assoc.fam

HGX00096 HGX00096 0 0 1 1
HGX00097 HGX00097 0 0 2 1
HGX00099 HGX00099 0 0 2 1
HGX00100 HGX00100 0 0 2 1
HGX00101 HGX00101 0 0 1 1
HGX00102 HGX00102 0 0 2 1
HGX00103 HGX00103 0 0 1 1
HGX00105 HGX00105 0 0 1 1
HGX00106 HGX00106 0 0 2 1
HGX00107 HGX00107 0 0 1 1
  496  2976 12896 /mnt/data/GWAS/output/build37/task3_imputation/imputed_files/chr22.dose.for.assoc.fam


In [13]:
%%bash
rm $path/imputed_files/chr22.dose.fam

**For the next step you need the following files:**
- chr22.dose.for.assoc.fam
- chri.dose.rsq.DS.vcf.gz