# Task 3: Imputation

After the QC, genotype imputation can be performed with the minimac 3 algorithm at the University of Michigan server using the HRC reference panel, and the SHAPEIT tool for haplotype phasing. After imputation, SNPs with an R2 quality estimate lower than 0.3 are excluded from further analyses according to the software recommendations. 

To run this notebook you need the bfile dataset.b3.QCed, generated on previous task Population stratification.


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preparing-the-dataset-for-imputation" referrerpolicy="origin" data-toc-modified-id="Preparing-the-dataset-for-imputation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparing the dataset for imputation</a></span></li><li><span><a href="#Upload-vcf.gz-files-to-the-TopMed-server-(rsq-Filter:-0.3-;-phasing:-Eagle-v-2.4;-reference:-TopMed-r2;-QC-Frequency-Check:-vs.-TOPMed-panel;-mode:-QC&amp;imputation)" referrerpolicy="origin" data-toc-modified-id="Upload-vcf.gz-files-to-the-TopMed-server-(rsq-Filter:-0.3-;-phasing:-Eagle-v-2.4;-reference:-TopMed-r2;-QC-Frequency-Check:-vs.-TOPMed-panel;-mode:-QC&amp;imputation)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Upload vcf.gz files to the TopMed server (rsq Filter: 0.3 ; phasing: Eagle v 2.4; reference: TopMed r2; QC Frequency Check: vs. TOPMed panel; mode: QC&amp;imputation)</a></span></li><li><span><a href="#QC-of-the-imputed-genotypes" referrerpolicy="origin" data-toc-modified-id="QC-of-the-imputed-genotypes-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>QC of the imputed genotypes</a></span></li></ul></div>

In [5]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [6]:
import os

# Create directories for the output files
path="/mnt/data/GWAS/output/build38/task3_imputation"
if not os.path.exists(path):
    os.makedirs(path)

In [28]:
%env path=/mnt/data/GWAS/output/build38/task3_imputation

env: path=/mnt/data/GWAS/output/build38/task3_imputation


##  Preparing the dataset for imputation

We will use Will Rayner's toolbox to prepare the data. (https://www.well.ox.ac.uk/~wrayner/tools/)

In [8]:
%%bash
# Determine allele frequencies in the dataset
plink --bfile /mnt/data/GWAS/output/build38/task2.2_stratification/intermediate_datasets/dataset.b38.QCed  --freq --out $path/dataset.b38.QCed.freq

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/build38/task3_imputation/dataset.b38.QCed.freq.log.
Options in effect:
  --bfile /mnt/data/GWAS/output/build38/task2.2_stratification/intermediate_datasets/dataset.b38.QCed
  --freq
  --out /mnt/data/GWAS/output/build38/task3_imputation/dataset.b38.QCed.freq

257659 MB RAM detected; reserving 128829 MB for main workspace.
316140 variants loaded from .bim file.
495 people (237 males, 258 females) loaded from .fam.
495 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 495 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total ge

/mnt/data/GWAS/output/build38/task3_imputation/dataset.b38.QCed.freq.hh ); many
commands treat these as missing.


**Run the perl script from Will Rayner's toolbox to check plink .bim files against TopMed for strand, id names, positions, alleles, ref/alt assignment.** 

Using the terminal (New->Terminal from Jupyter panel or from the terminal inside your docker container) create a bash script containing the perl command. Provide execution permission (chmod +x) and run it (./HRC.sh). Place the script where the path environment variable points.

Next, run from terminal the Run-plink.sh created by the perl one.

**IMPORTANT: Do not run this script on momic.us.es.** This script needs 30Gb of RAM to run. The ouput of this command is stored in the path where the environment variable 'path' points to. If the user needs to impute his own data, install this pipeline locally following the User Manual.

## Upload vcf.gz files to the TopMed server (rsq Filter: 0.3 ; phasing: Eagle v 2.4; reference: TopMed r2; QC Frequency Check: vs. TOPMed panel; mode: QC&imputation)

Log into the TopMedn Imputation Server site and click on Run from the menu located at the top.

Select TopMed r2 reference panel, GRCh38/hg38 array built, 0.3 rsq filter, Eagle v2.4 as the phasing algoruthm, EUR population and "Quality control& Imputation" as the mode; you can also select AES 256 encryption. Upload vcf.gz files generated in the previous step and wait for a succesfull upload and initial QC.

You will receive an email when the job is completed with the link and paswword to access the results. Download them.

## QC of the imputed genotypes

In [31]:
%%bash
ls $path/Imputed_files

chr1.dose.rsq.0.3.DS.vcf.gz
chr1.dose.vcf.gz
chr1.info.gz
chr10.dose.rsq.0.3.DS.vcf.gz
chr10.dose.vcf.gz
chr10.info.gz
chr11.dose.rsq.0.3.DS.vcf.gz
chr11.dose.vcf.gz
chr11.info.gz
chr12.dose.rsq.0.3.DS.vcf.gz
chr12.dose.vcf.gz
chr12.info.gz
chr13.dose.rsq.0.3.DS.vcf.gz
chr13.dose.vcf.gz
chr13.info.gz
chr14.dose.rsq.0.3.DS.vcf.gz
chr14.dose.vcf.gz
chr14.info.gz
chr15.dose.rsq.0.3.DS.vcf.gz
chr15.dose.vcf.gz
chr15.info.gz
chr16.dose.rsq.0.3.DS.vcf.gz
chr16.dose.vcf.gz
chr16.info.gz
chr17.dose.rsq.0.3.DS.vcf.gz
chr17.dose.vcf.gz
chr17.info.gz
chr18.dose.rsq.0.3.DS.vcf.gz
chr18.dose.vcf.gz
chr18.info.gz
chr19.dose.rsq.0.3.DS.vcf.gz
chr19.dose.vcf.gz
chr19.info.gz
chr2.dose.rsq.0.3.DS.vcf.gz
chr2.dose.vcf.gz
chr2.info.gz
chr20.dose.rsq.0.3.DS.vcf.gz
chr20.dose.vcf.gz
chr20.info.gz
chr21.dose.rsq.0.3.DS.vcf.gz
chr21.dose.vcf.gz
chr21.info.gz
chr22.dose.bed
chr22.dose.bim
chr22.dose.fam
chr22.dose.for.assoc.log
chr22.dose.for.assoc.nosex
chr22.dose.log
chr22.dose.nosex
chr22.dose.rsq.0.3.DS.v

In [13]:
%%bash
# Unzip the chromosomes files. Provide the password (in double quotes) supplied by the Michigan server
for i in {1..22}
do
unzip -P "XshQEGXh_v8Vy9" $path/Imputed_files/chr_$i.zip -d $path/Imputed_files 
done

Archive:  /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr_1.zip
  inflating: /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr1.dose.vcf.gz  
  inflating: /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr1.info.gz  
Archive:  /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr_2.zip
  inflating: /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr2.dose.vcf.gz  
  inflating: /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr2.info.gz  
Archive:  /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr_3.zip
  inflating: /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr3.dose.vcf.gz  
  inflating: /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr3.info.gz  
Archive:  /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr_4.zip
  inflating: /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr4.dose.vcf.gz  
  inflating: /mnt/data/GWAS/output/bu

 **Extract genotype doses**

In [20]:
%%bash
# Extract genotype doses from vcf files and generate dosage files for PLINK software.
# 1st parameter: path to extracted zip files (from previous step). Output files will be stored in this path
# 2nd parameter: boolean indicating whereas the rsquare filter was applied during imputation
bash scripts/extract_dose.sh $path/Imputed_files true

**Generate a .fam file for PLINK from the chr22 vcf. Then update sex and pheno**

In [25]:
%%bash
plink --vcf $path/Imputed_files/chr22.dose.vcf.gz --make-bed --out $path/Imputed_files/chr22.dose


PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose.log.
Options in effect:
  --make-bed
  --out /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose
  --vcf /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose.vcf.gz

257659 MB RAM detected; reserving 128829 MB for main workspace.
--vcf: 288k variants complete.
/mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose-temporary.bed
+
/mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose-temporary.bim
+
/mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose-temporary.fam
written.
288606 variants loaded from .bim file.
495 people (0 males, 0 females, 495 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/build38/task3_imputation/Imputed_

In [32]:
%%bash
head $path/Imputed_files/chr22.dose.fam

HGX00096 HGX00096 0 0 0 -9
HGX00097 HGX00097 0 0 0 -9
HGX00099 HGX00099 0 0 0 -9
HGX00100 HGX00100 0 0 0 -9
HGX00101 HGX00101 0 0 0 -9
HGX00102 HGX00102 0 0 0 -9
HGX00103 HGX00103 0 0 0 -9
HGX00105 HGX00105 0 0 0 -9
HGX00106 HGX00106 0 0 0 -9
HGX00107 HGX00107 0 0 0 -9


In [35]:
%%bash
plink --bfile $path/Imputed_files/chr22.dose --make-bed --update-sex /mnt/data/GWAS/output/build38/task2.2_stratification/covar_mds_sex_pheno.txt 11 --pheno /mnt/data/GWAS/output/build38/task2.2_stratification/covar_mds_sex_pheno.txt --mpheno 12 --out  $path/Imputed_files/chr22.dose.for.assoc


PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose.for.assoc.log.
Options in effect:
  --bfile /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose
  --make-bed
  --mpheno 12
  --out /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose.for.assoc
  --pheno /mnt/data/GWAS/output/build38/task2.2_stratification/covar_mds_sex_pheno.txt
  --update-sex /mnt/data/GWAS/output/build38/task2.2_stratification/covar_mds_sex_pheno.txt 11

257659 MB RAM detected; reserving 128829 MB for main workspace.
288606 variants loaded from .bim file.
495 people (0 males, 0 females, 495 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose.for.assoc.nosex
.
495 phenotype values present after --pheno.
--update-sex

In [36]:
%%bash
head $path/Imputed_files/chr22.dose.for.assoc.fam

HGX00096 HGX00096 0 0 1 1
HGX00097 HGX00097 0 0 2 1
HGX00099 HGX00099 0 0 2 1
HGX00100 HGX00100 0 0 2 1
HGX00101 HGX00101 0 0 1 1
HGX00102 HGX00102 0 0 2 1
HGX00103 HGX00103 0 0 1 1
HGX00105 HGX00105 0 0 1 1
HGX00106 HGX00106 0 0 2 1
HGX00107 HGX00107 0 0 1 1


In [None]:
%%bash
rm $path/Imputed_files/chr22.dose.fam

**For the next step you need the following files:**
- chr22.dose.for.assoc.fam
- chri.dose.rsq.DS.vcf.gz