# VIDUS  Imputation Preparation
__Author__: Jesse Marks
**Date**: August 31, 2018

This document logs the steps taken to perform phasing and imputation on the dataset [VIDUS](https://www.bccsu.ca/vidus/). The starting point for this analysis is after quality control of observed genotypes. The quality controlled genotypes are oriented on the GRCh37 plus strand. 

## Software and tools
The software and tools used for porcessing these data are
* [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) (MIS)
* [Amazon Web Services (AWS) - Cloud Computing Services](https://aws.amazon.com/)
    * Linux AMI
* [PLINK v1.90 beta 4.10](https://www.cog-genomics.org/plink/)
* [bgzip](http://www.htslib.org/doc/tabix.html)
* [BCF Tools](http://www.htslib.org/doc/bcftools.html)
* Windows 10 with [Cygwin](https://cygwin.com/) installed
* GNU bash version 4.2.46

## Data retrieval and organization

The genotype data (QC'd) were retrieved from AWS S3 at `s3://rti-heroin/ngc_vidus_fou/data/genotype/original/ea/`.

## Data processing
### GRCh37 strand and allele discordance check

In [None]:
# EC2 command line #
base_dir=/shared/bioinformatics/jmarks/Heroin/Vidus
mkdir -p $base_dir/genotype/original
aws s3 sync s3://rti-heroin/ngc_vidus_fou/data/genotype/original/ea/ $base_dir/genotype/original

mkdir $base_dir/1000g
ancestry="ea"
geno_dir=$base_dir/genotype/original

# write out the MAF report
for ancestry in ea;do
    for study in vidus; do
        mkdir $base_dir/strand_check
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bfile $geno_dir/${ancestry}_chr_all \
            --freq \
            --out $base_dir/strand_check/${ancestry}_chr_all
    done
done

# Get list of variants from all studies
for ancestry in ea; do
    cat $geno_dir/${ancestry}_chr_all.bim | \
            perl -lane 'if (($F[0]+0) <= 23) { print $F[1]; }' | \
            sort -u > $base_dir/${ancestry}_chr_all_sorted_variants.txt
done

# Calculate autosome and chrX MAFs for 1000G EUR
pop="EUR"
ancestry="ea"
for chr in {1..22}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${pop}_${chr} \
        --script_prefix ${base_dir}/${pop}_chr${chr}.maf \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
            --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.hap.gz\
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz \
            --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
            --chr ${chr} \
            --out ${base_dir}/1000g/${pop}_chr${chr}.maf \
            --extract ${ancestry}_chr_all_sorted_variants.txt \
            --keep_groups ${pop}
done

chr=23
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name ${pop}_${chr} \
    --script_prefix ${base_dir}/${pop}_chr${chr}.maf \
    --mem 6.8 \
    --nslots 1 \
    --priority 0 \
    --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
        --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.hap.gz\
        --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
        --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
        --chr ${chr} \
        --out ${base_dir}/1000g/${pop}_chr${chr}.maf \
        --extract ${ancestry}_chr_all_sorted_variants.txt \
        --keep_groups ${pop}


# Merge per chr MAFs for EUR
pop="EUR"
head -n 1 1000g/${pop}_chr1.maf > 1000g/${pop}_chr_all.maf
tail -q -n +2 1000g/${pop}_chr{1..23}.maf \
    >> 1000g/${pop}_chr_all.maf

study=vidus
# Run discordance checks for EA groups
for ancestry in ea; do
    if [ $ancestry = "ea" ]; then
        pop=EUR
    else
        pop=AFR
    fi

#    /shared/bioinformatics/software/scripts/qsub_job.sh \
#        --job_name ${ancestry}_${study}_crosscheck \
#        --script_prefix ${base_dir}/strand_check/${ancestry}_allele_discordance_check \
#        --mem 6 \
#        --nslots 4 \
#        --priority 0 \
#        --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
#            --study_bim_file ${geno_dir}/${ancestry}_chr_all.bim
#            --study_frq_file ${base_dir}/strand_check/${ancestry}_chr_all.frq
#            --ref_maf_file ${base_dir}/1000g/${pop}_chr_all.maf
#            --out_prefix ${base_dir}/strand_check/${ancestry}_allele_discordance"

    # just checking chr23 here
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${ancestry}_${study}_crosscheck \
        --script_prefix ${base_dir}/strand_check/${ancestry}_allele_discordance_check \
        --mem 6.8 \
        --nslots 2 \
        --priority 0 \
        --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
            --study_bim_file ${geno_dir}/${ancestry}_chr_all.bim
            --study_frq_file ${base_dir}/strand_check/${ancestry}_chr_all.frq
            --ref_maf_file ${base_dir}/1000g/${pop}_chr23.maf
            --out_prefix ${base_dir}/strand_check/${ancestry}_allele_discordance"
done

### Resolving allele discordances
The allele discordances will be resolved by
* Flipping allele discordances that are fixed by flipping
* Removing SNPs with discordant names
* Removing SNPs with discordant positions
* Removing allele discordances that are not resolved by flipping
* Removing alleles with large deviations from the reference population allele frequencies

In [None]:
# EC2 command line #
base_dir=/shared/bioinformatics/jmarks/Heroin/Vidus
geno_dir=${base_dir}/genotype/original

study=vidus

# Apply filters
for ancestry in ea;do
    echo -e "\n===============\nProcessing ${study}\n"
    # Create remove list
    echo "Making remove list"
    cat <(cut -f2,2 ${base_dir}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2) \
        <(cut -f2,2 ${base_dir}/strand_check/${ancestry}_allele_discordance.at_cg_snps_freq_diff_gt_0.2 | tail -n +2) \
        <(cut -f2,2 ${base_dir}/strand_check/${ancestry}_allele_discordance.discordant_names | tail -n +2) \
        <(cut -f2,2 ${base_dir}/strand_check/${ancestry}_allele_discordance.discordant_positions | tail -n +2) \
        <(cut -f2,2 ${base_dir}/strand_check/${ancestry}_allele_discordance.discordant_alleles_polymorphic_in_study_not_fixed_by_strand_flip | tail -n +2) | \
        sort -u > ${base_dir}/strand_check/${ancestry}_snps.remove

    # Create flip list
    echo "Making flip list"
    comm -23 <(cut -f2,2 ${base_dir}/strand_check/${ancestry}_allele_discordance.discordant_alleles | tail -n +2 | sort -u) \
        <(cut -f2,2 ${base_dir}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2 | sort -u) \
        > ${base_dir}/strand_check/${ancestry}_snps.flip

    # Apply filters
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile $geno_dir/${ancestry}_chr_all \
        --exclude $base_dir/strand_check/${ancestry}_snps.remove \
        --flip $base_dir/strand_check/${ancestry}_snps.flip \
        --make-bed \
        --out ${geno_dir}/${ancestry}_filtered
done

### Remove monomorphic variants
Monomorphic variants prevent MIS from accepting the genotype data. In this case, an arbitrarily small MAF is set that is smaller than the lower bound for these data.

In [None]:
# EC2 command line #
base_dir=/shared/bioinformatics/jmarks/Heroin/Vidus
geno_dir=$base_dir/genotype/original
ancestry="ea"

# Apply filters
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --memory 2048 \
    --bfile $geno_dir/${ancestry}_filtered \
    --maf 0.000001 \
    --make-bed \
    --out $geno_dir/${ancestry}_filtered_mono

## Imputation preparation for Michigan Imputation Server
Visit the [MIS Getting Started Webpage](https://imputationserver.sph.umich.edu/start.html#!pages/help) for more information about the preparing the data for upload to MIS.
### VCF File Conversion

In [None]:
# EC2 command line #
base_dir=/shared/bioinformatics/jmarks/Heroin/Vidus
geno_dir=$base_dir/genotype/original
ancestry="ea"

mkdir $base_dir/phase_prep

# Split by chr and remove any individuals with missing data for whole chr
#for chr in {1..23}; do
for chr in 23; do
    # Remove SNPs
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 4000 \
        --bfile ${geno_dir}/${ancestry}_filtered_mono \
        --chr ${chr} \
        --mind 0.99 \
        --make-bed \
        --out ${base_dir}/phase_prep/${ancestry}_chr${chr}_for_phasing 
done > chr_splitting.log

__Note__: No subjects were removed.

In [None]:
# EC2 command line #
phase_dir=/shared/bioinformatics/jmarks/Heroin/Vidus/phase_prep
ancestry="ea"

mkdir ea 

chr=23
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --memory 5000 \
    --bfile ${phase_dir}/${ancestry}_chr${chr}_for_phasing \
    --output-chr M \
    --set-hh-missing \
    --recode vcf bgz \
    --out ${phase_dir}/${ancestry}/${ancestry}_chr${chr}_final

## Upload to Michigan Imputation Server (MIS)
Transfer the \*.vcf files to local machine (per chromosome) and then upload to MIS.

### Uploading parameters EA
These are the parameters that were selected on MIS.

__Name__: vidus_ea_23

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: EUR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.

# Download Imputed Data from MIS
First Download the data form the Michigan Imputation Server by clicking on the link provided in the email they send out to alert you that your data has finished. Here you will find commands for downloading the data.
## EA

In [None]:
## EC2 ##
mkdir -p /shared/impute/kreek/data/genotype/imputed/{aa,ea}
cd /shared/impute/kreek/data/genotype/imputed/ea

# I put all of the download commands in a bash script so I could submit them as jobs.
cat mywget.sh
"""
wget https://imputationserver.sph.umich.edu/share/results/2e899483c134f1d2c08e3aabc4414253/chr_1.zip 
wget https://imputationserver.sph.umich.edu/share/results/ed48286121726c3bc64ddf024ec72730/chr_10.zip 
wget https://imputationserver.sph.umich.edu/share/results/1472b06d90055f4effca30c4924fbeca/chr_11.zip 
wget https://imputationserver.sph.umich.edu/share/results/b3e1f5e3fd0bfd0a69fe171e438d3c53/chr_12.zip 
wget https://imputationserver.sph.umich.edu/share/results/8bc2b433fed05c8c57e09c240aae9d38/chr_13.zip 
wget https://imputationserver.sph.umich.edu/share/results/ed5623de7f6b337dec56abce51908cf5/chr_14.zip 
wget https://imputationserver.sph.umich.edu/share/results/55b667fafaa4881bba5c285c859b6d6e/chr_15.zip 
wget https://imputationserver.sph.umich.edu/share/results/7744e3469a60729f0cfabef804cb7cf/chr_16.zip 
wget https://imputationserver.sph.umich.edu/share/results/3097bb774de0115ed454e48628ad06dc/chr_17.zip 
wget https://imputationserver.sph.umich.edu/share/results/761df4b95b0214e19d321cd327e5ac4e/chr_18.zip 
wget https://imputationserver.sph.umich.edu/share/results/fd2b2ee5136093e255bdf60f7d575bf8/chr_19.zip 
wget https://imputationserver.sph.umich.edu/share/results/dffef070d2e0da537b7b0d2b7c42d5e1/chr_2.zip 
wget https://imputationserver.sph.umich.edu/share/results/76d3e3a2e8938a080fda54c80cbbe264/chr_20.zip 
wget https://imputationserver.sph.umich.edu/share/results/27e8da7066c8dfe52c890eb1952e6dcd/chr_21.zip 
wget https://imputationserver.sph.umich.edu/share/results/f1f5a8bbdbce2b2293081f72b7c5ba6c/chr_22.zip 
wget https://imputationserver.sph.umich.edu/share/results/cd7e1fe0b57b45590ec83367bf33ffae/chr_3.zip 
wget https://imputationserver.sph.umich.edu/share/results/f405efbacb10d9243aaf3d662c6f9414/chr_4.zip 
wget https://imputationserver.sph.umich.edu/share/results/ccba882630a8cce30bf458dca68a2d09/chr_5.zip 
wget https://imputationserver.sph.umich.edu/share/results/77aeabce5acbca6e225e3ac02965eb0d/chr_6.zip 
wget https://imputationserver.sph.umich.edu/share/results/b7798d33eca337160fb28db9d855e6e0/chr_7.zip 
wget https://imputationserver.sph.umich.edu/share/results/f95778ef2f6cb1364e5f272e0b0fcf75/chr_8.zip 
wget https://imputationserver.sph.umich.edu/share/results/494429e58125aacc323dce6366f348be/chr_9.zip 
wget https://imputationserver.sph.umich.edu/share/results/877e7a71f369a30acf5933880b5edbda/chr_X.no.auto_female.zip 
wget https://imputationserver.sph.umich.edu/share/results/1c9f7d9ac07591172ecb01e76b743fd/chr_X.no.auto_male.zip 
"""

sh /shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name ea.download_kreek \
    --script_prefix kreek_imputation_download \
    --mem 5 \
    --priority 0 \
    --nslots 1 \
    --program bash mywget.sh

### Inflate imputation results
The zip files from Michigan Imputation Server (MIS) need to be inflated before you can begin working with them. They require a passcode that is sent by MIS to email.

In [None]:
### EC2 console ###
cd /shared/imputed/kreek/data/genotype/imputed/ea

# inflate chr results
for f in {1..23};do
echo '#!/bin/bash' > chr_$f.sh
echo '' >> chr_$f.sh
echo 'unzip -P "\aKcPO5MYw6qr" chr_'$f'.zip' >> chr_$f.sh
done

for chr in {1..22}; do
    sh /shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name ea.inflate_chr${chr} \
    --script_prefix ea.chr${chr}_inflation \
    --mem 5 \
    --priority 0 \
    --nslots 1 \
    --program bash chr_${chr}.sh
done

rm *.sh

# Upload to S3
Uploaded to:

`s3://rti-heroin/kreek/data/genotype/imputed`