# VIDUS & IMP BioBank Phasing + Imputation
__Author__: Jesse Marks

**Date:** August 21, 2018

**GitHub Issue:** [Imputation Correction Strategy #104](https://github.com/RTIInternational/bioinformatics/issues/104)

This document logs the steps taken to perform phasing and imputation on the merged case-control datasets of [VIDUS](http://www.cfenet.ubc.ca/research/vidus) & [IMP Biobank](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000388.v1.p1). The starting point for this analysis is after quality control of observed genotypes. The quality controlled genotypes are oriented on the GRCh37 plus strand. Based on findings from [Johnson et al.](https://link.springer.com/article/10.1007/s00439-013-1266-7), the intersection set of variants for will be used for imputation. 

## Software and tools
The software and tools used for porcessing these data are
* [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) (MIS)
* [Amazon Web Services (AWS) - Cloud Computing Services](https://aws.amazon.com/)
    * Linux AMI
* [PLINK v1.90 beta 4.10](https://www.cog-genomics.org/plink/)
* [bgzip](http://www.htslib.org/doc/tabix.html)
* [BCF Tools](http://www.htslib.org/doc/bcftools.html)
* Windows 10 with [Cygwin](https://cygwin.com/) installed
* GNU bash version 4.2.46

## Data retrieval and organization
PLINK binary filesets will be obtained from AWS S3 storage.

In [None]:
mkdir -p /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation/{vidus,imp_biobank}/genotype/original/final
cd /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation/

aws s3 sync s3://rti-heroin/imp_biobank/data/genotype/original/final imp_biobank/genotype/original/final
aws s3 sync s3://rti-heroin/ngc_vidus_fou/data/genotype/original/ea vidus/genotype/original/final

## Data processing
### GRCh37 strand and allele discordance check

In [None]:
# EC2 command line #
cd /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation

mkdir 1000g
ancestry="ea"
for study in {imp_biobank,vidus}; do
    mkdir ${study}/genotype/strand_check
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile ${study}/genotype/original/final/${ancestry}_chr_all \
        --freq \
        --out ${study}/genotype/strand_check/${ancestry}_chr_all
done

# Get list of variants from all studies
cat {imp_biobank,vidus}/genotype/original/final/${ancestry}_chr_all.bim | \
        perl -lane 'if (($F[0]+0) <= 23) { print $F[1]; }' | \
        sort -u > ${ancestry}_chr_all_sorted_variants.txt

 wc -l ea_chr_all_sorted_variants.txt
"""1106437 ea_chr_all_sorted_variants.txt"""


# Calculate autosome MAFs for 1000G EUR, 
pop="EUR"
ancestry="ea"
for chr in {1..22}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${pop}_${chr} \
        --script_prefix ${pop}_chr${chr}.maf \
        --mem 8 \
        --nslots 2 \
        --priority 0 \
        --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
            --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.hap.gz\
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz \
            --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
            --chr ${chr} \
            --out 1000g/${pop}_chr${chr}.maf \
            --extract ${ancestry}_chr_all_sorted_variants.txt \
            --keep_groups ${pop}
done

# Calculate chrX MAFs for 1000G EUR, 
pop="EUR"
ancestry="ea"
chr=23
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name ${pop}_${chr} \
    --script_prefix ${pop}_chr${chr}.maf \
    --mem 7 \
    --priority 0 \
    --program perl /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
        --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.hap.gz\
        --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
        --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
        --chr ${chr} \
        --extract ${ancestry}_chr_all_sorted_variants.txt \
        --out 1000g/${pop}_chr${chr}.maf \
        --keep_groups ${pop}


# Merge per chr MAFs for EUR
pop="EUR"
head -n 1 1000g/${pop}_chr1.maf > 1000g/${pop}_chr_all.maf
tail -q -n +2 1000g/${pop}_chr{1..23}.maf \
    >> 1000g/${pop}_chr_all.maf

# Run discordance checks for EA group
pop="EUR"
ancestry="ea"
for study in {imp_biobank,vidus}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${ancestry}_${study}_crosscheck \
        --script_prefix ${study}/genotype/strand_check/${ancestry}_allele_discordance_check \
        --mem 6 \
        --priority 0 \
        --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
            --study_bim_file ${study}/genotype/original/final/${ancestry}_chr_all.bim
            --study_frq_file ${study}/genotype/strand_check/${ancestry}_chr_all.frq
            --ref_maf_file 1000g/${pop}_chr_all.maf
            --out_prefix ${study}/genotype/strand_check/${ancestry}_allele_discordance"
done

### Resolving allele discordances
The allele discordances will be resolved by
* Flipping allele discordances that are fixed by flipping
* Removing SNPs with discordant names
* Removing SNPs with discordant positions
* Removing allele discordances that are not resolved by flipping
* Removing alleles with large deviations from the reference population allele frequencies

Given that the allele discordance check was done using a union set of SNPs across all studies within an ancestry group, some of the SNPs logged as discordant for a given study may not actually be in the study. Fortunately, if they are not in a given study they will not interfere with the filtering procedures.

In [None]:
# EC2 command line #
cd /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation

ancestry="ea"

for study in {imp_biobank,vidus}; do
    echo -e "\n===============\nProcessing ${study}\n"
    # Create remove list
    echo "Making remove list"
    cat <(cut -f2,2 ${study}/genotype/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2) \
        <(cut -f2,2 ${study}/genotype/strand_check/${ancestry}_allele_discordance.at_cg_snps_freq_diff_gt_0.2 | tail -n +2) \
        <(cut -f2,2 ${study}/genotype/strand_check/${ancestry}_allele_discordance.discordant_names | tail -n +2) \
        <(cut -f2,2 ${study}/genotype/strand_check/${ancestry}_allele_discordance.discordant_positions | tail -n +2) \
        <(cut -f2,2 ${study}/genotype/strand_check/${ancestry}_allele_discordance.discordant_alleles_polymorphic_in_study_not_fixed_by_strand_flip | tail -n +2) | \
        sort -u > ${study}/genotype/${ancestry}_snps.remove

    # Create flip list
    echo "Making flip list"
    comm -23 <(cut -f2,2 ${study}/genotype/strand_check/${ancestry}_allele_discordance.discordant_alleles | tail -n +2 | sort -u) \
        <(cut -f2,2 ${study}/genotype/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2 | sort -u) \
        > ${study}/genotype/${ancestry}_snps.flip

    # Apply filters
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile ${study}/genotype/original/final/${ancestry}_chr_all \
        --exclude ${study}/genotype/${ancestry}_snps.remove \
        --flip ${study}/genotype/${ancestry}_snps.flip \
        --make-bed \
        --out ${study}/genotype/original/final/${ancestry}_filtered
done

wc -l */genotype/original/final/*bim
"""
wc -l */genotype/original/final/*bim
   600372 imp_biobank/genotype/original/final/ea_chr_all.bim
   600371 imp_biobank/genotype/original/final/ea_filtered.bim
   686855 vidus/genotype/original/final/ea_chr_all.bim
   679342 vidus/genotype/original/final/ea_filtered.bim
"""

### Remove monomorphic variants
Monomorphic variants prevent MIS from accepting the genotype data. In this case, an arbitrarily small MAF is set that is smaller than the lower bound for these data.

In [None]:
# EC2 command line #
cd /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation

ancestry="ea"

for study in {imp_biobank,vidus}; do
    # Apply filters
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile ${study}/genotype/original/final/${ancestry}_filtered \
        --maf 0.000001 \
        --make-bed \
        --out ${study}/genotype/original/final/${ancestry}_filtered_mono
done

"""
wc -l */genotype/original/final/*bim
   600372 imp_biobank/genotype/original/final/ea_chr_all.bim
   600371 imp_biobank/genotype/original/final/ea_filtered.bim
   600371 imp_biobank/genotype/original/final/ea_filtered_mono.bim
   686855 vidus/genotype/original/final/ea_chr_all.bim
   679342 vidus/genotype/original/final/ea_filtered.bim
   653513 vidus/genotype/original/final/ea_filtered_mono.bim
"""

### SNP intersection

In [None]:
# EC2 command line #
cd /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation

mkdir intersect
studies=(imp_biobank vidus) # array of study names

# Get intersection set
    file1=${studies[0]}/genotype/original/final/${ancestry}_filtered_mono.bim
    file2=${studies[1]}/genotype/original/final/${ancestry}_filtered_mono.bim
    echo -e "\nCalculating intersection between ${file1} and ${file2}...\n"
    comm -12 <(cut -f 2,2 $file1 | sort -u) <(cut -f 2,2 $file2 | sort -u) \
        > intersect/${ancestry}_variant_intersection.txt

# Make new PLINK binary file sets
    for study in ${studies[@]}; do
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --bfile ${study}/genotype/original/final/${ancestry}_filtered_mono \
            --extract intersect/${ancestry}_variant_intersection.txt \
            --make-bed \
            --out intersect/${study}_${ancestry}_filtered_snp_intersection
    done

"""
 wc -l intersect/*bim
  180758 intersect/imp_biobank_ea_filtered_snp_intersection.bim
"""

### Merge test
As a final check to confirm that our data sets are all compatible, a PLINK file set merge is conducted. If any issues persist then an error will be raised.

In [None]:
# EC2 command line #
cd /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation/intersect

# Merge file sets
echo -e "\n\n======== ${ancestry} ========\n\n"
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --memory 4000 \
    --bfile vidus_${ancestry}_filtered_snp_intersection \
    --bmerge imp_biobank_${ancestry}_filtered_snp_intersection \
    --make-bed \
    --out ${ancestry}_studies_merged

## Imputation preparation for Michigan Imputation Server
Visit the [MIS Getting Started Webpage](https://imputationserver.sph.umich.edu/start.html#!pages/help) for more information about the preparing the data for upload to MIS.
### VCF File Conversion

In [None]:
# EC2 command line #
cd /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation

mkdir phase_prep
ancestry="ea"

# Split by chr and remove any individuals with missing data for whole chr
for chr in {1..23}; do
    # Remove SNPs
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 4000 \
        --bfile intersect/${ancestry}_studies_merged \
        --chr ${chr} \
        --mind 0.99 \
        --make-bed \
        --out phase_prep/${ancestry}_chr${chr}_for_phasing 
done > chr_splitting.log

__Note__: No subjects were removed.

```
 wc -l *bim
  10260 ea_chr10_for_phasing.bim
   9357 ea_chr11_for_phasing.bim
   9111 ea_chr12_for_phasing.bim
   6995 ea_chr13_for_phasing.bim
   6026 ea_chr14_for_phasing.bim
   5866 ea_chr15_for_phasing.bim
   6220 ea_chr16_for_phasing.bim
   4919 ea_chr17_for_phasing.bim
   5803 ea_chr18_for_phasing.bim
   3045 ea_chr19_for_phasing.bim
  14935 ea_chr1_for_phasing.bim
   4857 ea_chr20_for_phasing.bim
   2698 ea_chr21_for_phasing.bim
   2572 ea_chr22_for_phasing.bim
     39 ea_chr23_for_phasing.bim
  14950 ea_chr2_for_phasing.bim
  12454 ea_chr3_for_phasing.bim
  10313 ea_chr4_for_phasing.bim
  11018 ea_chr5_for_phasing.bim
  11356 ea_chr6_for_phasing.bim
   9692 ea_chr7_for_phasing.bim
   9719 ea_chr8_for_phasing.bim
   8553 ea_chr9_for_phasing.bim
 180758 total

```

In [None]:
# EC2 command line #
cd /home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation/phase_prep
mkdir ea

ancestry="ea"
for chr in {1..22}; do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 5000 \
        --bfile ${ancestry}_chr${chr}_for_phasing \
        --recode vcf bgz \
        --out ${ancestry}/${ancestry}_chr${chr}_final
done

chr=23
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --memory 5000 \
    --bfile ${ancestry}_chr${chr}_for_phasing \
    --output-chr M \
    --set-hh-missing \
    --recode vcf bgz \
    --out ${ancestry}/${ancestry}_chr${chr}_final

Transfer the *.vcf.gz files to local machine (per chromosome) and then upload to MIS.

In [None]:
cd /cygdrive/c/Users/jmarks/Desktop/data_transfers/michigan_imputation_server/vidus_and_imp_biobank
scp -i ~/.ssh/gwas_rsa ec2-user@35.168.108.18:/home/ec2-user/jmarks/heroin/VIDUS_and_IMP_BIOBANK_imputation/phase_prep/ea/*gz .

## Upload to Michigan Imputation Server (MIS)

### Uploading parameters
These are the parameters that were selected on MIS.

__Name__: chr1-22

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: EUR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.