# Kreek Phasing + Imputation
__Author__: Jesse Marks


This document logs the steps taken to perform phasing and imputation on the dataset [Kreek]() & [Lung Cancer in Never Smokers](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000634.v1.p1). The starting point for this analysis is after quality control of observed genotypes and after checking for compatibility of population control dbGaP data sets. The quality controlled genotypes are oriented on the GRCh37 plus strand. Based on findings from [Johnson et al.](https://link.springer.com/article/10.1007/s00439-013-1266-7), the intersection set of variants for will be used for imputation. 

## Software and tools
The software and tools used for porcessing these data are
* [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) (MIS)
* [Amazon Web Services (AWS) - Cloud Computing Services](https://aws.amazon.com/)
    * Linux AMI
* [PLINK v1.90 beta 4.10](https://www.cog-genomics.org/plink/)
* [bgzip](http://www.htslib.org/doc/tabix.html)
* [BCF Tools](http://www.htslib.org/doc/bcftools.html)
* Windows 10 with [Cygwin](https://cygwin.com/) installed
* GNU bash version 4.2.46

## Data retrieval and organization
The temporary working directory for this analysis will be: `/shared/sandbox/ngc_vidus-lung_cancer_oaall_case_control` <br>
on `EC2`. PLINK binary filesets will be obtained from AWS S3 storage.



* `/share/nas04/bioinformatics_group/data/studies/kreek`


As a constituent of an opioid GWAS meta-analysis in progress referred to as the "UHS-Nelson" study, we need to show replication in an independent cohort. We will take advantage of a data set released by Mary-Jeanne Kreek's research group, which we refer to as the "Kreek" data set, to assess whether the UHS-Nelson study findings are replicable with the Kreek data. The three types of opioid GWAS we will run are:

* Opioid cases vs. clean controls (no abuse or dependence of any drug)
* Opioid cases vs. all controls with no covariates adjusting for other drugs
* Opioid cases vs. all controls with covariates adjusting for cocaine and alcohol

These three GWAS will be run separately for 2 different ancestry groups:

* European (EA)
* African (AA)

**Note**: It was decided that we would not proceed with the the Hispanic ancestry group.

## Data processing
### GRCh37 strand and allele discordance check

In [None]:
# EC2 command line #
cd /shared/impute/kreek/fou

mkdir 1000g
ancestry="ea"
genoLoc=/shared/impute/kreek/data/genotype/original
workingDir=/shared/impute/kreek/fou
#for study in {lung_cancer,VIDUS}; do
for ancestry in {ea,aa};do
    for study in kreek; do
        mkdir strand_check
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bfile $genoLoc/${ancestry}_chr_all \
            --freq \
            --out $workingDir/strand_check/${ancestry}_chr_all
    done
done
# Get list of variants from all studies
for ancestry in {ea,aa}; do
    cat $genoLoc/${ancestry}_chr_all.bim | \
            perl -lane 'if (($F[0]+0) <= 23) { print $F[1]; }' | \
            sort -u > $workingDir/${ancestry}_chr_all_sorted_variants.txt
done

# Calculate autosome MAFs for 1000G EUR and AFR 
for ancestry in {ea,aa}; do
    if [ $ancestry = "ea" ]; then
        pop=EUR
    else
        pop=AFR
    fi
    for chr in {1..22}; do
        /shared/bioinformatics/software/scripts/qsub_job.sh \
            --job_name ${pop}_${chr} \
            --script_prefix ${pop}_chr${chr}.maf \
            --mem 8 \
            --nslots 1 \
            --priority 0 \
            --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
                --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.hap.gz\
                --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz \
                --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
                --chr ${chr} \
                --out 1000g/${pop}_chr${chr}.maf \
                --extract ${ancestry}_chr_all_sorted_variants.txt \
                --keep_groups ${pop}
    done
done

# Merge per chr MAFs for EUR

for pop in {EUR,AFR}; do
    head -n 1 1000g/${pop}_chr1.maf > 1000g/${pop}_chr_all.maf
    tail -q -n +2 1000g/${pop}_chr{1..22}.maf \
        >> 1000g/${pop}_chr_all.maf
done

# Run discordance checks for EA group

for ancestry in {ea,aa}; do
    study=kreek

    if [ $ancestry = "ea" ]; then
        pop=EUR
    else
        pop=AFR
    fi

    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${ancestry}_${study}_crosscheck \
        --script_prefix ${workingDir}/strand_check/test/${ancestry}_allele_discordance_check \
        --mem 6 \
        --priority 0 \
        --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
            --study_bim_file ${genoLoc}/${ancestry}_chr_all.bim
            --study_frq_file ${workingDir}/strand_check/${ancestry}_chr_all.frq
            --ref_maf_file 1000g/${pop}_chr_all.maf
            --out_prefix ${workingDir}/strand_check/${ancestry}_allele_discordance"
done

### Resolving allele discordances
The allele discordances will be resolved by
* Flipping allele discordances that are fixed by flipping
* Removing SNPs with discordant names
* Removing SNPs with discordant positions
* Removing allele discordances that are not resolved by flipping
* Removing alleles with large deviations from the reference population allele frequencies

Given that the allele discordance check was done using a union set of SNPs across all studies within an ancestry group, some of the SNPs logged as discordant for a given study may not actually be in the study. Fortunately, if they are not in a given study they will not interfere with the filtering procedures.

In [None]:
# EC2 command line #
cd /shared/impute/kreek/fou
mkdir final
genoLoc=/shared/impute/kreek/data/genotype/original

study=kreek
for ancestry in {ea,aa};do
    echo -e "\n===============\nProcessing ${study}\n"
    # Create remove list
    echo "Making remove list"
    cat <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2) \
        <(cut -f2,2 strand_check/${ancestry}_allele_discordance.at_cg_snps_freq_diff_gt_0.2 | tail -n +2) \
        <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_names | tail -n +2) \
        <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_positions | tail -n +2) \
        <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_alleles_polymorphic_in_study_not_fixed_by_strand_flip | tail -n +2) | \
        sort -u > ${ancestry}_snps.remove

    # Create flip list
    echo "Making flip list"
    comm -23 <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_alleles | tail -n +2 | sort -u) \
        <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2 | sort -u) \
        > ${ancestry}_snps.flip

    # Apply filters
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile $genoLoc/${ancestry}_chr_all \
        --exclude ${ancestry}_snps.remove \
        --flip ${ancestry}_snps.flip \
        --make-bed \
        --out final/${ancestry}_filtered
done

### Remove monomorphic variants
Monomorphic variants prevent MIS from accepting the genotype data. In this case, an arbitrarily small MAF is set that is smaller than the lower bound for these data.

In [None]:
# EC2 command line #
cd /shared/impute/kreek/fou/final

ancestry="ea"
study=kreek
for ancestry in {ea,aa};do
    # Apply filters
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile ${ancestry}_filtered \
        --maf 0.000001 \
        --make-bed \
        --out ${ancestry}_filtered_mono
done

"""
482792 aa_filtered.bim
457378 aa_filtered_mono.bim

490863 ea_filtered.bim
434239 ea_filtered_mono.bim
"""

## Imputation preparation for Michigan Imputation Server
Visit the [MIS Getting Started Webpage](https://imputationserver.sph.umich.edu/start.html#!pages/help) for more information about the preparing the data for upload to MIS.
### VCF File Conversion

In [None]:
# EC2 command line #
cd /shared/impute/kreek/fou/final

mkdir phase_prep

# Split by chr and remove any individuals with missing data for whole chr
for ancestry in {ea,aa};do
    for chr in {1..23}; do
        # Remove SNPs
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 4000 \
            --bfile ${ancestry}_filtered_mono \
            --chr ${chr} \
            --mind 0.99 \
            --make-bed \
            --out phase_prep/${ancestry}_chr${chr}_for_phasing 
    done > chr_splitting.log
done

__Note__: No subjects were removed.

In [None]:
# EC2 command line #
cd /shared/impute/kreek/fou/final/phase_prep
mkdir ea aa

for ancestry in {ea,aa};do
    for chr in {1..22}; do
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 5000 \
            --bfile ${ancestry}_chr${chr}_for_phasing \
            --recode vcf bgz \
            --out ${ancestry}/${ancestry}_chr${chr}_final
    done
done

chr=23
for ancestry in {ea,aa};do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 5000 \
        --bfile ${ancestry}_chr${chr}_for_phasing \
        --output-chr M \
        --set-hh-missing \
        --recode vcf bgz \
        --out ${ancestry}/${ancestry}_chr${chr}_final
done

## Upload to Michigan Imputation Server (MIS)
Transfer the \*.vcf files to local machine (per chromosome) and then upload to MIS.

### Uploading parameters EA
These are the parameters that were selected on MIS.

__Name__: kreek_ea

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: EUR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.


### Uploading parameters AA
These are the parameters that were selected on MIS.

__Name__: chr1-23

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: AFR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.

# Download Imputed Data from MIS
First Download the data form the Michigan Imputation Server by clicking on the link provided in the email they send out to alert you that your data has finished. Here you will find commands for downloading the data.
## EA

In [None]:
## EC2 ##
mkdir -p /shared/impute/kreek/data/genotype/imputed/{aa,ea}
cd /shared/impute/kreek/data/genotype/imputed/ea

# I put all of the download commands in a bash script so I could submit them as jobs.
cat mywget.sh
"""
wget https://imputationserver.sph.umich.edu/share/results/2e899483c134f1d2c08e3aabc4414253/chr_1.zip 
wget https://imputationserver.sph.umich.edu/share/results/ed48286121726c3bc64ddf024ec72730/chr_10.zip 
wget https://imputationserver.sph.umich.edu/share/results/1472b06d90055f4effca30c4924fbeca/chr_11.zip 
wget https://imputationserver.sph.umich.edu/share/results/b3e1f5e3fd0bfd0a69fe171e438d3c53/chr_12.zip 
wget https://imputationserver.sph.umich.edu/share/results/8bc2b433fed05c8c57e09c240aae9d38/chr_13.zip 
wget https://imputationserver.sph.umich.edu/share/results/ed5623de7f6b337dec56abce51908cf5/chr_14.zip 
wget https://imputationserver.sph.umich.edu/share/results/55b667fafaa4881bba5c285c859b6d6e/chr_15.zip 
wget https://imputationserver.sph.umich.edu/share/results/7744e3469a60729f0cfabef804cb7cf/chr_16.zip 
wget https://imputationserver.sph.umich.edu/share/results/3097bb774de0115ed454e48628ad06dc/chr_17.zip 
wget https://imputationserver.sph.umich.edu/share/results/761df4b95b0214e19d321cd327e5ac4e/chr_18.zip 
wget https://imputationserver.sph.umich.edu/share/results/fd2b2ee5136093e255bdf60f7d575bf8/chr_19.zip 
wget https://imputationserver.sph.umich.edu/share/results/dffef070d2e0da537b7b0d2b7c42d5e1/chr_2.zip 
wget https://imputationserver.sph.umich.edu/share/results/76d3e3a2e8938a080fda54c80cbbe264/chr_20.zip 
wget https://imputationserver.sph.umich.edu/share/results/27e8da7066c8dfe52c890eb1952e6dcd/chr_21.zip 
wget https://imputationserver.sph.umich.edu/share/results/f1f5a8bbdbce2b2293081f72b7c5ba6c/chr_22.zip 
wget https://imputationserver.sph.umich.edu/share/results/cd7e1fe0b57b45590ec83367bf33ffae/chr_3.zip 
wget https://imputationserver.sph.umich.edu/share/results/f405efbacb10d9243aaf3d662c6f9414/chr_4.zip 
wget https://imputationserver.sph.umich.edu/share/results/ccba882630a8cce30bf458dca68a2d09/chr_5.zip 
wget https://imputationserver.sph.umich.edu/share/results/77aeabce5acbca6e225e3ac02965eb0d/chr_6.zip 
wget https://imputationserver.sph.umich.edu/share/results/b7798d33eca337160fb28db9d855e6e0/chr_7.zip 
wget https://imputationserver.sph.umich.edu/share/results/f95778ef2f6cb1364e5f272e0b0fcf75/chr_8.zip 
wget https://imputationserver.sph.umich.edu/share/results/494429e58125aacc323dce6366f348be/chr_9.zip 
wget https://imputationserver.sph.umich.edu/share/results/877e7a71f369a30acf5933880b5edbda/chr_X.no.auto_female.zip 
wget https://imputationserver.sph.umich.edu/share/results/1c9f7d9ac07591172ecb01e76b743fd/chr_X.no.auto_male.zip 
"""

sh /shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name download_kreek \
    --script_prefix kreek_imputation_download \
    --mem 5 \
    --priority 0 \
    --nslots 1 \
    --program bash mywget.sh

### Inflate imputation results
The zip files from Michigan Imputation Server (MIS) need to be inflated before you can begin working with them. They require a passcode that is sent by MIS to email.

In [None]:
### EC2 console ###
cd /shared/imputed/kreek/data/genotype/imputed/ea

# inflate chr results
for f in {1..23};do
echo '#!/bin/bash' > chr_$f.sh
echo '' >> chr_$f.sh
echo 'unzip -P "YHv(tBF1rCx?3b" chr_'$f'.zip' >> chr_$f.sh
done

for chr in {1..23}; do
    sh /shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name ea.inflate_chr${chr} \
    --script_prefix ea.chr${chr}_inflation \
    --mem 5 \
    --priority 0 \
    --nslots 1 \
    --program bash chr_${chr}.sh
done

rm *.sh

## AA 


In [None]:
## EC2 ##
cd /shared/impute/kreek/data/genotype/imputed/aa

# I put all of the download commands in a bash script so I could submit them as jobs.
cat mywget.sh
"""
wget https://imputationserver.sph.umich.edu/share/results/8fce96789a782e1eae8564174708f790/chr_1.zip
wget https://imputationserver.sph.umich.edu/share/results/2818409a432b4f031522994106fb5b00/chr_10.zip
wget https://imputationserver.sph.umich.edu/share/results/eb99a005ea72e5a9dcf1f32e7ad06f89/chr_11.zip
wget https://imputationserver.sph.umich.edu/share/results/775f95824d1ad4ec0a9d9d4f677b2837/chr_12.zip
wget https://imputationserver.sph.umich.edu/share/results/6848af95103b8cf57c84d758afa0240c/chr_13.zip
wget https://imputationserver.sph.umich.edu/share/results/90d53e022cf5f290f9b35cd2536f5a4/chr_14.zip
wget https://imputationserver.sph.umich.edu/share/results/c95be565f4e16408dcf0c2be15fd0166/chr_15.zip
wget https://imputationserver.sph.umich.edu/share/results/33fc07c12dc27a27e99e0108ff6f3efb/chr_16.zip
wget https://imputationserver.sph.umich.edu/share/results/efa12bc6a1e571d43fd0a56bc02f49c5/chr_17.zip
wget https://imputationserver.sph.umich.edu/share/results/9aafd32ba54edba436ba648e9ade8a5e/chr_18.zip
wget https://imputationserver.sph.umich.edu/share/results/bf0cc57426f359bd178c17b8261ae09d/chr_19.zip
wget https://imputationserver.sph.umich.edu/share/results/665ba49c9066e3d68fcc5a3d81cc68e6/chr_2.zip
wget https://imputationserver.sph.umich.edu/share/results/7faaf92de24d36f56fad80c74abe8417/chr_20.zip
wget https://imputationserver.sph.umich.edu/share/results/89db9a7ee57ca03779087290d9328530/chr_21.zip
wget https://imputationserver.sph.umich.edu/share/results/f37ce24b6a80ffa12ff07a09458eab9f/chr_22.zip
wget https://imputationserver.sph.umich.edu/share/results/3be4ba7c31591f3cbcc4a1661e039e9f/chr_3.zip
wget https://imputationserver.sph.umich.edu/share/results/84c9a1580141c54eb726a3dff15204c2/chr_4.zip
wget https://imputationserver.sph.umich.edu/share/results/10e69b2563ce96499b3dcee78e402411/chr_5.zip
wget https://imputationserver.sph.umich.edu/share/results/9252a2a74b2a78b6a51d59810821ce68/chr_6.zip
wget https://imputationserver.sph.umich.edu/share/results/d6a22c97c0b306e3e6019eacf512d583/chr_7.zip
wget https://imputationserver.sph.umich.edu/share/results/534afd90e14af150a2437add4834b8b4/chr_8.zip
wget https://imputationserver.sph.umich.edu/share/results/8c905522864028a306ca36669f244c50/chr_9.zip
wget https://imputationserver.sph.umich.edu/share/results/14d26b9017c84fe2b5e970639dd1a0ab/chr_X.no.auto_female.zip
wget https://imputationserver.sph.umich.edu/share/results/57d7fb05f5e2b20df4e5b4972e2618bc/chr_X.no.auto_male.zip

"""

sh /shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name aa_kreek.download \
    --script_prefix aa_kreek_imputation_download \
    --mem 5 \
    --priority 0 \
    --nslots 1 \
    --program bash mywget.sh

### Inflate imputation results
The zip files from Michigan Imputation Server (MIS) need to be inflated before you can begin working with them. They require a passcode that is sent by MIS to email.

In [None]:
### EC2 console ###
cd /shared/imputed/kreek/data/genotype/imputed/aa

# inflate chr results
for f in {1..23};do
echo '#!/bin/bash' > chr_$f.sh
echo '' >> chr_$f.sh
echo 'unzip -P "T7RPRoezqRvn*4" chr_'$f'.zip' >> chr_$f.sh
done

for chr in {1..23}; do
    sh /shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name aa.inflate_chr${chr} \
    --script_prefix aa.chr${chr}_inflation \
    --mem 5 \
    --priority 0 \
    --nslots 1 \
    --program bash chr_${chr}.sh
done

rm *.sh