# eMERGE Cohort for Nicotine Dependence: Phasing and Imputation
__Author:__ Jesse Marks

This document logs the steps taken to perform phasing and imputation on the [eMERGE Genome Wide Association Study of Cataract and Low HDL cholesterol](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000170.v2.p1). The starting point for this analysis is after quality control of observed genotypes and after checking for compatibility of population control dbGaP data sets. The quality controlled genotypes are oriented on the GRCh37 plus strand. Based on findings from [Johnson et al.](https://link.springer.com/article/10.1007/s00439-013-1266-7), the intersection set of variants for will be used for imputation. 

__Note:__ The `Human660W-Quad_v1_A` chip was used to genotype these subjects.

## Software and tools
The software and tools used for porcessing these data are
* [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) (MIS)
* [Amazon Web Services (AWS) - Cloud Computing Services](https://aws.amazon.com/)
    * Linux AMI
* [PLINK v1.90 beta 4.10](https://www.cog-genomics.org/plink/)
* [bgzip](http://www.htslib.org/doc/tabix.html)
* [BCF Tools](http://www.htslib.org/doc/bcftools.html)
* Windows 10 with [Cygwin](https://cygwin.com/) installed
* GNU bash version 4.2.46

## Data retrieval and organization
The temporary working directory for this analysis will be: `/shared/sandbox/emerge/` <br>
on `EC2`. PLINK binary filesets will be obtained from AWS S3 storage.

In [None]:
# EC2 command line #
mkdir -p /shared/sandbox/emerge/phs000170_cataract
cd /shared/sandbox/emerge/phs000170_cataract

# Copy post-QC emerge observed genotype data
aws s3 cp s3://rti-common/dbGaP/phs000170_cataract/genotype/original/final/ . --recursive

# de-compress
gunzip *

## Data processing
### GRCh37 strand and allele discordance check

In [None]:
# EC2 command line #
cd /shared/sandbox/emerge/
mkdir 1000g
mkdir strand_check

ancestry="ea"
study="cataract"

# Calculate allele frequencies
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --memory 2048 \
    --bfile ${ancestry}_chr_all \
    --freq \
    --out strand_check/${ancestry}_chr_all

# Get list of variant IDs from all studies for an ancestry group
cat ${ancestry}_chr_all.bim | \
    perl -lane 'if (($F[0]+0) <= 23) { print $F[1]; }' | \
    sort -u > ${ancestry}_chr_all_sorted_variants.txt

# Calculate autosome and chrX MAFs for 1000G EUR
pop="EUR"
ancestry="ea"
for chr in {1..22}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${pop}_${chr} \
        --script_prefix ${pop}_chr${chr}.maf \
        --mem 10 \
        --nslots 1 \
        --priority 0 \
        --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
            --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.hap.gz\
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz \
            --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
            --chr ${chr} \
            --out 1000g/${pop}_chr${chr}.maf \
            --extract ${ancestry}_chr_all_sorted_variants.txt \
            --keep_groups ${pop}
done


chr=23
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name ${pop}_${chr} \
    --script_prefix ${pop}_chr${chr}.maf \
    --mem 6.8 \
    --nslots 1 \
    --priority 0 \
    --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
        --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.hap.gz\
        --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
        --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
        --chr ${chr} \
        --out 1000g/${pop}_chr${chr}.maf \
        --extract ${ancestry}_chr_all_sorted_variants.txt \
        --keep_groups ${pop}

# Merge per chr MAFs for EUR
pop="EUR"
head -n 1 1000g/${pop}_chr1.maf > 1000g/${pop}_chr_all.maf
tail -q -n +2 1000g/${pop}_chr{1..23}.maf \
    >> 1000g/${pop}_chr_all.maf

# Run discordance checks for EA group
pop="EUR"
ancestry="ea"
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name ${ancestry}_crosscheck \
    --script_prefix strand_check/${ancestry}_allele_discordance_check \
    --mem 10 \
    --priority 0 \
    --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
        --study_bim_file ${ancestry}_chr_all.bim
        --study_frq_file strand_check/${ancestry}_chr_all.frq
        --ref_maf_file 1000g/${pop}_chr_all.maf
        --out_prefix strand_check/${ancestry}_allele_discordance"

### Resolving allele discordances
The allele discordances will be resolved by
* Flipping allele discordances that are fixed by flipping
* Removing SNPs with discordant names
* Removing SNPs with discordant positions
* Removing allele discordances that are not resolved by flipping
* Removing alleles with large deviations from the reference population allele frequencies

In [None]:
# EC2 command line #
cd /shared/sandbox/emerge

ancestry="ea"

# Apply filters
echo -e "\n===============\nProcessing emerge \n"
# Create remove list
echo "Making remove list"
cat <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2) \
    <(cut -f2,2 strand_check/${ancestry}_allele_discordance.at_cg_snps_freq_diff_gt_0.2 | tail -n +2) \
    <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_names | tail -n +2) \
    <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_positions | tail -n +2) \
    <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_alleles_polymorphic_in_study_not_fixed_by_strand_flip | tail -n +2) | \
    sort -u > ${ancestry}_snps.remove

# Create flip list
echo "Making flip list"
comm -23 <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_alleles | tail -n +2 | sort -u) \
    <(cut -f2,2 strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2 | sort -u) \
    > ${ancestry}_snps.flip

# Apply filters
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --memory 2048 \
    --bfile ${ancestry}_chr_all \
    --exclude ${ancestry}_snps.remove \
    --flip ${ancestry}_snps.flip \
    --make-bed \
    --out ${ancestry}_filtered

### Remove monomorphic variants
Monomorphic variants prevent MIS from accepting the genotype data. In this case, an arbitrarily small MAF is set that is smaller than the lower bound for these data.

In [None]:
# EC2 command line #
cd /shared/sandbox/emerge

ancestry="ea"

# Apply filters
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
--noweb \
--memory 2048 \
--bfile ${ancestry}_filtered \
--maf 0.000001 \
--make-bed \
--out ${ancestry}_filtered_mono

## Imputation preparation for Michigan Imputation Server
Visit the [MIS Getting Started Webpage](https://imputationserver.sph.umich.edu/start.html#!pages/help) for more information about the preparing the data for upload to MIS.
### VCF File Conversion

In [None]:
# EC2 command line #
cd /shared/sandbox/emerge
mkdir phased

# Split by chr and remove any individuals with missing data for whole chr
ancestry="ea"
for chr in {1..23}; do
    # Remove SNPs
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 4000 \
        --bfile ${ancestry}_filtered_mono \
        --chr ${chr} \
        --mind 0.99 \
        --make-bed \
        --out phased/${ancestry}_chr${chr}_for_phasing 
done > chr_splitting.log

__Note__: No subjects were removed.

In [None]:
# EC2 command line #
cd /shared/sandbox/emerge/phased
mkdir ea

ancestry="ea"
for chr in {1..22}; do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 5000 \
        --bfile ${ancestry}_chr${chr}_for_phasing \
        --recode vcf bgz \
        --out ${ancestry}/${ancestry}_chr${chr}_final
done

# Had to alter the chr23 file to be compatible with the MIS. Change 23 to X.
chr=23
ancestry="ea"
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 5000 \
        --bfile ${ancestry}_chr${chr}_for_phasing \
        --output-chr M \
        --set-hh-missing \
        --recode vcf bgz \
        --out ${ancestry}/${ancestry}_chr${chr}_final

### Uploading parameters
These are the parameters that were selected on MIS.

__Name__: chr1-22

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: EUR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.

__Note__: submitted chr23 separately. 