# VIDUS chr23 Imputation Preparation
__Author__: Jesse Marks <br>
**Date**: September 20, 2018 <br>
**GitHub**: [Issue #98](https://github.com/RTIInternational/bioinformatics/issues/98#) <br>

This document logs the steps taken to perform pre-imputation data processing on the dataset [VIDUS](https://www.bccsu.ca/vidus/). The starting point for this analysis is after quality control of observed genotypes. The quality controlled genotypes are oriented on the GRCh37 plus strand. 

## Software and tools
The software and tools used for porcessing these data are
* [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) (MIS)
* [Amazon Web Services (AWS) - Cloud Computing Services](https://aws.amazon.com/)
    * Linux AMI
* [PLINK v1.90 beta 4.10](https://www.cog-genomics.org/plink/)
* [bgzip](http://www.htslib.org/doc/tabix.html)
* [BCF Tools](http://www.htslib.org/doc/bcftools.html)
* Windows 10 with [Cygwin](https://cygwin.com/) installed
* GNU bash version 4.2.46

## Data retrieval and organization

VIDUS <br>
EA: start 17565; post missingness 16472; post HWE 16471

Updated VIDUS chrX QC - new files located here: <br>
https://rti-midas-data.s3.amazonaws.com/studies/vidus/observed/final/vidus.ea.chr23.bed <br>
https://rti-midas-data.s3.amazonaws.com/studies/vidus/observed/final/vidus.ea.chr23.bim <br>
https://rti-midas-data.s3.amazonaws.com/studies/vidus/observed/final/vidus.ea.chr23.fam <br>

### chrX Statistics Breakdown 
This table includes the initial number of variants in each study as well as the final number of variants in the intersection set. The `Variants Post-Filtering` is in referral to the filtering steps (1) remove discordant alleles & (2) removal of monomorphic variants.

#### EA
| Data Set      | Initial Variants (Post-QC)| Variants Post-Filtering  | Intersection     |
|---------------|---------------------------|--------------------------|------------------|
| VIDUS         |   16,471                  |  14,705                  | NA               |

### Create Directory Stucture & Download Data
The following section needs to be modified each time to reflect where the data is stored!

In [None]:
### EC2 command line (Bash) ###

## Create directory stucture & download data ##
base_dir=/home/ec2-user/jmarks/heroin/chr23_impute
base_name="chr23" # chr23 or chr_all
ancestry_list="ea" # space delimited Ex. "ea aa ha"
study_list="VIDUS" # space delimited 

mkdir ${base_dir}/1000g
for study in ${study_list};do
    for ancestry in ${ancestry_list};do
        mkdir -p ${base_dir}/${study}/genotype/observed/${ancestry}
    done
done


# Edit this section! 
for ext in {bed,bim,fam};do
    aws s3 cp s3://rti-midas-data/studies/vidus/observed/final/vidus.ea.chr23.${ext} \
        ${base_dir}/VIDUS/genotype/observed/ea/ --quiet &
done

# Data processing
## GRCh37 strand and allele discordance check
### MAF for study data

In [None]:
### EC1 command line (Bash) ###

# write out the MAF report
for study in ${study_list}; do
    study_dir=${base_dir}/${study}/strand_check
    mkdir ${study_dir}
    for ancestry in ${ancestry_list};do
        data_dir=${base_dir}/${study}/genotype/observed/${ancestry}
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bed ${data_dir}/*bed\
            --bim ${data_dir}/*bim\
            --fam ${data_dir}/*fam\
            --freq \
            --out ${study_dir}/${ancestry}_${base_name}
    done
done

# Get list of variants from all studies
for ancestry in ${ancestry_list}; do
    for study in ${study_list};do
        cat ${base_dir}/${study}/genotype/observed/${ancestry}/*bim | \
                perl -lane 'if (($F[0]+0) <= 23) { print $F[1]; }' | \
                sort -u > ${base_dir}/${ancestry}_${base_name}_sorted_variants.txt
    done
done

### MAF for 1000G
This pipeline is currently set up to handle EUR and AFR populations. 
#### Autosomes
Get 1000G MAF for chromosomes 1&ndash;22 (autosomes)

In [None]:
### EC2 command line (Bash)

# Calculate autosome MAFs for 1000G populations
for ancestry in ${ancestry_list};do

    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi
    
for chr in {1..22}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${pop}_${chr}_MAF \
        --script_prefix ${base_dir}/1000g/${pop}_chr${chr}.maf \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
            --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.hap.gz\
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz \
            --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
            --chr ${chr} \
            --out ${base_dir}/1000g/${pop}_chr${chr}.maf \
            --extract ${base_dir}/${ancestry}_${base_name}_sorted_variants.txt \
            --keep_groups ${pop}
done

#### chrX 
Get 1000G MAF for chromosome 23 (chrX)

In [None]:
### Bash ###

for ancestry in ${ancestry_list};do
    chr=23

    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi

    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${pop}_23_MAF \
        --script_prefix ${base_dir}/1000g/${pop}_chr${chr}.maf \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
            --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.hap.gz\
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
            --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
            --chr $chr \
            --out ${base_dir}/1000g/${pop}_chr${chr}.maf \
            --extract ${ancestry}_${base_name}_sorted_variants.txt \
            --keep_groups ${pop}
done

### Merge 1000G chromosomes
Only need to perform this if there were multiple chromosomes for which the MAF was calculated.

In [None]:
### Bash ###

# Merge per chr MAFs for each 1000G population
for ancestry in ${ancestry_list};do
    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi
    
        head -n 1 ${base_dir}/1000g/${pop}_chr1.maf > 1000g/${pop}_chr_all.maf
        tail -q -n +2 1000g/${pop}_chr{1..23}.maf \
            >> 1000g/${pop}_chr_all.maf
done

### Autosome Discordant Check

In [None]:
### Bash ###

# Run discordance checks for each ancestry group
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        if [ $ancestry = "ea" ]; then
            pop=EUR
        else
            pop=AFR
        fi

       /shared/bioinformatics/software/scripts/qsub_job.sh \
           --job_name ${ancestry}_${study}_crosscheck \
           --script_prefix ${base_dir}/strand_check/${ancestry}_allele_discordance_check \
           --mem 6 \
           --nslots 4 \
           --priority 0 \
           --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
               --study_bim_file ${base_dir}/${study}/genotype/observed/${ancestry}/*bim
               --study_frq_file ${base_dir}/${study}/strand_check/${ancestry}_chr_all.frq
               --ref_maf_file ${base_dir}/1000g/${pop}_chr_all.maf
               --out_prefix ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance"
    done
done

### chrX Discordant Check

In [None]:
### Bash ###

for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        if [ $ancestry = "ea" ]; then
            pop=EUR
        else
            pop=AFR
        fi

        # chr23 discordance check
        /shared/bioinformatics/software/scripts/qsub_job.sh \
            --job_name ${ancestry}_${study}_crosscheck \
            --script_prefix ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance_check \
            --mem 6.8 \
            --nslots 2 \
            --priority 0 \
            --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
                --study_bim_file ${base_dir}/${study}/genotype/observed/${ancestry}/*bim
                --study_frq_file ${base_dir}/${study}/strand_check/${ancestry}_chr23.frq
                --ref_maf_file ${base_dir}/1000g/${pop}_chr23.maf
                --out_prefix ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance"
    done
done

### Resolving allele discordances
The allele discordances will be resolved by
* Flipping allele discordances that are fixed by flipping
* Removing SNPs with discordant names
* Removing SNPs with discordant positions
* Removing allele discordances that are not resolved by flipping
* Removing alleles with large deviations from the reference population allele frequencies

In [None]:
### Bash ###

# Apply filters
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        echo -e "\n===============\nProcessing ${study}_${ancestry}\n"
        echo "Making remove list"
        cat <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.at_cg_snps_freq_diff_gt_0.2 | tail -n +2) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_names | tail -n +2) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_positions | tail -n +2) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_polymorphic_in_study_not_fixed_by_strand_flip | tail -n +2) | \
              sort -u > ${base_dir}/${study}/strand_check/${ancestry}_snps.remove

        # Create flip list
        echo "Making flip list"
        comm -23 <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles | tail -n +2 | sort -u) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2 | sort -u) \
            > ${base_dir}/${study}/strand_check/${ancestry}_snps.flip

        # Apply filters
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bed ${base_dir}/${study}/genotype/observed/${ancestry}/*bed \
            --bim ${base_dir}/${study}/genotype/observed/${ancestry}/*bim \
            --fam ${base_dir}/${study}/genotype/observed/${ancestry}/*fam \
            --exclude ${base_dir}/${study}/strand_check/${ancestry}_snps.remove \
            --flip ${base_dir}/${study}/strand_check/${ancestry}_snps.flip \
            --make-bed \
            --out ${base_dir}/${study}/genotype/observed/${ancestry}/${ancestry}_filtered
    done
done

### Remove monomorphic variants
Monomorphic variants prevent MIS from accepting the genotype data. In this case, an arbitrarily small MAF is set that is smaller than the lower bound for these data.

In [None]:
### Bash ###

# Apply filters
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        geno_dir=${base_dir}/${study}/genotype/observed/${ancestry}

        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bfile ${geno_dir}/${ancestry}_filtered \
            --maf 0.000001 \
            --make-bed \
            --out ${geno_dir}/${ancestry}_filtered_mono
    done
done

## Snp Intersection
to-do

## Imputation preparation for Michigan Imputation Server
Visit the [MIS Getting Started Webpage](https://imputationserver.sph.umich.edu/start.html#!pages/help) for more information about the preparing the data for upload to MIS.
### VCF File Conversion

**Note**: this section will be different once the intersection section has been created (see above).

In [None]:
### Bash ###

mkdir ${base_dir}/phase_prep

# Split by chr and remove any individuals with missing data for whole chr

for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        geno_dir=${base_dir}/${study}/genotype/observed/${ancestry}
        for chr in {1..23}; do
            /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
                --noweb \
                --memory 4000 \
                --bfile ${geno_dir}/${ancestry}_filtered_mono \
                --chr ${chr} \
                --mind 0.99 \
                --make-bed \
                --out ${base_dir}/phase_prep/${ancestry}_chr${chr}_for_phasing 
        done > chr_splitting.log
    done
done

__Note__: No subjects were removed.

In [None]:
# EC2 command line #

chr=23
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        mkdir ${base_dir}/phase_prep/${ancestry}

        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 5000 \
            --bfile ${base_dir}/phase_prep/${ancestry}_chr${chr}_for_phasing \
            --output-chr M \
            --set-hh-missing \
            --recode vcf bgz \
            --out ${base_dir}/phase_prep/${ancestry}/${ancestry}_chr${chr}_final
    done
done

## Upload to Michigan Imputation Server (MIS)
Transfer the \*.vcf files to local machine (per chromosome) and then upload to MIS.

### Uploading parameters EA
These are the parameters that were selected on MIS.

__Name__: vidus_ea_23

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: EUR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.

**Input Validation**
```
1 valid VCF file(s) found.

Samples: 940
Chromosomes: X
SNPs: 14705
Chunks: 8
Datatype: unphased
Reference Panel: phase3
Phasing: shapeit
```

**Quality Control**
```
ChrX Statistics: 
Submitting 2 jobs: 
chrX Non.Par male ( as Chr X II ) 
chrX Non.Par female ( as Chr X I ) 
NonPar Sex Check: 
Males: 712
Females: 228
No Sex dedected and therefore filtered: 0
```

# Download Imputed Data from MIS
First Download the data form the Michigan Imputation Server by clicking on the link provided in the email they send out to alert you that your data has finished. Here you will find commands for downloading the data.

In [None]:
cd /home/ec2-user/jmarks/heroin/chr23_impute/VIDUS/genotype/imputed

wget https://imputationserver.sph.umich.edu/share/results/82285d4825c7da133fc96eb8d36954c7/chr_X.no.auto_female.zip
wget https://imputationserver.sph.umich.edu/share/results/97b24fa6b1c8ed67fafc61ea0f4e859c/chr_X.no.auto_male.zip

## Inflate imputation results
The zip files from Michigan Imputation Server (MIS) need to be inflated before you can begin working with them. They require a passcode that is sent by MIS to email.

In [None]:
### EC2 console ###
cd /shared/imputed/kreek/data/genotype/imputed/ea

# inflate chr results
for file in *zip;do
    unzip -P "cBrhKJoZ5pX0G" $file 
done

# Upload to S3
Uploaded to:


In [None]:
aws s3 sync . s3://rti-midas-data/studies/vidus/imputed/20180921/