
# Pre-Imputation Automated Pipeline (chrX)
**Author:** Jesse Marks <br>

This notebook documents the procedures for the pre-imputation genotype data processing pipeline—necessary for the submission to the [Michigan Imputation Server (MIS)](https://imputationserver.sph.umich.edu/start.html). To submit genotype data to MIS for imputation, you must create an account/profile on their website. 

The starting point (input data) for this pipeline is after the quality control (QC) of the observed genotype data. The QC genotype data should be oriented on the GRGh37 plus strand. When multiple data sets are to be merged for imputation, the intersection set of variants will be used for imputation; this is based on the finding from [Johnson et al.](https://link.springer.com/article/10.1007/s00439-013-1266-7). 

## Software and tools
The software and tools used for porcessing these data are:
* [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) (MIS)
* [Amazon Web Services (AWS) - Cloud Computing Services](https://aws.amazon.com/)
    * Linux AMI
* [PLINK v1.90 beta 4.10](https://www.cog-genomics.org/plink/)
* [bgzip](http://www.htslib.org/doc/tabix.html)
* [BCF Tools](http://www.htslib.org/doc/bcftools.html)
* Windows 10 with [Cygwin](https://cygwin.com/) installed
* GNU bash version 4.2.46

## Example Data Set
The example cohort we test this pipeline on is [VIDUS](https://www.bccsu.ca/vidus/). There is only one ancestry associated with this cohort, namely European ancestry (EA).

### QC Stats Summary:


### Pre-Imputation Stats Summary
| Data Set      | Initial Variants (Post-QC) | Variants Post-Filtering  | Intersection     |
|---------------|----------------------------|--------------------------|------------------|
|               |                            |                          |                  |

# Data Processing

## Create Directory Stucture & Download Data
The following section needs to be modified each time to reflect:
* where the genotype data (post-QC) are stored
* where the base directory for the pre-imputation data processing will be
* the study or studies involved
* the ancesty group(s) involved
* the data to be processed (all_chr or chr23)

In [None]:
# parameters 
base_dir=/shared/jmarks/hiv/wihs3/genotype/imputed/processing # DO NOT end in forward slash
ancestry_list="aa ea" # space delimited Ex. "ea aa ha"
study_list="wihs3" # space delimited 
#base_name="chr_all" # chr_all chr23 

# create directory structure
mkdir -p ${base_dir}/{intersect,1000g,impute_prep}
for study in ${study_list};do
    mkdir ${base_dir}/${study}/strand_check

    for ancestry in ${ancestry_list};do
        mkdir -p ${base_dir}/${study}/genotype/${ancestry}
        
    done
done


## copy post-qc genotype data to correct directory
## AND REMAME TO CORRECT NAMING SCHEMA <study_ancestry.$extension> 
## also unzip the Plink files

#/shared/jmarks/hiv/cfar_coga/genotype/imputed/processing/cfar/genotype/aa
#/shared/jmarks/hiv/cfar_coga/genotype/imputed/processing/cfar/genotype/ea
#/shared/jmarks/hiv/cfar_coga/genotype/imputed/processing/cfar/genotype/ha
#/shared/jmarks/hiv/cfar_coga/genotype/imputed/processing/coga/genotype/aa
#/shared/jmarks/hiv/cfar_coga/genotype/imputed/processing/coga/genotype/ea
#/shared/jmarks/hiv/cfar_coga/genotype/imputed/processing/coga/genotype/ha

## GRCh37 strand and allele discordance check
### MAF for study data (all chromosomes)

In [None]:
# Write out the MAF report
for study in ${study_list}; do
    for ancestry in ${ancestry_list}; do
        docker run -v "${base_dir}/$study/:/data/" rticode/plink:1.9 plink \
            --bfile /data/genotype/$ancestry/${study}_${ancestry} \
            --freq \
            --out /data/strand_check/${ancestry}
    done
done


# Get list of variants from all studies
studies=($study_list)  #studies=(uhs1 uhs2 uhs3_v1-2 uhs3_v1-3 uhs4) # array of study names
num=${#studies[@]}

## Get intersection set
for ancestry in ${ancestry_list};do
    bim_files=()
    for (( i=0; i<${num}; i++ ));do
        bim_files+=(${base_dir}/${studies[$i]}/genotype/$ancestry/*bim)
    done
    
    echo -e "\nCalculating intersection between $ancestry ${study_list}...\n"
    cat ${bim_files[@]}| cut -f2 | sort |  uniq -c | awk -v num=$num '$1 == num {print $2}' \
        > ${base_dir}/intersect/${ancestry}_variant_intersection.txt
    wc -l ${base_dir}/intersect/${ancestry}_variant_intersection.txt
done 

### MAF for 1000G
The current setup requires the 1000G MAF for autosomes and chrX to be processed separately. This pipeline is also currently set up to handle EUR and AFR populations. 
#### Autosomes
Get 1000G MAF for chromosomes 1&ndash;22 (autosomes).

In [None]:
# Calculate autosome MAFs for 1000G populations
for ancestry in ${ancestry_list};do

    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi
    
    for chr in {1..22}; do
        /shared/bioinformatics/software/scripts/qsub_job.sh \
            --job_name ${pop}_${chr}_MAF \
            --script_prefix ${base_dir}/1000g/${pop}_chr${chr}.maf \
            --mem 6.8 \
            --nslots 3 \
            --priority 0 \
            --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
                --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.hap.gz\
                --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz \
                --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
                --chr ${chr} \
                --out ${base_dir}/1000g/${pop}_chr${chr}.maf \
                --extract ${base_dir}/intersect/${ancestry}_variant_intersection.txt \
                --keep_groups ${pop}
    done
done

#### chrX 
Get 1000G MAF for chromosome 23 (chrX).

In [None]:
chr=23
for ancestry in ${ancestry_list};do

    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi

    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${pop}_23_MAF \
        --script_prefix ${base_dir}/1000g/${pop}_chr${chr}.maf \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
            --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.hap.gz\
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
            --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
            --chr $chr \
            --out ${base_dir}/1000g/${pop}_chr${chr}.maf \
            --extract ${base_dir}/intersect/${ancestry}_variant_intersection.txt \
            --keep_groups ${pop}
done

### Merge 1000G chromosomes
Only need to perform this if there were multiple chromosomes for which the MAF was calculated.

In [None]:
# Merge per chr MAFs for each 1000G population
for ancestry in ${ancestry_list};do
    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi

    head -n1 ${base_dir}/1000g/${pop}_chr1.maf > ${base_dir}/1000g/${pop}_chr_all.maf
    for chr in {1..23}; do
            tail -q -n +2 ${base_dir}/1000g/${pop}_chr${chr}.maf >> \
                ${base_dir}/1000g/${pop}_chr_all.maf
    done
done

### Autosome Discordant Check

In [None]:
# Run discordance checks for each ancestry group
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        if [ $ancestry = "ea" ]; then
            pop=EUR
        else
            pop=AFR
        fi

       /shared/bioinformatics/software/scripts/qsub_job.sh \
           --job_name ${ancestry}_${study}_crosscheck \
           --script_prefix ${base_dir}/$study/strand_check/${ancestry}_allele_discordance_check \
           --mem 6 \
           --nslots 3 \
           --priority 0 \
           --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
               --study_bim_file ${base_dir}/${study}/genotype/${ancestry}/*bim
               --study_frq_file ${base_dir}/${study}/strand_check/${ancestry}_chr_all.frq
               --ref_maf_file ${base_dir}/1000g/${pop}_chr_all.maf
               --out_prefix ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance"
    done
done

### chrX Discordant Check
Run this cell below if you are only processing chrX.

In [None]:
#for study in ${study_list}; do
#    for ancestry in ${ancestry_list};do
#        if [ $ancestry = "ea" ]; then
#            pop=EUR
#        else
#            pop=AFR
#        fi
#
#        # chr23 discordance check
#        /shared/bioinformatics/software/scripts/qsub_job.sh \
#            --job_name ${ancestry}_${study}_crosscheck \
#            --script_prefix ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance_check \
#            --mem 6.8 \
#            --nslots 1 \
#            --priority 0 \
#            --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
#                --study_bim_file ${base_dir}/data/${study}/genotype/${ancestry}/*bim
#                --study_frq_file ${base_dir}/${study}/strand_check/${ancestry}_chr23.frq
#                --ref_maf_file ${base_dir}/1000g/${pop}_chr23.maf
#                --out_prefix ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance"
#    done
#done

### Resolving allele discordances
The allele discordances will be resolved by
* Flipping allele discordances that are fixed by flipping
* Removing SNPs with discordant names
* Removing SNPs with discordant positions
* Removing allele discordances that are not resolved by flipping
* Removing alleles with large deviations from the reference population allele frequencies

**Note**: that we could flip the SNPs that are in the snps.flip file we create here, however we are going to opt not to this time because we found that for this case flipping did not actually resolve the issue because most likely they were monomorphic variants.

In [None]:
# Apply filters
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        echo -e "\n===============\nProcessing ${study}_${ancestry}\n"
        echo "Making remove list"
        cat <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.at_cg_snps_freq_diff_gt_0.2 | tail -n +2) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_names | tail -n +2) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_positions | tail -n +2) \
            <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_polymorphic_in_study_not_fixed_by_strand_flip | tail -n +2) | \
              sort -u > ${base_dir}/${study}/strand_check/${ancestry}_snps.remove

        # Create flip list
        echo "Making flip list"
        comm -23 <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles | tail -n +2 | sort -u) \
                 <(cut -f2,2 ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2 | sort -u) \
                 > ${base_dir}/${study}/strand_check/${ancestry}_snps.flip

        # Apply filters
        docker run -v ${base_dir}/$study/:/data/ rticode/plink:1.9 plink \
            --bfile  /data/genotype/${ancestry}/${study}_${ancestry} \
            --exclude /data/strand_check/${ancestry}_snps.remove \
            --make-bed \
            --out     /data/${ancestry}_filtered
    done
done

wc -l $base_dir/*/*filtered.bim
wc -l $base_dir/*/strand_check/*remove
#wc -l $base_dir/*/strand_check/*snps.flip

### Remove monomorphic variants
Monomorphic variants prevent MIS from accepting the genotype data. In this case, an arbitrarily small MAF is set that is smaller than the lower bound for these data.

In [None]:
# Apply filters
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        docker run -v "${base_dir}/$study/:/data/" rticode/plink:1.9 plink \
            --bfile /data/${ancestry}_filtered \
            --maf 0.000001 \
            --make-bed \
            --out /data/${ancestry}_filtered_mono
    done
done

wc -l $base_dir/*/*mono.bim

## Snp Intersection
Only perform if merging datasets.

In [None]:
studies=($study_list)  #studies=(UHS1 UHS2 UHS3_v1-2 UHS3_v1-3) # array of study names

# Make new PLINK binary file sets
for ancestry in ${ancestry_list};do
    for study in ${studies[@]}; do
        docker run -v "${base_dir}/:/data/" rticode/plink:1.9 plink \
            --bfile /data/$study/${ancestry}_filtered_mono \
            --extract /data/intersect/${ancestry}_variant_intersection.txt \
            --make-bed \
            --out /data/intersect/${study}_${ancestry}_filtered_snp_intersection
    done
done
    
ww $base_dir/intersect/*section.txt

### Merge test
As a final check to confirm that our data sets are all compatible, a PLINK file set merge is conducted. If any issues persist then an error will be raised. 

Only run this section if merging data.

In [None]:
for ancestry in $ancestry_list; do
    echo "Creating $ancestry merge-list"
    truncate -s 0 ${base_dir}/intersect/${ancestry}_merge_list.txt
    for study in $study_list; do
        echo /data/${study}_${ancestry}_filtered_snp_intersection >>\
        ${base_dir}/intersect/${ancestry}_merge_list.txt
    done
    
# Merge file sets
    echo -e "\n\n======== ${ancestry} ========\n\n"
    docker run -v "${base_dir}/intersect:/data/" rticode/plink:1.9 plink \
        --merge-list /data/${ancestry}_merge_list.txt \
        --make-bed \
        --out /data/${ancestry}_studies_merged
done

wc -l $base_dir/intersect/*merged*bim

## Imputation preparation for Michigan Imputation Server
Visit the [MIS Getting Started Webpage](https://imputationserver.sph.umich.edu/start.html#!pages/help) for more information about the preparing the data for upload to MIS.

### Remove individuals missing whole chromsome
Remove any individual missing, essentially, an entire chromosome. Then convert the data to VCF format.

#### If NO merging was performed
(i.e. only one study being processed)

In [None]:
# Split by chr and remove any individuals with missing data for whole chr
for ancestry in $ancestry_list; do
    for chr in {1..23}; do
        docker run -v "${base_dir}/:/data/" rticode/plink:1.9 plink \
            --bfile /data/$study/${ancestry}_filtered_mono \
            --chr ${chr} \
            --mind 0.99 \
            --make-bed \
            --out /data/impute_prep/${ancestry}_chr${chr}_for_phasing 
    done
done > ${base_dir}/impute_prep/chr_splitting.log 


## look through log files to determine if any subjects were removed
for ancestry in $ancestry_list; do
    grep removed $base_dir/impute_prep/$ancestry*log |
        perl -lne '/(\d+)(\speople)/;
             $mycount += $1; 
             print $mycount if eof'  > $base_dir/impute_prep/$ancestry.removed
    any_removed=$(cat $base_dir/impute_prep/$ancestry.removed)
    if [ "$any_removed" == 0 ]; then
        echo "No $ancestry subjects removed"
    else
        echo "Some $ancestry subjects removed"
    fi
done

#### If merging was performed

In [None]:
## Split by chr and remove any individuals with missing data for whole chr
for chr in {1..23}; do 
    for ancestry in $ancestry_list;do
        docker run -v "${base_dir}:/data/" rticode/plink:1.9 plink \
            --bfile /data/intersect/${ancestry}_studies_merged \
            --chr ${chr} \
            --mind 0.99 \
            --make-bed \
            --out /data/impute_prep/${ancestry}_chr${chr}_for_phasing
    done > ${base_dir}/impute_prep/chr_splitting.log 
done


## look through log files to determine if any subjects were removed
for ancestry in $ancestry_list; do
    grep removed $base_dir/impute_prep/$ancestry*log |
        perl -lne '/(\d+)(\speople)/;
             $mycount += $1; 
             print $mycount if eof'  > $base_dir/impute_prep/$ancestry.removed
    any_removed=$(cat $base_dir/impute_prep/$ancestry.removed)
    if [ "$any_removed" == 0 ]; then
        echo "No $ancestry subjects removed"
    else
        echo "Some $ancestry subjects removed"
    fi
done

### Convert to VCF

In [None]:
for ancestry in ${ancestry_list}; do
    mkdir -p ${base_dir}/impute_prep/${ancestry}
    for chr in {1..23}; do
        docker run -v "${base_dir}/impute_prep/:/data/" rticode/plink:1.9 plink \
            --bfile /data/${ancestry}_chr${chr}_for_phasing \
            --output-chr M \
            --set-hh-missing \
            --recode vcf bgz \
            --out /data/$ancestry/${ancestry}_chr${chr}_final
    done
done

Transfer the *.vcf.gz files to local machine (per chromosome) and then upload to MIS.

# Upload to Michigan Imputation Server (MIS)
Transfer the `*.vcf` files to local machine (per chromosome) and then upload to MIS.

## Uploading parameters EA
These are the parameters that were selected on MIS:

__Name__: VIDUS_ea_chr23

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: EUR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.

**Input Validation**
```
1 valid VCF file(s) found.

Samples: 940
Chromosomes: X
SNPs: 14705
Chunks: 8
Datatype: unphased
Reference Panel: phase3
Phasing: shapeit
```

**Quality Control**
```
ChrX Statistics: 
Submitting 2 jobs: 
chrX Non.Par male ( as Chr X II ) 
chrX Non.Par female ( as Chr X I ) 
NonPar Sex Check: 
Males: 712
Females: 228
No Sex dedected and therefore filtered: 0
```

# Download Imputed Data from MIS
First Download the data form the Michigan Imputation Server by clicking on the link provided in the email they send out to alert you that your data has finished. Here you will find commands for downloading the data.

In [None]:
ancestry=aa
study=wihs3
passW="6pc6BrQVevuMW"
cd /shared/jmarks/hiv/wihs3/genotype/imputed/final/$ancestry


# download.file
####################################################################################################
####################################################################################################

# QC-results
curl -sL https://imputationserver.sph.umich.edu/get/1600201/69680c1f7e70788e97868263a39b117f | bash
# Logs
curl -sL https://imputationserver.sph.umich.edu/get/1600208/4d9447d59c8572d741c6a37b23fb9419 | bash
# SNP Statistics
curl -sL https://imputationserver.sph.umich.edu/get/1600207/af4d6be166bd84f15c33b656fe4c6916 | bash
# Imputation Results
curl -sL https://imputationserver.sph.umich.edu/get/1600204/6e97e687b015a8503a2562ec243e7f03 | bash

####################################################################################################
####################################################################################################

# inflate chr results
for file in *zip; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name unzip.$study.$ancestry.$file \
        --script_prefix unzip.imputed.$study.$ancestry.data \
        --mem 3 \
        --nslots 2 \
        --priority 0 \
        --program unzip -P $passW $file 
done

# we can remove the original imputed data from MIS after we inflate the zip files
rm -rf *zip

# upload to s3
aws s3 sync . s3://rti-hiv/wihs3/data/genotype/imputed/$ancestry --quiet &