# UHS1-3 chrX imputation
__Author__: Jesse Marks <br>
**Date:** October 04, 2018

**GitHub Issue:** [Issue #53](https://github.com/RTIInternational/bioinformatics/issues/53)

This document logs the steps taken to perform pre-imputation procedures on the Kreek dataset, both EA and AA. The starting point for this analysis is after quality control of observed genotypes. The quality controlled genotypes are oriented on the GRCh37 plus strand. 

## Software and tools
The software and tools used for porcessing these data are
* [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) (MIS)
* [Amazon Web Services (AWS) - Cloud Computing Services](https://aws.amazon.com/)
    * Linux AMI
* [PLINK v1.90 beta 4.10](https://www.cog-genomics.org/plink/)
* [bgzip](http://www.htslib.org/doc/tabix.html)
* [BCF Tools](http://www.htslib.org/doc/bcftools.html)
* Windows 10 with [Cygwin](https://cygwin.com/) installed
* GNU bash version 4.2.46

## Data retrieval and organization
PLINK binary filesets will be obtained from AWS S3 storage.

Jesse Marks performed the QC and stored the observed genotype data at:

```
s3/rti-heroin/kreek/data/genotype/original/20181128/
```

###  Statistics Breakdown 
This table includes the initial number of variants in each study as well as the final number of variants in the intersection set. The `Variants Post-Filtering` is in referral to the filtering steps (1) remove discordant alleles & (2) removal of monomorphic variants.

#### EA
| Data Set      | Initial Variants | Variants Post-Filtering  | Intersection     |
|---------------|------------------|--------------------------|------------------|
| Kreek         |                  |                          | NA               |


#### AA
| Data Set      | Initial Variants | Variants Post-Filtering  | Intersection     |
|---------------|------------------|--------------------------|------------------|
| Kreek         |                  |                          | NA               |


## Create Directory Structure & Download Data
The following section needs to be modified each time to reflect where the data is stored!

In [1]:
### EC2 command line (bash) ###

# Modify variables below
######################################################################
base_dir=/shared/data/studies/heroin/kreek/genotype/imputed/20181128
#genD=/shared/data/studies/heroin/kreek/genotype/observed/final

base_name="chr_all" # chr23 or chr_all
chr_list={1..23} # or {1..22} 
ancestry_list="aa ea" # space delimited Ex. "ea aa ha"
study_list="kreek" # space delimited 
######################################################################

mkdir ${base_dir}/processing/{intersect,1000g,impute_prep}
for study in ${study_list};do
    for ancestry in ${ancestry_list};do
        mkdir -p ${base_dir}/processing/${study}
        mkdir -p ${base_dir}/data/${study}/genotype/observed/${ancestry}
    done
done

## download genotype (with AWS CLI tools) to respective directories ##
aws s3 sync s3://rti-heroin/kreek/data/genotype/original/20181128/ \
    ${base_dir}/data/${study}/genotype/observed/
mv ${base_dir}/data/${study}/genotype/observed/ea* ${base_dir}/data/${study}/genotype/observed/ea
mv ${base_dir}/data/${study}/genotype/observed/aa* ${base_dir}/data/${study}/genotype/observed/aa

# Data Processing
## GRCh37 strand and allele discordance check
### MAF for study data

In [None]:
### EC1 command line (Bash) ###

# write out the MAF report
for study in ${study_list}; do
    study_dir=${base_dir}/processing/${study}/strand_check
    mkdir ${study_dir}
    for ancestry in ${ancestry_list};do
        data_dir=${base_dir}/data/${study}/genotype/observed/${ancestry}
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bed ${data_dir}/*bed\
            --bim ${data_dir}/*bim\
            --fam ${data_dir}/*fam\
            --freq \
            --out ${study_dir}/${ancestry}_${base_name}
    done
done

# Get list of variants from all studies
for ancestry in ${ancestry_list}; do
    for study in ${study_list};do
        cat ${base_dir}/data/${study}/genotype/observed/${ancestry}/*bim | \
                perl -lane 'if (($F[0]+0) <= 23) { print $F[1]; }' | \
                sort -u > ${base_dir}/processing/${ancestry}_${base_name}_sorted_variants.txt
    done
done

```
 wc -l ${base_dir}/processing/*
503120 /shared/data/studies/heroin/kreek/genotype/imputed/20181128/processing/aa_chr_all_sorted_variants.txt
  510052 /shared/data/studies/heroin/kreek/genotype/imputed/20181128/processing/ea_chr_all_sorted_variants.txt
  ```

### MAF for 1000G
This pipeline is currently set up to handle EUR and AFR populations.
#### Autosomes
Get 1000G MAF for chromosomes 1–22 (autosomes).

In [None]:
### EC2 command line (Bash)

# Calculate autosome MAFs for 1000G populations
for ancestry in ${ancestry_list};do

    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi
    
    for chr in {1..22}; do
        /shared/bioinformatics/software/scripts/qsub_job.sh \
            --job_name ${pop}_${chr}_MAF \
            --script_prefix ${base_dir}/processing/1000g/${pop}_chr${chr}.maf \
            --mem 6.8 \
            --nslots 1 \
            --priority 0 \
            --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
                --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.hap.gz\
                --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz \
                --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
                --chr ${chr} \
                --out ${base_dir}/processing/1000g/${pop}_chr${chr}.maf \
                --extract ${base_dir}/processing/${ancestry}_${base_name}_sorted_variants.txt \
                --keep_groups ${pop}
    done
done

#### chrX
Get 1000G MAF for chromosome 23 (chrX).

In [None]:
### Bash ###

chr=23
for ancestry in ${ancestry_list};do

    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi

    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${pop}_23_MAF \
        --script_prefix ${base_dir}/processing/1000g/${pop}_chr${chr}.maf \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
            --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.hap.gz\
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
            --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
            --chr $chr \
            --out ${base_dir}/processing/1000g/${pop}_chr${chr}.maf \
            --extract ${base_dir}/processing/${ancestry}_${base_name}_sorted_variants.txt \
            --keep_groups ${pop}
done

### Merge 1000G chromosomes
Only need to perform this if there were multiple chromosomes for which the MAF was calculated—e.g. more than just chrX.

In [None]:
### Bash ###



# Merge per chr MAFs for each 1000G population
for ancestry in ${ancestry_list};do
    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi
    
        head -n 1 ${base_dir}/processing/1000g/${pop}_chr1.maf > ${base_dir}/processing/1000g/${pop}_chr_all.maf
        tail -q -n +2 ${base_dir}/processing/1000g/${pop}_chr??maf  >> ${base_dir}/processing/1000g/${pop}_chr_all.maf
done

```
wc -l processing/1000g/???_chr_all.maf
  275508 processing/1000g/AFR_chr_all.maf
  279371 processing/1000g/EUR_chr_all.maf
```

###  Allele Discordances Check
The allele discordances will be resolved by
* Flipping allele discordances that are fixed by flipping
* Removing SNPs with discordant names
* Removing SNPs with discordant positions
* Removing allele discordances that are not resolved by flipping
* Removing alleles with large deviations from the reference population allele frequencies

Given that the allele discordance check was done using a union set of SNPs across all studies within an ancestry group, some of the SNPs logged as discordant for a given study may not actually be in the study. Fortunately, if they are not in a given study they will not interfere with the filtering procedures. Note that the intersection set is used for the final studies merger.

#### Autosomes

In [None]:
### Bash ###

# Run discordance checks for each ancestry group
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        if [ $ancestry = "ea" ]; then
            pop=EUR
        else
            pop=AFR
        fi

       /shared/bioinformatics/software/scripts/qsub_job.sh \
           --job_name ${ancestry}_${study}_crosscheck \
           --script_prefix ${base_dir}/processing/$study/strand_check/${ancestry}_allele_discordance_check \
           --mem 6 \
           --nslots 4 \
           --priority 0 \
           --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
               --study_bim_file ${base_dir}/data/${study}/genotype/observed/${ancestry}/*bim
               --study_frq_file ${base_dir}/processing/${study}/strand_check/${ancestry}_chr_all.frq
               --ref_maf_file ${base_dir}/processing/1000g/${pop}_chr_all.maf
               --out_prefix ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance"
    done
done

#### chrX 
Not necessary to run unless you are only processing chrX.

In [None]:
### Bash ###

#for study in ${study_list}; do
#    for ancestry in ${ancestry_list};do
#        if [ $ancestry = "ea" ]; then
#            pop=EUR
#        else
#            pop=AFR
#        fi
#
#        # chr23 discordance check
#        /shared/bioinformatics/software/scripts/qsub_job.sh \
#            --job_name ${ancestry}_${study}_crosscheck \
#            --script_prefix ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance_check \
#            --mem 6.8 \
#            --nslots 1 \
#            --priority 0 \
#            --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
#                --study_bim_file ${base_dir}/data/${study}/genotype/observed/${ancestry}/*bim
#                --study_frq_file ${base_dir}/processing/${study}/strand_check/${ancestry}_chr23.frq
#                --ref_maf_file ${base_dir}/processing/1000g/${pop}_chr23.maf
#                --out_prefix ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance"
#    done
#done

### Resolving Allele Discordances

In [None]:
### Bash ###

# Apply filters
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        echo -e "\n===============\nProcessing ${study}_${ancestry}\n"
        echo "Making remove list"
        cat <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2) \
            <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.at_cg_snps_freq_diff_gt_0.2 | tail -n +2) \
            <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_names | tail -n +2) \
            <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_positions | tail -n +2) \
            <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_polymorphic_in_study_not_fixed_by_strand_flip | tail -n +2) | \
              sort -u > ${base_dir}/processing/${study}/strand_check/${ancestry}_snps.remove

        # Create flip list
        echo "Making flip list"
        comm -23 <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles | tail -n +2 | sort -u) \
                 <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2 | sort -u) \
                 > ${base_dir}/processing/${study}/strand_check/${ancestry}_snps.flip

        # Apply filters
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bed     ${base_dir}/data/${study}/genotype/observed/${ancestry}/*bed \
            --bim     ${base_dir}/data/${study}/genotype/observed/${ancestry}/*bim \
            --fam     ${base_dir}/data/${study}/genotype/observed/${ancestry}/*fam \
            --exclude ${base_dir}/processing/${study}/strand_check/${ancestry}_snps.remove \
            --flip    ${base_dir}/processing/${study}/strand_check/${ancestry}_snps.flip \
            --make-bed \
            --out     ${base_dir}/processing/${study}/${ancestry}_filtered
    done
done

 ```
 wc -l processing/*/*bim
     493427 processing/kreek/aa_filtered.bim
     500336 processing/kreek/ea_filtered.bim
  ```

## Remove monomorphic variants
Monomorphic variants prevent MIS from accepting the genotype data. In this case, an arbitrarily small MAF is set that is smaller than the lower bound for these data.

In [None]:
### Bash ###

# Apply filters
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        geno_dir=${base_dir}/processing/${study}

        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bfile ${geno_dir}/${ancestry}_filtered \
            --maf 0.000001 \
            --make-bed \
            --out ${geno_dir}/${ancestry}_filtered_mono
    done
done

```
wc -l processing/*/*mono.bim

  459968 processing/kreek/aa_filtered_mono.bim
  432255 processing/kreek/ea_filtered_mono.bim
  ```

### SNP intersection
Only run if merging multiple data sets.

In [None]:
#studies=($study_list)  #studies=(UHS1 UHS2 UHS3_v1-2 UHS3_v1-3) # array of study names
#num=${#studies[@]}
#
## Get intersection set
#for ancestry in ${ancestry_list};do
#    bim_files=()
#    for (( i=0; i<${num}; i++ ));do
#        bim_files+=(${base_dir}/processing/${studies[$i]}/${ancestry}_filtered_mono.bim)
#    done
#    
#    echo -e "\nCalculating intersection between $ancestry ${study_list}...\n"
#    sort ${bim_files[@]} | uniq -dc | awk -v num=$num '$1 == num {print $3}' \
#        > ${base_dir}/processing/intersect/${ancestry}_variant_intersection.txt
#
#    # Make new PLINK binary file sets
#    for study in ${studies[@]}; do
#        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
#            --noweb \
#            --bfile ${base_dir}/processing/${study}/${ancestry}_filtered_mono \
#            --extract ${base_dir}/processing/intersect/${ancestry}_variant_intersection.txt \
#            --make-bed \
#            --out ${base_dir}/processing/intersect/${study}_${ancestry}_filtered_snp_intersection
#    done
#done

```wc -l *txt

```

### Merge test
If merging multiple datasets together:

As a final check to confirm that our data sets are all compatible, a PLINK file set merge is conducted. If any issues persist then an error will be raised.

In [None]:
#for ancestry in $ancestry_list;do
#
#    echo "Creating $ancestry merge-list"
#    touch ${base_dir}/processing/intersect/${ancestry}_merge_list.txt
#    for study in $study_list;do
#        echo ${base_dir}/processing/intersect/${study}_${ancestry}_filtered_snp_intersection >>\
#             ${base_dir}/processing/intersect/${ancestry}_merge_list.txt
#    done
#
## Merge file sets
#    echo -e "\n\n======== ${ancestry} ========\n\n"
#    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
#        --noweb \
#        --memory 4000 \
#        --merge-list ${base_dir}/processing/intersect/${ancestry}_merge_list.txt \
#        --make-bed \
#        --out ${base_dir}/processing/intersect/${ancestry}_studies_merged
#done

```
 wc -l *merged.bim
```

## Imputation preparation for Michigan Imputation Server
Visit the [MIS Getting Started Webpage](https://imputationserver.sph.umich.edu/start.html#!pages/help) for more information about the preparing the data for upload to MIS.


### VCF File Conversion

In [None]:
### Split by chr and remove any individuals with missing data for whole chr

## if merged data sets together
#for ancestry in $ancestry_list;do
#    # Remove SNPs
#    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
#        --noweb \
#        --memory 4000 \
#        --bfile ${base_dir}/processing/intersect/${ancestry}_studies_merged \
#        --chr ${chr} \
#        --mind 0.99 \
#        --make-bed \
#        --out ${base_dir}/processing/impute_prep/${ancestry}_chr${chr}_for_phasing 
#done > ${base_dir}/processing/impute_prep/chr_splitting.log 

## if NO merging was done
for ancestry in $ancestry_list;do
    for chr in {1..23};do
        # Remove SNPs
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 4000 \
            --bfile ${base_dir}/processing/$study/${ancestry}_filtered_mono \
            --chr ${chr} \
            --mind 0.99 \
            --make-bed \
            --out ${base_dir}/processing/impute_prep/${ancestry}_chr${chr}_for_phasing 
    done
done > ${base_dir}/processing/impute_prep/chr_splitting.log 

__Note__: No subjects were removed.

```
grep remove *log
aa_chr10_for_phasing.log:0 people removed due to missing genotype data (--mind).
aa_chr11_for_phasing.log:0 people removed due to missing genotype data (--mind).
aa_chr12_for_phasing.log:0 people removed due to missing genotype data (--mind).
aa_chr13_for_phasing.log:0 people removed due to missing genotype data (--mind).
aa_chr14_for_phasing.log:0 people removed due to missing genotype data (--mind).
aa_chr15_for_phasing.log:0 people removed due to missing genotype data (--mind).
aa_chr16_for_phasing.log:0 people removed due to missing genotype data (--mind).
aa_chr17_for_phasing.log:0 people removed due to missing genotype data (--mind).

...

ea_chr5_for_phasing.log:0 people removed due to missing genotype data (--mind).
ea_chr6_for_phasing.log:0 people removed due to missing genotype data (--mind).
ea_chr7_for_phasing.log:0 people removed due to missing genotype data (--mind).
ea_chr8_for_phasing.log:0 people removed due to missing genotype data (--mind).
ea_chr9_for_phasing.log:0 people removed due to missing genotype data (--mind).
```

In [None]:
# EC2 command line #

for chr in {1..23};do
    for ancestry in ${ancestry_list};do
        final_dir=${base_dir}/processing/impute_prep/${ancestry}
        mkdir $final_dir
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 5000 \
            --bfile ${final_dir}/../${ancestry}_chr${chr}_for_phasing \
            --output-chr M \
            --set-hh-missing \
            --recode vcf bgz \
            --out ${final_dir}/${ancestry}_chr${chr}_final
    done
done

Transfer the *.vcf.gz files to local machine (per chromosome) and then upload to MIS.

## Upload to Michigan Imputation Server (MIS)

### Uploading parameters
These are the parameters that were selected on MIS.

__Name__: Kreek_EA

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF (.gz) files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: EUR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.

# Download Imputed Results from MIS
First, download the data form the Michigan Imputation Server (MIS) by clicking on the link provided in the email they send out to alert you that your data has finished. Here you will find commands for downloading the data.

In [None]:
## EC2 (share drive)
# Note: jmarks is a symlink; you will have to save results
# to the share drive because they are very large)

# create directory structure
mkdir -p /home/ec2-user/jmarks/hiv/uhs123/data/uhs123_merged/genotype/imputed/{ea,aa}/chr23

## EA
The zip files from Michigan Imputation Server (MIS) need to be inflated before you can begin working with them. They require a passcode that is sent by MIS to email.
* EA password: 4AORikqOU0YbSa

Then, inflate imputation results.

In [None]:
cd /shared/data/studies/heroin/kreek/genotype/imputed/20181128/ea

# QC-results
wget https://imputationserver.sph.umich.edu/share/results/9cc8b01ead466c33ea8cf8fd12e16333/qcreport.html

# SNP Statistics
wget https://imputationserver.sph.umich.edu/share/results/8e0b7a96bb3320fc4afa56dd7332d516/statistics.txt

# Logs
wget https://imputationserver.sph.umich.edu/share/results/8c1ba637f2e1521b27f0fe240eeb12d3/chr_1.log
wget https://imputationserver.sph.umich.edu/share/results/67bd10ec79e9dbbeb84d80405fd3728/chr_10.log
wget https://imputationserver.sph.umich.edu/share/results/5542bcdf75eb9122ce07b75ef96ba4a9/chr_11.log
wget https://imputationserver.sph.umich.edu/share/results/4704e2a72c9d3c6edaedd400bbc0ee2/chr_12.log
wget https://imputationserver.sph.umich.edu/share/results/94b04eae0ac254ad185c6f19628e13a2/chr_13.log
wget https://imputationserver.sph.umich.edu/share/results/2d16285fe8f25e1d0d76d9d996d26d18/chr_14.log
wget https://imputationserver.sph.umich.edu/share/results/c102f136ffe1446d0b090ea279c110d3/chr_15.log
wget https://imputationserver.sph.umich.edu/share/results/df4e1056ad63c59f7ef0bc6f6ed602dd/chr_16.log
wget https://imputationserver.sph.umich.edu/share/results/75dee0fa17320ec65d05b63499ae1e3/chr_19.log
wget https://imputationserver.sph.umich.edu/share/results/d13bf856a5420490362c30546bdb80f0/chr_2.log
wget https://imputationserver.sph.umich.edu/share/results/2b113eb19c5c23fc117755743079febc/chr_20.log
wget https://imputationserver.sph.umich.edu/share/results/8c98bc11cd04b290c39e4ac0a20976cb/chr_21.log
wget https://imputationserver.sph.umich.edu/share/results/ea645b55cbfc1cde0236487c2dd88179/chr_22.log
wget https://imputationserver.sph.umich.edu/share/results/259a30ff5ed50f0fe4f48d878aacece9/chr_3.log
wget https://imputationserver.sph.umich.edu/share/results/c9498a231c5f032ea70d4780df6ab086/chr_4.log
wget https://imputationserver.sph.umich.edu/share/results/5101322509b5a377b01cf99339b6a7e1/chr_5.log
wget https://imputationserver.sph.umich.edu/share/results/f959b858a1ccb3dbf4fa71a07de4ef96/chr_6.log
wget https://imputationserver.sph.umich.edu/share/results/d04bc39e29c1ae059926647532338c60/chr_7.log
wget https://imputationserver.sph.umich.edu/share/results/433cee3a0d5a290f6a4a2f587631945d/chr_8.log
wget https://imputationserver.sph.umich.edu/share/results/a301102331a7849556b259bc96c7b2e7/chr_9.log
wget https://imputationserver.sph.umich.edu/share/results/654836dc1a734fc6a793b2a4169ab402/chr_X.no.auto_female.log
wget https://imputationserver.sph.umich.edu/share/results/7ddf6b8be448134055a552acf84ae8ce/chr_X.no.auto_male.log

# Imputation Results
wget https://imputationserver.sph.umich.edu/share/results/642ac4790aae16d13e4e6479ba798ec7/chr_1.zip
wget https://imputationserver.sph.umich.edu/share/results/314be9cb9bb639f8b4ab915635868448/chr_10.zip
wget https://imputationserver.sph.umich.edu/share/results/963d85cb5d2bcf0448d8fca5fe3409a/chr_11.zip
wget https://imputationserver.sph.umich.edu/share/results/3e7eaff881a87b1b557ee6b5572fdb8f/chr_12.zip
wget https://imputationserver.sph.umich.edu/share/results/6ba1a35689dea53978bd3712b5450819/chr_13.zip
wget https://imputationserver.sph.umich.edu/share/results/f505ae864b738cd3ebe5d21013df4a4b/chr_14.zip
wget https://imputationserver.sph.umich.edu/share/results/438ad1e6d0df969885f64e2449f0f07d/chr_15.zip
wget https://imputationserver.sph.umich.edu/share/results/b49fdd1244e74e0a04bb22a0a959d616/chr_16.zip
wget https://imputationserver.sph.umich.edu/share/results/3dfa1aff5e95300b24b1e8dff4ce9ea6/chr_17.zip
wget https://imputationserver.sph.umich.edu/share/results/72eb662e0a9f026f13ae8f4b7be37b84/chr_18.zip
wget https://imputationserver.sph.umich.edu/share/results/7d0129cf2991c85ef71e5529c92c04e/chr_19.zip
wget https://imputationserver.sph.umich.edu/share/results/1df4d5629c523a2e5f440aa95c57d5c3/chr_2.zip
wget https://imputationserver.sph.umich.edu/share/results/60ca7431abb995e2d2d437f51cad4999/chr_20.zip
wget https://imputationserver.sph.umich.edu/share/results/a33c27e28598d7b9a7c0cc3d6238a2ea/chr_21.zip
wget https://imputationserver.sph.umich.edu/share/results/85984570259e3014f0ada2b56fa8a2be/chr_22.zip
wget https://imputationserver.sph.umich.edu/share/results/f1ecab60d735d4e0ad67340d9b90e679/chr_3.zip
wget https://imputationserver.sph.umich.edu/share/results/34dc6e8bd5164c83d051e5fa562d4736/chr_4.zip
wget https://imputationserver.sph.umich.edu/share/results/f8bb244760aa049a30cf1d852f82125b/chr_5.zip
wget https://imputationserver.sph.umich.edu/share/results/adbdfb17d87aab08d6186e4d39bd5470/chr_6.zip
wget https://imputationserver.sph.umich.edu/share/results/add3b937b8588b3086b6a139806a6f50/chr_7.zip
wget https://imputationserver.sph.umich.edu/share/results/d142ceb2f49899406a7a585095658c65/chr_8.zip
wget https://imputationserver.sph.umich.edu/share/results/b9ae0694ad2f509ba3b73e041a992a81/chr_9.zip
wget https://imputationserver.sph.umich.edu/share/results/450fe80feee07f7bb07c15a27d636235/chr_X.no.auto_female.zip
wget https://imputationserver.sph.umich.edu/share/results/67eb4dd04188a9f2a336d3eb320b4ce3/chr_X.no.auto_male.zip

/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name ea_impute_down \
    --script_prefix ea_impute_download \
    --mem 6.8 \
    --nslots 1 \
    --priority 0 \
    --program bash download_data_ea 

# inflate chr results
for file in *zip;do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ea_unzip \
        --script_prefix ea_unzip \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program unzip -P "4AORikqOU0YbSa" $file
done


# we can remove the original imputed data from MIS after we inflate the zip files
rm -rf *zip

## AA
The zip files from Michigan Imputation Server (MIS) need to be inflated before you can begin working with them. They require a passcode that is sent by MIS to email.
* AA password: p2bKgqC6S3LKw

Then, inflate imputation results.

In [None]:
cd /shared/data/studies/heroin/kreek/genotype/imputed/20181128/aa

# QC-results
wget https://imputationserver.sph.umich.edu/share/results/76b2af4e51ddb17106dbd7201a64856/qcreport.html

# SNP Statistics
wget https://imputationserver.sph.umich.edu/share/results/b0e5f1ba32612be885f018712a3f35fa/statistics.txt

# Logs
wget https://imputationserver.sph.umich.edu/share/results/520107bc671f90addb78992d088a4bd6/chr_1.log
wget https://imputationserver.sph.umich.edu/share/results/c3a24b8358c709a9810d0cd3a8cd43ce/chr_10.log
wget https://imputationserver.sph.umich.edu/share/results/612a65dae7a1151fda96c112b2dcfb28/chr_11.log
wget https://imputationserver.sph.umich.edu/share/results/390b9b56041aad8c435cecd4c6d90d26/chr_12.log
wget https://imputationserver.sph.umich.edu/share/results/4d3a0f700a8fc69a96810f4da3b274ad/chr_13.log
wget https://imputationserver.sph.umich.edu/share/results/9a30ac79a5e26b8288d870a63c5e4344/chr_14.log
wget https://imputationserver.sph.umich.edu/share/results/78195815df31262452692f9ea3a6e6f8/chr_15.log
wget https://imputationserver.sph.umich.edu/share/results/b99b2a2becabc0d1b267538a792eba6/chr_16.log
wget https://imputationserver.sph.umich.edu/share/results/e35251c2dc4afe966b6b11025690848d/chr_17.log
wget https://imputationserver.sph.umich.edu/share/results/fb37b20d3bfbf9a139881ddaafef2aa2/chr_18.log
wget https://imputationserver.sph.umich.edu/share/results/59d2034f8e3232221b584e1dc5cebd01/chr_19.log
wget https://imputationserver.sph.umich.edu/share/results/8bc99cead6115da117ab05bbbcf32fd/chr_2.log
wget https://imputationserver.sph.umich.edu/share/results/dec04142eb7d5690a5d9e0e257ccbbaf/chr_20.log
wget https://imputationserver.sph.umich.edu/share/results/3b8e900f825166e2c66ea5fc73e5a3ba/chr_21.log
wget https://imputationserver.sph.umich.edu/share/results/e1541ec29f8d94a01a13ddf885b73c83/chr_22.log
wget https://imputationserver.sph.umich.edu/share/results/d307186c6bed8ae73b3563a0ec129cf5/chr_3.log
wget https://imputationserver.sph.umich.edu/share/results/b1552ffa9f80ba3430407d2b0f5d3933/chr_4.log
wget https://imputationserver.sph.umich.edu/share/results/a07372ed1ce98ea08acd0e660cf4fb17/chr_5.log
wget https://imputationserver.sph.umich.edu/share/results/1e555a387bd774cc1277da2c5f0288f5/chr_6.log
wget https://imputationserver.sph.umich.edu/share/results/9e9e662f32d60769652ff3f17ab9005/chr_7.log
wget https://imputationserver.sph.umich.edu/share/results/cfea49ba8184d362be609a1b1cd6cf0c/chr_8.log
wget https://imputationserver.sph.umich.edu/share/results/fc291287c0ac8fcbe1cecacc8f23e5ce/chr_9.log
wget https://imputationserver.sph.umich.edu/share/results/c5e12ba854df7d07e4630e0b257aa5f7/chr_X.no.auto_female.log
wget https://imputationserver.sph.umich.edu/share/results/ba29e7637d07657831a7efebcad2d2ad/chr_X.no.auto_male.log

# Imputation Results
wget https://imputationserver.sph.umich.edu/share/results/6e85686d8498e08c5692b2b31cc42512/chr_1.zip
wget https://imputationserver.sph.umich.edu/share/results/1a4bd74d9af9d36315230958b89594b9/chr_10.zip
wget https://imputationserver.sph.umich.edu/share/results/17a0d344ee027f8dd2ab4241caf9fb84/chr_11.zip
wget https://imputationserver.sph.umich.edu/share/results/df7831076d553445538c43422ddaf926/chr_12.zip
wget https://imputationserver.sph.umich.edu/share/results/fcf53436fa8f5a971e85157afcccf404/chr_13.zip
wget https://imputationserver.sph.umich.edu/share/results/8dfa37b6567eb0ec701310711b0761fa/chr_14.zip
wget https://imputationserver.sph.umich.edu/share/results/31d2f83b9df44eeaede4fc671c473b4c/chr_15.zip
wget https://imputationserver.sph.umich.edu/share/results/a30b20102062fe6ecc6c661ffb604c00/chr_16.zip
wget https://imputationserver.sph.umich.edu/share/results/1b5b42ac3697c55927f641cbe1e2abb5/chr_17.zip
wget https://imputationserver.sph.umich.edu/share/results/9063a97c412b7a680a8a7adce693428b/chr_18.zip
wget https://imputationserver.sph.umich.edu/share/results/40ab72afda94c7a3c73fc5c05d92a99c/chr_19.zip
wget https://imputationserver.sph.umich.edu/share/results/53602f7f9b9128f4523d53f6f3c50b83/chr_2.zip
wget https://imputationserver.sph.umich.edu/share/results/48dd96405843563460916d48c39ae386/chr_20.zip
wget https://imputationserver.sph.umich.edu/share/results/bf323cfad8f6798e0b288c8d0229acf6/chr_21.zip
wget https://imputationserver.sph.umich.edu/share/results/f662aa87dca49b0c01430932e4149941/chr_22.zip
wget https://imputationserver.sph.umich.edu/share/results/7a34c9ffbfa4b51f3dea11a44648932f/chr_3.zip
wget https://imputationserver.sph.umich.edu/share/results/25b6828ad16a1fbe8d07a836067ac6f6/chr_4.zip
wget https://imputationserver.sph.umich.edu/share/results/b760bc250048a4360ff58bef52f44ae5/chr_5.zip
wget https://imputationserver.sph.umich.edu/share/results/4bba6830c9493dad608f42657cdefa44/chr_6.zip
wget https://imputationserver.sph.umich.edu/share/results/1d0b49c2dba5a08334b7b963a7c75d5e/chr_7.zip
wget https://imputationserver.sph.umich.edu/share/results/e68d004b9f8282a71829204587aeed75/chr_8.zip
wget https://imputationserver.sph.umich.edu/share/results/bf7be419ce57652d4fb7541792a55669/chr_9.zip
wget https://imputationserver.sph.umich.edu/share/results/3d296dc57dadbd74996a5b127c92438/chr_X.no.auto_female.zip
wget https://imputationserver.sph.umich.edu/share/results/2dd8373287ba869adacd15b46a8cdd94/chr_X.no.auto_male.zip

/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name aa_impute_down \
    --script_prefix aa_impute_download \
    --mem 6.8 \
    --nslots 1 \
    --priority 0 \
    --program bash download_data_aa 

# inflate chr results
for file in *zip;do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name aa_unzip \
        --script_prefix aa_unzip \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program unzip -P "p2bKgqC6S3LKw" $file
done
# inflate chr results
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name aa_unzip \
    --script_prefix aa_unzip \
    --mem 6.8 \
    --nslots 1 \
    --priority 0 \
    --program unzip -P "p2bKgqC6S3LKw" *zip 
# we can remove the original imputed data from MIS after we inflate the zip files
rm -rf *zip

# Upload to S3

In [None]:
cd /shared/data/studies/heroin/kreek/genotype/imputed/20181128

for ancestry in {aa,ea};do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${ancestry}_upload \
        --script_prefix ${ancestry}_upload \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program aws s3 sync ${ancestry} s3://rti-heroin/kreek/data/genotype/imputed/20181128/${ancestry}/
done