# UHS1-4 chrX imputation
__Author__: Jesse Marks

**Date:** October 04, 2018

**GitHub Issue:** [Issue #112](https://github.com/RTIInternational/bioinformatics/issues/112)

This document logs the steps taken to perform pre-imputation procedures on the merged datasets of UHS1, UHS2, UHS3, and UHS4—both EA and AA. The starting point for this analysis is after quality control of observed genotypes. The quality controlled genotypes are oriented on the GRCh37 plus strand. Based on findings from [Johnson et al.](https://link.springer.com/article/10.1007/s00439-013-1266-7), the intersection set of variants for will be used for imputation. 

## Software and tools
The software and tools used for porcessing these data are
* [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) (MIS)
* [Amazon Web Services (AWS) - Cloud Computing Services](https://aws.amazon.com/)
    * Linux AMI
* [PLINK v1.90 beta 4.10](https://www.cog-genomics.org/plink/)
* [bgzip](http://www.htslib.org/doc/tabix.html)
* [BCF Tools](http://www.htslib.org/doc/bcftools.html)
* Windows 10 with [Cygwin](https://cygwin.com/) installed
* GNU bash version 4.2.46

## Data retrieval and organization
PLINK binary filesets will be obtained from AWS S3 storage.

Nathan Gaddis performed the QC and stored the observed genotype data at:

```
UHS1
s3://rti-midas-data/studies/hiv/observed/final/uhs1.aa.chr23
s3://rti-midas-data/studies/hiv/observed/final/uhs1.ea.chr23

UHS2

s3://rti-midas-data/studies/hiv/observed/final/uhs2.aa.chr23
s3://rti-midas-data/studies/hiv/observed/final/uhs2.ea.chr23

UHS3
s3://rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-2.chr23
s3://rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-2.chr23
s3://rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-3.chr23
s3://rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.chr23
s3://rti-midas-data/studies/hiv/observed/final/uhs3.aa.merged.chr23
s3://rti-midas-data/studies/hiv/observed/final/uhs3.ea.merged.chr23

UHS4
s3://rti-midas-data/studies/uhs4/observed/genotypes/final/uhs4/merged2.ea
s3://rti-midas-data/studies/uhs4/observed/genotypes/final/uhs4/merged2.aa
```


**Note**: For UHS3 I will use v1-2 and v1-3 datasets. I will treat them like two different cohorts. Therefore, when I merge UHS1-4 I will be merging 4 data sets per ancestry. 

* uhs1.aa, uhs2.aa, uhs3.aa.V1-2,uhs3.aa.V1-3, & uhs4.merged2.aa
* uhs1.ea, uhs2.ea, uhs3.ea.V1-2,uhs3.ea.V1-3, & uhs4.merged2.ea

**Note2**: The STRUCTURE analysis was performed on the combined UHS1-4 data. This resulted in a new ancestry partitioning. In particular, one UHS1 sample and one UHS4 sample was reassigned ancestry groups.

**Note3:** The UHS1 chrX sample counts are in discordance with the autosome sample counts. For example:
```
  2015 uhs1.aa.chr23.fam
  2016 uhs1.aa.fam
```

### chrX Statistics Breakdown 
This table includes the initial number of variants in each study as well as the final number of variants in the intersection set. The `Variants Post-Filtering` is in referral to the filtering steps (1) remove discordant alleles & (2) removal of monomorphic variants.

#### EA
| Data Set      | Initial Variants | Variants Post-Filtering  |
|---------------|------------------|--------------------------|
| UHS1          | 17,408           |                          |
| UHS2          | 32,857           |                          |
| UHS3_V1-2     | 24,202           |                          |
| UHS3_V1-3     | 35,440           |                          |
| UHS4          | 35,469           |                          |
| Intersection  | NA               |                          |


#### AA
| Data Set      | Initial Variants | Variants Post-Filtering  |
|---------------|------------------|--------------------------|
| UHS1          | 17,396           |                          |
| UHS2          | 23,484           |                          |
| UHS3_V1-2     | 41,884           |                          |
| UHS3_V1-3     | 44,706           |                          |
| UHS4          | 29,488           |                          |
| Merged        | NA               |                          |


## Create Directory Structure & Download Data
The following section needs to be modified each time to reflect where the data is stored!

In [5]:
### EC2 command line (bash) ###

# Modify variables below
######################################################################
base_dir=/shared/jmarks/hiv/uhs1-4/chrX_processing
base_name="chr23" # chr23 or chr_all
#chr_list={1..23} # or {1..22} 
ancestry_list="aa ea" # space delimited Ex. "ea aa ha"
study_list="uhs1 uhs2 uhs3_v1-2 uhs3_v1-3 uhs4" # space delimited 
######################################################################

mkdir -p ${base_dir}/processing/{intersect,1000g,impute_prep}
for study in ${study_list};do
    for ancestry in ${ancestry_list};do
        mkdir -p ${base_dir}/processing/${study}
        mkdir -p ${base_dir}/data/${study}/genotype/observed/${ancestry}
    done
done

## download genotype (with AWS CLI tools) to respective directories ##

#aws s3 cp s3://rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.chr23.fam .
#aws s3 cp s3://rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.chr23.bed .
#aws s3 cp s3://rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.chr23.bim .
#...

# Data Processing
## Data Wrangling
The UHS4 chrX genotype data have not been separated from the autosomes like the other sub-groups have. Therefore, we will do this now.

In [38]:
cd /shared/jmarks/hiv/uhs1-4/chrX_processing/uhs4
## break out chrX data for UHS4
for ancestry in ea aa; do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile uhs4.merged2.$ancestry \
        --chr 23 \
        --make-bed \
        --out uhs4.merged2.$ancestry.chr23
done

## uhs1: merge ancestries
study=uhs1            
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile ${base_dir}/$study/uhs1.aa.chr23 \
    --bmerge ${base_dir}/$study/uhs1.ea.chr23 \
    --make-bed \
    --out $base_dir/$study/$study.aa+ea.chr23.1KG.structure

## uhs4: merge ancestries
study=uhs4
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile ${base_dir}/$study/uhs4.merged2.aa.chr23 \
    --bmerge ${base_dir}/$study/uhs4.merged2.ea.chr23 \
    --make-bed \
    --out $base_dir/$study/$study.aa+ea.chr23.1KG.structure

PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to uhs4.merged2.ea.chr23.log.
Options in effect:
  --bfile uhs4.merged2.ea
  --chr 23
  --make-bed
  --memory 2048
  --noweb
  --out uhs4.merged2.ea.chr23

Note: --noweb has no effect since no web check is implemented yet.
1957 MB RAM detected; reserving 2048 MB for main workspace.
Allocated 1536 MB successfully, after larger attempt(s) failed.
35469 out of 2073618 variants loaded from .bim file.
989 people (659 males, 330 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 989 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.974134.


In [39]:
## partition UHS1 and UHS4 into respective ancestry groups ##
plots=/shared/jmarks/hiv/uhs1-4/autosome_processing/structure/triangle_plots/thresholding
anlist="ea aa"
for study in uhs1 uhs4; do
    for an in $anlist; do
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --bfile $base_dir/$study/$study.aa+ea.chr23.1KG.structure \
            --keep $plots/$an.subject_ids.keep.txt \
            --make-bed \
            --out $base_dir/data/$study/genotype/observed/$an/$study.$an.chr23.1KG.structure
    done
done

PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/ea/uhs1.ea.chr23.1KG.structure.log.
Options in effect:
  --bfile /shared/jmarks/hiv/uhs1-4/chrX_processing/uhs1/uhs1.aa+ea.chr23.1KG.structure
  --keep /shared/jmarks/hiv/uhs1-4/autosome_processing/structure/triangle_plots/thresholding/ea.subject_ids.keep.txt
  --make-bed
  --out /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/ea/uhs1.ea.chr23.1KG.structure

1957 MB RAM detected; reserving 978 MB for main workspace.
17413 variants loaded from .bim file.
3155 people (2374 males, 781 females) loaded from .fam.
--keep: 1128 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1128 founders and 0 nonfounders present.
Calculating allele frequencies... 1011121314151617181920212223242526

In [40]:
ww $plots/ea.subject_ids.keep.txt
ww $plots/aa.subject_ids.keep.txt

3023 /shared/jmarks/hiv/uhs1-4/autosome_processing/structure/triangle_plots/thresholding/ea.subject_ids.keep.txt
4026 /shared/jmarks/hiv/uhs1-4/autosome_processing/structure/triangle_plots/thresholding/aa.subject_ids.keep.txt


In [41]:
ww /shared/jmarks/hiv/uhs1-4/autosome_processing/data/*/*/*/aa/*fam
echo ""
ww /shared/jmarks/hiv/uhs1-4/autosome_processing/data/*/*/*/ea/*fam

  2009 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs1/genotype/observed/aa/uhs1.aa.1KG.structure.fam
   767 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs2/genotype/observed/aa/uhs2.aa.1KG.structure.fam
    84 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs3_v1-2/genotype/observed/aa/uhs3_v1-2.aa.1KG.structure.fam
    94 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs3_v1-3/genotype/observed/aa/uhs3_v1-3.aa.1KG.structure.fam
  1072 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs4/genotype/observed/aa/uhs4.aa.1KG.structure.fam
  4026 total

  1130 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs1/genotype/observed/ea/uhs1.ea.1KG.structure.fam
   828 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs2/genotype/observed/ea/uhs2.ea.1KG.structure.fam
    33 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs3_v1-2/genotype/observed/ea/uhs3_v1-2.ea.1KG.structure.fam
    44 /shared/jmarks/hiv/uhs1-4/autosome_processing/data/uhs3_v1-3/genotype

In [42]:
echo ""
ww /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs*/genotype/observed/aa/*fam
echo ""
ww /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs*/genotype/observed/ea/*fam


  2008 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/aa/uhs1.aa.chr23.1KG.structure.fam
   767 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs2/genotype/observed/aa/uhs2.aa.chr23.fam
    84 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs3_v1-2/genotype/observed/aa/uhs3.aa.V1-2.chr23.fam
    94 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs3_v1-3/genotype/observed/aa/uhs3.aa.V1-3.chr23.fam
  1072 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs4/genotype/observed/aa/uhs4.aa.chr23.1KG.structure.fam
  4025 total

  1128 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/ea/uhs1.ea.chr23.1KG.structure.fam
   828 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs2/genotype/observed/ea/uhs2.ea.chr23.fam
    33 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs3_v1-2/genotype/observed/ea/uhs3.ea.V1-2.chr23.fam
    44 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs3_v1-3/genotype/observed/ea/uhs3.ea.V1-3.chr23.fam
   988 /shared/jmarks

In [43]:
ww /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs*/genotype/observed/ea/*bim 
echo ""
ww /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs*/genotype/observed/aa/*bim 

  17413 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/ea/uhs1.ea.chr23.1KG.structure.bim
  32857 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs2/genotype/observed/ea/uhs2.ea.chr23.bim
  24202 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs3_v1-2/genotype/observed/ea/uhs3.ea.V1-2.chr23.bim
  35440 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs3_v1-3/genotype/observed/ea/uhs3.ea.V1-3.chr23.bim
  36003 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs4/genotype/observed/ea/uhs4.ea.chr23.1KG.structure.bim
 145915 total

  17413 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/aa/uhs1.aa.chr23.1KG.structure.bim
  23484 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs2/genotype/observed/aa/uhs2.aa.chr23.bim
  41884 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs3_v1-2/genotype/observed/aa/uhs3.aa.V1-2.chr23.bim
  44706 /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs3_v1-3/genotype/observed/aa/uhs3.aa.V1-3.chr23.bim
  36003 /sha

## GRCh37 strand and allele discordance check
### MAF for study data

In [45]:
### EC1 command line (Bash) ###

# write out the MAF report
for study in ${study_list}; do
    study_dir=${base_dir}/processing/${study}/strand_check
    mkdir ${study_dir}
    for ancestry in ${ancestry_list};do
        data_dir=${base_dir}/data/${study}/genotype/observed/${ancestry}
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bed ${data_dir}/*bed\
            --bim ${data_dir}/*bim\
            --fam ${data_dir}/*fam\
            --freq \
            --out ${study_dir}/${ancestry}_${base_name}
    done
done

# Get list of variants from all studies
studies=($study_list)  #studies=(uhs1 uhs2 uhs3_v1-2 uhs3_v1-3 uhs4) # array of study names
num=${#studies[@]}

## Get intersection set
for ancestry in ${ancestry_list};do
    bim_files=()
    for (( i=0; i<${num}; i++ ));do
        bim_files+=(${base_dir}/data/${studies[$i]}/genotype/observed/$ancestry/*bim)
    done
    
    echo -e "\nCalculating intersection between $ancestry ${study_list}...\n"
    cat ${bim_files[@]}| cut -f2 | sort |  uniq -c | awk -v num=$num '$1 == num {print $2}' \
        > ${base_dir}/processing/intersect/${ancestry}_variant_intersection.txt
    ww ${base_dir}/processing/intersect/${ancestry}_variant_intersection.txt
done 

mkdir: cannot create directory ‘/shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/strand_check’: File exists
PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/strand_check/aa_chr23.log.
Options in effect:
  --bed /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/aa/uhs1.aa.chr23.1KG.structure.bed
  --bim /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/aa/uhs1.aa.chr23.1KG.structure.bim
  --fam /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/aa/uhs1.aa.chr23.1KG.structure.fam
  --freq
  --memory 2048
  --noweb
  --out /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/strand_check/aa_chr23

Note: --noweb has no effect since no web check is implemented yet.
1957 MB RAM detected; reserving 2048 MB for main workspace.
Allocated 1536 MB su

### MAF for 1000G
This pipeline is currently set up to handle EUR and AFR populations.
#### Autosomes
Get 1000G MAF for chromosomes 1–22 (autosomes).

In [None]:
#### EC2 command line (Bash)
#
## Calculate autosome MAFs for 1000G populations
#for ancestry in ${ancestry_list};do
#
#    if [ $ancestry == "ea" ]
#    then
#        pop="EUR"
#    else
#        pop="AFR"
#    fi
#    
#    for chr in {1..22}; do
#        /shared/bioinformatics/software/scripts/qsub_job.sh \
#            --job_name ${pop}_${chr}_MAF \
#            --script_prefix ${base_dir}/processing/1000g/${pop}_chr${chr}.maf \
#            --mem 6.8 \
#            --nslots 1 \
#            --priority 0 \
#            --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
#                --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.hap.gz\
#                --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz \
#                --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
#                --chr ${chr} \
#                --out ${base_dir}/1000g/${pop}_chr${chr}.maf \
#                --extract ${base_dir}/processing/intersect/${ancestry}_variant_intersection.txt \
#                --keep_groups ${pop}
#    done
#done

#### chrX
Get 1000G MAF for chromosome 23 (chrX).

In [48]:
### Bash ###

chr=23
for ancestry in ${ancestry_list};do

    if [ $ancestry == "ea" ]
    then
        pop="EUR"
    else
        pop="AFR"
    fi

    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${pop}_23_MAF \
        --script_prefix ${base_dir}/processing/1000g/${pop}_chr${chr}.maf \
        --mem 6.8 \
        --nslots 1 \
        --priority 0 \
        --program /shared/bioinformatics/software/perl/stats/calculate_maf_from_impute2_hap_file.pl \
            --hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.hap.gz\
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
            --sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
            --chr $chr \
            --out ${base_dir}/processing/1000g/${pop}_chr${chr}.maf \
            --extract ${base_dir}/processing/intersect/${ancestry}_variant_intersection.txt \
            --keep_groups ${pop}
done

Your job 8231 ("AFR_23_MAF") has been submitted
Your job 8232 ("EUR_23_MAF") has been submitted


### Merge 1000G chromosomes
Only need to perform this if there were multiple chromosomes for which the MAF was calculated—e.g. more than just chrX.

In [None]:
#### Bash ###
##chr_list={1..23}
#
## Merge per chr MAFs for each 1000G population
#for ancestry in ${ancestry_list};do
#    if [ $ancestry == "ea" ]
#    then
#        pop="EUR"
#    else
#        pop="AFR"
#    fi
#    
#        head -n 1 ${base_dir}/1000g/${pop}_chr1.maf > 1000g/${pop}_chr_all.maf
#        tail -q -n +2 1000g/${pop}_${chr_list}.maf \ # chr_list defined in beginning
#            >> 1000g/${pop}_chr_all.maf
#done

###  Allele Discordances Check
The allele discordances will be resolved by
* Flipping allele discordances that are fixed by flipping
* Removing SNPs with discordant names
* Removing SNPs with discordant positions
* Removing allele discordances that are not resolved by flipping
* Removing alleles with large deviations from the reference population allele frequencies

Given that the allele discordance check was done using a union set of SNPs across all studies within an ancestry group, some of the SNPs logged as discordant for a given study may not actually be in the study. Fortunately, if they are not in a given study they will not interfere with the filtering procedures. Note that the intersection set is used for the final studies merger.

#### Autosomes

In [None]:
#### Bash ###
#
## Run discordance checks for each ancestry group
#for study in ${study_list}; do
#    for ancestry in ${ancestry_list};do
#        if [ $ancestry = "ea" ]; then
#            pop=EUR
#        else
#            pop=AFR
#        fi
#
#       /shared/bioinformatics/software/scripts/qsub_job.sh \
#           --job_name ${ancestry}_${study}_crosscheck \
#           --script_prefix ${base_dir}/strand_check/${ancestry}_allele_discordance_check \
#           --mem 6 \
#           --nslots 4 \
#           --priority 0 \
#           --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
#               --study_bim_file ${base_dir}/${study}/genotype/observed/${ancestry}/*bim
#               --study_frq_file ${base_dir}/${study}/strand_check/${ancestry}_chr_all.frq
#               --ref_maf_file ${base_dir}/1000g/${pop}_chr_all.maf
#               --out_prefix ${base_dir}/${study}/strand_check/${ancestry}_allele_discordance"
#    done
#done

#### chrX 

In [50]:
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        if [ $ancestry = "ea" ]; then
            pop=EUR
        else
            pop=AFR
        fi

        # chr23 discordance check
        /shared/bioinformatics/software/scripts/qsub_job.sh \
            --job_name ${ancestry}_${study}_crosscheck \
            --script_prefix ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance_check \
            --mem 6.8 \
            --nslots 1 \
            --priority 0 \
            --program "Rscript /shared/bioinformatics/software/R/check_study_data_against_1000G.R
                --study_bim_file ${base_dir}/data/${study}/genotype/observed/${ancestry}/*bim
                --study_frq_file ${base_dir}/processing/${study}/strand_check/${ancestry}_chr23.frq
                --ref_maf_file ${base_dir}/processing/1000g/${pop}_chr23.maf
                --out_prefix ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance"
    done
done

Your job 8233 ("aa_uhs1_crosscheck") has been submitted
Your job 8234 ("ea_uhs1_crosscheck") has been submitted
Your job 8235 ("aa_uhs2_crosscheck") has been submitted
Your job 8236 ("ea_uhs2_crosscheck") has been submitted
Your job 8237 ("aa_uhs3_v1-2_crosscheck") has been submitted
Your job 8238 ("ea_uhs3_v1-2_crosscheck") has been submitted
Your job 8239 ("aa_uhs3_v1-3_crosscheck") has been submitted
Your job 8240 ("ea_uhs3_v1-3_crosscheck") has been submitted
Your job 8241 ("aa_uhs4_crosscheck") has been submitted
Your job 8242 ("ea_uhs4_crosscheck") has been submitted


### Resolving Allele Discordances

In [52]:
### Bash ###

# Apply filters
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        echo -e "\n===============\nProcessing ${study}_${ancestry}\n"
        echo "Making remove list"
        cat <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2) \
            <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.at_cg_snps_freq_diff_gt_0.2 | tail -n +2) \
            <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_names | tail -n +2) \
            <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_positions | tail -n +2) \
            <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_polymorphic_in_study_not_fixed_by_strand_flip | tail -n +2) | \
              sort -u > ${base_dir}/processing/${study}/strand_check/${ancestry}_snps.remove

        # Create flip list
        echo "Making flip list"
        comm -23 <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles | tail -n +2 | sort -u) \
                 <(cut -f2,2 ${base_dir}/processing/${study}/strand_check/${ancestry}_allele_discordance.discordant_alleles_not_fixed_by_strand_flip | tail -n +2 | sort -u) \
                 > ${base_dir}/processing/${study}/strand_check/${ancestry}_snps.flip

        # Apply filters
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bed     ${base_dir}/data/${study}/genotype/observed/${ancestry}/*bed \
            --bim     ${base_dir}/data/${study}/genotype/observed/${ancestry}/*bim \
            --fam     ${base_dir}/data/${study}/genotype/observed/${ancestry}/*fam \
            --exclude ${base_dir}/processing/${study}/strand_check/${ancestry}_snps.remove \
            --flip    ${base_dir}/processing/${study}/strand_check/${ancestry}_snps.flip \
            --make-bed \
            --out     ${base_dir}/processing/${study}/${ancestry}_filtered
    done
done


Processing uhs1_aa

Making remove list
Making flip list
PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/aa_filtered.log.
Options in effect:
  --bed /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/aa/uhs1.aa.chr23.1KG.structure.bed
  --bim /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/aa/uhs1.aa.chr23.1KG.structure.bim
  --exclude /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/strand_check/aa_snps.remove
  --fam /shared/jmarks/hiv/uhs1-4/chrX_processing/data/uhs1/genotype/observed/aa/uhs1.aa.chr23.1KG.structure.fam
  --flip /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/strand_check/aa_snps.flip
  --make-bed
  --memory 2048
  --noweb
  --out /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/aa_filtered

Note: --noweb has no effect sinc

In [53]:
wc -l $base_dir/processing/*/*filtered.bim

   17399 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/aa_filtered.bim
   17410 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/ea_filtered.bim
   23484 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs2/aa_filtered.bim
   32857 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs2/ea_filtered.bim
   41847 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs3_v1-2/aa_filtered.bim
   24197 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs3_v1-2/ea_filtered.bim
   44668 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs3_v1-3/aa_filtered.bim
   35436 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs3_v1-3/ea_filtered.bim
   36003 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs4/aa_filtered.bim
   36003 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs4/ea_filtered.bim
  309304 total


## Remove monomorphic variants
Monomorphic variants prevent MIS from accepting the genotype data. In this case, an arbitrarily small MAF is set that is smaller than the lower bound for these data.

In [54]:
### Bash ###

# Apply filters
for study in ${study_list}; do
    for ancestry in ${ancestry_list};do
        geno_dir=${base_dir}/processing/${study}

        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 2048 \
            --bfile ${geno_dir}/${ancestry}_filtered \
            --maf 0.000001 \
            --make-bed \
            --out ${geno_dir}/${ancestry}_filtered_mono
    done
done

PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/aa_filtered_mono.log.
Options in effect:
  --bfile /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/aa_filtered
  --maf 0.000001
  --make-bed
  --memory 2048
  --noweb
  --out /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/aa_filtered_mono

Note: --noweb has no effect since no web check is implemented yet.
1957 MB RAM detected; reserving 2048 MB for main workspace.
Allocated 1536 MB successfully, after larger attempt(s) failed.
17399 variants loaded from .bim file.
2008 people (1423 males, 585 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2008 founders and 0 nonfounders present.
Calculating allele frequencies... 1011121314151617181920212223242526272829303132333435363738394041424

In [55]:
wc -l $base_dir/processing/*/*mono.bim

   16515 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/aa_filtered_mono.bim
   15702 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/ea_filtered_mono.bim
   21223 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs2/aa_filtered_mono.bim
   24486 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs2/ea_filtered_mono.bim
   37074 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs3_v1-2/aa_filtered_mono.bim
   19719 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs3_v1-2/ea_filtered_mono.bim
   39773 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs3_v1-3/aa_filtered_mono.bim
   26391 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs3_v1-3/ea_filtered_mono.bim
   27966 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs4/aa_filtered_mono.bim
   26645 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs4/ea_filtered_mono.bim
  255494 total


### SNP intersection

In [56]:
studies=($study_list)  #studies=(UHS1 UHS2 UHS3_v1-2 UHS3_v1-3) # array of study names
num=${#studies[@]}

# Get intersection set
for ancestry in ${ancestry_list};do
    bim_files=()
    for (( i=0; i<${num}; i++ ));do
        bim_files+=(${base_dir}/processing/${studies[$i]}/${ancestry}_filtered_mono.bim)
    done
    
    echo -e "\nCalculating intersection between $ancestry ${study_list}...\n"
    cat ${bim_files[@]} | cut -f2 | sort | uniq -c | awk -v num=$num '$1 == num {print $2}' \
        > ${base_dir}/processing/intersect/${ancestry}_variant_intersection.txt

    # Make new PLINK binary file sets
    for study in ${studies[@]}; do
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --bfile ${base_dir}/processing/${study}/${ancestry}_filtered_mono \
            --extract ${base_dir}/processing/intersect/${ancestry}_variant_intersection.txt \
            --make-bed \
            --out ${base_dir}/processing/intersect/${study}_${ancestry}_filtered_snp_intersection
    done
done


Calculating intersection between aa uhs1 uhs2 uhs3_v1-2 uhs3_v1-3 uhs4...

PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/uhs1_aa_filtered_snp_intersection.log.
Options in effect:
  --bfile /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/uhs1/aa_filtered_mono
  --extract /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/aa_variant_intersection.txt
  --make-bed
  --noweb
  --out /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/uhs1_aa_filtered_snp_intersection

Note: --noweb has no effect since no web check is implemented yet.
1957 MB RAM detected; reserving 978 MB for main workspace.
16515 variants loaded from .bim file.
2008 people (1423 males, 585 females) loaded from .fam.
--extract: 4581 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Befor

In [59]:
ww $base_dir/processing/intersect/*section.txt

  4581 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/aa_variant_intersection.txt
  3251 /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/ea_variant_intersection.txt
  7832 total


### Merge test
As a final check to confirm that our data sets are all compatible, a PLINK file set merge is conducted. If any issues persist then an error will be raised.

In [60]:
chr=23 # comment out if not running chrX exclusively
for ancestry in $ancestry_list;do

    echo "Creating $ancestry merge-list"
    touch ${base_dir}/processing/intersect/${ancestry}_merge_list.txt
    for study in $study_list;do
        echo ${base_dir}/processing/intersect/${study}_${ancestry}_filtered_snp_intersection >>\
             ${base_dir}/processing/intersect/${ancestry}_merge_list.txt
    done
    
    if [ $chr==23 ]; then
        out_file=${ancestry}_studies_merged_chrx
    else
        out_file=${ancestry}_studies_merged
    fi

# Merge file sets
    echo -e "\n\n======== ${ancestry} ========\n\n"
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 4000 \
        --merge-list ${base_dir}/processing/intersect/${ancestry}_merge_list.txt \
        --make-bed \
        --out ${base_dir}/processing/intersect/$out_file
done

Creating aa merge-list




PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/aa_studies_merged.log.
Options in effect:
  --make-bed
  --memory 4000
  --merge-list /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/aa_merge_list.txt
  --noweb
  --out /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/aa_studies_merged

Note: --noweb has no effect since no web check is implemented yet.
1957 MB RAM detected; reserving 4000 MB for main workspace.
Allocated 1265 MB successfully, after larger attempt(s) failed.
Performing single-pass merge (4025 people, 4581 variants).
Merged fileset written to
/shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/aa_studies_merged-merge.bed
+
/shared/jmarks/hiv/uhs1-4/chrX_processing/processing/intersect/aa_studies_merged-merge.bim
+
/shared/jmar

In [None]:
wc -l $base_dir/processing/intersect/*merged*bim
  4581 /shared/jmarks/hiv/uhs1234/chrX_processing//processing/intersect/aa_studies_merged_chrx.bim
  3251 /shared/jmarks/hiv/uhs1234/chrX_processing//processing/intersect/ea_studies_merged_chrx.bim

## Imputation preparation for Michigan Imputation Server
Visit the [MIS Getting Started Webpage](https://imputationserver.sph.umich.edu/start.html#!pages/help) for more information about the preparing the data for upload to MIS.


### VCF File Conversion

In [64]:
### Split by chr and remove any individuals with missing data for whole chr

chr=23
for ancestry in $ancestry_list;do


    if [ $chr==23 ]; then
        myfile=${ancestry}_studies_merged_chrx
    else
        myfile=${ancestry}_studies_merged
    fi
    
    # Remove SNPs
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 4000 \
        --bfile ${base_dir}/processing/intersect/$myfile \
        --chr ${chr} \
        --mind 0.99 \
        --make-bed \
        --out ${base_dir}/processing/impute_prep/${ancestry}_chr${chr}_for_phasing 
done > ${base_dir}/processing/impute_prep/chr_splitting.log 


### Split by chr and remove any individuals with missing data for whole chr

# if merged data sets together
#for ancestry in $ancestry_list;do
#    for chr in {1..23};do
#        # Remove SNPs
#        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
#            --noweb \
#            --memory 4000 \
#            --bfile ${base_dir}/processing/intersect/${ancestry}_studies_merged \
#            --chr ${chr} \
#            --mind 0.99 \
#            --make-bed \
#            --out ${base_dir}/processing/impute_prep/${ancestry}_chr${chr}_for_phasing 
#    done > ${base_dir}/processing/impute_prep/chr_splitting.log 
#done

for ancestry in $ancestry_list; do
    grep removed $base_dir/processing/impute_prep/$ancestry*log |
        perl -lne '/(\d+)(\speople)/;
             $mycount += $1; 
             print $mycount if eof'  > $base_dir/processing/impute_prep/$ancestry.removed
    any_removed=$(cat $base_dir/processing/impute_prep/$ancestry.removed)
    if [ "$any_removed" == 0 ]; then
        echo "No $ancestry subjects removed"
    else
        echo "Some $ancestry subjects removed"
    fi
done

No aa subjects removed
No ea subjects removed


In [65]:
# EC2 command line #
#for chr in ${chr_list};do
chr=23
    for ancestry in ${ancestry_list};do
        final_dir=${base_dir}/processing/impute_prep/${ancestry}
        mkdir $final_dir
        /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 5000 \
            --bfile ${final_dir}/../${ancestry}_chr${chr}_for_phasing \
            --output-chr M \
            --set-hh-missing \
            --recode vcf bgz \
            --out ${final_dir}/${ancestry}_chr${chr}_final
    done
#done

mkdir: cannot create directory ‘/shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa’: File exists
PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa/aa_chr23_final.log.
Options in effect:
  --bfile /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa/../aa_chr23_for_phasing
  --memory 5000
  --noweb
  --out /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa/aa_chr23_final
  --output-chr M
  --recode vcf bgz
  --set-hh-missing

Note: --noweb has no effect since no web check is implemented yet.
1957 MB RAM detected; reserving 5000 MB for main workspace.
Allocated 1185 MB successfully, after larger attempt(s) failed.
4581 variants loaded from .bim file.
4025 people (2828 males, 1197 females) loaded from .fam.
Using up to 2 threads (change this with --threads)

Transfer the *.vcf.gz files to local machine (per chromosome) and then upload to MIS.

## Upload to Michigan Imputation Server (MIS)

### Uploading parameters
These are the parameters that were selected on MIS.

__Name__: UHS1-3_merged_chrX_EA_v02

__Reference Panel__ 1000G Phase 3 v5

__Input Files__ File Upload <br>

* Select Files - select VCF (.gz) files that were downloaded to local machine from cloud. <br>

__Phasing__: ShapeIT v2.r790 (unphased) 

__Population__: EUR

__Mode__: Quality Control & Imputation

* I will not attempt to re-identify or contact research participants.
* I will report any inadvertent data release, security breach or other data management incident of which I become aware.

# Troubleshooting

```
Chromosome X check failed! 
java.io.IOException: Found haplotype 0/1 at pos 2788707 for male proband 8002221030_HHG6599_AS93-4798_8002221030_HHG6599_AS93-4798
Found haplotype 0/1 at pos 2825403 for male proband 8002221030_HHG6599_AS93-4798_8002221030_HHG6599_AS93-4798
Found haplotype 0/1 at pos 2862721 for male proband 8002221030_HHG6599_AS93-4798_8002221030_HHG6599_AS93-4798
Found haplotype 0/1 at pos 2985672 for male proband 8002221030_HHG6599_AS93-4798_8002221030_HHG6599_AS93-4798
Found haplotype 0/1 at pos 3002687 for male proband 8002221030_HHG6599_AS93-4798_8002221030_HHG6599_AS93-4798
Found haplotype 0/1 at pos 3028289 for male proband 8002221030_HHG6599_AS93-4798_8002221030_HHG6599_AS93-4798
Found haplotype 0/1 at pos 3296294 for male proband 8002221030_HHG6599_AS93-4798_8002221030_HHG6599_AS93-4798
Found haplotype 0/1 at pos 3542110 for male proband 8002221030_HHG6599_AS93-4798_8002221030_HHG6599_AS93-4798
Found haplotype 0/1 at pos 3627414 for male proband 8002221030_HHG6599
Error during manifest file creation.
```

**We are going to remove subject listed above in the MIS error message from the study.**

In [None]:
remove_dir=$base_dir/processing/impute_prep/aa/remove_subject
mkdir $remove_dir
ancestry=aa
chr=23
echo 8002221030_HHG6599_AS93-4798 8002221030_HHG6599_AS93-4798 > ${remove_dir}/remove_list.tsv

/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 5000 \
        --bfile ${remove_dir}/../../${ancestry}_chr${chr}_for_phasing \
        --output-chr M \
        --set-hh-missing \
        --remove-fam $remove_dir/remove_list.tsv \
        --recode vcf bgz \
        --out ${remove_dir}/${ancestry}_chr${chr}_final_v02

mkdir: cannot create directory ‘/shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa/remove_subject’: File exists
PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa/remove_subject/aa_chr23_final_v02.log.
Options in effect:
  --bfile /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa/remove_subject/../../aa_chr23_for_phasing
  --memory 5000
  --noweb
  --out /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa/remove_subject/aa_chr23_final_v02
  --output-chr M
  --recode vcf bgz
  --remove-fam /shared/jmarks/hiv/uhs1-4/chrX_processing/processing/impute_prep/aa/remove_subject/remove_list.tsv
  --set-hh-missing

Note: --noweb has no effect since no web check is implemented yet.
1957 MB RAM detected; reserving 5000 MB for main workspace.
Allocated 1185 MB succe

# Download Imputed Results from MIS
First, download the data form the Michigan Imputation Server (MIS) by clicking on the link provided in the email they send out to alert you that your data has finished. Here you will find commands for downloading the data.

In [None]:
## EC2 (share drive)
# Note: jmarks is a symlink; you will have to save results
# to the share drive because they are very large)

# create directory structure
mkdir -p /home/ec2-user/jmarks/hiv/uhs123/data/uhs123_merged/genotype/imputed/{ea,aa}/chr23

## ChrX
### EA
The zip files from Michigan Imputation Server (MIS) need to be inflated before you can begin working with them. They require a passcode that is sent by MIS to email.
* EA password: z@j1CRuP:Diw3Y

Then, inflate imputation results.

In [None]:
ancestry=ea
study=uhs1-4
passW="'z@j1CRuP:Diw3Y'"
cd /shared/jmarks/hiv/uhs1234/gwas/genotype/imputed/$ancestry


# download.file
####################################################################################################
####################################################################################################
# QC-results
wget https://imputationserver.sph.umich.edu/share/results/3e003fabde51f68742c3df72b19720dc/qcreport.html

# Imputation Results
wget https://imputationserver.sph.umich.edu/share/results/ce8b8d3a6e056b00d429efb863350952/chr_X.no.auto_female.zip
wget https://imputationserver.sph.umich.edu/share/results/20d74f774326075473cf154d4a619622/chr_X.no.auto_male.zip

# Logs
wget https://imputationserver.sph.umich.edu/share/results/d5d5d852884d1b437eec9665d880641/chr_X.no.auto_female.log
wget https://imputationserver.sph.umich.edu/share/results/1afd40c6ed931bf4530a50e2d41eb764/chr_X.no.auto_male.log


# SNP Statistics
wget https://imputationserver.sph.umich.edu/share/results/d7270cc7271a914540a9a11aed68baf3/statistics.txt

####################################################################################################
####################################################################################################

/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name MIS.dowload.$study.$ancestry.chrx \
    --script_prefix imputed.data.download.$ancestry.chrx \
    --mem 3 \
    --nslots 1 \
    --priority 0 \
    --program bash download.file

# inflate chr results
for file in *zip; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name unzip.$study.$ancestry.chrx \
        --script_prefix unzip.imputed.$study.$ancestry.chrx.data \
        --mem 3 \
        --nslots 2 \
        --priority 0 \
        --program unzip -P $passW $file 
done

# we can remove the original imputed data from MIS after we inflate the zip files
rm -rf *zip

### AA
The zip files from Michigan Imputation Server (MIS) need to be inflated before you can begin working with them. They require a passcode that is sent by MIS to email.
* AA password: LFkee+tMo31WV:

Then, inflate imputation results.

In [None]:
ancestry=aa
study=uhs1-4
passW="'LFkee+tMo31WV:'"
cd /shared/jmarks/hiv/uhs1234/gwas/genotype/imputed/$ancestry


# download.file
####################################################################################################
####################################################################################################
# QC-results
wget https://imputationserver.sph.umich.edu/share/results/dae3adc9d1820e3e7b6108ed82c8851c/qcreport.html

# Imputation Results
wget https://imputationserver.sph.umich.edu/share/results/fef581304eaa1ce12cfaaf6d570a5dfb/chr_X.no.auto_female.zip
wget https://imputationserver.sph.umich.edu/share/results/851f851b7dbe42aac869cfc6c4b465c9/chr_X.no.auto_male.zip

# Logs
wget https://imputationserver.sph.umich.edu/share/results/f65c749682821647a9cd40597588807a/chr_X.no.auto_female.log
wget https://imputationserver.sph.umich.edu/share/results/7a8fca2dc09fe487c09fe80220be8ea6/chr_X.no.auto_male.log

# SNP Statistics
wget https://imputationserver.sph.umich.edu/share/results/15699900e18d5b91e416476d33b1f452/statistics.txt

####################################################################################################
####################################################################################################

/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name MIS.dowload.$study.$ancestry.chrx \
    --script_prefix imputed.data.download.$ancestry.chrx \
    --mem 3 \
    --nslots 1 \
    --priority 0 \
    --program bash download.file

# inflate chr results
for file in *zip; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name unzip.$study.$ancestry.chrx \
        --script_prefix unzip.imputed.$study.$ancestry.chrx.data \
        --mem 3 \
        --nslots 2 \
        --priority 0 \
        --program unzip -P $passW $file 
done

# we can remove the original imputed data from MIS after we inflate the zip files
rm -rf *zip


# Upload to S3

In [None]:
ancestry=aa
# mv inflated results to respective directories and upload to s3
files="chrX.no.auto_male.dose.vcf.gz chrX.no.auto_male.dose.vcf.gz.tbi chrX.no.auto_male.info.gz chr_X.no.auto_male.log "
for file in $files; do
    aws s3 cp $file s3://rti-hiv/hiv_uhs1234/data/genotype/imputed/$ancestry/$file --quiet &
done