# UHS1–4 Genotype Array QC
__Author__: Jesse Marks <br>

**GitHub Issue:** [Issue #125](https://github.com/RTIInternational/bioinformatics/issues/125)

This document logs the steps taken to perform pre-imputation procedures on the UHS datasets: UHS1, UHS2, UHS3 (chip version 1–2), UHS3 (chip version 1–3), and UHS4. The starting point for this analysis is after quality control of observed genotypes for each batch/wave. We will merge these data and run through our automated [genotype array QC](https://github.com/RTIInternational/biocloud_gwas_workflows/tree/master/genotype_array_qc) workflow. This will remove any overlapping samples there may be. The quality controlled genotypes are oriented on the GRCh37 plus strand. 

## Data retrieval and organization
PLINK binary filesets were obtained from AWS S3 storage. Nathan Gaddis detail the whereabouts of the UHS1–3 genotype data in [this post from GitHub Issue #117](https://github.com/RTIInternational/bioinformatics/issues/117#issuecomment-469845859) and for UHS4 by email correspondence.

* UHS1: `s3://rti-midas-data/studies/hiv/observed/final/uhs1.ea*`
* UHS2: `s3://rti-midas-data/studies/hiv/observed/final/uhs2{ea,aa}*`
* UHS3_V–2: `s3://rti-midas-data/studies/hiv/observed/final/uhs3.{ea,aa}.V1-2*`
* UHS3_V–3: `s3://rti-midas-data/studies/hiv/observed/final/uhs3.{ea,aa}.V1-3*`
* UHS4: `s3://rti-shared/shared_data/post_qc/uhs4/genotype/array/observed/0007/eur/uhs4.{bed,bim,fam}.gz`

**Note:** that we are not using any HA data for UHS.

# Pre-QC

## Prepare analysis environment

In [None]:
study="uhs1234"

## create directory structure
baseDir=/shared/rti-shared/shared_data/pre_qc/$study/genotype/array/observed/0001
mkdir -p $baseDir/{eur,afr}/{uhs1,uhs2,uhs3_v1-2,uhs3_v1-3,uhs4}
mkdir -p /shared/rti-common/ref_panels/1000G/2014.10/legend_with_chr/dbsnp_b153_ids/

## download reference files
cd /shared/rti-common/ref_panels/1000G/2014.10/legend_with_chr/dbsnp_b153_ids/
aws s3 sync s3://rti-common/ref_panels/1000G/2014.10/legend_with_chr/dbsnp_b153_ids/ .

## download study genotype data
cd $baseDir
for ext in {bed,bim,fam}; do
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.aa.$ext.gz afr/uhs1/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.aa.chr23.$ext.gz afr/uhs1/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.ea.$ext.gz eur/uhs1/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.ea.chr23.$ext.gz eur/uhs1/

    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2.aa.$ext.gz afr/uhs2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2.aa.chr23.$ext.gz afr/uhs2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2.ea.$ext.gz eur/uhs2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2.ea.chr23.$ext.gz eur/uhs2/

    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-2.$ext.gz afr/uhs3_v1-2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-2.chr23.$ext.gz afr/uhs3_v1-2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-2.$ext.gz eur/uhs3_v1-2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-2.chr23.$ext.gz eur/uhs3_v1-2/

    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-3.$ext.gz afr/uhs3_v1-3/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-3.chr23.$ext.gz afr/uhs3_v1-3/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.$ext.gz eur/uhs3_v1-3/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.chr23.$ext.gz eur/uhs3_v1-3/

    aws s3 cp s3://rti-shared/shared_data/post_qc/uhs4/genotype/array/observed/0007/afr/uhs4.$ext.gz afr/uhs4/
    aws s3 cp s3://rti-shared/shared_data/post_qc/uhs4/genotype/array/observed/0007/eur/uhs4.$ext.gz eur/uhs4/
done

gunzip -r * 

wc -l */*/*bim

```
   805863 afr/uhs1/uhs1.aa.bim
    17396 afr/uhs1/uhs1.aa.chr23.bim
  1395852 afr/uhs2/uhs2.aa.bim
    23484 afr/uhs2/uhs2.aa.chr23.bim
  1845925 afr/uhs3_v1-2/uhs3.aa.V1-2.bim
    41884 afr/uhs3_v1-2/uhs3.aa.V1-2.chr23.bim
  1806719 afr/uhs3_v1-3/uhs3.aa.V1-3.bim
    44706 afr/uhs3_v1-3/uhs3.aa.V1-3.chr23.bim
  1715530 afr/uhs4/uhs4.bim
  
   808822 eur/uhs1/uhs1.ea.bim
    17408 eur/uhs1/uhs1.ea.chr23.bim
  1820881 eur/uhs2/uhs2.ea.bim
    32857 eur/uhs2/uhs2.ea.chr23.bim
  1924758 eur/uhs3_v1-2/uhs3.ea.V1-2.bim
    24202 eur/uhs3_v1-2/uhs3.ea.V1-2.chr23.bim
  1868610 eur/uhs3_v1-3/uhs3.ea.V1-3.bim
    35440 eur/uhs3_v1-3/uhs3.ea.V1-3.chr23.bim
  2009074 eur/uhs4/uhs4.bim
```

## Merge autosomes with chrX

In [None]:
# UHS1
cd $baseDir/afr/uhs1/
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs1.aa \
    --bmerge uhs1.aa.chr23 \
    --make-bed \
    --out uhs1_afr
#wc -l *fam
#  2015 uhs1.aa.chr23.fam
#  2016 uhs1.aa.fam
#  2016 uhs1_afr.fam

#wc -l *bim
#  805863 uhs1.aa.bim
#   17396 uhs1.aa.chr23.bim
#  805863 uhs1_afr.bim

cd $baseDir/eur/uhs1/
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs1.ea \
    --bmerge uhs1.ea.chr23 \
    --make-bed \
    --out uhs1_eur
# wc -l *fam
#  1140 uhs1.ea.chr23.fam
#  1142 uhs1.ea.fam
#  1142 uhs1_eur.fam

#wc -l *bim
#  808822 uhs1.ea.bim
#   17408 uhs1.ea.chr23.bim
#  808822 uhs1_eur.bim


# UHS2
cd $baseDir/afr/uhs2/
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs2.aa \
    --bmerge uhs2.aa.chr23 \
    --make-bed \
    --out uhs2_afr
# wc -l *fam
#   767 uhs2.aa.chr23.fam
#   767 uhs2.aa.fam
#   767 uhs2_afr.fam

#wc -l *bim
#  1395852 uhs2.aa.bim
#    23484 uhs2.aa.chr23.bim
#  1416262 uhs2_afr.bim

cd $baseDir/eur/uhs2/
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs2.ea \
    --bmerge uhs2.ea.chr23 \
    --make-bed \
    --out uhs2_eur
#wc -l *fam
#   828 uhs2.ea.chr23.fam
#   828 uhs2.ea.fam
#   828 uhs2_eur.fam

#wc -l *bim
#  1820881 uhs2.ea.bim
#    32857 uhs2.ea.chr23.bim
#  1841954 uhs2_eur.bim


# UHS3_v1-2
cd $baseDir/afr/uhs3_v1-2
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs3.aa.V1-2 \
    --bmerge uhs3.aa.V1-2.chr23 \
    --make-bed \
    --out uhs3_v1-2_afr
#wc -l *bim *fam
#  1845925 uhs3.aa.V1-2.bim
#    41884 uhs3.aa.V1-2.chr23.bim
#  1845927 uhs3_v1-2_afr.bim
#       84 uhs3.aa.V1-2.chr23.fam
#       84 uhs3.aa.V1-2.fam
#       84 uhs3_v1-2_afr.fam

cd $baseDir/eur/uhs3_v1-2
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs3.ea.V1-2 \
    --bmerge uhs3.ea.V1-2.chr23 \
    --make-bed \
    --out uhs3_v1-2_eur
#wc -l *bim *fam
#  1924758 uhs3.ea.V1-2.bim
#    24202 uhs3.ea.V1-2.chr23.bim
#  1924759 uhs3_v1-2_eur.bim
#       33 uhs3.ea.V1-2.chr23.fam
#       33 uhs3.ea.V1-2.fam
#       33 uhs3_v1-2_eur.fam


# UHS3_v1-3
cd $baseDir/afr/uhs3_v1-3
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs3.aa.V1-3 \
    --bmerge uhs3.aa.V1-3.chr23 \
    --make-bed \
    --out uhs3_v1-3_afr
#wc -l *bim *fam
#  1806719 uhs3.aa.V1-3.bim
#    44706 uhs3.aa.V1-3.chr23.bim
#  1806722 uhs3_v1-3_afr.bim
#       94 uhs3.aa.V1-3.chr23.fam
#       94 uhs3.aa.V1-3.fam
#       94 uhs3_v1-3_afr.fam

cd $baseDir/eur/uhs3_v1-3
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs3.ea.V1-3 \
    --bmerge uhs3.ea.V1-3.chr23 \
    --make-bed \
    --out uhs3_v1-3_eur
#wc -l *bim *fam
#  1868610 uhs3.ea.V1-3.bim
#    35440 uhs3.ea.V1-3.chr23.bim
#  1868612 uhs3_v1-3_eur.bim
#       44 uhs3.ea.V1-3.chr23.fam
#       44 uhs3.ea.V1-3.fam
#       44 uhs3_v1-3_eur.fam


# UHS4
cd $baseDir/afr/uhs4/
# rename files for consistency 
for ext in {bim,bed,fam}; do
    cp uhs4.$ext uhs4_afr.$ext
done
#wc -l *afr.bim *afr.fam
# 1715530 uhs4_afr.bim
#    1131 uhs4_afr.fam

cd $baseDir/eur/uhs4/
# rename files for consistency 
for ext in {bim,bed,fam}; do
    cp uhs4.$ext uhs4_eur.$ext
done
#wc -l *eur.bim *eur.fam
# 2009074 uhs4_eur.bim
#    1225 uhs4_eur.fam

## Merge ancestry specific batches/waves

### AFR

In [None]:
cd $baseDir

# create merge list
wc -l afr/uhs*/*afr.fam | awk '{print $2}' | head -n -1 | awk -F"." '{print $1}' > uhs_afr_merge_list.txt

/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --merge-list uhs_afr_merge_list.txt \
    --make-bed \
    --out afr/uhs1234_afr
# Error: 6723 variants with 3+ alleles present.

# remove those variants and try the merge again.
while read line; do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --bfile $line \
        --exclude afr/uhs1234_afr-merge.missnp \
        --make-bed \
        --out ${line}_remove_missnps
done < uhs_afr_merge_list.txt

# create new merge list
wc -l afr/uhs*/*remove_missnps.fam | awk '{print $2}' | head -n -1 | awk -F"." '{print $1}' > uhs_afr_new_merge_list.txt

# try merge again
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --merge-list uhs_afr_new_merge_list.txt \
    --make-bed \
    --out afr/uhs1234_afr

gzip afr/uhs1234_afr*

### EUR

In [None]:
cd $baseDir

# create merge list
wc -l eur/uhs*/*eur.fam | awk '{print $2}' | head -n -1 | awk -F"." '{print $1}' > uhs_eur_merge_list.txt

/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --merge-list uhs_eur_merge_list.txt \
    --make-bed \
    --out eur/uhs1234_eur
#Error: 6717 variants with 3+ alleles present.

# remove those variants and try the merge again.
while read line; do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --bfile $line \
        --exclude eur/uhs1234_eur-merge.missnp \
        --make-bed \
        --out ${line}_remove_missnps
done < uhs_eur_merge_list.txt

# create new merge list
wc -l eur/uhs*/*remove_missnps.fam | awk '{print $2}' | head -n -1 | awk -F"." '{print $1}' > uhs_eur_new_merge_list.txt

# try merge again
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --merge-list uhs_eur_new_merge_list.txt \
    --make-bed \
    --out eur/uhs1234_eur

gzip eur/uhs1234_eur*

## Upload to S3

In [None]:
cd $baseDir/afr/

# include a README within the version-level directory
for file in uhs1234_afr*; do
    aws s3 mv $file  s3://rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0001/afr/
done

cd $baseDir/eur/
# include a README within the version-level directory
for file in uhs1234_eur*; do
    aws s3 mv $file  s3://rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0001/eur/
done

# QC

## Create Directories

In [None]:
mkdir -p /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/{wf_input,wf_output}
mkdir -p /shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/

##  Set up config file for QC pipeline

Copy JSON file from previous run and modify settings. Then edit this config file to include the appropriate cohort information.

In [None]:
cd /shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/
cp /shared/biocloud_gwas_workflows/genotype_array_qc/test/ukb_qc_wf.json uhs1234_afr_genotype_qc_wf.json
cp /shared/biocloud_gwas_workflows/genotype_array_qc/test/ukb_qc_wf.json uhs1234_eur_genotype_qc_wf.json

# Edit wf config files

## Zip biocloud_gwas_workflows repo

In [None]:
cd /shared/biocloud_gwas_workflows
git pull
git submodule update --init --recursive
git rev-parse HEAD > /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/wf_input/git_hash.txt
cd /shared
zip \
    --exclude=*/var/* \
    --exclude=*.git/* \
    --exclude=*/test/* \
    --exclude=*/.idea/* \
    -r /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/wf_input/biocloud_gwas_workflows.zip \
    biocloud_gwas_workflows/

## Submit job

In [None]:
# Open session in terminal 1
ssh -i ~/.ssh/gwas_rsa -L localhost:8000:localhost:8000 ec2-user@54.174.185.7

# Submit job in terminal 2
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/genotype_array_qc_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/uhs1234_afr_genotype_qc_wf.json" \
    -F "workflowDependencies"=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/wf_input/biocloud_gwas_workflows.zip

curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/genotype_array_qc_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/uhs1234_eur_genotype_qc_wf.json" \
    -F "workflowDependencies"=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/wf_input/biocloud_gwas_workflows.zip

job=a06ddc1c-278a-446a-908c-24d8878e1de7 # afr
job=4dfd555f-4897-41b4-ba47-209aa250839f # eur

# Monitor job in terminal 1
#tail -f /tmp/cromwell-server.log

# check job status in terminal 2
curl -X GET "http://localhost:8000/api/workflows/v1/${job}/status"   

## Get final outputs

In terminal with credentials for accessing rti-code S3.

### AFR

In [None]:
cd /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/wf_output/
echo $job # to verify you have the correct job ID saved in variable "job"

# Download output from Swagger UI.
curl -X GET "http://localhost:8000/api/workflows/v1/${afr_job}/outputs" -H "accept: application/json" \
  > ${job}_final_outputs.json
curl -X GET "http://localhost:8000/api/workflows/v1/${eur_job}/outputs" -H "accept: application/json" \
  > ${job}_final_outputs.json

# Copy final outputs json to S3
aws s3 cp ${job}_final_outputs.json \
  s3://rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/wf_output/

# Copy final qc files from cromwell-output to final location
python - <<EOF
import json
import os
import re

job = "$job"
dir = "rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/afr"

def traverse(o, tree_types=(list, tuple)):
    if isinstance(o, tree_types):
        for value in o:
            for subvalue in traverse(value, tree_types):
                yield subvalue
    else:
        yield o

with open(job + '_final_outputs.json') as f:
    outputs = json.load(f)
    outputs = outputs["outputs"]
    for key in outputs:
        if (type(outputs[key]) == list):
            for value in traverse(outputs[key]):
                if (str(value)[0:2] == "s3"):
                    message = "aws s3 cp {} s3://{}/wf_output/".format(value, dir)
                    os.system(message)
        else:
            if (str(outputs[key])[0:2] == "s3"):
                message = "aws s3 cp {} s3://{}/wf_output/".format(outputs[key], dir)
                os.system(message)

# Move and rename final genotype plink file set
with open('/shared/' + dir + '/wf_output/' + job + '_final_outputs.json') as f:
    outputs = json.load(f)
    outputs = outputs["outputs"]
    for ext in ["bed", "bim", "fam"]:
        final = "genotype_array_qc_wf.final_qc_{}".format(ext)
        for original in outputs[final]:
            fileName = re.sub(r'.+/', '', original)
            new = re.sub(r'uhs1234.YRI.+\.', 'afr/uhs1234.', fileName)
            new = re.sub(r'uhs1234.CEU.+\.', 'eur/uhs1234.', new)
            new = re.sub(r'uhs1234.CHB.+\.', 'amr/uhs1234.', new)
            final = "s3://{}/wf_output/{}".format(dir, new)
            os.system("aws s3 cp {} {}".format(original, final))

EOF

### EUR

In [None]:
cd /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/wf_output/
echo $job # to verify you have the correct job ID saved in variable "job"

# Download output from Swagger UI.
curl -X GET "http://localhost:8000/api/workflows/v1/${job}/outputs" -H "accept: application/json" \
  > ${job}_final_outputs.json

# Copy final outputs json to S3
aws s3 cp ${job}_final_outputs.json \
  s3://rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/wf_output/

# Copy final qc files from cromwell-output to final location
python - <<EOF
import json
import os
import re

job = "$job"
dir = "rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0002/eur"

def traverse(o, tree_types=(list, tuple)):
    if isinstance(o, tree_types):
        for value in o:
            for subvalue in traverse(value, tree_types):
                yield subvalue
    else:
        yield o

with open(job + '_final_outputs.json') as f:
    outputs = json.load(f)
    outputs = outputs["outputs"]
    for key in outputs:
        if (type(outputs[key]) == list):
            for value in traverse(outputs[key]):
                if (str(value)[0:2] == "s3"):
                    message = "aws s3 cp {} s3://{}/wf_output/".format(value, dir)
                    os.system(message)
        else:
            if (str(outputs[key])[0:2] == "s3"):
                message = "aws s3 cp {} s3://{}/wf_output/".format(outputs[key], dir)
                os.system(message)

# Move and rename final genotype plink file set
with open('/shared/' + dir + '/wf_output/' + job + '_final_outputs.json') as f:
    outputs = json.load(f)
    outputs = outputs["outputs"]
    for ext in ["bed", "bim", "fam"]:
        final = "genotype_array_qc_wf.final_qc_{}".format(ext)
        for original in outputs[final]:
            fileName = re.sub(r'.+/', '', original)
            new = re.sub(r'uhs1234.YRI.+\.', 'afr/uhs1234.', fileName)
            new = re.sub(r'uhs1234.CEU.+\.', 'eur/uhs1234.', new)
            new = re.sub(r'uhs1234.CHB.+\.', 'amr/uhs1234.', new)
            final = "s3://{}/wf_output/{}".format(dir, new)
            os.system("aws s3 cp {} {}".format(original, final))

EOF