# UHS1-3 HIV Acquisition 2df GWAS:  G$\times$Sex (2df)

__Author__: Jesse Marks <br>
**GitHub Issue**: [Issue #97](https://github.com/RTIInternational/bioinformatics/issues/97)

This notebook documents the processing steps for UHS123 EAs—the combined UHS1, UHS2, & UHS3, EA data sets. 

__Phenotype file__:`s3://rti-hiv/uhs_data/fang_processing/pheno_hiv_ea_probabel` <br>
__Imputed Genotype Data Location__: `s3://rti-hiv/uhs_data/ea/unencrypted`

## Software and tools
The software and tools used for processing these data are

* [Amazon Elastic Compute Cloud(EC2)](https://aws.amazon.com/ec2/)
* [PLINK v1.9 beta 3.45](https://www.cog-genomics.org/plink/)
* [ProbABEL](https://github.com/GenABEL-Project/ProbABEL)

wc -l pheno_hiv_ea_probabel
```
2047 pheno_hiv_ea_probabel
```

In [None]:
# EC2 instance
aws configure

AWS Access Key ID [None]: AKIAJ2EJQLCVM4VSQ5IA
AWS Secret Access Key [None]:  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
Default region name [None]: us-east-1
Default output format [None]: text  # could be json, text, or table
    
    
# size is the size you desire in Mb. Note, will need to remove dry-run to make it work.
#aws ec2 modify-volume --dry-run --volume-id vol-046b896280e520fa7 --size 1500
aws ec2 modify-volume --volume-id vol-046b896280e520fa7 --size 1500
"""VOLUMEMODIFICATION      modifying       3072    1024    gp2     0       2019-04-17T14:23:19.000Z        4500    1500    gp2       vol-046b896280e520fa7
"""

# extend file system to the new volume capacity.
sudo resize2fs /dev/nvme1n1

# Download Data and Create Directory Structure

In [None]:
## bash ##

### create directory structure ###
study=uhs123
#ancestry="ea"
#genoD=/shared/jmarks/hiv/$study/genotype/observed/final/001 # location of QC'ed genotype data
#gwasD=/shared/jmarks/hiv/$study/gwas # base processing dir
#phenoD=/shared/jmarks/hiv/$study/phenotype # base phenotype dir
#eig=$phenoD/processing/eig # location of PCA processing dir
#mkdir -p $genoD $gwasD $phenoD/{final,processing,unprocessed} $eig/results 
#
#### Download data & unzip ###
aws s3 cp s3://rti-hiv/uhs_data/fang_processing/pheno_hiv_ea_probabel $phenoD/final
aws s3 cp s3://rti-hiv/uhs_data/ea/unencrypted $gwasD

# Prepare files for Analysis
## Phenotype processing
The phenotype data were processed by Fang Fang for the baseline HIV acquisition GWAS (1df), therefore we do not need to reproduce the genotype PCs to include in the model as covariats. The PCs selected for the EA cohort that we will use are: PC1, PC4, PC8, and PC10. 

```
wc -l pheno_hiv_ea_probabel
    2047 pheno_hiv_ea_probabel
```

```
iid	hiv	gender	age	pc10	pc1	pc4	pc8
245@1064714500_245@1064714500	0	1	26	-0.0001	0.0059	-0.0036	0.0024
266@1064714555_266@1064714555	0	1	48	0.0050	-0.0109	-0.0120	-0.0251
441@1064714760_441@1064714760	0	1	27	0.0076	-0.0077	-0.0157	-0.0267
                            .
                            .
                            .
8002697211_HHG0903_AS00-10269_8002697211_HHG0903_AS00-10269	0	0	34	0.0016	0.0094	0.0007	0.0025
8002697668_HHG7123_AS92-4653_8002697668_HHG7123_AS92-4653	0	0	41	0.0018	0.0109	0.0025	0.0030
8002697690_HHG6954_AS95-01827_8002697690_HHG6954_AS95-01827	0	0	37	0.0007	0.0066	0.0002	0.0072
```

## Genotype Data
Prepare Genotype Data for analysis software.
### Inflate imputation results
These data were imputed on the Michigan Imputation Server and therefore need to be inflated.
### merge chrX
No chromosome 23 data.

### Convert imputed format: dose to mach
The ProbABEL software requires the imputed genotype data to be in mach format.

In [None]:
base_dir=/shared/jmarks/hiv/uhs123/genotype/imputed

# autosomes
for chr in {1..22};do
    chr_location=${base_dir}/mach/chr${chr}
    mkdir -p $chr_location

    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name convert_dosage_chr${chr} \
        --script_prefix ${chr_location}/chr${chr}.convert \
        --mem 10 \
        --buffer 100000 \
        --nslots 3 \
        --priority 0 \
        --program /shared/bioinformatics/software/third_party/dosage_converter_v1.0.4/bin/DosageConvertor \
            --vcfDose ${base_dir}/chr${chr}.dose.vcf.gz \
            --info ${base_dir}/chr${chr}.info.gz \
            --prefix $chr_location/chr${chr} \
            --type mach \
            --format 1 # contains the expected alternate allele count (one value per sample per marker).
done

### Prune mach files 

The converted imputed genotype data (mach format) files need to be reorder and pruned to match the subjects in the phenotype file. Pg. 7 of the [ProbABEL manual](http://www.genabel.org/sites/default/files/pdfs/ProbABEL_manual.pdf) reports that the genomic predictor file—which is the dosage file—has in the first column a sequential ID, followed by an arrow followed by the study ID.

We will first get a list of the IDs from the phenotype file. Then we will prune/reorder the imputed data as need.

**Note:** for these UHS123 data, there is only one subject that needs to be removed from the genotype data because they are note present in the phenotype data.

In [None]:
## bash ## 
cd /shared/jmarks/hiv/uhs123/phenotype/final
tail -n +2  pheno_hiv_ea_probabel| cut -f1 > phenotype.ids
cp phenotype.ids /shared/jmarks/hiv/uhs123/genotype/imputed/mach

wc -l phenotype.ids
"""
2046 phenotype.ids
"""

In [None]:
### python ###
"""
processing.genotype.files.py

This script will process the mach.dose imputed genotype files.
In particular, it will remove any subjects that are not in
the phenotype file. It will output the new filtered mach.dose
file as well as a file that contains the order of the subject
IDs in the genotype files. We will then use this information to
reorder the phenotype file. This script should be in the directory
that contains the imputed data. For example, if my imputed data are in:
/shared/genotype/imputed/chr{1..22}
then the script should be in: /shared/genotype/imputed/

INPUT
    chrom: the chromosome to process 
    phenids: name of the file (in same directory as script) that has the order of the
             subjects ids that are in the phenotype file

OUTPUT
    name of the file that contains the order of the samples in the genoytpe data
"""
import os, sys, gzip

chrom = sys.argv[1]
#phenids = sys.argv[2]

def process_imputed(chrom, phenids="phenotype.ids"):
    print(chrom)
    myfile = "chr{0}/chr{0}.mach.dose.gz".format(chrom)
    outfile = "chr{0}/chr{0}.mach.dose.filtered".format(chrom)
    out_order = "chr{0}/chr{0}.id.order".format(chrom, order_name)

    with gzip.open(myfile) as inF, open(phenids) as idF, open(outfile, 'w') as outF, open(out_order, "w") as outID:
        id_set = set()
        for line in idF:
            id_set.add(line.strip())

        line = inF.readline()
        
        count = 1
        while line:
            sl = line.split()
            gen_id = sl[0].split("->")[0] 
            if gen_id in id_set:
                sl[0] = "{}->{}".format(count, gen_id) # sequential_id->subject_id
                count += 1
                newline = " ".join(sl)
                outF.write(newline + "\n")
                outID.write(gen_id + "\n")

            line = inF.readline()

        message = "chr{0} all done!".format(chrom)
        print(message)
        
process_imputed(chrom)

In [None]:
# submit processes to the job scheduler
cd /shared/jmarks/hiv/uhs123/genotype/imputed/mach

for chr in {1..22};do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
            --job_name chr$chr.gen.data.processing \
            --script_prefix chr$chr/chr$chr.mach.formatting \
            --mem 20 \
            --nslots 7 \
            --program time python processing.genotype.files.py $chr 
done

In [None]:
# make sure files were properly pruned (visually inspect)
grep -L "done" */*formatting*log

In [None]:
# upload imputed data to S3
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name uhs123.impute.upload \
    --script_prefix upload.to.s3 \
    --mem 20 \
    --nslots 7 \
    --program time aws s3 sync . s3://rti-hiv/uhs123_2df_hiv_acquisition/data/imputed/ea/

### Reorder phenotype file
Reorder the phenotype file to be of the same order as the genotype files. 

**Note:** if processing autosomes and chrX <br>
The autosomes and chrX will have different orders and will therefore require different phenotype files. This is because chrX is imputed separately for males and females on the Michigan Imputation Server. We have to merge the data after we get them which results in the genotype data being in a disparate order than the autosomes. 

**Note2:** For the UHS123 EA data, the phenotype file is already in the same order as the genotype file now, so no need to process anything.

In [None]:
#cd /shared/jmarks/hiv/uhs123/phenotype/final
#
#ln -s /shared/jmarks/hiv/uhs123/genotype/impute/mach/chr22/chr22.id.order .
#
## autosomes
#head -1 pheno_hiv_ea_probabel > pheno_hiv_ea_probabel_ordered
#awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' \
#    pheno_hiv_ea_probabel chr22.chr22.id.order >> pheno_hiv_ea_probabel_ordered

### create legend
Pg. 10 of the [ProbABEL manual](http://www.genabel.org/sites/default/files/pdfs/ProbABEL_manual.pdf) reports that the legend should be in HapMap format:

rsID, position, allele1, allele2

Note: We do not have the rsID right now. We have the chromosome and the position. It shouldn't be to difficult to convert this to the actual rsID. We might be able to use Nathan Gladdis' Nyholt script to convert these. This can be converted later, though. Also, this is an optional file. The only column that is actually used is the SNP location.

In [None]:
## EC2 console ##
machD=/shared/jmarks/hiv/uhs123/genotype/imputed/mach

# HapMap "legend" file format
# note that our data is not in rsID format - we have chr:position for the rsID instead
for chr in {1..22}; do
    echo processing chr$chr
    # print header
    echo "id position 0 1" > $machD/chr$chr/map.chr$chr.legend
    # grab the SNP, position, allele1 and allele2
    tail -n +2 $machD/chr${chr}/*info | awk '{pos = $1; gsub(/^.+:/, "", pos); print $1,pos,$2,$3}' >>\
        $machD/chr$chr/map.chr$chr.legend
done &

### Format info file
The ProbABEL manual specifies that the info file should have exactly 7 columns.

In [None]:
for chr in {1..22}; do
    cut -f 1-7 $machD/chr${chr}/chr${chr}.mach.info > $machD/chr$chr/chr$chr.mach.info.filtered  &
done

# Start ProbABEL Analysis (logistic model)
Perform GWAS. Consult [ProbABEL manual](http://www.genabel.org/sites/default/files/pdfs/ProbABEL_manual.pdf) for details about software parameters.
## EA G$\times$Sex GWAS (2df)
Note about compute node usage. For these data, I had to use the x1e.2xlarge	(8vCPU	244mem). I was running out of memory with the x1e.xlarge (4vCPU 122mem). 

In [None]:
version="002"
probabel="palogist" # palogist or palinear
model="2df"
study="uhs123"
dose_suffix="mach.dose.filtered"
info_suffix="mach.info.filtered"
covars="sex,age,PC1,PC4,PC8,PC10"
MODEL="HIV_ACQ~SNP+SNP*SEX+AGE+SEX+PC1+PC4+PC8+PC10"
ancestry="ea"
phenoF="pheno_hiv_ea_probabel"
#phenoXF="vidus.ea.HIV_ACQ.AGE.SEX.PC1+PC5+PC9.ordered.chrx.txt"
genD=/shared/jmarks/hiv/$study/genotype/imputed/mach
phenD=/shared/jmarks/hiv/$study/phenotype/final
procD=/shared/jmarks/hiv/$study/gwas/$ancestry/$model/$version

mkdir -p $procD/final
mkdir -p $procD/processing/chr{1..22}
################################################################################


# autosomes
for chr in {1..22}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name chr${chr}.GxSex.${probabel} \
        --script_prefix $procD/processing/chr${chr}/chr${chr}.2df.GxSex.gwas \
        --mem 29 \
        --nslots 7 \
        --program time /shared/bioinformatics/software/third_party/probabel-0.5.0/bin/${probabel} \
            --pheno $phenD/$phenoF \
            --dose $genD/chr${chr}/chr$chr.${dose_suffix} \
            --info $genD/chr${chr}/chr${chr}.${info_suffix} \
            --map $genD/chr${chr}/map.chr${chr}.legend \
            --chrom ${chr} \
            --interaction 1 \
            --out $procD/processing/chr${chr}/chr${chr}.GxSex.${probabel}.results
done


## chrX
## chrx has a different order phenotype file because the dose file is ordered by sex
#chr=23
#/shared/bioinformatics/software/scripts/qsub_job.sh \
#    --job_name chr${chr}.GxSex.$probabel \
#    --script_prefix $procD/processing/chr${chr}/chr${chr}.2df.GxSex.gwas \
#    --mem 30 \
#    --nslots 7 \
#    --program time /shared/bioinformatics/software/third_party/probabel-0.5.0/bin/${probabel} \
#        --pheno $phenD/$phenoXF \
#        --dose $genD/chr${chr}/chr$chr.${dose_suffix} \
#        --info $genD/chr${chr}/chr${chr}.${info_suffix} \
#        --map $genD/chr${chr}/map.chr${chr}.legend \
#        --chrom ${chr} \
#        --interaction 2 \
#        --out $procD/processing/chr${chr}/chr${chr}.GxSex.${probabel}.results


In [None]:
## check for completion
grep -L 100.00% $procD/processing/chr*/*log

### Results Processing
#### Calculate chi, P, and OR

In [None]:
baseDir=$procD/processing

for (( chr=1; chr<23; chr++ )); do
    bash /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ea_$chr \
        --script_prefix $baseDir/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.stats \
        --mem 15 \
        --nslots 7 \
        --priority 0 \
        --program Rscript ~/bin/calculate_2df_stats_from_probabel_results.R \
            --in $baseDir/chr$chr/chr$chr.GxSex.${probabel}.results_add.out.txt \
            --out $baseDir/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.stats \
            --interaction_covar gender
done

In [None]:
## check for completion
grep -L "Done" $baseDir/chr*/${study}*log

#### convert name to 1000G_p3

In [None]:
for chr in {1..22};do
    bash /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${study}.1000g_p3.chr${chr}_name \
        --script_prefix $baseDir/chr$chr/name_conversion \
        --mem 15 \
        --nslots 3 \
        --priority 0 \
        --program time perl /shared/bioinformatics/software/perl/id_conversion/convert_to_1000g_p3_ids.pl \
            --file_in ${baseDir}/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.stats \
            --file_out ${baseDir}/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.stats.converted \
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.legend.gz \
            --file_in_header 1 \
            --file_in_id_col  0 \
            --file_in_chr_col  1 \
            --file_in_pos_col  2 \
            --file_in_a1_col  3 \
            --file_in_a2_col  4 \
            --chr $chr
done


#chr=23
#bash /shared/bioinformatics/software/scripts/qsub_job.sh \
#    --job_name $study.1000g_p3.chr${chr}.name.conversion \
#    --script_prefix $chrxD/name_conversion \
#    --mem 15 \
#    --nslots 1 \
#    --priority 0 \
#    --program perl /shared/bioinformatics/software/perl/id_conversion/convert_to_1000g_p3_ids.pl \
#        --file_in $chrxD/$study.$ancestry.1000G_p3.chr$chr.$MODEL.recalc_maf.stats \
#        --file_out $chrxD/$study.$ancestry.1000G_p3.chr$chr.$MODEL.stats.converted \
#        --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
#        --file_in_header 1 \
#        --file_in_id_col  0 \
#        --file_in_chr_col  1 \
#        --file_in_pos_col  2 \
#        --file_in_a1_col  3 \
#        --file_in_a2_col  4 \
#        --chr $chr

In [None]:
## check for completion
grep -L "Done" ${baseDir}/chr*/name*log

### Filter results
MAF filters in study and 1000G, as well as imputation quality (r^2) filter.

**Note**, according to Fang Fang, when you convert the dosage file from vcf to mach, the info files from male/female are not accurate for variants in terms of AF and quality, so I did not use info file in Dosage_Converter. Instead I calculated these AF after GWAS by taking into account both the male and female AFs. The quality score I reported in both genders.
#### MAF > 0.01 in Study

In [None]:
## EC2 console ##

# Filter MAF <= 0.01 in subjects
for ((chr=1; chr<23; chr++));do
    echo "Processing chr${chr}_${ancestry}"
    head -n 1 ${baseDir}/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.stats.converted > \
        ${baseDir}/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject.stats

    # note column 7 corresponds to the MAF column
    awk ' NR>=2 {if ($7 >= 0.01) {print $0}}' \
        ${baseDir}/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.stats.converted \
        >> ${baseDir}/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject.stats
done 

#### MAF > 0.01 in 1000G

In [None]:
if [ $ancestry == "aa" ]; then
    group=afr
elif [ $ancestry == "ea" ]; then
    group=eur
fi

# creating a list of SNPs based off of 1000G population 
# - filter the variants to ones whose MAF <= 1%
for chr in {1..22};do
    echo "Processing chr$chr"
   awk ' { if ($9 >= 0.01) {print $1}}' <(zcat /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.legend.gz) >\
    /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.legend.unique_ids.maf_gt_0.01_${group}
done 

## chrX
#chr=23
#awk ' { if ($9 >= 0.01) {print $1}}' <(zcat /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz) >\
#    /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.legend.unique_ids.maf_gt_0.01_${group}
#
#
for chr in {1..22}; do
    idList=/shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.legend.unique_ids.maf_gt_0.01_${group} 
    echo "Processing chr${chr}_${ancestry}"      
    /shared/bioinformatics/software/perl/utilities/extract_rows.pl \
        --source ${baseDir}/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject.stats \
        --id_list $idList \
        --out ${baseDir}/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject+${group}.stats \
        --header 1 \
        --id_column 0 
done 

#### RSQ >= 0.30

In [None]:
## Filter by R^2 Autosomes
for chr in {1..22}; do
    echo -e "${ancestry} chr${chr}..."
        awk '{ if ($9 > 0.3){ print $0 } }' \
            $procD/processing/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject+eur.stats >\
             $procD/processing/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject+eur.rsq_gt_0.30.stats
done

mv $procD/processing/chr*/$study.$ancestry.1000G_p3.chr*.$MODEL.maf_gt_0.01_subject+eur.rsq_gt_0.30.stats \
    $procD/final/


## Filter by R^2 chrX
# chrX note that imputed data are split up by males & females 
# perform the filtering on both male and female data and then merge results
#chr=23
#echo -e "${ancestry} chr${chr}..."
#tail -n +2 $chrxD/chrX.no.auto_male.info| \
#    awk '{ if( $7 > 0.3){ print $1":"$2":"$3 } }' \
#    > $procD/processing/chr$chr/$study.chr${chr}_variants_rsq_gt_0.3.keep.tmp
#
#tail -n +2 $chrxD/chrX.no.auto_female.info| \
#    awk '{ if( $7 > 0.3){ print $1":"$2":"$3 } }' \
#    >> $procD/processing/chr$chr/$study.chr${chr}_variants_rsq_gt_0.3.keep.tmp
#
### keep only SNPs that passed filters for both males and females
#sort $procD/processing/chr$chr/$study.chr${chr}_variants_rsq_gt_0.3.keep.tmp |\
#    uniq -d > $procD/processing/chr$chr/$study.chr${chr}_variants_rsq_gt_0.3.keep &
#
## change X to 23
#awk -F":" '$1=23 {print $0}' $procD/processing/chr23/$study.chr${chr}_variants_rsq_gt_0.3.keep >\
#    $procD/processing/chr23/$study.chr${chr}_variants_rsq_gt_0.3.keep.split
#
## filter results
#awk ' NR==FNR { map[$1":"$2":"$3":"$4] = 1; next}
#    FNR==1 {print $0}
#    FNR>=2 {if  (map[$2":"$3":"$4":"$5] == 1)
#    { print $0} }' *keep.split \
#    vidus.ea.1000G_p3.chr23.$MODEL.maf_gt_0.01_subject+eur.stats >\
#    $procD/processing/chr$chr/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject+eur.rsq_gt_0.30.stats

### Plot results

In [None]:
#model="2df"
group=eur
if [ $model == "1df" ]; then
    pcol=10
else
    pcol=13
fi

outfile=$procD/processing/${study}.${ancestry}.$model.1000G_p3.hiv_acq.maf_gt_0.01.rsq_gt_0.3.assoc.table
echo -e "VARIANT_ID\tCHR\tPOSITION\tP\tTYPE" > $outfile
for (( chr=1; chr<23; chr++ )); do
infile=$procD/final/$study.ea.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject+${group}.rsq_gt_0.30.stats
echo Processing $infile
tail -n +2 $infile |
  perl -slne '/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)(?:\s+\S+){13}\s+(\S+)/;
              if (($4 eq "A" || $4 eq "C" || $4 eq "G" || $4 eq "T") && ($5 eq "A" || $5 eq "C" || $5 eq "G" || $5 eq "T")) {
                print join("\t",$1,$2,$3,$6,"snp");
              } else {
                print join("\t",$1,$2,$3,$6,"indel");
              }' -- -pcol >> $outfile
done &


# Make Q-Q and manhattan plots
sh /shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name gwas_plots_${ancestry} \
    --script_prefix $procD/final/${ancestry}.$model.1000G.hiv_acq.maf_gt_0.01.rsq_gt_0.3.assoc.plot \
    --mem 20 \
    --nslots 7 \
    --priority 0 \
    --program Rscript /shared/bioinformatics/software/R/generate_gwas_plots.R \
        --in $procD/processing/$study.$ancestry.$model.1000G_p3.hiv_acq.maf_gt_0.01.rsq_gt_0.3.assoc.table \
        --in_chromosomes autosomal_nonPAR \
        --in_header \
        --out $procD/final/$study.${ancestry}.$model.1000G.hiv_acq.maf_gt_0.01.rsq_gt_0.3.assoc.plot.all_chr \
        --col_id VARIANT_ID \
        --col_chromosome CHR \
        --col_position POSITION \
        --col_p P \
        --col_variant_type TYPE \
        --generate_snp_indel_manhattan_plot \
        --manhattan_odd_chr_color red3 \
        --manhattan_even_chr_color dodgerblue3 \
        --manhattan_points_cex 1.5 \
        --generate_snp_indel_qq_plot \
        --qq_lines \
        --qq_points_bg black \
        --qq_lambda


####   SNP lookup (EA)
Eric O. Johnson requested a lookup of rs4878712—a SNP reported in a prior paper whose G allele reduces the risk of HIV. This SNP's position is 37654257 in the GRCh37 build.

ProbABEL

**name    chrom   position        A1      A2      Freq1   MAF     Quality Rsq     n       Mean_predictor_allele   beta_SNP_addA1  sebeta_SNP_addA1      beta_SNP_age     sebeta_SNP_age  cov_SNP_int_SNP_age     chi2_SNP_add    chi     p       or_95_percent_ci**
```
rs4878712:37654257:G:A  9       37654257        G       A       0.43695 0.43695 0.98504 0.94857 2040    0.437196        0.272018        0.376019      -0.00705293      0.0101744       -0.00371112     0.523723        0.523330089935723       0.469424607010255       1.31(0.63-2.74)
```

### View plots
#### EA
![uhs123.ea.2df.1000G.hiv_acq.maf_gt_0.01.rsq_gt_0.3.assoc.plot.all_chr.snps+indels.manhattan.png](attachment:uhs123.ea.2df.1000G.hiv_acq.maf_gt_0.01.rsq_gt_0.3.assoc.plot.all_chr.snps+indels.manhattan.png)
![uhs123.ea.2df.1000G.hiv_acq.maf_gt_0.01.rsq_gt_0.3.assoc.plot.all_chr.snps+indels.qq.png](attachment:uhs123.ea.2df.1000G.hiv_acq.maf_gt_0.01.rsq_gt_0.3.assoc.plot.all_chr.snps+indels.qq.png)

### P-value Filter

In [None]:
group=eur
if [ $model == "1df" ]; then
    pcol=10
else
    pcol=13
fi
outFile=$baseDir/$study.$ancestry.1000G_p3.$MODEL.maf_gt_0.01_subject+eur.rsq_gt_0.30.p_lte_0.001
head -n 1 $procD/final/$study.$ancestry.1000G_p3.chr6.$MODEL.maf_gt_0.01_subject+eur.rsq_gt_0.30.stats > $outFile
for (( chr=1; chr<23; chr++ )); do
    echo Processing $procD/final/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject+eur.rsq_gt_0.30.stats
    tail -n +2 $procD/final/$study.$ancestry.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_subject+eur.rsq_gt_0.30.stats |\
    perl -lane 'if ($F[18] <= 0.001) { print; }' >>  $outFile
done


In [None]:
## Sort in R ##
R
finalD <- "/shared/jmarks/hiv/uhs123/gwas/ea/2df/001/processing/"
cohort="uhs123"
pfile <- "uhs123.ea.1000G_p3.HIV_ACQ~SNP+SNP*SEX+AGE+SEX+PC1+PC4+PC8+PC10.maf_gt_0.01_subject+eur.rsq_gt_0.30.p_lte_0.001"

for (ancestry in c("ea")){
    if (ancestry == "aa") { group = "afr" } else if (ancestry == "ea") { group = "eur" }
        dat=read.table(paste0(finalD, pfile), header = TRUE)
        dat <- dat[order(dat$p),]
        write.csv(dat,
                  file=paste0(finalD, pfile, ".csv"), row.names = FALSE, quote=F)
    }
## END Filter by p-value ###

# Upload results to S3