# UHS1 & McLaren GWAS Meta-Analysis

We will combine the samples from European descent into a GWAS meta-analysis. 

## Data Descriptions and locations

**UHS1**: 
* Novel Genetic Locus Implicated for HIV-1 Acquisition with Putative Regulatory Links to HIV Replication and Infectivity: A Genome-Wide Association Study ([PMID: 26023777](https://pubmed.ncbi.nlm.nih.gov/26023777/))
* s3://rti-hiv/gwas/uhs1/results/acquisition/0001/eur/split_by_chromosome/ 
* We will need to combine the chromosome-specific results.
* Determine what filters have been applied, if any.

**McLaren**: 
* Association study of common genetic variants and HIV-1 acquisition in 6,300 infected cases and 7,200 controls [(PMID: 23935489)](https://pubmed.ncbi.nlm.nih.gov/23935489/)
* s3://rti-hiv/gwas/mclaren/processed/hiv_acquisition/final_stats/hg19/
* We will need to combine the chromosome-specific results.
* We have the original non-filtered results, as well as the filtered results with the following filters applied:
1. MAF > 1%
2. Imputation quality (INFO) > 0.30
3. Only keep SNPs only if available in all cohorts
4. Reformatted indels to align with how our in-house results are coded
5. Convert to 1000 Genomes phase 3
6. Convert Odds ration (OR) to beta-coefficient (renamed beta_SNP_addA1)


## UHS1
Need to determine the position for each variant. Use the variants at `s3://rti-common/variants/b153/GRCh37.p13/variants_chr*.tsv.gz`. I may need to write a custom script.

Also need to filter by MAF and RSQ.

In [None]:
# download chromosome chunks
cd ~/rti-hiv/gwas/uhs1/results/acquisition/0001/eur/split_by_chromosome/

aws s3 sync s3://rti-hiv/gwas/uhs1/results/acquisition/0001/eur/split_by_chromosome/ .

zcat imputed.hiv_cauc.chr1.GWASHIV_10evs_class_syear_rec_aage_gwassex.palogist.csv.gz | head
#"name","A1","A2","Freq1","MAF","Quality","Rsq","n","Mean_predictor_allele","beta_SNP_add","sebeta_SNP_add","loglik","chi","p","or_95_percent_ci"
#"rs58108140","G","A",0.872,0.128,0,0.432,1132,0.871962,-0.0140979,0.209907,-674.692,0.00451081491670313,0.946452258281973,"0.99 (0.65-1.49)"


### Add variant positions
Our summary stats did not have positions, so we will extract the variant positions from GRCh37 reference files. 

`s3://rti-common/variants/b153/GRCh37.p13/variants_chr{1..22}.tsv.gz`

In [None]:
cd /home/ec2-user/rti-hiv/gwas/uhs1/results/acquisition/0001/eur/

# copy the var
for chr in {1..22}; do
    aws s3 cp s3://rti-common/variants/b153/GRCh37.p13/variants_chr${chr}.tsv.gz .
done

In [None]:
import gzip

for chrom in range(1, 23):
    infile = "variants_chr{}.tsv.gz".format(chrom)
    gwas_file = "split_by_chromosome/chr{}/imputed.hiv_cauc.chr{}.GWASHIV_10evs_class_syear_rec_aage_gwassex.palogist.csv.gz".format(chrom, chrom)
    out_file = "split_by_chromosome/chr{}/imputed.hiv_cauc.chr{}.GWASHIV_10evs_class_syear_rec_aage_gwassex.palogist.positions.txt".format(chrom, chrom)


    with gzip.open(infile, "rt") as inF, gzip.open(gwas_file) as gwasF, open(out_file, 'w') as outF:
        # prepare header
        gwas_head = gwasF.readline()
        gwas_head = gwas_head.replace('"', '') # remove quotations
        gwas_head = gwas_head.strip().split(",")
        # add pos and chr to from of header
        gwas_head.insert(0, "pos")
        gwas_head.insert(0, "chr")
        # create a SE of OR column then compare with Linran's formula
        
        #gwas_head[-1] = "odds_ratio" # remove confidence int portion
        
        gwas_head = " ".join(gwas_head) + "\n"
        outF.write(gwas_head)
        print(gwas_head)

        head = inF.readline()
        line = inF.readline()

        snp_dic = {}
        #for _ in range(100000):
        # our sumstats don't have the variant positions 
        # so make a dictionary of the reference GRCh37 variants so we can extract the position
        while line: 
            sl = line.strip().split("\t")
            markername = sl[0]
            rsid = markername.split(":")[0]
            pos = markername.split(":")[1]

            snp_dic[rsid] = [markername, pos]
            line = inF.readline()
        #print(dict(list(snp_dic.items())[0:20]))

        line = gwasF.readline()
        #for _ in range(10000):
        # now run through our sumstats, and if the rsid was in the reference
        # dictionary then extract the variant position
        while line:
            line = line.strip()
            line = line.replace('"', '')
            sl = line.split(",")
            rsid = sl[0]
            if rsid in snp_dic:
                odds_ratio = sl[-1].split(" ")[0] # remove the confidence interval portion
                sl[-1] = odds_ratio

                a1 = sl[1]
                a2 = sl[2]
                position = snp_dic[rsid][1]
                sl.insert(0, position)
                sl.insert(0, str(chrom))
                outline = " ".join(sl) + "\n"
                outF.write(outline) 

                line = gwasF.readline()
            else:
                line = gwasF.readline()


### Apply filters & combine chromosomes
Combine the split chromosome results and apply the filters RSQ >= 0.30 and  MAF >= 0.01. These are the same filters that were applied to the McLaren results.

In [None]:
# combine results
head -1 chr1/imputed.hiv_cauc.chr1.GWASHIV_10evs_class_syear_rec_aage_gwassex.palogist.positions.txt >/
    uhs1_ea_hiv_acquisition_all_chr_stats.txt
for chr in {1..22}; do
    tail -n + 2 chr$chr/imputed.hiv_cauc.chr$chr.GWASHIV_10evs_class_syear_rec_aage_gwassex.palogist.positions.txt >>/
    uhs1_ea_hiv_acquisition_all_chr_stats.txt
done


# apply rsq and maf filters
cat uhs1_ea_hiv_acquisition_all_chr_stats.txt | awk '
{ if ($7 >= 0.01 && $9 >= 0.3)
{print $0}}' > uhs1_ea_hiv_acquisition_all_chr_stats_rsq0.30_maf0.01.txt

# zip and upload to s3
gzip uhs1_ea_hiv_acquisition_all_chr_stats_rsq0.30_maf0.01.txt
aws s3 cp uhs1_ea_hiv_acquisition_all_chr_stats_rsq0.30_maf0.01.txt.gz \
    s3://rti-hiv/gwas/uhs1/results/acquisition/005/eur/merge_chromosomes/uhs1_ea_hiv_acquisition_all_chr_stats_rsq0.30_maf0.01.txt.gz

## McLaren
Combine data. Perform genomic annotation liftover (hg19). Add Beta and SE-beta.

In [None]:
cd /shared/jmarks/hiv/meta/data/mclaren
aws s3 cp s3://rti-hiv/gwas/mclaren/original/icgh_aquisition_results.tar .
## extract files
tar -xf icgh_aquisition_results.tar
rm *tar

## copy head
zcat dan_chr22_015_018.assoc.dosage.meta.ngt.metadaner.gz | head -1 > head.txt

combine files
for chr in {1..22}; do
    cat head.txt > dan_chr${chr}_assoc_dosage_meta_ngt_metadaner.txt
    for file in dan_chr${chr}_*.assoc.dosage.meta.ngt.metadaner.gz; do
        zcat $file | tail -n +2 >> dan_chr${chr}_assoc_dosage_meta_ngt_metadaner.txt
    done &
done


# remove chunks
rm *gz

# combine files to one file
#dan_chr_all_assoc_dosage_meta_ngt_metadaner_hg19.txt.gz

In [None]:
# perform liftover

In [None]:
### R 
mclaren$BETA <-log(mclaren$OR)
mclaren$Z <- sqrt(qchisq(mclaren$P, 1, lower.tail = F))
mclaren$SE_BETA <- abs(mclaren$BETA/mclaren$Z)

In [None]:
import gzip

for chrom in range(1,22):
    infile = "mclaren.hiv_acq.hg19.chr{}.stats.gz".format(chrom)
    outfile = "mclaren.hiv_acquisition.hg19.chr{}.rsid_only.stats.tsv".format(chrom)

    with gzip.open(infile, 'rt') as inF, open(outfile, 'w') as outF:
        head = inF.readline()
        outF.write(head)
        print(head)
        line = inF.readline()
        while line:
            sl = line.split()
            markername = sl[0].split(":")
            rsid = markername[0]
            if rsid[0] == "r":
                sl[0] = rsid
                outline = "\t".join(sl) + "\n"
                outF.write(outline)
            line = inF.readline()

In [None]:
# combine chromosomes to one file
for chr in {1..22}; do 
    tail -n +2 mclaren.hiv_acquisition.hg19.chr$chr.rsid_only.stats.tsv >>\
    mclaren_ea_hiv_acquisition_hg39_all_chr_rsq0.30_maf_0.01_rsid_only.tsv
done

gzip mclaren_ea_hiv_acquisition_hg39_all_chr_rsq0.30_maf_0.01_rsid_only.tsv

# upload to S3
aws s3 cp mclaren_ea_hiv_acquisition_hg39_all_chr_rsq0.30_maf_0.01_rsid_only.tsv \
    s3://rti-hiv/gwas/mclaren/processed/hiv_acquisition/final_stats/hg19/merge_chromosomes/mclaren_ea_hiv_acquisition_hg39_all_chr_rsq0.30_maf_0.01_rsid_only.tsv.gz

# Meta analysis 

In [None]:
cd /home/ec2-user/rti-hiv/gwas_meta/hiv_acquisition/0034/

# edit variables
project="rti-hiv"
phen=acquisition
study=uhs1_mclaren
ancestry=eur
version=0034

## Cloning this repo and biocloud_wdl_tools submodule together
cd /shared/
git clone --recurse-submodules https://github.com/RTIInternational/biocloud_gwas_workflows.git

# pull for any updates
cd /shared/biocloud_gwas_workflows
git pull
git submodule update --init --recursive

In [None]:
# Create wf config file for afr and eur
cd /home/ec2-user/rti-hiv/gwas_meta/hiv_acquisition/0034/biocloud_gwas_workflows/meta_analysis/metal/gwas/config_inputs/

# Modify config settings manually with vim

# save commit 
git rev-parse HEAD > git_hash.txt  
cd /home/ec2-user/rti-hiv/gwas_meta/hiv_acquisition/0034/

# Zip biocloud_gwas_workflows repo
zip \
    --exclude=*/var/* \
    --exclude=*.git/* \
    --exclude=*/test/* \
    --exclude=*/.idea/* \
    -r biocloud_gwas_workflows/meta_analysis/metal/gwas/biocloud_gwas_workflows.zip \
    biocloud_gwas_workflows/

cd /shared/${project}/gwas/${study}/results/${phen}/${version}/${ancestry}/

# note to opens the listening port to the large cromwell server in another window
# ssh -i ~/.ssh/gwas_rsa -L localhost:8000:localhost:8000 ec2-user@35.153.41.169

curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/home/ec2-user/rti-hiv/gwas_meta/hiv_acquisition/0034/biocloud_gwas_workflows/meta_analysis/metal/gwas/full_gwas_meta.wdl" \
    -F "workflowInputs=@/home/ec2-user/rti-hiv/gwas_meta/hiv_acquisition/0034/biocloud_gwas_workflows/meta_analysis/metal/gwas/config_templates/inputs.json" \
    -F "workflowDependencies=@/home/ec2-user/rti-hiv/gwas_meta/hiv_acquisition/0034/biocloud_gwas_workflows/meta_analysis/metal/gwas/biocloud_gwas_workflows.zip" \
    -F "workflowOptions=@/home/ec2-user/bin/cromwell/hiv_gnetii_charge_code.json"  >\ job_ids.txt

cat job_ids.txt

job=351f50af-4f3f-40a5-9342-f654ce817f9e
curl -X GET "http://localhost:8000/api/workflows/v1/${job}/status"

In [9]:
freq_a <- c(0.999, 0.998, 0.0107)
freq_u <- c(0.999, 0.998, 0.0118)

maf2 <- (freq_a*3*2 + freq_u*3*2)/(2*3+2*3)
maf <- (freq_a*3 + freq_u*3)/(3+3)
maf2
maf
not_minors <- which(maf > 0.5)
not_minors
maf[not_minors] <- 1 - maf[not_minors]
maf

In [4]:
maf[which(maf > 0.1 & maf < 0.999)]

(data$FRQ_A_6334 * 6334 * 2data$FRQ_U_7247 * 7247) / (6334+7247)

In [None]:
mclaren_filtered <- mclaren[which(mclaren$MAF > 0.01 & mclaren$INFO > 0.3), ]

mclaren_filtered <- mclaren_filtered[which(mclaren_filtered$INFO > 0.30), ]
