# HIV Acquisition SNP Lookup with McLaren Results in Meta-Analyses 
**Author:** Jesse Marks
**GitHub:** [issue #97](https://github.com/RTIInternational/bioinformatics/issues/97)

This notebook will be performing two different SNP lookups—call them 001 and 002. First for SNP lookup 001, we will search the McLaren Results for SNPs which had a p-value <= 5e-6 and create a list of these SNPs. We will search for the SNPs from this list in our HIV acquisition meta-analyses results—EA-specific (016) and cross-ancestry (015). 

The second SNP lookup 002 will be the reverse of this. Namely, we will perform two more SNP lookups: one we will search the 015 results for SNPs with p-value <=5e-6 and from this list search for those SNPs in the McLaren results; then we will do the same for the 016 results.

The McLaren results are on the share drive at: <br> 
`//RTPNFIL02/eojohnson/HIV/1.HIV GWAS II/technical/McLarenGWASacquisition/dan.assoc.dosage.meta.ngt.metadaner.ALL_CHR.FUMA_no_SE.gz`.

Our HIV acquisition meta-analyses results are on S3 at:
* `s3://rti-hiv/meta_new/015` (cross-ancestry)
* `s3://rti-hiv/meta_new/016` (ea specific)


## Download Data
Create the directory structure and download the meta-analyses results to EC2. Also, create a list of SNPs for the lookup from the McLaren results. The list should have two columns, a markername column and a chr column (no headers).

```
rs1:23:T:A  10
rs2:24:G:A  10
rs9:83:T:C  10
...
rs56:183:G:T  22
```

In [None]:
## EC2 ##

# Create directory 
mkdir -p /shared/jmarks/hiv/meta/lookup/20190702/{ea,cross}/{001,002}
cd /shared/jmarks/hiv/meta/lookup/20190702

zcat dan.assoc.dosage.meta.ngt.metadaner.ALL_CHR.FUMA_no_SE.gz  |\
    awk '$4 <= 0.000005' > dan.assoc.dosage.meta.ngt.metadaner.p_le_5e-6.FUMA 

awk '{print $3":"$2":"$5":"$6"\t"$1}' *A > combined_snp_list_1kg.tsv

``` 
wc -l combined_snp_list_1kg.tsv
560 combined_snp_list_1kg.tsv

head combined_snp_list_1kg.tsv
rs115440143:34513535:T:C        1
rs60675294:47169845:C:G 1
rs1075982:47108365:A:G  1
rs2304745:47154272:T:C  1
rs2218189:47176818:T:C  1
rs1440489:47126392:T:C  1
rs11804665:47104708:T:C 1
1_47148686:47148686:I2:D        1
rs1048380:47142538:A:G  1
rs3766214:47181300:T:C  1
```

# SNP lookup 001
I need to create a dictionary of the GSCAN SNPs and then see if each SNP in the meta-analysis is in the dictionary. I think this makes more sense rather than vice-versa; in particular, creating a dictionary for each SNP in the meta-analysis and then searching to see if the GSCAN SNPs are in the dictionary. This latter strategy would require a large amount of memory to create the Python dictionary. The former strategy makes more sense when comsidering computation expense. 

Also note that there are some SNPs in the meta-analysis which have the format of `chr:position:a1:a2` instead of `rsid:position:a1:a2`. I think the reason is that these SNPs of the former format did not have an associated rsID available. If a GSCAN SNP is not found in the lookup, then we need to output the SNPs that were not found and deal with those later. It might be the case that we have to convert them from `rsid:position:a1:a2` format to `chr:position:a1:a2` and then perform the search again with just these SNPs.

In [None]:
### Python3 ###
"""
*SNP lookup*

    Make sure the IDs are of the same format for the snp-list
    and the IDs in the meta-analysis results. e.g. 1000g_p3 or rsID only
"""
import gzip

################################################################################
date = "20190702" # date the results were generated (in results files name)
ancestry = "cross"

if ancestry=="aa":
    pop = "afr"
elif ancestry=="ea":
    pop = "eur"
else:
    pop = "cross"

## dict to hold gscan snps and the number of times they were found.
## we can tell which SNPs did not show up in any of the meta files
gscan_dict =  {}
base_dir = "/shared/jmarks/hiv/meta/lookup/20190702"
snp_list = "{}/combined_snp_list_1kg.tsv".format(base_dir)

for chrom in range(1,23):
    progress = "Prosessing {} {}".format(ancestry, chrom)
    #print(progress)
    out_file = "{}/{}/{}_{}_chr{}_hiv_acq_meta_analysis_1df_mclaren_lookup.txt".format(base_dir, ancestry, date, pop, chrom)
    results_files = "hiv_acquisition_1df_meta_analysis_uhs1-4_aa+uhs1-4_ea+vidus_ea+wihs1_aa+wihs1_ea+wihs1_ha+wihs2_aa.chr{}.exclude_singletons.1df.gz".format(chrom)
    meta = "/shared/jmarks/hiv/meta/results/015/final/{}".format(results_files)
    not_found = "{}/{}/{}_{}_hiv_acq_meta_analysis_mclaren_snps_not_found.txt".format(base_dir, ancestry, date, pop)
################################################################################

    with gzip.open(meta) as metF, open(snp_list) as gscanF, open(out_file, "w") as outF:
        gscan_line = gscanF.readline()
        met_head = metF.readline()
        met_line = metF.readline()

        outF.write(met_head)

        ## create a dictionary containing the gscan snps (just 1000G_p3 portion)
        if len(gscan_dict) == 0:
            while gscan_line:
                key = gscan_line.split()[0] # SNP id
                value = gscan_line.split()[1] #chromosome
                gscan_dict[key] = [0, value]
                gscan_line = gscanF.readline()

        while met_line:
            met_id = met_line.split()[0] # the 1000g_p3 ID in the meta-analysis
            if met_id in gscan_dict:
                gscan_dict[met_id][0] += 1
                outF.write(met_line)
            met_line = metF.readline()

with open(not_found, "w") as notF:
    notF.write("id\tchr\n")
    for key, value in gscan_dict.items():
        if value[0]==0:
            notF.write(key + "\t" +  value[1] + "\n")

### cross
Merge the cross ancestry results

In [None]:
### bash ###

## merge results
cd /shared/jmarks/hiv/meta/lookup/20190702/test/cross
head -1 20190702_cross_chr7_hiv_acq_meta_analysis_1df_mclaren_lookup.txt >\
    20190702_cross_hiv_acq_meta_analysis_1df_mclaren_lookup_merged_results.txt
    
for file in 20190702_cross_chr{1..22}_hiv_acq_meta_analysis_1df_mclaren_lookup.txt; do
    tail -n +2 $file >> 20190702_cross_hiv_acq_meta_analysis_1df_mclaren_lookup_merged_results.txt 
done 

### EA
Merge the ea-specific ancestry results. 

Note that we run the same script as above except simply change the `ancestry, pop,` and `date` variables at the top of the script to reflect the different results.

In [None]:
### bash ###

## merge results
cd /shared/jmarks/hiv/meta/lookup/20190702/test/ea
head -1 20190702_eur_chr7_hiv_acq_meta_analysis_1df_mclaren_lookup.txt >\
    20190702_eur_hiv_acq_meta_analysis_1df_mclaren_lookup_merged_results.txt
    
for file in 20190702_eur_chr{1..22}_hiv_acq_meta_analysis_1df_mclaren_lookup.txt; do
    tail -n +2 $file >> 20190702_eur_hiv_acq_meta_analysis_1df_mclaren_lookup_merged_results.txt 
done 

## Retrieve Missing SNPs
Many SNPs were not retrieved.
* 424 not found for EA
* 424 not found for cross-ancestry

This may be because the McLaren results are 1000g phase 1 SNPs and our HIV acquisition meta-analyses results are 1000g phase 3. There may be some discrepancies. Therefore, we will create a new script that will search for the rsID as well as the SNP position. 

In [None]:
### Python3 ###
"""
*SNP lookup v2*

Convert the SNP IDs to 1000g_p3 format (rsID:position:A1:A2)
in both the meta-analysis results and the SNP lookup list.
"""
import gzip

################################################################################
date = "20190702" # date the results were generated (in results files name)
ancestry = "cross"

if ancestry=="aa":
    pop = "afr"
elif ancestry=="ea":
    pop = "eur"
else:
    pop = "cross"

## dict to hold lookup snps and the number of times they were found.
## we can tell which SNPs did not show up in any of the meta files
lookup_dict =  {}
base_dir = "/shared/jmarks/hiv/meta/lookup/20190702"
snp_list = "{}/combined_snp_list_1kg.tsv".format(base_dir)

for chrom in range(1,23):
    progress = "Prosessing {} {}".format(ancestry, chrom)
    #print(progress)
    out_file = "{}/{}/001/{}_{}_chr{}_hiv_acq_meta_analysis_1df_mclaren_lookup.txt".format(base_dir, ancestry, date, pop, chrom)
    results_files = "hiv_acquisition_1df_meta_analysis_uhs1-4_aa+uhs1-4_ea+vidus_ea+wihs1_aa+wihs1_ea+wihs1_ha+wihs2_aa.chr{}.exclude_singletons.1df.gz".format(chrom)
    meta = "/shared/jmarks/hiv/meta/results/015/final/{}".format(results_files)
    not_found = "{}/{}/001/{}_{}_hiv_acq_meta_analysis_mclaren_snps_not_found.txt".format(base_dir, ancestry, date, pop)
################################################################################

    with gzip.open(meta) as metF, open(snp_list) as lookupF, open(out_file, "w") as outF:
        lookup_line = lookupF.readline()
        met_head = metF.readline()
        met_line = metF.readline()

        outF.write(met_head)

        ## create a dictionary containing the lookup snps (just 1000G_p3 portion)
        if len(lookup_dict) == 0:
            while lookup_line:
                thou_name = lookup_line.split()[0] # 1000g_p3 SNP id
                lookup_chrm = lookup_line.split()[1] #chromosome
                lookup_pos = thou_name.split(":")[1] # position
                key = "{}:{}".format(lookup_chrm,lookup_pos)
                lookup_dict[key] = [0, lookup_chrm, thou_name]
                lookup_line = lookupF.readline()

        while met_line:
            met_chr = met_line.split()[1]
            met_id = met_line.split()[0] # the 1000g_p3 ID in the meta-analysis
            met_pos = met_id.split(":")[1] # position
            met_id = "{}:{}".format(met_chr,met_pos)
            if met_id in lookup_dict:
                lookup_dict[met_id][0] += 1
                outF.write(met_line)
            met_line = metF.readline()

with open(not_found, "w") as notF:
    notF.write("chr\tid\n")
    for key, value in lookup_dict.items():
        if value[0]==0:
            notF.write("\t".join(str(v) for v in value[1:]) + "\n")

### merge EA

In [None]:
## merge results
cd /shared/jmarks/hiv/meta/lookup/20190702/ea/001
head -1 20190702_eur_chr7_hiv_acq_meta_analysis_1df_mclaren_lookup.txt >\
    20190702_eur_hiv_acq_meta_analysis_1df_mclaren_lookup_merged_results.txt
    
for file in 20190702_eur_chr{1..22}_hiv_acq_meta_analysis_1df_mclaren_lookup.txt; do
    tail -n +2 $file >> 20190702_eur_hiv_acq_meta_analysis_1df_mclaren_lookup_merged_results.txt 
done 

## merge Cross

In [None]:
## merge results
cd /shared/jmarks/hiv/meta/lookup/20190702/cross/001
head -1 20190702_cross_chr7_hiv_acq_meta_analysis_1df_mclaren_lookup.txt >\
    20190702_cross_hiv_acq_meta_analysis_1df_mclaren_lookup_merged_results.txt
    
for file in 20190702_cross_chr{1..22}_hiv_acq_meta_analysis_1df_mclaren_lookup.txt; do
    tail -n +2 $file >> 20190702_cross_hiv_acq_meta_analysis_1df_mclaren_lookup_merged_results.txt 
done 

# SNP Lookup 002
Now we will perform two more SNP lookups: one we will search the 015 results for SNPs with p-value <=5e-6 and from this list search for those SNPs in the McLaren results; then we will do the same for the 016 results.

In [None]:
## copy results to corresponding lookup directory
cp hiv_acquisition_1df_meta_analysis_uhs1-4_aa+uhs1-4_ea+vidus_ea+wihs1_aa+wihs1_ea+wihs1_ha+wihs2_aa.exclude_singletons.1df.p_lte_0.001.csv\
    /shared/jmarks/hiv/meta/lookup/20190702/cross/002
cp hiv_acquisition_1df_meta_analysis_uhs1-4_ea+vidus_ea+wihs1_ea.exclude_singletons.1df.p_lte_0.001.csv\
    /shared/jmarks/hiv/meta/lookup/20190702/ea/002/

## filter to SNPs with pvalue <= 5e-6
cd /shared/jmarks/hiv/meta/lookup/20190702/cross/002/
head -1 hiv_acquisition_1df_meta_analysis_uhs1-4_aa+uhs1-4_ea+vidus_ea+wihs1_aa+wihs1_ea+wihs1_ha+wihs2_aa.exclude_singletons.1df.p_lte_0.001.csv >\
    015_snps_pval_le_5e-6.txt
awk -F ","  ' $8 <= 0.000005 ' hiv_acquisition_1df_meta_analysis_uhs1-4_aa+uhs1-4_ea+vidus_ea+wihs1_aa+wihs1_ea+wihs1_ha+wihs2_aa.exclude_singletons.1df.p_lte_0.001.csv \
    >> 015_snps_pval_le_5e-6.txt

cd /shared/jmarks/hiv/meta/lookup/20190702/ea/002/
head -1 hiv_acquisition_1df_meta_analysis_uhs1-4_ea+vidus_ea+wihs1_ea.exclude_singletons.1df.p_lte_0.001.csv >\
    016_snps_pval_le_5e-6.txt
awk -F ","  ' $8 <= 0.000005 ' hiv_acquisition_1df_meta_analysis_uhs1-4_ea+vidus_ea+wihs1_ea.exclude_singletons.1df.p_lte_0.001.csv \
    >> 016_snps_pval_le_5e-6.txt

## cross

In [None]:
### python ###
"""
cd to working directory
"""
import gzip

mclaren = "/shared/jmarks/hiv/meta/lookup/20190702/dan.assoc.dosage.meta.ngt.metadaner.ALL_CHR.FUMA_no_SE.gz"
meta = "015_snps_pval_le_5e-6.txt"
out = "20190702_cross_hiv_acq_1df_meta_snp_lookup_results_version2.txt"
not_found = "20190702_cross_hiv_acq_1df_meta_snps_not_found_version2.txt"

mdic = {}

with gzip.open(mclaren) as mcF, open(out, 'w') as outF, open(meta) as metF:
    met_head = metF.readline()
    outF.write(met_head)
    met_line = metF.readline()
    while met_line:
        sl = met_line.split(",")
        key = "{}:{}".format(sl[1], sl[2]) # chr:position
        
        mdic[key] = [0, met_line]
        met_line = metF.readline()

    mc_line = mcF.readline()
    mc_line = mcF.readline()
    while mc_line:
        mc_chr = mc_line.split()[0]
        mc_pos = mc_line.split()[1]
        mc_key = "{}:{}".format(mc_chr, mc_pos)
        if mc_key in mdic:
            mdic[mc_key][0] += 1
            outF.write(mc_line)
        mc_line = mcF.readline()


with open(not_found, "w") as notF:
    for key, value in mdic.items():
        if value[0]==0:
            notF.write(value[1])

## EA
simply change the variables at the top of the script to reflect the different lookup.