# Exome40 results to nicotine dependence

**Author:** Jesse Marks

We have preliminary results from a very large-scale genome-wide study for cigarette smoking phenotypes that relate to our nicotine dependence GWAS results. The study of interest is called `Meta-analysis of up to 622,409 individuals identifies 40 novel smoking behaviour associated genetic loci` by A. Mesut Erzurumluoglu1 et al. We're interested in seeing whether these 40 novel associations extend over to nicotine dependence (FTND).

<br>

Our FTND 1df meta-analysis results are located on AWS S3 at: 
`rti-nd/META/1df/20181108/results`

## Download Data
Create the directory structure and download the meta-analysis results to EC2. Also, create a list of SNPs for the lookup from the GSCAN Excel sheet. Copy the SNPs from each of the four Excel sheets (supplemental tables) into separate files and then combine them (no duplicates).

In [None]:
## EC2 ##

# Create directory structure locally
mkdir -p /shared/jmarks/nicotine/exome40/lookup/20181216

# add the exome40 novel SNPs to a snp-lookup file (1000G_p3 format)
touch combined-snp_list.tsv


convert SNPs to 1000g_p3 format

In [6]:
import os
os.chdir("C:\\Users\\jmarks\\Desktop\\Projects\\Nicotine\\exome40")
with open("20181216-nicotine-ftnd-exome40-snp-lookup.txt") as inF, open("snp_list.txt", "w") as outF:
    head = inF.readline()
    line = inF.readline()
    while line:
        rs = line.split()[0]
        pos = line.split()[1]
        pos = pos.split(":")[1]
        a1a2 = line.split()[2]
        newl = "{}:{}:{}".format(rs, pos, a1a2)
        print(newl)
        outF.write(newl + "\n")
        line = inF.readline()


rs141611945	1:161771868	G:A

rs141611945:161771868:G:A
rs1190736:136113464:A:C
rs462779:111695887:A:G
rs216195:2203167:G:T
rs11539157:68381264:A:C
rs12616219:104352495:A:C
rs1150691:28168033:G:A
rs2841334:128122320:A:G
rs202664:41813886:C:T
rs11895381:60053727:A:G
rs12780116:104821946:A:G
rs1514175:74991644:G:A
rs7096169:104618695:G:A
rs2292239:56482180:G:T
rs216195:2203167:G:T
rs2960306:2990499:T:G
rs4908760:8526142:A:G
rs6692219:179989584:C:G
rs11971186:126437897:G:A
rs150493199:179721072:A:T
rs3001723:44037685:A:G
rs1937455:66416939:G:A
rs72720396:91191582:G:A
rs6673752:154219177:C:G
rs2947411:614168:G:A
rs528301:45154908:A:G
rs6738833:104150891:T:C
rs13026471:137564022:T:C
rs6724928:156005991:C:T
rs13022438:162800372:G:A
rs1869244:5724531:A:G
rs35438712:85588205:T:C
rs6883351:22193967:T:C
rs6414946:87729711:C:A
rs11747772:166992708:C:T
rs9320995:98726381:G:A
rs10255516:1675621:G:A
rs10807839:3344629:G:A
rs6965740:117514840:T:G
rs11776293:27418429:T:C
rs1562612:59817068:G:A
rs385791

## SNP lookup
The latest FTND meta-analysis results, as of 20181216, are at the location `/shared/jmarks/nicotine/meta/results/{aa,ea,cross}`. They are also on AWS S3 at: `s3://rti-nd/META/1df/20181108/results/{aa,ea,cross}`. If they had not been on my EC2 instance already I would have had to download the data from S3.

I need to create a dictionary of the exome40 SNPs and then see if each SNP in the meta-analysis is in the dictionary. I think this makes more sense than vice-versa; in particular, creating a dictionary for each SNP in the meta-analysis and then searching to see if the exome SNPs are in the dictionary. This latter strategy would require a large amount of memory to create the Python dictionary. I think the former strategy makes more computational sense. 

Also note that there are some SNPs in the meta-analysis which have the format of `chr:position:a1:a2` instead of `rsid:position:a1:a2`. I think the reason is that these SNPs of the former format did not have an associated rsID available. 

### AA

In [None]:
### Python3 ###
"""
*SNP lookup*

    Make sure the IDs are of the same format for the snp-list
    and the IDs in the meta-analysis results. e.g. 1000g_p3 or rsID only
"""
import gzip

################################################################################
date = "20181108" # date the results were generated (in results files name)
ancestry = "ea"

if ancestry=="aa":
    pop = "afr"
elif ancestry=="ea":
    pop = "eur"
else:
    pop = "afr+eur"

## dict to hold exome40 snps and the number of times they were found.
## we can tell which SNPs did not show up in any of the meta files
exome40_dict =  {}
out_dir = "/shared/jmarks/nicotine/exome40/lookup/20181216"
snp_list = "{}/snp_list.txt".format(out_dir)

for chrom in range(1,23):
    out_file = "{}/{}/{}-{}-chr{}-ftnd_meta_analysis-wave3-1df-exome40-lookup.txt".format(out_dir, ancestry, date, pop, chrom)
    results = "{}_ftnd_meta_analysis_wave3.{}.chr{}.exclude_singletons.1df.gz".format(date, pop, chrom)
    meta = "/shared/jmarks/nicotine/meta/results/{}/{}".format(ancestry, results)
    not_found = "{}/{}/{}-{}-ftnd_meta_analysis_wave3-exome40-snps-not-found".format(out_dir, ancestry, date, pop)
################################################################################

    with gzip.open(meta) as metF, open(snp_list) as exome40F, open(out_file, "w") as outF:
        exome40_line = exome40F.readline()
        met_head = metF.readline()
        met_line = metF.readline()

        outF.write(met_head)


        ## create a dictionary containing the exome40 snps
        if len(exome40_dict) == 0:
            while exome40_line:
                key = exome40_line.split()[0]
                exome40_dict[key] = 0
                exome40_line = exome40F.readline()


        while met_line:
            met_id = met_line.split()[0] # the 1000g_p3 ID in the meta-analysis

            if met_id in exome40_dict:
                exome40_dict[met_id] += 1
                outF.write(met_line)
            met_line = metF.readline()

with open(not_found, "w") as notF:
    for key, value in exome40_dict.items():
        if value==0:
            notF.write(key + "\n")

                                                                                                                                

In [None]:
### bash ###

an="aa"
## submit job
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name  exome40_${an}\
    --script_prefix $an/exome40-${an} \
    --mem 15 \
    --nslots 1 \
    --priority 0 \
    --program python extract_snps.py

## merge results
cd /shared/jmarks/nicotine/exome40/lookup/20181216/aa
head -1 20181106-afr-chr1-ftnd_meta_analysis-wave3-1df-exome40-lookup.txt >\
    20181216-afr-ftnd-1df-meta_analysis-wave3-exome40-lookup-merged-results.txt
    
for file in 20181106-afr-chr{1..22}-ftnd_meta_analysis-wave3-1df-exome40-lookup.txt; do
    tail -n +2 $file >> 20181216-afr-ftnd-1df-meta_analysis-wave3-exome40-lookup-merged-results.txt
done

### EA
simply change the `ancestry, pop,` and `date` variables at the top of the script.

In [None]:
### bash ### 

## merge results
cd /shared/jmarks/nicotine/exome40/lookup/20181216/ea
head -1 20181108-eur-chr1-ftnd_meta_analysis-wave3-1df-exome40-lookup.txt >\
    20181216-eur-ftnd-1df-meta_analysis-wave3-exome40-lookup-merged-results.txt
    
for file in 20181108-eur-chr{1..22}-ftnd_meta_analysis-wave3-1df-exome40-lookup.txt; do
    tail -n +2 $file >> 20181216-eur-ftnd-1df-meta_analysis-wave3-exome40-lookup-merged-results.txt
done

### Cross-Ancestry
simply change the `ancestry, pop,` and `date` variables at the top of the script.

In [None]:
### bash ### 

## merge results
cd /shared/jmarks/nicotine/exome40/lookup/20181216/cross
head -1 20181108-afr+eur-chr1-ftnd_meta_analysis-wave3-1df-exome40-lookup.txt >\
    20181216-afr+eur-ftnd-1df-meta_analysis-wave3-exome40-lookup-merged-results.txt
    
for file in 20181108-afr+eur-chr{1..22}-ftnd_meta_analysis-wave3-1df-exome40-lookup.txt; do
    tail -n +2 $file >> 20181216-afr+eur-ftnd-1df-meta_analysis-wave3-exome40-lookup-merged-results.txt
done