# Extension of GSCAN results to nicotine dependence (issue #59)

**Author:** Jesse Marks

We have preliminary results from a very large-scale genome-wide study for cigarette smoking phenotypes that relate to our nicotine dependence GWAS results. See the supplemental Tables S7-S9 here: 

`\rcdcollaboration01.rti.ns\GxG\Analysis\GSCAN\shared MS version 1\`.

The phenotypes of interest to us include: 

1) **Age of smoking initiation (AI)** - supplemental table 6

2) **Cigarettes per day (CPD)** - supplemental table 7

3) **Smoking cessation (SC)** - supplemental table 8

4) **smoking initiation (SI)** - supplemental table 9

We're interested in seeing whether these associations extend over to nicotine dependence. I use the SNP look-up script to extend Tables S6-S9 with our FTND meta-analysis results on S3 at:
`s3://rti-nd/META/1df/20181108/results/{aa,cross,ea}`

<br>

**Note** that because the meta-analysis results files are quite large, we will not perform the search locally. Also, the meta-analysis results were already on my EC2 instance so that I did not have to download them from S3.
`/shared/jmarks/nicotine/meta/results/aa/20181106/final`
`/shared/jmarks/nicotine/meta/results/ea/20181106/final`
`/shared/jmarks/nicotine/meta/results/cross/20181106/final`

## Download Data
Create the directory structure and download the meta-analysis results to EC2. Also, create a list of SNPs for the lookup from the GSCAN Excel sheet. Copy the SNPs from each of the four Excel sheets (supplemental tables) into separate files and then combine them (no duplicates).

In [None]:
## EC2 ##

# Create directory structure locally
cd /shared/jmarks/nicotine/gscan/lookup/20181212

# populate these files with the SNPs from respective Excel sheets
touch age_of_initiation.tsv  cig_per_day.tsv  smoking_cessation.tsv  smoking_initiation.tsv
wc -l * # note these are with headers
"""
   11 age_of_initiation
   56 cig_per_day
   25 smoking_cessation
  377 smoking_initiation
"""

# combine SNPs into one file, make sure SNPs are not listed twice
head -1 age_of_initiation.tsv > combined_snp_list.tsv
for file in age_of_initiation.tsv cig_per_day.tsv smoking_cessation.tsv smoking_initiation.tsv;do
    tail -n +2 $file >> combined_snp_list.tsv
done

# filter so that there are no duplicated SNPs (no header either)
tail -n +2 combined_snp_list.tsv | sort -u > combined_snp_list_filtered.tsv

# convert to 1000g_p3 format
awk '{print $3":"$2":"$4":"$5"\t"$1}' combined_snp_list_filtered.tsv > combined-snp_list-filtered-1000g_p3-chr.tsv

``` 
wc -l combined-snp_list-filtered-1000g_p3-chr.tsv
460 combined-snp_list-filtered-1000g_p3-chr.tsv

 head combined-snp_list-filtered-1000g_p3-chr.tsv
rs7920501:10043159:T:A  10
rs7901883:103186838:G:A 10
rs11594623:103960351:T:C        10
rs11191269:104120522:C:G        10
rs28408682:104403310:A:G        10
rs12244388:104640052:G:A        10
```



## SNP lookup
The latest FTND meta-analysis results, as of 20181212, are at the location `/shared/jmarks/nicotine/meta/results/{aa,ea,cross}`. They are also on AWS S3 at: `s3://rti-nd/META/1df/20181108/results/{aa,ea,cross}`. If they had not been on my EC2 instance already I would have had to download the data from S3.

I need to create a dictionary of the GSCAN SNPs and then see if each SNP in the meta-analysis is in the dictionary. I think this makes more sense than vice-versa; in particular, creating a dictionary for each SNP in the meta-analysis and then searching to see if the GSCAN SNPs are in the dictionary. This latter strategy would require a large amount of memory to create the Python dictionary. I think the former strategy makes more computational sense. 

Also note that there are some SNPs in the meta-analysis which have the format of `chr:position:a1:a2` instead of `rsid:position:a1:a2`. I think the reason is that these SNPs of the former format did not have an associated rsID available. If a GSCAN SNP is not found in the lookup, then we need to output the SNPs that were not found and deal with those later. It might be the case that they we have to convert them from `rsid:position:a1:a2` format to `chr:position:a1:a2` and then perform the search again with just these SNPs.

### AA

In [None]:
### Python3 ###
"""
*SNP lookup*

    Make sure the IDs are of the same format for the snp-list
    and the IDs in the meta-analysis results. e.g. 1000g_p3 or rsID only
"""
import gzip

################################################################################
date = "20181108" # date the results were generated (in results files name)
ancestry = "ea"

if ancestry=="aa":
    pop = "afr"
elif ancestry=="ea":
    pop = "eur"
else:
    pop = "afr+eur"

## dict to hold gscan snps and the number of times they were found.
## we can tell which SNPs did not show up in any of the meta files
gscan_dict =  {}
out_dir = "/shared/jmarks/nicotine/gscan/lookup/20181212"
snp_list = "{}/combined-snp_list-filtered-1000g_p3-chr.tsv".format(out_dir)

for chrom in range(1,23):
    out_file = "{}/{}/002/{}-{}-chr{}-ftnd_meta_analysis-wave3-1df-gscan-lookup.txt".format(out_dir, ancestry, date, pop, chrom)
    results = "{}_ftnd_meta_analysis_wave3.{}.chr{}.exclude_singletons.1df.gz".format(date, pop, chrom)
    meta = "/shared/jmarks/nicotine/meta/results/{}/{}/final/{}".format(ancestry, date, results)
    not_found = "{}/{}/002/{}-{}-ftnd_meta_analysis_wave3-gscan-snps-not-found".format(out_dir, ancestry, date, pop)
################################################################################

    with gzip.open(meta) as metF, open(snp_list) as gscanF, open(out_file, "w") as outF:
        gscan_line = gscanF.readline()
        met_head = metF.readline()
        met_line = metF.readline()

        outF.write(met_head)


        ## create a dictionary containing the gscan snps
        if len(gscan_dict) == 0:
            while gscan_line:
                key = gscan_line.split()[0]
                gscan_dict[key] = 0
                gscan_line = gscanF.readline()


        while met_line:
            met_id = met_line.split()[0] # the 1000g_p3 ID in the meta-analysis

            if met_id in gscan_dict:
                gscan_dict[met_id] += 1
                outF.write(met_line)
            met_line = metF.readline()

with open(not_found, "w") as notF:
    for key, value in gscan_dict.items():
        if value==0:
            notF.write(key + "\n")

                                                                                                                                

In [None]:
### bash ###

## merge results
cd /shared/jmarks/nicotine/gscan/lookup/20181212/aa/002
head -1 20181106-afr-chr8-ftnd_meta_analysis-wave3-1df-gscan-lookup.txt >\
    20181106-afr-ftnd-1df-meta_analysis-wave3-gscan-lookup-merged-results.txt
    
for file in 20181106-afr-chr{1..22}-ftnd_meta_analysis-wave3-1df-gscan-lookup.txt; do
    tail -n +2 $file >> 20181106-afr-ftnd-1df-meta_analysis-wave3-gscan-lookup-merged-results.txt
done

### EA
simply change the `ancestry, pop,` and `date` variables at the top of the script.

In [None]:
### bash ### 

## merge results
cd /shared/jmarks/nicotine/gscan/lookup/20181212/ea/002
head -1 20181108-eur-chr8-ftnd_meta_analysis-wave3-1df-gscan-lookup.txt >\
    20181108-eur-ftnd-1df-meta_analysis-wave3-gscan-lookup-merged-results.txt
for file in 20181108-eur-chr{1..22}-ftnd_meta_analysis-wave3-1df-gscan-lookup.txt;do
    tail -n +2 $file >> 20181108-eur-ftnd-1df-meta_analysis-wave3-gscan-lookup-merged-results.txt
done

### Cross-Ancestry
simply change the `ancestry, pop,` and `date` variables at the top of the script.

In [None]:
### bash ### 

## merge results
cd /shared/jmarks/nicotine/gscan/lookup/20181212/cross/002
head -1 20181108-afr+eur-chr8-ftnd_meta_analysis-wave3-1df-gscan-lookup.txt > 20181108-afr+eur-ftnd-1df-meta_analysis-wave3-gscan-lookup-merged-results.txt
for file in 20181108-afr+eur-chr{1..22}-ftnd_meta_analysis-wave3-1df-gscan-lookup.txt;do
    tail -n +2 $file >> 20181108-afr+eur-ftnd-1df-meta_analysis-wave3-gscan-lookup-merged-results.txt
done


# Retrieve Missing SNPs
There were four SNPs missing from the SNP-lookupâ€”the same four SNPs for each meta-analysis (i.e. aa, ea, and cross).