# Extension of NGC Heroin OAall Results 
**Author:** Jesse Marks

We have preliminary results from a very large-scale GWAS for cigarette smoking and alcohol consumption phenotypes. We will search for those SNPs in our NGC meta-analysis results. Supplemental tables S6-S10 are here:
`\rcdcollaboration01.rti.ns\GxG\Analysis\GSCAN\shared MS version 1\`

The phenotypes of interest to us include:

1) **Age of smoking initiation (AI)** - supplemental table 6

2) **Cigarettes per day (CPD)**- supplemental table 7

3) **Smoking cessation (SC)** - supplemental table 8

4) **smoking initiation (SI)** - supplemental table 9

5) **Drinks Per Week (DPW)** - supplemental table 10

We are interested in seeing whether these associations extend over to opioid addiction. The NGC OAall meta-analysis results are located on the share drive at:
`//rcdcollaboration01.rti.ns/Heroin_Public_Data/NGC GWAS/meta/oaall/061.adaa+cats+coga+decode+kreek+uhs+vidus+yale-penn.aa+ea`

## Download Data + Create SNP list
The data are located on the share drive, so there is no need to download them from S3. Create a list of SNPs for the lookup from the GSCAN Excell sheet. Copy the SNPs from each of the five Excell sheets (supplemental tables) into separate files and then combine them while removing duplicates.

In [None]:
## local ##

# Create directory structure locally
cd ~/Desktop/Projects/heroin/ngc/gscan_lookup

# populate these files with the SNPs from respective Excel sheets
touch age_of_initiation.tsv  cig_per_day.tsv  smoking_cessation.tsv  smoking_initiation.tsv drink_week.tsv
wc -l *tsv
"""
   11 age_of_initiation.tsv
   56 cig_per_day.tsv
  101 drink_week.tsv
   25 smoking_cessation.tsv
  377 smoking_initiation.tsv
  570 total
"""

# combine SNPs into one file, make sure SNPs are not listed twice
head -1 age_of_initiation.tsv > combined_snp_list.tsv
for file in age* cig* smoking* drink*;do
    tail -n +2 $file >> combined_snp_list.tsv
done

# filter so that there are no duplicated SNPs (no header either)
tail -n +2 combined_snp_list.tsv | sort -u > combined_snp_list_filtered.tsv

# convert to 1000g_p3 format
awk '{print $3":"$2":"$4":"$5"\t"$1}' combined_snp_list_filtered.tsv > combined-snp_list-filtered-1000g_p3-chr.tsv

```
wc -l combined-snp_list-filtered-1000g_p3-chr.tsv
557 combined-snp_list-filtered-1000g_p3-chr.tsv

head combined-snp_list-filtered-1000g_p3-chr.tsv
rs12027999:154206358:T:C        1
rs2072659:154548521:C:G 1
rs45444697:155034632:C:G        1
rs10753661:165119792:G:A        1
rs28680958:173848808:G:A        1
rs2901785:174104743:G:A 1
rs34973462:175993820:C:T        1
rs147052174:179783167:G:T       1
rs3820277:18436657:G:T  1
rs35656245:190957480:G:A        1
```

## SNP lookup
I need to create a dictionary of the GSCAN SNPs and then see if each SNP in the meta-analysis is in the dictionary. I think this makes more sense than vice-versa; in particular, creating a dictionary for each SNP in the meta-analysis and then searching to see if the GSCAN SNPs are in the dictionary. This latter strategy would require a large amount of memory to create the Python dictionary. I think the former strategy makes more computational sense.

Also note that there are some SNPs in the meta-analysis which have the format of chr:position:a1:a2 instead of rsid:position:a1:a2. I think the reason is that these SNPs of the former format did not have an associated rsID available. If a GSCAN SNP is not found in the lookup, then we need to output the SNPs that were not found and deal with those later. It might be the case that they we have to convert them from rsid:position:a1:a2 format to chr:position:a1:a2 and then perform the search again with just these SNPs.

### Cross-Ancestry

In [36]:
### Python3 ###
"""
*SNP lookup*

    Make sure the IDs are of the same format for the snp-list
    and the IDs in the meta-analysis results. e.g. 1000g_p3 or rsID only
"""
import gzip

################################################################################
date = "20190110"  # enter today's date
#ancestry = "ea"

#if ancestry=="aa":
#    pop = "afr"
#elif ancestry=="ea":
#    pop = "eur"
#else:
#    pop = "afr+eur"

## dict to hold gscan snps and the number of times they were found.
## we can tell which SNPs did not show up in any of the meta files
gscan_dict =  {}
base_dir = "C:\\Users\\jmarks\\Desktop\\gscan_lookup"
snp_list = "{}\\combined-snp_list-filtered-1000g_p3-chr.tsv".format(base_dir)

#for chrom in range(1,23):
out_file = "{}\\results\\{}-ngc-meta-analysis-aa+ea.maf_gt_0.03.rsq_gt_0.3-gscan-lookup.txt".format(base_dir, date)
results = "{}\\adaa+cats+coga+decode+kreek+uhs+vidus+yale-penn.aa+ea.maf_gt_0.03.rsq_gt_0.3.fuma.gz".format(base_dir)
not_found = "{}\\results\\{}-ngc-meta-analysis-gscan-snps-not-found".format(base_dir, date)
################################################################################

with gzip.open(results, 'rt') as metF, open(snp_list) as gscanF, open(out_file, "wt") as outF:
    gscan_line = gscanF.readline()
    met_head = metF.readline()
    met_line = metF.readline()
    print(gscan_line)
    print(met_head, met_line)

    outF.write(met_head)

    ## create a dictionary containing the gscan snps
    if len(gscan_dict) == 0:
        while gscan_line:
            key = gscan_line.split()[0]
            gscan_dict[key] = 0
            gscan_line = gscanF.readline()


    while met_line:
        met_id = met_line.split()[0] # the 1000g_p3 ID in the meta-analysis

        if met_id in gscan_dict:
            gscan_dict[met_id] += 1
            outF.write(met_line)
        met_line = metF.readline()
        
# report SNPs not found
with open(not_found, "wt") as notF:
    for key, value in gscan_dict.items():
        if value==0:
            notF.write(key + "\n")

rs12027999:154206358:T:C	1

MarkerName	CHR	POS	Allele1	Allele2	Effect	StdErr	P-value
 1:178757976:-:CTTTCT	1	178757976	-	ctttct	0.0639	0.0574	0.2653



```

 ~/Desktop/gscan_lookup/results
$ head *
==> 20190110-ngc-meta-analysis-aa+ea.maf_gt_0.03.rsq_gt_0.3-gscan-lookup.txt <==
MarkerName      CHR     POS     Allele1 Allele2 Effect  StdErr  P-value
rs4912332:58815243:C:T  1       58815243        t       c       0.0095  0.0172  0.5828
rs10914684:33795572:G:A 1       33795572        a       g       0.0007  0.0232  0.9759
rs34973462:175993820:C:T        1       175993820       t       c       -0.0579 0.0187  0.001976
rs951740:44011737:G:A   1       44011737        a       g       -0.0110 0.0183  0.5491
rs11264100:35591626:A:G 1       35591626        a       g       -0.0178 0.0272  0.513
rs3820277:18436657:G:T  1       18436657        t       g       0.0250  0.0180  0.1645
rs12088813:66407700:A:C 1       66407700        a       c       -0.0502 0.0218  0.02126
rs35656245:190957480:G:A        1       190957480       a       g       -0.0154 0.0199  0.4406
rs12563365:236872829:G:A        1       236872829       a       g       -0.0089 0.0193  0.6425

==> 20190110-ngc-meta-analysis-gscan-snps-not-found <==
rs147052174:179783167:G:T
rs184083806:96981736:T:C
rs2145451:29316842:T:C
rs28929474:94844947:C:T
rs4886550:78243579:A:G
rs12442563:83893243:G:T
rs72836318:44121579:T:C
rs143200968:41338847:G:C
rs145580088:41342842:A:G
```