# Extension of GSCAN results to nicotine dependence (issue #59)

**Author:** Jesse Marks <br>
**GitHub Issue**: [#103](https://github.com/RTIInternational/bioinformatics/issues/103#issuecomment-522000707)

We have preliminary results from a very large-scale genome-wide study for cigarette smoking phenotypes ([GSCAN](https://genome.psych.umn.edu/index.php/GSCAN)). We used the [LDSC workflow](https://github.com/RTIInternational/ld-regression-pipeline) to test the phenotypic relationship between the smoking traits in GSCAN and our in-house GWAS results of opioid addiction (OAall). Given the strong correlations we saw--with GSCAN phenotypes (P=6e-7 to 3e-20) and FTND (P=10e-5)--we now want to see how the top GSCAN SNP associations correlate with OAall. 

For top GSCAN results, see the supplemental Tables S7-S9 here:<br>
`//RTPNFIL02/dhancock/Analysis/GSCAN/shared MS version 1/Supplementary_Tables_S6-S12_Loci.xlsx`.

<br>

The GSCAN phenotypes of interest to us include: 

1) **Age of smoking initiation (AI)** - supplemental table 6

2) **Cigarettes per day (CPD)** - supplemental table 7

3) **Smoking cessation (SC)** - supplemental table 8

4) **smoking initiation (SI)** - supplemental table 9

We're interested in seeing whether these associations extend over to opioid dependence. I use the SNP look-up script to extend Tables S6-S9 with our OAall meta-analysis results on S3 at:
* `s3://rti-midas-data/studies/ngc/meta/082/processing/oaall/`

<br>
<br>
<br>

**Note:** that because the meta-analysis results files are quite large, we will not perform the search locally. 

## Download Data

Create the directory structure and download the meta-analysis results to EC2. Also, create a list of SNPs for the lookup from the GSCAN Excel sheet. Copy the SNPs from each of the four Excel sheets (supplemental tables) into separate files and then combine them (no duplicates).

In [None]:
## EC2 ##

# Create directory 
dir=/shared/jmarks/proj/heroin/gscan_lookup/20190816
mkdir -p $dir/{cross_meta,ea_meta,lookup_results}
mkdir $dir/lookup_results/{cross,ea}

for i in {2..22}; do
    #aws s3 cp s3://rti-midas-data/studies/ngc/meta/082/processing/oaall/cats+coga+decode+kreek+uhs+vidus+yale-penn.ea.chr$i.maf_gt_0.01.rsq_gt_0.3.gz $dir/ea_meta
    aws s3 cp s3://rti-midas-data/studies/ngc/meta/081/processing/oaall/adaa+alive+cats+coga+decode+kreek+uhs+vidus+yale-penn.aa+ea.chr$i.maf_gt_0.01.rsq_gt_0.3.gz $dir/cross_meta       
done

# populate these files with the SNPs from respective Excel sheets

touch age_of_initiation.tsv  cig_per_day.tsv  smoking_cessation.tsv  smoking_initiation.tsv
wc -l * # note these are with headers
"""
   11 age_of_initiation
   56 cig_per_day
   25 smoking_cessation
  377 smoking_initiation
"""

head age_of_initiation
"""
Chr     Pos     rsID    Reference Allele        Alternate Allele
2       145638766       rs72853300      C       T
2       225353649       rs12611472      T       C
2       63622309        rs7559982       T       A
3       85699040        rs11915747      C       G
4       140908755       rs13136239      G       A
"""

# combine SNPs into one file, make sure SNPs are not listed twice
head -1 age_of_initiation.tsv > combined_snp_list.tsv
for file in age_of_initiation.tsv cig_per_day.tsv smoking_cessation.tsv smoking_initiation.tsv; do
    tail -n +2 $file >> combined_snp_list.tsv
done

# filter so that there are no duplicated SNPs (no header either)
tail -n +2 combined_snp_list.tsv | sort -u > combined_snp_list_filtered.tsv

# convert to 1000g_p3 format
awk '{print $3":"$2":"$4":"$5"\t"$1}' combined_snp_list_filtered.tsv > combined_snp_list_filtered_1000g_p3_chr.tsv

``` 
wc -l combined_snp_list_filtered_1000g_p3_chr.tsv
460 combined_snp_list_filtered_1000g_p3_chr.tsv

 head combined_snp_list_filtered_1000g_p3_chr.tsv
rs7920501:10043159:T:A  10
rs7901883:103186838:G:A 10
rs11594623:103960351:T:C        10
rs11191269:104120522:C:G        10
rs28408682:104403310:A:G        10
rs12244388:104640052:G:A        10
rs111842178:104852121:A:G       10
rs34970111:106078937:C:T        10
rs9787523:106460460:T:C 10
rs11192347:106929313:G:A        10
```

## SNP lookup
I need to create a dictionary of the GSCAN SNPs and then see if each SNP in the meta-analysis is in the dictionary. I think this makes more sense rather than vice-versa; in particular, creating a dictionary for each SNP in the meta-analysis and then searching to see if the GSCAN SNPs are in the dictionary. This latter strategy would require a large amount of memory to create the Python dictionary. The former strategy makes more sense when comsidering computation expense. 

Also note that there are some SNPs in the meta-analysis which have the format of `chr:position:a1:a2` instead of `rsid:position:a1:a2`. I think the reason is that these SNPs of the former format did not have an associated rsID available. If a GSCAN SNP is not found in the lookup, then we need to output the SNPs that were not found and deal with those later. It might be the case that we have to convert them from `rsid:position:a1:a2` format to `chr:position:a1:a2` and then perform the search again with just these SNPs.

### OAall EA

In [None]:
### Python3 ###
"""
*SNP lookup*

    Make sure the IDs are of the same format for the snp-list
    and the IDs in the meta-analysis results. e.g. 1000g_p3 or rsID only
"""
import gzip

################################################################################
date = "20190816" # date the results were generated (in results files name)
ancestry = "cross"

if ancestry=="aa":
    pop = "afr"
elif ancestry=="ea":
    pop = "eur"
else:
    pop = "afr+eur"

## dict to hold gscan snps and the number of times they were found.
## we can tell which SNPs did not show up in any of the meta files
gscan_dict =  {}

# directory contaiting combined_snp_list_filtered_1000g_p3_chr.tsv
base_dir = "/shared/jmarks/proj/heroin/gscan_lookup/20190816" 
snp_list = "{}/combined_snp_list_filtered_1000g_p3_chr.tsv".format(base_dir)

for chrom in range(1,23):
    progress = "Prosessing {} {}".format(ancestry, chrom)
    #print(progress)
    out_file = "{}/lookup_results/{}/{}_{}_chr{}_oaall_1df_meta_gscan_snp_lookup.txt".format(base_dir, ancestry, date, pop, chrom)
    results = "adaa+alive+cats+coga+decode+kreek+uhs+vidus+yale-penn.aa+ea.chr{}.maf_gt_0.01.rsq_gt_0.3.gz".format(chrom)
    meta = "{}/cross_meta/{}".format(base_dir, results)
    not_found = "{}/lookup_results/{}/{}_{}_oaall_1df_meta_gscan_snp_lookup_not_found.txt".format(base_dir, ancestry, date, pop)
################################################################################

    with gzip.open(meta) as metF, open(snp_list) as gscanF, open(out_file, "w") as outF:
        gscan_line = gscanF.readline()
        met_head = metF.readline()
        met_line = metF.readline()

        outF.write(met_head)

        ## create a dictionary containing the gscan snps (just 1000G_p3 portion)
        if len(gscan_dict) == 0:
            while gscan_line:
                key = gscan_line.split()[0] # SNP id
                value = gscan_line.split()[1] #chromosome
                gscan_dict[key] = [0, value]
                gscan_line = gscanF.readline()

        while met_line:
            met_id = met_line.split()[0] # the 1000g_p3 ID in the meta-analysis
            if met_id in gscan_dict:
                gscan_dict[met_id][0] += 1
                outF.write(met_line)
            met_line = metF.readline()

with open(not_found, "w") as notF:
    notF.write("id\tchr\n")
    for key, value in gscan_dict.items():
        if value[0]==0:
            notF.write(key + "\t" +  value[1] + "\n")

In [None]:
### bash ###

## merge ea results
cd /shared/jmarks/proj/heroin/gscan_lookup/20190816/lookup_results/ea
head -1 20190816_eur_chr3_oaall_1df_meta_gscan_snp_lookup.txt >\
    20190816_eur_oaall_1df_meta_gscan_snp_lookup_merged_results.txt
    
for file in 20190816_eur_chr{1..22}_oaall_1df_meta_gscan_snp_lookup.txt; do
    tail -n +2 $file >> 20190816_eur_oaall_1df_meta_gscan_snp_lookup_merged_results.txt 
done 

wc -l 20190816_eur_oaall_1df_meta_gscan_snp_lookup_merged_results.txt
"""451"""

In [None]:
### bash ###

## merge results
cd /shared/jmarks/proj/heroin/gscan_lookup/20190816/lookup_results/cross
head -1 20190816_afr+eur_chr3_oaall_1df_meta_gscan_snp_lookup.txt >\
    20190816_afr+eur_oaall_1df_meta_gscan_snp_lookup_merged_results.txt
    
for file in 20190816_afr+eur_chr{1..22}_oaall_1df_meta_gscan_snp_lookup.txt; do
    tail -n +2 $file >> 20190816_afr+eur_oaall_1df_meta_gscan_snp_lookup_merged_results.txt 
done 

wc -l 20190816_afr+eur_oaall_1df_meta_gscan_snp_lookup_merged_results.txt
"""451"""

# Retrieve Missing SNPs
Some SNPs were not retrieved.
* 10 not found in either lookup 
```
id      chr
rs2145451:29316842:T:C  14
rs12611472:225353649:T:C        2
rs12442563:83893243:G:T 15
rs4886550:78243579:A:G  15
rs181508347:91366274:T:G        5
rs72836318:44121579:T:C 17
rs112913817:115077394:A:G       7
rs3076896:146283610:G:A 2
rs28813180:158083918:G:A        3
rs10698713:158882320:G:A        6
```