# GSCAN lookup: updated method.
**Author**: Jesse Marks

**Description**:<br>
This notebook performs a lookup of the gw-GSCAN SNPs in the LIBD mQTL and eQTL results using the updated lookup procedure. In particular, there are 462 unique genome-wide significant GSCAN SNPs across the four phenotypes:
* Age of Initiation
* Cigarettes per Day
* Smoking Cessation
* Smoking Initiation.

We lookup these SNPs within the LIBD mQTL results and eQTL results. This lookup was previously performed, but we didn't capture all of the results we could have. In our previous lookup, some of the variants were missed from the results due to a discordance between the rsIDs in the GSCAN results compared to the rsIDs used in the LIBD results. The LIBD results had a mixture of formats. There was no chromosome position information either. 

We resolved this issue by first performing the GSCAN lookup within the LIBD genotype data. These genotype data allowed us to match strictly by chromosome:position and thus ignore the discordant SNP names. Then we created a list of the GSCAN gw-sig SNPs using the LIBD SNP names that were actually in the LIBD genotype data. Then we use that lookup list to extract the mQTL and eQTL results.

## Genotype lookup
Perform the lookup in the genotype data first so we can capture the SNP names based off of the chromosome and genome positional information.

* 462 unique GSCAN SNPs.
* 373 found in the genotype data (89 not found)

[Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6358542/)
* [supplementary table](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6358542/bin/NIHMS1511852-supplement-2.xlsx)


### Create GSCAN table

Create GSCAN table from the spreadsheet you can download from the publication.

In [None]:
# paste just the chr,pos,rsID,ref allele, alt allele, and phenotype from the spreadsheet (include header)
wc -l age_at_initiation_snps.txt cigs_per_day_snps.txt smoking_cessation_snps.txt smoking_initiation_snps.txt
#      11 age_at_initiation_snps.txt
#      56 cigs_per_day_snps.txt
#      25 smoking_cessation_snps.txt
#     379 smoking_initiation_snps.txt
#     471 total
# 
# so should be 467 without the header

# there are 5 SNPs that show up in two phenotypes. So there are 462 uniq SNPs.

# Now we will combine those tables, taking care to note the phenotype(s) of those 462 SNPs 
cut -f1-6 age_at_initiation_snps.txt > gscan_tables_1-4_phen_chr_pos_rsid_ref_alt.txt
tail -n +2 cigs_per_day_snps.txt | cut -f1-6  >> gscan_tables_1-4_phen_chr_pos_rsid_ref_alt.txt
tail -n +2 smoking_cessation_snps.txt | cut -f1-6  >> gscan_tables_1-4_phen_chr_pos_rsid_ref_alt.txt
tail -n +2 smoking_initiation_snps.txt | cut -f1-6  >> gscan_tables_1-4_phen_chr_pos_rsid_ref_alt.txt

# since there are only 5, we will just manually add the information about the following SNPs.
# delete the duplicate lines and then manually edit the Phenotype column
#rs11780471 is in si and aai
#rs117824460 is in cpd and sc
#rs1565735 is in si and sc
#rs56113850 is in cpd and sc
#rs6011779 is in si and sc


# notice how rs11780471 has two phenotypes
head -11 gscan_tables_1-4_phen_chr_pos_rsid_ref_alt.txt
#Phenotype    Chr    Pos    rsID    Reference_Allele    Alternate_Allele
#aai    2    145638766    rs72853300    C    T
#aai    2    225353649    rs12611472    T    C
#aai    2    63622309    rs7559982    T    A
#aai    3    85699040    rs11915747    C    G
#aai    4    140908755    rs13136239    G    A
#aai    4    28589079    rs2471711    C    T
#aai    4    2881256    rs624833    T    G
#aai    4    68000888    rs7682598    A    G
#aai    7    2032865    rs1403174    A    T
#aai_si    8    27344719    rs11780471    G    A

wc -l gscan_tables_1-4_phen_chr_pos_rsid_ref_alt.txt # 463 (includes header)

In [None]:
infile = "gscan_tables_1-4_phen_chr_pos_rsid_ref_alt.txt"
libd = "/shared/mcarnes/matrix_eQTL/Preprocessing/MEGA/Genotype/LIBD_NAc_MEGA_baseline_genotype_matrixeQTL_input.bim" # on the mcarnes cluster
#outfile = "gscan_not_found_in_libd_genotype_data.txt"
outfile = "gscan_found_in_libd_genotype_data.txt"

with open(infile) as inF, open(libd) as libdF, open(outfile, 'w') as outF:
    line = libdF.readline()

    libd_dic = dict()
    print(line) # 1	rs140435168:256586:T:G	0	256586	G	T
    while line:
        sl = line.split()
        chrom = sl[0]
        pos = sl[3]
        togeth = "{}:{}".format(chrom, pos)
        libd_dic[togeth] = sl[1] 
        line = libdF.readline()

    #outF.write("Chr\tPos\trsID\tReference_Allele\tAlternate_Allele\n")
    line = inF.readline()
    line = inF.readline()
    #Chr     Pos     rsID    Reference_Allele        Alternate_Allele
    #2       145638766       rs72853300      C       T
    while line:
        sl = line.split()
        chrom = sl[0]
        pos = sl[1]
        togeth = "{}:{}".format(chrom, pos)
        if togeth in libd_dic:
            outF.write(libd_dic[togeth] + "\n")
        line = inF.readline()
