# Extension of GSCAN results to mQTL

**Author:** Jesse Marks

**GitHub Issue:**  [LIBD NAc mQTL Analysis #142](https://github.com/RTIInternational/bioinformatics/issues/142)

We have preliminary results from a very large-scale genome-wide study for cigarette smoking phenotypes that relate to our nicotine dependence GWAS results. See the supplemental Tables S6-S9 here: 

`smb://RTPNFIL02/dhancock/dhancock/Nicotine/Analysis/GSCAN/shared MS version 1/Supplementary_Tables_S6-S12_Loci.xlsx`

The phenotypes of interest to us include: 

1) **Age of smoking initiation (AI)** - supplemental table 6

2) **Cigarettes per day (CPD)** - supplemental table 7

3) **Smoking cessation (SC)** - supplemental table 8

4) **smoking initiation (SI)** - supplemental table 9

## Download Data
Create the directory structure and download the meta-analysis results to EC2. Also, create a list of SNPs for the lookup from the GSCAN Excel sheet. Copy the SNPs from each of the four Excel sheets (supplemental tables) into separate files and then combine them (no duplicates).

In [None]:
## EC2 ##
# Create directory 
mkdir -p /shared/rti-nd/lookup/gscan/20210915/
cd /shared/rti-nd/lookup/gscan/20210915/

# populate these files with the SNPs from respective Excel sheets
# I just copy/pasted since there were not that many
touch table6_age_of_initiation.txt  table7_cig_per_day.txt  table8_smoking_cessation.txt  table9_smoking_initiation.txt

# make space separated (and use vim to edit header by adding underscores in header name when spaces are present in one field)
awk '$1=$1' table6_age_of_initiation.txt  > tmp && mv tmp table6_age_of_initiation.txt
awk '$1=$1' table7_cig_per_day.txt  > tmp && mv tmp table7_cig_per_day.txt
awk '$1=$1' table8_smoking_cessation.txt  > tmp && mv tmp table8_smoking_cessation.txt
awk '$1=$1' table9_smoking_initiation.txt  > tmp && mv tmp table9_smoking_initiation.txt

wc -l * # note these are with headers
"""
   11 table6_age_of_initiation.txt
   56 table7_cig_per_day.txt
   25 table8_smoking_cessation.txt
  377 table9_smoking_initiation.txt
"""

head table6_age_of_initiation.txt
"""
head table6_age_of_initiation.txt
Chr Pos rsID Reference_Allele Alternate_Allele
2 145638766 rs72853300 C T
2 225353649 rs12611472 T C
2 63622309 rs7559982 T A
3 85699040 rs11915747 C G
4 140908755 rs13136239 G A
"""

In [None]:
## Download joint 2df results from the mQTL mapping
aws s3 sync s3://rti-nd/libd/results/methylation_qtl/strat_2df/maf_filtered_qtls/ .
    
head filtered_MAF_gte_0.05_sorted_LIBD_NAc_MEGA_strat2df_new_model_chr10_cis_qtl_table.txt
#SNP     gene    case_beta       ctrl_beta       case_t_stat     ctrl_t_stat     case_p  ctrl_p  chisq_stat      chisq_p
#10:10001753:TAAAG:T     cg02458642      -0.824762223341344      0.331612703669845       -2.07465254377  1.6076694029    0.043899527866893       0.109879160224346      6.8887840864    0.03192416465703847
#10:10001753:TAAAG:T     cg02703020      -0.496341358240632      0.0797450361608843      -1.59883376047  0.448872731902  0.117014807406559       0.654130827296118      2.75775612306   0.2518609671122902

## SNP lookup
I need to create a dictionary of the GSCAN SNPs and then see if each SNP in the stratified 2df results.  

Also note that there are some SNPs in the mQTL mapping results which have the format of `chr:position:a1:a2` of  others of the form `rsid:position:a1:a2`. We can first try to just search for the ones with rsIDs. We may end up searching for them based on chr:position though if we can't find them. 

### Table 6: AI Loci

In [None]:
### Python3 ###
"""
*SNP lookup*

    Make sure the IDs are of the same format for the snp-list
    and the IDs in the meta-analysis results. e.g. 1000g_p3 or rsID only
"""
import gzip

################################################################################
################################################################################

table = "table6_age_of_initiation" # no file extension
base_dir = "/shared/rti-nd/lookup/gscan/20210915"  # no forward slash at the end

## dict to hold gscan snps and the number of times they were found.
## we can tell which SNPs did not show up in any of the mqtl mapping files
snp_list = "{}/{}.txt".format(base_dir, table)
progress = "Processing {}".format(table)

out_file = "{}/lookup_results/2df_mqtl_mapping_{}_gscan_lookup.txt".format(base_dir, table)
not_found = "{}/lookup_results/2df_mqtl_mapping_{}_gscan_snps_not_found".format(base_dir, table)
head_line = "false" # write the header to out file then change to true so it won't print it again
with open(snp_list) as gscanF, open(out_file, "w") as outF:
    gscan_line = gscanF.readline()

    ## create a dictionary containing the gscan snps (rsIDs)
    gscan_dict =  {}
    gscan_posdict =  {} # some variants have depricated rsIDs, so we'll also search by position
    if len(gscan_dict) == 0:
        gscan_line = gscanF.readline() # 1st non-header line
        while gscan_line:
            # Chr Pos rsID Reference_Allele Alternate_Allele
            rsid = gscan_line.split()[2]
            position = gscan_line.split()[1]
            gscan_dict[rsid] = 0 # initiate the number of times it has been found
            gscan_posdict[position] = rsid
            gscan_line = gscanF.readline()
    print(gscan_dict)
    print(gscan_posdict)

    for chrom in range(1,22):
        which_chrom = "chr{}".format(chrom)
        print(which_chrom)
        mqtl_results = "{}/mqtl_mapping/filtered_MAF_gte_0.05_sorted_LIBD_NAc_MEGA_strat2df_new_model_chr{}_cis_qtl_table.txt".format(base_dir, chrom)
        with open(mqtl_results) as mqtlF:
            mqtl_head = mqtlF.readline()
            if head_line == "false":
                outF.write(mqtl_head)
                head_line = "true"
                
            mqtl_line = mqtlF.readline()
            while mqtl_line: # look at all of the variety
                # SNP     gene    case_beta       ctrl_beta  ...    
                #2:100060310:C:CT        cg00151788      -0.20686045775319       -0.0148707568224241 ... 
                # rs72853300:145638766:C:T        cg27382459      0.0206777132739697      0.0305212347955355 ...
                #GA006823        cg00029583      0.237828601989225       0.0853544190919354  ...
                # rs1000007       cg00002145      -0.129089093196684      -0.0565450971892582  ...

                variant_id = mqtl_line.split()[0] # SNP
                if variant_id[0:2] == "rs":
                    if ":" in variant_id:
                        rs_id = variant_id.split(":")[0] # just the rsIDs
                        variant_pos = variant_id.split(":")[1] # just the position

                        if rs_id in gscan_dict:
                            gscan_dict[rs_id] += 1 # mark the variant as found
                            outF.write(mqtl_line)
                        elif variant_pos in gscan_posdict:
                            gscan_rs = gscan_posdict[variant_pos]
                            gscan_dict[gscan_rs] += 1 # mark the variant as found
                            outF.write(mqtl_line)
                    elif variant_id in gscan_dict:
                        gscan_dict[variant_id] += 1 # mark the variant as found
                        outF.write(mqtl_line)

                mqtl_line = mqtlF.readline()

with open(not_found, "w") as notF:
    notF.write("variant\n")
    for key, value in gscan_dict.items():
        if value==0:
            notF.write(key + "\n")

### Table 7: CPD Loci
Same as table6, just change file name.

### Table 8: SC Loci
Same as table6, just change file name.

### Table 9: SI Loci
Same as table6, just change file name.

## Report results in Excel
See the Excel Spreadsheet
`/Users/jmarks/OneDrive - Research Triangle Institute/Projects/nicotine/lookup/gscan/20210921/20210921_gscan_lookup_in_2df_mqtl_mapping.xlsx`

# Redo lookup with random SNPs: 20210923
```css
We’re getting a lot of significant results in here, which may reflect true biology or indicate some inflation. Can you pull a random set of 371 SNPs across the genome and repeat the lookup of QTL results?

Thanks,
Dana
```

Log into GOBOT sandbox server.

In [None]:
mkdir -p ~/rti-nd/lookup/gscan/20210923/mqtls/

# download data
cd ~/rti-nd/lookup/gscan/20210923/mqtls/
aws s3 sync s3://rti-nd/libd/results/methylation_qtl/strat_2df/maf_filtered_qtls/ . --quiet

cd ../
# get header
head -1 mqtls/filtered_MAF_gte_0.05_sorted_LIBD_NAc_MEGA_strat2df_new_model_chr1_cis_qtl_table.txt > \
    strat2df_cis_qtl_table_371_random_snps_across_genome.txt

# In order to get exactly 371 SNPs, we need to partition the chromosomes up.
# In particular, we will extract 17 random lines from 19 largest chromosomes,
# and 16 random lines from the 3 smallest chromosomes. This equals 371 total lines.
n=17
for chr in {1..19}; do
    file=mqtls/filtered_MAF_gte_0.05_sorted_LIBD_NAc_MEGA_strat2df_new_model_chr${chr}_cis_qtl_table.txt
    shuf -n $N <(tail -n +2 $file) >> strat2df_cis_qtl_table_371_random_snps_across_genome.txt
done &

N=16
for chr in {20..22}; do
    file=mqtls/filtered_MAF_gte_0.05_sorted_LIBD_NAc_MEGA_strat2df_new_model_chr${chr}_cis_qtl_table.txt
    shuf -n $N <(tail -n +2 $file) >> strat2df_cis_qtl_table_371_random_snps_across_genome.txt
done

# This could potentially output non-unique SNPs since some SNPs are present on multiple lines.
# Let's make sure that all 371 lines are unique. (372 with header)
cut -f1 strat2df_cis_qtl_table_371_random_snps_across_genome.txt  | sort -u | wc -l
# 372

################################################################################
################################################################################

### Now perform lookup of those SNPs

# paste those 371 SNPs into a file named snp_list.txt then perform lookup
### Python3 ###
"""
*SNP lookup*

    Make sure the IDs are of the same format for the snp-list
    and the IDs in the meta-analysis results. e.g. 1000g_p3 or rsID only
"""
import gzip

table = "snp_list" # no file extension
base_dir = "/gobot/jmarks/rti-nd/lookup/gscan/20210923"  # no forward slash at the end

## dict to hold gscan snps and the number of times they were found.
## we can tell which SNPs did not show up in any of the mqtl mapping files
snp_list = "{}/{}.txt".format(base_dir, table)
progress = "Processing {}".format(table)

out_file = "{}/results/2df_mqtl_mapping_{}_gscan_lookup.txt".format(base_dir, table)
head_line = "false" # write the header to out file then change to true so it won't print it again
with open(snp_list) as gscanF, open(out_file, "w") as outF:
    gscan_line = gscanF.readline()

    ## create a dictionary containing the gscan snps (rsIDs)
    gscan_dict =  {}
    if len(gscan_dict) == 0:
        gscan_line = gscanF.readline() # 1st non-header line
        while gscan_line:
            rsid = gscan_line.split()[0]
            gscan_dict[rsid] = 0 # initiate the number of times it has been found
            gscan_line = gscanF.readline()
    print(gscan_dict)

    for chrom in range(1,23):
        which_chrom = "chr{}".format(chrom)
        print(which_chrom)
        mqtl_results = "{}/mqtls/filtered_MAF_gte_0.05_sorted_LIBD_NAc_MEGA_strat2df_new_model_chr{}_cis_qtl_table.txt".format(base_dir, chrom)
        with open(mqtl_results) as mqtlF:
            mqtl_head = mqtlF.readline()
            if head_line == "false":
                outF.write(mqtl_head)
                head_line = "true"

            mqtl_line = mqtlF.readline()
            while mqtl_line: # look at all of the variety
                variant_id = mqtl_line.split()[0] # SNP
                if variant_id in gscan_dict:
                    gscan_dict[variant_id] += 1 # mark the variant as found
                    outF.write(mqtl_line)

                mqtl_line = mqtlF.readline()

In [14]:
#17 * 19 + (16 * 3)
17 * 19


323