# Extension of GSCAN results to nicotine dependence (issue #59)

**Author:** Jesse Marks

We have preliminary results from a very large-scale genome-wide study for cigarette smoking phenotypes that relate to our nicotine dependence GWAS results. See the supplemental Tables S7-S9 here: 

`\rcdcollaboration01.rti.ns\GxG\Analysis\GSCAN\shared MS version 1\`.

The phenotypes of interest to us include: 

1) **Age of smoking initiation (AI)** - supplemental table 6

2) **Cigarettes per day (CPD)** - supplemental table 7

3) **Smoking cessation (SC)** - supplemental table 8

4) **smoking initiation (SI)** - supplemental table 9

We're interested in seeing whether these associations extend over to nicotine dependence. I use the SNP look-up script to extend Tables S6-S9 with our GWAS results (analysis sets 044, 045, and 046 here: `\\rcdcollaboration01.rti.ns\GxG\Analysis\META\1df`)

## Create directory structure and copy SNP look-up
The data that we will be parsing SNPs for are located on the `gxg share drive`. We will move the data to our local machine do to bandwidth issues re the share drive.

In [None]:
# Create directory structure locally
cd /cygdrive/c/Users/jmarks/Desktop/Projects
mkdir -p Nicotine/GSCAN_extended_results_nicotine/develop/{044,045,046}/\
{044_results,045_results,046_results}/{Table_S6,Table_S7,Table_S8,Table_S9}

mkdir -p Nicotine/GSCAN_extended_results_nicotine/data/\
{044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper,
045.eur13cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper,
046.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper}

# copy data over to local machine and change permission of file
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/

# copy 044 data
cp //rcdcollaboration01.rti.ns/gxg/Analysis/META/1df/044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/*.1df
/cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/

# copy 045 data
for i in //rcdcollaboration01.rti.ns/gxg/Analysis/META/1df/045.eur13cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+dental_caries+cogend2.eur.chr{1..22}.exclude_singletons.1df
do
cp $i 045.eur13cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/
done

# copy 046 data
for i in //rcdcollaboration01.rti.ns/gxg/Analysis/META/1df/046.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+sage+uw-tturc+gain+yale-penn+aand+jhs+cogend2.afr.chr{1..22}.exclude_singletons.1df
do
cp $i 046.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/
done


# change permission on all files 
chmod 755 044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/*
chmod 755 045.eur13cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/*
chmod 755 046.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/*


cd Nicotine/GSCAN_extended_results_nicotine/develop/GSCAN_extended_results_nicotine/

# I need to create a directory for each table because each table has different set of
# SNPs to lookup. Each set of SNPs need to be searched for in our GWAS results (044,045,046)
mkdir -p SNP_finder/{Table_S6,Table_S7,Table_S8,Table_S9}

# copy SNP look-up script to directory structure
for i in {6..9}; do  cp -r /cygdrive/c/Users/jmarks/Desktop/Code/SNP_finder/* SNP_finder/Table_S$i; done

# make a copy of the perl script in this directory for ease of use 
# (this is where it will actually be run from)

cp /cygdrive/c/Users/jmarks/Desktop/Code/SNP_finder/Table_S6/extract_rows.pl .

## Customize accompanying files to the SNP look-up script

The SNP look-up script takes as an argument a file which contains the SNPs that are to be searched for. The SNPs which will be of interest here are the SNPs listed in the supplementary tables at:

`//rcdcollaboration01.rti.ns/gxg/Analysis/GSCAN/shared MS version 1/Supplementary_Tables_S6-S12_Loci.xlsx`

* We are focusing on the SNPs from supplementary tables 6-9. The fourth column titled `rsID` is the column which contains the SNPs. We will copy/paste this column into the accompanying file titled `SNP_ids.txt`, which is located in the same directory as the SNP look-up script, titled `extract_rows.pl`

* The other accompanying file is titled `perlRun.txt`. This file is customized to detail the location of all necessary files need for the SNP script to run and is also located in the same directory as `extract_rows.pl` and `SNP_ids.txt`.

The `perlRun.txt` and `extract_row.pl` Perl script are located locally at 

## Supplementary Table 6
Run the SNP look-up script for the SNPs in supplementary table 6

In [None]:
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop

# Lookup the SNPs from supplementary table 6 in the 044 data (cross-ancestrial)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+aand+jhs+dental_caries+cogend2.afr+eur.chr{1..22}.1df; do
let j++
# location of the Perl SNP look-up script 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S6/SNP_ids.txt \
--out results/044_results/Table_S6/chr$j.overlap.txt \
--header 1 \
--id_column 0
done



# Lookup the SNPs from supplementary table 6 in the 045 data (EA-ancestry)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/045.eur13cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+dental_caries+cogend2.eur.chr{1..22}.exclude_singletons.1df; do

let j++ 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S6/SNP_ids.txt \
--out results/045_results/Table_S6/chr$j.overlap.txt \
--header 1 \
--id_column 0
done


# Lookup the SNPs from supplementary table 6 in the 046 data (AA-ancestry)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/046.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+sage+uw-tturc+gain+yale-penn+aand+jhs+cogend2.afr.chr{1..22}.exclude_singletons.1df; do
let j++
# location of the Perl SNP look-up script 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S6/SNP_ids.txt \
--out results/046_results/Table_S6/chr$j.overlap.txt \
--header 1 \
--id_column 0
done

## Supplementary Table 7
Run the SNP look-up script for the SNPs in supplementary table 7

In [None]:
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop

# Lookup the SNPs from supplementary table 7 in the 044 data (cross-ancestrial)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+aand+jhs+dental_caries+cogend2.afr+eur.chr{1..22}.1df; do
let j++
# location of the Perl SNP look-up script 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S7/SNP_ids.txt \
--out results/044_results/Table_S7/chr$j.overlap.txt \
--header 1 \
--id_column 0
done


# Lookup the SNPs from supplementary table 7 in the 045 data (EA-ancestry)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/045.eur13cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+dental_caries+cogend2.eur.chr{1..22}.exclude_singletons.1df; do
let j++ 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S7/SNP_ids.txt \
--out results/045_results/Table_S7/chr$j.overlap.txt \
--header 1 \
--id_column 0
done


# Lookup the SNPs from supplementary table 7 in the 046 data (AA-ancestry)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/046.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+sage+uw-tturc+gain+yale-penn+aand+jhs+cogend2.afr.chr{1..22}.exclude_singletons.1df; do
let j++
# location of the Perl SNP look-up script 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S7/SNP_ids.txt \
--out results/046_results/Table_S7/chr$j.overlap.txt \
--header 1 \
--id_column 0
done

## Supplementary Table 8
Run the SNP look-up script for the SNPs in supplementary table 8

In [None]:
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop

# Lookup the SNPs from supplementary table 8 in the 044 data (cross-ancestrial)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+aand+jhs+dental_caries+cogend2.afr+eur.chr{1..22}.1df; do
let j++
# location of the Perl SNP look-up script 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S8/SNP_ids.txt \
--out results/044_results/Table_S8/chr$j.overlap.txt \
--header 1 \
--id_column 0
done


# Lookup the SNPs from supplementary table 8 in the 045 data (EA-ancestry)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/045.eur13cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+dental_caries+cogend2.eur.chr{1..22}.exclude_singletons.1df; do
let j++ 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S8/SNP_ids.txt \
--out results/045_results/Table_S8/chr$j.overlap.txt \
--header 1 \
--id_column 0
done


# Lookup the SNPs from supplementary table 8 in the 046 data (AA-ancestry)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/046.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+sage+uw-tturc+gain+yale-penn+aand+jhs+cogend2.afr.chr{1..22}.exclude_singletons.1df; do
let j++
# location of the Perl SNP look-up script 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S8/SNP_ids.txt \
--out results/046_results/Table_S8/chr$j.overlap.txt \
--header 1 \
--id_column 0
done

## Supplementary Table 9
Run the SNP look-up script for the SNPs in supplementary table 9

In [None]:
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop

# Lookup the SNPs from supplementary table 9 in the 044 data (cross-ancestrial)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+aand+jhs+dental_caries+cogend2.afr+eur.chr{1..22}.1df; do
let j++
# location of the Perl SNP look-up script 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S9/SNP_ids.txt \
--out results/044_results/Table_S9/chr$j.overlap.txt \
--header 1 \
--id_column 0
done


# Lookup the SNPs from supplementary table 9 in the 045 data (EA-ancestry)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/045.eur13cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+decode+eagle+sage+uw-tturc+gain+nongain+yale-penn+ntr+finnish+dental_caries+cogend2.eur.chr{1..22}.exclude_singletons.1df; do
let j++ 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S9/SNP_ids.txt \
--out results/045_results/Table_S9/chr$j.overlap.txt \
--header 1 \
--id_column 0
done


# Lookup the SNPs from supplementary table 9 in the 046 data (AA-ancestry)
j=0
for i in /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/data/046.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper/cogend+copdgene+sage+uw-tturc+gain+yale-penn+aand+jhs+cogend2.afr.chr{1..22}.exclude_singletons.1df; do
let j++
# location of the Perl SNP look-up script 
perl extract_rows.pl \
--source $i \
--id_list SNP_finder/Table_S9/SNP_ids.txt \
--out results/046_results/Table_S9/chr$j.overlap.txt \
--header 1 \
--id_column 0
done

## Copy results to GxG drive
Do to issues with copying the files over from windows to the gxg drive, the best approach I have found is to first tarball the data then copy it over and finally untar it in the new location.

In [None]:
# local machine
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/

tar -czvf results.tar.gz results/
cp results.tar.gz //rcdcollaboration01.rti.ns/gxg/Analysis/META/GSCAN_extended_results_nicotine/

cd //rcdcollaboration01.rti.ns/gxg/Analysis/META/GSCAN_extended_results_nicotine/

# untar
tar -xzvf results.tar.gz
rm results.tar.gz

## Some of the SNPs did not show up in the results
We will determine which SNPs did not show up in the results and why this happened.

### 044

#### Supplemental table 6 - Age of Initiation (AI) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop
mkdir -p missing_snps/{044,045,046}
cd missing_snps/044

# Copy column 1 (MarkerName) from supplemental tables spreadsheet and paste
# into vim marker names- save as 044_S6.txt 
vim 044_S6.txt

# Because the results marker names are of the form rs1403174:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 044_S6.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S6/SNP_ids.txt
:set ff=unix

vim 044_S6.txt
:set ff=unix

# compare the two files and see which snps are missing in the results
grep -v -f 044_S6.txt ../../SNP_finder/Table_S6/SNP_ids.txt > 044_S6_missing_snps.txt
cat 044_S6_missing_snps.txt | clip
# result
'''
rs12611472 # chr2 
'''

# sanity check
wc -l ../../SNP_finder/Table_S6/SNP_ids.txt
wc -l 044_S6.txt

#result
'''
10
9
'''

cd //rcdcollaboration01.rti.ns/gxg/Analysis/META/1df/044.eur13cohorts.afr.9cohorts.eagle_lung.jhs_aric.1000G_p3_markerName_FinalDatasetDNMT3Bpaper
# could not find this snp in the file.
grep 'rs12611472' cogend+copdgene+decode+eagle+sage+uw-tturc+gain+\nongain+yale-penn+ntr+finnish+aand+jhs+dental_caries+cogend2.afr+eur.chr2.1df


#### Supplemental table 7 - Cigarettes per Day (CPD) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/044/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 044_S7.txt 
vim 044_S7.txt

# Because the results marker names are of the form rs1403174:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 044_S7.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S7/SNP_ids.txt
:set ff=unix

vim 044_S7.txt
:set ff=unix

# compare the two files and see which snps are missing in the results
grep -v -f 044_S7.txt ../../SNP_finder/Table_S7/SNP_ids.txt > 044_S7_missing_snps.txt
cat 044_S7_missing_snps.txt | clip
# result
'''
rs28813180 # chr 3   
rs4886550 # chr 15  
'''

# sanity check
wc -l ../../SNP_finder/Table_S7/SNP_ids.txt
wc -l 044_S7.txt

#result
'''
55
53
'''

#### Supplemental table 8 - Smoking Cessation (SC) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/044/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 044_S8.txt 
vim 044_S8.txt

# Because the results marker names are of the form rs1403184:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 044_S8.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S8/SNP_ids.txt
:set ff=unix # in vim

vim 044_S8.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 044_S8.txt ../../SNP_finder/Table_S8/SNP_ids.txt > 044_S8_missing_snps.txt
cat 044_S8_missing_snps.txt | clip
# result
'''
No missing
'''

# sanity check
wc -l ../../SNP_finder/Table_S8/SNP_ids.txt
wc -l 044_S8.txt

#result
'''
24
24
'''

#### Supplemental table 9 - Smoking Initiation (SI) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/044/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 044_S9.txt 
vim 044_S9.txt

# Because the results marker names are of the form rs1403194:2032965:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 044_S9.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S9/SNP_ids.txt
:set ff=unix # in vim

vim 044_S9.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 044_S9.txt ../../SNP_finder/Table_S9/SNP_ids.txt > 044_S9_missing_snps.txt
cat 044_S9_missing_snps.txt | clip
# result
'''
rs3076896 # chr 2
rs55900829 # chr 4 
rs181508347 # chr 5
rs79180767 # chr 6
rs10698713 # chr 6
rs112913817 # chr 7
rs78239456 # chr 11
rs2145451 # chr 14
rs12442563 # chr 15
rs72836318 # chr 17 
rs2359180 # chr 18 
'''

# sanity check
wc -l ../../SNP_finder/Table_S9/SNP_ids.txt
wc -l 044_S9.txt

#result
'''
376
365
'''

Combine all of the missing snps into one text file.

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/044/
mkdir combined
cd combined

# copy paste all missing snps from 044 into here
vim 044_missing_snps_with_chrom.txt
sed 's/\s#.*//g' 044_missing_snps_with_chrom.txt > 044_missing_snps_no_chrom.txt

## Locating the missing SNPs

In [None]:
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/
cat 044/combined/044_missing_snps_no_chrom.txt\
    045/combined/045_missing_snps_no_chrom.txt\
    046/combined/046_missing_snps_no_chrom.txt > all_missing_snps_no_chrom.txt

# 

### 045

#### Supplemental table 6 - Age of Initiation (AI) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/045

# Copy column 1 (MarkerName) from supplemental tables spreadsheet and paste
# into vim marker names- save as 045_S6.txt 
vim 045_S6.txt

# Because the results marker names are of the form rs1403174:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 045_S6.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S6/SNP_ids.txt
:set ff=unix # in vim

vim 045_S6.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 045_S6.txt ../../SNP_finder/Table_S6/SNP_ids.txt > 045_S6_missing_snps.txt
cat 045_S6_missing_snps.txt | clip
# result
'''
rs12611472 # chr2  
'''


# sanity check
wc -l ../../SNP_finder/Table_S6/SNP_ids.txt
wc -l 045_S6.txt

#result
'''
10
9
'''

#### Supplemental table 7 - Cigarettes per Day (CPD) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/045/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 045_S7.txt 
vim 045_S7.txt

# Because the results marker names are of the form rs1403174:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 045_S7.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S7/SNP_ids.txt
:set ff=unix # in vim

vim 045_S7.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 045_S7.txt ../../SNP_finder/Table_S7/SNP_ids.txt > 045_S7_missing_snps.txt
cat 045_S7_missing_snps.txt | clip

# result
'''
rs28813180 # chr 3   
rs4886550 # chr 15 
'''


# sanity check
wc -l ../../SNP_finder/Table_S7/SNP_ids.txt
wc -l 045_S7.txt

#result
'''
55
53
'''

#### Supplemental table 8 - Smoking Cessation (SC) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/045/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 045_S8.txt 
vim 045_S8.txt

# Because the results marker names are of the form rs1403184:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 045_S8.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S8/SNP_ids.txt
:set ff=unix # in vim

vim 045_S8.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 045_S8.txt ../../SNP_finder/Table_S8/SNP_ids.txt > 045_S8_missing_snps.txt
cat 045_S8_missing_snps.txt | clip
# result
'''
NA
'''

# sanity check
wc -l ../../SNP_finder/Table_S8/SNP_ids.txt
wc -l 045_S8.txt

#result
'''
24
24
'''

#### Supplemental table 9 - Smoking Initiation (SI) 

In [1]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/045/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 045_S9.txt 
vim 045_S9.txt

# Because the results marker names are of the form rs1403194:2032965:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 045_S9.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S9/SNP_ids.txt
:set ff=unix # in vim

vim 045_S9.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 045_S9.txt ../../SNP_finder/Table_S9/SNP_ids.txt > 045_S9_missing_snps.txt
cat 045_S9_missing_snps.txt | clip 
# result
'''
rs74664784 # chr 3
rs181508347 # chr 5
rs10259715 # chr 7
rs111842178 # chr 10
rs3076896 # chr 2
rs55900829 # chr 4 
rs79180767 # chr 6
rs10698713 # chr 6
rs112913817 # chr 7
rs78239456 # chr 11
rs2145451 # chr 14
rs12442563 # chr 15
rs72836318 # chr 17 
rs2359180 # chr 18 
'''

# sanity check
wc -l ../../SNP_finder/Table_S9/SNP_ids.txt
wc -l 045_S9.txt

#result
'''
376
362
'''

ERROR: Error in parse(text = x, srcfile = src): <text>:6:5: unexpected numeric constant
5: # into vim marker names- save as 045_S9.txt 
6: vim 045
       ^


Combine all of the missing snps into one text file.

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/045/
mkdir combined
cd combined

# copy paste all missing snps from 045 into here
vim 045_missing_snps_with_chrom.txt
sed 's/\s#.*//g' 045_missing_snps_with_chrom.txt > 045_missing_snps_no_chrom.txt

### 046

#### Supplemental table 6 - Age of Initiation (AI) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/046

# Copy column 1 (MarkerName) from supplemental tables spreadsheet and paste
# into vim marker names- save as 046_S6.txt 
vim 046_S6.txt

# Because the results marker names are of the form rs1403174:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 046_S6.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S6/SNP_ids.txt
:set ff=unix # in vim

vim 046_S6.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 046_S6.txt ../../SNP_finder/Table_S6/SNP_ids.txt > 046_S6_missing_snps.txt
cat 046_S6_missing_snps.txt | clip 
# result
'''
rs12611472 # chr 2
rs11780471 # chr 8
'''

# sanity check
wc -l ../../SNP_finder/Table_S6/SNP_ids.txt
wc -l 046_S6.txt

#result
'''
10
8
'''

#### Supplemental table 7 - Cigarettes per Day (CPD) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/046/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 046_S7.txt 
vim 046_S7.txt

# Because the results marker names are of the form rs1403174:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 046_S7.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S7/SNP_ids.txt
:set ff=unix # in vim

vim 046_S7.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 046_S7.txt ../../SNP_finder/Table_S7/SNP_ids.txt > 046_S7_missing_snps.txt
cat 046_S7_missing_snps.txt | clip 

# result
'''
rs11264100 # chr 1
rs2072659 # chr 1
rs3497346 # chr 1
rs28813180 # chr 3 
rs73229090 # chr 8
rs4886550 # chr 15 
rs143200968 # chr 19
rs117824460 # chr19
'''


# sanity check
wc -l ../../SNP_finder/Table_S7/SNP_ids.txt
wc -l 046_S7.txt

#result
'''
55
47
'''

#### Supplemental table 8 - Smoking Cessation (SC) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/046/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 046_S8.txt 
vim 046_S8.txt

# Because the results marker names are of the form rs1403184:2032865:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 046_S8.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S8/SNP_ids.txt
:set ff=unix # in vim

vim 046_S8.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 046_S8.txt ../../SNP_finder/Table_S8/SNP_ids.txt > 046_S8_missing_snps.txt
cat 046_S8_missing_snps.txt | clip 

# result
'''
rs3025327 # chr 9
rs145580088 # chr 19
rs117824460 # chr 19
rs4809543 # chr20
rs6089904 # chr20
'''

# sanity check
wc -l ../../SNP_finder/Table_S8/SNP_ids.txt
wc -l 046_S8.txt

#result
'''
24
19
'''

#### Supplemental table 9 - Smoking Initiation (SI) 

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/046/

# Copy column 1 (MarkerName) from the results spreadsheet and paste
# into vim marker names- save as 046_S9.txt 
vim 046_S9.txt

# Because the results marker names are of the form rs1403194:2032965:A:T
# we have to remove the trailing characters and keep just the rsID
sed -i 's/:.*//' 046_S9.txt

# make sure both files are in unix format since we copy/pasted from Windows
# i.e. want to convert the files from DOS to unix
vim ../../SNP_finder/Table_S9/SNP_ids.txt
:set ff=unix # in vim

vim 046_S9.txt
:set ff=unix # in vim

# compare the two files and see which snps are missing in the results
grep -v -f 046_S9.txt ../../SNP_finder/Table_S9/SNP_ids.txt > 046_S9_missing_snps.txt
cat 046_S9_missing_snps.txt | clip 
# result
'''
rs301807 # chr 1
rs3820277 # chr 1
rs1889571 # chr 1
rs10914684 # chr 1
rs2637869 # chr 1
rs12755632 # chr 1
rs951740 # chr 1
rs925524 # chr 1
rs12022778 # chr 1
rs11587399 # chr 1
rs4912332 # chr 1
rs1937443 # chr 1
rs1022528 # chr 1
rs12740789 # chr 1
rs80054503 # chr 1
rs10789369 # chr 1
rs1514176 # chr 1
rs10873871 # chr 1
rs11162019 # chr 1
rs1008078 # chr 1
rs1935571 # chr 1
rs12027999 # chr 1
rs45444697 # chr 1
rs2901785 # chr 1
rs147052174 # chr 1
rs35656245 # chr 1
rs12739243 # chr 1
rs12563365 # chr 1
rs876793 # chr 1
rs62106258 # chr 2
rs72790288 # chr 2
rs3076896 # chr 2
rs74664784 # chr 3
rs55900829 # chr 4
rs62340589 # chr 4
rs35375873 # chr 5
rs181508347 # chr 5
rs79180767 # chr 6
rs10698713 # chr 6 
rs10259715 # chr 7
rs112913817 # chr 7
rs11780471 # chr 8
rs11783093 # chr 8
rs111842178 # chr 10
rs76460663 # chr 11
rs78239456 # chr 11
rs11611651 # chr 12
rs1108130 # chr 13
rs2145451  # chr 14
rs12442563 # chr 15
rs117657830 #chr 16
rs72836318 #chr 17 
rs2359180 # chr 18 
rs76608582 # chr 19
rs6050446 # chr 20
'''

# sanity check
wc -l ../../SNP_finder/Table_S9/SNP_ids.txt
wc -l 046_S9.txt

#result
'''
376
321
#difference 55
'''

In [None]:
# command line
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps/046/
mkdir combined
cd combined

# copy paste all missing snps from 046 into here
vim 046_missing_snps_combined.txt

# make a copy which does not include the chromosome at the end
sed 's/\s#.*//g' 046_missing_snps_with_chrom.txt > 046_missing_snps_no_chrom.txt

# Retrieve Missing SNPs
It was discovered that the missing SNPs were filtered out of the final meta analyses data sets because of the MAF threshold set at 0.01. We want to recapture those SNPs which were filtered out. Because there were not that many SNPs that were missing (69) we pulled them from individual cohorts and analyzed them manually.

**Steps to retrieve missing SNPs**
1. Generate a union set SNP lookup list that includes all 3 meta analyses

2. Identify the union set of cohorts across the 3 meta analyses

3. For each cohort find its corresponding `*.stats.gz` file in its gwas directory. This can be partially deduced from looking at the meta-analysis methods files on MIDAS (example: /share/nas03/bioinformatics_group/data/studies/gxg/meta-analyses/118/_methods.gxg.meta-analyses.118.sh) in conjunction with John's email suggestion (<COHORT ROOT>/.../association_tests/<seq_number>/<ethnicity>/processing/chr<chr>/*.stats.gz)

4. Modify the extract.dental.py code for each cohort and check that the column matching specification within the code is correct

5. Run the python script for each cohort

6. Copy the lookup results to the respective meta analysis subdirectories and subset the results by the analysis specific lookup lists


## Generate a SNP look-up list

In [None]:
## command line - local ##
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps

cat 044/combined/044_missing_snps_no_chrom.txt\
    045/combined/045_missing_snps_no_chrom.txt\
    046/combined/046_missing_snps_no_chrom.txt > all_missing_snps_no_chrom.txt

# make uniq list
sort all_missing_snps_no_chrom.txt | uniq > all_missing_unique.txt

### Convert SNP list to proper format
We first convert the total missing SNPs list into proper format with [NCBI](https://www.ncbi.nlm.nih.gov/projects/SNP/dbSNP.cgi?list=rsfile![image.png](attachment:image.png)). When the SNPs document has been created, you will receive an email with the file. The missing SNP list needs to be of the format <rsID>\t<chr>\t<position>. Use CHROMOSOME RPT format from the website and then edit the text file by cutting the necessary fields from this document. 

In [None]:
# command line #
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/missing_snps_from_first_set_of_results

# file generated from NCBI website
cp ~/Downloads/180112160217/ .

# extracting the rsID, chr, and chr position
cut -f1,7,12 180112160217 > all_missing_unique_formatted.txt

# remove headings
tail -n +7 all_missing_unique_formatted.txt > all_missing_unique_formatted_2.txt

#add "rs" to first column 
awk '$1="rs"$1' all_missing_unique_formatted_2.txt > all_missing_unique_formatted_3.txt

# add a header 
awk 'BEGIN { print "MarkerName", "chr", "pos" } { print $0, "" }' all_missing_unique_formatted_3.txt > all_missing_unique_formatted_4.txt

# make file tab delimited
sed 's/ /\t/g' all_missing_unique_formatted_4.txt > all_missing_unique_formatted_5.txt

# remove SNPs from the list if they were missing information about chromosome position
awk -F'\t' '$3!=""' all_missing_unique_formatted_5.txt > all_missing_unique_formatted_6.txt

# rename and clean up directory
cp all_missing_unique_formatted_6.txt SNPlist.txt
rm all_missing_unique*

# upload SNPlist.txt to MIDAS
scp SNPlist.txt jmarks@rtplhpc01.rti.ns:~/

**Note**: Two snps did not show up in the lookup. I had to manually enter them into the `SNPlist.txt`. The table below details those two SNPs.

| MarkerName | chr | pos      |
|------------|-----|----------|
| rs2145451  | 14  | 28847636 |
| rs3497346  | 3   | 34026282 |

### Identify union of cohorts
Analyses description for all gxg analyses is located at: `\\rcdcollaboration01\GxG\Analysis\analysis_descriptions.xlsx`
In the spreadsheet on the `META` tab the cohorts involved in these meta analyses (union set) are:


| EA Cohorts    | AA Corhorts  |
|---------------|--------------|
| COGEND        | AAND+COGEND2 |
| COPDGene      | COGEND       |
| deCODE        | COPDGene     |
| EAGLE         | GAIN         |
| GAIN          | SAGE         |
| nonGAIN       | Yale_Penn    |
| SAGE          | JHS          |
| UW-TTURC      | UW-TTURC     |
| Yale_Penn     | COGEND2      |
| NTR           |              |
| Finn_Twin     |              |
| Dental_Caries |              |
| COGEND2       |              |


 **Note:** For DECODE we only have final results. Use the 1df files after genomic control:
`/share/nas03/bioinformatics_group/data/studies/decode/imputed/v1/association_tests/001/GC/mapping_marker_1000G_p3/decode.eur.GC.chr$chr.1df`

* I had to use the SNP look-up script from my local machine:
`~/Destop/Projects/Nicotine/GSCAN_extended_results_nicotine/develop/SNP_finder/`

COGEND2 is part of AAND_COGEND2
`/share/nas04/bioinformatics_group/data/studies/aand_cogend2/imputed/v2/association_tests/014/<ea or aa>/`


## Missing SNP-lookup
The prefiltered data that we are wanting to look in for the missing SNPs results is on MIDAS as of 20180126. What you need for this analysis is

1) `SNPlist.txt`
* The `SNPlist.txt` is a file which contains the SNPs of interest for the look-up. The `SNPlist.txt` file must be in the <rsID>\t<chr>\t<position> format. 

2) `studies.json`
* The studies.json file contains the information about which cohorts and which data are going to be used for the SNP look-up.

3) `generate_sh_for_multiple_gz.py`
* This is a script that is ran to create a number of shell scripts to run. The shell scripts are what looks up the SNPs. You can run them individually, or create a script which will execute all of the scripts that have been created.

4) `merge.R`
* Lastly, once cohort-specific scripts have been executed to perform the SNP look-up, you can use this file to combine all of the results from each cohort in the meta-analysis.

### Merge results from missing SNP look-up
With the merge.R script, we merge the results from the missing SNP look-up. This script needs to be modified to match the corresponding results files from each meta-analysis. Then, you can start an R session and paste the contents of the `merge.R` file into the R environment. Lastly, you will write the results to a table as shown below. With the tables that are created, you can then combine them into one excel notebook with 3 different sheets, in this case.

In [None]:
# Write this into the R environment after pasting contents of merge.R
write.table(dat, "044_combine_data.tsv", sep="\t", row.names=FALSE)

#write.table(dat, "045_combine_data.tsv", sep="\t", row.names=FALSE)
#write.table(dat, "046_combine_data.tsv", sep="\t", row.names=FALSE)

## Meta-analysis calculation 
This step needs to be performed on each step to determine is p-value for the meta-analysis.

## Final results
The last procedure was to combine the results for each chromosome and construct a spreadsheet to display these results. A spreadsheet was created for each of the three studies. These results can be found on the share drive at:

`//rcdcollaboration01.rti.ns/gxg/Analysis/META/GSCAN_extended_results_nicotine/results`