# COPDGene2 Fagerstrom Test For Nicotine Dependence (FTND) GWAS
__Author:__ Jesse Marks


This document logs the steps taken to process the electronic Medical Records and Genomics (eMERGE) Network data and perform the FTND GWAS. The [eMERGE](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000360.v1.p1) Network is a consortium of five participating sites (Group Health Seattle, Marshfield Clinic, Mayo Clinic, Northwestern University, and Vanderbilt University). We will be processing the Marshfield cohort.

FTND is a standard instrument for assessing the physical addiction to nicotine. For more information, see [this website](https://cde.drugabuse.gov/instrument/d7c0b0f5-b865-e4de-e040-bb89ad43202b).

The genotype data were imputed on the [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html).

* We use the variable `FTNDboth_cat` variable that lumps together the former smokers that have lifetime FTND (N=736) with the current smokers that have current FTND (N=78). This will optimize sample size, especially since the severe category is slim.

# What we know
* we have 1000 Genomes phase 3 imputation already available on the full COPDGene dataset
* double-check that there are no duplicate individuals between our existing COPDGene1 GWAS (current smokers only) and this new COPDGene2 dataset (mostly former smokers, but some current smokers embedded). If there is duplication, drop them from COPDGene2.
* we will run COPDGene1 and COPDGene2 separately because FTND was collected several years apart and there were some key differences in the way that the questions were asked in the different waves. We will keep the two waves separate to circumvent any potential discrepancies which could arrise because of the phenotypes being less than harmonious and then we will combine them in a meta-analysis later. 
* There will be ~2900 with FTND data collected in wave 2 but NOT in wave 1 - these are the ones to be inclused in the new COPDGene2 GWAS (N=2610 EAs and 290 AAs).

* I will run the GWAS analysis with the following variables:

**COPDGene1** (all data from phase 1 original dataset)

*cat_FTND_phase1*

*age_enroll*

*gender*

*finalGOLD dummy variables*

 

**COPDGene2** (all data from phase 2 LFU dataset, except for gender and finalGOLD which was collected only in phase 1)

*cat_FTND_phase2*

*age_LFU*

*gender*

*finalGOLD* (dummy variables)


* Once the phenotype data are ready, I need to first run chr15 and send Dana the results so she can compare them with our previously generated results for known SNPs

* What is the finalGOLD dummy variable?

* Phenotypes will be all in one file with both sets of variables

Now that we have -1 to include, I proposed 3 dummy variables with

GOLD 0 (reference)

GOLD -1 (dummy variable 1)

GOLD 1 and 2 (dummy variable 2)

GOLD 3 and 4 (dummy variable 3)

* FTND outcome and covariates are now ready at the path:
    
    `\\rcdcollaboration01.rti.ns\GxG\Analysis\COPDGene\phenotypes\Phase2\COPD Both waves with Cat FTND_v2.xls`

* GOLD stands for the Global Initiative for Chronic Obstructive Lung Disease. Essentially, GOLD it is a metric for quantifying the severity of COPD a patient has.


* need to separate by race (1=White and 2=Black or African American)



* For COPDGene1, we are increasing the sample size by several hundred. This is because, the previous analysis of COPDGene1 included only the subjects with determinant COPD GOLD status (finalGold=0 for controls, 1/2 for cases, and 3/4 for severe cases.) Now we know that subjects with an indeterminant status (i.e. finalGold=-1) were subjects that were classified to be between case and control. With this newfound knowledge, we can go ahead and include these subjects and rerun COPDGene1 with these subjects plus the subjects that were in the previous analysis. The model will be:

`CurFTND_cat_p1 = SNP + age_p1 + gender + Goldneg1 +Gold1or2 + Gold3or4 + EVs to be selected`
    * Also note that in this phase1, the subjects were all current smokers. (max N=5289)

* For phase2, COPDGene2, we need to exclude all of the subjects which were included in COPDGene1. This will leave *mostly* former smokers with lifetime FTND reported (max N=2934). There will be some current smokers picked up in phase2 and we will include them here.

`WstFTND_cat_p2 = SNP + age_p1 + gender + Goldneg1 +Gold1or2 + Gold3or4 + EVs to be selected`



## The first thing on my plate is to run COPDGene1 for only chromosome 15 
This is so that we can compare the results with previously generated COPDGene1 results before going genome-wide. I will check chr15 for COPDGene2 first as well.
* first thing I am going to do is run the phase2 COPDGene1. I need to filter out the subjects from the phenotype file that match the criteria specified. The criteria for COPDGene1 are:

```1) CurFTND_cat_p1 (current smokers)
2) age_p1 (age at current visit)
3) gender (1=male, 2=female)
4) Gold_Cat (reference or controls ~ no symptoms of COPD ~ )
5) Goldneg1 (between case and control - exhibits signs of both. Failed one diagnostic test while passing another.)
6) Gold1or2 (cases)
7) Gold3or4 (severe cases)```

**Also, I need to split the data up by race** (1=white, 2=black) 

**Note** We cannot assume that the increments in the GOLD classification are equal which is why we need to embed dummy variables rather than a single categorical variable in our regression models.

COPDGene2 will be like the above described setup except that the phenotype of interest is
*WstFTND_cat_p2*. I should verify with Dana that we are using age_p1 for this analysis as well. Also, am I using chromosome 23? Note that I did not use chr23 for the eMERGE analyis. 