# COPDGene1 Fagerstrom Test For Nicotine Dependence (FTND) GWAS
__Author:__ Jesse Marks

See [GitHub Issue #78](emerge_ea.1000G.CAT_FTND~SNP+SEX+EVs.maf_gt_0.01_rsq_gt_0.30_ea.snps+indels.manhattan.png.gz)

This document logs the steps taken to process:

* `Genetic Epidemiology of COPD (COPDGene) Funded by the National Heart, Lung, and Blood Institute` data and perform the FTND GWAS. The [COPDGene](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000179.v1.p1) cohort is racially diverse and sufficiently large and appropriately designed for genome-wide association analysis of COPD. 

* There are 10k subjects including control smokers, definitie COPD cases (GOLD Stage 2 to 4) and subjects not included in either group (GOLD 1 or GOLD-Unclassified). 

* The focus of this study is genome-wide association analysis to identify the genetic risk factors that determine susceptibility for COPD and COPD-related phenotypes.

* Our phenotype of interest is FTND

FTND is a standard instrument for assessing the physical addiction to nicotine. For more information, see [this website](https://cde.drugabuse.gov/instrument/d7c0b0f5-b865-e4de-e040-bb89ad43202b).

The imputed genotype are stored on Amazon Web Services S3 at:

`s3://rti-nd/COPDGene`

* We use the variable `FTNDboth_cat` variable that lumps together the former smokers that have lifetime FTND (N=736) with the current smokers that have current FTND (N=78). This will optimize sample size, especially since the severe category is slim.

* **Note**: I will first perform the GWAS with only the autosomes 

* John Guo would like me to try this with the nextFlow pipeline

# What we know
* we have 1000 Genomes phase 3 imputation already available on the full COPDGene dataset
* double-check that there are no duplicate individuals between our existing COPDGene1 GWAS (current smokers only) and this new COPDGene2 dataset (mostly former smokers, but some current smokers embedded). If there is duplication, drop them from COPDGene2.
* we will run COPDGene1 and COPDGene2 separately because FTND was collected several years apart and there were some key differences in the way that the questions were asked in the different waves. We will keep the two waves separate to circumvent any potential discrepancies which could arrise because of the phenotypes being less than harmonious and then we will combine them in a meta-analysis later. 
* There will be ~2900 with FTND data collected in wave 2 but NOT in wave 1 - these are the ones to be inclused in the new COPDGene2 GWAS (N=2610 EAs and 290 AAs).

* I will run the GWAS analysis with the following variables:

**COPDGene1** (all data from phase 1 original dataset)

*cat_FTND_phase1*

*age_enroll*

*gender*

*finalGOLD dummy variables*

 

**COPDGene2** (all data from phase 2 LFU dataset, except for gender and finalGOLD which was collected only in phase 1)

*cat_FTND_phase2*

*age_LFU*

*gender*

*finalGOLD* (dummy variables)


* Once the phenotype data are ready, I need to first run chr15 and send Dana the results so she can compare them with our previously generated results for known SNPs

* What is the finalGOLD dummy variable?

* Phenotypes will be all in one file with both sets of variables

Now that we have -1 to include, I proposed 3 dummy variables with

GOLD 0 (reference)

GOLD -1 (dummy variable 1)

GOLD 1 and 2 (dummy variable 2)

GOLD 3 and 4 (dummy variable 3)

* FTND outcome and covariates are now ready at the path:
    
    `\\rcdcollaboration01.rti.ns\GxG\Analysis\COPDGene\phenotypes\Phase2\COPD Both waves with Cat FTND_v2.xls`

* GOLD stands for the Global Initiative for Chronic Obstructive Lung Disease. Essentially, GOLD it is a metric for quantifying the severity of COPD a patient has.


* need to separate by race (1=White and 2=Black or African American)



* For COPDGene1, we are increasing the sample size by several hundred. This is because, the previous analysis of COPDGene1 included only the subjects with determinant COPD GOLD status (finalGold=0 for controls, 1/2 for cases, and 3/4 for severe cases.) Now we know that subjects with an indeterminant status (i.e. finalGold=-1) were subjects that were classified to be between case and control. With this newfound knowledge, we can go ahead and include these subjects and rerun COPDGene1 with these subjects plus the subjects that were in the previous analysis. The model will be:

`CurFTND_cat_p1 = SNP + age_p1 + gender + Goldneg1 +Gold1or2 + Gold3or4 + EVs to be selected`
    * Also note that in this phase1, the subjects were all current smokers. (max N=5289)

* For phase2, COPDGene2, we need to exclude all of the subjects which were included in COPDGene1. This will leave *mostly* former smokers with lifetime FTND reported (max N=2934). There will be some current smokers picked up in phase2 and we will include them here.

`WstFTND_cat_p2 = SNP + age_p1 + gender + Goldneg1 +Gold1or2 + Gold3or4 + EVs to be selected`



## The first thing on my plate is to run COPDGene1 for only chromosome 15 
This is so that we can compare the results with previously generated COPDGene1 results before going genome-wide. I will check chr15 for COPDGene2 first as well.
* first thing I am going to do is run the phase2 COPDGene1. I need to filter out the subjects from the phenotype file that match the criteria specified. The criteria for COPDGene1 are:

```1) CurFTND_cat_p1 (current smokers)
2) age_p1 (age at current visit)
3) gender (1=male, 2=female)
4) Goldneg1 (between case and control - exhibits signs of both. Failed one diagnostic test while passing another.)
5) Gold1or2 (cases)
6) Gold3or4 (severe cases)```

**Also, I need to split the data up by race** (1=white, 2=black) 

**Note** We cannot assume that the increments in the GOLD classification are equal which is why we need to embed dummy variables rather than a single categorical variable in our regression models.

COPDGene2 will be like the above described setup except that the phenotype of interest is
*WstFTND_cat_p2*. I should verify with Dana that we are using age_p1 for this analysis as well. Also, am I using chromosome 23? Note that I did not use chr23 for the eMERGE analyis. 

# Phenotype  Data
## Count the max number of subjects in COPDGene1&2

In [31]:
# Calculate number of subjects with CurFTND data
cur_vec <-  which(pheno.data$CurFTND_cat_p1 >= 0)
print("Max number of subjects for COPDGene1")
length(cur_vec)

# Calculate the number of subjects in WstFTND that are not in Cur
wst_vec <- length(which(pheno.data$WstFTND_cat_p2[cur_vec] >= 0))
print("Max number of subjects for COPDGene2")
length(which(pheno.data$WstFTND_cat_p2 >= 0)) - wst_vec

[1] "Max number of subjects for COPDGene1"


[1] "Max number of subjects for COPDGene2"


In [32]:
pheno.data <- read.table("C:/Users/jmarks/Desktop/Projects/Nicotine/COPDGene/pheno/Copy of COPD Both waves with Cat FTND_v2.csv",
                        sep=",", header=T)
head(pheno.data)

sid,gender,race,ethnic,fagerstrom_index,finalgold,age_p1,age_p2,have_lfu,fagerstrom_index_lfu,CurFTND_cat_p1,WstFTND_cat_p2,Gold_Cat,Goldneg1,Gold1or2,Gold3or4
15814W,2,1,2,,-2,,,,,,,,,,
16032X,1,1,2,,-2,,,,,,,,,,
16126G,2,1,2,,-2,,,,,,,,,,
16281S,2,1,2,,-2,,,,,,,,,,
16303C,1,1,2,,-2,,,,,,,,,,
16311B,2,1,2,,-2,,,,,,,,,,


In [42]:
# Calculate number of subjects with CurFTND data
cur_vec <-  which(pheno.data$CurFTND_cat_p1 >= 0)
print("Max number of subjects for COPDGene1")
length(cur_vec)

# Calculate the number of subjects in WstFTND that are not in Cur
wst_vec <- length(which(pheno.data$WstFTND_cat_p2[cur_vec] >= 0))
print("Max number of subjects for COPDGene2")
length(which(pheno.data$WstFTND_cat_p2 >= 0)) - wst_vec

[1] "Max number of subjects for COPDGene1"


[1] "Max number of subjects for COPDGene2"


In [34]:
table(pheno.data$CurFTND_cat_p1)


   0    1    2 
1506 2353 1430 

### CurFTND_cat_p1 variable description
| cat | Freq |
|-----|------|
| 0   | 1506 |   
| 1   | 2353 |   
| 2   | 1430 |   

* Where FTND conversion is 0=0-3, 1=4-6, and 2=7+

* So, 5289 is the maximum number of subjects that will be included in COPDGene1. We will have to filter this down based on if the subjects have `sex, age, GOLD status,` and `genotype data.`

* Also, we need to separate these by race once all of these filteres have been applied

## Filter subjects missing any FTND, sex, or age data
Then write to file.

In [53]:
## R concole ##
setwd("C:/Users/jmarks/Desktop/Projects/Nicotine/COPDGene/pheno")

# Total subjects 
print("Total number of subjects in phenotype pre-filtered data.")
length(pheno.data[,1])

# Calculate number of subjects with CurFTND data
phenoFTND_filtered <- pheno.data[complete.cases(pheno.data[,"CurFTND_cat_p1"]),]
print("Max number of subjects for COPDGene1")
length(phenoFTND_filtered[, "CurFTND_cat_p1"])


# filter out any subject missing age data
phenoFTND_age_filtered <- phenoFTND_filtered[complete.cases(phenoFTND_filtered[,"age_p1"]),]
print("Number of subjects after FTND and age filtering.")
length(phenoFTND[, "age_p1"])

# filtere out any subjects missing sex data
phenoFTND_age_sex_filtered <- phenoFTND_age_filtered[complete.cases(phenoFTND_age_filtered[,"gender"]),]
print("Number of subjects after FTND, age, and sex filtering.")
length(phenoFTND[, "gender"])

# fiter data set to only variable of interst for COPDGene1
variables_of_interest <- c("sid", "gender", "race", "age_p1", "CurFTND_cat_p1", "Goldneg1", "Gold1or2", "Gold3or4")
phenotype_final_data <- phenoFTND_age_sex_filtered[, variables_of_interest]
head(phenotype_final_data)

write.table(phenotype_final_data, "one/COPDGene1_ftnd.txt", sep = " ", row.names = F, quote = F)

[1] "Total number of subjects in phenotype pre-filtered data."


[1] "Max number of subjects for COPDGene1"


[1] "Number of subjects after FTND and age filtering."


[1] "Number of subjects after FTND, age, and sex filtering."


Unnamed: 0,sid,gender,race,age_p1,CurFTND_cat_p1,Goldneg1,Gold1or2,Gold3or4
101,10008W,1,1,55.1,0,1,0,0
104,10248Q,1,1,66.9,2,1,0,0
109,10521I,2,1,53.3,1,1,0,0
113,10604M,2,1,64.9,1,1,0,0
116,10705S,2,1,70.6,0,1,0,0
119,10918J,2,1,54.0,1,1,0,0


## PCA (EIGENSTRAT)

To obtain principal component covariates to use in the GWAS statistical model, EIGENSTRAT is run on LD-pruned observed genotypes for each ancestry group. 

### Construct subject-filtered PLINK file sets

In [None]:
## MIDAS ##

# create directory structure
mkdir -p /share/nas03/jmarks/studies/copdgene/one/{data,eigenstrat}
cd /share/nas03/jmarks/studies/copdgene/one/data

mkdir {observed,phenotypes,imputed}
mkdir -p observed/{processing,final_filtered}
mkdir phenotypes/{processing,probabel}

# copy genotype data (autosomes only)
cp /share/nas03/bioinformatics_group/data/studies/copdgene/observed/final_filtered/gxg_qc/copdgene.{aa,ea}.{fam,bim,bed} \
    observed/processing/

In [None]:
## local machine ##

# copy phenotype data over to MIDAS
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/COPDGene/pheno/one

scp COPDGene1_ftnd.txt jmarks@rtplhpc01.rti.ns:/share/nas03/jmarks/studies/copdgene/one/data/phenotypes/processing

In [None]:
## MIDAS ##
cd /share/nas03/jmarks/studies/copdgene/one/data/phenotypes/processing

# generate ID list to filter data with
awk ' NR>=2 { print $1 }' COPDGene1_ftnd.txt > id_list.txt

awk 'FNR==NR{a[$0];next} ($2 in a)' id_list.txt ../../genotype/original/ea_chr_all.fam > filtered.fam

awk '{print $1,$2 }' filtered.fam > ea_subject_ids.keep

# Remove subjects by phenotype criteria
ancestry="ea"
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --memory 2048 \
    --bfile ../../genotype/original/ea_chr_all \
    --keep ea_subject_ids.keep \
    --make-bed \
    --out /shared/s3/emerge/eigenstrat_no_sex/${ancestry}_pheno_filter