# COPDGene1 Fagerstrom Test For Nicotine Dependence (FTND) GWAS
__Author:__ Jesse Marks

See [GitHub Issue #78](emerge_ea.1000G.CAT_FTND~SNP+SEX+EVs.maf_gt_0.01_rsq_gt_0.30_ea.snps+indels.manhattan.png.gz)

This document logs the steps taken to process:

* `Genetic Epidemiology of COPD (COPDGene) Funded by the National Heart, Lung, and Blood Institute` data and perform the FTND GWAS. The [COPDGene](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000179.v1.p1) cohort is racially diverse and sufficiently large and appropriately designed for genome-wide association analysis of COPD. 

* There are 10k subjects including control smokers, definitie COPD cases (GOLD Stage 2 to 4) and subjects not included in either group (GOLD 1 or GOLD-Unclassified). 

* The focus of this study is genome-wide association analysis to identify the genetic risk factors that determine susceptibility for COPD and COPD-related phenotypes.

* Our phenotype of interest is FTND

FTND is a standard instrument for assessing the physical addiction to nicotine. For more information, see [this website](https://cde.drugabuse.gov/instrument/d7c0b0f5-b865-e4de-e040-bb89ad43202b).

The imputed genotype are stored on Amazon Web Services S3 at:

`s3://rti-nd/COPDGene`

* We use the variable `FTNDboth_cat` variable that lumps together the former smokers that have lifetime FTND (N=736) with the current smokers that have current FTND (N=78). This will optimize sample size, especially since the severe category is slim.

* **Note**: I will first perform the GWAS with only the autosomes 

* John Guo would like me to try this with the nextFlow pipeline

## Notes about study
**Note**: these notes are not cohesive but serve as a personal reference for information I gathered on the study

* we have 1000 Genomes phase 3 imputation already available on the full COPDGene dataset

* we will run COPDGene1 and COPDGene2 separately because FTND was collected several years apart and there were some key differences in the way that the questions were asked in the different waves. We will keep the two waves separate to circumvent any potential discrepancies which could arrise because of the phenotypes being less than harmonious and then we will combine them in a meta-analysis later. 


* I will run the GWAS analysis with the following variables:

**COPDGene2** (all data from phase 2 LFU dataset, except for gender and finalGOLD which was collected only in phase 1)
The criteria for COPDGene2 are:

```1) WstFTND_cat_p2 (mostly former smokers with some current smokers picked up)
2) age_p1 (age at current visit)
3) gender (1=male, 2=female)
4) Goldneg1 (between case and control - exhibits signs of both. Failed one diagnostic test while passing another.)
5) Gold1or2 (cases)
6) Gold3or4 (severe cases)
7) EVs to be selected```

**Also, I need to split the data up by race** (1=white, 2=black) 


**Note** We cannot assume that the increments in the GOLD classification are equal which is why we need to embed dummy variables rather than a single categorical variable in our regression models.

* GOLD stands for the Global Initiative for Chronic Obstructive Lung Disease. Essentially, GOLD it is a metric for quantifying the severity of COPD a patient has.

* FTND outcome and covariates are now ready at the path:
    
    `\\rcdcollaboration01.rti.ns\GxG\Analysis\COPDGene\phenotypes\Phase2\COPD Both waves with Cat FTND_v2.xls`
    

* For COPDGene1, we are increasing the sample size by several hundred. This is because, the previous analysis of COPDGene1 included only the subjects with determinant COPD GOLD status (finalGold=0 for controls, 1/2 for cases, and 3/4 for severe cases.) Now we know that subjects with an indeterminant status (i.e. finalGold=-1) were subjects that were classified to be between case and control. With this newfound knowledge, we can go ahead and include these subjects and rerun COPDGene1 with these subjects plus the subjects that were in the previous analysis. The model will be:

`CurFTND_cat_p1 = SNP + age_p1 + gender + Goldneg1 +Gold1or2 + Gold3or4 + EVs to be selected`
    * Also note that in this phase1, the subjects were all current smokers. (max N=5289)

* For phase2, COPDGene2, we need to exclude all of the subjects which were included in COPDGene1. This will leave *mostly* former smokers with lifetime FTND reported (max N=2934). There will be some current smokers picked up in phase2 and we will include them here.

`WstFTND_cat_p2 = SNP + age_p1 + gender + Goldneg1 +Gold1or2 + Gold3or4 + EVs to be selected`


* So, 2,934 is the maximum number of subjects that will be included in COPDGene2. We will have to filter this down based on if the subjects have `sex, age, GOLD status,` and `genotype data.` Also, there are reportedly 4 subjects for COPDGene2 (have entries for the WstFTND_cat_p2 variable) that are missing finalGold status. Of those 4, 3 marked 0 for `Goldneg1`, `Gold1or2`, or `Gold3or4` and the other 1 has NA for those three variables. We will treat these 4 subjects as missing and exclude them from the analysis.

### CstFTND_cat_p2 variable description
| cat | Freq |
|-----|------|
| 0   | 1,506 |   
| 1   | 2,353 |   
| 2   | 1,430 |   

* Where FTND conversion is 0=0-3, 1=4-6, and 2=7+

# Prepare files for ProbABEL
## Phenotype Data COPDGene1
**Age p1 filtered table**
___


| Filtering Criterion                   | Subjects Removed  | Total |
|---------------------------------------|-------------------|-------|
| Initial Data                          | 0                 | 10,300|
| Initial subjects (CurFTND)            | 5,011             | 5,289 |
| Missing finalgold status              | 47                | 5,242 |
| Missing sex                           | 0                 | 5,242 |
| Missing age (p1)                      | 0                 | 5,242 |
| Missing Genotype                      | 159               | 5,083 |

* Number of EAs: 2,549 
* Number of AAs: 2,534 


___
**Age p2 filtered table**


| Filtering Criterion                   | Subjects Removed  | Total |
|---------------------------------------|-------------------|-------|
| Initial Data                          | 0                 | 10,300|
| Initial subjects (CurFTND)            | 5,011             | 5,289 |
| Missing finalgold status              | 47                | 5,242 |
| Missing sex                           | 0                 | 5,242 |
| Missing age (p2)                      | 2,353             | 2,889 |
| Missing Genotype                      | 80                | 2,809 |

* Number of EAs: 1,466
* Number of AAs: 1,343

### Apply Filters

In [11]:
pheno.data <- read.table("C:/Users/jmarks/Desktop/Projects/Nicotine/COPDGene/pheno/Copy of COPD Both waves with Cat FTND_v2.csv",
                        sep=",", header=T)

table(pheno.data$CurFTND_cat_p1)
print("Below is the table head of the entire phenotype file.")
head(pheno.data)


   0    1    2 
1506 2353 1430 

[1] "Below is the table head of the entire phenotype file."


sid,gender,race,ethnic,fagerstrom_index,finalgold,age_p1,age_p2,have_lfu,fagerstrom_index_lfu,CurFTND_cat_p1,WstFTND_cat_p2,Gold_Cat,Goldneg1,Gold1or2,Gold3or4
15814W,2,1,2,,-2,,,,,,,,,,
16032X,1,1,2,,-2,,,,,,,,,,
16126G,2,1,2,,-2,,,,,,,,,,
16281S,2,1,2,,-2,,,,,,,,,,
16303C,1,1,2,,-2,,,,,,,,,,
16311B,2,1,2,,-2,,,,,,,,,,


#### Initial numbers

In [8]:
# COPDGene1
# Calculate number of subjects with CurFTND data
cur.ftnd.vec <-  which(pheno.data$CurFTND_cat_p1 >= 0)

cur.ftnd.filtered <- pheno.data[cur.ftnd.vec,]
print("Max number of subjects for COPDGene1 (both ancestries)")
length(cur.ftnd.filtered[,1])

print("Below is the table head of the COPDGene1 data before any filters (save FTND) have been applied.")
head(cur.ftnd.filtered)

[1] "Max number of subjects for COPDGene1 (both ancestries)"


[1] "Below is the table head of the COPDGene1 data before any filters (save FTND) have been applied."


Unnamed: 0,sid,gender,race,ethnic,fagerstrom_index,finalgold,age_p1,age_p2,have_lfu,fagerstrom_index_lfu,CurFTND_cat_p1,WstFTND_cat_p2,Gold_Cat,Goldneg1,Gold1or2,Gold3or4
101,10008W,1,1,2,1,-1,55.1,,,,0,,1,1,0,0
104,10248Q,1,1,2,9,-1,66.9,,,,2,,1,1,0,0
109,10521I,2,1,2,5,-1,53.3,,,,1,,1,1,0,0
113,10604M,2,1,2,5,-1,64.9,70.5,LFU,4.0,1,1.0,1,1,0,0
116,10705S,2,1,2,3,-1,70.6,76.4,LFU,4.0,0,1.0,1,1,0,0
119,10918J,2,1,2,4,-1,54.0,,,,1,,1,1,0,0


#### Missing finalgold filter
Remove subjects missing `finalgold` variable

In [12]:
# some subjects with reported GOLD status are missing in the finalgold variable
gold.data <- cur.ftnd.filtered$finalgold[]
print("These are the sequential indices in the cur.ftnd.filtered at which they occur.")
which(is.na(gold.data))
print("Number of subjects missing finalgold.")
length(which(is.na(gold.data)))

print("Here are the data of those subjects with missing finalgold.")
cur.ftnd.filtered[which(is.na(gold.data)),]

# exclude those subjects with missing finalgold
cur.ftnd.fg.filtered <- cur.ftnd.filtered[-which(is.na(gold.data)),]

print("Number of subjects remaining after filtering by missing finalgold data.")
length(cur.ftnd.fg.filtered[,1])

[1] "These are the sequential indices in the cur.ftnd.filtered at which they occur."


[1] "Number of subjects missing finalgold."


[1] "Here are the data of those subjects with missing finalgold."


Unnamed: 0,sid,gender,race,ethnic,fagerstrom_index,finalgold,age_p1,age_p2,have_lfu,fagerstrom_index_lfu,CurFTND_cat_p1,WstFTND_cat_p2,Gold_Cat,Goldneg1,Gold1or2,Gold3or4
6866,10279B,1,1,2,2,,68.2,,,,0,,,0.0,0.0,0.0
6867,10818F,1,1,2,3,,63.2,,,,0,,,0.0,0.0,0.0
6869,11125F,1,1,2,4,,55.9,61.0,LFU,5.0,1,1.0,,0.0,0.0,0.0
6870,11182R,1,1,2,7,,64.5,72.7,LFU,6.0,2,1.0,,0.0,0.0,0.0
6871,11832E,2,1,2,7,,60.4,,,,2,,,0.0,0.0,0.0
6876,15765J,1,1,2,5,,53.7,59.9,LFU,1.0,1,0.0,,0.0,0.0,0.0
6877,17778C,2,1,2,5,,62.8,68.0,LFU,2.0,1,0.0,,0.0,0.0,0.0
6878,18595Z,2,1,2,4,,47.5,,,,1,,,0.0,0.0,0.0
6879,18916T,2,1,2,8,,64.3,,,,2,,,0.0,0.0,0.0
6880,19004H,1,1,2,6,,66.0,71.0,LFU,4.0,1,1.0,,0.0,0.0,0.0


[1] "Number of subjects remaining after filtering by missing finalgold data."


#### Missing Sex filter
Remove any subjects missing sex data.

In [13]:
# filtered out any subjects missing sex data
cur.ftnd.fg.sex.filtered <- cur.ftnd.fg.filtered[complete.cases(cur.ftnd.fg.filtered[, "gender"]),]

print("Number of COPDGene1 subjects after FTND, finalgold, and sex filtering.")
length(cur.ftnd.fg.sex.filtered[,1])

[1] "Number of COPDGene2 subjects after FTND, finalgold, and sex filtering."


**Note:** no subjects missing sex data. 

#### Missing Age filter
Remove any subjects missing age data. 

#### age_p1
Filter by variable `age_p1` then write to file.

In [18]:
cur.ftnd.fg.sex.age1.filtered <- cur.ftnd.fg.sex.filtered[complete.cases(cur.ftnd.fg.sex.filtered[,"age_p1"]),]

print("Number of COPDGene1 subjects after FTND, finalgold, sex, and age (p1) filtering.")
length(cur.ftnd.fg.sex.age1.filtered[,1])

variables.of.interest1 <- c("sid", "gender", "race", "age_p1", "CurFTND_cat_p1", "Goldneg1", "Gold1or2", "Gold3or4")
pheno1 <- cur.ftnd.fg.sex.age1.filtered[,variables.of.interest1]

print("Head of filtered data (age_p1).")
head(pheno1)

setwd("C:/Users/jmarks/Desktop/Projects/Nicotine/COPDGene/pheno/one")
write.table(pheno1, "phenotype.ftnd.fg.sex.age1.filtered.txt", sep = " ", row.names = F, quote = F)

[1] "Number of COPDGene1 subjects after FTND, finalgold, sex, and age (p1) filtering."


[1] "Head of filtered data (age_p1)."


Unnamed: 0,sid,gender,race,age_p1,CurFTND_cat_p1,Goldneg1,Gold1or2,Gold3or4
101,10008W,1,1,55.1,0,1,0,0
104,10248Q,1,1,66.9,2,1,0,0
109,10521I,2,1,53.3,1,1,0,0
113,10604M,2,1,64.9,1,1,0,0
116,10705S,2,1,70.6,0,1,0,0
119,10918J,2,1,54.0,1,1,0,0


##### age_p2

Filter by variable `age_p2` then write to file.

In [19]:
cur.ftnd.fg.sex.age2.filtered <- cur.ftnd.fg.sex.filtered[complete.cases(cur.ftnd.fg.sex.filtered[,"age_p2"]),]

print("Number of COPDGene1 subjects after FTND, finalgold, sex, and age (p2) filtering.")
length(cur.ftnd.fg.sex.age2.filtered[,1])

variables.of.interest2 <- c("sid", "gender", "race", "age_p2", "CurFTND_cat_p1", "Goldneg1", "Gold1or2", "Gold3or4")
pheno2 <- cur.ftnd.fg.sex.age2.filtered[,variables.of.interest2]

print("Head of filtered data (age_p2).")
head(pheno2)

setwd("C:/Users/jmarks/Desktop/Projects/Nicotine/COPDGene/pheno/one")
write.table(pheno2, "phenotype.ftnd.fg.sex.age2.filtered.txt", sep = " ", row.names = F, quote = F)

[1] "Number of COPDGene1 subjects after FTND, finalgold, sex, and age (p2) filtering."


[1] "Head of filtered data (age_p2)."


Unnamed: 0,sid,gender,race,age_p2,CurFTND_cat_p1,Goldneg1,Gold1or2,Gold3or4
113,10604M,2,1,70.5,1,1,0,0
116,10705S,2,1,76.4,0,1,0,0
120,10935J,1,1,55.1,1,1,0,0
123,11020R,2,1,54.5,1,1,0,0
127,11081L,2,1,59.1,1,1,0,0
132,11159W,1,1,66.2,1,1,0,0


#### Genotype filter
Construct subject-filtered PLINK file sets by filterine out subjects missing genotype data.

In [None]:
## local machine ##
cd /cygdrive/c/Users/jmarks/Desktop/Projects/Nicotine/COPDGene/pheno/one

# create directory structure on MIDAS
mkdir -p jmarks@rtplhpc01.rti.ns:/share/nas03/jmarks/studies/copdgene1/{eigenstrat,data/{assoc_tests,genotype/{observed,imputed},phenotype}}

# copy phenotype data over to MIDAS
scp phenotype.ftnd.fg.sex.age1.filtered.txt jmarks@rtplhpc01.rti.ns:/share/nas03/jmarks/studies/copdgene1/data/phenotype/
scp phenotype.ftnd.fg.sex.age2.filtered.txt jmarks@rtplhpc01.rti.ns:/share/nas03/jmarks/studies/copdgene1/data/phenotype/

In [None]:
## MIDAS console ##

cd /share/nas03/jmarks/studies/copdgene1/data/

cp /share/nas03/bioinformatics_group/data/studies/copdgene/observed/final_filtered/gxg_qc/copdgene.{aa,ea}.{fam,bim,bed} \
    genotype/observed

# create a new PLINK fam file based on subjects with phenotype and genotype data
## age_p1 filter
### EA
awk 'FNR==NR {a[$1]; next} ($2 in a)' phenotype/phenotype.ftnd.fg.sex.age1.filtered.txt \
    genotype/observed/copdgene.ea.fam > phenotype/ea.phenotype1.all.filters.txt
     
### AA
awk 'FNR==NR {a[$1]; next} ($2 in a)' phenotype/phenotype.ftnd.fg.sex.age1.filtered.txt \
    genotype/observed/copdgene.aa.fam > phenotype/aa.phenotype1.all.filters.fam
     

## age_p2 filter
### EA
awk 'FNR==NR {a[$1]; next} ($2 in a)' phenotype/phenotype.ftnd.fg.sex.age2.filtered.txt\
    genotype/observed/copdgene.ea.fam > phenotype/ea.phenotype2.all.filters.txt
     
### AA
awk 'FNR==NR {a[$1]; next} ($2 in a)' phenotype/phenotype.ftnd.fg.sex.age2.filtered.txt \
    genotype/observed/copdgene.aa.fam > phenotype/aa.phenotype2.all.filters.fam
     

wc -l phenotype/{ea,aa}.phenotype1.all.filters.fam
"""  
2549 phenotype/ea.phenotype1.all.filters.txt
  2534 phenotype/aa.phenotype1.all.filters.txt
  5083 total
"""

wc -l phenotype/{ea,aa}.phenotype2.all.filters.txt
""" 
1466 phenotype/ea.phenotype2.all.filters.txt
 1343 phenotype/aa.phenotype2.all.filters.txt
 2809 total
"""


## PCA (EIGENSTRAT)

To obtain principal component covariates to use in the GWAS statistical model, EIGENSTRAT is run on LD-pruned observed genotypes for each ancestry group. 

### Construct subject-filtered PLINK file sets

In [None]:
## MIDAS ##
cd /share/nas03/jmarks/studies/copdgene/one/data/phenotypes/processing

# EA
awk ' NR>=2 { print $1 }' ea.phenotype.age1.txt > ea.id_list1.txt
awk ' NR>=2 { print $1 }' ea.phenotype.age2.txt > ea.id_list2.txt

awk 'FNR==NR{a[$0];next} ($2 in a)' ea.id_list1.txt ../../observed/processing/copdgene.ea.fam > ea.filtered1.fam
awk 'FNR==NR{a[$0];next} ($2 in a)' ea.id_list2.txt ../../observed/processing/copdgene.ea.fam > ea.filtered2.fam

# AA
awk ' NR>=2 { print $1 }' aa.phenotype.age1.txt > aa.id_list1.txt
awk ' NR>=2 { print $1 }' aa.phenotype.age2.txt > aa.id_list2.txt

awk 'FNR==NR{a[$0];next} ($2 in a)' aa.id_list1.txt ../../observed/processing/copdgene.aa.fam > aa.filtered1.fam
awk 'FNR==NR{a[$0];next} ($2 in a)' aa.id_list2.txt ../../observed/processing/copdgene.aa.fam > aa.filtered2.fam


#awk '{print $1,$2 }' filtered.fam > ea_subject_ids.keep
#
## Remove subjects by phenotype criteria
#ancestry="ea"
#/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
#    --noweb \
#    --memory 2048 \
#    --bfile ../../genotype/original/ea_chr_all \
#    --keep ea_subject_ids.keep \
#    --make-bed \
#    --out /shared/s3/emerge/eigenstrat_no_sex/${ancestry}_pheno_filter