# NGC VIDUS and Lung Cancer in Never Smokers OAall GWAS
__Author__: Jesse Marks

This document logs the steps taken to perform an opioid addiction GWAS on [Vancouver Injection Drug Users Study (VIDUS)](http://www.cfenet.ubc.ca/research/vidus) subjects versus all controls of [Lung Cancer in Never Smokers Study](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000634.v1.p1). The processing performed on these data follow the Heroin NIDA Genetics Consortium (NGC) Protocol. For any general or specific questions regarding this protocol, speak with Eric O. Johnson.

## Software and tools
The software and tools used for processing these data are

* [Amazon Elastic Compute Cloud(EC2)](https://aws.amazon.com/ec2/)
* GNU bash version 4.1.2
* [PLINK v1.9 beta 3.45](https://www.cog-genomics.org/plink/)
* [EIGENSOFT v4.2](https://www.hsph.harvard.edu/alkes-price/software/)
* [R v3.2.3](https://www.r-project.org/)
* R packages: MASS, moments
* [RVtests](https://render.githubusercontent.com/view/ipynb?commit=3bb8e661ad8b75af027ed2748133452ec251aaed&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f525449496e7465726e6174696f6e616c2f6271756163685f6e6f7465626f6f6b732f336262386536363161643862373561663032376564323734383133333435326563323531616165642f6865726f696e5f70726f6a6563742f646576656c6f702f32303138303131305f756873325f756873335f666f755f677761732e6970796e623f746f6b656e3d41664d79344e373237626e764465456f46535a697770346b48776246577964706b7335617570495a7741253344253344&nwo=RTIInternational%2Fbquach_notebooks&path=heroin_project%2Fdevelop%2F20180110_uhs2_uhs3_fou_gwas.ipynb&repository_id=105297875&repository_type=Repository)

## Variable information
__OAall Variables__:

`Opiodcase` <br>
* Identified opioid addiction cases at those with heroin or prescription use >= 2-3 times/week in last six months (= 4 | = 5 in heroin or prescript use – inj or non). Generated 181 cases. Considered possible additional cases from uses of one or more of these at a level of once/week (=3). However, only 12 noncases by the above criteria have FOU in the same range as the cases, all but 1 of which could get the in-range score from a combination of lower frequency use across the multiple variables and still not be using 2-3 times/week. Decided to stick with the 181 as a cleaner case set.

`Female (sex)` <br>
* 0 is male, 1 is female 

`Ageatint` <br>
* Age at time of interview.

__Note__: Eric Johnson supplied the phenotype data for these VIDUS cases.

## Retrieve OAall data
The data have already been filtered to 
* Remove subjects with opioidcase == 0
* Remove duplicate subjects by keeping only the more recent data record
* Remove subjects with previously reported sex discrepancy

These data are located on EC2 at `/shared/sandbox/ngc_vidus-lung_cancer_fou/phenotype/GWAS-Cohort-n938_passed_g_qc_only_opioid_FOU.csv`

* There are 181 subjects in this CASES file.

* I need to filter the original `.fam` file based off of the IDs in the FOU phenotype data.


In [1]:
### EC2 console ###
mkdir -p /shared/s3/ngc_vidus_oaall/{vidus,lung_cancer}
cd /shared/s3/ngc_vidus_oaall

cp -r ../../sandbox/ngc_vidus-lung_cancer_oaall_case_control/VIDUS/* vidus/
cp -r ../../sandbox/ngc_vidus-lung_cancer_oaall_case_control/lung_cancer/* lung_cancer

cd /shared/s3/ngc_vidus_oaall/vidus/phenotype

# R console on EC2 #
pheno <- read.table("GWAS-Cohort-n938_passed_g_qc_only_opioid_CASES.csv", header=T, sep=',')

pheno_size <- length(pheno[,1])
id_list <- c()

for (i in 1:pheno_size){
    tmp_num <- sprintf("-%04d", pheno[i,1]) 
    id_list <- append(id_list, tmp_num)
}

write(id_list, "id_list.txt", sep="\n")
quit()

ERROR: Error in parse(text = x, srcfile = src): <text>:2:43: unexpected ','
1: ### EC2 console ###
2: mkdir -p /shared/s3/ngc_vidus_oaall/{vidus,
                                             ^


In [None]:
### EC2 console ###
cd /shared/s3/ngc_vidus_oaall/phenotype

# filter the fam file based on the subjects that are in the FOU phenotype data
# this is needed to create the PLINK filtered datasets to run eigenstrat
grep -f id_list.txt ../genotype/original/final/ea_chr_all.fam > filtered.fam
awk '{print $1,$2 }' filtered.fam > ea_subject_ids.keep

# get covariates
awk 'BEGIN{FS=","; OFS="\t"} {print $1,$2,$17,$20}' GWAS-Cohort-n938_passed_g_qc_only_opioid_CASES.csv > ea_CASES.data
head ea_CASES.data
"""
gwas_code       female  ageatint        opioidcase
3       0       55      1
5       0       40      1
6       1       35      1
11      1       56      1
13      0       35      1
17      0       34      1
18      0       39      1
19      0       29      1
20      0       37      1
"""

# map sex code and gwas_code
awk 'NR==FNR{ map[NR]=$1;next } FNR==1{print $0} FNR>=2 {$1=map[FNR-1];print $0}' id_list.txt ea_CASES.data > ea_cases_map.data
awk '{ if( $2==0) { $2=2} {print $0}} ' ea_cases_map.data > temp && mv temp ea_cases_map.data && rm temp

# make sure there are no missing data for sex, FOU, or age
# Note that the phenotype data were already filtered. If they
# had not been, we would check for this. I did a visual inspection for 
# a sanity check though.

In [12]:
### R console ###
library(MASS)
options(repr.plot.width=10, repr.plot.height=17)
# note that I copied over the phenotype data to my local machine to produce the plots
setwd('C:/Users/jmarks/Desktop/VIDUS/oaall/pheno/')

ea.cases.data <- read.table("ea_cases_map.data", header = T, colClasses = c("character", rep("integer",  3)) )
table(ea.cases.data$female)


  1   2 
 56 125 

## Combine cases and controls

### Filter vidus subjects based off case status

In [None]:
### EC2 console ###
mkdir /shared/s3/ngc_vidus_oaall/vidus/genotype/original/final/filtered
cd /shared/s3/ngc_vidus_oaall/vidus/

# Remove vidus subjects by phenotype criteria (case status)
ancestry="ea"
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --memory 2048 \
    --bfile genotype/original/final/ea_chr_all \
    --keep phenotype/ea_subject_ids.keep \
    --make-bed \
    --out genotype/original/final/filtered/${ancestry}_cases

### Merge test

To determine whether any of the SNPs are flipped between studies, a merge is attempted. If any multi-allelic variants are identified (suggestive of flipping) then an error is raised. In this case, position duplicates were identified, but these will be removed during the SNP intersection step.

In [None]:
# EC2 command line #
mkdir -p /shared/s3/ngc_vidus_oaall/{intersect,merged}/merge_test
cd /shared/s3/ngc_vidus_oaall/merged

# Attempt data set merge
ancestry="ea"
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --bfile ../lung_cancer/genotype/original/final/ea_chr_all \
    --bmerge ../vidus/genotype/original/final/filtered/${ancestry}_cases \
    --make-bed \
    --out merge_test/merged_unflipped
'Error: 25 variants with 3+ alleles present.'

ancestry="ea"
studies=(lung_cancer vidus) # array of study names

# Get first intersection set
file1=/shared/s3/ngc_vidus_oaall/${studies[0]}/genotype/original/final/${ancestry}_chr_all.bim
file2=/shared/s3/ngc_vidus_oaall/${studies[1]}/genotype/original/final/filtered/${ancestry}_cases.bim
echo -e "\nCalculating intersection between ${file1} and ${file2}...\n"
comm -12 <(cut -f 2,2 $file1 | sort -u) <(cut -f 2,2 $file2 | sort -u) \
    > ../intersect/${ancestry}_variant_intersection.txt

wc -l ../intersect/ea_variant_intersection.txt
'628804 intersect/ea_variant_intersection.txt'

cd ../
# Make new PLINK binary file set for lung_cancer
study="lung_cancer"
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --bfile /shared/s3/ngc_vidus_oaall/${study}/genotype/original/final/${ancestry}_chr_all \
    --extract intersect/${ancestry}_variant_intersection.txt \
    --exclude merge_test/merged_unflipped-merge.missnp \
    --make-bed \
    --out intersect/${study}_${ancestry}
'--extract: 628804 variants remaining.
--exclude: 628779 variants remaining.'


study="vidus"
# Make new PLINK binary file set for vidus
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --bfile /shared/s3/ngc_vidus_oaall/${study}/genotype/original/final/filtered/${ancestry}_cases \
    --extract intersect/${ancestry}_variant_intersection.txt \
    --exclude merge_test/merged_unflipped-merge.missnp \
    --make-bed \
    --out intersect/${study}_${ancestry}


'--extract: 628804 variants remaining.
--exclude: 628779 variants remaining.'

### Second pass merge
To ensure data set compatipatibility, a second pass merge is executed.

In [None]:
# EC2 command line #
cd /shared/s3/ngc_vidus_oaall/

# Re-attempt data set merge
ancestry="ea"
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --bfile intersect/lung_cancer_ea \
    --bmerge intersect/vidus_ea \
    --make-bed \
    --out merge_test/merged_intersect

# Clean-up
rm -r merge_test

No errors after this 2nd pass merge

## Assign Cases and Controls

In [13]:
# EC2 command line #
cd /shared/s3/ngc_vidus_oaall/

# assign case or control status in fam file (1=control 2=case)

ancestry="ea"
study1=lung_cancer
study2=vidus

# Modify FAM file to include case/control status
awk '{ $6=1; print $0 }' intersect/${study1}_${ancestry}.fam \
    > intersect/${study1}_${ancestry}_control.fam
awk '{ $6=2; print $0 }' intersect/${study2}_${ancestry}.fam \
    > intersect/${study2}_${ancestry}_case.fam

ERROR: Error in parse(text = x, srcfile = src): <text>:11:5: unexpected string constant
10: # Modify FAM file to include case/control status
11: awk '{ $6=1; print $0 }'
        ^


## EIGENSTRAT
To obtain principal component covariates to use in the GWAS statistical model, EIGENSTRAT is run on LD-pruned observed genotypes for each ancestry group. Usually a GRCh37 plus strand check is implemented, as well as a monomorphic SNP filter and discordant allele flip. Since this was already done for data in preparation for haplotype phasing, the haplotype phasing input PLINK files will be used. Note: In addition to these aforementioned data processing steps, ambiguous SNPs identified by reference panel frequency differences in the discordant allele checks were also removed prior to phasing.

### PLINK file set merge and MAF filter
* Combine the PLINK file sets to form a cases+controls PLINK fileset

In [None]:
### EC2 console ###
mkdir /shared/s3/ngc_vidus_oaall/eigenstrat
cd /shared/s3/ngc_vidus_oaall/

ancestry="ea"
study1=lung_cancer
study2=vidus

# Create temporary file sets
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --fam intersect/${study1}_${ancestry}_control.fam \
    --bim intersect/${study1}_${ancestry}.bim \
    --bed intersect/${study1}_${ancestry}.bed \
    --make-bed \
    --out eigenstrat/${study1}_${ancestry}.tmp
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --fam intersect/${study2}_${ancestry}_case.fam \
    --bim intersect/${study2}_${ancestry}.bim \
    --bed intersect/${study2}_${ancestry}.bed \
    --make-bed \
    --out eigenstrat/${study2}_${ancestry}.tmp

# Merge file sets
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --bfile eigenstrat/${study1}_${ancestry}.tmp \
    --bmerge eigenstrat/${study2}_${ancestry}.tmp \
    --allow-no-sex \
    --make-bed \
    --out eigenstrat/${study1}_vs_${study2}_${ancestry}_merged.tmp

# MAF > 0.01
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --bfile eigenstrat/${study1}_vs_${study2}_${ancestry}_merged.tmp \
    --maf 0.01 \
    --make-bed \
    --out eigenstrat/${study1}_vs_${study2}_${ancestry}

# Clean up
rm eigenstrat/*.tmp.*

### Remove high-LD region variants

In [None]:
### EC2 console ###
cd /shared/s3/ngc_vidus_oaall/eigenstrat

ancestry="ea"
study1=lung_cancer
study2=vidus

# Generate list of variants in known high-LD regions
perl -lane 'if (($F[0]==5 && $F[3] >= 43964243 && $F[3] <= 51464243) || ($F[0]==6 && $F[3] >= 24892021 && $F[3] <= 33392022) || ($F[0]==8 && $F[3] >= 7962590 && $F[3] <= 11962591) || ($F[0]==11 && $F[3] >= 45043424 && $F[3] <= 57243424)) { print $F[1]."\n"; }' ${study1}_vs_${study2}_${ancestry}.bim \
    > ${study1}_vs_${study2}_${ancestry}.high_ld_regions.remove
            
# Remove SNPs in known high-LD regions
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --bfile ${study1}_vs_${study2}_${ancestry} \
    --exclude ${study1}_vs_${study2}_${ancestry}.high_ld_regions.remove \
    --make-bed \
    --out ${study1}_vs_${study2}_${ancestry}_high_ld_regions_removed

### Linkage disequilibrium pruning
Linkage disequilibrium (LD) pruning eliminates a large degree of redundancy in the data and reduces the influence of chromosomal artifacts. The objective of LD pruning is to select a subset of variants based off of LD such that the variants in the subset are indepdendent. This filtering will not carry forward to the final processed results, but this step improves the quality of EIGENSTRAT calculations. Consequently, the LD pruned data will be used as input for those calculations.

LD pruning is implemented using [PLINK --indep-pairwise](https://www.cog-genomics.org/plink/1.9/ld#indep).

In [None]:
### EC2 console ###
cd /shared/s3/ngc_vidus_oaall/eigenstrat

ancestry="ea"
study1=lung_cancer
study2=vidus

for chr in {1..23}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name ${study1}_${study2}_${ancestry}_${chr}_ld_prune \
        --script_prefix ${study1}_vs_${study2}_${ancestry}_${chr}_ld_prune \
        --mem 3 \
        --nslots 1 \
        --program /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --memory 3000 \
            --bfile ${study1}_vs_${study2}_${ancestry}_high_ld_regions_removed \
            --indep-pairwise 1500 150 0.2 \
            --chr ${chr} \
            --out ${study1}_vs_${study2}_${ancestry}_chr${chr}_ld_pruned
done

# Merge *prune.in files
ancestry="ea"
study1=phs000634_lung_cancer
study2=phs000801_lymphoma
cat ${study1}_vs_${study2}_${ancestry}_chr*_ld_pruned.prune.in > ${study1}_vs_${study2}_${ancestry}_chr_all_ld_pruned.prune.in

# Create new PLINK filesets with only lD pruned variants
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --noweb \
    --bfile ${study1}_vs_${study2}_${ancestry} \
    --extract ${study1}_vs_${study2}_${ancestry}_chr_all_ld_pruned.prune.in \
    --make-bed \
    --out ${study1}_vs_${study2}_${ancestry}_ld_pruned

# Clean up
rm *${ancestry}*ld_pruned.{prune.in,prune.out,log}
rm *${ancestry}*ld_prune*qsub*
rm *${ancestry}*high_ld_regions*
rm *${ancestry}*chr23_ld_pruned.hh