# FTND GWAS of Nelson Data

**Author**: Jesse Marks

**GitHub Issue**: [issue 96](https://github.com/RTIInternational/bioinformatics/issues/96#issuecomment-401903152)

This document logs the steps taken to process:

[*A Genome-Wide Association Study of Heroin Dependence*](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000277.v1.p1&phv=171252&phd=&pha=&pht=2797&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1) (Nelson data set) and perform an FTND GWAS to test for association with Nicotine Dependence (ND). This Nelson data set is a collaboration of Australian and American investigators aims to identify genes associated with liability for heroin dependence. There is FTND phenotype data included however, so we will perform a ND GWAS.

The [Fagerström Test for Nicotine Dependence (FTND](https://cde.drugabuse.gov/instrument/d7c0b0f5-b865-e4de-e040-bb89ad43202b) is a standard instrument for assessing the intensity of physical addiction to nicotine.

The data for this study are located on S3 at:

`s3://rti-midas-data/studies/heroin_icghd/`

The Nelson Data set has been downloaded to an EC2 instance from S3 to the location:
`/shared/data/studies/Nelson`


`/shared/data/studies/Nelson/UHS_heroin`

The head of the data looks like the follow:

`head imputed.aa_pop_pheno_controls_overlap.chr1.heroin_10EVs.palogist.stats`

`name position A1 A2 Freq1 MAF Quality Rsq n Mean_predictor_allele beta_SNP_add sebeta_SNP_add loglik chi p or_95_percent_ci       chr
rs58108140 10583 G A 0.923 0.077 0 0.471 3662 0.923191 0.00656649 0.133317 -2320.59 0.00242602632930075 0.960716312898352 1.01(0.78-1.31) 1
`

Where is the phenotype file for these data? What data do I need in order to perform a gwas? What I should do is look at my old notebooks to see how I performed the GWAS and then replicate this. 

**Set autoscaling tags**

This can be done in the AWS console.

In [None]:
aws autoscaling create-or-update-tags --tags "ResourceId=cfncluster-JMspot2-ComputeFleet-IQXOI7AVONI2,ResourceType=auto-scaling-group,Key=project-number,Value=0215457.000.001,PropagateAtLaunch=true"

In [None]:
## EC2 ##
aws s3 sync s3://rti-midas-data/studies/heroin_icghd/ /shared/data/studies/nelson --quiet &
mkdir -p /shared/jmarks/nicotine/gwas/nelson/phenotype
cp /shared/data/studies/nelson/phenotypes/probabel/heroin_icghd.ea.CAT_FTND.AGE.SEX.ALCOHOLDEP4.EVs.v1.gz \
    /shared/jmarks/nicotine/gwas/nelson/phenotype

cd /shared/jmarks/nicotine/gwas/nelson/phenotype
gunzip heroin_icghd.ea.CAT_FTND.AGE.SEX.ALCOHOLDEP4.EVs.v1.gz

# only what to run a baseline gwas and therefore we will remove the alcohol covariate column
awk '{print $1,$2,$3,$4,$6,$7,$8, $9}' heroin_icghd.ea.CAT_FTND.AGE.SEX.ALCOHOLDEP4.EVs.v1 > nelson.ea.CAT_FTND.AGE.SEX.EVs.v1

mkdir -p ../association_test/001
cd association_tests/001

cp /shared/data/studies/nelson/imputed/v1/association_tests/003/_methods.heroin_icghd.v1.association_tests.003.sh \
    _methods.nelson.v1.association_tests.sh





In [None]:
#!/data/0212964/nextflow/nextflow-0.25.1-all

/* ######################################################################
#                 Association Pipeline v0.1                             #
###################################################################### */

/*
 * Defines pipeline parameters
 */

params.final_chunks = "The final chunking file splited by chromosome"
params.input_pheno = "The phenotype to be used in this analyses, file generated in ESN Windows environment"
params.imputation_dir = "The directory contains the post imputation genotype files (mldose files)"
params.example_mldose = "One of the mldose files in the imputation directory"
params.geno_prefix = "The prefix of the mldose file"
params.working_dirs = "The working directory"
params.out = "The output file name for the stats file"
params.method = "palinear or palogist"

/*
 * probabel_header = "name\tchrom\tposition\tA1\tA2\tFreq1\tMAF\tQuality\tRsq\tn\tMean_predictor_allele\tbeta_SNP_add\tsebeta_SNP_add\tchi2_SNP\tchi\tp\tor_95_percent_ci"
 */

// create chunks channels from chunking file
chunks = Channel
                        .from( file(params.final_chunks) )
                        .splitCsv(header:['chr', 'chunk', 'start', 'end'], skip: 1, sep: '\t')
chunks.into { chunks_prep_geno; chunks_merge; }

/* *********************************************
 * Step 1: Start Prepare ProbABEL Phenotype File
 */

process prepare_pheno{
 input:
        // input defined by parameters

        output:
        file "probabel_pheno" into probabel_phenotype_file

        """
        /share/nas03/bioinformatics_group/software/perl/prepare_probabel_files.pl \
                --in_mldose ${params.example_mldose} \
                --in_pheno ${params.input_pheno} \
                --out_pheno "probabel_pheno"
        """
}

/* ********************************************
 * Step 2: Start Prepare ProbABEL Genotype File
 */

process prepare_geno{

        input:
        file "probabel_pheno" from probabel_phenotype_file
        set chr, chunk, start, end from chunks_prep_geno

        output:
        file "${params.geno_prefix}${chr}.${chunk}.mach_mldose" into probabel_genotype_files

    executor 'sge'
    clusterOptions '-S /bin/bash -l mem_free=15G,h_vmem=15G -p 0'

        """
        /share/nas03/bioinformatics_group/software/perl/prepare_probabel_files.pl \
                --in_mldose "${params.imputation_dir}/${params.geno_prefix}${chr}.${chunk}.mach.mldose.gz" \
                --in_pheno "${probabel_pheno}" \
                --out_mldose "${params.geno_prefix}${chr}.${chunk}.mach_mldose"
        """
}

/* *******************************
 * Step 3: Start ProbABEL Analysis
 */

process do_probabel_analysis{

        input:
        file probabel_pheno from probabel_phenotype_file
        file geno from probabel_genotype_files

        output:
        file "${geno.baseName}_add.out.txt" into probabel_out_files

        executor 'sge'
    clusterOptions '-S /bin/bash -l mem_free=7G,h_vmem=7G -p 0'

        """
        /share/nas03/bioinformatics_group/software/${params.method} \
        --pheno $probabel_pheno \
        --dose $geno \
        --info "${params.imputation_dir}/${geno.baseName}.mach.mlinfo" \
        --map "${params.imputation_dir}/${geno.baseName}.legend" \
        --out "${geno.baseName}"
        """
}

/* ***************************
 * Step 4: Start calc chi p OR
 */
process calc_chi_p_or{

        input:
        file result from probabel_out_files

        output:
        file "${result.baseName}.stats" into stats_files

        executor 'sge'
    clusterOptions '-S /bin/bash -l mem_free=7G,h_vmem=7G -p 0'

        """
        /share/nas03/bioinformatics_group/software/R/calculate_stats_for_probabel_results_v2.R \
                --remove_missing_p \
                --in_file "${result}" \
                --out_file "${result.baseName}.stats"
        """
}

/* ***************************
 * Step 5: merge stats files
 */

stats_files
        .collectFile(name: file(params.out), skip: 1)
        .println {"Result saved to file: $it"}


## Imputation Conversion

There are some apparent issues with the `mldose` files. Specifically, the minor allele frequencies (MAF) are not lining up to what they should be. The issue was found after performing a baseline GWAS and finding that all of the results were NULL. After comparing the MAF of the `mldose` files and the `gen.gz` files, one can see that the MAF are in discordance. As an example, let us compare the MAF of SNP `rs16969968` (Mr. Big). 

```
heroin_icghd.ea.1000G_p3.chr15.21.mach.mlinfo:
SNP     Al1     Al2     Freq1   MAF     Quality Rsq
rs16969968:78882925:G:A   G       A       0.002   0.002   0       0

 heroin_icghd.ea.1000G_p3.chr15.21.gen_info
snp_id rs_id position exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- rs16969968:78882925:G:A 78882925 0.349 0.996 0.998 0 -1 -1 -1
```

In [None]:
#perl /shared/bioinformatics/software/perl/file_conversion/convert_post-v2.3.2_impute2_files.pl\
dataDir=/shared/data/studies/nelson/imputed/v1/imputations/ea/chr15/sandbox

/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name impute_conversion_test \
    --script_prefix $dataDir/test2/mytest \
    --mem 15 \
    --nslots 4 \
    --priority 0 \
    --program perl /shared/bioinformatics/software/perl/file_conversion/convert_pre-v2.3.2_impute2_files.pl\
        --gen $dataDir/heroin_icghd.ea.1000G_p3.chr15.21.gen.gz \
        --sample $dataDir/heroin_icghd.ea.for_imputation.chr15.phased.sample.gz\
        --gen_info $dataDir/heroin_icghd.ea.1000G_p3.chr15.21.gen_info \
        --remove $dataDir/heroin_icghd.ea.1000G_p3.chr15.gen.remove\
        --out $dataDir/test2/heroin_icghd.ea.1000G_p3.chr15.21\
        --generate_mach_mldose_file\
        --generate_mach_mlinfo_file\
        --generate_legend_file \
        --variant_type all
