# NextFlow pipeline sandbox
__Author__: Jesse Marks

This notebook will be dedicated toward learning the NextFlow pipeline. This pipeline is executed to perform a GWAS on Nicotine projects. The NextFlow pipeline has been tailored to expect output from IMPUTE2 that has been converted to mldose format.

One important distinction between the heroin protocol and the nicotine one, is that the nicotine cohorts need to be analyzed using ProbABEL so that we can embed interactions for some of the analyses and run subsequent joint 2df meta-analyses.

## Unfamiliar concepts
### NextFlow

This is a framework that eases writing computational pipelines with complex data. Linux is the lingua franca of data science. Nextflow extends the approach of Linux by adding the ability to add new definitions and interactions between complex programming as well as high-level parallel computing. 

Essentially Nextflow pipeline script is a series of scripting language commands joined together. Nextflow allows you to write a computational pipeline by making it easier to join many different tasks together.

### ProbABEL

This is a tool for GWAS analysis of imputed genetic data. These imputed data are probabilistic in nature. With imputed data, the genotypic status of the person is know with a much lower confidence than directly genotyped markers. The results of imputed data are that the unknowed genotype is given a probability between different possibilities. One could just accept the highest probability and treat it as an observed genotype. This is a biased approach, however. It does not take into account the uncertainty about the true genotypic status. The ProbABEL package was designed to perform an association analysis by means of regression of the outcome of interest onto estimated genotypic probabilities. For our studies, we will be implementing a linear regression. 


#### Input files
Three files are needed as input:
1. snp info file (MLINFO file of MaCH)
2. genome- or chromosome-wide predictor information (e.g. the MLDOSE file of MaCH)
3. file containing phenotype of interest and covariates.


* The program to run a linear regression is called `palinear`
* The `--interaction` option allows you to include interaction between SNPs and any covariate. Such as age * SNP
    * I believe this is why we are wanting to use probABEL instead of rvtest for the GWAS.

See [manual](http://www.genabel.org/sites/default/files/pdfs/ProbABEL_manual.pdf) for more information.

### Joint 2df meta-analyses

### Genomics meta-analysis

## SGE
Sun Grid Engine is a cluster environment. This is necessary scheduling system when limited computational resources are shared by many.

## Pipeline analysis

John Guo pointed me towards an example on MIDAS at:

`/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/imputed/v3/association_tests/001`

I copied the contents over to my local directory on MIDAS at:

`/home/jmarks/nextflow/001`

### Contents of pipeline
There are three files named:

__` _methods.cogend.imputed.v3.association_tests.006.sh  nextflow.config  _pipeline.association.out_stats_files.run.sh`__

and one directory named:

__`aa`__.

#### _methods.cogend.imputed.v3.association_tests.006.sh
The code of this methods file is below. We will go through this code to get an understanding of what is going on. We will annotate as much as we can.

In [None]:
ASSOCIATION_ROOT=/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/imputed/v3/association_tests/001
COHORT=jhs_aric_aa
MODEL=CAT_FTND~SNP+AGE+SEX+EVs

for ethnicity in aa; do
        for (( chr=1; chr<24; chr++ )); do
        mkdir -p $ASSOCIATION_ROOT/$ethnicity/processing/chr$chr
        done
        mkdir $ASSOCIATION_ROOT/$ethnicity/final
done
### START Filter ###

# add in chr id
for ethnicity in aa; do
    for (( chr=1; chr<24; chr++ )); do
        inFile=$ASSOCIATION_ROOT/$ethnicity/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.stats
        echo Processing $inFile
        outFile=$ASSOCIATION_ROOT/$ethnicity/processing/chr$chr/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.stats
        echo -e "chr\tname\tposition\tA1\tA2\tFreq1\tMAF\tQuality\tRsq\tn\tMean_predictor_allele\tbeta_SNP_add\tsebeta_SNP_add\tchi2_SNP\tchi\tp\tor_95_percent_ci" > $outFile
        tail -n +1 $inFile |
          perl -slne '{ print join("\t", "$chr", "$_"); }' -- -chr="$chr" >> $outFile
    done
done

# MAF > 0.01 in AFR (AA) or EUR (EA)
for ethnicity in aa;do
  if [ $ethnicity == "aa" ]; then
    group=afr
  else
    group=eur
  fi
    for (( chr=1; chr<24; chr++ )); do
      if [ $chr == "23" ]; then
        idList=/share/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.maf_lte_0.01_$group
      else
        idList=/share/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.maf_lte_0.01_$group
      fi
      /share/nas03/bioinformatics_group/software/scripts/qsub_job.sh \
        --job_name ${ethnicity}_${chr} \
        --script_prefix $ASSOCIATION_ROOT/$ethnicity/processing/chr$chr/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_$group \
        --mem 3.8 \
        --priority 0 \
        --program /share/nas03/bioinformatics_group/software/perl/extract_rows.pl \
        --source $ASSOCIATION_ROOT/$ethnicity/processing/chr$chr/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.stats \
        --id_list $idList \
        --out $ASSOCIATION_ROOT/$ethnicity/processing/chr$chr/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_$group \
        --header 0 \
        --id_column 1 \
        --remove
    done
done

python /home/jyguo/monitor.py

for ethnicity in aa; do
  mv $ASSOCIATION_ROOT/$ethnicity/processing/chr*/*.maf_gt_0.01_??? \
   $ASSOCIATION_ROOT/$ethnicity/final
done

# Filter out variants with MAF <= 0.01 in COGEND
for ethnicity in aa; do
    if [ $ethnicity == "aa" ]; then
        group=afr
    else
        group=eur
    fi
    for (( chr=1; chr<24; chr++ )); do
        inFile=$ASSOCIATION_ROOT/$ethnicity/final/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_$group
        outFile=$ASSOCIATION_ROOT/$ethnicity/final/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_${group}+$COHORT
        echo Processing $inFile
        head -n 1 $inFile > $outFile
        tail -n +2 $inFile |
            perl -lne '/^(?:\S+\s+){6}(\S+)/;
                        if ( $1 > 0.01) {
                            print;
                        }' >> $outFile
    done
done


### END Filter ###

### START Generate plots ###
for ethnicity in aa; do
  if [ $ethnicity == "aa" ]; then
    group=afr
  else
    group=eur
  fi
  for ext in $group ${group}+$COHORT; do
        outFile=$ASSOCIATION_ROOT/$ethnicity/processing/$COHORT.$ethnicity.1000G_p3.$MODEL.maf_gt_0.01_$ext.table
echo -e "VARIANT_ID\tCHR\tPOSITION\tP\tTYPE" > $outFile
for (( chr=1; chr<24; chr++ )); do
    inFile=$ASSOCIATION_ROOT/$ethnicity/final/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_$ext
    echo Processing $inFile
    tail -n +2 $inFile |
      perl -lne '/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)(?:\s+\S+){10}\s+(\S+)/;
                 if (($4 eq "A" || $4 eq "C" || $4 eq "G" || $4 eq "T") && ($5 eq "A" || $5 eq "C" || $5 eq "G" || $5 eq "T")) {
                   print join("\t",$2,$1,$3,$6,"snp");
                 } else {
                   print join("\t",$2,$1,$3,$6,"indel");
                 }' >> $outFile
done
done
done

for ethnicity in aa; do
  if [ $ethnicity == "aa" ]; then
    group=afr
  else
    group=eur
  fi
  for ext in $group ${group}+$COHORT; do
/share/nas03/bioinformatics_group/software/scripts/qsub_job.sh \
  --job_name gwas_plots \
  --script_prefix $ASSOCIATION_ROOT/$ethnicity/processing/$COHORT.$ethnicity.1000G_p3.$MODEL.maf_gt_0.01_$ext.plots \
  --mem 15 \
  --priority 0 \
  --program /share/nas03/bioinformatics_group/software/R/dev/generate_gwas_plots.v6.R \
  --in $ASSOCIATION_ROOT/$ethnicity/processing/$COHORT.$ethnicity.1000G_p3.$MODEL.maf_gt_0.01_$ext.table \
  --in_chromosomes autosomal_nonPAR \
  --in_header \
  --out $ASSOCIATION_ROOT/$ethnicity/final/$COHORT.$ethnicity.1000G_p3.$MODEL.maf_gt_0.01_$ext \
  --col_id VARIANT_ID \
  --col_chromosome CHR \
  --col_position POSITION \
  --col_p P \
  --col_variant_type TYPE \
  --generate_snp_indel_manhattan_plot \
  --manhattan_odd_chr_color red \
  --manhattan_even_chr_color blue \
  --manhattan_points_cex 1.5 \
  --generate_snp_indel_qq_plot \
  --qq_lines \
  --qq_points_bg black \
  --qq_lambda
done
done

### END Generate plots ###


### START Filter by p-value ###

# MAF > 0.01 in AFR and EUR
for ethnicity in aa; do
  if [ $ethnicity == "aa" ]; then
    group=afr
  else
    group=eur
  fi
  for ext in $group ${group}+$COHORT; do
    outFile=$ASSOCIATION_ROOT/$ethnicity/processing/$COHORT.$ethnicity.1000G_p3.$MODEL.maf_gt_0.01_$ext.p_lte_0.001
    head -n 1 $ASSOCIATION_ROOT/$ethnicity/final/$COHORT.$ethnicity.1000G_p3.chr1.$MODEL.maf_gt_0.01_$ext > \
      $outFile
    for (( chr=1; chr<24; chr++ )); do
      echo Processing ${ASSOCIATION_ROOT}/$ethnicity/final/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_$ext
      tail -n +2 $ASSOCIATION_ROOT/$ethnicity/final/$COHORT.$ethnicity.1000G_p3.chr$chr.$MODEL.maf_gt_0.01_$ext |
        perl -lane 'if ($F[15] <= 0.001) { print; }' >> \
        $outFile
    done
done
done

# Sort
for (ethnicity in c("aa")){
        if (ethnicity == "aa") { group = "afr" } else if (ethnicity == "ea") { group = "eur" }
        dat=read.table(paste0('/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/imputed/v3/association_tests/001/',ethnicity,'/processing/jhs_aric_aa.',ethnicity,'.1000G_p3.CAT_FTND~SNP+AGE+SEX+EVs.maf_gt_0.01_',group,'.p_lte_0.001'), header = TRUE)
        dat = dat[order(dat$p),]
        write.csv(dat, file = paste0('/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/imputed/v3/association_tests/001/',ethnicity,'/final/jhs_aric_aa.',ethnicity,'.1000G_p3.CAT_FTND~SNP+AGE+SEX+EVs.maf_gt_0.01_',group,'.p_lte_0.001.csv'), row.names = FALSE)
        dat=read.table(paste0('/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/imputed/v3/association_tests/001/',ethnicity,'/processing/jhs_aric_aa.',ethnicity,'.1000G_p3.CAT_FTND~SNP+AGE+SEX+EVs.maf_gt_0.01_',group,'+jhs_aric_aa.p_lte_0.001'), header = TRUE)
        dat = dat[order(dat$p),]
        write.csv(dat, file = paste0('/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/imputed/v3/association_tests/001/',ethnicity,'/final/jhs_aric_aa.',ethnicity,'.1000G_p3.CAT_FTND~SNP+AGE+SEX+EVs.maf_gt_0.01_',group,'+jhs_aric_aa.p_lte_0.001.csv'), row.names = FALSE)
}

## END Filter by p-value ###

#### nextflow.config
The contents of nextflow.config are below. The config file is used to define which executor to use, ENV. variables, pipeline parameters, ect.

In [None]:
process.executor = 'sge' # definining the target execution system
process.clusterOptions = '-S /bin/bash'

#### _pipeline.association.out_stats_files.run.sh
The contents of this bash script are below.

__Notes__:

* Need to organize my phenotype data similar to what is seen in

`phenotypes/probabel/jhs_aric_aa.CAT_FTND.AGE.SEX.EVs.v1` 

In [None]:
#!/bin/sh

for ethnicity in aa; do
        working_dir=/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/imputed/v3/association_tests/001/$ethnicity
        imputation_root=/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/imputed/v3/imputations/
        phenotype_root=/share/nas04/bioinformatics_group/data/studies/jhs_aric_aa/phenotypes/probabel/

        method=palinear

        for (( chr=1; chr<24; chr++ )); do
        out_file=jhs_aric_aa.$ethnicity.1000G_p3.chr$chr.CAT_FTND~SNP+AGE+SEX+EVs.stats
            phenotype_file=jhs_aric_aa.CAT_FTND.AGE.SEX.EVs.v1

#head phenotypes/probabel/jhs_aric_aa.CAT_FTND.AGE.SEX.EVs.v1
"""
IID     CAT_FTND        AGE     SEX     EV3     EV5     EV8
PT-7X6C 1       68      1       0.0139  0.0166  0.0127
PT-7ZI1 0       46      2       0.0003  0.0042  -0.0046
PT-7ZHW 2       47      1       0.0045  -0.0002 0.0151
"""
            geno_prefix=jhs_aric_aa.$ethnicity.1000G_p3.chr

            /share/nas03/bioinformatics_group/software/nextflow/nextflow-0.25.1-all \
            /share/nas03/bioinformatics_group/software/pipeline/_pipeline.association.out_stats_files.v0.1.nf \
                --final_chunks $imputation_root/chunks/final_chunks.chr$chr \
                --input_pheno $phenotype_root/$phenotype_file \
                --imputation_dir $imputation_root/$ethnicity/chr$chr \
                --example_mldose $imputation_root/$ethnicity/chr$chr/jhs_aric_aa.$ethnicity.1000G_p3.chr$chr.2.mach.mldose.gz \
                --geno_prefix $geno_prefix \
                --working_dirs $working_dir \
                --out $working_dir/$out_file \
                --method $method

                rm -r $working_dir/../work
        done
done

#### aa

This directory contains another directory: `processing/chr1` that has 3 files:

* jhs_aric_aa.aa.1000G_p3.chr1.CAT_FTND~SNP+AGE+SEX+EVs.maf_gt_0.01_afr.qsub.log
* jhs_aric_aa.aa.1000G_p3.chr1.CAT_FTND~SNP+AGE+SEX+EVs.maf_gt_0.01_afr.qsub.sh
* jhs_aric_aa.aa.1000G_p3.chr1.CAT_FTND~SNP+AGE+SEX+EVs.stats


__Note__: I deleted the other `chr#` directories to save space.

## Converting from VCF to mldose