# FTND LD Score Regression Update
**Author:** Jesse Marks  <br>
**GitHub Issue:** [#103](https://github.com/RTIInternational/bioinformatics/issues/103#issuecomment-602203354)

In this notebook we document the [LD Score Regression](https://github.com/bulik/ldsc) (LDSC) analysis performed for our paper [Expanding the Genetic Architecture of Nicotine Dependence and its Shared Genetics with Multiple Traits: Findings from the Nicotine Dependence GenOmics (iNDiGO) Consortium](https://www.biorxiv.org/content/10.1101/2020.01.15.898858v1.full). The reviewers of this paper commented about outdated results that we used for the LDSR. The data we used were on [LDHub](http://ldsc.broadinstitute.org/). We therefore had to download the most recent sets of results from the [Psychiatric Genomics Consortium](https://www.med.unc.edu/pgc/) and use these results to update our LDSR analysis plot. We are also going to add a vertical line at rg==1 in our plot to address a request from the reviewer.

Note that Michael Bray of WUSTL updated the Lung Cancer results which we will put in here.


## Data Locations
Dana Hancock downloaded the updated results too:
```
\rtpnfil02\dhancock\Nicotine\Analysis\META\1df\wave3GWASmeta\LDSR\ADHD_Demontis2019
\rtpnfil02\dhancock\Nicotine\Analysis\META\1df\wave3GWASmeta\LDSR\Anorexia_Watson2019
\rtpnfil02\dhancock\Nicotine\Analysis\META\1df\wave3GWASmeta\LDSR\Autism_Groves2019
\rtpnfil02\dhancock\Nicotine\Analysis\META\1df\wave3GWASmeta\LDSR\Bipolar_Stahl2019
\rtpnfil02\dhancock\Nicotine\Analysis\META\1df\wave3GWASmeta\LDSR\MDD_Howard2019
\rtpnfil02\dhancock\Nicotine\Analysis\META\1df\wave3GWASmeta\LDSR\PTSD_Nievergelt2019
\rtpnfil02\dhancock\Nicotine\Analysis\META\1df\wave3GWASmeta\LDSR\YrsSchool_Lee2018\
```

<br><br>

The wave3 FTND meta-analysis results can be found at:
```
s3://rti-nd/META/1df/results/wave3/final_results/ea/final/20190322_ftnd_meta_analysis_wave3.eur.chr[1..22].exclude_singletons.1df.gz
```

## Workflow Guideline
1. Create Excel phenotype file locally then upload to EC2 instance
2. Clone https://github.com/RTIInternational/ld-regression-pipeline
3. Then edit full_ld_regression_wf_template.json to include the reference data of choice
4. Use dockerized tool to finish filling out the json file that will be input for workflow
5. Run the WDL workflow for LDSC

## Create Input Files


In [None]:
## ADHD
cd /shared/jmarks/nicotine/ldsc/ftnd_all/20200326/processing/ADHD_Demontis2019

## split up results file so we can order it sequencially by chromosome
for chr in {1..22}; do
    metaF=daner_meta_filtered_NA_iPSYCH23_PGC11_sigPCs_woSEX_2ell6sd_EUR_Neff_70.meta
    outF=adhd_pgc_eur_chr${chr}.txt

    awk -v chrom=$chr 'NR==1 {print $0; next} NR>1 {
        if ($1==chrom)
            {print $0}} ' $metaF > $outF &
done        

outf=adhd_demontis2019_ldsc_ready.txt
echo -e "SNP\tCHR\tPOS\tA1\tA2\tOR\tP\tN" > $outf
for chr in {1..22};do
    inf=adhd_pgc_eur_chr$chr.txt
    # MarkerName, CHR, POS, Allele1, Allele2, Effect, P-value
    tail -n +2 $inf  | \
    awk 'BEGIN{OFS="\t"} {print $2,$1,$3,$4,$5,$9,$11,$17+$18}' \
        >> $outf
done &

gzip $outf
## upload to S3
aws s3 cp $outf.gz s3://rti-nd/ldsc_genetic_correlation/data/ADHD_Demontis2019/$outf.gz


####################################################################################################
## Anorexia

# remove extra headers
zcat pgcAN2.2019-07.vcf.tsv.gz  |  perl -lane ' 
    if ((!/##/) ) 
        { print $_}' > anorexia.txt

# split up results file so we can order it sequencially by chromosome
for chr in {1..22}; do
    metaF=anorexia.txt
    outF=anorexia_pgc_eur_chr${chr}.txt

    awk -v chrom=$chr 'NR==1 {print $0; next} NR>1 {
        if ($1==chrom)
            {print $0}} ' $metaF > $outF &
done        


outf=anorexia_watson2019_ldsc_ready.txt
echo -e "SNP\tCHR\tPOS\tA1\tA2\tBETA\tP\tN" > $outf
for chr in {1..22};do
    inf=anorexia_pgc_eur_chr$chr.txt
    # MarkerName, CHR, POS, Allele1, Allele2, Effect, P-value
    tail -n +2 $inf  | \
    awk 'BEGIN{OFS="\t"}  {print $3,$1,$2,$5,$4,$6,$8,$12+$13}'  >> $outf
done &

gzip $outf

## clean up directory
rm anorexia*t

## upload to S3
aws s3 sync . s3://rti-nd/ldsc_genetic_correlation/data/Anorexia_Watson2019/

####################################################################################################
## Autism


## split up results file so we can order it sequencially by chromosome
for chr in {1..22}; do
    metaF=iPSYCH-PGC_ASD_Nov2017.gz
    outF=autism_pgc_eur_chr${chr}.txt

    awk -v chrom=$chr 'NR==1 {print $0; next} NR>1 {
        if ($1==chrom)
            {print $0}} ' <(zcat $metaF) > $outF &
done        

outf=autism_groves2019_n46351_ldsc_ready.txt
echo -e "SNP\tCHR\tPOS\tA1\tA2\tOR\tP" > $outf
for chr in {1..22};do
    inf=autism_pgc_eur_chr$chr.txt
    # MarkerName, CHR, POS, Allele1, Allele2, Effect, P-value
    tail -n +2 $inf  | \
    awk 'BEGIN{OFS="\t"} {print $2,$1,$3,$4,$5,$7,$9}' \
        >> $outf
done &

gzip $outf

## clean up directory
rm autism_pgc*t

## upload to S3
aws s3 sync. s3://rti-nd/ldsc_genetic_correlation/data/autism_groves2019/


####################################################################################################
## Bipolar


## split up results file so we can order it sequencially by chromosome
for chr in {1..22}; do
    metaF=daner_PGC_BIP32b_mds7a_0416a.gz
    outF=bipolar_pgc_eur_chr${chr}.txt

    awk -v chrom=$chr 'NR==1 {print $0; next} NR>1 {
        if ($1==chrom)
            {print $0}} ' <(zcat $metaF) > $outF &
done        

outf=bipolar_stahl2019_ldsc_ready.txt
echo -e "SNP\tCHR\tPOS\tA1\tA2\tOR\tP\tN" > $outf
for chr in {1..22};do
    inf=bipolar_pgc_eur_chr$chr.txt
    # MarkerName, CHR, POS, Allele1, Allele2, Effect, P-value, N
    tail -n +2 $inf  | \
    awk 'BEGIN{OFS="\t"} {print $2,$1,$3,$4,$5,$9,$11,$18+$17}' \
        >> $outf
done &

gzip $outf

## clean up directory
rm bipolar_pgc*t

## upload to S3
aws s3 sync . s3://rti-nd/ldsc_genetic_correlation/data/bipolar_stahl2019/


####################################################################################################
## MDD


## split up results file so we can order it sequencially by chromosome


outf=mdd_howard2019_n807553_ldsc_ready.txt
echo -e "SNP\tCHR\tPOS\tA1\tA2\tLOG_ODD\tP" > $outf
for chr in {1..22};do
    inf=mdd_pgc_eur_chr$chr.txt
    # MarkerName, CHR, POS, Allele1, Allele2, Effect, P-value
    tail -n +2 $inf  | \
    awk 'BEGIN{OFS="\t"} {print $3,$1,$2,$4,$5,$7,$9}' \
        >> $outf
done &

gzip $outf

## clean up directory
rm mdd_pgc*t

## upload to S3
aws s3 sync . s3://rti-nd/ldsc_genetic_correlation/data/mdd_howard2019/


####################################################################################################
## PTSD


## split up results file so we can order it sequencially by chromosome
for chr in {1..22}; do
    metaF=pts_eur_freeze2_overall.results.gz
    outF=ptsd_pgc_eur_chr${chr}.txt

    awk -v chrom=$chr 'NR==1 {print $0; next} NR>1 {
        if ($1==chrom)
            {print $0}} ' <(zcat $metaF) > $outF &
done        

outf=ptsd_nieverge2019_ldsc_ready.txt
echo -e "SNP\tCHR\tPOS\tA1\tA2\tOR\tP\tN" > $outf
for chr in {1..22};do
    inf=ptsd_pgc_eur_chr$chr.txt
    # MarkerName, CHR, POS, Allele1, Allele2, Effect, P-value, N
    tail -n +2 $inf  | \
    awk 'BEGIN{OFS="\t"} {print $2,$1,$3,$4,$5,$9,$11,$17+$18}'  >> $outf
done &

gzip $outf

## clean up directory
rm ptsd_pgc*t

## upload to S3
aws s3 sync . s3://rti-nd/ldsc_genetic_correlation/data/ptsd_nieverge2019/


####################################################################################################
## YrsSchool_Lee2018


## split up results file so we can order it sequencially by chromosome
for chr in {1..22}; do
    metaF=GWAS_EA_excl23andMe.txt
    outF=yrschool_pgc_eur_chr${chr}.txt

    awk -v chrom=$chr 'NR==1 {print $0; next} NR>1 {
        if ($2==chrom)
            {print $0}} ' <(cat $metaF) > $outF &
done        

outf=yrschool_lee2019_n766345_ldsc_ready.txt
echo -e "SNP\tCHR\tPOS\tA1\tA2\tBETA\tP" > $outf
for chr in {1..22};do
    inf=yrschool_pgc_eur_chr$chr.txt
    # MarkerName, CHR, POS, Allele1, Allele2, Effect, P-value
    tail -n +2 $inf  | \
    awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5,$7,$9}'  >> $outf
done &

gzip $outf

## clean up directory
rm yrs_pgc*t

## upload to S3
aws s3 sync . s3://rti-nd/ldsc_genetic_correlation/data/yrschool_lee2018/

## Run Analysis Workflow
`1b17491d-9e97-439c-b973-c162d196943a`

In [None]:
procD=/shared/jmarks/nicotine/ldsc/ftnd_all/20200326/

# enter compute node and use screen tool

# clone github repo
cd $procD
git clone https://github.com/RTIInternational/ld-regression-pipeline
    
# edit file-input json
cd ld-regression-pipeline
mkdir workflow_inputs
cp json_input/full_ld_regression_wf_template.json workflow_inputs
cd workflow_inputs

## vim edit file (see README.md at https://github.com/RTIInternational/ld-regression-pipeline)


# create final workflow input (a json file)
docker run -v $procD/ld-regression-pipeline/workflow_inputs:/data/ \
    rticode/generate_ld_regression_input_json:1ddbd682cb1e44dab6d11ee571add34bd1d06e21 \
    --json-input /data/full_ld_regression_wf_template.json \
    --pheno-file /data/ftnd_ldsc_phenotypes_local.xlsx >\
        $procD/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json

## zip appropriate files 
# Change to directory immediately above ld-regression-pipeline
cd $procD/ld-regression-pipeline
cd ..
# Make zipped copy of repo somewhere
zip --exclude=*var/* --exclude=*.git/* -r \
    $procD/ld-regression-pipeline/workflow_inputs/ld-regression-pipeline.zip \
    ld-regression-pipeline

## download cromwell and the config file, if necessary
cd /shared/jmarks/bin/cromwell
#aws s3 cp s3://rti-cromwell-output/cromwell-config/cromwell_default_genomics_queue.conf .
#wget https://github.com/broadinstitute/cromwell/releases/download/44/cromwell-44.jar

## run ldsc workflow on AWS EC2 instance
java -Dconfig.file=/shared/jmarks/bin/cromwell/cromwell_default_genomics_queue.conf \
    -jar cromwell-49.jar \
    run $procD/ld-regression-pipeline/workflow/full_ld_regression_wf.wdl \
    -i $procD/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json \
    -p $procD/ld-regression-pipeline/workflow_inputs/ld-regression-pipeline.zip



In [None]:
# sandbox

java -Dconfig.file=/shared/jmarks/bin/cromwell/cromwell_default_genomics_queue.conf \
    -jar cromwell-49.jar \
    run $procD/ld-regression-pipeline/workflow/full_ld_regression_wf.wdl \
    -i $procD/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json \
    -p $procD/ld-regression-pipeline/workflow_inputs/ld-regression-pipeline.zip


curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@$procD/ld-regression-pipeline/workflow/full_ld_regression_wf.wdl" \
    -F "workflowInputs=@$procD/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json" \
    -F "workflowDependencies=@$procD/ld-regression-pipeline/workflow_inputs/ld-regression-pipeline.zip"

# Final Plot
We combine these updated results with the previous set of genetic correlation results.

Upload the plot table to EC2 instance to run docker and create the plot.

In [None]:
## enter interactive mode ##
# note that the image tag corresponds to the latest tag for this image

docker run -it -v"/shared/jmarks/projects/nicotine/ldsc/ftnd-all/006/:/data/" \
    rticode/plot_ld_regression_results:172bbfcc46b857dc95b4d0a080ccd092fd3a4bac  /bin/bash


Rscript /opt/plot_ld_regression/plot_ld_regression_results.R  \
    --input_file 20200513_ftnd_ld_regression_results.csv \
    --output_file 20200513_ftnd_ld_regression_results.pdf  \
    --group_order_file ftnd_rg_plot_order.csv \
    --comma_delimited \
    --vertical_rg 1 