# NGC LDSC Regression
**Author**: Jesse Marks <br>
**GitHub Issue:** [#140](https://github.com/RTIInternational/bioinformatics/issues/140) <br>
**Results Location:** 

Compare FOU, OAall, and OAexp to the following:
* **Each other**
* **FTND**
* **Cigarettes per day (GSCAN)**
* **Smoking cessation (current vs. former, GSCAN)**
* **Smoking initiation (ever vs. never, GSCAN)**
* **Age of smoking initiation (GSCAN)**
* **Alcohol dependence**
* **Alcohol drinks per week (GSCAN)**
* *MVP alcohol*
* *AUDIT*
* **Cannabis use disorder**
* **Lifetime cannabis use (ever vs. never)**
* Parkinson's disease
* Amyotrophic lateral sclerosis
* Alzheimers disease
* Intelligence
* Childhood IQ
* College completion
* Years of schooling
* Neuroticism
* Conscientiousness
* Openness to experience
* **Posttraumatic Stress Disorder**
* Attention deficit hyperactivity disorder
* Depressive symptoms
* Major depressive disorder
* Bipolar disorder
* Psychiatric cross-disorder
* Schizophrenia
* Autism spectrum disorder
* Anorexia Nervosa
* Subjective well being
* Putamen volume
* Accumbens volume
* Pallidum volume
* Caudate volume
* Thalamus volume
* Hippocampus volume
* Intracranial volume

\* bold are in-house; and italics we will acquire soon.

**Note** that most of the in-house data are located on S3 at: `s3://rti-nd/LDSC`

We are going to utilize the [LD score regression pipeline](https://github.com/RTIInternational/ld-regression-pipeline) that Alex Waldrop developed to perform LD score regression. 

## Data
NGC summary stats results location:
* **FOU**: `s3://rti-midas-data/studies/ngc/meta/087/processing/fou/alive+cats+cogend+start+uhs1-4+vidus+yale-penn.ea.fou.chr[1-22].maf_gt_0.01.rsq_gt_0.3.gz`<br><br>
* **OAall**: `s3://rti-midas-data/studies/ngc/meta/089/processing/oaall/cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chr[1-22].maf_gt_0.01.rsq_gt_0.3.gz`<br><br>
* **OAexp**: `s3://rti-midas-data/studies/ngc/meta/060/processing/oaexp/coga+decode+yale_penn.ea.chr[1-22].maf_gt_0.01.rsq_gt_0.3.gz`

<br>

**sample sizes**
* OAall: 304507
* OAexp: 5561
* FOU: 5388

### Data wrangling
Format the summary stats for input into cromwell.

In [None]:
## FOU
cd /shared/jmarks/heroin/ldsc/ngc_all/fou/001/processing
for chr in {1..22};do 
    aws s3 cp s3://rti-midas-data/studies/ngc/meta/087/processing/fou/alive+cats+cogend+start+uhs1-4+vidus+yale-penn.ea.fou.chr$chr.maf_gt_0.01.rsq_gt_0.3.gz . --quiet &
done

outf=fou_087.txt
for chr in {1..22};do
    inf=alive+cats+cogend+start+uhs1-4+vidus+yale-penn.ea.fou.chr$chr.maf_gt_0.01.rsq_gt_0.3.gz
    awk '{print $1,$2,$3,$4,$5,$6,$8}' <(zcat $inf) >> $outf
done &

gzip $outf
## upload to S3
aws s3 cp $outf.gz s3://rti-nd/LDSC/opioid_fou/$outf.gz
    
    

## OAexp
for chr in {1..22}; do
    aws s3 cp s3://rti-midas-data/studies/ngc/meta/060/processing/oaexp/coga+decode+yale_penn.ea.chr$chr.maf_gt_0.01.rsq_gt_0.3.gz . --quiet &
done

outf=oaexp_060.txt
for chr in {1..22};do
    inf=coga+decode+yale_penn.ea.chr$chr.maf_gt_0.01.rsq_gt_0.3.gz
    awk '{print $1,$2,$3,$4,$5,$6,$8}' <(zcat $inf) >> $outf
done &

gzip $outf
## upload to S3
aws s3 cp $outf.gz s3://rti-nd/LDSC/opioid_oaexp/$outf.gz


## OAall
for chr in {1..22}; do
    aws s3 cp s3://rti-midas-data/studies/ngc/meta/089/processing/oaall/cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chr$chr.maf_gt_0.01.rsq_gt_0.3.gz . --quiet &
done

outf=oaall_089.txt
for chr in {1..22};do
    inf=cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chr$chr.maf_gt_0.01.rsq_gt_0.3.gz
    awk '{print $1,$2,$3,$4,$5,$6,$8}' <(zcat $inf) >> $outf
done &

gzip $outf
## upload to S3
aws s3 cp $outf.gz s3://rti-nd/LDSC/opioid_oaall/$outf.gz


## OAall (no deCODE)
for chr in {1..22}; do
    aws s3 cp  s3://rti-midas-data/studies/ngc/meta/091/processing/oaall/cats+coga+kreek+odb+uhs+vidus+yale-penn.ea.chr$chr.maf_gt_0.01.rsq_gt_0.3.gz . --quiet &
done

outf=oaall_091.txt
for chr in {1..22};do
    inf=cats+coga+kreek+odb+uhs+vidus+yale-penn.ea.chr$chr.maf_gt_0.01.rsq_gt_0.3.gz
    awk '{print $1,$2,$3,$4,$5,$6,$8}' <(zcat $inf) >> $outf
done &

gzip $outf
## upload to S3
aws s3 cp $outf.gz s3://rti-nd/LDSC/opioid_oaall/$outf.gz


# FOU
`9ec11018-cab4-4da3-8fa4-e2a4b3b335e7`

In [None]:
cd /home/jmarks/Projects/heroin/ldsc/ngc_all/fou/001

## Create WorkFlow inputs
Here is an example entry in the Excel Phenotype File:

**trait	plot_label	sumstats_path	pmid	category	sample_size	id_col	chr_col	pos_col	effect_allele_col	ref_allele_col	effect_col	pvalue_col	sample_size_col	effect_type	w_ld_chr**
```
COPDGWAS Hobbs et al.	COPD	s3://rti-nd/LDSC/COPDGWAS_HobbsEtAl/modGcNoOtherMinMissSorted.withchrpos.txt.gz	28166215	Respiratory	51772	3	1	2	4	5	10	12		beta	s3://clustername--files/eur_w_ld_chr.tar.bz2
```


In [None]:
## 1. upload Excel phenotype file to EC2 instance
## 2. then edit full_ld_regression_wf_template.json to include the reference data of choice
## 3. lastly use dockerized tool to finish filling out the json file that will be input for workflow

## login to a larger compute node
qrsh

phenD=20191209_heroin_ldsc_phenotypes_local.xlsx
procD=/shared/jmarks/heroin/ldsc/ngc_all/fou/001
mkdir -p $procD/{ldhub,plot} # for later processing
git clone https://github.com/RTIInternational/ld-regression-pipeline/ $procD/ld-regression-pipeline
mkdir $procD/ld-regression-pipeline/workflow_inputs
## upload files to */workflow_inputs/

# create final workflow input (a json file) 
# edit this file
cp $procD/ld-regression-pipeline/json_input/full_ld_regression_wf_template.json \
    $procD/ld-regression-pipeline/workflow_inputs

docker run -v $procD/ld-regression-pipeline/workflow_inputs/:/data/ \
    rticode/generate_ld_regression_input_json:1ddbd682cb1e44dab6d11ee571add34bd1d06e21 \
    --json-input /data/full_ld_regression_wf_template.json \
    --pheno-file /data/$phenD >\
        $procD/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json

## Run Analysis Workflow

In [None]:
## zip appropriate files 
# Change to directory immediately above metaxcan-pipeline repo
cd $procD/ld-regression-pipeline
cd ..
# Make zipped copy of repo somewhere
zip --exclude=*var/* --exclude=*.git/* -r \
    $procD/ld-regression-pipeline/workflow_inputs/ld-regression-pipeline.zip \
    ld-regression-pipeline

## copy cromwell config file from S3 to EC2 instance
cd /shared/jmarks/bin/cromwell
#aws s3 cp s3://rti-cromwell-output/cromwell-config/cromwell_default_genomics_queue.conf .

## Run workflow—Navigate to cromwell directory
java -Dconfig.file=/shared/jmarks/bin/cromwell/cromwell_default_genomics_queue.conf \
    -jar cromwell-44.jar \
    run $procD/ld-regression-pipeline/workflow/full_ld_regression_wf.wdl \
    -i $procD/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json \
    -p $procD/ld-regression-pipeline/workflow_inputs/ld-regression-pipeline.zip


Record the workflow log-ID. Then get the results on s3 at `s3:///rti-cromwell-output/cromwell-execution/full_ld_regression_wf/<log-ID>/` <br>
You can find the log-ID in the directory `/shared/jmarks/bin/cromwell/cromwell-workflow-logs/` (for example).
<br>
<br>
<br>
<br>
<br>
<br>

## LD Hub
```
Important notes for your uploaded file:

1. To save the uploading time, LD Hub only accepts zipped files as input (e.g. mydata.zip).

2. Please check that there is ONLY ONE plain TXT file (e.g. mydata.txt) in your zipped file.

3. Please make sure you do NOT zip any folder together with the plain txt file (e.g. /myfolder/mydata.txt), otherwise you will get an error: [Errno 2] No such file or directory

4. Please do NOT zip multiple files (e.g. zip mydata.zip file1.txt file2.txt ..) or zip a file with in a folder (e.g. zip mydata.zip /path/to/my/file/mydata.txt).

5. Please keep the file name of your plain txt file short (less than 50 characters), otherwise you may get an error: [Errno 2] No such file or directory

6. Please zip your plain txt file using following command (ONE file at a time):

For Windows system: 1) Locate the file that you want to compress. 2) Right-click the file, point to Send to, and then click Compressed (zipped) folder.

For Linux and Mac OS system: zip mydata.zip mydata.txt

Reminder: for Mac OS system, please do NOT zip you file by right click mouse and click "Compress" to zip your file, this will automatically create a folder called "__MACOS". You will get an error: [Errno 2] No such file or directory.

Upload the trait of interest
To save your upload time, we highly recommend you to use the SNP list we used in LD Hub to reduce the number of SNPs in your uploaded file. Click here to download our SNP list (w_hm3.noMHC.snplist.zip).

Please upload the zipped file you just created. Click here to download an input example.
```

In [None]:
cd $procD/ldhub
outF=hiv016_ldhub_with_pvalues.txt # name of file to create for ldhub
samp_size=4664

### Download outputs for each ref chr from rftm_sumstats step ###
aws s3 sync s3://rti-cromwell-output/cromwell-execution/full_ld_regression_wf/71182bae-08fb-4733-825e-85f1fdec8f81/call-munge_ref/MUNGE_REF_WF.munge_sumstats_wf/5b22c4ab-015b-4c87-b43d-4b8866bc6e54/call-munge_chr_wf/ .
        
mv  */MUNGE_CHR.munge_sumstats_chr_wf/*/call-rfmt_sumstats/*.standardized.phase3ID.munge_ready.txt .
rm -rf shard*

## Concat into single file ##
cat *.chr1.*.standardized.phase3ID.munge_ready.txt > $outF
for chr in {2..22}; do
    tail -n +2  *.chr$chr.*.standardized.phase3ID.munge_ready.txt >> $outF
done

## Remove unnecessary columns (need snpID, A1, A2 Beta, Pvalue) in that order ##
head -1 $outF | cut -f1,4,5,6,7 > tmp
tail -n +2 $outF | awk 'BEGIN{OFS="\t"}{print $1, $4, $5, $6, $7}'  >> tmp && mv tmp $outF

## Add sample size column (sample = 18245.00) and change header names ##
cat $outF | awk -v var=$samp_size -F "\t"  \
    'BEGIN{OFS="\t";} NR==1{print "snpid", "A1", "A2", "BETA", "N", "P-value"} \
    NR>1{print $1,$2,$3,$4,var, $5}' > tmp && mv tmp $outF

In [None]:
cd /home/jmarks/Projects/HIV/ldsc/meta016_copd_pft/v02/processing/input/ldhub
scp -i ~/.ssh/gwas_rsa   ec2-user@54.84.72.140:/shared/jmarks/proj/hiv/ldsc/meta016/v02/ldhub/hiv016_ldhub_with_pvalues.txt
    
# zip file with 7zip
hiv016_ldhub_with_pvalues.txt.zip

### upload input file
Follow the steps above to zip and upload input file. Essentially, 
* download the file created in the cell above to your local machine.
* Then zip this file (and only this file).
* Login to [LDHub](http://ldsc.broadinstitute.org/ldhub/) by clicking on `Get Started with LDHub` and then sign in with your Google email account.
* Click `Go Test Center`
* Click `Continue`
* Upload zipped file by clicking `Choose File`, naming your trait, and clicking `Continue`.
* Select traits of interest from LDHub by checking the box next to the trait of interest and then clicking `Submit your request`

**Note**: keep browser open during LDSC analysis on LDHub.

**title:** `hiv`

## Create Final Plot
The merged CSV file should have the header:
```
trait2	Trait_Label	Trait_Group	rg	se	z	p	h2_obs	h2_obs_se	h2_int	h2_int_se	gcov_int	gcov_int_se
```

**Note**: upload the plot table to EC2 instance to run docker and create the plot.

In [None]:
## enter interactive mode ##
# note that the image tag corresponds to the latest tag for this image
docker run -it -v"/shared/jmarks/proj/hiv/ldsc/meta016/v02/plot:/data/" \
    rticode/plot_ld_regression_results:7bbd11a1d0c664bcb8bede8c398772b13abe15b3  /bin/bash
    
Rscript /opt/plot_ld_regression/plot_ld_regression_results.R  \
    --input_file 20190925_hivacq_ldsc_meta016_copd_pft_rg_results.csv \
    --output_file 20190925_hivacq_ldsc_meta016_copd_pft_rg_results.pdf  \
    --comma_delimited \
    --title "HIV Acquisition Meta016"