# LD Score Regression: Meta022 + COPD + 3 LDHub Lung Function + 2 Pulminary Lung Function
**Author**: Jesse Marks <br>
**GitHub Issue:** [#126](https://github.com/RTIInternational/bioinformatics/issues/126)


This Jupyter Notebook documents the steps taken to perform LD Score Regression (LDSC)—a tool for estimating heritability and genetic correlation—on our European-specific meta-analysis results. In particular, we perform LDSC on our in-house meta-analysis (labeled 022) compared to: the COPD GWAS results, two sets of GWAS results for pulminary lung function, and 3 GWAS results for lung function on LDHub.

## Data 

**In-house Meta022** <br>
The 022 meta-analysis for HIV acquisition has `n=18,245` and includes:
* McLaren EA (n=13,581)
* UHS1-4 EA (n=3013)
* WIHS1 EA (n=720)
* VIDUS EA (n=931)

**Two Pulminary Lung Function GWAS**: <br>
* What is the `N`? 
* What is the chromosome?
```css
@jaamarks GWAS results for PFT decline (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0100776 ) are available here for comparison to HIV acquisition, once you have the new in-house+McLaren GWAS meta:
/rti-shares/gxg/R21_GxNutrients/PriorGWASresults/MAres_FEV1_Longitudinal_All_291120121.tbl
/rti-shares/gxg/R21_GxNutrients/PriorGWASresults/MAres_FEV1_Longitudinal_sub_291120121.tbl
```

We are going to utilize the [LD score regression pipeline](https://github.com/RTIInternational/ld-regression-pipeline) that Alex Waldrop developed to perform LD score regression. Specifically, we are going to compare the two studies to determine if VIDUS has a significant effect on heritability. We will be using the LDSC pipeline on Docker to complete these analyses, which will enable us to assess the SNP-based heritability of HIV acquisition and compare whether VIDUS is systematically deflating the heritability (which we would expect if VIDUS is systematic flipping SNP association directions given our prior look-ups of specific regions).

<br><br>

### Workflow guideline
 1. Create Excel phenotype file locally then upload to EC2 instance
 2. Clone https://github.com/RTIInternational/ld-regression-pipeline
 3. Then edit full_ld_regression_wf_template.json to include the reference data of choice
 4. Lastly use dockerized tool to finish filling out the json file that will be input for workflow

### workflow ID number:
`b8ab345f-b45d-4df2-a4fe-d087f464f684`

## Create WorkFlow inputs
Here is an example entry in the Excel Phenotype File:

**trait	plot_label	sumstats_path	pmid	category	sample_size	id_col	chr_col	pos_col	effect_allele_col	ref_allele_col	effect_col	pvalue_col	sample_size_col	effect_type	w_ld_chr**
```
COPDGWAS Hobbs et al.	COPD	s3://rti-nd/LDSC/COPDGWAS_HobbsEtAl/modGcNoOtherMinMissSorted.withchrpos.txt.gz	28166215	Respiratory	51772	3	1	2	4	5	10	12		beta	s3://clustername--files/eur_w_ld_chr.tar.bz2
```

In [None]:
cd /shared/jmarks/projects/hiv/ldsc/20190806/003
aws s3 sync s3://rti-hiv/meta_new/019/results/stats/exclude_singletons . --quiet
    
# create merged summary stats for meta 019 (extract only necessary columns)
zcat hiv_acquisition_1df_meta_analysis_uhs1-4_ea+wihs1_ea.chr1.exclude_singletons.1df.gz |\
cut -d " " -f1,2,3,4,5,6,8 > hiv_acq_019.txt 

for chr in {2..22}; do
    zcat hiv_acquisition_1df_meta_analysis_uhs1-4_ea+wihs1_ea.chr$chr.exclude_singletons.1df.gz |\
        tail -n +2 | cut -d " " -f1,2,3,4,5,6,8 >> hiv_acq_019.txt 
done &

In [None]:
### EC2 ###
# enter compute node and use screen tool
qrsh -l h=ip-172-31-29-161
screen 

# clone github repo
cd /shared/jmarks/projects/nicotine/ldsc/001
git clone https://github.com/RTIInternational/ld-regression-pipeline
    
# edit file-input json
cd ld-regression-pipeline
mkdir workflow_inputs
cp json_input/full_ld_regression_wf_template.json workflow_inputs
cd workflow_inputs

## vim edit file (see README.md at https://github.com/RTIInternational/ld-regression-pipeline)

In [None]:
### local ###

## edit phenotype file and upload to EC2 instance
scp -i ~/.ssh/gwas_rsa *xlsx ec2-user@3.221.213.211:/shared/jmarks/projects/nicotine/ldsc/001/ld-regression-pipeline/workflow_inputs

In [None]:
### EC2 ###

# create final workflow input (a json file)
docker run -v /shared/jmarks/projects/nicotine/ldsc/001/ld-regression-pipeline/workflow_inputs:/data/ \
    rticode/generate_ld_regression_input_json:1ddbd682cb1e44dab6d11ee571add34bd1d06e21 \
    --json-input /data/full_ld_regression_wf_template.json \
    --pheno-file /data/hiv_acquisition_ldsc_phenotypes_local.xlsx >\
        /shared/jmarks/projects/nicotine/ldsc/001/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json

## Run Analysis Workflow

In [None]:
## copy cromwell config file from S3 to EC2 instance
cd /shared/jmarks/bin/cromwell
    
## zip appropriate files 
# Change to directory immediately above ld-regression-pipeline
cd /shared/jmarks/projects/hiv/ldsc/20190806/003/ld-regression-pipeline
cd ..
# Make zipped copy of repo somewhere
zip --exclude=*var/* --exclude=*.git/* -r \
    /shared/jmarks/projects/hiv/ldsc/20190806/003/ld-regression-pipeline/workflow_inputs/ld-regression-pipeline.zip \
    ld-regression-pipeline

## download cromwell and the config file, if necessary
cd /shared/jmarks/bin/cromwell
#aws s3 cp s3://rti-cromwell-output/cromwell-config/cromwell_default_genomics_queue.conf .
#wget https://github.com/broadinstitute/cromwell/releases/download/44/cromwell-44.jar

## run ldsc workflow on AWS EC2 instance
java -Dconfig.file=/shared/jmarks/bin/cromwell/cromwell_default_genomics_queue.conf \
    -jar cromwell-44.jar \
    run /shared/jmarks/projects/hiv/ldsc/20190806/003/ld-regression-pipeline/workflow/full_ld_regression_wf.wdl \
    -i /shared/jmarks/projects/hiv/ldsc/20190806/003/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json \
    -p /shared/jmarks/projects/hiv/ldsc/20190806/003/ld-regression-pipeline/workflow_inputs/ld-regression-pipeline.zip

### workflow ID
Get this ID by looking at the log, example: <br>
`/shared/jmarks/bin/cromwell/cromwell-workflow-logs/workflow.cffd947c-9345-41f1-a146-3e3454404fa3.log`

`b8ab345f-b45d-4df2-a4fe-d087f464f684`

### Copy Workflow Results to Local


In [None]:
scp -i ~/.ssh/gwas_rsa   ec2-user@3.221.213.211:/shared/jmarks/projects/hiv/ldsc/20190806/003/ld-regression-pipeline/workflow_inputs/* .

## LD Hub
```
Important notes for your uploaded file:

1. To save the uploading time, LD Hub only accepts zipped files as input (e.g. mydata.zip).

2. Please check that there is ONLY ONE plain TXT file (e.g. mydata.txt) in your zipped file.

3. Please make sure you do NOT zip any folder together with the plain txt file (e.g. /myfolder/mydata.txt), otherwise you will get an error: [Errno 2] No such file or directory

4. Please do NOT zip multiple files (e.g. zip mydata.zip file1.txt file2.txt ..) or zip a file with in a folder (e.g. zip mydata.zip /path/to/my/file/mydata.txt).

5. Please keep the file name of your plain txt file short (less than 50 characters), otherwise you may get an error: [Errno 2] No such file or directory

6. Please zip your plain txt file using following command (ONE file at a time):

For Windows system: 1) Locate the file that you want to compress. 2) Right-click the file, point to Send to, and then click Compressed (zipped) folder.

For Linux and Mac OS system: zip mydata.zip mydata.txt

Reminder: for Mac OS system, please do NOT zip you file by right click mouse and click "Compress" to zip your file, this will automatically create a folder called "__MACOS". You will get an error: [Errno 2] No such file or directory.

Upload the trait of interest
To save your upload time, we highly recommend you to use the SNP list we used in LD Hub to reduce the number of SNPs in your uploaded file. Click here to download our SNP list (w_hm3.noMHC.snplist.zip).

Please upload the zipped file you just created. Click here to download an input example.
```

In [None]:
## Download outputs for each ref chr from rftm_sumstats step
#cd /shared/jmarks/hiv/ldsc/ldhub
#aws s3 sync s3://rti-cromwell-output/cromwell-execution/full_ld_regression_wf/ed5747ed-ccbe-4bc9-bb44-1f2d750a27eb/call-munge_ref/MUNGE_REF_WF.munge_sumstats_wf/e6c9491a-ca22-4ca0-8ad6-79d2b13a6dbe/call-munge_chr_wf/ .
#    
#mv  */MUNGE_CHR.munge_sumstats_chr_wf/*/call-rfmt_sumstats/hiv_acquisition_1df_meta_analysis_uhs1-4_ea+vidus_ea+wihs1_ea.chr*.exclude_singletons.1df.standardized.phase3ID.munge_ready.txt .
#
## Concat into single file
#cat hiv_acquisition_1df_meta_analysis_uhs1-4_ea+vidus_ea+wihs1_ea.chr1.exclude_singletons.1df.standardized.phase3ID.munge_ready.txt >\
#    hiv016_ld_hub_with_pvalues.txt
#for chr in {2..22}
#do
#    tail -n +2  hiv_acquisition_1df_meta_analysis_uhs1-4_ea+vidus_ea+wihs1_ea.chr$chr.exclude_singletons.1df.standardized.phase3ID.munge_ready.txt >>\
#        hiv016_ld_hub_with_pvalues.txt
#done
#
#
## Remove unnecessary columns (need snpID, A1, A2, Beta, Pvalue)
#cat hiv016_ld_hub_with_pvalues.txt | cut -f 1,4,5,6,7 > tmp && mv tmp hiv016_ld_hub_with_pvalues.txt
#
## Add sample size column (sample = 46213.00)
#cat hiv016_ld_hub_with_pvalues.txt | awk -v OFS="\t" -F"\t" '{print $1,$2,$3,$4,"4664.000",$5}' > hiv016_ld_hub.txt
#
## Use vi to change column names to be:
#snpid A1 A2 BETA N P-value
#

In [None]:
# enter interactive mode
docker run -it -v"/shared/jmarks/hiv/ldsc/final/:/data/" \
    rticode/plot_ld_regression_results:1ddbd682cb1e44dab6d11ee571add34bd1d06e21 /bin/bash
    
Rscript /opt/plot_ld_regression/plot_ld_regression_results.R  \
    --input_file 20190806_hiv_aqcuisition_ldsc_016_vs_019_results_table.csv \
    --output_file 20190806_hiv_aqcuisition_meta016_ldsc_copd_lung_function_results_plot.pdf  \
    --comma_delimited
    #--group_order_file 20170729_hiv_aqcuisition_meta016_ldsc_copd_lung_function_plot_order.txt

In [None]:
#Rscript /opt/plot_ld_regression/plot_ld_regression_results.R \
#    --input_file ftnd_revised_plot_table_7-29-19.csv \
#    --output_file ftnd_ld_regression_results_7-29-19.pdf \
#    --comma_delimited

```
chr     pos     MarkerName      Allele1 Allele2 Freq1   FreqSE  MinFreq MaxFreq Effect  StdErr  P-value Direction       HetISq    HetChiSq        HetDf   HetPVal
```

## Create local phenotype file

In [None]:
aws s3 sync s3://rti-hiv/meta_new/016/results/stats/exclude_singletons .
file=hiv_acquisition_1df_meta_analysis_uhs1-4_ea+vidus_ea+wihs1_ea.chr1.exclude_singletons.1df.gz
cut -d " " -f2,3,1,4,5,6,8 <(zcat $file) > hiv_acq_016.txt

for file in hiv_acquisition_1df_meta_analysis_uhs1-4_ea+vidus_ea+wihs1_ea.chr{2..22}.exclude_singletons.1df.gz; do
    cut -d " " -f2,3,1,4,5,6,8 <(zcat $file) >> hiv_acq_016.txt 
done &

# Sandbox

In [None]:
Workflow inputs does not exist: /shared/jmarks/studies/hiv/ldsc/016_vs_019/ld-regression-pipeline/workflow_inputs/final_wf_inputs.json
# 

In [None]:
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name test01 \
    --script_prefix mytest.20190806 \
    --mem 3 \
    --nslots 1 \
    --priority 0 \
    --program sleep 1 

HIV_Acquisition_by_hiv019.ldsc_regression.log
```
Heritability of phenotype 1
---------------------------
Total Observed scale h2: 0.0948 (0.0958)
Lambda GC: 1.0315
Mean Chi^2: 1.0304
Intercept: 1.0217 (0.0063)
Ratio: 0.7154 (0.2082)

Heritability of phenotype 2/2
-----------------------------
Total Observed scale h2: 0.1195 (0.1227)
Lambda GC: 1.0315
Mean Chi^2: 1.0356
Intercept: 1.0269 (0.0069)
Ratio: 0.7567 (0.1936)

```

<br><br><br>
HIV_Acquisition_by_copd.ldsc_regression.log

```
Heritability of phenotype 1
---------------------------
Total Observed scale h2: 0.0801 (0.0935)
Lambda GC: 1.0315
Mean Chi^2: 1.0314
Intercept: 1.024 (0.0065)
Ratio: 0.7625 (0.2055)

Heritability of phenotype 2/2
-----------------------------
Total Observed scale h2: 0.1023 (0.0129)
Lambda GC: 1.0618
Mean Chi^2: 1.0928
Intercept: 0.9869 (0.0075)
Ratio < 0 (usually indicates GC correction).

```