# Stratified LDSC using DEGs by HIV status
GitHub issue: [166](https://github.com/RTIInternational/bioinformatics/issues/166)

**Bryan**:
```
FYI the HIV GWAS sumstats I used are from here:
s3://rti-hiv/gwas_meta/hiv_acquisition/results/0033/001/final_stats/hiv_acquisition_gwas_meta_eur_chr${chr}_stats.txt.gz

Munged sumstats are here:
s3://rti-hiv/scratch/bquach/hiv/stratified_ldsc/hiv_acquisition_meta/data/sumstats/post_qc/
```

In addition to the HIV acquisition GWAS, we will apply the paritioned h2 analysis using the DEGs by HIV status we will apply analyze the traits:

| Trait                             | Location                                                                                                                                             |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| Alzheimer's disease               | s3://rti-shared/ldsc/data/alzheimers_disease_lambert2013_nat_genet/munged/alzheimers_disease_lambert2013_nat_genet.sumstats.gz                       |
| Amyotrophic lateral sclerosis     | s3://rti-shared/ldsc/data/amyotrophic_lateral_sclerosis_rheenen2016_nat_genet/munged/amyotrophic_lateral_sclerosis_rheenen2016_nat_genet.sumstats.gz |
| Asthma                            | s3://rti-shared/ldsc/data/asthma_han2020_nat_commun/munged/asthma_han2020_nat_commun.sumstats.gz                                                     |
| Atopic dermatitis (Eczema)        | s3://rti-shared/ldsc/data/atopic_dermatitis_paternoster2015_nat_genet/munged/munged_sumstats_readme.txt                                              |
| Crohn's disease                   | s3://rti-shared/ldsc/data/crohns_disease_liu2015_nat_genet/munged/crohns_disease_liu2015_nat_genet.sumstats.gz                                       |
| Inflammatory Bowel Disease (Euro) | s3://rti-shared/ldsc/data/inflammatory_bowel_disease_liu2015_nat_genet/munged/inflammatory_bowel_disease_liu2015_nat_genet.sumstats.gz               |
| Neuroticism                       | s3://rti-shared/ldsc/data/neuroticism_okbay2016_nat_genet/munged/neuroticism_okbay2016_nat_genet.sumstats.gz                                         |
| Parkinson's disease               | s3://rti-shared/ldsc/data/parkinsons_disease_sanchez2009_nat_genet/munged/parkinsons_disease_sanchez2009_nat_genet.sumstats.gz                       |
| Platelet count                    | s3://rti-shared/ldsc/data/platelet_count_vuckovic2020_cell/munged/platelet_count_vuckovic2020_cell.sumstats.gz                                       |
| Primary biliary cirrhosis         | s3://rti-shared/ldsc/data/primary_biliary_cirrhosis_cordell2015_nat_commun/munged/PASS_Primary_biliary_cirrhosis.sumstats.gz                         |
| Primary sclerosing cholangitis    | s3://rti-shared/ldsc/data/primary_sclerosing_cholangitis_ji2017_nat_genet/munged/primary_sclerosing_cholangitis_ji2017_nat_genet.sumstats.gz         |
| Red blood cell count              | s3://rti-shared/ldsc/data/red_blood_cell_count_vuckovic2020_cell/munged/red_blood_cell_count_vuckovic2020_cell.sumstats.gz                           |
| Rheumatoid Arthritis              | s3://rti-shared/ldsc/data/rheumatoid_arthritis_okada2014_nature/munged/rheumatoid_arthritis_okada2014_nature.sumstats.gz                             |
| Systemic lupus erythematosus      | s3://rti-shared/ldsc/data/systemic_lupus_erythematosus_bentham2015_nat_genet/munged/PASS_Lupus.sumstats.gz                                           |
| Type 2 Diabetes                   | s3://rti-shared/ldsc/data/type2_diabetes_xue2018_nat_commun/munged/type2_diabetes_xue2018_nat_commun.sumstats.gz                                     |
| Ulcerative colitis                | s3://rti-shared/ldsc/data/ulcerative_colitis_liu2015_nat_genet/munged/ulcerative_colitis_liu2015_nat_genet.sumstats.gz                               |
| White blood cell count            | s3://rti-shared/ldsc/data/white_blood_cell_count_vuckovic2020_cell/munged/white_blood_cell_count_vuckovic2020_cell.sumstats.gz                       |

## Create directories

In [None]:
mkdir -p /shared/rti-hiv/ldsc/hiv_acquisition/partitioned_h2/0001/{1000g,deg_bedfiles,results,sumstats}/
baseD=/shared/rti-hiv/ldsc/hiv_acquisition/partitioned_h2/0001/

## Download data

In [None]:
# download ldsc files
cd $baseD/1000g/

wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_baseline_ldscores.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_plinkfiles.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_frq.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/weights_hm3_no_hla.tgz

# decompress
for file in *tgz; do
  tar xvzf $file
  rm $file 
done  


# Download sumstats formatted (munged) results
cd $baseD/sumstats/
aws s3 cp s3://rti-hiv/scratch/bquach/hiv/stratified_ldsc/hiv_acquisition_meta/data/sumstats/post_qc/hiv_acquisition_gwas_meta_eur.sumstats.gz . # HIV acquisition
    
aws s3 cp s3://rti-shared/ldsc/data/alzheimers_disease_lambert2013_nat_genet/munged/alzheimers_disease_lambert2013_nat_genet.sumstats.gz . # Alzheimer's disease
aws s3 cp s3://rti-shared/ldsc/data/amyotrophic_lateral_sclerosis_rheenen2016_nat_genet/munged/amyotrophic_lateral_sclerosis_rheenen2016_nat_genet.sumstats.gz . # Amyotrophic lateral sclerosis
aws s3 cp s3://rti-shared/ldsc/data/asthma_han2020_nat_commun/munged/asthma_han2020_nat_commun.sumstats.gz . # Asthma
aws s3 cp s3://rti-shared/ldsc/data/atopic_dermatitis_paternoster2015_nat_genet/munged/eczema_paternoster2015_nat_genet.sumstats.gz . # Atopic dermatitis (Eczema)
aws s3 cp s3://rti-shared/ldsc/data/crohns_disease_liu2015_nat_genet/munged/crohns_disease_liu2015_nat_genet.sumstats.gz . # Crohn's disease
aws s3 cp s3://rti-shared/ldsc/data/inflammatory_bowel_disease_liu2015_nat_genet/munged/inflammatory_bowel_disease_liu2015_nat_genet.sumstats.gz . # Inflammatory Bowel Disease (Euro)
aws s3 cp s3://rti-shared/ldsc/data/neuroticism_okbay2016_nat_genet/munged/neuroticism_okbay2016_nat_genet.sumstats.gz . # Neuroticism
aws s3 cp s3://rti-shared/ldsc/data/parkinsons_disease_sanchez2009_nat_genet/munged/parkinsons_disease_sanchez2009_nat_genet.sumstats.gz . # Parkinson's disease
aws s3 cp s3://rti-shared/ldsc/data/platelet_count_vuckovic2020_cell/munged/platelet_count_vuckovic2020_cell.sumstats.gz . # Platelet count
aws s3 cp s3://rti-shared/ldsc/data/primary_biliary_cirrhosis_cordell2015_nat_commun/munged/PASS_Primary_biliary_cirrhosis.sumstats.gz . # Primary biliary cirrhosis
aws s3 cp s3://rti-shared/ldsc/data/primary_sclerosing_cholangitis_ji2017_nat_genet/munged/primary_sclerosing_cholangitis_ji2017_nat_genet.sumstats.gz . # Primary sclerosing cholangitis
aws s3 cp s3://rti-shared/ldsc/data/red_blood_cell_count_vuckovic2020_cell/munged/red_blood_cell_count_vuckovic2020_cell.sumstats.gz . # Red blood cell count
aws s3 cp s3://rti-shared/ldsc/data/rheumatoid_arthritis_okada2014_nature/munged/rheumatoid_arthritis_okada2014_nature.sumstats.gz . # Rheumatoid Arthritis
aws s3 cp s3://rti-shared/ldsc/data/systemic_lupus_erythematosus_bentham2015_nat_genet/munged/PASS_Lupus.sumstats.gz . # Systemic lupus erythematosus
aws s3 cp s3://rti-shared/ldsc/data/type2_diabetes_xue2018_nat_commun/munged/type2_diabetes_xue2018_nat_commun.sumstats.gz . # Type 2 Diabetes
aws s3 cp s3://rti-shared/ldsc/data/ulcerative_colitis_liu2015_nat_genet/munged/ulcerative_colitis_liu2015_nat_genet.sumstats.gz . # Ulcerative colitis
aws s3 cp s3://rti-shared/ldsc/data/white_blood_cell_count_vuckovic2020_cell/munged/white_blood_cell_count_vuckovic2020_cell.sumstats.gz . # White blood cell count
    
# Download DEGs by HIV status
cd $baseD/deg_bedfiles/
aws s3 sync s3://rti-hiv/scratch/bquach/hiv/stratified_ldsc/deg_bedfiles/ .

## Partitioned h2 analysis
See the LDSC wiki page  [LD-Score-Estimation-Tutorial](https://github.com/bulik/ldsc/wiki/LD-Score-Estimation-Tutorial#partitioned-ld-scores).

1. Create annotation file based off of a BED formatted file (file containing chr, chr-start, chr-end)
2. Compute the annotation-specific (partitioned) LD scores.
3. Compute the partitioned heritability estimate.

In [None]:
# interactive session
docker run -it -v $PWD:/data/ \
    404545384114.dkr.ecr.us-east-1.amazonaws.com/ldsc:v1.0.1_0bb574e bash

# loop through all traits
for trait in {"alzheimers_disease","amyotrophic_lateral_sclerosis","asthma","atopic_dermatitis","crohns_disease","inflammatory_bowel_disease","neuroticism","parkinsons_disease","platelet_count","primary_biliary_cirrhosis","primary_sclerosing_cholangitis","red_blood_cell_count","rheumatoid_arthritis","systemic_lupus_erythematosus","type2_diabetes","ulcerative_colitis","white_blood_cell_count"}; do

    # store processing files for each meta in separate dir
    mkdir -p /data/annotations_ldscores/${trait}/
    
    # use sumstats files that corresponds to the trait name for the h2 estimate
    case $trait in 
        "alzheimers_disease")  stats=/data/sumstats/alzheimers_disease_lambert2013_nat_genet.sumstats.gz ;;
        "amyotrophic_lateral_sclerosis") stats=/data/sumstats/amyotrophic_lateral_sclerosis_rheenen2016_nat_genet.sumstats.gz ;;
        "asthma") stats=/data/sumstats/asthma_han2020_nat_commun.sumstats.gz ;;
        "atopic_dermatitis") stats=/data/sumstats/eczema_paternoster2015_nat_genet.sumstats.gz ;;
        "crohns_disease") stats=/data/sumstats/crohns_disease_liu2015_nat_genet.sumstats.gz ;;
        "inflammatory_bowel_disease") stats=/data/sumstats/inflammatory_bowel_disease_liu2015_nat_genet.sumstats.gz ;;
        "neuroticism") stats=/data/sumstats/neuroticism_okbay2016_nat_genet.sumstats.gz ;;
        "parkinsons_disease") stats=/data/sumstats/parkinsons_disease_sanchez2009_nat_genet.sumstats.gz ;;
        "platelet_count") stats=/data/sumstats/platelet_count_vuckovic2020_cell.sumstats.gz ;;
        "primary_biliary_cirrhosis") stats=/data/sumstats/PASS_Primary_biliary_cirrhosis.sumstats.gz ;;
        "primary_sclerosing_cholangitis") stats=/data/sumstats/primary_sclerosing_cholangitis_ji2017_nat_genet.sumstats.gz ;;
        "red_blood_cell_count") stats=/data/sumstats/red_blood_cell_count_vuckovic2020_cell.sumstats.gz ;;
        "rheumatoid_arthritis") stats=/data/sumstats/rheumatoid_arthritis_okada2014_nature.sumstats.gz ;;
        "systemic_lupus_erythematosus") stats=/data/sumstats/PASS_Lupus.sumstats.gz ;;
        "type2_diabetes") stats=/data/sumstats/type2_diabetes_xue2018_nat_commun.sumstats.gz ;;
        "ulcerative_colitis") stats=/data/sumstats/ulcerative_colitis_liu2015_nat_genet.sumstats.gz ;;
        "white_blood_cell_count") stats=/data/sumstats/white_blood_cell_count_vuckovic2020_cell.sumstats.gz ;;
    esac
    
    # loop through each BED file
    for window in {cis10k,cis100k,cis400k}; do
        case $window in
            cis10k) deg_file=/data/deg_bedfiles/hiv_status_vl_suppressed_degs_cis10k.bed.gz ;;
            cis100k) deg_file=/data/deg_bedfiles/hiv_status_vl_suppressed_degs_cis100k.bed.gz ;;
            cis400k) deg_file=/data/deg_bedfiles/hiv_status_vl_suppressed_degs_cis400k.bed.gz ;;
        esac 
        
        # loop through each chromosome
        for j in {1..22}; do
        
            # create annotation files
            python /opt/ldsc/make_annot.py \
                --bed-file $deg_file \
                --bimfile "/data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j.bim" \
                --annot-file "/data/annotations_ldscores/$trait/${trait}_degs_${window}.$j.annot.gz"

            # compute LD scores
            python /opt/ldsc/ldsc.py \
                --l2 \
                --bfile "/data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j" \
                --ld-wind-cm 1 \
                --annot "/data/annotations_ldscores/$trait/${trait}_degs_${window}.$j.annot.gz" \
                --thin-annot \
                --out "/data/annotations_ldscores/$trait/${trait}_degs_${window}.$j" \
                --print-snps "/data/1000g/1000G_EUR_Phase3_baseline/print_snps.txt"
        done # end chr loop
        
        # computed partitioned heritability estimate
        python /opt/ldsc/ldsc.py \
            --h2 $stats \
            --w-ld-chr "/data/1000g/weights_hm3_no_hla/weights." \
            --ref-ld-chr "/data/annotations_ldscores/$trait/${trait}_degs_${window}.,/data/1000g/1000G_EUR_Phase3_baseline/baseline." \
            --overlap-annot \
            --out "/data/results/${trait}_hiv_status_vl_suppressed_degs_${window}_results" \
            --print-coefficients \
            --frqfile-chr "/data/1000g/1000G_Phase3_frq/1000G.EUR.QC."
    
    done # end BED file loop
done # end trait file loop

In [None]:
# loop through all traits
for trait in "hiv_acquisition"; do

    # store processing files for each meta in separate dir
    mkdir -p /data/annotations_ldscores/${trait}/
    
    # use sumstats files that corresponds to the trait name for the h2 estimate
    stats=/data/sumstats/hiv_acquisition_gwas_meta_eur.sumstats.gz
    
    # loop through each BED file
    for window in {cis10k,cis100k,cis400k}; do
        case $window in
            cis10k) deg_file=/data/deg_bedfiles/hiv_status_vl_suppressed_degs_cis10k.bed.gz ;;
            cis100k) deg_file=/data/deg_bedfiles/hiv_status_vl_suppressed_degs_cis100k.bed.gz ;;
            cis400k) deg_file=/data/deg_bedfiles/hiv_status_vl_suppressed_degs_cis400k.bed.gz ;;
        esac 
        
        # loop through each chromosome
        for j in {1..22}; do
        
            # create annotation files
            python /opt/ldsc/make_annot.py \
                --bed-file $deg_file \
                --bimfile "/data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j.bim" \
                --annot-file "/data/annotations_ldscores/$trait/${trait}_degs_${window}.$j.annot.gz"

            # compute LD scores
            python /opt/ldsc/ldsc.py \
                --l2 \
                --bfile "/data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j" \
                --ld-wind-cm 1 \
                --annot "/data/annotations_ldscores/$trait/${trait}_degs_${window}.$j.annot.gz" \
                --thin-annot \
                --out "/data/annotations_ldscores/$trait/${trait}_degs_${window}.$j" \
                --print-snps "/data/1000g/1000G_EUR_Phase3_baseline/print_snps.txt"
        done # end chr loop
        
        # computed partitioned heritability estimate
        python /opt/ldsc/ldsc.py \
            --h2 $stats \
            --w-ld-chr "/data/1000g/weights_hm3_no_hla/weights." \
            --ref-ld-chr "/data/annotations_ldscores/$trait/${trait}_degs_${window}.,/data/1000g/1000G_EUR_Phase3_baseline/baseline." \
            --overlap-annot \
            --out "/data/results/${trait}_hiv_status_vl_suppressed_degs_${window}_results" \
            --print-coefficients \
            --frqfile-chr "/data/1000g/1000G_Phase3_frq/1000G.EUR.QC."
    
    done # end BED file loop
done # end trait file loop

In [None]:
j=22
name=test
peaks=jt
mkdir /data/LDSC_${name}/
python /opt/ldsc/make_annot.py \
    --bed-file "/data/deg_bedfiles/hiv_status_vl_suppressed_degs_cis10k.bed.gz" \
    --bimfile "/data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j.bim" \
    --annot-file "/data/LDSC_${name}/${name}_${peaks}.$j.annot.gz"

In [None]:
for window in {"10k","100k","400k"}; do
    outfile=hiv_status_vl_suppressed_degs_cis${window}_combined_traits_results.tsv
    head -1 alzheimers_disease_hiv_status_vl_suppressed_degs_cis${window}_results.results > $outfile
        
    for file in *degs_cis${window}_results.results; do
        trait=$(echo $file |  sed 's/_hiv_status_vl_suppressed_degs_cis.*//')
        awk -v trait=$trait \
        '$1 = trait {print $0}' OFS="\t" <(tail -n +2 $file | head -1) >> $outfile
    done
done


