# G×Vit D in UK Biobank: calculate LDSC intercept
[GitHub Issue 109](https://github.com/RTIInternational/bioinformatics/issues/109#issuecomment-840591260)
Investigate inflated lambda with LD intercepts: due to residual bias or large amount of positive signal? The [Bulik-Sullivan 2015 Nat. Genet](https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC4495769%2F&data=04%7C01%7Cjmarks%40rti.org%7Cb2d84e99d0fc4744238b08d8efc7a822%7C2ffc2ede4d4449948082487341fa43fb%7C0%7C0%7C637522988106115849%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Wm2EKtjtrfxWp442dd8xmodHY00maD41tUDgTZ8oq0o%3D&reserved=0) paper describes how the LDSC intercept can reveal a bit more information about you summary statitics than with only the genomic inflaction factor (lambda). If the lambda value is inflated and the LDSC intercept is ~1 then you can reason that the inflated lambda is due to polygenicity and not cryptic relatedness, population stratification, or sample overlap. Because as sample sizes and power increases, so does lambda.

**FVC**

UK Biobank
Filtered: `s3://rti-pulmonary/gwas/ukbiobank/results/fvc/0003/chr{1..22}/chr{1..22}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv.gz`
        
**FEV1**

UK Biobank
Filtered: `s3://rti-pulmonary/gwas/ukbiobank/results/fev1/0003/chr{1..22}/chr{1..22}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv.gz`


<br>

**Description**: 
```
@jaamarks It would be great if you could compute the LDSR intercept, and it is needed only for the UK Biobank filtered results. Note that we have one set of results for FEV1 and one for FVC; within each phenotype, we want to know the intercept for the 1df SNP main effect term, the 1df SNPxVitD interaction term, and the 2df result.

@ngaddis the filtered UK Biobank results with Rsq>0.8 are provided above. Can you provide the Rsq>0.3 results? Also, while Jesse is working with these results, would it be helpful for him to run the QQ plot for the 1df SNP main effect term? (Jesse, FYI: the QQ and Manhattan plots are already generated for the 1df interaction and 2df results).
    
@danahancock @jaamarks The UK Biobank rsq>0.3 results are in the same S3 location as the rsq>0.8 results. Yeah, it would be great if Jesse could run the QQ plot for the 1df main effect term. The one catch is that the p-value for this term is not provided in the results file. It does provide the beta and SE though, so the p-value can be calculated.    
```

In [None]:
# setup environment
fvcD=/shared/rti-pulmonary/gwas/ukbiobank/results/fvc/0003/
fev1D=/shared/rti-pulmonary/gwas/ukbiobank/results/fev1/0003/
mkdir -p $fvcD
mkdir -p $fev1D

# restore objects
for chr in {1..22}; do
    #FVC
    aws s3api restore-object \
        --bucket rti-pulmonary \
        --key gwas/ukbiobank/results/fvc/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3.tsv.gz \
        --restore-request '{"GlacierJobParameters":{"Tier":"Standard"}}' 
     
    aws s3api restore-object \
        --bucket rti-pulmonary \
        --key gwas/ukbiobank/results/fvc/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv.gz \
        --restore-request '{"GlacierJobParameters":{"Tier":"Standard"}}' 
    
    #FEV1
    aws s3api restore-object \
        --bucket rti-pulmonary \
        --key gwas/ukbiobank/results/fev1/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3.tsv.gz \
        --restore-request '{"GlacierJobParameters":{"Tier":"Standard"}}' 
     
    aws s3api restore-object \
        --bucket rti-pulmonary \
        --key gwas/ukbiobank/results/fev1/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv.gz \
        --restore-request '{"GlacierJobParameters": {"Tier": "Expedited"}}'
done

aws s3api head-object \
    --bucket rti-pulmonary \
    --key gwas/ukbiobank/results/fev1/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv.gz


cd $fvcD
for chr in {1..22}; do 
    aws s3 cp s3://rti-pulmonary/gwas/ukbiobank/results/fvc/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3.tsv.gz .
    aws s3 cp s3://rti-pulmonary/gwas/ukbiobank/results/fvc/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv.gz .
done

cd $fev1D
for chr in {1..22}; do 
    aws s3 cp s3://rti-pulmonary/gwas/ukbiobank/results/fev1/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3.tsv.gz .
    aws s3 cp s3://rti-pulmonary/gwas/ukbiobank/results/fev1/0003/chr$chr/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv.gz .
done 

In [None]:
## pseudocode
tval <- Est.G / SE.G

# The function pt returns the value of the cumulative density function (cdf) of the Student t distribution
pvalue <- 2 * pt( abs(tval) , df=n.obs-1, lower.tail=FALSE ) 

# P-Value
The P-value for the 1df main effect term was not included in the GWAS summary statistics. We will therefore have to calculate these P-values.

## Calculate P-value for the 1df main effect

### Rscript

In [None]:
#!/share/apps/R/bin/Rscript

args <- commandArgs(TRUE)

loop = TRUE
fileIn = ""
fileOut = ""
colBeta = ""
colSE = ""
colN = ""
fileInHeader = TRUE
#chiDf=1

while (loop) {

    if (args[1] == "--in") {
        fileIn = args[2]
    }
    
    if (args[1] == "--in_header") {
        fileInHeader = TRUE
    }
    
    if (args[1] == "--out") {
        fileOut = args[2]
    }
    
    if (args[1] == "--col_p") {
        colP = gsub("-",".",args[2])
    }

    if (args[1] == "--col_beta") {
        colBeta = gsub("-",".",args[2])
    }

    if (args[1] == "--col_se") {
        colSE = gsub("-",".",args[2])
    }

    if (args[1] == "--col_n") {
        colN = gsub("-",".",args[2])
    }

    if (length(args) > 1) {
        args = args[2:length(args)]
    } else {
        loop=FALSE
    }
}

if (fileIn == "") {
    stop("No input file specified")
} else if (fileOut == "") {
    stop("No output file specified")
} 

cat("Reading ", fileIn, "...\n", sep = "")
inputData = read.table(fileIn, header = fileInHeader)

cat("Calculating t-values...\n")
tvalues = inputData[, colBeta] / inputData[, colSE]
#print(tvalues)

cat("Calculating p-values...\n")
pvalues <- 2 * pt( abs(tvalues) , df=inputData[, colN]-1, lower.tail=FALSE ) 
#print(pvalues)

inputData$Main.pval <- pvalues

cat(paste("Writing file",fileOut,"...\n"))
write.table(inputData, file = fileOut, row.names = FALSE, quote = FALSE, sep="\t")

### Submit jobs

In [None]:
pheno=fev1 # fvc
workingD=/shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003
mkdir -p $workingD/pvalue_added/logs/

for chr in {1..22}; do
    for rsq in {0.3,0.8}; do
        /shared/bioinformatics/software/scripts/qsub_job.sh \
            --job_name ${pheno}_add_pvalue_chr${chr}_rsq${rsq} \
            --script_prefix $workingD/pvalue_added/logs/${pheno}_chr${chr}_rsq${rsq}_add_pvalue \
            --nslots 1 \
            --program  Rscript ~/bin/calculate_1dfpvalue.R \
                --in $workingD/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_${rsq}.tsv.gz \
                --out $workingD/pvalue_added/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_${rsq}.tsv \
                --col_beta Est.G \
                --col_se SE.G \
                --col_n n.obs
    done
done

## Upload to S3

In [None]:
## gzip first

pheno=fev1 # fvc
workingD=/shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/pvalue_added

for chr in {1..22}; do
    for rsq in {0.3,0.8}; do
        infile=$workingD/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_${rsq}.tsv.gz
        aws s3 cp $infile s3://rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/chr$chr/1df_pvalue_added/
    done
done

## QQ Plot

### combine chromosome results

In [None]:
pheno=fvc # and other pheno
workingD=/shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/pvalue_added
mkdir -p $workingD/qqplot/wf_input/

cd $workingD

# get header
zcat $workingD/chr1_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3.tsv.gz | head -1 > \
    $workingD/qqplot/chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3.tsv
zcat chr1_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv.gz | head -1 > \
    $workingD/qqplot/chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.8.tsv

for rsq in {0.3,0.8}; do
    for chr in {1..22}; do
        infile=$workingD/chr${chr}_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_${rsq}.tsv.gz
        zcat $infile | tail -n +2 >> $workingD/qqplot/chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_${rsq}.tsv
    done
done &

### prepare plotting workflow

In [None]:
pheno=fvc # and other pheno
workingD=/shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/pvalue_added/qqplot

cd /shared/biocloud_wdl_tools/
git pull
#git submodule update --init --recursive
cd /shared
cp -r biocloud_wdl_tools/generate_gwas_plots/* $workingD/wf_input/
cd $workingD/wf_input

## edit workflow and configuration file

## Submit jobs

In [None]:
## FVC
java -jar ~/bin/cromwell/cromwell-54.jar \
    run /shared/rti-pulmonary/gwas/ukbiobank/results/fvc/0003/pvalue_added/qqplot/wf_input/test_generate_gwas_plots.wdl \
    --inputs /shared/rti-pulmonary/gwas/ukbiobank/results/fvc/0003/pvalue_added/qqplot/wf_input/fvc_rsq_0.8_input.json 

java -jar ~/bin/cromwell/cromwell-54.jar \
    run /shared/rti-pulmonary/gwas/ukbiobank/results/fvc/0003/pvalue_added/qqplot/wf_input/test_generate_gwas_plots.wdl \
    --inputs /shared/rti-pulmonary/gwas/ukbiobank/results/fvc/0003/pvalue_added/qqplot/wf_input/fvc_rsq_0.3_input.json 

## FEV1
java -jar ~/bin/cromwell/cromwell-54.jar \
    run /shared/rti-pulmonary/gwas/ukbiobank/results/fev1/0003/pvalue_added/qqplot/wf_input/test_generate_gwas_plots.wdl \
    --inputs /shared/rti-pulmonary/gwas/ukbiobank/results/fev1/0003/pvalue_added/qqplot/wf_input/fev1_rsq_0.8_input.json 

java -jar ~/bin/cromwell/cromwell-54.jar \
    run /shared/rti-pulmonary/gwas/ukbiobank/results/fev1/0003/pvalue_added/qqplot/wf_input/test_generate_gwas_plots.wdl \
    --inputs /shared/rti-pulmonary/gwas/ukbiobank/results/fev1/0003/pvalue_added/qqplot/wf_input/fev1_rsq_0.3_input.json 


In [None]:
## Open session in terminal 1
#ssh -i ~/.ssh/gwas_rsa -L localhost:8000:localhost:8000 ec2-user@54.174.185.7
#
## Submit jobs in terminal 2
#
## AFR relatedness
#curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
#    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/relatedness_wf.wdl" \
#    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0003/relatedness/uhs1234_afr_relatedness_wf.json" \
#    -F "workflowDependencies=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0003/wf_input/biocloud_gwas_workflows.zip"
#echo ""
## Monitor job in terminal 1
##tail -f /tmp/cromwell-server.log
#
## check job status in terminal 2
#for job in {$afr_relate,$afr_sex,$eur_relate,$eur_sex}; do
#    curl -X GET "http://localhost:8000/api/workflows/v1/${job}/status"   
#    echo ""
#done

# LDSC Intercept

## FVC

In [None]:
pheno=fvc
workD=/shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/pvalue_added/ldsc
mkdir -p $workD/{sumstats,1df_main,1df_interaction,2df_joint}
cd $workD

# Download data
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2

tar -jxvf eur_w_ld_chr.tar.bz2
bunzip2 w_hm3.snplist.bz2

### calculate joint 2df BETA
Simply add the main effect beta and the interaction beta. See GitHub issue comment [here](https://github.com/RTIInternational/bioinformatics/issues/109/#issuecomment-846119728).

In [None]:
cd /shared/rti-pulmonary/gwas/ukbiobank/results/fvc/0003/pvalue_added/qqplot/

#variant.id      chr     pos     ref     alt     n.obs   freq    MAC     Est.G   Est.G.VITD      SE.G    SE.G.VITD       GxE.Stat        Joint.Stat      GxE.pval       Joint.pval      info    Main.pval
#rs367896724:10177:A:AC  1       10177   AC      A       205475  0.397744        163453  0.00524484      -0.000179483    0.00674009      0.000123214     1.45668 2.0751  0.145206        0.116132        0.467935        0.43647807738792

### python3
import gzip

infile = "chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3.tsv.gz"
outfile = "../ldsc/sumstats/chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3_joint_2df_beta_rsid_only.tsv"


with gzip.open(infile, 'rt') as inF, open(outfile, 'w') as outF:
    head = inF.readline()
    head = head.split()
    head.append("Est.joint")
    head = "\t".join(head) + "\n"
    outF.write(head)

    line = inF.readline()
    while line:
        sl = line.split()
        rs = sl[0].split(":")[0] # only the rsID portion
        sl[0] = rs
        main = sl[8]
        interaction = sl[9]
        joint = float(main) + float(interaction)
        sl.append(str(joint))
        outline = "\t".join(sl) + "\n"
        outF.write(outline)
        line = inF.readline()

In [None]:
## start interactive session

docker run -i -t  \
    -v $workD:$workD \
    rticode/ldsc:7618f4943d8f31a37cbd207f867ba5742d03373f /bin/bash

pheno=fvc 
workD=/shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/pvalue_added/ldsc


# Munge data
#for rsq in {0.3,0.8}; do
for rsq in 0.3; do
    for df in {"1df_main","1df_interaction","2df_joint"}; do
    case $df in 

        "1df_main") pval="Main.pval"; sign="Est.G" ;;
        "1df_interaction") pval="GxE.pval"; sign="Est.G.VITD" ;;
        "2df_joint") pval="Joint.pval"; sign="Est.Joint" ;;
    esac

    /opt/ldsc/munge_sumstats.py \
        --sumstats $workD/sumstats/chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_${rsq}_joint_2df_beta_rsid_only.tsv \
        --snp variant.id \
        --N-col n.obs \
        --a1 alt \
        --a2 ref \
        --p $pval \
        --signed-sumstats ${sign},0 \
        --out $workD/$df/${pheno}_${df} \
        --merge-alleles $workD/w_hm3.snplist

    /opt/ldsc/ldsc.py \
        --h2 $workD/$df/${pheno}_${df}.sumstats.gz \
        --ref-ld-chr eur_w_ld_chr/ \
        --w-ld-chr eur_w_ld_chr/ \
        --out $workD/$df/${pheno}_${df}_h2
    done
done

## FEV1

In [None]:
pheno=fev1
workD=/shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/pvalue_added/ldsc
mkdir -p $workD/{sumstats,1df_main,1df_interaction,2df_joint}
cd $workD

# Download data
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2

tar -jxvf eur_w_ld_chr.tar.bz2
bunzip2 w_hm3.snplist.bz2

### calculate joint 2df BETA
Simply add the main effect beta and the interaction beta. See GitHub issue comment [here](https://github.com/RTIInternational/bioinformatics/issues/109/#issuecomment-846119728).

In [None]:
cd /shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/pvalue_added/qqplot/

#variant.id      chr     pos     ref     alt     n.obs   freq    MAC     Est.G   Est.G.VITD      SE.G    SE.G.VITD       GxE.Stat        Joint.Stat      GxE.pval       Joint.pval      info    Main.pval
#rs367896724:10177:A:AC  1       10177   AC      A       205475  0.397744        163453  0.00524484      -0.000179483    0.00674009      0.000123214     1.45668 2.0751  0.145206        0.116132        0.467935        0.43647807738792

### python3
import gzip

infile = "chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3.tsv.gz"
outfile = "../ldsc/sumstats/chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_0.3_joint_2df_beta_rsid_only.tsv"


with gzip.open(infile, 'rt') as inF, open(outfile, 'w') as outF:
    head = inF.readline()
    head = head.split()
    head.append("Est.joint")
    head = "\t".join(head) + "\n"
    outF.write(head)

    line = inF.readline()
    while line:
        sl = line.split()
        rs = sl[0].split(":")[0] # only the rsID portion
        sl[0] = rs
        main = sl[8]
        interaction = sl[9]
        joint = float(main) + float(interaction)
        sl.append(str(joint))
        outline = "\t".join(sl) + "\n"
        outF.write(outline)
        line = inF.readline()

In [None]:
## start interactive session

docker run -i -t  \
    -v $workD:$workD \
    rticode/ldsc:7618f4943d8f31a37cbd207f867ba5742d03373f /bin/bash

pheno=fev1
workD=/shared/rti-pulmonary/gwas/ukbiobank/results/$pheno/0003/pvalue_added/ldsc


# Munge data
#for rsq in {0.3,0.8}; do
for rsq in 0.3; do
    for df in {"1df_main","1df_interaction","2df_joint"}; do
    case $df in 

        "1df_main") pval="Main.pval"; sign="Est.G" ;;
        "1df_interaction") pval="GxE.pval"; sign="Est.G.VITD" ;;
        "2df_joint") pval="Joint.pval"; sign="Est.Joint" ;;
    esac

    /opt/ldsc/munge_sumstats.py \
        --sumstats $workD/sumstats/chr_all_2dfgrch37_dbsnp_b153_maf_gt_0.01_rsq_gt_${rsq}_joint_2df_beta_rsid_only.tsv \
        --snp variant.id \
        --N-col n.obs \
        --a1 alt \
        --a2 ref \
        --p $pval \
        --signed-sumstats ${sign},0 \
        --out $workD/$df/${pheno}_${df} \
        --merge-alleles $workD/w_hm3.snplist

    /opt/ldsc/ldsc.py \
        --h2 $workD/$df/${pheno}_${df}.sumstats.gz \
        --ref-ld-chr eur_w_ld_chr/ \
        --w-ld-chr eur_w_ld_chr/ \
        --out $workD/$df/${pheno}_${df}_h2
    done
done