# Workshop on Statistical Genetics and Genetic Epidemiology STAGE-Quebec
## Theme 2 - Molecular Phenotypes in Genetic Epidemiology

By Marc-André Legault (Université de Montréal) and Qihuang Zhang (McGill University)

**July 31 - August 1, 2025**

### Introduction

In this notebook, we will use a second TWAS software called FUSION on the same GWAS summary statistics. This will allow us to understand how to run the FUSION software, and to compare the results with S-PrediXcan which is the theme of the 3rd section.

### Data exploration

The file structure for the FUSION data included in this notebook is as follows:
<pre>
/data/FUSION/
├── gwas/      # GWAS summary statistics
├── ld/        # LD reference panels (1000 Genomes EUR)
└── models/    # Gene expression weight models
    ├── GTExv8_EUR_Pancreas/
    └── YFS_BLOOD_RNAARR/
</pre>

Each ``models/`` subfolder contains:

- *.wgt.RDat weight model files
- One .pos file listing all available genes

### Running the software

The code below will call FUSION for the first gene expression prediction model. This model was derived from gene expression in whole blood from the Young Finns Study. The loop will call FUSION for every chromosome. For computational reason, you may want to only run the first two chromosomes. If you want to run the whole transcriptome, you can comment out the ``break`` statement by adding the pound sign at the beginning of the line (``# break``).

We prepared a Checkpoint with the full results to save computational ressources.

In [4]:
for chrom in $(seq 1 22); do
    Rscript /workshop/software/FUSION/FUSION.assoc_test.R \
        --sumstats /workshop/data/FUSION/gwas/Mahajan.NatGen2022.DIAMANTE-EUR.ldsc_ss.FUSION_ref.tsv \
        --weights /workshop/data/FUSION/models/YFS_BLOOD_RNAARR/YFS.BLOOD.RNAARR.pos \
        --weights_dir /workshop/data/FUSION/models/YFS_BLOOD_RNAARR/ \
        --ref_ld_chr /workshop/data/FUSION/ld/1000G.EUR. \
        --chr $chrom \
        --out results/fusion_YFS_$chrom.tsv
        
    if (( $chrom > 2 )); then
        # Comment the line below to run all the chromosomes,
        # otherwise only the first two will be used.
        break
    fi
done

Analysis completed.
NOTE: 2 / 476 genes were skipped
Analysis completed.
NOTE: 0 / 322 genes were skipped
Analysis completed.
NOTE: 0 / 293 genes were skipped


Now, we can also run FUSION with weights estimated from GTEx (pancreas). The same GTEx data was used for the S-PrediXcan model we previously ran. This will be a point of comparison between both approaches.

In [None]:
for chrom in $(seq 1 22); do
    Rscript /workshop/software/FUSION/FUSION.assoc_test.R \
        --sumstats /workshop/data/FUSION/gwas/Mahajan.NatGen2022.DIAMANTE-EUR.ldsc_ss.FUSION_ref.tsv \
        --weights /workshop/data/FUSION/models/GTExv8_EUR_Pancreas/GTExv8.EUR.Pancreas.pos \
        --weights_dir /workshop/data/FUSION/models/GTExv8_EUR_Pancreas \
        --ref_ld_chr /workshop/data/FUSION/ld/1000G.EUR. \
        --chr $chrom \
        --out results/fusion_GTEx_pancreas_$chrom.tsv
        
    if (( $chrom > 2 )); then
        # Comment the line below to run all the chromosomes,
        # otherwise only the first two will be used.
        break
    fi
done

Analysis completed.
NOTE: 3 / 538 genes were skipped
Analysis completed.
NOTE: 0 / 420 genes were skipped


For convenience, you can combine the results into a single file.

In [15]:
# Merge the files (GTEx Pancreas).
# cp results/fusion_GTEx_pancreas_1.tsv results/fusion_GTEx_pancreas.tsv
# for chrom in $(seq 2 22); do
#     cat results/fusion_GTEx_pancreas_${chrom}.tsv | sed 1d >> results/fusion_GTEx_pancreas.tsv
# done
#
# Merge the files (YFS).
# cp results/fusion_YFS_1.tsv results/fusion_YFS.tsv
# for chrom in $(seq 2 22); do
#     cat results/fusion_YFS_${chrom}.tsv | sed 1d >> results/fusion_YFS.tsv
# done

Now that we have TWAS results, you can open a results file by double clicking on it from the file browser. You should see a TSV file with these columns:

In [8]:
head -n 1 /workshop/local/results/fusion_GTEx_pancreas_1.tsv | tr '\t' '\n' | nl

     1	PANEL
     2	FILE
     3	ID
     4	CHR
     5	P0
     6	P1
     7	HSQ
     8	BEST.GWAS.ID
     9	BEST.GWAS.Z
    10	EQTL.ID
    11	EQTL.R2
    12	EQTL.Z
    13	EQTL.GWAS.Z
    14	NSNP
    15	NWGT
    16	MODEL
    17	MODELCV.R2
    18	MODELCV.PV
    19	TWAS.Z
    20	TWAS.P


### Output Explanation

Each output file will contain:

- TWAS.Z, TWAS.P: Z-scores and P-values for imputed gene-trait associations
- MODELCV.R2, MODELCV.PV: Cross-validated predictive performance of the model
- Best.GWAS.Z: Strongest GWAS SNP signal in region

Here is a file excerpt:

In [7]:
awk '{print $1, $4, $5, $9, $10, $11, $19, $20}' /workshop/local/results/fusion_GTEx_pancreas_1.tsv | head

PANEL CHR P0 BEST.GWAS.Z EQTL.ID EQTL.R2 TWAS.Z TWAS.P
GTExv8.EUR.Pancreas 1 959308 3.180 rs3748595 0.107999 1.15827 2.47e-01
GTExv8.EUR.Pancreas 1 966496 3.180 rs604618 -0.004009 0.34623 7.29e-01
GTExv8.EUR.Pancreas 1 998050 3.180 rs3128117 0.391529 0.69014 4.90e-01
GTExv8.EUR.Pancreas 1 1063287 3.180 rs9442372 0.156418 -0.12335 9.02e-01
GTExv8.EUR.Pancreas 1 1116360 3.180 rs4275402 -0.001403 -0.91489 3.60e-01
GTExv8.EUR.Pancreas 1 1324690 3.180 rs2765021 0.100828 -3.00972 2.61e-03
GTExv8.EUR.Pancreas 1 1407312 3.180 rs35242196 0.152516 -0.73636 4.62e-01
GTExv8.EUR.Pancreas 1 1421768 3.180 rs12089560 0.120587 1.47945 1.39e-01
GTExv8.EUR.Pancreas 1 1422468 3.180 rs12089560 0.114774 0.98098 3.27e-01


In [19]:
wc -l ./results/fusion_GTEx_pancreas.tsv
wc -l ./results/fusion_YFS.tsv

5766 ./results/fusion_GTEx_pancreas.tsv
4629 ./results/fusion_YFS.tsv


<div class="alert alert-info">
Can you identify the genes in both tissues that reach a tissue-wide Bonferroni significance threshold of 0.05 / # of tested genes per tissue? How many are there?

<!--
Solution:
awk 'NR == 1 || $20 <= 0.05/5765 { print $3, $19, $20 }' results/fusion_GTEx_pancreas.tsv
awk 'NR == 1 || $20 <= 0.05/4628 { print $3, $19, $20 }' results/fusion_YFS.tsv

# 119 significant for GTEx pancreas
# 85 significant for YFS
//-->

</div>

<div class="alert alert-info">
How does MODELCV.R2 affect your confidence?    
</div>

<div class="alert alert-info">
Are some gene signals stronger in one tissue than another?    
</div>