# Workshop on Statistical Genetics and Genetic Epidemiology STAGE-Quebec
## Theme 2 - Molecular Phenotypes in Genetic Epidemiology

By Marc-André Legault (Université de Montréal) and Qihuang Zhang (McGill University)

**July 31 - August 1, 2025**

### Introduction

This notebook presents the second TWAS (Transcriptome-Wide Association Study) analysis using **FUSION**, an alternative approach to the S-PrediXcan method demonstrated previously. By applying FUSION to the same GWAS summary statistics, participants will gain experience with different TWAS methodologies and learn to compare their respective results.

**FUSION methodology**: FUSION employs a Bayesian approach to combine multiple prediction models and provides comprehensive quality metrics for each gene-trait association. This complementary analysis will enhance understanding of how different TWAS methodologies can yield varying insights from identical input data.

### Understanding the FUSION Data Structure

The analysis utilizes a structured data organization that facilitates FUSION analyses:

<pre>
/data/FUSION/
├── gwas/      # GWAS summary statistics (diabetes data from previous analysis)
├── ld/        # LD reference panels (1000 Genomes EUR population)
└── models/    # Pre-trained gene expression prediction models
    ├── GTExv8_EUR_Pancreas/    # Pancreas tissue models from GTEx
    └── YFS_BLOOD_RNAARR/       # Blood tissue models from Young Finns Study
</pre>

**Components:**
- **GWAS data**: Diabetes GWAS summary statistics analyzed previously with S-PrediXcan
- **LD reference**: Population-specific linkage disequilibrium patterns required for accurate statistical inference
- **Prediction models**: Pre-trained weights for estimating gene expression from genetic variants

Each `models/` directory contains:
- `*.wgt.RDat` files: Prediction weights for individual genes
- One `.pos` file: Master list of available genes with genomic coordinates

### Running FUSION: Initial Analysis with Blood Tissue Models

The first analysis employs gene expression models from **whole blood** tissue (Young Finns Study). Blood tissue provides broad coverage as it is readily accessible and extensively studied in genetic research.

**Computational note**: The analysis processes all 22 autosomes, with a break statement after chromosome 2 included for time efficiency. Complete genomic results are available for examination, and the full analysis can be executed by removing the break statement.

In [4]:
for chrom in $(seq 1 22); do
    Rscript /workshop/software/FUSION/FUSION.assoc_test.R \
        --sumstats /workshop/data/FUSION/gwas/Mahajan.NatGen2022.DIAMANTE-EUR.ldsc_ss.FUSION_ref.tsv \
        --weights /workshop/data/FUSION/models/YFS_BLOOD_RNAARR/YFS.BLOOD.RNAARR.pos \
        --weights_dir /workshop/data/FUSION/models/YFS_BLOOD_RNAARR/ \
        --ref_ld_chr /workshop/data/FUSION/ld/1000G.EUR. \
        --chr $chrom \
        --out results/fusion_YFS_$chrom.tsv
        
    if (( $chrom > 2 )); then
        # Comment the line below to run all the chromosomes,
        # otherwise only the first two will be used.
        break
    fi
done

Analysis completed.
NOTE: 2 / 476 genes were skipped
Analysis completed.
NOTE: 0 / 322 genes were skipped
Analysis completed.
NOTE: 0 / 293 genes were skipped


### Tissue Comparison: GTEx Pancreas Analysis

The subsequent analysis employs prediction models from **pancreas tissue** using GTEx data. This provides a direct comparison opportunity because:

1. **Identical underlying data**: The analysis uses the same GTEx pancreas expression data that supported the S-PrediXcan analysis
2. **Methodological comparison**: FUSION's approach enables direct comparison of methodological differences
3. **Biological relevance**: Pancreas tissue is highly relevant for diabetes research, potentially revealing tissue-specific associations

**Comparative analysis value**: By analyzing identical tissue data with different TWAS methods, researchers can understand how methodological choices influence results and develop skills for interpreting findings across different analytical frameworks.

In [None]:
for chrom in $(seq 1 22); do
    Rscript /workshop/software/FUSION/FUSION.assoc_test.R \
        --sumstats /workshop/data/FUSION/gwas/Mahajan.NatGen2022.DIAMANTE-EUR.ldsc_ss.FUSION_ref.tsv \
        --weights /workshop/data/FUSION/models/GTExv8_EUR_Pancreas/GTExv8.EUR.Pancreas.pos \
        --weights_dir /workshop/data/FUSION/models/GTExv8_EUR_Pancreas \
        --ref_ld_chr /workshop/data/FUSION/ld/1000G.EUR. \
        --chr $chrom \
        --out results/fusion_GTEx_pancreas_$chrom.tsv
        
    if (( $chrom > 2 )); then
        # Comment the line below to run all the chromosomes,
        # otherwise only the first two will be used.
        break
    fi
done

Analysis completed.
NOTE: 3 / 538 genes were skipped
Analysis completed.
NOTE: 0 / 420 genes were skipped


### Result Organization for Analysis

For comprehensive analysis, chromosome-specific results should be combined into single files. The following code demonstrates how to merge chromosome-specific outputs, though it is commented out for this workshop session which focuses on a subset of chromosomes.

In [15]:
# Merge the files (GTEx Pancreas).
# cp results/fusion_GTEx_pancreas_1.tsv results/fusion_GTEx_pancreas.tsv
# for chrom in $(seq 2 22); do
#     cat results/fusion_GTEx_pancreas_${chrom}.tsv | sed 1d >> results/fusion_GTEx_pancreas.tsv
# done
#
# Merge the files (YFS).
# cp results/fusion_YFS_1.tsv results/fusion_YFS.tsv
# for chrom in $(seq 2 22); do
#     cat results/fusion_YFS_${chrom}.tsv | sed 1d >> results/fusion_YFS.tsv
# done

### FUSION Output Structure: Result Interpretation

The FUSION analysis generates TWAS results in tab-separated values (TSV) format. Understanding the output structure is essential for downstream analysis and interpretation. The following command examines the column headers:

In [8]:
head -n 1 /workshop/local/results/fusion_GTEx_pancreas_1.tsv | tr '\t' '\n' | nl

     1	PANEL
     2	FILE
     3	ID
     4	CHR
     5	P0
     6	P1
     7	HSQ
     8	BEST.GWAS.ID
     9	BEST.GWAS.Z
    10	EQTL.ID
    11	EQTL.R2
    12	EQTL.Z
    13	EQTL.GWAS.Z
    14	NSNP
    15	NWGT
    16	MODEL
    17	MODELCV.R2
    18	MODELCV.PV
    19	TWAS.Z
    20	TWAS.P


### FUSION Output Interpretation: Key Metrics

Understanding FUSION's output columns is essential for proper result interpretation. The most important metrics include:

**Primary TWAS Results:**
- **TWAS.Z, TWAS.P**: Core gene-trait association statistics - Z-scores and P-values
- **ID**: Gene identifier (format varies by dataset - may be gene symbols or Ensembl IDs)

**Quality Assessment Metrics:**
- **MODELCV.R2**: Cross-validated R² indicating how well genetic variants predict gene expression (higher values indicate greater reliability)
- **MODELCV.PV**: P-value for prediction model performance
- **NSNP, NWGT**: Number of SNPs utilized and number with non-zero weights

**Additional Context:**
- **BEST.GWAS.Z**: Strongest individual GWAS signal in the gene region (for comparison with TWAS signal)
- **EQTL.R2**: Predictive capacity of the top eQTL for expression
- **HSQ**: Heritability estimate for gene expression

Example results demonstrating these metrics:

In [7]:
awk '{print $1, $4, $5, $9, $10, $11, $19, $20}' /workshop/local/results/fusion_GTEx_pancreas_1.tsv | head

PANEL CHR P0 BEST.GWAS.Z EQTL.ID EQTL.R2 TWAS.Z TWAS.P
GTExv8.EUR.Pancreas 1 959308 3.180 rs3748595 0.107999 1.15827 2.47e-01
GTExv8.EUR.Pancreas 1 966496 3.180 rs604618 -0.004009 0.34623 7.29e-01
GTExv8.EUR.Pancreas 1 998050 3.180 rs3128117 0.391529 0.69014 4.90e-01
GTExv8.EUR.Pancreas 1 1063287 3.180 rs9442372 0.156418 -0.12335 9.02e-01
GTExv8.EUR.Pancreas 1 1116360 3.180 rs4275402 -0.001403 -0.91489 3.60e-01
GTExv8.EUR.Pancreas 1 1324690 3.180 rs2765021 0.100828 -3.00972 2.61e-03
GTExv8.EUR.Pancreas 1 1407312 3.180 rs35242196 0.152516 -0.73636 4.62e-01
GTExv8.EUR.Pancreas 1 1421768 3.180 rs12089560 0.120587 1.47945 1.39e-01
GTExv8.EUR.Pancreas 1 1422468 3.180 rs12089560 0.114774 0.98098 3.27e-01


In [19]:
wc -l ./results/fusion_GTEx_pancreas.tsv
wc -l ./results/fusion_YFS.tsv

5766 ./results/fusion_GTEx_pancreas.tsv
4629 ./results/fusion_YFS.tsv


<div class="alert alert-info">
<h4>Analysis Exercise 1: Genome-wide Significance Assessment</h4>

<strong>Objective:</strong> Identify genes reaching genome-wide significance using Bonferroni correction in both tissue types.

<strong>Tasks:</strong>
1. Calculate the significance threshold: 0.05 divided by the number of tested genes per tissue
2. Quantify genes meeting this stringent criterion in each tissue
3. Evaluate implications for tissue-specific versus shared genetic effects

<strong>Method:</strong> Use the line counts shown above to determine the total number of tested genes, then apply appropriate filtering.

<!--
Solution:
awk 'NR == 1 || $20 <= 0.05/5765 { print $3, $19, $20 }' results/fusion_GTEx_pancreas.tsv
awk 'NR == 1 || $20 <= 0.05/4628 { print $3, $19, $20 }' results/fusion_YFS.tsv

# 119 significant for GTEx pancreas
# 85 significant for YFS
//-->

</div>

<div class="alert alert-info">
<h4>Analysis Exercise 2: Model Quality Assessment</h4>

<strong>Objective:</strong> Evaluate how MODELCV.R² should influence confidence in TWAS results.

<strong>Considerations:</strong>
- Interpretation of low R² values (e.g., 0.01) regarding TWAS signal reliability
- Assessment of highly significant TWAS.P values when MODELCV.R² is low
- Establishment of reasonable R² thresholds for result filtering
- Biological factors that may contribute to low prediction accuracy for certain genes

<strong>Investigation approach:</strong> Compare MODELCV.R² distributions across significant associations. Examine whether genes with stronger TWAS signals demonstrate superior prediction model performance.
</div>

<div class="alert alert-info">
<h4>Analysis Exercise 3: Tissue-Specific Association Patterns</h4>

<strong>Objective:</strong> Evaluate differential gene association patterns between blood and pancreas tissues and their implications for diabetes biology.

<strong>Investigation areas:**
1. **Shared associations**: Identify genes with significant associations in both tissues and assess effect direction consistency
2. **Tissue-specific associations**: Determine genes significant in only one tissue and evaluate potential explanatory factors
3. **Effect magnitude comparison**: For genes significant in both tissues, compare association strengths
4. **Biological interpretation**: Assess whether tissue-specific patterns align with established diabetes pathophysiology

<strong>Advanced analytical approaches:**
- Generate scatter plots comparing TWAS.Z scores between tissues
- Investigate whether pancreas-specific signals show enrichment for metabolic pathways
- Examine whether blood-specific signals may reflect inflammatory processes
- Assess systematic differences in prediction model quality (MODELCV.R²) between tissues

<strong>Integration note:** These findings will be essential for comparison with S-PrediXcan results in subsequent analyses.
</div>