# Workshop on Statistical Genetics and Genetic Epidemiology STAGE-Quebec
## Theme 2 - Molecular Phenotypes in Genetic Epidemiology

By Marc-André Legault (Université de Montréal) and Qihuang Zhang (McGill University)

**July 31 - August 1, 2025**

### Introduction

This workshop provides a comprehensive introduction to transcriptome-wide association studies (TWAS). We will demonstrate how to identify type 2 diabetes (T2D) genes by testing the association between predicted gene expression across various tissues and T2D risk.

The analysis employs real-world data from a large-scale meta-analysis involving 180,834 T2D cases and over one million controls from [Mahajan _et al._ (2022)](https://www.nature.com/articles/s41588-022-01058-3). These are high-quality summary statistics that represent current standards in genetic epidemiology research.

This workshop will cover two established TWAS approaches:
- **[S-PrediXcan](https://www.nature.com/articles/s41467-018-03621-1)**
- **[TWAS FUSION](https://www.nature.com/articles/ng.3506)**

The GWAS summary statistics have been harmonized to match the genome build (GRCh38) used by the gene expression prediction models. We have curated a focused set of tissues from GTEx to balance comprehensiveness with computational efficiency.

### Workshop Structure

The workshop is organized into three interconnected sections:

1. **S-PrediXcan** (this Notebook) - Fundamentals of summary statistics-based TWAS using S-PrediXcan.
2. **TWAS / FUSION** [(link)](2a-FUSION.ipynb) - A second software to conduct TWAS.
3. **Comparing Methods** [(link)](3a-compare_TWAS_models.ipynb) - Comparative analysis of the results.

**Navigation:** Use the file browser in the left pane to explore all workshop materials and switch between notebooks.

<div class="alert alert-warning">
    <p>
    <strong>Important:</strong> The <strong>local</strong> folder is shared with your computer—everything else will disappear when you close the notebook server. Save your important work and results in the local folder.
    </p>
    <p>
    The "local" folder in this environment can be found on your computer, in your home directory:<br/>
    • Windows: <code>C:\Users\[your username]\STAGE2025_workshop_theme2</code><br/>
    • macOS/Linux: <code>/Users/[your username]/STAGE2025_workshop_theme2</code>
    </p>
</div>

We will begin by exploring S-PrediXcan and applying it to our prepared GWAS data to investigate the genetic architecture of type 2 diabetes.

### Exploring the GWAS Summary Statistics

Before running S-PrediXcan, we will examine our input data. Understanding the structure and content of GWAS summary statistics is essential for any TWAS analysis.

Our GWAS summary statistics have been harmonized with [PredictDB](https://predictdb.hakyimlab.org/) models—the standard gene expression prediction models that accompany S-PrediXcan. This harmonization ensures that variant names, allele coding, and genomic coordinates are aligned between our GWAS data and the prediction models.

**Note:** This harmonization step is critical because inconsistencies in variant naming or allele coding can lead to incorrect results or missed associations.

Let's examine the data structure, which follows a consistent format across all tissues:

In [4]:
filename="/workshop/data/PrediXcan/gwas_harmonized/harmonized_en_Whole_Blood.tsv.gz"
echo "Looking at file $filename"
echo "Column numbers and titles:"
echo "$(zcat $filename | head -n 1 | tr '\t' '\n' | nl)"
echo
echo "File excerpt:"
zcat $filename | head -n 10 | cut -f 1-6,9

Looking at file /workshop/data/PrediXcan/gwas_harmonized/harmonized_en_Whole_Blood.tsv.gz
Column numbers and titles:
     1	chromosome
     2	base_pair_location
     3	effect_allele
     4	other_allele
     5	beta
     6	standard_error
     7	effect_allele_frequency
     8	neg_log_10_p_value
     9	rsid_x
    10	variant_id
    11	predictdb_id
    12	rsid_y

File excerpt:
chromosome	base_pair_location	effect_allele	other_allele	beta	standard_error	rsid_x
chr1	785910	C	G	-0.0176	0.0262	rs12565286
chr1	817186	A	G	0.0048	0.0098	rs3094315
chr1	817341	G	A	0.0045	0.0097	rs3131972
chr1	825532	T	C	0.0069	0.0099	rs1048488
chr1	825767	C	T	0.0066	0.0098	rs3115850
chr1	833068	A	G	0.0	0.0117	rs12562034
chr1	841742	T	A	0.0078	0.0101	rs2980319
chr1	843942	G	A	-0.007	0.0101	rs4040617
chr1	849670	A	G	0.0063	0.0098	rs2905062


<div class="alert alert-info">
    <strong>Exercise:</strong> Take a moment to interpret the displayed columns. What does each column represent, and why might each be important for a TWAS analysis?
</div>

Enter your answer here

### Gene Expression Prediction Models: The Foundation of TWAS

We will now examine how gene expression is predicted from genetic variants. The gene expression prediction models are the cornerstone of any TWAS analysis. Let's examine the models for GTEx "Whole blood" tissue as our example.

**Definition:** These models are mathematical functions that predict gene expression levels based on genotype data. They represent the relationship: "Given a person's genotype at specific SNPs, what would their gene expression likely be in this tissue?" These models were trained using data from the GTEx consortium, where researchers measured both genotypes and gene expression in the same individuals.

PredictDB distributes these models as SQLite databases. Each database contains two essential tables that work together:

In [5]:
# Schema for the "extra" table:
sqlite3 /workshop/data/PrediXcan/predictdb/en_Whole_Blood.db ".schema extra"

CREATE TABLE `extra` (
  `gene` TEXT,
  `genename` TEXT,
  `gene_type` TEXT,
  `alpha` REAL,
  `n_snps_in_window` INTEGER,
  `n.snps.in.model` INTEGER,
  `test_R2_avg` REAL,
  `test_R2_sd` REAL,
  `cv_R2_avg` REAL,
  `cv_R2_sd` REAL,
  `in_sample_R2` REAL,
  `nested_cv_fisher_pval` REAL,
  `nested_cv_converged` INTEGER,
  `rho_avg` REAL,
  `rho_se` REAL,
  `rho_zscore` REAL,
  `pred.perf.R2` REAL,
  `pred.perf.pval` REAL,
  `pred.perf.qval` REAL,
  `phi` REAL
);
CREATE INDEX metab_model_summary ON extra (gene);


The `extra` table contains model quality control information. For every gene, it includes:
- **Gene identifiers** (Ensembl ID and gene name)
- **Model composition** (how many SNPs are included)  
- **Performance metrics** (how well the model predicts expression)

This metadata is essential for interpreting TWAS results appropriately.

<div class="alert alert-info">
    <strong>Question:</strong> Why is the <code>pred.perf.pval</code> column important? What does it tell us about a gene's suitability for TWAS analysis? Consider what happens if we cannot reliably predict a gene's expression from genotype data.
</div>

Enter your answer here

The `weights` table contains the actual prediction parameters—the mathematical components of each gene expression model. For every gene (identified by its Ensembl ID), this table lists:
- **Which genetic variants** (eQTLs) influence expression
- **The magnitude of influence** each variant has (the prediction weight)

Let's examine this with an example. We will look at the model for **ENSG00000157823.16**, which codes for AP3S2 (Adaptor Related Protein Complex 3 Subunit Sigma 2). To focus on the most important contributors, we will sort variants by their absolute weight and show the top 10:

In [6]:
sqlite3 /workshop/data/PrediXcan/predictdb/en_Whole_Blood.db <<EOF
.mode columns
.header on
select * from weights
  where gene='ENSG00000157823.16'
  order by abs(weight) desc limit 10;
EOF

gene                rsid        varID                   ref_allele  eff_allele  weight             
------------------  ----------  ----------------------  ----------  ----------  -------------------
ENSG00000157823.16  rs4363847   chr15_90547112_T_C_b38  T           C           0.128329668848438  
ENSG00000157823.16  rs12898828  chr15_89817721_A_T_b38  A           T           -0.109067828981916 
ENSG00000157823.16  rs11073814  chr15_88835653_A_T_b38  A           T           0.0524825659916266 
ENSG00000157823.16  rs3853641   chr15_89900547_G_C_b38  G           C           -0.0467714907373544
ENSG00000157823.16  rs16943673  chr15_89864201_T_C_b38  T           C           -0.0326710086433488
ENSG00000157823.16  rs10520684  chr15_89881770_C_A_b38  C           A           -0.0295110487238457
ENSG00000157823.16  rs1879530   chr15_88902553_G_A_b38  G           A           0.0263159897074107 
ENSG00000157823.16  rs2165069   chr15_89863887_C_T_b38  C           T           -0.0251527944185157


<div class="alert alert-info">
    <strong>Interpreting the weights:</strong> Notice that some weights are positive while others are negative. What does this tell us about the relationship between these variants and gene expression? What statistical methods could be used to estimate these prediction weights?
</div>

In [None]:
Enter your answer here

### Running Your First TWAS Analysis with S-PrediXcan

We are now ready to conduct our TWAS analysis using S-PrediXcan. This analysis will combine our GWAS summary statistics with the gene expression prediction models to identify genes whose predicted expression levels are associated with type 2 diabetes risk.

**Note:** The analysis below may take several minutes to complete. If it runs too long, you can stop it and use the pre-computed checkpoint files we have provided.

### Understanding the S-PrediXcan Command

Each input parameter serves a specific purpose:

**Core Input Files:**
- **`model_db_path`**: The SQLite database containing gene expression prediction weights
- **`covariance`**: Linkage disequilibrium (LD) estimates between SNPs—essential for accounting for correlation between nearby variants
- **`gwas_file`**: Our harmonized GWAS summary statistics

**Column Mapping Parameters:**
The `*_column` parameters tell S-PrediXcan how to interpret our GWAS file format. This flexibility allows the software to work with data from different sources and formats.

**Polygenicity Calibration Parameters:**
- **SNP heritability (`gwas_h2 = 0.15`)**: The proportion of trait variance explained by all SNPs
- **Sample size (`gwas_N = 1000000`)**: The effective sample size of our GWAS

**Rationale:** Polygenic traits like T2D have thousands of causal variants with small effects distributed across the genome. Many of these variants are in linkage disequilibrium with our eQTLs, potentially creating spurious associations. By providing heritability and sample size estimates, we enable S-PrediXcan's recalibration method ([Liang _et al._ 2024](https://www.biorxiv.org/content/10.1101/2023.10.17.562831v2)) to account for this "polygenic background" and provide more accurate results.

The analysis will be run across seven selected tissues:

In [None]:
tissues="Adipose_Subcutaneous Artery_Coronary Brain_Cortex Liver Muscle_Skeletal Pancreas Whole_Blood"

for tissue in $tissues; do
    echo
    echo
    echo "Running S-PrediXcan for tissue ${tissue}"
    echo
    SPrediXcan.py \
        --model_db_path /workshop/data/PrediXcan/predictdb/en_${tissue}.db \
        --covariance /workshop/data/PrediXcan/predictdb/en_${tissue}.txt.gz \
        --gwas_file /workshop/data/PrediXcan/gwas_harmonized/harmonized_en_${tissue}.tsv.gz \
        --snp_column rsid_x \
        --effect_allele_column effect_allele \
        --non_effect_allele_column other_allele \
        --output_file /workshop/local/results/twas_en_${tissue}.csv \
        --beta_column beta \
        --se_column standard_error \
        --gwas_h2 0.15 \
        --gwas_N 1000000
    
    # Uncomment this line to only run a single tissue.
    # break
done

### Analysis Complete

You have successfully run S-PrediXcan across multiple tissues and generated comprehensive TWAS results for type 2 diabetes. Each output file contains gene-level association statistics that indicate which genes' predicted expression levels are significantly associated with T2D risk in each tissue.

<div class="alert alert-info">
    <strong>Examine your results:</strong> Open one of the results files in the <code>results</code> directory. Look for columns like gene names, association p-values, and effect sizes. Which genes show the strongest associations? Do you recognize any as known diabetes genes?
</div>

Enter you answer here

### Next Steps

You have established a foundation in TWAS methodology. The next phase involves **interpreting** these results. In our [next notebook](1b-S-PrediXcan-interpretation.ipynb), we will use the R programming environment to:

- **Apply** multiple testing corrections appropriately
- **Visualize** association patterns across tissues
- **Identify** the most promising candidate genes  
- **Understand** the biological significance of our findings

This interpretation phase transforms statistical genetics results into biological insights.