# Data-Driven Covariance Estimation

This notebook shows how to use **data-driven** covariance matrices in pymash,
which often capture effect-sharing patterns better than canonical matrices.
Read the [Introduction notebook](01_introduction.ipynb) first.

The strategy is:

1. **Find strong signals** using a condition-by-condition analysis
2. **Estimate initial covariances** from those signals using PCA
3. **Refine** using Extreme Deconvolution (ED)
4. **Combine** canonical and data-driven covariances for the final fit

In [1]:
import numpy as np
import pymash as mash

In [2]:
sim = mash.simple_sims(nsamp=500, ncond=5, err_sd=1.0, seed=1)
data = mash.mash_set_data(sim["Bhat"], sim["Shat"])

## Step 1: Identify Strong Signals

Run a quick condition-by-condition analysis with `mash_1by1()` to identify
effects that are significant in at least one condition. These strong signals
will be used to learn data-driven covariance matrices.

In [3]:
m1 = mash.mash_1by1(data)
strong = mash.get_significant_results(m1, thresh=0.05)
print(f"Number of strong signals: {len(strong)}")

Number of strong signals: 1141


## Step 2: PCA-Based Covariance Matrices

`cov_pca()` performs SVD on the strong signals to produce rank-1 covariance
matrices from the top principal components, plus a combined ("tPCA") matrix.

In [4]:
U_pca = mash.cov_pca(data, npc=5, subset=strong)
print(f"PCA covariances: {list(U_pca.keys())}")

PCA covariances: ['PCA_1', 'PCA_2', 'PCA_3', 'PCA_4', 'PCA_5', 'tPCA']


## Step 3: Refine with Extreme Deconvolution

The PCA-based matrices capture patterns in the *observed* data (Bhat), but
what we need is covariances of the *true underlying effects* (B). Extreme
Deconvolution (ED) accounts for the measurement error and refines the
covariance estimates.

In [5]:
U_ed = mash.cov_ed(data, U_pca, subset=strong)
print(f"ED covariances: {list(U_ed.keys())}")

ED covariances: ['ED_PCA_1', 'ED_PCA_2', 'ED_PCA_3', 'ED_PCA_4', 'ED_PCA_5', 'ED_tPCA']


## Fit the Model with Data-Driven Covariances

**Remember the Crucial Rule:** fit on *all* tests, not just the strong subset.

In [6]:
result_ed = mash.mash(data, Ulist=U_ed)
print(f"Log-likelihood (ED only): {result_ed.loglik:.2f}")

Log-likelihood (ED only): -15991.51


## Combine Canonical + Data-Driven

In practice, combining both canonical and data-driven covariances usually
gives the best fit. Just merge the two dictionaries.

In [7]:
U_c = mash.cov_canonical(data)
U_combined = {**U_c, **U_ed}
print(f"Total covariance matrices: {len(U_combined)}")

Total covariance matrices: 16


In [8]:
result_combined = mash.mash(data, Ulist=U_combined)
print(f"Log-likelihood (combined):  {result_combined.loglik:.2f}")
print(f"Log-likelihood (ED only):   {result_ed.loglik:.2f}")

Log-likelihood (combined):  -15986.58
Log-likelihood (ED only):   -15991.51


## Compare Model Fits

A higher log-likelihood indicates a better fit. Since these simulated data use
canonical covariance patterns, the canonical matrices help here. In real data
with complex sharing patterns, data-driven matrices often dominate.

In [9]:
# Also fit canonical-only for comparison
result_canonical = mash.mash(data, Ulist=U_c)

print(f"Log-likelihood (canonical):  {result_canonical.loglik:.2f}")
print(f"Log-likelihood (ED only):    {result_ed.loglik:.2f}")
print(f"Log-likelihood (combined):   {result_combined.loglik:.2f}")

Log-likelihood (canonical):  -16007.26
Log-likelihood (ED only):    -15991.51
Log-likelihood (combined):   -15986.58
