# Scaling to Real eQTL Data

In real-world applications (e.g., eQTL studies with millions of tests across
dozens of tissues), it is impractical to load and fit all tests at once.
This notebook demonstrates the recommended workflow using two subsets:

- **Strong signals** — top eQTLs per gene, used for learning covariance patterns
- **Random subset** — an unbiased sample of all tests, used for fitting the model

In [Urbut et al. 2019](https://doi.org/10.1038/s41588-018-0268-8), the strong
set contained ~16k tests and the random set ~20k tests.

## Strategy

1. Estimate null correlations from the random subset
2. Learn data-driven covariances from the strong subset
3. Fit the model (mixture proportions) on the random subset
4. Compute posteriors for any subset using the learned model

In [1]:
import numpy as np
import pymash as mash
from pymash.correlation import estimate_null_correlation_simple

## Simulate Data

We simulate a larger dataset (40k tests) to mimic a real scenario.

In [2]:
sim = mash.simple_sims(nsamp=10000, ncond=5, err_sd=1.0, seed=1)
print(f"Total tests: {sim['Bhat'].shape[0]}")

Total tests: 40000


### Identify Strong and Random Subsets

In [3]:
# Find strong signals using a quick 1-by-1 analysis
full_data = mash.mash_set_data(sim["Bhat"], sim["Shat"])
m1 = mash.mash_1by1(full_data)
strong_idx = mash.get_significant_results(m1, thresh=0.05)
print(f"Strong signals: {len(strong_idx)}")

# Select a random subset of 5000 tests
rng = np.random.default_rng(42)
random_idx = rng.choice(sim["Bhat"].shape[0], size=5000, replace=False)
print(f"Random subset: {len(random_idx)}")

Strong signals: 23176
Random subset: 5000


## Step 1: Estimate Null Correlations

Estimate the residual correlation structure among conditions from the random
subset. This captures correlations due to confounders, not true effects.

**Use the random subset** (not the strong subset, which may lack null tests).

In [4]:
data_temp = mash.mash_set_data(
    sim["Bhat"][random_idx], sim["Shat"][random_idx]
)
Vhat = estimate_null_correlation_simple(data_temp)
print("Estimated null correlation matrix:")
print(np.array2string(Vhat, precision=3))

Estimated null correlation matrix:
[[ 1.000e+00  8.399e-02  4.904e-02 -6.857e-04  3.717e-02]
 [ 8.399e-02  1.000e+00  6.944e-02  5.545e-02  5.890e-02]
 [ 4.904e-02  6.944e-02  1.000e+00  1.988e-02  4.224e-02]
 [-6.857e-04  5.545e-02  1.988e-02  1.000e+00  5.643e-02]
 [ 3.717e-02  5.890e-02  4.224e-02  5.643e-02  1.000e+00]]


### Create Data Objects with Estimated Correlations

In [5]:
data_random = mash.mash_set_data(
    sim["Bhat"][random_idx], sim["Shat"][random_idx], V=Vhat
)
data_strong = mash.mash_set_data(
    sim["Bhat"][strong_idx], sim["Shat"][strong_idx], V=Vhat
)

## Step 2: Learn Data-Driven Covariances from Strong Signals

In [6]:
U_pca = mash.cov_pca(data_strong, npc=5)
U_ed = mash.cov_ed(data_strong, U_pca)
print(f"Data-driven covariances: {list(U_ed.keys())}")

Data-driven covariances: ['ED_PCA_1', 'ED_PCA_2', 'ED_PCA_3', 'ED_PCA_4', 'ED_PCA_5', 'ED_tPCA']


## Step 3: Fit the Model on the Random Subset

Fit using both canonical and data-driven covariances on the **random** subset.
Use `outputlevel=1` to skip computing posteriors (saves time — we only need
the mixture proportions from this step).

In [7]:
U_c = mash.cov_canonical(data_random)
U_all = {**U_ed, **U_c}

m_fit = mash.mash(data_random, Ulist=U_all, outputlevel=1)
print(f"Log-likelihood: {m_fit.loglik:.2f}")

Log-likelihood: -40027.60


## Step 4: Compute Posteriors on the Strong Subset

Apply the learned model (mixture proportions) to the strong signals.
Use `g=m_fit.fitted_g` and `fixg=True` to skip re-estimation.

In [8]:
m_strong = mash.mash(data_strong, g=m_fit.fitted_g, fixg=True)

sig = mash.get_significant_results(m_strong, thresh=0.05)
print(f"Significant effects in strong set: {len(sig)}")
print(f"\nlfsr (first 5 effects):")
print(mash.get_lfsr(m_strong)[:5].round(3))

Significant effects in strong set: 2241

lfsr (first 5 effects):
[[0.    0.    0.    0.    0.001]
 [0.    0.    0.    0.    0.   ]
 [0.    0.554 0.596 0.561 0.306]
 [0.    0.653 0.607 0.701 0.591]
 [0.562 0.688 0.    0.771 0.67 ]]


## Loading Real Data from CSV

In practice, you would load your Bhat and Shat matrices from files.
For example, starting from CSV files where rows are tests and columns
are conditions:

```python
import pandas as pd

bhat_df = pd.read_csv("bhat.csv", index_col=0)
shat_df = pd.read_csv("shat.csv", index_col=0)

data = mash.mash_set_data(bhat_df.values, shat_df.values)
```

The key requirement is that `Bhat` and `Shat` are aligned matrices with
the same shape: J tests (rows) by R conditions (columns).