# gnomAD v4 Ancestry Projection - Summary

## Objective
Project Tapestry samples (n=97,422) onto gnomAD v4 PC space and assign ancestry labels using the official Random Forest classifier.

## Methods

### 1. Data Preparation
- Downloaded gnomAD v4 PCA loadings (Hail table format)
- Extracted 20 PC loadings for projection
- Identified 168,373 overlapping variants between Tapestry and gnomAD
- Filtered monomorphic variants (AF=0 or AF=1) to enable variance standardization

### 2. PC Projection
```bash
plink2 --score with variance-standardize modifier
```
- Used `--score` to compute: projected_PCs = genotypes @ gnomAD_loadings
- Applied variance standardization (matches gnomAD's HWE-normalized PCA)
- Multiplied SCORE_AVG × ALLELE_CT to get proper PC scale

### 3. Random Forest Classification
- Loaded gnomAD v4 RF model (ONNX format, avoids pickle version conflicts)
- Applied to first 20 projected PCs
- Obtained ancestry predictions + confidence scores

## Results

### Ancestry Distribution
- **EUR (NFE+FIN)**: 89,639 samples (92.0%)
  - NFE: 67,856 (69.7%)
  - FIN: 21,783 (22.4%)
- **Other ancestries**: ~7,783 samples (8.0%)

### Critical Finding: FIN Classification Failure
**Problem**: 22.4% classified as Finnish — **demographically implausible** for Minnesota (expected ~5-10%).

**Evidence of poor separation**:
- **Median FIN confidence: 0.32** (essentially random)
- **80% borderline** (17,515/21,783 with prob < 0.5)
- **Only 266 confident FIN** (prob > 0.7) = **0.3%** (more realistic)
- **Most samples**: ~0.3 NFE prob + ~0.3 FIN prob (RF can't decide)

**Visual inspection (PC11 vs PC5)**:
- No discrete FIN/NFE clusters
- Continuous gradient rather than separated populations
- Suggests projection shrinkage or cohort-specific structure

## Interpretation

The gnomAD RF was trained on discrete reference populations but fails on this cohort because:
1. **Low variant overlap** → PC projection shrinkage toward 0
2. **Continuous European variation** in Tapestry (no distinct Finnish subpopulation)
3. **gnomAD's FIN/NFE boundary** doesn't generalize to all cohorts

This is a **known limitation** mentioned in gnomAD documentation: *"RF may not perform adequately on all datasets"*

## Decision: Collapse FIN + NFE → EUR

**Rationale**:
- 22.4% FIN is not credible for Minnesota demographics
- Median confidence (0.32) indicates unreliable boundary
- FIN and NFE are genetically very close (both North European)
- Fairness analysis doesn't require sub-European resolution

**Final ancestry scheme**:
- **EUR**: Merged NFE + FIN (high confidence threshold not practical here)
- **AFR, EAS, SAS, AMR**: Keep separate (better separated in PC space)
```python
# Collapse European ancestries
pcs_clean['ancestry_final'] = pcs_clean['ancestry_rf'].replace({
    'nfe': 'eur',
    'fin': 'eur'
})
```

## Files Generated
- `tapestry_projected_pcs.csv` - 97,422 samples × 20 PCs
- `ancestry_labels.txt` - Sample IDs + ancestry + confidence scores
- `gnomad_projection.png` - PC visualization
- `fin_nfe_separation_check.png` - Diagnostic plots

## Next Steps (Week 3 Task 2)
1. Use **continuous PC-based stratification** (PC deciles) for debiasing
2. Implement **ancestry subspace removal** (project out PC1-10)
3. Validate fairness metrics stratified by final ancestry labels

## Key Takeaway
**Discrete ancestry labels from gnomAD RF are unreliable for this cohort.** For fairness analysis, prefer:
- Continuous PC-based debiasing (original Week 2 plan)
- Collapsed EUR category (avoids spurious FIN/NFE split)
- High-confidence non-EUR ancestries for stratification


In [None]:
import onnxruntime as ort
import numpy as np

# Load ONNX model
model_path = '/home/ext_meehl_joshua_mayo_edu/pre-phd-genomics/02_genomics_domain/data/gnomAD/gnomad.v4.0.RF_fit.onnx'
session = ort.InferenceSession(model_path)

# Inspect model inputs/outputs
print("Model inputs:")
for inp in session.get_inputs():
    print(f"  Name: {inp.name}, Shape: {inp.shape}, Type: {inp.type}")

print("\nModel outputs:")
for out in session.get_outputs():
    print(f"  Name: {out.name}, Shape: {out.shape}, Type: {out.type}")

In [None]:
import pandas as pd
pcs_clean = pd.read_csv('/home/ext_meehl_joshua_mayo_edu/pre-phd-genomics/02_genomics_domain/data/gnomAD/tapestry_projected_pcs.csv')

In [None]:
# Prepare input (likely needs first N PCs as float32)
# Check what the model expects from Step 1 output
n_pcs = session.get_inputs()[0].shape[1]  # Get expected number of features
print(f"Model expects {n_pcs} PCs")

# Prepare features
X = pcs_clean[[f'PC{i}' for i in range(1, n_pcs + 1)]].values.astype(np.float32)

# Run inference
input_name = session.get_inputs()[0].name
output_names = [out.name for out in session.get_outputs()]

results = session.run(output_names, {input_name: X})

# results typically contains: [labels, probabilities]
predictions = results[0]  # Ancestry predictions
probabilities = results[1] if len(results) > 1 else None  # Confidence scores

print(f"Predictions shape: {predictions.shape}")
if probabilities is not None:
    print(f"Probabilities shape: {len(probabilities)}")

In [None]:
# Check exact input specifications
print("Model input details:")
for inp in session.get_inputs():
    print(f"  Name: {inp.name}")
    print(f"  Shape: {inp.shape}")
    print(f"  Type: {inp.type}")

# Check your input shape
print(f"\nYour input shape: {X.shape}")
print(f"Your input dtype: {X.dtype}")

# Verify your input actually has variation
print(f"\nFirst 5 samples, first 5 PCs:")
print(X[:5, :5])

print(f"\nPC1 statistics:")
print(f"  Min: {X[:, 0].min()}")
print(f"  Max: {X[:, 0].max()}")
print(f"  Mean: {X[:, 0].mean()}")
print(f"  Std: {X[:, 0].std()}")

In [None]:
pcs_clean['ancestry_rf'] = predictions
pcs_clean['ancestry_prob'] = [max(prob_dict.values()) for prob_dict in probabilities]

In [None]:
pcs_clean['ancestry_rf'].value_counts()

In [None]:
pcs_clean['ancestry_prob'].hist()

In [None]:
import numpy as np
from itertools import combinations

n_pc_search = 20
# Get all PC columns (first 20)
pc_cols = [f'PC{i}' for i in range(1, 21)]

# Function to compute between-group variance / within-group variance
def separation_score(data, labels, pc1, pc2):
    """Compute F-statistic for 2D separation"""
    X = data[[pc1, pc2]].values
    
    # Between-group variance
    group_means = []
    for ancestry in labels.unique():
        group_means.append(X[labels == ancestry].mean(axis=0))
    group_means = np.array(group_means)
    overall_mean = X.mean(axis=0)
    between_var = np.sum([len(X[labels == anc]) * np.sum((gm - overall_mean)**2) 
                          for anc, gm in zip(labels.unique(), group_means)])
    
    # Within-group variance
    within_var = np.sum([np.sum((X[labels == anc] - gm)**2) 
                         for anc, gm in zip(labels.unique(), group_means)])
    
    # F-statistic (higher = better separation)
    return between_var / (within_var + 1e-10)

# Test all PC pairs
print("Computing separation scores for all PC pairs...")
scores = {}
for pc1, pc2 in combinations(pc_cols[:n_pc_search], 2):  # Test first 10 PCs
    score = separation_score(pcs_clean, pcs_clean['ancestry_rf'], pc1, pc2)
    scores[(pc1, pc2)] = score

# Sort by score
sorted_pairs = sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Show top 10
print("\nTop 10 PC pairs for ancestry separation:")
for (pc1, pc2), score in sorted_pairs[:n_pc_search]:
    print(f"  {pc1} vs {pc2}: {score:.2f}")

# Visualize best pair
best_pc1, best_pc2 = sorted_pairs[0][0]
print(f"\nBest pair: {best_pc1} vs {best_pc2}")

In [None]:
ancestries

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Number of PC pairs to plot (e.g., 10 pairs = PC1-PC20)
n_pairs = 10

# Calculate grid size (e.g., 5x2 for 10 pairs, 3x3 for 9 pairs)
n_cols = 3
n_rows = (n_pairs + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 6*n_rows))
axes = axes.flatten()

# Get unique ancestries and colors
ancestries = pcs_clean['ancestry_rf'].unique()
ancestries = ['nfe', 'afr', 'fin', 'sas', 'ami', 'asj', 'eas', 'amr', 'mid']
ancestries = ['nfe', 'fin']
colors = dict(zip(ancestries, sns.color_palette('Set2', n_colors=len(ancestries))))

for pair_idx in range(n_pairs):
    ax = axes[pair_idx]
    
    # Get PC pair (PC1+PC2, PC3+PC4, etc.)
    pc1 = f'PC{2*pair_idx + 1}'
    pc2 = f'PC{2*pair_idx + 2}'
    
    # Plot each ancestry
    for ancestry in ancestries:
        subset = pcs_clean[pcs_clean['ancestry_rf'] == ancestry]
        ax.scatter(subset[pc1], subset[pc2], 
                   label=ancestry if pair_idx == 0 else '',  # Legend only on first plot
                   alpha=0.4, s=5, color=colors[ancestry])
    
    ax.set_xlabel(pc1, fontsize=10)
    ax.set_ylabel(pc2, fontsize=10)
    ax.set_title(f'{pc1} vs {pc2}', fontsize=11)
    ax.grid(alpha=0.3)

# Add legend to first subplot
axes[0].legend(markerscale=3, loc='best', fontsize=8)

# Hide unused subplots
for idx in range(n_pairs, len(axes)):
    axes[idx].axis('off')

plt.suptitle('Consecutive PC Pairs - Ancestry Structure', fontsize=16, y=1.00)
plt.tight_layout()
plt.savefig('consecutive_pc_pairs_ancestry.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
x_pc = 'PC11'
y_pc = 'PC5'

# Look at samples with low confidence or near Finnish boundary
pcs_clean['is_european'] = pcs_clean['ancestry_rf'].isin(['nfe', 'fin'])
european_samples = pcs_clean[pcs_clean['is_european']]

# Check FIN assignments with low confidence
fin_samples = european_samples[european_samples['ancestry_rf'] == 'fin']
# print(f"\nFinnish classifications:")
# print(fin_samples['ancestry_prob'].describe())

# Low-confidence FIN might be admixed
ambiguous_fin = fin_samples[fin_samples['ancestry_prob'] < 0.6]
print(f"\nAmbiguous FIN: {len(ambiguous_fin)} samples")

# Plot NFE vs FIN in PC space
import matplotlib.pyplot as plt
nfe = european_samples[european_samples['ancestry_rf'] == 'nfe']
fin = european_samples[european_samples['ancestry_rf'] == 'fin']

plt.figure(figsize=(10, 8))
plt.scatter(nfe[x_pc], nfe[y_pc], label='NFE', alpha=0.5, s=10)
plt.scatter(fin[x_pc], fin[y_pc], label='FIN', alpha=0.5, s=10)
plt.xlabel(x_pc)
plt.ylabel(y_pc)
plt.legend()
plt.title('NFE vs FIN classification')
plt.show()

In [None]:
# Are these low-confidence classifications?
fin_samples = pcs_clean[pcs_clean['ancestry_rf'] == 'fin']

print(f"FIN samples: {len(fin_samples)} ({len(fin_samples)/len(pcs_clean)*100:.1f}%)")
print("\nFIN confidence distribution:")
print(fin_samples['ancestry_prob'].describe())

# How many are confidently FIN (>0.7)?
confident_fin = fin_samples[fin_samples['ancestry_prob'] > 0.7]
print(f"\nConfident FIN (prob > 0.7): {len(confident_fin)} ({len(confident_fin)/len(pcs_clean)*100:.1f}%)")

# How many are borderline?
borderline_fin = fin_samples[fin_samples['ancestry_prob'] < 0.5]
print(f"Borderline FIN (prob < 0.5): {len(borderline_fin)}")