# NB03: Contamination vs Functional Potential Models

Test association between contamination level and inferred functional potential.

**Inputs**
- `../data/geochemistry_sample_matrix.tsv`
- `../data/community_taxon_counts.tsv`
- `../data/taxon_bridge.tsv`
- `../data/taxon_functional_features.tsv`

**Planned outputs**
- `../data/site_functional_scores.tsv`
- `../data/model_results.tsv`
- figures in `../figures/`


In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt

DATA_DIR = Path('../data')
FIG_DIR = Path('../figures')
FIG_DIR.mkdir(parents=True, exist_ok=True)

geo = pd.read_csv(DATA_DIR / 'geochemistry_sample_matrix.tsv', sep='\t')
community = pd.read_csv(DATA_DIR / 'community_taxon_counts.tsv', sep='\t')
bridge = pd.read_csv(DATA_DIR / 'taxon_bridge.tsv', sep='\t')
features = pd.read_csv(DATA_DIR / 'taxon_functional_features.tsv', sep='\t')

print('geo rows:', len(geo))
print('community rows:', len(community))
print('bridge rows:', len(bridge))
print('feature rows:', len(features))

## Modeling Plan

1. Build contamination index (uranium-only and multimetal PCA variants).
2. Build sample-level functional scores by abundance-weighted taxon feature aggregation.
3. Test with:
   - Spearman correlation
   - Robust regression (`RLM`)
   - Label permutation tests
4. Re-run on mapping confidence subsets (Tier 1 vs Tier1+2).


In [None]:
# Placeholder result frames until aggregation/model code is added
site_scores = pd.DataFrame(columns=['sample_id', 'contamination_index', 'functional_score'])
model_results = pd.DataFrame(columns=['model', 'effect', 'p_value', 'n_samples'])

site_scores.to_csv(DATA_DIR / 'site_functional_scores.tsv', sep='\t', index=False)
model_results.to_csv(DATA_DIR / 'model_results.tsv', sep='\t', index=False)

plt.figure(figsize=(6, 4))
plt.title('Placeholder: contamination vs functional score')
plt.xlabel('contamination_index')
plt.ylabel('functional_score')
plt.tight_layout()
plt.savefig(FIG_DIR / 'contamination_vs_functional_score_placeholder.png', dpi=150)
plt.close()

print('Wrote placeholder model outputs and figure.')