# NB04: Lab-Field Concordance & NMDC Validation

**Requires BERDL JupyterHub** — `get_spark_session()` is only available in JupyterHub kernels.

## Purpose

Test whether dark genes' lab fitness conditions predict the environmental contexts where
they appear in nature. Two independent tests:

1. **Lab-field concordance**: Pre-registered mapping from FB experiment groups to expected
   environmental categories, then test if carrier genomes are enriched in the predicted environments
2. **NMDC independent validation**: For taxa carrying dark genes, check if their abundance in
   NMDC metagenomic samples correlates with abiotic variables matching the lab fitness conditions

## Inputs

- `data/dark_genes_only.tsv` from NB01
- `data/carrier_noncarrier_tests.tsv` from NB03
- `data/carrier_genome_map.tsv` from NB03
- `data/biogeographic_profiles.tsv` from NB03

## Outputs

- `data/lab_field_concordance.tsv` — per-cluster concordance test results
- `data/nmdc_validation.tsv` — NMDC abiotic correlation results
- `figures/fig11_concordance_matrix.png`
- `figures/fig12_nmdc_correlations.png`

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

# Path setup
if os.path.basename(os.getcwd()) == 'notebooks':
    PROJECT_DIR = os.path.dirname(os.getcwd())
else:
    PROJECT_DIR = os.getcwd()
    _d = PROJECT_DIR
    while _d != '/':
        if os.path.exists(os.path.join(_d, 'PROJECT.md')):
            break
        _d = os.path.dirname(_d)

DATA_DIR = os.path.join(PROJECT_DIR, 'data')
FIG_DIR = os.path.join(PROJECT_DIR, 'figures')
os.makedirs(FIG_DIR, exist_ok=True)

spark = get_spark_session()
print(f'Project dir: {PROJECT_DIR}')

Project dir: /home/aparkin/BERIL-research-observatory/projects/functional_dark_matter


## Section 1: Pre-Registered Condition-Environment Mapping

Define the mapping from FB experiment condition classes to expected environmental
categories **before** looking at the biogeographic data. This prevents post-hoc
cherry-picking of associations.

In [2]:
# Pre-registered mapping: FB condition class -> expected environment categories
# Each condition class maps to one or more env_category values from NB03's classification
CONDITION_ENV_MAP = {
    'stress': {
        'expected_envs': ['contaminated', 'soil_sediment', 'wastewater_engineered'],
        'description': 'Stress conditions (metals, oxidative, osmotic) -> contaminated/variable environments',
        'nmdc_abiotic': [],  # stress is heterogeneous, no single abiotic variable
    },
    'carbon source': {
        'expected_envs': ['soil_sediment', 'freshwater', 'plant_associated'],
        'description': 'Carbon utilization -> carbon-rich environments (soil, freshwater, rhizosphere)',
        'nmdc_abiotic': ['annotations_tot_org_carb_has_numeric_value',
                         'annotations_diss_org_carb_has_numeric_value'],
    },
    'nitrogen source': {
        'expected_envs': ['soil_sediment', 'freshwater', 'wastewater_engineered'],
        'description': 'Nitrogen utilization -> nitrogen-variable environments',
        'nmdc_abiotic': ['annotations_tot_nitro_content_has_numeric_value',
                         'annotations_ammonium_has_numeric_value',
                         'annotations_ammonium_nitrogen_has_numeric_value'],
    },
    'pH': {
        'expected_envs': ['soil_sediment', 'extreme', 'freshwater'],
        'description': 'pH stress -> pH-variable environments',
        'nmdc_abiotic': ['annotations_ph'],
    },
    'motility': {
        'expected_envs': ['soil_sediment', 'freshwater', 'plant_associated'],
        'description': 'Motility -> structured environments requiring chemotaxis',
        'nmdc_abiotic': [],
    },
    'anaerobic': {
        'expected_envs': ['soil_sediment', 'freshwater', 'animal_associated'],
        'description': 'Anaerobic growth -> low-oxygen environments',
        'nmdc_abiotic': ['annotations_diss_oxygen_has_numeric_value'],
    },
}

print('Pre-registered condition-environment mapping:')
for cond, info in CONDITION_ENV_MAP.items():
    print(f'  {cond} -> {info["expected_envs"]}')
    if info['nmdc_abiotic']:
        print(f'    NMDC abiotic: {info["nmdc_abiotic"]}')

Pre-registered condition-environment mapping:
  stress -> ['contaminated', 'soil_sediment', 'wastewater_engineered']
  carbon source -> ['soil_sediment', 'freshwater', 'plant_associated']
    NMDC abiotic: ['annotations_tot_org_carb_has_numeric_value', 'annotations_diss_org_carb_has_numeric_value']
  nitrogen source -> ['soil_sediment', 'freshwater', 'wastewater_engineered']
    NMDC abiotic: ['annotations_tot_nitro_content_has_numeric_value', 'annotations_ammonium_has_numeric_value', 'annotations_ammonium_nitrogen_has_numeric_value']
  pH -> ['soil_sediment', 'extreme', 'freshwater']
    NMDC abiotic: ['annotations_ph']
  motility -> ['soil_sediment', 'freshwater', 'plant_associated']
  anaerobic -> ['soil_sediment', 'freshwater', 'animal_associated']
    NMDC abiotic: ['annotations_diss_oxygen_has_numeric_value']


In [3]:
# Load NB03 results
dark = pd.read_csv(os.path.join(DATA_DIR, 'dark_genes_only.tsv'), sep='\t', low_memory=False)
carrier_tests = pd.read_csv(os.path.join(DATA_DIR, 'carrier_noncarrier_tests.tsv'), sep='\t')
carrier_map = pd.read_csv(os.path.join(DATA_DIR, 'carrier_genome_map.tsv'), sep='\t')
profiles = pd.read_csv(os.path.join(DATA_DIR, 'biogeographic_profiles.tsv'), sep='\t')

print(f'Carrier test results: {len(carrier_tests)} clusters')
print(f'Carrier genome map: {len(carrier_map):,} pairs')
print(f'Species profiles: {len(profiles)} species')

# Filter to clusters with condition class in our pre-registered mapping
mapped_conditions = list(CONDITION_ENV_MAP.keys())
concordance_clusters = carrier_tests[
    carrier_tests['top_condition_class'].isin(mapped_conditions)
].copy()
print(f'\nClusters with mapped condition class: {len(concordance_clusters)}')
print(concordance_clusters['top_condition_class'].value_counts().to_string())

Carrier test results: 151 clusters
Carrier genome map: 8,139 pairs
Species profiles: 31 species

Clusters with mapped condition class: 121
top_condition_class
stress             65
carbon source      38
nitrogen source    11
pH                  4
motility            2
anaerobic           1


## Section 2: Lab-Field Concordance Test

For each cluster with a mapped condition class AND environment data for carriers,
test whether carriers are enriched in the predicted environment categories.

In [4]:
# Re-extract environment data for carrier and non-carrier genomes
# We need the raw env categories per genome, not just the top carrier env from NB03

# Load all genomes in target species (from NB03 approach)
target_species = concordance_clusters['gtdb_species_clade_id'].unique().tolist()
species_sdf = spark.createDataFrame(
    [(str(s),) for s in target_species],
    ['gtdb_species_clade_id']
)
species_sdf.createOrReplaceTempView('concordance_species')

all_genomes = spark.sql("""
    SELECT g.genome_id, g.gtdb_species_clade_id, g.ncbi_biosample_id
    FROM kbase_ke_pangenome.genome g
    JOIN concordance_species cs ON g.gtdb_species_clade_id = cs.gtdb_species_clade_id
""").toPandas()

# Pivot NCBI env for these genomes
genome_ids_sdf = spark.createDataFrame(
    [(str(gid), str(bs)) for gid, bs in zip(
        all_genomes['genome_id'].tolist(),
        all_genomes['ncbi_biosample_id'].fillna('').tolist()
    ) if bs],
    ['genome_id', 'accession']
)
genome_ids_sdf.createOrReplaceTempView('concordance_genomes')

env_data = spark.sql("""
    SELECT cg.genome_id,
           MAX(CASE WHEN ne.harmonized_name = 'isolation_source' THEN ne.content END) as isolation_source,
           MAX(CASE WHEN ne.harmonized_name = 'env_broad_scale' THEN ne.content END) as env_broad_scale,
           MAX(CASE WHEN ne.harmonized_name = 'host' THEN ne.content END) as host
    FROM concordance_genomes cg
    JOIN kbase_ke_pangenome.ncbi_env ne ON cg.accession = ne.accession
    GROUP BY cg.genome_id
""").toPandas()

print(f'Env data for {len(env_data):,} genomes in {len(target_species)} species')

Env data for 1,131 genomes in 10 species


In [5]:
# Classify environments (same function as NB03)
def classify_environment(row):
    source = str(row.get('isolation_source', '')).lower()
    host = str(row.get('host', '')).lower()
    if any(kw in source for kw in ['contaminated', 'polluted', 'mining', 'acid mine',
                                    'industrial', 'heavy metal', 'uranium', 'chromium']):
        return 'contaminated'
    if any(kw in source for kw in ['blood', 'sputum', 'urine', 'wound', 'clinical',
                                    'patient', 'hospital', 'human']):
        return 'human_clinical'
    if 'homo sapiens' in host or 'human' in host:
        return 'human_associated'
    if any(kw in source for kw in ['animal', 'bovine', 'chicken', 'pig', 'cattle',
                                    'poultry', 'feces', 'gut', 'intestin', 'rumen']):
        return 'animal_associated'
    if any(kw in source for kw in ['soil', 'rhizosphere', 'root', 'compost', 'peat',
                                    'agricultural', 'sediment']):
        return 'soil_sediment'
    if any(kw in source for kw in ['marine', 'ocean', 'sea', 'seawater', 'coastal',
                                    'saline', 'brackish', 'salt', 'brine', 'hypersaline']):
        return 'marine_saline'
    if any(kw in source for kw in ['freshwater', 'lake', 'river', 'pond', 'stream',
                                    'groundwater', 'spring', 'aquifer']):
        return 'freshwater'
    if any(kw in source for kw in ['plant', 'leaf', 'stem', 'flower', 'seed',
                                    'phyllosphere', 'endophyte']):
        return 'plant_associated'
    if any(kw in source for kw in ['wastewater', 'sewage', 'activated sludge',
                                    'bioreactor', 'ferment']):
        return 'wastewater_engineered'
    if any(kw in source for kw in ['hot spring', 'hydrothermal', 'volcanic',
                                    'permafrost', 'acidic', 'alkaline']):
        return 'extreme'
    return 'other_unknown'

env_data['env_category'] = env_data.apply(classify_environment, axis=1)
env_lookup = env_data.set_index('genome_id')['env_category'].to_dict()
print(f'Environment lookup: {len(env_lookup):,} genomes')
print(env_data['env_category'].value_counts().to_string())

Environment lookup: 1,131 genomes
env_category
other_unknown            442
human_associated         427
human_clinical           142
soil_sediment             37
animal_associated         25
plant_associated          23
freshwater                14
wastewater_engineered     12
marine_saline              6
contaminated               3


In [6]:
# Build carrier sets from carrier_map
carrier_sets = carrier_map.groupby('gene_cluster_id')['genome_id'].apply(set).to_dict()
genomes_per_species = all_genomes.groupby('gtdb_species_clade_id')['genome_id'].apply(set).to_dict()

# Run concordance test for each cluster
concordance_results = []

for _, row in concordance_clusters.iterrows():
    cid = row['gene_cluster_id']
    sid = row['gtdb_species_clade_id']
    cond_class = row['top_condition_class']
    expected_envs = CONDITION_ENV_MAP[cond_class]['expected_envs']
    
    all_gids = genomes_per_species.get(sid, set())
    carrier_gids = carrier_sets.get(cid, set()) & all_gids
    noncarrier_gids = all_gids - carrier_gids
    
    # Get env categories for carriers and non-carriers
    carrier_envs = [env_lookup.get(g) for g in carrier_gids if g in env_lookup]
    noncarrier_envs = [env_lookup.get(g) for g in noncarrier_gids if g in env_lookup]
    
    if len(carrier_envs) < 3 or len(noncarrier_envs) < 3:
        continue
    
    # Count carriers in expected vs non-expected environments
    a = sum(1 for e in carrier_envs if e in expected_envs)  # carrier + expected
    b = sum(1 for e in carrier_envs if e not in expected_envs)  # carrier + not expected
    c = sum(1 for e in noncarrier_envs if e in expected_envs)  # non-carrier + expected
    d = sum(1 for e in noncarrier_envs if e not in expected_envs)  # non-carrier + not expected
    
    if a + c == 0:  # no genomes in expected environments at all
        continue
    
    odds_ratio, p_val = stats.fisher_exact([[a, b], [c, d]])
    
    concordance_results.append({
        'gene_cluster_id': cid,
        'gtdb_species_clade_id': sid,
        'orgId': row.get('orgId', ''),
        'locusId': row.get('locusId', ''),
        'desc': row.get('desc', ''),
        'top_condition_class': cond_class,
        'max_abs_fit': row.get('max_abs_fit', np.nan),
        'expected_envs': '|'.join(expected_envs),
        'n_carriers': len(carrier_envs),
        'n_noncarriers': len(noncarrier_envs),
        'carrier_in_expected': a,
        'carrier_pct_expected': a / (a + b) * 100 if (a + b) > 0 else 0,
        'noncarrier_in_expected': c,
        'noncarrier_pct_expected': c / (c + d) * 100 if (c + d) > 0 else 0,
        'odds_ratio': odds_ratio,
        'p_value': p_val,
        'is_concordant': a / (a + b) > c / (c + d) if (a + b) > 0 and (c + d) > 0 else False,
    })

conc_df = pd.DataFrame(concordance_results)
print(f'Concordance tests completed: {len(conc_df)} clusters')
print(f'  By condition class:')
print(conc_df['top_condition_class'].value_counts().to_string())

Concordance tests completed: 47 clusters
  By condition class:
top_condition_class
carbon source      23
nitrogen source     9
stress              8
pH                  4
motility            2
anaerobic           1


In [7]:
# Apply BH-FDR correction
from statsmodels.stats.multitest import multipletests

if len(conc_df) > 0:
    _, fdr, _, _ = multipletests(conc_df['p_value'], method='fdr_bh')
    conc_df['fdr'] = fdr
    
    n_sig = (fdr < 0.05).sum()
    n_concordant = conc_df['is_concordant'].sum()
    n_sig_concordant = ((fdr < 0.05) & conc_df['is_concordant']).sum()
    
    print(f'Concordance summary:')
    print(f'  Total tested: {len(conc_df)}')
    print(f'  Concordant (carrier enriched in expected env): {n_concordant} ({n_concordant/len(conc_df)*100:.1f}%)')
    print(f'  Significant (FDR < 0.05): {n_sig}')
    print(f'  Significant AND concordant: {n_sig_concordant}')
    
    # Per condition class
    print(f'\nPer condition class:')
    for cond in conc_df['top_condition_class'].unique():
        sub = conc_df[conc_df['top_condition_class'] == cond]
        n_c = sub['is_concordant'].sum()
        n_s = (sub['fdr'] < 0.05).sum()
        print(f'  {cond}: {len(sub)} tested, {n_c} concordant ({n_c/len(sub)*100:.0f}%), {n_s} sig')
else:
    print('No clusters could be tested for concordance')
    conc_df['fdr'] = np.nan

Concordance summary:
  Total tested: 47
  Concordant (carrier enriched in expected env): 29 (61.7%)
  Significant (FDR < 0.05): 1
  Significant AND concordant: 0

Per condition class:
  carbon source: 23 tested, 15 concordant (65%), 0 sig
  stress: 8 tested, 2 concordant (25%), 1 sig
  nitrogen source: 9 tested, 7 concordant (78%), 0 sig
  pH: 4 tested, 4 concordant (100%), 0 sig
  anaerobic: 1 tested, 1 concordant (100%), 0 sig
  motility: 2 tested, 0 concordant (0%), 0 sig


In [8]:
# Show top concordant results
if len(conc_df) > 0:
    sig_conc = conc_df[(conc_df['fdr'] < 0.2) & conc_df['is_concordant']].sort_values('odds_ratio', ascending=False)
    if len(sig_conc) > 0:
        print(f'Significant concordant clusters (FDR < 0.2): {len(sig_conc)}')
        cols = ['gene_cluster_id', 'orgId', 'locusId', 'desc', 'top_condition_class',
                'max_abs_fit', 'expected_envs', 'carrier_pct_expected', 'noncarrier_pct_expected',
                'odds_ratio', 'p_value', 'fdr']
        print(sig_conc[cols].head(20).to_string())
    else:
        print('No significant concordant clusters at FDR < 0.2')
        print('\nTop 10 by odds ratio (concordant only):')
        top = conc_df[conc_df['is_concordant']].sort_values('odds_ratio', ascending=False)
        cols = ['gene_cluster_id', 'orgId', 'desc', 'top_condition_class',
                'carrier_pct_expected', 'noncarrier_pct_expected', 'odds_ratio', 'p_value']
        print(top[cols].head(10).to_string())

Significant concordant clusters (FDR < 0.2): 6
            gene_cluster_id           orgId        locusId                  desc top_condition_class  max_abs_fit                                   expected_envs  carrier_pct_expected  noncarrier_pct_expected  odds_ratio   p_value       fdr
24  NZ_JACVAH010000015.1_20  pseudo5_N2C3_1    AO356_12450  hypothetical protein       carbon source     2.076086       soil_sediment|freshwater|plant_associated             62.500000                 0.000000         inf  0.009050  0.092663
23  NZ_JACVAH010000006.1_10  pseudo5_N2C3_1    AO356_25185  hypothetical protein           anaerobic     2.723784      soil_sediment|freshwater|animal_associated             55.555556                 0.000000         inf  0.029412  0.178201
38     NZ_PPRZ01000025.1_37  pseudo5_N2C3_1    AO356_24150  hypothetical protein     nitrogen source     3.012796  soil_sediment|freshwater|wastewater_engineered             55.555556                 0.000000         inf  0.029412

In [9]:
# Figure 11: Concordance matrix — condition class vs observed enrichment direction
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel A: Concordance rate by condition class
ax = axes[0]
if len(conc_df) > 0:
    cond_summary = conc_df.groupby('top_condition_class').agg(
        n_tested=('gene_cluster_id', 'count'),
        n_concordant=('is_concordant', 'sum'),
        n_significant=('fdr', lambda x: (x < 0.05).sum()),
        mean_carrier_pct=('carrier_pct_expected', 'mean'),
        mean_noncarrier_pct=('noncarrier_pct_expected', 'mean'),
    ).reset_index()
    cond_summary['concordance_rate'] = cond_summary['n_concordant'] / cond_summary['n_tested'] * 100
    cond_summary = cond_summary.sort_values('concordance_rate', ascending=True)
    
    y_pos = range(len(cond_summary))
    bars = ax.barh(y_pos, cond_summary['concordance_rate'], color='steelblue', alpha=0.7)
    ax.set_yticks(y_pos)
    ax.set_yticklabels([f"{c} (n={n})" for c, n in zip(cond_summary['top_condition_class'], cond_summary['n_tested'])])
    ax.set_xlabel('Concordance rate (%)')
    ax.set_title('Lab-Field Concordance Rate\nby Condition Class')
    ax.axvline(50, color='red', linestyle='--', alpha=0.5, label='Chance (50%)')
    ax.legend()
    ax.set_xlim(0, 100)

# Panel B: Carrier vs non-carrier % in expected environment
ax = axes[1]
if len(conc_df) > 0:
    for cond in conc_df['top_condition_class'].unique():
        sub = conc_df[conc_df['top_condition_class'] == cond]
        ax.scatter(sub['noncarrier_pct_expected'], sub['carrier_pct_expected'],
                  label=cond, alpha=0.6, s=40)
    
    ax.plot([0, 100], [0, 100], 'k--', alpha=0.3, label='Equal')
    ax.set_xlabel('Non-carrier % in expected environment')
    ax.set_ylabel('Carrier % in expected environment')
    ax.set_title('Carrier vs Non-Carrier\nEnvironmental Enrichment')
    ax.legend(fontsize=7, loc='upper left')
    ax.set_xlim(-5, 105)
    ax.set_ylim(-5, 105)

plt.tight_layout()
plt.savefig(os.path.join(FIG_DIR, 'fig11_concordance_matrix.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved fig11_concordance_matrix.png')

Saved fig11_concordance_matrix.png


## Section 3: NMDC Independent Validation

Use NMDC metagenomic data as an independent check. For taxa that carry dark genes
in the pangenome, we infer their abundance in NMDC samples via the taxonomy profile,
then correlate with abiotic measurements.

This is **community-level** validation at **genus** resolution — not gene-level.
Results should be framed as "consistent with" or "independently corroborated by".

In [10]:
# Load NMDC tables
print('Loading NMDC tables...')
tax_features = spark.sql("SELECT * FROM nmdc_arkin.taxonomy_features").toPandas()
print(f'  taxonomy_features: {tax_features.shape[0]} samples x {tax_features.shape[1]} columns')

abiotic = spark.sql("SELECT * FROM nmdc_arkin.abiotic_features").toPandas()
print(f'  abiotic_features: {abiotic.shape[0]} samples x {abiotic.shape[1]} columns')

tax_dim = spark.sql("SELECT * FROM nmdc_arkin.taxonomy_dim").toPandas()
print(f'  taxonomy_dim: {tax_dim.shape[0]} taxa')

# Identify taxon columns (numeric taxids as column names)
taxon_cols = [c for c in tax_features.columns if c != 'sample_id' and c.isdigit()]
print(f'  Taxon columns: {len(taxon_cols)}')

# Overlap between taxonomy and abiotic samples
shared_samples = set(tax_features['sample_id']) & set(abiotic['sample_id'])
print(f'  Samples with both taxonomy + abiotic: {len(shared_samples)}')

Loading NMDC tables...


  taxonomy_features: 6365 samples x 3493 columns


  abiotic_features: 13847 samples x 22 columns


  taxonomy_dim: 2594787 taxa
  Taxon columns: 3492
  Samples with both taxonomy + abiotic: 6365


In [11]:
# Map dark gene carrier genera to NMDC taxon columns
# Two-tier approach (following prophage_ecology pattern):
# 1. Get GTDB genus for each dark gene carrier species
# 2. Map NMDC taxid columns to genera via taxonomy_dim
# 3. Match by genus name

# Step 1: Extract genus from carrier species
carrier_species = concordance_clusters['gtdb_species_clade_id'].unique()

# Get GTDB genus from pangenome metadata
sp_sdf = spark.createDataFrame(
    [(str(s),) for s in carrier_species],
    ['gtdb_species_clade_id']
)
sp_sdf.createOrReplaceTempView('carrier_species_view')

genus_map = spark.sql("""
    SELECT DISTINCT
        cs.gtdb_species_clade_id,
        REGEXP_EXTRACT(t.genus, 'g__(.*)', 1) AS gtdb_genus,
        CAST(m.ncbi_species_taxid AS INT) AS ncbi_species_taxid
    FROM carrier_species_view cs
    JOIN kbase_ke_pangenome.gtdb_taxonomy_r214v1 t
        ON cs.gtdb_species_clade_id = t.species
    JOIN kbase_ke_pangenome.gtdb_metadata m
        ON t.genome_id = m.accession
    WHERE t.genus IS NOT NULL
""").toPandas()

# Deduplicate — one genus per species
genus_map = genus_map.drop_duplicates(subset='gtdb_species_clade_id')
print(f'Carrier species -> genus mapping: {len(genus_map)}')
print(f'Unique genera: {genus_map["gtdb_genus"].nunique()}')
print(f'\nTop genera:')
print(genus_map['gtdb_genus'].value_counts().head(10).to_string())

Carrier species -> genus mapping: 0
Unique genera: 0

Top genera:
Series([], )


In [12]:
# Step 2: Map NMDC taxon columns to genera via taxonomy_dim
# taxonomy_dim has taxid -> genus mapping
taxid_to_genus = {}
for _, row in tax_dim.iterrows():
    tid = str(int(row['taxid'])) if pd.notna(row.get('taxid')) else None
    genus = row.get('genus', '')
    if tid and genus and pd.notna(genus) and genus != '':
        taxid_to_genus[tid] = genus

print(f'Taxid -> genus mapping: {len(taxid_to_genus)} taxa')

# Step 3: Find NMDC taxon columns matching carrier genera
carrier_genera = set(genus_map['gtdb_genus'].dropna().unique())
print(f'Carrier genera to match: {len(carrier_genera)}')

matched_cols = {}  # genus -> list of taxon column names
for col in taxon_cols:
    genus = taxid_to_genus.get(col, '')
    if genus in carrier_genera:
        if genus not in matched_cols:
            matched_cols[genus] = []
        matched_cols[genus].append(col)

print(f'Matched genera in NMDC: {len(matched_cols)} / {len(carrier_genera)}')
n_matched_cols = sum(len(v) for v in matched_cols.values())
print(f'Total matched taxon columns: {n_matched_cols}')
if matched_cols:
    print(f'\nMatched genera: {", ".join(sorted(matched_cols.keys())[:15])}')

Taxid -> genus mapping: 2594787 taxa
Carrier genera to match: 0
Matched genera in NMDC: 0 / 0
Total matched taxon columns: 0


In [13]:
# Compute per-sample dark gene carrier score
# For each NMDC sample, sum abundance of taxa whose genera carry dark genes
# Weight by condition class to create condition-specific scores

# Map genus -> condition classes (from our dark gene clusters)
genus_conditions = defaultdict(set)
for _, row in concordance_clusters.iterrows():
    sid = row['gtdb_species_clade_id']
    cond = row['top_condition_class']
    gm = genus_map[genus_map['gtdb_species_clade_id'] == sid]
    if len(gm) > 0:
        genus = gm.iloc[0]['gtdb_genus']
        if genus and pd.notna(genus):
            genus_conditions[genus].add(cond)

print(f'Genera with condition mapping: {len(genus_conditions)}')

# Filter to shared samples
tax_shared = tax_features[tax_features['sample_id'].isin(shared_samples)].copy()
abiotic_shared = abiotic[abiotic['sample_id'].isin(shared_samples)].copy()
print(f'Shared samples for analysis: {len(tax_shared)}')

# Compute overall dark-gene-carrier abundance per sample
all_matched_taxcols = []
for genus, cols in matched_cols.items():
    all_matched_taxcols.extend(cols)

if all_matched_taxcols:
    # Convert taxon columns to numeric
    for col in all_matched_taxcols:
        tax_shared[col] = pd.to_numeric(tax_shared[col], errors='coerce').fillna(0)
    
    tax_shared['dark_carrier_abundance'] = tax_shared[all_matched_taxcols].sum(axis=1)
    print(f'Dark carrier abundance: median={tax_shared["dark_carrier_abundance"].median():.4f}, '
          f'mean={tax_shared["dark_carrier_abundance"].mean():.4f}')
    print(f'  Non-zero samples: {(tax_shared["dark_carrier_abundance"] > 0).sum()} / {len(tax_shared)}')
    
    # Also compute condition-specific scores
    for cond in mapped_conditions:
        cond_genera = [g for g, conds in genus_conditions.items() if cond in conds]
        cond_cols = []
        for g in cond_genera:
            cond_cols.extend(matched_cols.get(g, []))
        if cond_cols:
            score_col = f'score_{cond.replace(" ", "_")}'
            tax_shared[score_col] = tax_shared[cond_cols].sum(axis=1)
            n_nonzero = (tax_shared[score_col] > 0).sum()
            print(f'  {cond}: {len(cond_cols)} taxon cols, {n_nonzero} non-zero samples')
else:
    print('No matched taxon columns — cannot compute carrier abundance')
    tax_shared['dark_carrier_abundance'] = 0

Genera with condition mapping: 0


Shared samples for analysis: 6365
No matched taxon columns — cannot compute carrier abundance


In [14]:
# Correlate dark carrier abundance with abiotic variables
# Merge taxonomy scores with abiotic measurements
merged = tax_shared[['sample_id', 'dark_carrier_abundance'] +
                     [c for c in tax_shared.columns if c.startswith('score_')]].merge(
    abiotic_shared, on='sample_id', how='inner'
)
print(f'Merged samples for correlation: {len(merged)}')

# Get abiotic column names
abiotic_cols = [c for c in abiotic_shared.columns if c.startswith('annotations_')]
score_cols = ['dark_carrier_abundance'] + [c for c in tax_shared.columns if c.startswith('score_')]

# Correlate each score with each abiotic variable
nmdc_results = []

for score_col in score_cols:
    for abiotic_col in abiotic_cols:
        score_vals = pd.to_numeric(merged[score_col], errors='coerce')
        abiotic_vals = pd.to_numeric(merged[abiotic_col], errors='coerce')
        
        valid = score_vals.notna() & abiotic_vals.notna() & (abiotic_vals != 0)
        n_valid = valid.sum()
        
        if n_valid >= 30:
            rho, p = stats.spearmanr(score_vals[valid], abiotic_vals[valid])
            nmdc_results.append({
                'score_type': score_col,
                'abiotic_variable': abiotic_col.replace('annotations_', '').replace('_has_numeric_value', ''),
                'n_samples': n_valid,
                'spearman_rho': rho,
                'p_value': p,
            })

nmdc_df = pd.DataFrame(nmdc_results)
print(f'\nNMDC correlation tests: {len(nmdc_df)}')

if len(nmdc_df) > 0:
    _, nmdc_fdr, _, _ = multipletests(nmdc_df['p_value'], method='fdr_bh')
    nmdc_df['fdr'] = nmdc_fdr
    n_sig = (nmdc_fdr < 0.05).sum()
    print(f'  Significant (FDR < 0.05): {n_sig}')
    
    # Show significant results
    sig = nmdc_df[nmdc_df['fdr'] < 0.05].sort_values('p_value')
    if len(sig) > 0:
        print(f'\nSignificant NMDC correlations:')
        print(sig[['score_type', 'abiotic_variable', 'n_samples', 'spearman_rho', 'p_value', 'fdr']].to_string())
    else:
        print('\nTop 10 correlations by p-value:')
        print(nmdc_df.sort_values('p_value').head(10)[['score_type', 'abiotic_variable', 'n_samples', 'spearman_rho', 'p_value']].to_string())

Merged samples for correlation: 6365

NMDC correlation tests: 15
  Significant (FDR < 0.05): 0

Top 10 correlations by p-value:
               score_type                 abiotic_variable  n_samples  spearman_rho  p_value
0  dark_carrier_abundance                         ammonium         33           NaN      NaN
1  dark_carrier_abundance                ammonium_nitrogen       1230           NaN      NaN
2  dark_carrier_abundance                 carb_nitro_ratio        910           NaN      NaN
3  dark_carrier_abundance                      chlorophyll         34           NaN      NaN
4  dark_carrier_abundance                           conduc         70           NaN      NaN
5  dark_carrier_abundance  depth_has_maximum_numeric_value       4973           NaN      NaN
6  dark_carrier_abundance  depth_has_minimum_numeric_value        349           NaN      NaN
7  dark_carrier_abundance                            depth        517           NaN      NaN
8  dark_carrier_abundance          

  rho, p = stats.spearmanr(score_vals[valid], abiotic_vals[valid])


In [15]:
# Check pre-registered concordance: do condition-specific scores correlate
# with their expected abiotic variables?
print('Pre-registered NMDC concordance check:')
print('=' * 60)

for cond, info in CONDITION_ENV_MAP.items():
    score_col = f'score_{cond.replace(" ", "_")}'
    if score_col not in tax_shared.columns:
        print(f'\n{cond}: no score computed (no matched genera)')
        continue
    
    expected_abiotics = info['nmdc_abiotic']
    if not expected_abiotics:
        print(f'\n{cond}: no pre-registered abiotic variables')
        continue
    
    print(f'\n{cond}:')
    for ab in expected_abiotics:
        ab_short = ab.replace('annotations_', '').replace('_has_numeric_value', '')
        match = nmdc_df[
            (nmdc_df['score_type'] == score_col) &
            (nmdc_df['abiotic_variable'] == ab_short)
        ]
        if len(match) > 0:
            r = match.iloc[0]
            sig_str = '*' if r['fdr'] < 0.05 else ''
            print(f'  {ab_short}: rho={r["spearman_rho"]:.3f}, p={r["p_value"]:.2e}, '
                  f'FDR={r["fdr"]:.3f}, n={int(r["n_samples"])} {sig_str}')
        else:
            print(f'  {ab_short}: insufficient data (< 30 valid samples)')

Pre-registered NMDC concordance check:

stress: no score computed (no matched genera)

carbon source: no score computed (no matched genera)

nitrogen source: no score computed (no matched genera)

pH: no score computed (no matched genera)

motility: no score computed (no matched genera)

anaerobic: no score computed (no matched genera)


In [16]:
# Figure 12: NMDC correlation results
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel A: Heatmap of correlations (score_type x abiotic_variable)
ax = axes[0]
if len(nmdc_df) > 0:
    pivot = nmdc_df.pivot_table(index='score_type', columns='abiotic_variable',
                                values='spearman_rho', aggfunc='first')
    # Show only variables with at least one |rho| > 0.05
    strong_cols = pivot.columns[pivot.abs().max() > 0.05]
    if len(strong_cols) > 0 and len(pivot) > 0:
        pivot_sub = pivot[strong_cols]
        # Clean labels
        pivot_sub.index = [s.replace('score_', '').replace('dark_carrier_abundance', 'all_dark')
                           for s in pivot_sub.index]
        sns.heatmap(pivot_sub, cmap='RdBu_r', center=0, annot=True, fmt='.2f',
                   ax=ax, cbar_kws={'label': 'Spearman rho'})
        ax.set_title('NMDC: Dark Gene Carrier Abundance\nvs Abiotic Variables')
        ax.set_ylabel('Score type')
    else:
        ax.text(0.5, 0.5, 'No strong correlations found', ha='center', va='center',
                transform=ax.transAxes)
        ax.set_title('NMDC Correlations')
else:
    ax.text(0.5, 0.5, 'No NMDC data available', ha='center', va='center',
            transform=ax.transAxes)

# Panel B: Volcano plot of NMDC correlations
ax = axes[1]
if len(nmdc_df) > 0:
    colors = ['#E53935' if f < 0.05 else '#2196F3' if f < 0.2 else '#9E9E9E'
              for f in nmdc_df['fdr']]
    ax.scatter(nmdc_df['spearman_rho'],
              -np.log10(nmdc_df['p_value'].clip(lower=1e-20)),
              c=colors, alpha=0.6, s=30)
    ax.axhline(-np.log10(0.05), color='gray', linestyle='--', alpha=0.5)
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Spearman rho')
    ax.set_ylabel('-log10(p-value)')
    ax.set_title(f'NMDC Abiotic Correlations\n({len(nmdc_df)} tests, {(nmdc_df["fdr"] < 0.05).sum()} sig)')
    
    # Label significant points
    sig_points = nmdc_df[nmdc_df['fdr'] < 0.05]
    for _, r in sig_points.iterrows():
        ax.annotate(f"{r['score_type'].replace('score_', '').replace('dark_carrier_abundance', 'all')}\n{r['abiotic_variable']}",
                   (r['spearman_rho'], -np.log10(max(r['p_value'], 1e-20))),
                   fontsize=6, alpha=0.8)

plt.tight_layout()
plt.savefig(os.path.join(FIG_DIR, 'fig12_nmdc_correlations.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved fig12_nmdc_correlations.png')

Saved fig12_nmdc_correlations.png


## Section 4: Save Outputs and Summary

In [17]:
# Save results
conc_df.to_csv(os.path.join(DATA_DIR, 'lab_field_concordance.tsv'), sep='\t', index=False)
print(f'Saved lab_field_concordance.tsv: {len(conc_df)} clusters')

nmdc_df.to_csv(os.path.join(DATA_DIR, 'nmdc_validation.tsv'), sep='\t', index=False)
print(f'Saved nmdc_validation.tsv: {len(nmdc_df)} correlation tests')

Saved lab_field_concordance.tsv: 47 clusters
Saved nmdc_validation.tsv: 15 correlation tests


In [18]:
# Final summary
print('=' * 70)
print('NB04: LAB-FIELD CONCORDANCE & NMDC VALIDATION — SUMMARY')
print('=' * 70)

print(f'\n--- Lab-Field Concordance ---')
print(f'Pre-registered condition classes: {len(CONDITION_ENV_MAP)}')
print(f'Clusters tested: {len(conc_df)}')
if len(conc_df) > 0:
    n_conc = conc_df['is_concordant'].sum()
    n_sig = (conc_df['fdr'] < 0.05).sum()
    n_sig_conc = ((conc_df['fdr'] < 0.05) & conc_df['is_concordant']).sum()
    print(f'  Concordant: {n_conc}/{len(conc_df)} ({n_conc/len(conc_df)*100:.1f}%)')
    print(f'  Significant (FDR < 0.05): {n_sig}')
    print(f'  Significant + concordant: {n_sig_conc}')

print(f'\n--- NMDC Validation ---')
if len(nmdc_df) > 0:
    print(f'Samples with taxonomy + abiotic: {len(shared_samples)}')
    print(f'Matched genera in NMDC: {len(matched_cols)}')
    print(f'Correlation tests: {len(nmdc_df)}')
    n_nmdc_sig = (nmdc_df['fdr'] < 0.05).sum()
    print(f'  Significant (FDR < 0.05): {n_nmdc_sig}')
else:
    print('No NMDC correlations could be computed')

print(f'\nOutput files:')
for f in ['lab_field_concordance.tsv', 'nmdc_validation.tsv']:
    fp = os.path.join(DATA_DIR, f)
    if os.path.exists(fp):
        size_kb = os.path.getsize(fp) / 1024
        print(f'  {f}: {size_kb:.1f} KB')
print(f'\nFigures:')
for f in ['fig11_concordance_matrix.png', 'fig12_nmdc_correlations.png']:
    fp = os.path.join(FIG_DIR, f)
    if os.path.exists(fp):
        print(f'  {f}')

NB04: LAB-FIELD CONCORDANCE & NMDC VALIDATION — SUMMARY

--- Lab-Field Concordance ---
Pre-registered condition classes: 6
Clusters tested: 47
  Concordant: 29/47 (61.7%)
  Significant (FDR < 0.05): 1
  Significant + concordant: 0

--- NMDC Validation ---
Samples with taxonomy + abiotic: 6365
Matched genera in NMDC: 0
Correlation tests: 15
  Significant (FDR < 0.05): 0

Output files:
  lab_field_concordance.tsv: 12.0 KB
  nmdc_validation.tsv: 0.7 KB

Figures:
  fig11_concordance_matrix.png
  fig12_nmdc_correlations.png
