# NB02: File→Sample Bridge and Taxonomy→GTDB Mapping

**Project**: Community Metabolic Ecology via NMDC × Pangenome Integration  
**Requires**: BERDL JupyterHub (Spark — `get_spark_session()` injected into kernel)  

## Purpose

NB01 confirmed that classifier files (`nmdc:dobj-11-*`) and metabolomics files (`nmdc:dobj-12-*`)  
share **zero file_id overlap** — they are different workflow output types. The shared identifier  
is the **biosample/sample_id** (e.g., `nmdc:bsm-11-*`). This notebook:

1. **Part 1**: Find the `file_id → sample_id` bridge in `nmdc_arkin`
2. **Part 2**: Build sample inventory — samples with both metagenomics classifier AND metabolomics data
3. **Part 3**: Map NMDC taxon names (centrifuge_gold) → GTDB species (`gtdb_species_clade`)
4. **Part 4**: Compute bridge quality per sample
5. **Part 5**: Save outputs

## Inputs

- `nmdc_arkin` tables: `centrifuge_gold`, `metabolomics_gold`, `abiotic_features`, `study_table`
- `kbase_ke_pangenome`: `gtdb_species_clade`

## Outputs

- `data/sample_file_bridge.csv` — sample_id ↔ file_id mapping for all omics types
- `data/nmdc_sample_inventory.csv` — updated with samples having paired omics (replaces NB01 empty file)
- `data/taxon_bridge.tsv` — NMDC taxon name → GTDB species clade mappings with confidence tiers
- `data/bridge_quality.csv` — per-sample fraction of community abundance mapped to pangenome

In [None]:
# On BERDL JupyterHub — get_spark_session() is injected into the kernel; no import needed
spark = get_spark_session()
spark

In [None]:
import os
import re
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

PROJECT_DIR = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
DATA_DIR = os.path.join(PROJECT_DIR, 'data')
FIGURES_DIR = os.path.join(PROJECT_DIR, 'figures')
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIGURES_DIR, exist_ok=True)
print(f'DATA_DIR: {DATA_DIR}')
print(f'FIGURES_DIR: {FIGURES_DIR}')

---
## Part 1: Find the file_id → sample_id Bridge

NB01 found that classifier tables and metabolomics tables use non-overlapping `file_id` namespaces.  
The NMDC data model links files to biosamples through workflow activities.  
We explore candidate tables in `nmdc_arkin` to find a `file_id → sample_id` mapping.

In [None]:
# Step 1a: Inspect candidate bridge tables — look for tables that have both file_id and sample_id
# Candidates from schema doc: sample_tokens_v1, taxonomy_dim, taxstring_lookup,
# embedding_metadata, taxonomy_embeddings, trait_unified

candidate_tables = [
    'sample_tokens_v1',
    'taxonomy_dim',
    'taxstring_lookup',
    'embedding_metadata',
    'taxonomy_embeddings',
    'trait_unified',
    'biochemical_features',
    'biochemical_embeddings',
    'abiotic_embeddings',
]

bridge_candidates = []  # tables that have both file_id and sample_id

for tbl in candidate_tables:
    try:
        schema = spark.sql(f'DESCRIBE nmdc_arkin.{tbl}').toPandas()
        cols = set(schema['col_name'].tolist())
        has_file = 'file_id' in cols
        has_sample = 'sample_id' in cols
        n_cols = len(cols)
        print(f'{tbl}: {n_cols} cols, file_id={has_file}, sample_id={has_sample}')
        if has_file or has_sample:
            print(f'  -> columns: {sorted(cols)[:12]}')
        if has_file and has_sample:
            bridge_candidates.append(tbl)
            print(f'  *** BRIDGE CANDIDATE ***')
    except Exception as e:
        print(f'{tbl}: ERROR — {e}')

print(f'\nBridge candidates (have both file_id and sample_id): {bridge_candidates}')

In [None]:
# Step 1b: List all tables in nmdc_arkin to find any additional candidates
all_tables = spark.sql('SHOW TABLES IN nmdc_arkin').toPandas()
print(f'Total tables in nmdc_arkin: {len(all_tables)}')
print(all_tables['tableName'].sort_values().tolist())

In [None]:
# Step 1c: Scan ALL tables in nmdc_arkin for file_id OR sample_id columns
# This finds any table not covered by the candidate list above

table_names = all_tables['tableName'].tolist()
file_id_tables = []
sample_id_tables = []
both_id_tables = []

for tbl in table_names:
    try:
        schema = spark.sql(f'DESCRIBE nmdc_arkin.{tbl}').toPandas()
        cols = set(schema['col_name'].tolist())
        has_file = 'file_id' in cols
        has_sample = 'sample_id' in cols
        if has_file:
            file_id_tables.append(tbl)
        if has_sample:
            sample_id_tables.append(tbl)
        if has_file and has_sample:
            both_id_tables.append(tbl)
    except Exception:
        pass

print('Tables with file_id:', file_id_tables)
print()
print('Tables with sample_id:', sample_id_tables)
print()
print('Tables with BOTH file_id and sample_id (bridge candidates):', both_id_tables)

In [None]:
# Step 1d: Inspect the bridge table(s) found above
# If both_id_tables is non-empty, show schema and sample rows for each

if both_id_tables:
    for tbl in both_id_tables:
        print(f'\n=== nmdc_arkin.{tbl} schema ===')
        spark.sql(f'DESCRIBE nmdc_arkin.{tbl}').show(30, truncate=False)

        n = spark.sql(f'SELECT COUNT(*) as n FROM nmdc_arkin.{tbl}').collect()[0]['n']
        print(f'Row count: {n}')

        print(f'Sample rows:')
        spark.sql(f'SELECT * FROM nmdc_arkin.{tbl} LIMIT 5').show(truncate=False)
else:
    print('No table with both file_id and sample_id found in nmdc_arkin.')
    print('Will attempt bridge via file_name parsing (Step 1e).')

In [None]:
# Step 1e: Explore sample_id format in abiotic_features
# to understand what ID format we need to bridge TO
print('=== abiotic_features sample_id examples ===')
abiotic_ids = spark.sql("""
    SELECT sample_id
    FROM nmdc_arkin.abiotic_features
    LIMIT 10
""").toPandas()
print(abiotic_ids['sample_id'].tolist())

# Also check study_table: do study_ids appear in sample_ids?
print('\nStudy_id examples from study_table:')
study_ids = spark.sql("""
    SELECT study_id FROM nmdc_arkin.study_table LIMIT 5
""").toPandas()
print(study_ids['study_id'].tolist())

# Check if sample_id in abiotic_features starts with 'nmdc:bsm-'
bsm_count = spark.sql("""
    SELECT COUNT(*) as n
    FROM nmdc_arkin.abiotic_features
    WHERE sample_id LIKE 'nmdc:bsm-%'
""").collect()[0]['n']
print(f'\nabiotic_features rows where sample_id starts with nmdc:bsm-: {bsm_count}')

In [None]:
# Step 1f: Check if the sample_tokens_v1 table has more columns than documented
# and whether it contains file_id or a file reference field
try:
    print('=== sample_tokens_v1 schema ===')
    spark.sql('DESCRIBE nmdc_arkin.sample_tokens_v1').show(40, truncate=False)
    n = spark.sql('SELECT COUNT(*) as n FROM nmdc_arkin.sample_tokens_v1').collect()[0]['n']
    print(f'Row count: {n}')
    print('Sample rows:')
    spark.sql('SELECT * FROM nmdc_arkin.sample_tokens_v1 LIMIT 3').show(truncate=False)
except Exception as e:
    print(f'sample_tokens_v1 error: {e}')

# Also check embeddings_v1 for sample_id format
try:
    print('\n=== embeddings_v1 schema ===')
    spark.sql('DESCRIBE nmdc_arkin.embeddings_v1').show(20, truncate=False)
    n = spark.sql('SELECT COUNT(*) as n FROM nmdc_arkin.embeddings_v1').collect()[0]['n']
    print(f'Row count: {n}')
    # Show non-vector columns (sample_id etc.)
    emb_schema = spark.sql('DESCRIBE nmdc_arkin.embeddings_v1').toPandas()
    str_cols = emb_schema[emb_schema['data_type'] == 'string']['col_name'].tolist()
    if str_cols:
        cols_sql = ', '.join([f'`{c}`' for c in str_cols[:6]])
        spark.sql(f'SELECT {cols_sql} FROM nmdc_arkin.embeddings_v1 LIMIT 5').show(truncate=False)
except Exception as e:
    print(f'embeddings_v1 error: {e}')

In [None]:
# Step 1g: Try to build the bridge from the best candidate found above.
# If a table with both file_id and sample_id was found, use it directly.
# Otherwise, try to link through file_name: kraken file_names encode the
# workflow activity ID (e.g., 'nmdc_wfrbt-11-krmkys65.1_kraken2_report.tsv').
# Check if any table maps workflow activity IDs to biosample IDs.

file_bridge_df = None  # will be set below if a bridge is found

if both_id_tables:
    # Use the first table with both IDs
    bridge_tbl = both_id_tables[0]
    print(f'Using bridge table: nmdc_arkin.{bridge_tbl}')
    file_bridge_df = spark.sql(f"""
        SELECT DISTINCT file_id, sample_id
        FROM nmdc_arkin.{bridge_tbl}
        WHERE file_id IS NOT NULL AND sample_id IS NOT NULL
    """).toPandas()
    print(f'Bridge rows (distinct file_id, sample_id): {len(file_bridge_df)}')
    print('Sample rows:')
    print(file_bridge_df.head(5).to_string())

    # Check coverage: how many classifier file_ids are in the bridge?
    clf_files = spark.sql("""
        SELECT DISTINCT file_id FROM nmdc_arkin.centrifuge_gold
    """).toPandas()['file_id']
    met_files = spark.sql("""
        SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold
    """).toPandas()['file_id']

    n_clf_bridged = file_bridge_df['file_id'].isin(clf_files).sum()
    n_met_bridged = file_bridge_df['file_id'].isin(met_files).sum()
    print(f'\nClassifier file_ids in bridge: {n_clf_bridged} / {len(clf_files)}')
    print(f'Metabolomics file_ids in bridge: {n_met_bridged} / {len(met_files)}')
else:
    print('No direct bridge table found. Attempting file_name-based bridge...')
    # Inspect file_name patterns in classifier and metabolomics tables
    print('Classifier file_name samples:')
    spark.sql("""
        SELECT DISTINCT file_id, file_name
        FROM nmdc_arkin.centrifuge_gold
        LIMIT 5
    """).show(truncate=False)
    print('Metabolomics file_name samples:')
    spark.sql("""
        SELECT DISTINCT file_id, file_name
        FROM nmdc_arkin.metabolomics_gold
        LIMIT 5
    """).show(truncate=False)

In [None]:
# Step 1h: If no direct bridge was found, check nmdc_ncbi_biosamples for a cross-reference
# The nmdc_ncbi_biosamples database may contain a file → biosample link

if file_bridge_df is None or len(file_bridge_df) == 0:
    print('Exploring nmdc_ncbi_biosamples for file → sample links...')
    try:
        ncbi_tables = spark.sql('SHOW TABLES IN nmdc_ncbi_biosamples').toPandas()
        print('Tables in nmdc_ncbi_biosamples:', ncbi_tables['tableName'].tolist())

        # Check biosamples_ids for file_id or dobj references
        try:
            spark.sql('DESCRIBE nmdc_ncbi_biosamples.biosamples_ids').show(20, truncate=False)
            spark.sql('SELECT * FROM nmdc_ncbi_biosamples.biosamples_ids LIMIT 5').show(truncate=False)
        except Exception as e:
            print(f'biosamples_ids: {e}')

        # Check biosamples_links for file_id
        try:
            link_schema = spark.sql('DESCRIBE nmdc_ncbi_biosamples.biosamples_links').toPandas()
            print('\nbiosamples_links columns:', link_schema['col_name'].tolist())
            spark.sql('SELECT * FROM nmdc_ncbi_biosamples.biosamples_links LIMIT 5').show(truncate=False)
        except Exception as e:
            print(f'biosamples_links: {e}')

    except Exception as e:
        print(f'nmdc_ncbi_biosamples exploration error: {e}')
else:
    print(f'Bridge already found ({len(file_bridge_df)} rows). Skipping nmdc_ncbi_biosamples.')

---
## Part 2: Sample Inventory — Samples with Both Omics Data Types

Using the bridge found in Part 1, identify `sample_id` values that have both  
a metagenomics classifier file (centrifuge_gold) AND a metabolomics file (metabolomics_gold).

In [None]:
# Build sample overlap via the bridge
# Strategy: file_bridge_df maps file_id → sample_id
# Join centrifuge_gold file_ids through bridge to get sample_ids with metagenomics
# Join metabolomics_gold file_ids through bridge to get sample_ids with metabolomics
# Intersect the two sample_id sets

if file_bridge_df is not None and len(file_bridge_df) > 0:
    # Register bridge as a Spark temp view for SQL joins
    bridge_spark = spark.createDataFrame(file_bridge_df[['file_id', 'sample_id']].drop_duplicates())
    bridge_spark.createOrReplaceTempView('file_sample_bridge')

    # Samples with metagenomics (centrifuge) classifier data
    clf_samples = spark.sql("""
        SELECT DISTINCT b.sample_id
        FROM (SELECT DISTINCT file_id FROM nmdc_arkin.centrifuge_gold) c
        JOIN file_sample_bridge b ON c.file_id = b.file_id
    """).toPandas()

    # Samples with metabolomics data
    met_samples = spark.sql("""
        SELECT DISTINCT b.sample_id
        FROM (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m
        JOIN file_sample_bridge b ON m.file_id = b.file_id
    """).toPandas()

    # Samples with BOTH
    clf_set = set(clf_samples['sample_id'])
    met_set = set(met_samples['sample_id'])
    overlap_samples = clf_set & met_set

    print(f'Samples with classifier data (centrifuge): {len(clf_set)}')
    print(f'Samples with metabolomics data:            {len(met_set)}')
    print(f'Samples with BOTH (overlap):               {len(overlap_samples)}')
    overlap_sample_ids = list(overlap_samples)
else:
    print('ERROR: No file→sample bridge available. Cannot compute sample overlap.')
    print('Manual inspection of bridge tables is required before proceeding.')
    # Set fallback empty state
    clf_set = set()
    met_set = set()
    overlap_sample_ids = []

In [None]:
# Merge ecosystem metadata from study_table for overlap samples
# abiotic_features uses sample_id — join directly

if overlap_sample_ids:
    # Load abiotic features for overlap samples
    abiotic_df = spark.sql('SELECT * FROM nmdc_arkin.abiotic_features').toPandas()
    abiotic_overlap = abiotic_df[abiotic_df['sample_id'].isin(overlap_sample_ids)].copy()
    print(f'Abiotic features rows for overlap samples: {len(abiotic_overlap)}')

    # Study table for ecosystem categorization
    # study_id format: nmdc:sty-11-XXXXXXXX
    # sample_id format: nmdc:bsm-11-XXXXXXXX (if biosample prefix)
    # Link via file_bridge_df: get file_id → study linkage from file_name
    # file_name for classifier: 'nmdc_wfrbt-11-krmkys65.1_kraken2_report.tsv'
    # Study linkage must come through study_table

    # For now, check what sample_id values look like
    if overlap_sample_ids:
        print(f'\nFirst 10 overlap sample_ids: {overlap_sample_ids[:10]}')
else:
    print('No overlap samples — skipping ecosystem metadata merge.')
    abiotic_overlap = pd.DataFrame()

In [None]:
# Build per-sample file inventory with omics type labels
# For each overlap sample: list its classifier file_id and metabolomics file_id

if overlap_sample_ids and file_bridge_df is not None:
    # Get classifier file_ids per sample
    clf_file_df = spark.sql("""
        SELECT DISTINCT b.sample_id, c.file_id as clf_file_id, c.file_name as clf_file_name
        FROM (SELECT DISTINCT file_id, file_name FROM nmdc_arkin.centrifuge_gold) c
        JOIN file_sample_bridge b ON c.file_id = b.file_id
    """).toPandas()
    clf_file_df = clf_file_df[clf_file_df['sample_id'].isin(overlap_sample_ids)]

    # Get metabolomics file_ids per sample
    met_file_df = spark.sql("""
        SELECT DISTINCT b.sample_id, m.file_id as met_file_id, m.file_name as met_file_name
        FROM (SELECT DISTINCT file_id, file_name FROM nmdc_arkin.metabolomics_gold) m
        JOIN file_sample_bridge b ON m.file_id = b.file_id
    """).toPandas()
    met_file_df = met_file_df[met_file_df['sample_id'].isin(overlap_sample_ids)]

    # Combine into inventory
    inventory = clf_file_df.merge(met_file_df, on='sample_id', how='inner')
    print(f'Inventory rows (sample × clf_file × met_file): {len(inventory)}')
    print(f'Unique samples in inventory: {inventory["sample_id"].nunique()}')
    print(inventory.head(5).to_string())
else:
    print('Cannot build inventory — no bridge or no overlap samples.')
    inventory = pd.DataFrame()

---
## Part 3: NMDC Taxon Names → GTDB Species Clade

Map species-rank taxon names from `centrifuge_gold` to GTDB species clades  
in `kbase_ke_pangenome.gtdb_species_clade` using normalized name matching.  

Approach (following `enigma_contamination_functional_potential` NB02):
- Normalize genus name from NCBI species name (remove 's__' prefix, lowercase)
- Match at species level first (exact GTDB species name match)
- Fall back to genus-level match for unresolved taxa
- Report confidence tier: `species_exact`, `genus_proxy`, `unmapped`

In [None]:
# Get all unique species-rank taxa from centrifuge_gold
# Using all files (not just overlap) to build a comprehensive bridge table
# centrifuge_gold: rank='species', label=taxon name, abundance=relative abundance

centrifuge_species = spark.sql("""
    SELECT label as taxon_name, COUNT(DISTINCT file_id) as n_files,
           AVG(abundance) as mean_abundance
    FROM nmdc_arkin.centrifuge_gold
    WHERE LOWER(rank) = 'species'
      AND label IS NOT NULL
      AND label != ''
    GROUP BY label
    ORDER BY n_files DESC
""").toPandas()

print(f'Unique species-rank taxa in centrifuge_gold: {len(centrifuge_species)}')
print('Most common taxa:')
print(centrifuge_species.head(10).to_string())

In [None]:
# Normalization functions for name matching
def norm_species(x: str) -> str:
    """Normalize NCBI species name for GTDB matching.
    NCBI: 'Candidatus Methylotenera versatilis' -> 'methylotenera_versatilis'
    GTDB: 's__Methylotenera_versatilis' -> 'methylotenera_versatilis'
    """
    if pd.isna(x) or not str(x).strip():
        return ''
    x = str(x).strip().lower()
    # Remove GTDB prefix if present
    x = re.sub(r'^s__', '', x)
    # Remove 'candidatus' prefix
    x = re.sub(r'^candidatus\s+', '', x)
    # Replace spaces and special chars with underscore
    x = re.sub(r'[^a-z0-9]+', '_', x)
    return x.strip('_')


def norm_genus(x: str) -> str:
    """Extract and normalize genus from species name."""
    if pd.isna(x) or not str(x).strip():
        return ''
    x = str(x).strip().lower()
    x = re.sub(r'^s__', '', x)
    x = re.sub(r'^g__', '', x)
    x = re.sub(r'^candidatus\s+', '', x)
    # Take first word as genus
    parts = re.split(r'[^a-z0-9]', x)
    genus = parts[0] if parts else ''
    return re.sub(r'[^a-z0-9]', '_', genus).strip('_')


# Prepare centrifuge taxa with normalized names
centrifuge_species['species_norm'] = centrifuge_species['taxon_name'].map(norm_species)
centrifuge_species['genus_norm'] = centrifuge_species['taxon_name'].map(norm_genus)
centrifuge_species = centrifuge_species[centrifuge_species['species_norm'] != '']
print(f'Taxa with non-empty normalized name: {len(centrifuge_species)}')
print(centrifuge_species[['taxon_name', 'species_norm', 'genus_norm']].head(10).to_string())

In [None]:
# Load GTDB species clades from pangenome
# GTDB species names look like: 's__Methylotenera_versatilis'
gtdb_species = spark.sql("""
    SELECT DISTINCT gtdb_species_clade_id, GTDB_species
    FROM kbase_ke_pangenome.gtdb_species_clade
""").toPandas()

print(f'GTDB species clades: {len(gtdb_species)}')
print('Sample GTDB species names:')
print(gtdb_species['GTDB_species'].head(10).tolist())

# Build normalized species and genus columns for GTDB
gtdb_species['species_norm'] = gtdb_species['GTDB_species'].map(norm_species)
gtdb_species['genus_norm'] = gtdb_species['GTDB_species'].map(norm_genus)
gtdb_species = gtdb_species[gtdb_species['species_norm'] != ''].drop_duplicates()

print(f'\nGTDB species with non-empty normalized name: {len(gtdb_species)}')
print(gtdb_species[['GTDB_species', 'species_norm', 'genus_norm']].head(5).to_string())

In [None]:
# Step 3a: Species-level exact match (normalized NCBI species name == normalized GTDB species name)
species_exact = centrifuge_species.merge(
    gtdb_species[['gtdb_species_clade_id', 'GTDB_species', 'species_norm']],
    on='species_norm',
    how='left'
)
species_exact['mapping_tier'] = species_exact['gtdb_species_clade_id'].apply(
    lambda v: 'species_exact' if pd.notna(v) else 'unmapped'
)

mapped_species = species_exact[species_exact['mapping_tier'] == 'species_exact']
still_unmapped = species_exact[species_exact['mapping_tier'] == 'unmapped']

print(f'Species-exact matches: {mapped_species["taxon_name"].nunique()}')
print(f'Unmapped after species-exact: {still_unmapped["taxon_name"].nunique()}')
print(f'Species-exact match rate: {mapped_species["taxon_name"].nunique() / len(centrifuge_species):.1%}')

In [None]:
# Step 3b: Genus-level fallback for unmapped taxa
# Match on normalized genus; if one GTDB clade per genus, use it (species proxy)
# If multiple clades per genus, record as multi_clade_ambiguous

# Count GTDB species per genus
gtdb_genus_counts = (
    gtdb_species.groupby('genus_norm')['gtdb_species_clade_id']
    .nunique()
    .rename('n_gtdb_clades_for_genus')
    .reset_index()
)

# Best representative GTDB clade per genus (use first alphabetically when ambiguous)
gtdb_genus_rep = (
    gtdb_species.sort_values(['genus_norm', 'GTDB_species'])
    .drop_duplicates(subset=['genus_norm'])
    [['genus_norm', 'gtdb_species_clade_id', 'GTDB_species']]
    .rename(columns={'GTDB_species': 'GTDB_species_representative'})
)

genus_fallback = still_unmapped[['taxon_name', 'species_norm', 'genus_norm', 'n_files', 'mean_abundance']].merge(
    gtdb_genus_rep,
    on='genus_norm',
    how='left'
).merge(
    gtdb_genus_counts,
    on='genus_norm',
    how='left'
)
genus_fallback['n_gtdb_clades_for_genus'] = genus_fallback['n_gtdb_clades_for_genus'].fillna(0).astype(int)
genus_fallback['mapping_tier'] = genus_fallback.apply(
    lambda row: (
        'genus_proxy_unique' if row['n_gtdb_clades_for_genus'] == 1
        else 'genus_proxy_ambiguous' if row['n_gtdb_clades_for_genus'] > 1
        else 'unmapped'
    ),
    axis=1
)

print('Genus fallback mapping tier distribution:')
print(genus_fallback['mapping_tier'].value_counts().to_string())

In [None]:
# Step 3c: Combine species_exact and genus_fallback into the full bridge table

# Standardize columns for concatenation
species_exact_out = mapped_species[['taxon_name', 'species_norm', 'genus_norm',
                                     'gtdb_species_clade_id', 'GTDB_species',
                                     'n_files', 'mean_abundance', 'mapping_tier']].copy()
species_exact_out['n_gtdb_clades_for_genus'] = 1
species_exact_out['GTDB_species_representative'] = species_exact_out['GTDB_species']

genus_fallback_out = genus_fallback.copy()
genus_fallback_out['GTDB_species'] = genus_fallback_out.get('GTDB_species_representative',
                                                              pd.Series([''] * len(genus_fallback)))

# Keep only unmapped from genus_fallback (species-exact already captured)
genus_mapped = genus_fallback_out[genus_fallback_out['mapping_tier'] != 'unmapped'].copy()
truly_unmapped = genus_fallback_out[genus_fallback_out['mapping_tier'] == 'unmapped'].copy()

bridge = pd.concat([
    species_exact_out[['taxon_name', 'species_norm', 'genus_norm',
                        'gtdb_species_clade_id', 'GTDB_species',
                        'n_files', 'mean_abundance', 'mapping_tier', 'n_gtdb_clades_for_genus']],
    genus_mapped[['taxon_name', 'species_norm', 'genus_norm',
                   'gtdb_species_clade_id', 'GTDB_species_representative',
                   'n_files', 'mean_abundance', 'mapping_tier', 'n_gtdb_clades_for_genus']].rename(
                   columns={'GTDB_species_representative': 'GTDB_species'}),
    truly_unmapped[['taxon_name', 'species_norm', 'genus_norm',
                     'n_files', 'mean_abundance', 'mapping_tier', 'n_gtdb_clades_for_genus']],
], ignore_index=True)

print(f'Total bridge rows: {len(bridge)}')
print('Mapping tier distribution:')
print(bridge['mapping_tier'].value_counts().to_string())

mapped_frac = bridge[bridge['mapping_tier'] != 'unmapped']['taxon_name'].nunique() / len(bridge)
print(f'\nOverall taxon mapping rate: {mapped_frac:.1%}')

---
## Part 4: Bridge Quality Per Sample

For each overlap sample (samples with both classifier and metabolomics data),  
compute the fraction of community abundance that has been mapped to a GTDB clade.  

Quality metric: `bridge_coverage = Σ(abundance_i) for mapped taxa / Σ(abundance_all_taxa)`  
Flag samples below 30% bridge coverage for exclusion.

In [None]:
# Compute bridge quality per classifier file
# For each file: sum abundance of mapped taxa / total abundance

if overlap_sample_ids and file_bridge_df is not None:
    # Get species-rank centrifuge data for overlap samples
    # Get file_ids for overlap samples first
    overlap_clf_files = clf_file_df[clf_file_df['sample_id'].isin(overlap_sample_ids)]['clf_file_id'].tolist()

    if overlap_clf_files:
        # Create temp view of overlap file IDs
        overlap_clf_spark = spark.createDataFrame(
            pd.DataFrame({'clf_file_id': overlap_clf_files})
        )
        overlap_clf_spark.createOrReplaceTempView('overlap_clf_files_tmp')

        # Get species-rank rows for overlap files
        clf_data = spark.sql("""
            SELECT c.file_id, c.label as taxon_name, c.abundance
            FROM nmdc_arkin.centrifuge_gold c
            JOIN overlap_clf_files_tmp o ON c.file_id = o.clf_file_id
            WHERE LOWER(c.rank) = 'species'
              AND c.label IS NOT NULL AND c.label != ''
              AND c.abundance > 0
        """).toPandas()

        print(f'Species-rank rows for overlap files: {len(clf_data)}')
        print(f'Files in clf_data: {clf_data["file_id"].nunique()}')

        # Join with bridge to mark mapped vs unmapped taxa
        mapped_taxa = set(bridge[bridge['mapping_tier'] != 'unmapped']['taxon_name'])
        clf_data['is_mapped'] = clf_data['taxon_name'].isin(mapped_taxa)

        # Compute bridge coverage per file
        bridge_quality_file = clf_data.groupby('file_id').agg(
            total_abundance=('abundance', 'sum'),
            mapped_abundance=('abundance', lambda x: x[clf_data.loc[x.index, 'is_mapped']].sum()),
            n_taxa=('taxon_name', 'nunique'),
            n_mapped_taxa=('taxon_name', lambda x: x[clf_data.loc[x.index, 'is_mapped']].nunique())
        ).reset_index()
        bridge_quality_file['bridge_coverage'] = (
            bridge_quality_file['mapped_abundance'] / bridge_quality_file['total_abundance']
        ).fillna(0)

        # Map file_id → sample_id
        bridge_quality_file = bridge_quality_file.merge(
            clf_file_df[['sample_id', 'clf_file_id']].rename(columns={'clf_file_id': 'file_id'}),
            on='file_id', how='left'
        )

        print(f'\nBridge quality summary:')
        print(bridge_quality_file[['file_id', 'sample_id', 'bridge_coverage',
                                   'n_taxa', 'n_mapped_taxa']].describe().to_string())
        print(f'\nSamples with >30% bridge coverage: '
              f'{(bridge_quality_file["bridge_coverage"] >= 0.30).sum()} / {len(bridge_quality_file)}')
    else:
        print('No overlap classifier files found for overlap samples.')
        bridge_quality_file = pd.DataFrame()
else:
    print('No overlap samples or bridge — skipping bridge quality computation.')
    bridge_quality_file = pd.DataFrame()

In [None]:
# Summarize bridge quality by coverage threshold
if len(bridge_quality_file) > 0:
    print('Bridge coverage distribution:')
    for threshold in [0.10, 0.20, 0.30, 0.50, 0.70]:
        n = (bridge_quality_file['bridge_coverage'] >= threshold).sum()
        print(f'  >= {threshold:.0%}: {n} / {len(bridge_quality_file)} files '
              f'({n/len(bridge_quality_file):.1%})')

    # Flag samples
    BRIDGE_THRESHOLD = 0.30
    bridge_quality_file['passes_bridge_qc'] = bridge_quality_file['bridge_coverage'] >= BRIDGE_THRESHOLD
    n_pass = bridge_quality_file['passes_bridge_qc'].sum()
    print(f'\nSamples passing QC (>={BRIDGE_THRESHOLD:.0%}): {n_pass}')
    print(f'Samples failing QC (<{BRIDGE_THRESHOLD:.0%}): {len(bridge_quality_file) - n_pass}')

---
## Part 5: Save Outputs and Figures

In [None]:
# Save bridge table
bridge_path = os.path.join(DATA_DIR, 'taxon_bridge.tsv')
bridge.to_csv(bridge_path, sep='\t', index=False)
print(f'Saved: data/taxon_bridge.tsv ({len(bridge)} rows)')

# Save bridge quality
if len(bridge_quality_file) > 0:
    bq_path = os.path.join(DATA_DIR, 'bridge_quality.csv')
    bridge_quality_file.to_csv(bq_path, index=False)
    print(f'Saved: data/bridge_quality.csv ({len(bridge_quality_file)} rows)')

# Save/update sample inventory (replaces empty NB01 version if overlap found)
if len(inventory) > 0:
    inv_path = os.path.join(DATA_DIR, 'nmdc_sample_inventory.csv')
    inventory.to_csv(inv_path, index=False)
    print(f'Saved: data/nmdc_sample_inventory.csv ({len(inventory)} rows, '
          f'{inventory["sample_id"].nunique()} unique samples)')

# Save file→sample bridge
if file_bridge_df is not None and len(file_bridge_df) > 0:
    fb_path = os.path.join(DATA_DIR, 'sample_file_bridge.csv')
    file_bridge_df.to_csv(fb_path, index=False)
    print(f'Saved: data/sample_file_bridge.csv ({len(file_bridge_df)} rows)')

In [None]:
# Figure: Bridge quality distribution and mapping tier breakdown
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('Taxonomy Bridge Quality', fontsize=14)

# Panel 1: Bridge coverage histogram
ax = axes[0]
if len(bridge_quality_file) > 0:
    ax.hist(bridge_quality_file['bridge_coverage'], bins=30,
            color='#4C9BE8', edgecolor='white')
    ax.axvline(0.30, color='red', linestyle='--', label='30% threshold')
    ax.set_xlabel('Bridge coverage (fraction of abundance mapped)')
    ax.set_ylabel('Number of samples')
    ax.set_title('Per-sample bridge coverage')
    ax.legend()
else:
    ax.text(0.5, 0.5, 'No bridge quality data', ha='center', va='center',
            transform=ax.transAxes)

# Panel 2: Mapping tier breakdown (n taxa in each tier, weighted by n_files)
ax2 = axes[1]
if len(bridge) > 0:
    tier_counts = bridge['mapping_tier'].value_counts()
    colors = {
        'species_exact': '#2ecc71',
        'genus_proxy_unique': '#f39c12',
        'genus_proxy_ambiguous': '#e67e22',
        'unmapped': '#e74c3c'
    }
    bar_colors = [colors.get(t, '#95a5a6') for t in tier_counts.index]
    ax2.bar(range(len(tier_counts)), tier_counts.values,
            color=bar_colors, edgecolor='white')
    ax2.set_xticks(range(len(tier_counts)))
    ax2.set_xticklabels(tier_counts.index, rotation=30, ha='right', fontsize=9)
    ax2.set_ylabel('Number of unique taxa')
    ax2.set_title('Taxon mapping tier distribution')
    for i, val in enumerate(tier_counts.values):
        ax2.text(i, val + 0.5, str(val), ha='center', fontsize=9)
else:
    ax2.text(0.5, 0.5, 'No bridge data', ha='center', va='center', transform=ax2.transAxes)

plt.tight_layout()
fig_path = os.path.join(FIGURES_DIR, 'bridge_quality_distribution.png')
plt.savefig(fig_path, dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/bridge_quality_distribution.png')

---
## Summary and Decisions for NB03

| Question | Finding |
|---|---|
| file→sample bridge table | ??? (fill after running Part 1) |
| Samples with both classifier + metabolomics | ??? |
| Species-exact GTDB matches | ??? of ??? centrifuge taxa |
| Genus-proxy unique matches | ??? |
| Genus-proxy ambiguous | ??? |
| Overall taxon mapping rate | ???% |
| Samples passing 30% bridge QC | ??? |
| Classifier used for bridge | centrifuge_gold (61.3% species-rank, best from NB01) |

**Decision for NB03**:  
Use samples passing 30% bridge QC for community-weighted pathway completeness computation.  
Use `species_exact` + `genus_proxy_unique` clades for community weighting in NB03.  
Exclude `genus_proxy_ambiguous` from community weighting (or use abundance-weighted mean across all matching clades — document choice).