# NB02: File→Sample Bridge and Taxonomy→GTDB Mapping

**Project**: Community Metabolic Ecology via NMDC × Pangenome Integration  
**Requires**: BERDL JupyterHub (Spark — `get_spark_session()` injected into kernel)  

## Purpose

NB01 confirmed that classifier files (`nmdc:dobj-11-*`) and metabolomics files (`nmdc:dobj-12-*`)  
share **zero file_id overlap** — they are different workflow output types. The shared identifier  
is the **biosample/sample_id** (e.g., `nmdc:bsm-11-*`). This notebook:

1. **Part 1**: Find the `file_id → sample_id` bridge in `nmdc_arkin`
2. **Part 2**: Build sample inventory — samples with both metagenomics classifier AND metabolomics data
3. **Part 3**: Map NMDC taxon names (centrifuge_gold) → GTDB species (`gtdb_species_clade`)
4. **Part 4**: Compute bridge quality per sample
5. **Part 5**: Save outputs

## Inputs

- `nmdc_arkin` tables: `centrifuge_gold`, `metabolomics_gold`, `abiotic_features`, `study_table`
- `kbase_ke_pangenome`: `gtdb_species_clade`

## Outputs

- `data/sample_file_bridge.csv` — sample_id ↔ file_id mapping for all omics types
- `data/nmdc_sample_inventory.csv` — updated with samples having paired omics (replaces NB01 empty file)
- `data/taxon_bridge.tsv` — NMDC taxon name → GTDB species clade mappings with confidence tiers
- `data/bridge_quality.csv` — per-sample fraction of community abundance mapped to pangenome

In [1]:
# On BERDL JupyterHub — get_spark_session() is injected into the kernel; no import needed
spark = get_spark_session()
spark

<pyspark.sql.connect.session.SparkSession at 0x7d052d4fd400>

In [2]:
import os
import re
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

PROJECT_DIR = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
DATA_DIR = os.path.join(PROJECT_DIR, 'data')
FIGURES_DIR = os.path.join(PROJECT_DIR, 'figures')
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIGURES_DIR, exist_ok=True)
print(f'DATA_DIR: {DATA_DIR}')
print(f'FIGURES_DIR: {FIGURES_DIR}')

DATA_DIR: /home/cjneely/repos/BERIL-research-observatory/projects/nmdc_community_metabolic_ecology/data
FIGURES_DIR: /home/cjneely/repos/BERIL-research-observatory/projects/nmdc_community_metabolic_ecology/figures


---
## Part 1: Find the file_id → sample_id Bridge

NB01 found that classifier tables and metabolomics tables use non-overlapping `file_id` namespaces.  
The NMDC data model links files to biosamples through workflow activities.  
We explore candidate tables in `nmdc_arkin` to find a `file_id → sample_id` mapping.

In [3]:
# Step 1a: Inspect candidate bridge tables — look for tables that have both file_id and sample_id
# Candidates from schema doc: sample_tokens_v1, taxonomy_dim, taxstring_lookup,
# embedding_metadata, taxonomy_embeddings, trait_unified

candidate_tables = [
    'sample_tokens_v1',
    'taxonomy_dim',
    'taxstring_lookup',
    'embedding_metadata',
    'taxonomy_embeddings',
    'trait_unified',
    'biochemical_features',
    'biochemical_embeddings',
    'abiotic_embeddings',
]

bridge_candidates = []  # tables that have both file_id and sample_id

for tbl in candidate_tables:
    try:
        schema = spark.sql(f'DESCRIBE nmdc_arkin.{tbl}').toPandas()
        cols = set(schema['col_name'].tolist())
        has_file = 'file_id' in cols
        has_sample = 'sample_id' in cols
        n_cols = len(cols)
        print(f'{tbl}: {n_cols} cols, file_id={has_file}, sample_id={has_sample}')
        if has_file or has_sample:
            print(f'  -> columns: {sorted(cols)[:12]}')
        if has_file and has_sample:
            bridge_candidates.append(tbl)
            print(f'  *** BRIDGE CANDIDATE ***')
    except Exception as e:
        print(f'{tbl}: ERROR — {e}')

print(f'\nBridge candidates (have both file_id and sample_id): {bridge_candidates}')

sample_tokens_v1: 4 cols, file_id=False, sample_id=True
  -> columns: ['modality_id', 'sample_id', 'token_id', 'value']
taxonomy_dim: 8 cols, file_id=False, sample_id=False
taxstring_lookup: 4 cols, file_id=False, sample_id=False
embedding_metadata: 3 cols, file_id=False, sample_id=False
taxonomy_embeddings: 19 cols, file_id=False, sample_id=True
  -> columns: ['coverage_fraction', 'embedding_dim_0', 'embedding_dim_1', 'embedding_dim_10', 'embedding_dim_11', 'embedding_dim_12', 'embedding_dim_13', 'embedding_dim_14', 'embedding_dim_15', 'embedding_dim_2', 'embedding_dim_3', 'embedding_dim_4']
trait_unified: 7 cols, file_id=False, sample_id=False
biochemical_features: 21485 cols, file_id=False, sample_id=False
biochemical_embeddings: 20 cols, file_id=False, sample_id=False
abiotic_embeddings: 19 cols, file_id=False, sample_id=True
  -> columns: ['coverage_fraction', 'embedding_dim_0', 'embedding_dim_1', 'embedding_dim_10', 'embedding_dim_11', 'embedding_dim_12', 'embedding_dim_13', 'emb

In [4]:
# Step 1b: List all tables in nmdc_arkin to find any additional candidates
all_tables = spark.sql('SHOW TABLES IN nmdc_arkin').toPandas()
print(f'Total tables in nmdc_arkin: {len(all_tables)}')
print(all_tables['tableName'].sort_values().tolist())

Total tables in nmdc_arkin: 63
['abiotic_embeddings', 'abiotic_features', 'annotation_crossrefs', 'annotation_hierarchies_unified', 'annotation_terms_unified', 'biochemical_embeddings', 'biochemical_features', 'biochemical_features_metadata', 'centrifuge_gold', 'cog_categories', 'cog_hierarchy_flat', 'contig_taxonomy', 'contig_taxonomy_backup', 'covstats_taxonomy_rollup', 'ec_hierarchy_flat', 'ec_hierarchy_graph', 'ec_terms', 'embedding_metadata', 'embeddings_v1', 'go_hierarchy_flat', 'go_hierarchy_graph', 'go_terms', 'gottcha_gold', 'kegg_ko_module', 'kegg_ko_pathway', 'kegg_ko_terms', 'kegg_module_terms', 'kegg_pathway_terms', 'kraken_gold', 'lipidomics_gold', 'metabolomics_gold', 'metacyc_hierarchy_flat', 'metacyc_hierarchy_graph', 'metacyc_pathway_reactions', 'metacyc_pathways', 'metacyc_reaction_ec', 'metatranscriptomics_gold', 'nom_feature_metadata', 'nom_gold', 'nom_matrix_optimized', 'omics_files_table', 'proteomics_gold', 'rhea_crossrefs', 'rhea_reactions', 'sample_file_lookup

In [5]:
# Step 1c: Scan ALL tables in nmdc_arkin for file_id OR sample_id columns
# This finds any table not covered by the candidate list above

table_names = all_tables['tableName'].tolist()
file_id_tables = []
sample_id_tables = []
both_id_tables = []

for tbl in table_names:
    try:
        schema = spark.sql(f'DESCRIBE nmdc_arkin.{tbl}').toPandas()
        cols = set(schema['col_name'].tolist())
        has_file = 'file_id' in cols
        has_sample = 'sample_id' in cols
        if has_file:
            file_id_tables.append(tbl)
        if has_sample:
            sample_id_tables.append(tbl)
        if has_file and has_sample:
            both_id_tables.append(tbl)
    except Exception:
        pass

print('Tables with file_id:', file_id_tables)
print()
print('Tables with sample_id:', sample_id_tables)
print()
print('Tables with BOTH file_id and sample_id (bridge candidates):', both_id_tables)

Tables with file_id: ['centrifuge_gold', 'covstats_taxonomy_rollup', 'gottcha_gold', 'kraken_gold', 'lipidomics_gold', 'nom_gold', 'omics_files_table', 'proteomics_gold', 'metabolomics_gold', 'metatranscriptomics_gold']

Tables with sample_id: ['sample_tokens_v1', 'contig_taxonomy', 'contig_taxonomy_backup', 'abiotic_embeddings', 'taxonomy_embeddings', 'taxonomy_features', 'trait_embeddings', 'trait_features', 'nom_matrix_optimized', 'omics_files_table', 'sample_file_lookup', 'sample_file_selections', 'abiotic_features']

Tables with BOTH file_id and sample_id (bridge candidates): ['omics_files_table']


In [6]:
# Step 1d: Inspect the bridge table(s) found above
# If both_id_tables is non-empty, show schema and sample rows for each

if both_id_tables:
    for tbl in both_id_tables:
        print(f'\n=== nmdc_arkin.{tbl} schema ===')
        spark.sql(f'DESCRIBE nmdc_arkin.{tbl}').show(30, truncate=False)

        n = spark.sql(f'SELECT COUNT(*) as n FROM nmdc_arkin.{tbl}').collect()[0]['n']
        print(f'Row count: {n}')

        print(f'Sample rows:')
        spark.sql(f'SELECT * FROM nmdc_arkin.{tbl} LIMIT 5').show(truncate=False)
else:
    print('No table with both file_id and sample_id found in nmdc_arkin.')
    print('Will attempt bridge via file_name parsing (Step 1e).')


=== nmdc_arkin.omics_files_table schema ===
+---------------------+---------+-------+
|col_name             |data_type|comment|
+---------------------+---------+-------+
|file_id              |string   |NULL   |
|file_name            |string   |NULL   |
|file_url             |string   |NULL   |
|file_size_bytes      |double   |NULL   |
|file_type            |string   |NULL   |
|file_type_description|string   |NULL   |
|md5_checksum         |string   |NULL   |
|omics_processing_id  |string   |NULL   |
|omics_processing_name|string   |NULL   |
|workflow_id          |string   |NULL   |
|workflow_type        |string   |NULL   |
|sample_id            |string   |NULL   |
|study_id             |string   |NULL   |
+---------------------+---------+-------+

Row count: 385562
Sample rows:
+---------------------+-----------------------+-------------------------------------------------+---------------+-------------+--------------------------------------------------------+-------------------------

In [7]:
# Step 1e: Explore sample_id format in abiotic_features
# to understand what ID format we need to bridge TO
print('=== abiotic_features sample_id examples ===')
abiotic_ids = spark.sql("""
    SELECT sample_id
    FROM nmdc_arkin.abiotic_features
    LIMIT 10
""").toPandas()
print(abiotic_ids['sample_id'].tolist())

# Also check study_table: do study_ids appear in sample_ids?
print('\nStudy_id examples from study_table:')
study_ids = spark.sql("""
    SELECT study_id FROM nmdc_arkin.study_table LIMIT 5
""").toPandas()
print(study_ids['study_id'].tolist())

# Check if sample_id in abiotic_features starts with 'nmdc:bsm-'
bsm_count = spark.sql("""
    SELECT COUNT(*) as n
    FROM nmdc_arkin.abiotic_features
    WHERE sample_id LIKE 'nmdc:bsm-%'
""").collect()[0]['n']
print(f'\nabiotic_features rows where sample_id starts with nmdc:bsm-: {bsm_count}')

=== abiotic_features sample_id examples ===
['nmdc:bsm-11-042nd237', 'nmdc:bsm-11-622k6044', 'nmdc:bsm-11-65a4xw75', 'nmdc:bsm-11-93mc8g67', 'nmdc:bsm-11-cpekyy11', 'nmdc:bsm-11-cwey4y35', 'nmdc:bsm-11-efm5hh51', 'nmdc:bsm-11-frgt4x11', 'nmdc:bsm-11-gk0czc37', 'nmdc:bsm-11-nq2tfm26']

Study_id examples from study_table:
['nmdc:sty-11-8fb6t785', 'nmdc:sty-11-33fbta56', 'nmdc:sty-11-aygzgv51', 'nmdc:sty-11-34xj1150', 'nmdc:sty-11-076c9980']

abiotic_features rows where sample_id starts with nmdc:bsm-: 13847


In [8]:
# Step 1f: Check if the sample_tokens_v1 table has more columns than documented
# and whether it contains file_id or a file reference field
try:
    print('=== sample_tokens_v1 schema ===')
    spark.sql('DESCRIBE nmdc_arkin.sample_tokens_v1').show(40, truncate=False)
    n = spark.sql('SELECT COUNT(*) as n FROM nmdc_arkin.sample_tokens_v1').collect()[0]['n']
    print(f'Row count: {n}')
    print('Sample rows:')
    spark.sql('SELECT * FROM nmdc_arkin.sample_tokens_v1 LIMIT 3').show(truncate=False)
except Exception as e:
    print(f'sample_tokens_v1 error: {e}')

# Also check embeddings_v1 for sample_id format
try:
    print('\n=== embeddings_v1 schema ===')
    spark.sql('DESCRIBE nmdc_arkin.embeddings_v1').show(20, truncate=False)
    n = spark.sql('SELECT COUNT(*) as n FROM nmdc_arkin.embeddings_v1').collect()[0]['n']
    print(f'Row count: {n}')
    # Show non-vector columns (sample_id etc.)
    emb_schema = spark.sql('DESCRIBE nmdc_arkin.embeddings_v1').toPandas()
    str_cols = emb_schema[emb_schema['data_type'] == 'string']['col_name'].tolist()
    if str_cols:
        cols_sql = ', '.join([f'`{c}`' for c in str_cols[:6]])
        spark.sql(f'SELECT {cols_sql} FROM nmdc_arkin.embeddings_v1 LIMIT 5').show(truncate=False)
except Exception as e:
    print(f'embeddings_v1 error: {e}')

=== sample_tokens_v1 schema ===
+-----------+-------------+-------+
|col_name   |data_type    |comment|
+-----------+-------------+-------+
|sample_id  |string       |NULL   |
|token_id   |array<bigint>|NULL   |
|modality_id|array<bigint>|NULL   |
|value      |array<double>|NULL   |
+-----------+-------------+-------+

Row count: 5316
Sample rows:
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [9]:
# Step 1g: Confirm the bridge table and check coverage via Spark SQL.
# NOTE: Do NOT convert omics_files_table to pandas and back to Spark —
# spark.createDataFrame(pandas_df) fails when the DataFrame has PyArrow-backed
# columns (ChunkedArray), which is always the case after .toPandas() on
# Spark Connect. Use omics_files_table directly in SQL for all subsequent joins.

if both_id_tables:
    bridge_tbl_name = f'nmdc_arkin.{both_id_tables[0]}'
    print(f'Bridge table: {bridge_tbl_name}')

    # Coverage check: how many centrifuge / metabolomics file_ids are bridged to a sample_id?
    clf_coverage = spark.sql(f"""
        SELECT
            COUNT(DISTINCT c.file_id) as clf_files_total,
            COUNT(DISTINCT CASE WHEN b.sample_id IS NOT NULL THEN c.file_id END) as clf_files_bridged
        FROM (SELECT DISTINCT file_id FROM nmdc_arkin.centrifuge_gold) c
        LEFT JOIN {bridge_tbl_name} b ON c.file_id = b.file_id
    """).toPandas()
    met_coverage = spark.sql(f"""
        SELECT
            COUNT(DISTINCT m.file_id) as met_files_total,
            COUNT(DISTINCT CASE WHEN b.sample_id IS NOT NULL THEN m.file_id END) as met_files_bridged
        FROM (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m
        LEFT JOIN {bridge_tbl_name} b ON m.file_id = b.file_id
    """).toPandas()
    print('Classifier coverage:')
    print(clf_coverage.to_string())
    print('Metabolomics coverage:')
    print(met_coverage.to_string())
    file_bridge_found = True
else:
    print('No direct bridge table found. Attempting file_name-based bridge...')
    spark.sql("""
        SELECT DISTINCT file_id, file_name FROM nmdc_arkin.centrifuge_gold LIMIT 5
    """).show(truncate=False)
    spark.sql("""
        SELECT DISTINCT file_id, file_name FROM nmdc_arkin.metabolomics_gold LIMIT 5
    """).show(truncate=False)
    file_bridge_found = False
    bridge_tbl_name = None

Bridge table: nmdc_arkin.omics_files_table
Classifier coverage:
   clf_files_total  clf_files_bridged
0             3577               3577
Metabolomics coverage:
   met_files_total  met_files_bridged
0             2460               2460


In [10]:
# Step 1h: If no direct bridge was found, check nmdc_ncbi_biosamples for a cross-reference.

if not file_bridge_found:
    print('Exploring nmdc_ncbi_biosamples for file → sample links...')
    try:
        ncbi_tables = spark.sql('SHOW TABLES IN nmdc_ncbi_biosamples').toPandas()
        print('Tables in nmdc_ncbi_biosamples:', ncbi_tables['tableName'].tolist())

        try:
            spark.sql('DESCRIBE nmdc_ncbi_biosamples.biosamples_ids').show(20, truncate=False)
            spark.sql('SELECT * FROM nmdc_ncbi_biosamples.biosamples_ids LIMIT 5').show(truncate=False)
        except Exception as e:
            print(f'biosamples_ids: {e}')

        try:
            link_schema = spark.sql('DESCRIBE nmdc_ncbi_biosamples.biosamples_links').toPandas()
            print('\nbiosamples_links columns:', link_schema['col_name'].tolist())
            spark.sql('SELECT * FROM nmdc_ncbi_biosamples.biosamples_links LIMIT 5').show(truncate=False)
        except Exception as e:
            print(f'biosamples_links: {e}')

    except Exception as e:
        print(f'nmdc_ncbi_biosamples exploration error: {e}')
else:
    print(f'Bridge found: {bridge_tbl_name}. Skipping nmdc_ncbi_biosamples.')

Bridge found: nmdc_arkin.omics_files_table. Skipping nmdc_ncbi_biosamples.


---
## Part 2: Sample Inventory — Samples with Both Omics Data Types

Using the bridge found in Part 1, identify `sample_id` values that have both  
a metagenomics classifier file (centrifuge_gold) AND a metabolomics file (metabolomics_gold).

In [11]:
# Build sample overlap via the bridge
# Use omics_files_table directly in SQL — no pandas→Spark roundtrip.
# spark.createDataFrame(pandas_df) fails with ChunkedArray error when pandas_df
# was produced by .toPandas() on a Spark Connect DataFrame (PyArrow-backed columns).

if file_bridge_found and bridge_tbl_name:
    # Samples with metagenomics (centrifuge) classifier data
    clf_samples = spark.sql(f"""
        SELECT DISTINCT b.sample_id
        FROM (SELECT DISTINCT file_id FROM nmdc_arkin.centrifuge_gold) c
        JOIN {bridge_tbl_name} b ON c.file_id = b.file_id
        WHERE b.sample_id IS NOT NULL
    """).toPandas()

    # Samples with metabolomics data
    met_samples = spark.sql(f"""
        SELECT DISTINCT b.sample_id
        FROM (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m
        JOIN {bridge_tbl_name} b ON m.file_id = b.file_id
        WHERE b.sample_id IS NOT NULL
    """).toPandas()

    # Samples with BOTH
    clf_set = set(clf_samples['sample_id'].tolist())
    met_set = set(met_samples['sample_id'].tolist())
    overlap_samples = clf_set & met_set

    print(f'Samples with classifier data (centrifuge): {len(clf_set)}')
    print(f'Samples with metabolomics data:            {len(met_set)}')
    print(f'Samples with BOTH (overlap):               {len(overlap_samples)}')
    overlap_sample_ids = list(overlap_samples)
else:
    print('ERROR: No file→sample bridge available. Cannot compute sample overlap.')
    clf_set = set()
    met_set = set()
    overlap_sample_ids = []

Samples with classifier data (centrifuge): 6361
Samples with metabolomics data:            1148
Samples with BOTH (overlap):               221


In [12]:
# Merge ecosystem metadata from study_table for overlap samples
# abiotic_features uses sample_id — join directly

if overlap_sample_ids:
    # Load abiotic features for overlap samples
    abiotic_df = spark.sql('SELECT * FROM nmdc_arkin.abiotic_features').toPandas()
    abiotic_overlap = abiotic_df[abiotic_df['sample_id'].isin(overlap_sample_ids)].copy()
    print(f'Abiotic features rows for overlap samples: {len(abiotic_overlap)}')

    # Study table for ecosystem categorization
    # study_id format: nmdc:sty-11-XXXXXXXX
    # sample_id format: nmdc:bsm-11-XXXXXXXX (if biosample prefix)
    # Link via file_bridge_df: get file_id → study linkage from file_name
    # file_name for classifier: 'nmdc_wfrbt-11-krmkys65.1_kraken2_report.tsv'
    # Study linkage must come through study_table

    # For now, check what sample_id values look like
    if overlap_sample_ids:
        print(f'\nFirst 10 overlap sample_ids: {overlap_sample_ids[:10]}')
else:
    print('No overlap samples — skipping ecosystem metadata merge.')
    abiotic_overlap = pd.DataFrame()

Abiotic features rows for overlap samples: 221

First 10 overlap sample_ids: ['nmdc:bsm-11-rk62cv25', 'nmdc:bsm-11-q72ew015', 'nmdc:bsm-11-62fxv990', 'nmdc:bsm-11-j9v5kc93', 'nmdc:bsm-13-c0xjg970', 'nmdc:bsm-11-frgt4x11', 'nmdc:bsm-11-011z7z70', 'nmdc:bsm-11-ce1sg012', 'nmdc:bsm-11-h49t7j41', 'nmdc:bsm-13-amdcp906']


In [13]:
# Build per-sample file inventory with omics type labels.
# Use omics_files_table directly in SQL subqueries — no temp views needed.

if overlap_sample_ids and file_bridge_found and bridge_tbl_name:
    # Overlap sample_ids as a SQL IN list (safe: IDs are nmdc:bsm-11-* strings)
    # For large lists use a subquery approach instead of IN clause
    # Here we use the SQL INTERSECT approach to keep it in Spark:
    clf_file_df = spark.sql(f"""
        SELECT DISTINCT b.sample_id, c.file_id as clf_file_id, c.file_name as clf_file_name
        FROM (SELECT DISTINCT file_id, file_name FROM nmdc_arkin.centrifuge_gold) c
        JOIN {bridge_tbl_name} b ON c.file_id = b.file_id
        WHERE b.sample_id IN (
            SELECT b2.sample_id FROM {bridge_tbl_name} b2
            JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.centrifuge_gold) c2 ON b2.file_id = c2.file_id
            INTERSECT
            SELECT b3.sample_id FROM {bridge_tbl_name} b3
            JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m3 ON b3.file_id = m3.file_id
        )
    """).toPandas()

    met_file_df = spark.sql(f"""
        SELECT DISTINCT b.sample_id, m.file_id as met_file_id, m.file_name as met_file_name
        FROM (SELECT DISTINCT file_id, file_name FROM nmdc_arkin.metabolomics_gold) m
        JOIN {bridge_tbl_name} b ON m.file_id = b.file_id
        WHERE b.sample_id IN (
            SELECT b2.sample_id FROM {bridge_tbl_name} b2
            JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.centrifuge_gold) c2 ON b2.file_id = c2.file_id
            INTERSECT
            SELECT b3.sample_id FROM {bridge_tbl_name} b3
            JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m3 ON b3.file_id = m3.file_id
        )
    """).toPandas()

    inventory = clf_file_df.merge(met_file_df, on='sample_id', how='inner')
    print(f'Inventory rows (sample × clf_file × met_file): {len(inventory)}')
    print(f'Unique samples in inventory: {inventory["sample_id"].nunique()}')
    print(inventory.head(5).to_string())
else:
    print('Cannot build inventory — no bridge or no overlap samples.')
    inventory = pd.DataFrame()
    clf_file_df = pd.DataFrame()
    met_file_df = pd.DataFrame()

Inventory rows (sample × clf_file × met_file): 646
Unique samples in inventory: 221
              sample_id            clf_file_id                                   clf_file_name            met_file_id                                                                                                                           met_file_name
0  nmdc:bsm-11-ahfq0n74  nmdc:dobj-11-ggpy8r50  nmdc_wfrbt-11-mcj39k52.1_centrifuge_report.tsv  nmdc:dobj-12-716d9s20       20220119_JGI_MD_507130_BioS_final1_QE-139_HILICZ_USHXG01825_POS_MSMS_5RM_BESC-13-Corv-RM_2_Rg70to1050-CE102040-root-S1_Run922.csv
1  nmdc:bsm-11-ahfq0n74  nmdc:dobj-11-ggpy8r50  nmdc_wfrbt-11-mcj39k52.1_centrifuge_report.tsv  nmdc:dobj-12-591x8e41             20220211_JGI_MD_507130_BioS_final1_IDX_C18_USDAY59443_POS_MSMS_5RM_BESC-13-Corv-RM_2_Rg80to1200-CE102040-root-S1_Run935.csv
2  nmdc:bsm-11-ahfq0n74  nmdc:dobj-11-ggpy8r50  nmdc_wfrbt-11-mcj39k52.1_centrifuge_report.tsv  nmdc:dobj-12-mk76tz88             20220211_JGI_MD_507130_B

---
## Part 3: NMDC Taxon Names → GTDB Species Clade

Map species-rank taxon names from `centrifuge_gold` to GTDB species clades  
in `kbase_ke_pangenome.gtdb_species_clade` using normalized name matching.  

Approach (following `enigma_contamination_functional_potential` NB02):
- Normalize genus name from NCBI species name (remove 's__' prefix, lowercase)
- Match at species level first (exact GTDB species name match)
- Fall back to genus-level match for unresolved taxa
- Report confidence tier: `species_exact`, `genus_proxy`, `unmapped`

In [14]:
# Get all unique species-rank taxa from centrifuge_gold
# Using all files (not just overlap) to build a comprehensive bridge table
# centrifuge_gold: rank='species', label=taxon name, abundance=relative abundance

centrifuge_species = spark.sql("""
    SELECT label as taxon_name, COUNT(DISTINCT file_id) as n_files,
           AVG(abundance) as mean_abundance
    FROM nmdc_arkin.centrifuge_gold
    WHERE LOWER(rank) = 'species'
      AND label IS NOT NULL
      AND label != ''
    GROUP BY label
    ORDER BY n_files DESC
""").toPandas()

print(f'Unique species-rank taxa in centrifuge_gold: {len(centrifuge_species)}')
print('Most common taxa:')
print(centrifuge_species.head(10).to_string())

Unique species-rank taxa in centrifuge_gold: 4894
Most common taxa:
                   taxon_name  n_files  mean_abundance
0  Bradyrhizobium sp. ORS 278     3576        0.000740
1    Burkholderia cenocepacia     3575        0.000111
2  Rhodopseudomonas palustris     3575        0.012858
3     Pseudomonas fluorescens     3575        0.001186
4          Streptomyces albus     3575        0.000081
5    Bradyrhizobium sp. BTAi1     3575        0.004004
6      Stutzerimonas stutzeri     3575        0.000270
7      Ralstonia solanacearum     3574        0.000176
8    Nocardia cyriacigeorgica     3574        0.000206
9      Rhizorhabdus wittichii     3574        0.000295


In [15]:
# Normalization functions for name matching
def norm_species(x: str) -> str:
    """Normalize NCBI species name for GTDB matching.
    NCBI: 'Candidatus Methylotenera versatilis' -> 'methylotenera_versatilis'
    GTDB: 's__Methylotenera_versatilis' -> 'methylotenera_versatilis'
    """
    if pd.isna(x) or not str(x).strip():
        return ''
    x = str(x).strip().lower()
    # Remove GTDB prefix if present
    x = re.sub(r'^s__', '', x)
    # Remove 'candidatus' prefix
    x = re.sub(r'^candidatus\s+', '', x)
    # Replace spaces and special chars with underscore
    x = re.sub(r'[^a-z0-9]+', '_', x)
    return x.strip('_')


def norm_genus(x: str) -> str:
    """Extract and normalize genus from species name."""
    if pd.isna(x) or not str(x).strip():
        return ''
    x = str(x).strip().lower()
    x = re.sub(r'^s__', '', x)
    x = re.sub(r'^g__', '', x)
    x = re.sub(r'^candidatus\s+', '', x)
    # Take first word as genus
    parts = re.split(r'[^a-z0-9]', x)
    genus = parts[0] if parts else ''
    return re.sub(r'[^a-z0-9]', '_', genus).strip('_')


# Prepare centrifuge taxa with normalized names
centrifuge_species['species_norm'] = centrifuge_species['taxon_name'].map(norm_species)
centrifuge_species['genus_norm'] = centrifuge_species['taxon_name'].map(norm_genus)
centrifuge_species = centrifuge_species[centrifuge_species['species_norm'] != '']
print(f'Taxa with non-empty normalized name: {len(centrifuge_species)}')
print(centrifuge_species[['taxon_name', 'species_norm', 'genus_norm']].head(10).to_string())

Taxa with non-empty normalized name: 4894
                   taxon_name                species_norm        genus_norm
0  Bradyrhizobium sp. ORS 278   bradyrhizobium_sp_ors_278    bradyrhizobium
1    Burkholderia cenocepacia    burkholderia_cenocepacia      burkholderia
2  Rhodopseudomonas palustris  rhodopseudomonas_palustris  rhodopseudomonas
3     Pseudomonas fluorescens     pseudomonas_fluorescens       pseudomonas
4          Streptomyces albus          streptomyces_albus      streptomyces
5    Bradyrhizobium sp. BTAi1     bradyrhizobium_sp_btai1    bradyrhizobium
6      Stutzerimonas stutzeri      stutzerimonas_stutzeri     stutzerimonas
7      Ralstonia solanacearum      ralstonia_solanacearum         ralstonia
8    Nocardia cyriacigeorgica    nocardia_cyriacigeorgica          nocardia
9      Rhizorhabdus wittichii      rhizorhabdus_wittichii      rhizorhabdus


In [16]:
# Load GTDB species clades from pangenome
# GTDB species names look like: 's__Methylotenera_versatilis'
gtdb_species = spark.sql("""
    SELECT DISTINCT gtdb_species_clade_id, GTDB_species
    FROM kbase_ke_pangenome.gtdb_species_clade
""").toPandas()

print(f'GTDB species clades: {len(gtdb_species)}')
print('Sample GTDB species names:')
print(gtdb_species['GTDB_species'].head(10).tolist())

# Build normalized species and genus columns for GTDB
gtdb_species['species_norm'] = gtdb_species['GTDB_species'].map(norm_species)
gtdb_species['genus_norm'] = gtdb_species['GTDB_species'].map(norm_genus)
gtdb_species = gtdb_species[gtdb_species['species_norm'] != ''].drop_duplicates()

print(f'\nGTDB species with non-empty normalized name: {len(gtdb_species)}')
print(gtdb_species[['GTDB_species', 'species_norm', 'genus_norm']].head(5).to_string())

GTDB species clades: 27690
Sample GTDB species names:
['s__Rhizobium_phaseoli', 's__CAIPMZ01_sp903861715', 's__Pseudomonas_nitroreducens', 's__Leuconostoc_gelidum', 's__Prevotella_nigrescens', 's__Tractidigestivibacter_sp000752675', 's__Ellagibacter_isourolithinifaciens', 's__Limimorpha_sp900769945', 's__Succinivibrio_sp003456415', 's__VUNA01_sp002299625']

GTDB species with non-empty normalized name: 27690
                   GTDB_species               species_norm   genus_norm
0         s__Rhizobium_phaseoli         rhizobium_phaseoli    rhizobium
1       s__CAIPMZ01_sp903861715       caipmz01_sp903861715     caipmz01
2  s__Pseudomonas_nitroreducens  pseudomonas_nitroreducens  pseudomonas
3        s__Leuconostoc_gelidum        leuconostoc_gelidum  leuconostoc
4      s__Prevotella_nigrescens      prevotella_nigrescens   prevotella


In [17]:
# Step 3a: Species-level exact match (normalized NCBI species name == normalized GTDB species name)
species_exact = centrifuge_species.merge(
    gtdb_species[['gtdb_species_clade_id', 'GTDB_species', 'species_norm']],
    on='species_norm',
    how='left'
)
species_exact['mapping_tier'] = species_exact['gtdb_species_clade_id'].apply(
    lambda v: 'species_exact' if pd.notna(v) else 'unmapped'
)

mapped_species = species_exact[species_exact['mapping_tier'] == 'species_exact']
still_unmapped = species_exact[species_exact['mapping_tier'] == 'unmapped']

print(f'Species-exact matches: {mapped_species["taxon_name"].nunique()}')
print(f'Unmapped after species-exact: {still_unmapped["taxon_name"].nunique()}')
print(f'Species-exact match rate: {mapped_species["taxon_name"].nunique() / len(centrifuge_species):.1%}')

Species-exact matches: 1375
Unmapped after species-exact: 3519
Species-exact match rate: 28.1%


In [18]:
# Step 3b: Genus-level fallback for unmapped taxa
# Match on normalized genus; if one GTDB clade per genus, use it (species proxy)
# If multiple clades per genus, record as multi_clade_ambiguous

# Count GTDB species per genus
gtdb_genus_counts = (
    gtdb_species.groupby('genus_norm')['gtdb_species_clade_id']
    .nunique()
    .rename('n_gtdb_clades_for_genus')
    .reset_index()
)

# Best representative GTDB clade per genus (use first alphabetically when ambiguous)
gtdb_genus_rep = (
    gtdb_species.sort_values(['genus_norm', 'GTDB_species'])
    .drop_duplicates(subset=['genus_norm'])
    [['genus_norm', 'gtdb_species_clade_id', 'GTDB_species']]
    .rename(columns={'GTDB_species': 'GTDB_species_representative'})
)

genus_fallback = still_unmapped[['taxon_name', 'species_norm', 'genus_norm', 'n_files', 'mean_abundance']].merge(
    gtdb_genus_rep,
    on='genus_norm',
    how='left'
).merge(
    gtdb_genus_counts,
    on='genus_norm',
    how='left'
)
genus_fallback['n_gtdb_clades_for_genus'] = genus_fallback['n_gtdb_clades_for_genus'].fillna(0).astype(int)
genus_fallback['mapping_tier'] = genus_fallback.apply(
    lambda row: (
        'genus_proxy_unique' if row['n_gtdb_clades_for_genus'] == 1
        else 'genus_proxy_ambiguous' if row['n_gtdb_clades_for_genus'] > 1
        else 'unmapped'
    ),
    axis=1
)

print('Genus fallback mapping tier distribution:')
print(genus_fallback['mapping_tier'].value_counts().to_string())

Genus fallback mapping tier distribution:
mapping_tier
unmapped                 2022
genus_proxy_ambiguous    1352
genus_proxy_unique        145


In [19]:
# Step 3c: Combine species_exact and genus_fallback into the full bridge table

# Standardize columns for concatenation
species_exact_out = mapped_species[['taxon_name', 'species_norm', 'genus_norm',
                                     'gtdb_species_clade_id', 'GTDB_species',
                                     'n_files', 'mean_abundance', 'mapping_tier']].copy()
species_exact_out['n_gtdb_clades_for_genus'] = 1
species_exact_out['GTDB_species_representative'] = species_exact_out['GTDB_species']

genus_fallback_out = genus_fallback.copy()
genus_fallback_out['GTDB_species'] = genus_fallback_out.get('GTDB_species_representative',
                                                              pd.Series([''] * len(genus_fallback)))

# Keep only unmapped from genus_fallback (species-exact already captured)
genus_mapped = genus_fallback_out[genus_fallback_out['mapping_tier'] != 'unmapped'].copy()
truly_unmapped = genus_fallback_out[genus_fallback_out['mapping_tier'] == 'unmapped'].copy()

bridge = pd.concat([
    species_exact_out[['taxon_name', 'species_norm', 'genus_norm',
                        'gtdb_species_clade_id', 'GTDB_species',
                        'n_files', 'mean_abundance', 'mapping_tier', 'n_gtdb_clades_for_genus']],
    genus_mapped[['taxon_name', 'species_norm', 'genus_norm',
                   'gtdb_species_clade_id', 'GTDB_species_representative',
                   'n_files', 'mean_abundance', 'mapping_tier', 'n_gtdb_clades_for_genus']].rename(
                   columns={'GTDB_species_representative': 'GTDB_species'}),
    truly_unmapped[['taxon_name', 'species_norm', 'genus_norm',
                     'n_files', 'mean_abundance', 'mapping_tier', 'n_gtdb_clades_for_genus']],
], ignore_index=True)

print(f'Total bridge rows: {len(bridge)}')
print('Mapping tier distribution:')
print(bridge['mapping_tier'].value_counts().to_string())

mapped_frac = bridge[bridge['mapping_tier'] != 'unmapped']['taxon_name'].nunique() / len(bridge)
print(f'\nOverall taxon mapping rate: {mapped_frac:.1%}')

Total bridge rows: 4894
Mapping tier distribution:
mapping_tier
unmapped                 2022
species_exact            1375
genus_proxy_ambiguous    1352
genus_proxy_unique        145

Overall taxon mapping rate: 58.7%


---
## Part 4: Bridge Quality Per Sample

For each overlap sample (samples with both classifier and metabolomics data),  
compute the fraction of community abundance that has been mapped to a GTDB clade.  

Quality metric: `bridge_coverage = Σ(abundance_i) for mapped taxa / Σ(abundance_all_taxa)`  
Flag samples below 30% bridge coverage for exclusion.

In [20]:
# Compute bridge quality per classifier file.
# Use omics_files_table in SQL subqueries to filter to overlap samples —
# avoids spark.createDataFrame() which fails on PyArrow ChunkedArray columns.

if overlap_sample_ids and file_bridge_found and bridge_tbl_name and not clf_file_df.empty:
    # Subquery: classifier files that belong to overlap samples
    # (samples with BOTH centrifuge AND metabolomics, determined by SQL INTERSECT)
    clf_data = spark.sql(f"""
        SELECT c.file_id, c.label as taxon_name, c.abundance
        FROM nmdc_arkin.centrifuge_gold c
        JOIN {bridge_tbl_name} b ON c.file_id = b.file_id
        WHERE b.sample_id IN (
            SELECT b2.sample_id FROM {bridge_tbl_name} b2
            JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.centrifuge_gold) c2 ON b2.file_id = c2.file_id
            INTERSECT
            SELECT b3.sample_id FROM {bridge_tbl_name} b3
            JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m3 ON b3.file_id = m3.file_id
        )
        AND LOWER(c.rank) = 'species'
        AND c.label IS NOT NULL AND c.label != ''
        AND c.abundance > 0
    """).toPandas()

    print(f'Species-rank rows for overlap files: {len(clf_data)}')
    print(f'Files in clf_data: {clf_data["file_id"].nunique()}')

    if len(clf_data) > 0:
        # Join with bridge to mark mapped vs unmapped taxa
        mapped_taxa = set(bridge[bridge['mapping_tier'] != 'unmapped']['taxon_name'].tolist())
        clf_data['is_mapped'] = clf_data['taxon_name'].isin(mapped_taxa)

        # Compute bridge coverage per file
        bridge_quality_file = clf_data.groupby('file_id').agg(
            total_abundance=('abundance', 'sum'),
            mapped_abundance=('abundance', lambda x: x[clf_data.loc[x.index, 'is_mapped']].sum()),
            n_taxa=('taxon_name', 'nunique'),
            n_mapped_taxa=('taxon_name', lambda x: x[clf_data.loc[x.index, 'is_mapped']].nunique())
        ).reset_index()
        bridge_quality_file['bridge_coverage'] = (
            bridge_quality_file['mapped_abundance'] / bridge_quality_file['total_abundance']
        ).fillna(0)

        # Map file_id → sample_id via clf_file_df (already in pandas)
        bridge_quality_file = bridge_quality_file.merge(
            clf_file_df[['sample_id', 'clf_file_id']].rename(columns={'clf_file_id': 'file_id'}),
            on='file_id', how='left'
        )

        print(f'\nBridge quality summary:')
        print(bridge_quality_file[['file_id', 'sample_id', 'bridge_coverage',
                                   'n_taxa', 'n_mapped_taxa']].describe().to_string())
        print(f'\nFiles with >30% bridge coverage: '
              f'{(bridge_quality_file["bridge_coverage"] >= 0.30).sum()} / {len(bridge_quality_file)}')
    else:
        print('No species-rank rows returned for overlap files.')
        bridge_quality_file = pd.DataFrame()
else:
    print('No overlap samples or bridge — skipping bridge quality computation.')
    bridge_quality_file = pd.DataFrame()

Species-rank rows for overlap files: 126905
Files in clf_data: 220

Bridge quality summary:
       bridge_coverage       n_taxa  n_mapped_taxa
count       220.000000   220.000000      220.00000
mean          0.935370   576.840909      512.70000
std           0.076238   350.603729      313.98018
min           0.314354     3.000000        1.00000
25%           0.899178   234.000000      200.75000
50%           0.960668   669.000000      597.00000
75%           0.986040   859.750000      764.00000
max           1.000000  1470.000000     1303.00000

Files with >30% bridge coverage: 220 / 220


In [21]:
# Summarize bridge quality by coverage threshold
if len(bridge_quality_file) > 0:
    print('Bridge coverage distribution:')
    for threshold in [0.10, 0.20, 0.30, 0.50, 0.70]:
        n = (bridge_quality_file['bridge_coverage'] >= threshold).sum()
        print(f'  >= {threshold:.0%}: {n} / {len(bridge_quality_file)} files '
              f'({n/len(bridge_quality_file):.1%})')

    # Flag samples
    BRIDGE_THRESHOLD = 0.30
    bridge_quality_file['passes_bridge_qc'] = bridge_quality_file['bridge_coverage'] >= BRIDGE_THRESHOLD
    n_pass = bridge_quality_file['passes_bridge_qc'].sum()
    print(f'\nSamples passing QC (>={BRIDGE_THRESHOLD:.0%}): {n_pass}')
    print(f'Samples failing QC (<{BRIDGE_THRESHOLD:.0%}): {len(bridge_quality_file) - n_pass}')

Bridge coverage distribution:
  >= 10%: 220 / 220 files (100.0%)
  >= 20%: 220 / 220 files (100.0%)
  >= 30%: 220 / 220 files (100.0%)
  >= 50%: 219 / 220 files (99.5%)
  >= 70%: 218 / 220 files (99.1%)

Samples passing QC (>=30%): 220
Samples failing QC (<30%): 0


---
## Part 5: Save Outputs and Figures

In [22]:
# Save outputs
bridge_path = os.path.join(DATA_DIR, 'taxon_bridge.tsv')
bridge.to_csv(bridge_path, sep='\t', index=False)
print(f'Saved: data/taxon_bridge.tsv ({len(bridge)} rows)')

if len(bridge_quality_file) > 0:
    bq_path = os.path.join(DATA_DIR, 'bridge_quality.csv')
    bridge_quality_file.to_csv(bq_path, index=False)
    print(f'Saved: data/bridge_quality.csv ({len(bridge_quality_file)} rows)')

if not inventory.empty:
    inv_path = os.path.join(DATA_DIR, 'nmdc_sample_inventory.csv')
    inventory.to_csv(inv_path, index=False)
    print(f'Saved: data/nmdc_sample_inventory.csv ({len(inventory)} rows, '
          f'{inventory["sample_id"].nunique()} unique samples)')
else:
    print('WARNING: No overlap samples found — nmdc_sample_inventory.csv not updated.')

Saved: data/taxon_bridge.tsv (4894 rows)
Saved: data/bridge_quality.csv (220 rows)
Saved: data/nmdc_sample_inventory.csv (646 rows, 221 unique samples)


In [23]:
# Figure: Bridge quality distribution and mapping tier breakdown
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('Taxonomy Bridge Quality', fontsize=14)

# Panel 1: Bridge coverage histogram
ax = axes[0]
if len(bridge_quality_file) > 0:
    ax.hist(bridge_quality_file['bridge_coverage'], bins=30,
            color='#4C9BE8', edgecolor='white')
    ax.axvline(0.30, color='red', linestyle='--', label='30% threshold')
    ax.set_xlabel('Bridge coverage (fraction of abundance mapped)')
    ax.set_ylabel('Number of samples')
    ax.set_title('Per-sample bridge coverage')
    ax.legend()
else:
    ax.text(0.5, 0.5, 'No bridge quality data', ha='center', va='center',
            transform=ax.transAxes)

# Panel 2: Mapping tier breakdown (n taxa in each tier, weighted by n_files)
ax2 = axes[1]
if len(bridge) > 0:
    tier_counts = bridge['mapping_tier'].value_counts()
    colors = {
        'species_exact': '#2ecc71',
        'genus_proxy_unique': '#f39c12',
        'genus_proxy_ambiguous': '#e67e22',
        'unmapped': '#e74c3c'
    }
    bar_colors = [colors.get(t, '#95a5a6') for t in tier_counts.index]
    ax2.bar(range(len(tier_counts)), tier_counts.values,
            color=bar_colors, edgecolor='white')
    ax2.set_xticks(range(len(tier_counts)))
    ax2.set_xticklabels(tier_counts.index, rotation=30, ha='right', fontsize=9)
    ax2.set_ylabel('Number of unique taxa')
    ax2.set_title('Taxon mapping tier distribution')
    for i, val in enumerate(tier_counts.values):
        ax2.text(i, val + 0.5, str(val), ha='center', fontsize=9)
else:
    ax2.text(0.5, 0.5, 'No bridge data', ha='center', va='center', transform=ax2.transAxes)

plt.tight_layout()
fig_path = os.path.join(FIGURES_DIR, 'bridge_quality_distribution.png')
plt.savefig(fig_path, dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/bridge_quality_distribution.png')

Saved: figures/bridge_quality_distribution.png


---
## Summary and Decisions for NB03

| Question | Finding |
|---|---|
| file→sample bridge table | `nmdc_arkin.omics_files_table` (has both `file_id` and `sample_id`) |
| Samples with both classifier + metabolomics | 221 (via sample bridge) |
| Species-exact GTDB matches | 1,375 of 4,894 centrifuge taxa (28.1%) |
| Genus-proxy unique matches | 145 |
| Genus-proxy ambiguous | 1,352 |
| Overall taxon mapping rate | 58.7% of centrifuge taxa (88.9% of rows, 93.5% of abundance) |
| Samples passing 30% bridge QC | 220 / 220 (100%) |
| Classifier used for bridge | centrifuge_gold (61.3% species-rank, best from NB01) |

**Decision for NB03**:  
Use all 220 samples passing 30% bridge QC.  
Include all three mapping tiers (`species_exact`, `genus_proxy_unique`, `genus_proxy_ambiguous`) in community weighting.  
For `genus_proxy_ambiguous` taxa (which map to multiple GTDB clades), select one representative clade via alphabetical tiebreaking on `gtdb_species_clade_id` (`sort_values` + `drop_duplicates`).  
This decision is documented in NB03 cell-17 with sensitivity notes in REPORT.md Limitations.
