# NB01: NMDC Schema Exploration and Sample Inventory

**Project**: Community Metabolic Ecology via NMDC × Pangenome Integration  
**Requires**: BERDL JupyterHub (Spark — `get_spark_session()` injected into kernel)  

## Purpose

Gate notebook for the project. Before writing NB02–NB05, we must verify:

1. Column names in `nmdc_arkin.study_table`, `taxonomy_features`, and `metabolomics_gold`  
   (not fully documented in schema docs)
2. Whether `sample_id` values are prefixed with `study_id` (Query 1 depends on this)
3. Whether `taxonomy_features` provides relative abundances or raw read counts
4. Whether `metabolomics_gold` carries KEGG/ChEBI compound IDs in its own columns
5. Which taxonomy classifier (kraken_gold, centrifuge_gold, gottcha_gold, taxonomy_features)  
   provides the most species-level resolution

## Outputs

- `data/nmdc_sample_inventory.csv` — samples with paired taxonomy + metabolomics
- `data/nmdc_taxonomy_coverage.csv` — per-sample taxonomy stats per classifier
- `data/nmdc_metabolomics_coverage.csv` — per-sample compound counts and annotation rates
- `figures/nmdc_sample_coverage.png` — ecosystem type distribution and sample overlap

## Key Decisions to Make

- Which taxonomy classifier to use in NB02?
- Is normalization needed before computing community-weighted completeness (NB03)?
- How to map metabolomics compounds to amino acid identity (Plan A: KEGG IDs; Plan B: name matching)?

In [None]:
# On BERDL JupyterHub — get_spark_session() is injected into the kernel; no import needed
spark = get_spark_session()
spark

In [None]:
import os
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib_venn  # pip install matplotlib-venn if needed

# Paths (relative to project root — adjust if running from a different cwd)
PROJECT_DIR = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
DATA_DIR = os.path.join(PROJECT_DIR, 'data')
FIGURES_DIR = os.path.join(PROJECT_DIR, 'figures')
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIGURES_DIR, exist_ok=True)
print(f'DATA_DIR: {DATA_DIR}')
print(f'FIGURES_DIR: {FIGURES_DIR}')

---
## Part 1: Schema Verification

Verify column names before building any dependent queries.

In [None]:
# Step 1a: study_table — confirm study_id, ecosystem_category, ecosystem_type
print('=== nmdc_arkin.study_table schema ===')
spark.sql('DESCRIBE nmdc_arkin.study_table').show(50, truncate=False)

In [None]:
# Step 1b: taxonomy_features — confirm sample_id column, taxon/abundance columns
print('=== nmdc_arkin.taxonomy_features schema ===')
spark.sql('DESCRIBE nmdc_arkin.taxonomy_features').show(50, truncate=False)

In [None]:
# Step 1c: metabolomics_gold — look for KEGG/ChEBI compound ID columns
print('=== nmdc_arkin.metabolomics_gold schema ===')
spark.sql('DESCRIBE nmdc_arkin.metabolomics_gold').show(50, truncate=False)

In [None]:
# Step 1d: Inspect raw sample_id and study_id values to check prefix format
# This determines whether LIKE CONCAT(study_id, '%') join works
print('=== Sample study_table values ===')
spark.sql("""
    SELECT *
    FROM nmdc_arkin.study_table
    LIMIT 5
""").show(truncate=False)

print('\n=== Sample taxonomy_features sample_id values ===')
spark.sql("""
    SELECT sample_id
    FROM nmdc_arkin.taxonomy_features
    LIMIT 10
""").show(truncate=False)

print('\n=== Sample metabolomics_gold sample_id values ===')
spark.sql("""
    SELECT *
    FROM nmdc_arkin.metabolomics_gold
    LIMIT 3
""").show(truncate=False)

In [None]:
# Step 1e: Check taxonomy classifiers for available rank columns
for tbl in ['kraken_gold', 'centrifuge_gold', 'gottcha_gold']:
    try:
        print(f'\n=== nmdc_arkin.{tbl} schema ===')
        spark.sql(f'DESCRIBE nmdc_arkin.{tbl}').show(30, truncate=False)
    except Exception as e:
        print(f'  ERROR describing {tbl}: {e}')

---
## Part 2: Study and Sample Inventory

How many studies exist? What ecosystems are covered? How many samples have paired taxonomy + metabolomics?

In [None]:
# All studies with row count
study_df = spark.sql("""
    SELECT *
    FROM nmdc_arkin.study_table
""").toPandas()

print(f'Total studies: {len(study_df)}')
print(study_df.columns.tolist())
study_df.head(10)

In [None]:
# Count samples with taxonomy profiles
n_tax_samples = spark.sql("""
    SELECT COUNT(DISTINCT sample_id) as n
    FROM nmdc_arkin.taxonomy_features
""").collect()[0]['n']

# Count samples with metabolomics data
n_met_samples = spark.sql("""
    SELECT COUNT(DISTINCT sample_id) as n
    FROM nmdc_arkin.metabolomics_gold
""").collect()[0]['n']

print(f'Samples with taxonomy profiles:  {n_tax_samples}')
print(f'Samples with metabolomics data:  {n_met_samples}')

In [None]:
# Samples with BOTH taxonomy AND metabolomics
overlap_df = spark.sql("""
    SELECT tf.sample_id
    FROM nmdc_arkin.taxonomy_features tf
    JOIN nmdc_arkin.metabolomics_gold mg ON tf.sample_id = mg.sample_id
    GROUP BY tf.sample_id
""").toPandas()

print(f'Samples with BOTH taxonomy AND metabolomics: {len(overlap_df)}')
overlap_samples = set(overlap_df['sample_id'])

In [None]:
# Try to join samples to study metadata
# NOTE: The join key between samples and studies is unknown until schema is verified above.
# Update STUDY_ID_COL and SAMPLE_STUDY_COL based on Part 1 findings.
# Placeholder: inspect sample_id prefix manually to determine join strategy.

# Show a few overlap sample IDs alongside study IDs to check prefix patterns
print('Overlap sample_id examples:')
print(overlap_df['sample_id'].head(15).tolist())
print('\nStudy ID examples (from study_table):')
# Use whatever column was found in Part 1 step 1d
print(study_df.iloc[:5, 0].tolist() if len(study_df) > 0 else '(no studies found)')

In [None]:
# Abiotic features for overlap samples
abiotic_df = spark.sql("""
    SELECT *
    FROM nmdc_arkin.abiotic_features
""").toPandas()

print(f'Total abiotic_features samples: {len(abiotic_df)}')
abiotic_overlap = abiotic_df[abiotic_df['sample_id'].isin(overlap_samples)]
print(f'Abiotic features for overlap samples: {len(abiotic_overlap)}')
print(abiotic_df.columns.tolist())

---
## Part 3: Taxonomy Classifier Comparison

Which classifier (kraken_gold, centrifuge_gold, gottcha_gold, taxonomy_features) provides the most species-level resolution in overlap samples?

In [None]:
# Assess species-level resolution per classifier for overlap samples
# Columns vary by classifier — update RANK_COL and TAXON_COL after Part 1 schema review

def assess_classifier(table_name, sample_col='sample_id', rank_col=None, taxon_col=None):
    """
    Returns per-sample stats: n_unique_taxa, fraction_with_species_rank.
    rank_col: column containing taxonomic rank label (e.g., 'rank' or 'taxon_rank')
    taxon_col: column containing taxon name
    Update based on DESCRIBE output from Part 1.
    """
    try:
        df = spark.sql(f'SELECT * FROM nmdc_arkin.{table_name} LIMIT 5')
        print(f'\n--- {table_name} sample rows ---')
        df.show(truncate=False)
        cols = df.columns
        print(f'Columns: {cols}')
    except Exception as e:
        print(f'  ERROR accessing {table_name}: {e}')

for tbl in ['kraken_gold', 'centrifuge_gold', 'gottcha_gold', 'taxonomy_features']:
    assess_classifier(tbl)

In [None]:
# After inspecting columns above, fill in the correct rank/taxon column names.
# This cell is a placeholder — complete after Part 1 + Part 3 assessment above.
#
# Example (update column names to match actual schema):
#
# RANK_COL = 'rank'        # e.g., 'species', 'genus', 'family'
# TAXON_COL = 'taxon_name' # taxon name string
# ABUND_COL = 'abundance'  # relative abundance or read count
#
# species_frac = spark.sql(f"""
#     SELECT
#         COUNT(CASE WHEN LOWER({RANK_COL}) = 'species' THEN 1 END) as n_species_rows,
#         COUNT(*) as n_total_rows,
#         COUNT(CASE WHEN LOWER({RANK_COL}) = 'species' THEN 1 END) / COUNT(*) as species_frac
#     FROM nmdc_arkin.kraken_gold
# """).toPandas()
# display(species_frac)

print('TODO: Fill in RANK_COL, TAXON_COL, ABUND_COL based on schema review above.')

---
## Part 4: Metabolomics Coverage

What fraction of measured compounds carry KEGG/ChEBI compound IDs (needed for amino acid matching in NB04)?

In [None]:
# Per-sample compound counts and annotation rates
# Update column names based on metabolomics_gold DESCRIBE output (Part 1 step 1c)

met_sample_stats = spark.sql("""
    SELECT
        sample_id,
        COUNT(*) as n_compounds_total
    FROM nmdc_arkin.metabolomics_gold
    GROUP BY sample_id
""").toPandas()

print(f'Samples with metabolomics: {len(met_sample_stats)}')
print(f'Compounds per sample — median: {met_sample_stats["n_compounds_total"].median():.0f}, '
      f'max: {met_sample_stats["n_compounds_total"].max()}')
met_sample_stats.describe()

In [None]:
# Check for KEGG/ChEBI annotation columns in metabolomics_gold
# After seeing the schema in Part 1, identify candidate columns for compound annotation.
# This cell searches for columns whose names suggest compound IDs.

met_cols = spark.sql('DESCRIBE nmdc_arkin.metabolomics_gold').toPandas()
candidate_id_cols = met_cols[
    met_cols['col_name'].str.lower().str.contains(
        'kegg|chebi|compound|id|inchi|smiles|name|annotation', na=False
    )
]
print('Candidate compound ID columns:')
print(candidate_id_cols[['col_name', 'data_type']].to_string())

In [None]:
# Annotation rate: fraction of metabolomics rows with a non-null compound ID
# Update COMPOUND_ID_COL to the actual column name found above.
#
# COMPOUND_ID_COL = 'kegg_id'  # UPDATE THIS
#
# annotation_rate = spark.sql(f"""
#     SELECT
#         COUNT(CASE WHEN {COMPOUND_ID_COL} IS NOT NULL AND {COMPOUND_ID_COL} != '' THEN 1 END) as n_annotated,
#         COUNT(*) as n_total,
#         COUNT(CASE WHEN {COMPOUND_ID_COL} IS NOT NULL AND {COMPOUND_ID_COL} != '' THEN 1 END)
#             / COUNT(*) as annotation_rate
#     FROM nmdc_arkin.metabolomics_gold
# """).toPandas()
# print(annotation_rate)

# For now, inspect a sample of rows to see what annotation looks like
spark.sql("""
    SELECT *
    FROM nmdc_arkin.metabolomics_gold
    LIMIT 10
""").show(truncate=False)
print('TODO: Set COMPOUND_ID_COL after reviewing column names above.')

In [None]:
# Check for amino acid compound names via string matching (Plan B fallback)
AMINO_ACIDS = [
    'alanine', 'arginine', 'asparagine', 'aspartate', 'aspartic acid',
    'cysteine', 'glutamate', 'glutamic acid', 'glutamine', 'glycine',
    'histidine', 'isoleucine', 'leucine', 'lysine', 'methionine',
    'phenylalanine', 'proline', 'serine', 'threonine', 'tryptophan',
    'tyrosine', 'valine'
]

# Try to find a compound name column and check for amino acid hits
# Update NAME_COL based on schema review
# NAME_COL = 'compound_name'  # UPDATE THIS
#
# aa_pattern = '|'.join(AMINO_ACIDS)
# aa_hits = spark.sql(f"""
#     SELECT {NAME_COL}, COUNT(DISTINCT sample_id) as n_samples
#     FROM nmdc_arkin.metabolomics_gold
#     WHERE LOWER({NAME_COL}) RLIKE '{aa_pattern}'
#     GROUP BY {NAME_COL}
#     ORDER BY n_samples DESC
# """).toPandas()
# print(f'Amino acid compound hits: {len(aa_hits)}')
# display(aa_hits.head(20))
print(f'Amino acids to search for: {AMINO_ACIDS}')
print('TODO: Set NAME_COL after reviewing metabolomics_gold columns above.')

---
## Part 5: Save Outputs and Figures

In [None]:
# Save sample inventory
inventory = pd.DataFrame({'sample_id': list(overlap_samples)})
inventory['has_taxonomy'] = True
inventory['has_metabolomics'] = True

# Merge abiotic features (ecosystem type) for overlap samples
if len(abiotic_overlap) > 0:
    inventory = inventory.merge(abiotic_overlap[['sample_id'] + 
        [c for c in abiotic_overlap.columns if c != 'sample_id']], 
        on='sample_id', how='left')

inventory.to_csv(os.path.join(DATA_DIR, 'nmdc_sample_inventory.csv'), index=False)
print(f'Saved: data/nmdc_sample_inventory.csv ({len(inventory)} rows)')

In [None]:
# Save metabolomics coverage stats
met_overlap = met_sample_stats[met_sample_stats['sample_id'].isin(overlap_samples)].copy()
met_overlap.to_csv(os.path.join(DATA_DIR, 'nmdc_metabolomics_coverage.csv'), index=False)
print(f'Saved: data/nmdc_metabolomics_coverage.csv ({len(met_overlap)} rows)')

In [None]:
# Figure: Sample counts by data type and ecosystem type
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('NMDC Sample Coverage', fontsize=14)

# Panel 1: Venn-style bar chart (simple counts)
ax = axes[0]
counts = {
    'Taxonomy only': n_tax_samples - len(overlap_samples),
    'Metabolomics only': n_met_samples - len(overlap_samples),
    'Both (overlap)': len(overlap_samples)
}
bars = ax.bar(list(counts.keys()), list(counts.values()), 
              color=['#4C9BE8', '#E88C4C', '#6EC46E'], edgecolor='white')
ax.set_ylabel('Number of samples')
ax.set_title('Sample data availability')
for bar, val in zip(bars, counts.values()):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
            str(val), ha='center', va='bottom', fontsize=10)
ax.tick_params(axis='x', rotation=15)

# Panel 2: Metabolomics compound count distribution for overlap samples
ax2 = axes[1]
if len(met_overlap) > 0:
    ax2.hist(met_overlap['n_compounds_total'], bins=30, color='#6EC46E', edgecolor='white')
    ax2.set_xlabel('Compounds per sample')
    ax2.set_ylabel('Number of samples')
    ax2.set_title('Metabolomics compound counts\n(overlap samples)')
    ax2.axvline(met_overlap['n_compounds_total'].median(), color='black',
                linestyle='--', label=f'Median: {met_overlap["n_compounds_total"].median():.0f}')
    ax2.legend()
else:
    ax2.text(0.5, 0.5, 'No overlap samples found', ha='center', va='center',
             transform=ax2.transAxes)

plt.tight_layout()
fig_path = os.path.join(FIGURES_DIR, 'nmdc_sample_coverage.png')
plt.savefig(fig_path, dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/nmdc_sample_coverage.png')

---
## Summary and Decisions for NB02

Fill in after running all cells:

| Question | Finding |
|---|---|
| Samples with taxonomy | ??? |
| Samples with metabolomics | ??? |
| Overlap (both) | ??? |
| Best taxonomy classifier | ??? (kraken / centrifuge / gottcha / taxonomy_features) |
| Taxonomy resolution | ??? (genus-level only / species-level available) |
| Taxonomy values = relative abundance? | ??? (yes / no — normalization needed) |
| Metabolomics compound IDs available? | ??? (KEGG IDs in column ___ / name matching only) |
| Amino acid metabolites found? | ??? compounds matching AA names |
| Study → sample ID join strategy | ??? (prefix match / explicit study_id column / other) |

**Decision for NB02**: Use `___` classifier. Bridge at `___` level. Expected bridge coverage: ???%.