# NB02: Taxonomy Bridge and Functional Features

Map ENIGMA taxa to BERDL pangenome taxonomy and summarize stress-related functional features.

**Inputs**
- `../data/community_taxon_counts.tsv`
- `../data/sample_location_metadata.tsv`

**Planned outputs**
- `../data/taxon_bridge.tsv`
- `../data/taxon_functional_features.tsv`


In [None]:
from pathlib import Path
import re
import pandas as pd

DATA_DIR = Path('../data')
community = pd.read_csv(DATA_DIR / 'community_taxon_counts.tsv', sep='\t')
sample_meta = pd.read_csv(DATA_DIR / 'sample_location_metadata.tsv', sep='\t')

print('community rows:', len(community))
print('sample metadata rows:', len(sample_meta))

In [None]:
# Spark connection for pangenome taxonomy + annotations
try:
    from berdl_notebook_utils.setup_spark_session import get_spark_session
except Exception:
    from get_spark_session import get_spark_session

spark = get_spark_session()
print('Spark session ready')

In [None]:
# Discover candidate taxonomy columns before building deterministic join logic
for tbl in [
    'kbase_ke_pangenome.gtdb_taxonomy_r214v1',
    'kbase_ke_pangenome.gtdb_species_clade',
    'kbase_ke_pangenome.genome',
    'kbase_ke_pangenome.eggnog_mapper_annotations',
    'kbase_ke_pangenome.gene_cluster',
]:
    print('\n===', tbl, '===')
    spark.sql(f'DESCRIBE {tbl}').show(200, truncate=False)

## Bridge Strategy

Use confidence tiers for taxon mapping:
- Tier 1: Exact NCBI taxid match
- Tier 2: Normalized exact name match
- Tier 3: Genus-level relaxed match

Run all downstream models on Tier 1-only and Tier1+2 for sensitivity analysis.


In [None]:
# Placeholder outputs until mapping SQL is finalized
pd.DataFrame(columns=['enigma_taxon', 'mapped_id', 'mapping_tier']).to_csv(
    DATA_DIR / 'taxon_bridge.tsv', sep='\t', index=False
)
pd.DataFrame(columns=['mapped_id', 'feature_name', 'feature_value']).to_csv(
    DATA_DIR / 'taxon_functional_features.tsv', sep='\t', index=False
)
print('Wrote placeholder bridge/features tables.')