# NB01: ENIGMA Extraction and QC

Build analysis-ready ENIGMA sample, geochemistry, and community tables.

**Requires BERDL JupyterHub** with Spark session support.

**Planned outputs**
- `../data/geochemistry_sample_matrix.tsv`
- `../data/community_taxon_counts.tsv`
- `../data/sample_location_metadata.tsv`


In [None]:
from pathlib import Path
import pandas as pd

DATA_DIR = Path('../data')
DATA_DIR.mkdir(parents=True, exist_ok=True)

print('Data directory:', DATA_DIR.resolve())

In [None]:
# Spark session helper: supports both BERDL utility module and legacy import path
try:
    from berdl_notebook_utils.setup_spark_session import get_spark_session
except Exception:
    from get_spark_session import get_spark_session

spark = get_spark_session()
print('Spark session ready')

In [None]:
# Quick inventory
spark.sql('SHOW TABLES IN enigma_coral').show(200, truncate=False)

In [None]:
# Verify schemas before writing extraction SQL
for tbl in [
    'ddt_brick0000010',
    'ddt_brick0000459',
    'ddt_brick0000454',
    'sdt_sample',
    'sdt_community',
    'sdt_location',
]:
    print('\n===', tbl, '===')
    spark.sql(f'DESCRIBE enigma_coral.{tbl}').show(200, truncate=False)
    spark.sql(f'SELECT * FROM enigma_coral.{tbl} LIMIT 5').show(truncate=False)

## Extraction TODOs

1. Finalize geochemistry query from `ddt_brick0000010` with explicit numeric casts.
2. Build overlap sample set with both geochemistry and community coverage.
3. Aggregate ASV/community counts from `ddt_brick0000459` and map to taxonomy via `ddt_brick0000454`.
4. Export three TSVs for NB02/NB03.


In [None]:
# Placeholder exports (replace after extraction SQL is finalized)
pd.DataFrame().to_csv(DATA_DIR / 'geochemistry_sample_matrix.tsv', sep='\t', index=False)
pd.DataFrame().to_csv(DATA_DIR / 'community_taxon_counts.tsv', sep='\t', index=False)
pd.DataFrame().to_csv(DATA_DIR / 'sample_location_metadata.tsv', sep='\t', index=False)
print('Wrote placeholder TSV files. Replace with real extracted outputs.')