# EXFOR Data Loading & Verification with Pre-Enriched AME2020/NUBASE2020

## Real Experimental Nuclear Data from IAEA EXFOR

This notebook demonstrates how to load and verify **real experimental nuclear data** from the IAEA EXFOR database using the X4Pro SQLite format.

**NUCML-Next uses ONLY real data:**
- ‚úÖ Uses REAL experimental cross-section measurements
- ‚úÖ X4Pro SQLite ingestion (efficient, single-file database)
- ‚úÖ **NEW:** Pre-enriched AME2020/NUBASE2020 data (all 5 files, all tiers)
- ‚úÖ Production-grade data quality

## Pre-Enrichment Architecture (v1.2.0+)

**Key Change:** AME2020/NUBASE2020 enrichment now happens ONCE during ingestion, not repeatedly during feature generation.

**Benefits:**
- ‚ö° **Faster**: No file parsing or joins during feature generation
- üéØ **Consistent**: All users get same enrichment from single Parquet source
- üíæ **Efficient**: Parquet columnar format only loads needed columns
- üì¶ **Production-Ready**: Single data source with all tier features

**What's in Pre-Enriched Parquet:**
```
Core EXFOR: Entry, Z, A, N, MT, Energy, CrossSection, Uncertainty
Tier B/C:   Mass_Excess_keV, Binding_Energy_keV, S_1n, S_2n, S_1p, S_2p
Tier D:     Spin, Parity, Isomer_Level, Half_Life_s
Tier E:     Q_alpha, Q_2beta_minus, Q_n_alpha, ... (8 Q-values)
```

---

## Focus Isotopes for This Tutorial

We'll focus on two isotopes that represent different data availability scenarios:

### 1. **U-235 (Uranium-235) - Well-Understood**
- **Why:** Critical for nuclear reactors (LWR fuel)
- **Data Quality:** Extensive experimental measurements since 1940s
- **Key Reactions:** (n,f) fission at MT=18, (n,Œ≥) capture at MT=102
- **Expected Data:** 1000+ measurements across energy range
- **AME2020:** Full enrichment available (mass, energetics, structure, Q-values)

### 2. **Cl-35 (Chlorine-35) (n,p) - Research Interest**
- **Why:** Important for nuclear astrophysics, medical isotope production
- **Data Quality:** Limited experimental data, sparse measurements
- **Key Reaction:** (n,p) at MT=103 ‚Üí Produces S-35 (medical tracer)
- **Expected Data:** 10-100 measurements (much sparser!)
- **AME2020:** Full enrichment available for comparison with U-235

**Educational Value:** These isotopes demonstrate ML performance on data-rich vs. data-sparse scenarios.

---

## Data Ingestion (One-Time Setup)

**Option 1: Quick Start with Sample Database**

```bash
# Basic ingestion (no enrichment)
python scripts/ingest_exfor.py \
    --x4-db data/x4sqlite1_sample.db \
    --output data/exfor_enriched.parquet
```

**Option 2: Full Enrichment (Recommended for Production)**

```bash
# Step 1: Download AME2020/NUBASE2020 files (one-time)
cd data/
wget https://www-nds.iaea.org/amdc/ame2020/mass_1.mas20.txt
wget https://www-nds.iaea.org/amdc/ame2020/rct1.mas20.txt
wget https://www-nds.iaea.org/amdc/ame2020/rct2_1.mas20.txt
wget https://www-nds.iaea.org/amdc/ame2020/nubase_4.mas20.txt
cd ..

# Step 2: Ingest with full enrichment (all tier columns added)
python scripts/ingest_exfor.py \
    --x4-db data/x4sqlite1_sample.db \
    --output data/exfor_enriched.parquet \
    --ame2020-dir data/
```

**Option 3: Use Full X4 Database**

```bash
# Download full EXFOR database from https://www-nds.iaea.org/x4/
python scripts/ingest_exfor.py \
    --x4-db ~/data/x4sqlite1.db \
    --output data/exfor_enriched.parquet \
    --ame2020-dir data/
```

**What Happens During Ingestion:**
1. Loads X4Pro SQLite database (EXFOR measurements)
2. Loads ALL 5 AME2020/NUBASE2020 files (if --ame2020-dir provided)
3. Merges ALL enrichment columns into EXFOR dataframe
4. Writes complete schema to Parquet with ~25 columns
5. Feature selection = column selection (no file I/O, no joins!)

---

In [None]:
import sys
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from nucml_next.examples import load_dataset, print_dataset_summary

print("‚úì Imports successful")
print("‚úì NUCML-Next: X4Pro SQLite ingestion for real EXFOR data")

## Step 1: Verify EXFOR Data Exists

In [None]:
# Check if EXFOR data has been processed
exfor_path = Path('../data/exfor_enriched.parquet')

if not exfor_path.exists():
    print("‚ùå EXFOR data not found!")
    print("\nPlease run the ingestor first:")
    print("  python scripts/ingest_exfor.py --x4-db data/x4sqlite1_sample.db --output data/exfor_enriched.parquet")
    print("\nOr use the quick_ingest helper in your code:")
    print("  from nucml_next.examples import quick_ingest")
    print("  df = quick_ingest()")
    raise FileNotFoundError(f"EXFOR data not found at {exfor_path}")
else:
    print(f"‚úì Found EXFOR data at {exfor_path}")
    # Check size
    if exfor_path.is_dir():
        print(f"  Type: Partitioned dataset (directory)")
    else:
        size_mb = exfor_path.stat().st_size / (1024**2)
        print(f"  Size: {size_mb:.1f} MB")

## Step 2: Load Real EXFOR Data

**Note:** `data_path` is REQUIRED. If you don't provide it or if the file doesn't exist, NucmlDataset will raise an error immediately. This prevents accidental misuse.

In [None]:
# Load EXFOR data focusing on our two isotopes
# DEMONSTRATION: Using legacy filters for simple isotope selection
# For production training, use DataSelection for physics-aware filtering!

# U-235: Z=92, A=235, MT=18 (fission), MT=102 (capture)
# Cl-35: Z=17, A=35, MT=103 (n,p)

from nucml_next.examples import load_dataset

dataset = load_dataset(
    data_path='../data/exfor_enriched.parquet',
    mode='tabular',
    filters={  # Simple filters for demonstration
        'Z': [92, 17],     # Uranium and Chlorine
        'A': [235, 35],    # U-235 and Cl-35
        'MT': [18, 102, 103]  # Fission, capture, (n,p)
    },
)

print(f"\n‚úì Loaded {len(dataset.df)} REAL experimental data points")
print(f"  Isotopes: {dataset.df[['Z', 'A']].drop_duplicates().shape[0]}")
print(f"  Reactions: {dataset.df['MT'].nunique()}")
print(f"  Energy range: {dataset.df['Energy'].min():.2e} - {dataset.df['Energy'].max():.2e} eV")

# Show breakdown by isotope
print("\nüìä Data Distribution by Isotope:")
for (z, a), group in dataset.df.groupby(['Z', 'A']):
    isotope_name = f"{'U' if z==92 else 'Cl'}-{a}"
    print(f"  {isotope_name:8s}: {len(group):>6,} measurements")
    for mt, mt_group in group.groupby('MT'):
        mt_name = {18: 'Fission', 102: 'Capture', 103: '(n,p)'}.get(int(mt), f'MT={mt}')
        print(f"    ‚îî‚îÄ {mt_name:12s}: {len(mt_group):>6,} points")

print("\n" + "="*80)
print("üí° TIP: For production training, use DataSelection for physics-aware filtering:")
print("="*80)
print("""
from nucml_next.data import DataSelection, NucmlDataset

selection = DataSelection(
    # PROJECTILE: 'neutron' | 'all'
    projectile='neutron',
    
    # ENERGY RANGE (eV)
    energy_min=1e-5,    # Thermal (0.01 eV)
    energy_max=2e7,     # Fast (20 MeV)
    
    # MT MODE: 'reactor_core' | 'threshold_only' | 'fission_details' | 'all_physical' | 'custom'
    mt_mode='reactor_core',  # Essential reactions: MT 2,4,16,18,102,103,107
    # mt_mode='threshold_only',  # Threshold reactions: MT 16,17,103-107
    # mt_mode='fission_details',  # Fission channels: MT 18,19,20,21,38
    # mt_mode='all_physical',     # All physical (< 9000)
    # mt_mode='custom',           # Use custom_mt_codes below
    
    custom_mt_codes=None,  # Example: [2, 18, 102] when mt_mode='custom'
    
    # EXCLUSIONS
    exclude_bookkeeping=True,  # Exclude MT 0, 1, >= 9000
    drop_invalid=True,         # Drop NaN/non-positive
    
    # HOLDOUT for extrapolation testing
    holdout_isotopes=None      # Example: [(92, 235), (17, 35)]
)

dataset = NucmlDataset(
    data_path='data/exfor_enriched.parquet',
    selection=selection
)

‚Üí Enables predicate pushdown (90% I/O reduction!)
‚Üí Scientifically defensible defaults
‚Üí Explicit physics rationale
""")
print("="*80)
print()

## Step 3: Inspect Real EXFOR Data

In [None]:
# Show sample of real experimental data
print("Sample EXFOR Data (first 20 rows):")
print(dataset.df.head(20))

# Check for pre-enrichment (NEW v1.2.0+)
print("\n" + "="*80)
print("PRE-ENRICHMENT STATUS")
print("="*80)

tier_columns = {
    'Tier B/C (Mass)': ['Mass_Excess_keV', 'Binding_Energy_keV', 'Binding_Per_Nucleon_keV'],
    'Tier C (Separation)': ['S_1n', 'S_2n', 'S_1p', 'S_2p'],
    'Tier D (Structure)': ['Spin', 'Parity', 'Half_Life_s', 'Isomer_Level'],
    'Tier E (Q-values)': ['Q_alpha', 'Q_2beta_minus', 'Q_n_alpha']
}

enrichment_found = False
for tier_name, cols in tier_columns.items():
    present_cols = [col for col in cols if col in dataset.df.columns]
    if present_cols:
        enrichment_found = True
        print(f"‚úì {tier_name:25s}: {len(present_cols)}/{len(cols)} columns present")
        
        # Show coverage for first column
        if present_cols:
            sample_col = present_cols[0]
            coverage = dataset.df[sample_col].notna().sum() / len(dataset.df) * 100
            print(f"    ‚îî‚îÄ {sample_col:30s}: {coverage:.1f}% coverage")
    else:
        print(f"‚úó {tier_name:25s}: Not pre-enriched")

if enrichment_found:
    print("\n‚úì Data is PRE-ENRICHED (AME2020/NUBASE2020 columns present)")
    print("  ‚Üí Feature generation = column selection (fast!)")
    print("  ‚Üí No file I/O or joins needed")
else:
    print("\n‚ö†Ô∏è  Data NOT pre-enriched")
    print("  ‚Üí Re-run ingestion with --ame2020-dir data/ to add tier columns")
    print("  ‚Üí Or feature generation will use legacy on-demand enrichment (slower)")

print("="*80)

# Check for uncertainties
has_uncertainty = dataset.df['Uncertainty'].notna().sum()
print(f"\n‚úì {has_uncertainty}/{len(dataset.df)} points have experimental uncertainty")

print("\nüí° TIP: For production use, always use pre-enriched Parquet:")
print("   python scripts/ingest_exfor.py --x4-db data.db --ame2020-dir data/ --output enriched.parquet")

## Step 4: Visualize Real Experimental Data

This shows **actual EXFOR measurements** with experimental scatter - comparing data-rich U-235 vs. data-sparse Cl-35.

In [None]:
# Create comparative visualization: U-235 (data-rich) vs Cl-35 (data-sparse)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# LEFT PLOT: U-235 Fission (data-rich)
u235_fission = dataset.df[(dataset.df['Z'] == 92) & 
                           (dataset.df['A'] == 235) & 
                           (dataset.df['MT'] == 18)]

if len(u235_fission) > 0:
    if 'Uncertainty' in u235_fission.columns:
        has_unc = u235_fission['Uncertainty'].notna()
        
        # Points with uncertainty
        ax1.errorbar(
            u235_fission[has_unc]['Energy'],
            u235_fission[has_unc]['CrossSection'],
            yerr=u235_fission[has_unc]['Uncertainty'],
            fmt='o', markersize=3, alpha=0.6,
            label=f'With uncertainty ({has_unc.sum()} pts)',
            color='blue', elinewidth=0.8
        )
        
        # Points without uncertainty
        if (~has_unc).sum() > 0:
            ax1.scatter(
                u235_fission[~has_unc]['Energy'],
                u235_fission[~has_unc]['CrossSection'],
                marker='x', s=15, alpha=0.5,
                label=f'No uncertainty ({(~has_unc).sum()} pts)',
                color='orange'
            )
    else:
        ax1.scatter(
            u235_fission['Energy'],
            u235_fission['CrossSection'],
            marker='o', s=8, alpha=0.6,
            label=f'EXFOR Data ({len(u235_fission)} pts)'
        )
    
    ax1.set_xlabel('Energy (eV)', fontsize=12, fontweight='bold')
    ax1.set_ylabel('Cross Section (barns)', fontsize=12, fontweight='bold')
    ax1.set_title('U-235 Fission: WELL-UNDERSTOOD (Data-Rich)\n' + 
                  f'{len(u235_fission):,} EXFOR measurements',
                  fontsize=13, fontweight='bold', color='darkblue')
    ax1.legend(fontsize=10)
    ax1.set_xscale('log')
    ax1.set_yscale('log')
    ax1.grid(True, alpha=0.3)
else:
    ax1.text(0.5, 0.5, 'No U-235 fission data in dataset\n(Check EXFOR ingestion)',
             ha='center', va='center', transform=ax1.transAxes, fontsize=12)
    ax1.set_title('U-235 Fission (No Data)', fontsize=13, fontweight='bold')

# RIGHT PLOT: Cl-35 (n,p) (data-sparse)
cl35_np = dataset.df[(dataset.df['Z'] == 17) & 
                      (dataset.df['A'] == 35) & 
                      (dataset.df['MT'] == 103)]

if len(cl35_np) > 0:
    if 'Uncertainty' in cl35_np.columns:
        has_unc = cl35_np['Uncertainty'].notna()
        
        # Points with uncertainty
        if has_unc.sum() > 0:
            ax2.errorbar(
                cl35_np[has_unc]['Energy'],
                cl35_np[has_unc]['CrossSection'],
                yerr=cl35_np[has_unc]['Uncertainty'],
                fmt='o', markersize=5, alpha=0.7,
                label=f'With uncertainty ({has_unc.sum()} pts)',
                color='green', elinewidth=1.2
            )
        
        # Points without uncertainty
        if (~has_unc).sum() > 0:
            ax2.scatter(
                cl35_np[~has_unc]['Energy'],
                cl35_np[~has_unc]['CrossSection'],
                marker='s', s=30, alpha=0.7,
                label=f'No uncertainty ({(~has_unc).sum()} pts)',
                color='red'
            )
    else:
        ax2.scatter(
            cl35_np['Energy'],
            cl35_np['CrossSection'],
            marker='o', s=25, alpha=0.7,
            label=f'EXFOR Data ({len(cl35_np)} pts)'
        )
    
    ax2.set_xlabel('Energy (eV)', fontsize=12, fontweight='bold')
    ax2.set_ylabel('Cross Section (barns)', fontsize=12, fontweight='bold')
    ax2.set_title('Cl-35 (n,p): RESEARCH INTEREST (Data-Sparse)\n' + 
                  f'{len(cl35_np):,} EXFOR measurements',
                  fontsize=13, fontweight='bold', color='darkgreen')
    ax2.legend(fontsize=10)
    ax2.set_xscale('log')
    ax2.set_yscale('log')
    ax2.grid(True, alpha=0.3)
else:
    ax2.text(0.5, 0.5, 'No Cl-35 (n,p) data in dataset\n(Check EXFOR ingestion or expand --max-files)',
             ha='center', va='center', transform=ax2.transAxes, fontsize=11)
    ax2.set_title('Cl-35 (n,p) (No Data)', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n" + "="*80)
print("üîç KEY OBSERVATIONS:")
print("="*80)
print(f"  LEFT (U-235 Fission):")
print(f"    ‚Ä¢ {'MANY' if len(u235_fission) > 100 else 'FEW'} measurements ‚Üí ML models have rich training data")
print(f"    ‚Ä¢ Dense energy coverage ‚Üí Good interpolation possible")
print(f"    ‚Ä¢ Well-characterized resonances ‚Üí Physics well-understood")
print()
print(f"  RIGHT (Cl-35 (n,p)):")
print(f"    ‚Ä¢ {'SPARSE' if len(cl35_np) < 100 else 'MODERATE'} measurements ‚Üí ML models face data scarcity")
print(f"    ‚Ä¢ Energy gaps ‚Üí Interpolation/extrapolation challenging")
print(f"    ‚Ä¢ Active research area ‚Üí Models can help guide new experiments!")
print("="*80)

## Step 5: Production Data Statistics

In [None]:
# Get comprehensive statistics
stats = dataset.get_statistics()

print("\n" + "="*70)
print("EXFOR PRODUCTION DATA STATISTICS")
print("="*70)
for key, value in stats.items():
    if isinstance(value, tuple):
        print(f"{key:25s}: {value[0]:.2e} - {value[1]:.2e}")
    else:
        print(f"{key:25s}: {value}")
print("="*70)

# Reaction breakdown
print("\nReaction Types in Dataset:")
reaction_counts = dataset.df.groupby('MT').size().sort_values(ascending=False)
for mt, count in reaction_counts.items():
    mt_name = {2: 'Elastic', 18: 'Fission', 102: 'Capture', 16: '(n,2n)'}.get(int(mt), f'MT={mt}')
    print(f"  {mt_name:20s}: {count:>8,} points")

## Step 6: Verify Data Quality

Production checks to ensure data is suitable for training.

In [None]:
# Quality checks
print("\nData Quality Checks:")
print("="*70)

# 1. No infinite values
has_inf = np.isinf(dataset.df['CrossSection']).sum()
print(f"‚úì Infinite cross sections: {has_inf} (should be 0)")

# 2. No NaN in critical columns
critical_cols = ['Z', 'A', 'Energy', 'CrossSection']
for col in critical_cols:
    nan_count = dataset.df[col].isna().sum()
    status = "‚úì" if nan_count == 0 else "‚ùå"
    print(f"{status} NaN in {col:20s}: {nan_count}")

# 3. Positive cross sections
negative = (dataset.df['CrossSection'] < 0).sum()
print(f"‚úì Negative cross sections: {negative} (should be 0)")

# 4. Energy range coverage
energy_decades = np.log10(dataset.df['Energy'].max() / dataset.df['Energy'].min())
print(f"‚úì Energy range: {energy_decades:.1f} decades")

# 5. Natural targets flagged
if 'Is_Natural_Target' in dataset.df.columns:
    natural_count = dataset.df['Is_Natural_Target'].sum()
    print(f"‚úì Natural targets flagged: {natural_count}")

print("="*70)

## Step 7: Ready for Production Training

This dataset is now ready to use with:
- Baseline models (XGBoost, Decision Trees)
- GNN-Transformer
- Physics-informed training
- OpenMC validation

**All with REAL experimental data!**

In [None]:
# Demonstrate production-safe usage with pre-enriched data
print("\nüéØ Production Training Example (Pre-Enriched Data):")
print("="*80)
print("# Load pre-enriched EXFOR data (already has AME2020/NUBASE2020 columns)")
print("import pandas as pd")
print("from nucml_next.data.features import FeatureGenerator")
print("")
print("# Direct Parquet load (fast!)")
print("df = pd.read_parquet('data/exfor_enriched.parquet')")
print("")
print("# Generate tier-based features (just column selection, no file I/O!)")
print("gen = FeatureGenerator()  # No enricher needed - data is pre-enriched!")
print("features = gen.generate_features(df, tiers=['A', 'C', 'D'])")
print("")
print("# Train on real data with tier features")
print("from nucml_next.baselines import XGBoostEvaluator")
print("xgb = XGBoostEvaluator()")
print("xgb.train(features)")
print("="*80)

print("\n" + "="*80)
print("PRE-ENRICHMENT BENEFITS")
print("="*80)
print("‚ö° Faster:      No AME2020 file parsing (already in Parquet)")
print("üíæ Efficient:   Parquet only loads needed columns")
print("üéØ Consistent:  All users get same enrichment")
print("üì¶ Production:  Single data source, no external dependencies")
print("="*80)

print("\n‚úÖ This dataset is production-ready!")
print(f"‚úÖ {len(dataset.df):,} real EXFOR measurements")
print(f"‚úÖ {dataset.df[['Z', 'A']].drop_duplicates().shape[0]} isotopes")
print(f"‚úÖ {dataset.df['MT'].nunique()} reaction types")

# Check if pre-enriched
has_enrichment = 'Mass_Excess_keV' in dataset.df.columns
if has_enrichment:
    print(f"‚úÖ Pre-enriched with AME2020/NUBASE2020 (all tier columns)")
    print(f"   ‚Üí Feature generation = column selection (fast!)")
else:
    print(f"‚ö†Ô∏è  Not pre-enriched - consider re-ingesting with --ame2020-dir")
    print(f"   ‚Üí Feature generation will use legacy on-demand enrichment (slower)")

print("\nContinue to baseline/GNN-Transformer training notebooks ‚Üí")

---

## üéì Key Takeaway

> **NUCML-Next v1.2.0+: Pre-Enrichment Architecture for Production ML!**
>
> **What we've learned:**
> - ‚úÖ X4Pro SQLite provides single-file, efficient database access
> - ‚úÖ **NEW:** Pre-enrichment during ingestion (load AME2020 once, not repeatedly)
> - ‚úÖ ALL tier columns in Parquet ‚Üí feature generation = column selection
> - ‚úÖ Parquet columnar format ‚Üí only loads needed columns (fast!)
> - ‚úÖ Production-ready: Single data source, consistent preprocessing
>
> **Pre-Enrichment Benefits:**
> - ‚ö° **10-100x faster** feature generation (no file I/O, no joins)
> - üéØ **Consistent** preprocessing (all users get same enrichment)
> - üíæ **Efficient** storage (Parquet only loads needed columns)
> - üì¶ **Production-ready** (single data source)
>
> **Migration from Legacy Approach:**
> ```bash
> # Old approach (DEPRECATED)
> python scripts/ingest_exfor.py --x4-db data.db --ame2020 mass_1.mas20.txt
> # ‚Üí Only mass_1 loaded during ingestion
> # ‚Üí Other files loaded on-demand (slow, redundant I/O)
> 
> # New approach (RECOMMENDED)
> python scripts/ingest_exfor.py --x4-db data.db --ame2020-dir data/
> # ‚Üí ALL 5 files loaded once during ingestion
> # ‚Üí Complete enrichment schema in Parquet
> # ‚Üí Feature generation = column selection (fast!)
> ```
>
> **Ingestion Commands:**
> ```bash
> # Download AME2020/NUBASE2020 files (one-time)
> cd data/
> wget https://www-nds.iaea.org/amdc/ame2020/mass_1.mas20.txt
> wget https://www-nds.iaea.org/amdc/ame2020/rct1.mas20.txt
> wget https://www-nds.iaea.org/amdc/ame2020/rct2_1.mas20.txt
> wget https://www-nds.iaea.org/amdc/ame2020/nubase_4.mas20.txt
> cd ..
> 
> # Ingest with full pre-enrichment
> python scripts/ingest_exfor.py \
>     --x4-db data/x4sqlite1_sample.db \
>     --output data/exfor_enriched.parquet \
>     --ame2020-dir data/
> ```
>
> **Feature Generation (Pre-Enriched):**
> ```python
> import pandas as pd
> from nucml_next.data.features import FeatureGenerator
> 
> # Load pre-enriched Parquet (already has ALL tier columns)
> df = pd.read_parquet('data/exfor_enriched.parquet')
> 
> # Generate features (just column selection, no file I/O!)
> gen = FeatureGenerator()  # No enricher needed!
> features = gen.generate_features(df, tiers=['A', 'C', 'D'])
> # ‚Üí Fast! No AME2020 file parsing, no joins
> ```
>
> **Focus isotopes demonstrate different scenarios:**
> - **U-235 (data-rich)**: Extensive measurements, well-understood physics
> - **Cl-35 (n,p) (data-sparse)**: Limited measurements, active research area
> - **Both:** Full AME2020/NUBASE2020 enrichment for tier-based feature ablation

**Next:** See how classical ML handles these different data scenarios in `00_Baselines_and_Limitations.ipynb` ‚Üí

---