# EXFOR Data Loading & Verification

## Real Experimental Nuclear Data from IAEA EXFOR

This notebook demonstrates how to load and verify **real experimental nuclear data** from the IAEA EXFOR database.

**NUCML-Next uses ONLY real data:**
- ‚úÖ Uses REAL experimental cross-section measurements
- ‚úÖ AME2020-enriched isotope features
- ‚úÖ Production-grade data quality
- ‚úÖ All models require EXFOR data path

## Focus Isotopes for This Tutorial

We'll focus on two isotopes that represent different data availability scenarios:

### 1. **U-235 (Uranium-235) - Well-Understood**
- **Why:** Critical for nuclear reactors (LWR fuel), nuclear weapons
- **Data Quality:** Extensive experimental measurements since 1940s
- **Key Reactions:** (n,f) fission at MT=18, (n,Œ≥) capture at MT=102
- **Status:** High-priority evaluation, well-characterized resonances
- **Expected EXFOR Data:** 1000+ measurements across energy range

### 2. **Cl-35 (Chlorine-35) (n,p) - Research Interest**
- **Why:** Important for nuclear astrophysics (nucleosynthesis), medical isotope production
- **Data Quality:** Limited experimental data, sparse measurements
- **Key Reaction:** (n,p) at MT=103 ‚Üí Produces S-35 (medical tracer)
- **Status:** Active research interest, data gaps in fast neutron region
- **Expected EXFOR Data:** 10-100 measurements (much sparser!)

**Educational Value:** These two isotopes demonstrate how ML models perform on data-rich vs. data-sparse scenarios!

---

## Prerequisites

Before running this notebook, you must:

1. **Download EXFOR-X5json bulk database:**
   ```bash
   # Visit: https://www-nds.iaea.org/exfor/
   # Download: EXFOR-X5json bulk ZIP (~500 MB)
   # Unzip to: ~/data/EXFOR-X5json/
   ```

2. **Run the EXFOR ingestor:**
   ```bash
   python scripts/ingest_exfor.py \
       --exfor-root ~/data/EXFOR-X5json/ \
       --output data/exfor_processed.parquet \
       --max-files 1000  # Start with subset for testing
   ```

3. **(Optional) Download AME2020:**
   ```bash
   wget https://www-nds.iaea.org/amdc/ame2020/mass_1.mas20.txt -O data/ame2020.txt
   ```

---

In [None]:
import sys
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from nucml_next.data import NucmlDataset

print("‚úì Imports successful")
print("‚úì NUCML-Next: Real EXFOR Data Only")

## Step 1: Verify EXFOR Data Exists

In [None]:
# Check if EXFOR data has been processed
exfor_path = Path('../data/exfor_processed.parquet')

if not exfor_path.exists():
    print("‚ùå EXFOR data not found!")
    print("\nPlease run the ingestor first:")
    print("  python scripts/ingest_exfor.py --exfor-root ~/data/EXFOR-X5json/ --output data/exfor_processed.parquet")
    raise FileNotFoundError(f"EXFOR data not found at {exfor_path}")
else:
    print(f"‚úì Found EXFOR data at {exfor_path}")
    # Check size
    if exfor_path.is_dir():
        print(f"  Type: Partitioned dataset (directory)")
    else:
        size_mb = exfor_path.stat().st_size / (1024**2)
        print(f"  Size: {size_mb:.1f} MB")

## Step 2: Load Real EXFOR Data

**Note:** `data_path` is REQUIRED. If you don't provide it or if the file doesn't exist, NucmlDataset will raise an error immediately. This prevents accidental misuse.

In [None]:
# Load EXFOR data focusing on our two isotopes
# U-235: Z=92, A=235, MT=18 (fission), MT=102 (capture)
# Cl-35: Z=17, A=35, MT=103 (n,p)

dataset = NucmlDataset(
    data_path='../data/exfor_processed.parquet',
    mode='tabular',
    filters={
        'Z': [92, 17],     # Uranium and Chlorine
        'A': [235, 35],    # U-235 and Cl-35
        'MT': [18, 102, 103]  # Fission, capture, (n,p)
    },
)

print(f"\n‚úì Loaded {len(dataset.df)} REAL experimental data points")
print(f"  Isotopes: {dataset.df[['Z', 'A']].drop_duplicates().shape[0]}")
print(f"  Reactions: {dataset.df['MT'].nunique()}")
print(f"  Energy range: {dataset.df['Energy'].min():.2e} - {dataset.df['Energy'].max():.2e} eV")

# Show breakdown by isotope
print("\nüìä Data Distribution by Isotope:")
for (z, a), group in dataset.df.groupby(['Z', 'A']):
    isotope_name = f"{'U' if z==92 else 'Cl'}-{a}"
    print(f"  {isotope_name:8s}: {len(group):>6,} measurements")
    for mt, mt_group in group.groupby('MT'):
        mt_name = {18: 'Fission', 102: 'Capture', 103: '(n,p)'}.get(int(mt), f'MT={mt}')
        print(f"    ‚îî‚îÄ {mt_name:12s}: {len(mt_group):>6,} points")

## Step 3: Inspect Real EXFOR Data

In [None]:
# Show sample of real experimental data
print("Sample EXFOR Data:")
print(dataset.df.head(20))

# Check for AME2020 enrichment
if 'Mass_Excess_keV' in dataset.df.columns:
    print("\n‚úì Data is AME2020-enriched (has mass excess & binding energy)")
else:
    print("\n‚ö†Ô∏è  Data not AME2020-enriched (using SEMF approximation)")

# Check for uncertainties
has_uncertainty = dataset.df['Uncertainty'].notna().sum()
print(f"\n‚úì {has_uncertainty}/{len(dataset.df)} points have experimental uncertainty")

## Step 4: Visualize Real Experimental Data

This shows **actual EXFOR measurements** with experimental scatter - comparing data-rich U-235 vs. data-sparse Cl-35.

In [None]:
# Create comparative visualization: U-235 (data-rich) vs Cl-35 (data-sparse)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# LEFT PLOT: U-235 Fission (data-rich)
u235_fission = dataset.df[(dataset.df['Z'] == 92) & 
                           (dataset.df['A'] == 235) & 
                           (dataset.df['MT'] == 18)]

if len(u235_fission) > 0:
    if 'Uncertainty' in u235_fission.columns:
        has_unc = u235_fission['Uncertainty'].notna()
        
        # Points with uncertainty
        ax1.errorbar(
            u235_fission[has_unc]['Energy'],
            u235_fission[has_unc]['CrossSection'],
            yerr=u235_fission[has_unc]['Uncertainty'],
            fmt='o', markersize=3, alpha=0.6,
            label=f'With uncertainty ({has_unc.sum()} pts)',
            color='blue', elinewidth=0.8
        )
        
        # Points without uncertainty
        if (~has_unc).sum() > 0:
            ax1.scatter(
                u235_fission[~has_unc]['Energy'],
                u235_fission[~has_unc]['CrossSection'],
                marker='x', s=15, alpha=0.5,
                label=f'No uncertainty ({(~has_unc).sum()} pts)',
                color='orange'
            )
    else:
        ax1.scatter(
            u235_fission['Energy'],
            u235_fission['CrossSection'],
            marker='o', s=8, alpha=0.6,
            label=f'EXFOR Data ({len(u235_fission)} pts)'
        )
    
    ax1.set_xlabel('Energy (eV)', fontsize=12, fontweight='bold')
    ax1.set_ylabel('Cross Section (barns)', fontsize=12, fontweight='bold')
    ax1.set_title('U-235 Fission: WELL-UNDERSTOOD (Data-Rich)\n' + 
                  f'{len(u235_fission):,} EXFOR measurements',
                  fontsize=13, fontweight='bold', color='darkblue')
    ax1.legend(fontsize=10)
    ax1.set_xscale('log')
    ax1.set_yscale('log')
    ax1.grid(True, alpha=0.3)
else:
    ax1.text(0.5, 0.5, 'No U-235 fission data in dataset\n(Check EXFOR ingestion)',
             ha='center', va='center', transform=ax1.transAxes, fontsize=12)
    ax1.set_title('U-235 Fission (No Data)', fontsize=13, fontweight='bold')

# RIGHT PLOT: Cl-35 (n,p) (data-sparse)
cl35_np = dataset.df[(dataset.df['Z'] == 17) & 
                      (dataset.df['A'] == 35) & 
                      (dataset.df['MT'] == 103)]

if len(cl35_np) > 0:
    if 'Uncertainty' in cl35_np.columns:
        has_unc = cl35_np['Uncertainty'].notna()
        
        # Points with uncertainty
        if has_unc.sum() > 0:
            ax2.errorbar(
                cl35_np[has_unc]['Energy'],
                cl35_np[has_unc]['CrossSection'],
                yerr=cl35_np[has_unc]['Uncertainty'],
                fmt='o', markersize=5, alpha=0.7,
                label=f'With uncertainty ({has_unc.sum()} pts)',
                color='green', elinewidth=1.2
            )
        
        # Points without uncertainty
        if (~has_unc).sum() > 0:
            ax2.scatter(
                cl35_np[~has_unc]['Energy'],
                cl35_np[~has_unc]['CrossSection'],
                marker='s', s=30, alpha=0.7,
                label=f'No uncertainty ({(~has_unc).sum()} pts)',
                color='red'
            )
    else:
        ax2.scatter(
            cl35_np['Energy'],
            cl35_np['CrossSection'],
            marker='o', s=25, alpha=0.7,
            label=f'EXFOR Data ({len(cl35_np)} pts)'
        )
    
    ax2.set_xlabel('Energy (eV)', fontsize=12, fontweight='bold')
    ax2.set_ylabel('Cross Section (barns)', fontsize=12, fontweight='bold')
    ax2.set_title('Cl-35 (n,p): RESEARCH INTEREST (Data-Sparse)\n' + 
                  f'{len(cl35_np):,} EXFOR measurements',
                  fontsize=13, fontweight='bold', color='darkgreen')
    ax2.legend(fontsize=10)
    ax2.set_xscale('log')
    ax2.set_yscale('log')
    ax2.grid(True, alpha=0.3)
else:
    ax2.text(0.5, 0.5, 'No Cl-35 (n,p) data in dataset\n(Check EXFOR ingestion or expand --max-files)',
             ha='center', va='center', transform=ax2.transAxes, fontsize=11)
    ax2.set_title('Cl-35 (n,p) (No Data)', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n" + "="*80)
print("üîç KEY OBSERVATIONS:")
print("="*80)
print(f"  LEFT (U-235 Fission):")
print(f"    ‚Ä¢ {'MANY' if len(u235_fission) > 100 else 'FEW'} measurements ‚Üí ML models have rich training data")
print(f"    ‚Ä¢ Dense energy coverage ‚Üí Good interpolation possible")
print(f"    ‚Ä¢ Well-characterized resonances ‚Üí Physics well-understood")
print()
print(f"  RIGHT (Cl-35 (n,p)):")
print(f"    ‚Ä¢ {'SPARSE' if len(cl35_np) < 100 else 'MODERATE'} measurements ‚Üí ML models face data scarcity")
print(f"    ‚Ä¢ Energy gaps ‚Üí Interpolation/extrapolation challenging")
print(f"    ‚Ä¢ Active research area ‚Üí Models can help guide new experiments!")
print("="*80)

## Step 5: Production Data Statistics

In [None]:
# Get comprehensive statistics
stats = dataset.get_statistics()

print("\n" + "="*70)
print("EXFOR PRODUCTION DATA STATISTICS")
print("="*70)
for key, value in stats.items():
    if isinstance(value, tuple):
        print(f"{key:25s}: {value[0]:.2e} - {value[1]:.2e}")
    else:
        print(f"{key:25s}: {value}")
print("="*70)

# Reaction breakdown
print("\nReaction Types in Dataset:")
reaction_counts = dataset.df.groupby('MT').size().sort_values(ascending=False)
for mt, count in reaction_counts.items():
    mt_name = {2: 'Elastic', 18: 'Fission', 102: 'Capture', 16: '(n,2n)'}.get(int(mt), f'MT={mt}')
    print(f"  {mt_name:20s}: {count:>8,} points")

## Step 6: Verify Data Quality

Production checks to ensure data is suitable for training.

In [None]:
# Quality checks
print("\nData Quality Checks:")
print("="*70)

# 1. No infinite values
has_inf = np.isinf(dataset.df['CrossSection']).sum()
print(f"‚úì Infinite cross sections: {has_inf} (should be 0)")

# 2. No NaN in critical columns
critical_cols = ['Z', 'A', 'Energy', 'CrossSection']
for col in critical_cols:
    nan_count = dataset.df[col].isna().sum()
    status = "‚úì" if nan_count == 0 else "‚ùå"
    print(f"{status} NaN in {col:20s}: {nan_count}")

# 3. Positive cross sections
negative = (dataset.df['CrossSection'] < 0).sum()
print(f"‚úì Negative cross sections: {negative} (should be 0)")

# 4. Energy range coverage
energy_decades = np.log10(dataset.df['Energy'].max() / dataset.df['Energy'].min())
print(f"‚úì Energy range: {energy_decades:.1f} decades")

# 5. Natural targets flagged
if 'Is_Natural_Target' in dataset.df.columns:
    natural_count = dataset.df['Is_Natural_Target'].sum()
    print(f"‚úì Natural targets flagged: {natural_count}")

print("="*70)

## Step 7: Ready for Production Training

This dataset is now ready to use with:
- Baseline models (XGBoost, Decision Trees)
- GNN-Transformer
- Physics-informed training
- OpenMC validation

**All with REAL experimental data!**

In [None]:
# Demonstrate production-safe usage
print("\nüéØ Production Training Example:")
print("="*70)
print("from nucml_next.data import NucmlDataset")
print("from nucml_next.baselines import XGBoostEvaluator")
print("")
print("# Load REAL EXFOR data (data_path is required)")
print("dataset = NucmlDataset(")
print("    data_path='data/exfor_processed.parquet',")
print("    mode='tabular'")
print(")")
print("")
print("# Train on real data")
print("df = dataset.to_tabular(mode='physics')")
print("xgb = XGBoostEvaluator()")
print("xgb.train(df)")
print("="*70)

print("\n‚úÖ This dataset is production-ready!")
print(f"‚úÖ {len(dataset.df):,} real EXFOR measurements")
print(f"‚úÖ {dataset.df[['Z', 'A']].drop_duplicates().shape[0]} isotopes")
print(f"‚úÖ {dataset.df['MT'].nunique()} reaction types")
print("\nContinue to baseline/GNN-Transformer training notebooks ‚Üí")

---

## üéì Key Takeaway

> **NUCML-Next requires real EXFOR experimental data for all operations!**
>
> We've loaded two contrasting cases:
> - **U-235 (data-rich)**: Extensive measurements, well-understood physics
> - **Cl-35 (n,p) (data-sparse)**: Limited measurements, active research area
>
> This demonstrates:
> - How ML models perform on well-characterized vs. under-studied reactions
> - The importance of uncertainty quantification in sparse data regimes
> - How physics-informed models can guide future experimental campaigns
>
> Always use:
> - `data_path='data/exfor_processed.parquet'` ‚úì (REQUIRED parameter)
> - Verify EXFOR data exists before running notebooks ‚úì
> - Run `scripts/ingest_exfor.py` to prepare data ‚úì
>
> The framework will immediately raise an error if:
> - `data_path` is not provided
> - The specified path doesn't exist
> - The data file is invalid
>
> This ensures you never accidentally train on incorrect or missing data.

**Next:** See how classical ML handles these different data scenarios in `00_Baselines_and_Limitations.ipynb` ‚Üí

---