# EXFOR Data Loading & Verification

## Real Experimental Nuclear Data from IAEA EXFOR

This notebook demonstrates how to load and verify **real experimental nuclear data** from the IAEA EXFOR database.

**NUCML-Next uses ONLY real data:**
- ‚úÖ Uses REAL experimental cross-section measurements
- ‚úÖ AME2020-enriched isotope features
- ‚úÖ Production-grade data quality
- ‚úÖ All models require EXFOR data path

---

## Prerequisites

Before running this notebook, you must:

1. **Download EXFOR-X5json bulk database:**
   ```bash
   # Visit: https://www-nds.iaea.org/exfor/
   # Download: EXFOR-X5json bulk ZIP (~500 MB)
   # Unzip to: ~/data/EXFOR-X5json/
   ```

2. **Run the EXFOR ingestor:**
   ```bash
   python scripts/ingest_exfor.py \
       --exfor-root ~/data/EXFOR-X5json/ \
       --output data/exfor_processed.parquet \
       --max-files 1000  # Start with subset for testing
   ```

3. **(Optional) Download AME2020:**
   ```bash
   wget https://www-nds.iaea.org/amdc/ame2020/mass_1.mas20.txt -O data/ame2020.txt
   ```

---

In [None]:
import sys
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from nucml_next.data import NucmlDataset

print("‚úì Imports successful")
print("‚úì NUCML-Next: Real EXFOR Data Only")

## Step 1: Verify EXFOR Data Exists

In [None]:
# Check if EXFOR data has been processed
exfor_path = Path('../data/exfor_processed.parquet')

if not exfor_path.exists():
    print("‚ùå EXFOR data not found!")
    print("\nPlease run the ingestor first:")
    print("  python scripts/ingest_exfor.py --exfor-root ~/data/EXFOR-X5json/ --output data/exfor_processed.parquet")
    raise FileNotFoundError(f"EXFOR data not found at {exfor_path}")
else:
    print(f"‚úì Found EXFOR data at {exfor_path}")
    # Check size
    if exfor_path.is_dir():
        print(f"  Type: Partitioned dataset (directory)")
    else:
        size_mb = exfor_path.stat().st_size / (1024**2)
        print(f"  Size: {size_mb:.1f} MB")

## Step 2: Load Real EXFOR Data

**Note:** `data_path` is REQUIRED. If you don't provide it or if the file doesn't exist, NucmlDataset will raise an error immediately. This prevents accidental misuse.

In [None]:
# Load EXFOR data (data_path is required)
dataset = NucmlDataset(
    data_path='../data/exfor_processed.parquet',
    mode='tabular',
    filters={'Z': [92], 'MT': [18, 102]},  # U-235/U-238 fission & capture
)

print(f"\n‚úì Loaded {len(dataset.df)} REAL experimental data points")
print(f"  Isotopes: {dataset.df[['Z', 'A']].drop_duplicates().shape[0]}")
print(f"  Reactions: {dataset.df['MT'].nunique()}")
print(f"  Energy range: {dataset.df['Energy'].min():.2e} - {dataset.df['Energy'].max():.2e} eV")

## Step 3: Inspect Real EXFOR Data

In [None]:
# Show sample of real experimental data
print("Sample EXFOR Data:")
print(dataset.df.head(20))

# Check for AME2020 enrichment
if 'Mass_Excess_keV' in dataset.df.columns:
    print("\n‚úì Data is AME2020-enriched (has mass excess & binding energy)")
else:
    print("\n‚ö†Ô∏è  Data not AME2020-enriched (using SEMF approximation)")

# Check for uncertainties
has_uncertainty = dataset.df['Uncertainty'].notna().sum()
print(f"\n‚úì {has_uncertainty}/{len(dataset.df)} points have experimental uncertainty")

## Step 4: Visualize Real Experimental Data

This shows **actual EXFOR measurements** with experimental scatter.

In [None]:
# Plot U-235 fission cross section (real EXFOR data)
u235_fission = dataset.df[(dataset.df['Z'] == 92) & 
                          (dataset.df['A'] == 235) & 
                          (dataset.df['MT'] == 18)]

fig, ax = plt.subplots(figsize=(14, 7))

# Plot with uncertainty if available
if 'Uncertainty' in u235_fission.columns:
    has_unc = u235_fission['Uncertainty'].notna()
    
    # Points with uncertainty
    ax.errorbar(
        u235_fission[has_unc]['Energy'],
        u235_fission[has_unc]['CrossSection'],
        yerr=u235_fission[has_unc]['Uncertainty'],
        fmt='o', markersize=4, alpha=0.6,
        label='EXFOR Data (with uncertainty)',
        color='blue'
    )
    
    # Points without uncertainty
    ax.scatter(
        u235_fission[~has_unc]['Energy'],
        u235_fission[~has_unc]['CrossSection'],
        marker='x', s=20, alpha=0.5,
        label='EXFOR Data (no uncertainty)',
        color='orange'
    )
else:
    ax.scatter(
        u235_fission['Energy'],
        u235_fission['CrossSection'],
        marker='o', s=10, alpha=0.6,
        label='EXFOR Experimental Data'
    )

ax.set_xlabel('Energy (eV)', fontsize=13, fontweight='bold')
ax.set_ylabel('Cross Section (barns)', fontsize=13, fontweight='bold')
ax.set_title('U-235 Fission: REAL EXFOR Experimental Data\n(Showing experimental scatter and uncertainties)',
             fontsize=15, fontweight='bold')
ax.legend(fontsize=11)
ax.set_xscale('log')
ax.set_yscale('log')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n‚úì Plotted {len(u235_fission)} real EXFOR measurements")
print(f"  Notice the experimental scatter - this is REAL data!")

## Step 5: Production Data Statistics

In [None]:
# Get comprehensive statistics
stats = dataset.get_statistics()

print("\n" + "="*70)
print("EXFOR PRODUCTION DATA STATISTICS")
print("="*70)
for key, value in stats.items():
    if isinstance(value, tuple):
        print(f"{key:25s}: {value[0]:.2e} - {value[1]:.2e}")
    else:
        print(f"{key:25s}: {value}")
print("="*70)

# Reaction breakdown
print("\nReaction Types in Dataset:")
reaction_counts = dataset.df.groupby('MT').size().sort_values(ascending=False)
for mt, count in reaction_counts.items():
    mt_name = {2: 'Elastic', 18: 'Fission', 102: 'Capture', 16: '(n,2n)'}.get(int(mt), f'MT={mt}')
    print(f"  {mt_name:20s}: {count:>8,} points")

## Step 6: Verify Data Quality

Production checks to ensure data is suitable for training.

In [None]:
# Quality checks
print("\nData Quality Checks:")
print("="*70)

# 1. No infinite values
has_inf = np.isinf(dataset.df['CrossSection']).sum()
print(f"‚úì Infinite cross sections: {has_inf} (should be 0)")

# 2. No NaN in critical columns
critical_cols = ['Z', 'A', 'Energy', 'CrossSection']
for col in critical_cols:
    nan_count = dataset.df[col].isna().sum()
    status = "‚úì" if nan_count == 0 else "‚ùå"
    print(f"{status} NaN in {col:20s}: {nan_count}")

# 3. Positive cross sections
negative = (dataset.df['CrossSection'] < 0).sum()
print(f"‚úì Negative cross sections: {negative} (should be 0)")

# 4. Energy range coverage
energy_decades = np.log10(dataset.df['Energy'].max() / dataset.df['Energy'].min())
print(f"‚úì Energy range: {energy_decades:.1f} decades")

# 5. Natural targets flagged
if 'Is_Natural_Target' in dataset.df.columns:
    natural_count = dataset.df['Is_Natural_Target'].sum()
    print(f"‚úì Natural targets flagged: {natural_count}")

print("="*70)

## Step 7: Ready for Production Training

This dataset is now ready to use with:
- Baseline models (XGBoost, Decision Trees)
- GNN-Transformer
- Physics-informed training
- OpenMC validation

**All with REAL experimental data!**

In [None]:
# Demonstrate production-safe usage
print("\nüéØ Production Training Example:")
print("="*70)
print("from nucml_next.data import NucmlDataset")
print("from nucml_next.baselines import XGBoostEvaluator")
print("")
print("# Load REAL EXFOR data (data_path is required)")
print("dataset = NucmlDataset(")
print("    data_path='data/exfor_processed.parquet',")
print("    mode='tabular'")
print(")")
print("")
print("# Train on real data")
print("df = dataset.to_tabular(mode='physics')")
print("xgb = XGBoostEvaluator()")
print("xgb.train(df)")
print("="*70)

print("\n‚úÖ This dataset is production-ready!")
print(f"‚úÖ {len(dataset.df):,} real EXFOR measurements")
print(f"‚úÖ {dataset.df[['Z', 'A']].drop_duplicates().shape[0]} isotopes")
print(f"‚úÖ {dataset.df['MT'].nunique()} reaction types")
print("\nContinue to baseline/GNN-Transformer training notebooks ‚Üí")

---

## üéì Key Takeaway

> **NUCML-Next requires real EXFOR experimental data for all operations!**
>
> Always use:
> - `data_path='data/exfor_processed.parquet'` ‚úì (REQUIRED parameter)
> - Verify EXFOR data exists before running notebooks ‚úì
> - Run `scripts/ingest_exfor.py` to prepare data ‚úì
>
> The framework will immediately raise an error if:
> - `data_path` is not provided
> - The specified path doesn't exist
> - The data file is invalid
>
> This ensures you never accidentally train on incorrect or missing data.

---