# üìä 02 ‚Äî Data Pipeline Validation

This notebook validates the complete data pipeline for the **ModernBERT-RGAT** project:

1. **Config loading** ‚Äî verify `configs/config.yaml`
2. **Data loading** ‚Äî load all 3 processed CSV datasets
3. **Stratified splitting** ‚Äî 80/10/10 at sentence level
4. **Leakage checks** ‚Äî ensure no sentence overlap across splits
5. **Distribution analysis** ‚Äî verify stratification preserves polarity ratios
6. **Class weights** ‚Äî compute inverse-frequency weights for imbalanced labels
7. **Save verified splits** ‚Äî cache for downstream training

---

## 1Ô∏è‚É£ Setup & Imports

In [None]:
import sys
import os

# Ensure project root is on path
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)
os.chdir(PROJECT_ROOT)

print(f"Project root: {PROJECT_ROOT}")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_pipeline import (
    load_config,
    load_dataset,
    load_all_datasets,
    stratified_sentence_split,
    validate_split,
    print_split_summary,
    compute_class_weights,
    build_splits,
    build_all_splits,
)

# Style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('deep')
pd.set_option('display.max_columns', None)

print("‚úÖ All imports successful")

## 2Ô∏è‚É£ Load Configuration

In [None]:
config = load_config("configs/config.yaml")

print("üìã Configuration Summary:")
print(f"  Model backbone:    {config['model']['backbone']}")
print(f"  Max sequence len:  {config['model']['max_len']}")
print(f"  Batch size:        {config['training']['batch_size']}")
print(f"  Learning rate:     {config['training']['learning_rate']}")
print(f"  Epochs:            {config['training']['epochs']}")
print(f"  Split ratios:      {config['data']['split']['train']}/{config['data']['split']['val']}/{config['data']['split']['test']}")
print(f"  Random seed:       {config['data']['split']['seed']}")
print(f"  Polarity labels:   {config['labels']['polarity']}")

## 3Ô∏è‚É£ Load Processed Datasets

In [None]:
print("üìÇ Loading all processed datasets...\n")
datasets = load_all_datasets(config)

for year, df in datasets.items():
    print(f"\n--- {year} ---")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Shape:   {df.shape}")
    print(f"  Polarity values: {df['polarity'].unique().tolist()}")
    implicit = (df['aspect'] == config['data']['implicit_aspect_token']).sum()
    print(f"  Implicit aspects: {implicit}")
    display(df.head(3))

## 4Ô∏è‚É£ Build Stratified Splits (80 / 10 / 10)

In [None]:
print("üîÄ Building stratified sentence-level splits...\n")

all_splits = build_all_splits(config, verbose=True)

## 5Ô∏è‚É£ Visual Verification: Polarity Distribution Across Splits

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (year, (train_df, val_df, test_df)) in enumerate(all_splits.items()):
    ax = axes[idx]
    
    # Compute distributions
    polarities = sorted(train_df['polarity'].unique())
    x = np.arange(len(polarities))
    width = 0.25
    
    train_pcts = [train_df['polarity'].value_counts(normalize=True).get(p, 0)*100 for p in polarities]
    val_pcts   = [val_df['polarity'].value_counts(normalize=True).get(p, 0)*100   for p in polarities]
    test_pcts  = [test_df['polarity'].value_counts(normalize=True).get(p, 0)*100  for p in polarities]
    
    bars1 = ax.bar(x - width, train_pcts, width, label='Train', color='#2196F3', alpha=0.85)
    bars2 = ax.bar(x,         val_pcts,   width, label='Val',   color='#FF9800', alpha=0.85)
    bars3 = ax.bar(x + width, test_pcts,  width, label='Test',  color='#4CAF50', alpha=0.85)
    
    ax.set_xlabel('Polarity', fontsize=11, fontweight='bold')
    ax.set_ylabel('Percentage (%)', fontsize=11, fontweight='bold')
    ax.set_title(f'SemEval {year}', fontsize=13, fontweight='bold')
    ax.set_xticks(x)
    ax.set_xticklabels(polarities, rotation=15)
    ax.legend(fontsize=9)
    ax.set_ylim(0, 80)
    
    # Add value labels on bars
    for bars in [bars1, bars2, bars3]:
        for bar in bars:
            h = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., h + 0.5,
                    f'{h:.1f}', ha='center', va='bottom', fontsize=7)

fig.suptitle('Polarity Distribution Across Splits (Stratification Verification)',
             fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n‚úÖ If bars within each dataset are approximately equal height per polarity,")
print("   stratification is working correctly!")

## 6Ô∏è‚É£ Data Leakage Verification

In [None]:
print("üîí Checking for data leakage (sentence overlap)...\n")

all_clear = True
for year, (train_df, val_df, test_df) in all_splits.items():
    train_sids = set(train_df['sentence_id'].unique())
    val_sids   = set(val_df['sentence_id'].unique())
    test_sids  = set(test_df['sentence_id'].unique())
    
    overlap_tv = train_sids & val_sids
    overlap_tt = train_sids & test_sids
    overlap_vt = val_sids & test_sids
    
    total_overlap = len(overlap_tv) + len(overlap_tt) + len(overlap_vt)
    
    if total_overlap == 0:
        print(f"  ‚úÖ SemEval {year}: No leakage detected")
    else:
        print(f"  ‚ùå SemEval {year}: {total_overlap} overlapping sentences FOUND!")
        all_clear = False

print()
if all_clear:
    print("üéâ ALL DATASETS PASS ‚Äî No data leakage across any split!")
else:
    print("‚ö†Ô∏è LEAKAGE DETECTED ‚Äî Fix the splitting logic before proceeding!")

## 7Ô∏è‚É£ Class Weights for Imbalanced Sentiments

In [None]:
print("‚öñÔ∏è Computing class weights (inverse frequency) from TRAINING sets...\n")

label_map = config['labels']['polarity']

for year, (train_df, _, _) in all_splits.items():
    weights = compute_class_weights(train_df, label_map)
    
    print(f"\n  SemEval {year}:")
    print(f"  {'Label':12s} {'Count':>6s} {'Weight':>8s}")
    print(f"  {'‚îÄ'*30}")
    
    counts = train_df['polarity'].value_counts()
    for label_name, label_idx in sorted(label_map.items(), key=lambda x: x[1]):
        cnt = counts.get(label_name, 0)
        wt = weights[label_idx]
        print(f"  {label_name:12s} {cnt:6d} {wt:8.4f}")

print("\nüí° Higher weight = rarer class = model pays more attention to it")

## 8Ô∏è‚É£ Split Size Summary

In [None]:
summary_data = []

for year, (train_df, val_df, test_df) in all_splits.items():
    summary_data.append({
        'Dataset': f'SemEval {year}',
        'Train Rows': len(train_df),
        'Train Sentences': train_df['sentence_id'].nunique(),
        'Val Rows': len(val_df),
        'Val Sentences': val_df['sentence_id'].nunique(),
        'Test Rows': len(test_df),
        'Test Sentences': test_df['sentence_id'].nunique(),
        'Total Rows': len(train_df) + len(val_df) + len(test_df),
    })

summary_df = pd.DataFrame(summary_data)
display(summary_df)

print("\n‚úÖ Phase 2: Data Pipeline ‚Äî COMPLETE")
print("\nüöÄ Ready for Phase 3: Model Architecture")

---

## ‚úÖ Phase 2 Summary

| Item | Status |
|------|--------|
| Config loaded | ‚úÖ |
| 3 datasets loaded | ‚úÖ |
| Stratified 80/10/10 split | ‚úÖ |
| Sentence-level (no leakage) | ‚úÖ |
| Distribution preserved | ‚úÖ |
| Class weights computed | ‚úÖ |
| Splits cached | ‚úÖ |

**Next step ‚Üí** Phase 3: Model Architecture