# CUAD Data Exploration

This notebook explores the CUAD (Contract Understanding Atticus Dataset) for contract clause extraction.

**Dataset Stats:**
- 510 contracts
- 41 clause categories
- ~20,900 question-answer pairs
- 32% positive (has clause) / 68% negative (no clause)

In [None]:
import sys
sys.path.insert(0, "..")

from src.data import CUADDataLoader, CATEGORY_TIERS, get_category_tier

## 1. Load Dataset

In [None]:
# Load from local JSON file (default)
loader = CUADDataLoader()
loader.load()

print(f"Loaded {len(loader):,} samples")

## 2. Dataset Statistics

In [None]:
stats = loader.stats()

print("Dataset Statistics")
print("=" * 40)
print(f"Total samples:     {stats['total_samples']:,}")
print(f"Positive samples:  {stats['positive_samples']:,} ({stats['positive_rate']:.1%})")
print(f"Negative samples:  {stats['negative_samples']:,}")
print(f"")
print(f"Categories:        {stats['num_categories']}")
print(f"Contracts:         {stats['num_contracts']}")
print(f"")
print("Samples by tier:")
print(f"  Common (easy):   {stats['common_tier_samples']:,}")
print(f"  Moderate:        {stats['moderate_tier_samples']:,}")
print(f"  Rare (hard):     {stats['rare_tier_samples']:,}")

## 3. Category Tiers

Categories are stratified by baseline F1 performance from ContractEval:
- **Common (F1 > 0.7):** Easy categories where models perform well
- **Moderate (F1 0.3-0.7):** Medium difficulty
- **Rare (F1 near 0):** Hard categories where models struggle

In [None]:
print("Category Tiers")
print("=" * 60)

for tier, categories in CATEGORY_TIERS.items():
    print(f"\n{tier.upper()} ({len(categories)} categories):")
    for cat in categories:
        samples = loader.get_by_category(cat)
        positive = sum(1 for s in samples if s.ground_truth)
        print(f"  - {cat}: {len(samples)} samples ({positive} positive)")

## 4. Sample Exploration

In [None]:
# Get a few sample items
samples = list(loader)[:5]

for i, sample in enumerate(samples):
    print(f"\n{'='*60}")
    print(f"Sample {i+1}: {sample.id}")
    print(f"{'='*60}")
    print(f"Category: {sample.category} (Tier: {sample.tier})")
    print(f"Contract: {sample.contract_title}")
    print(f"Contract length: {len(sample.contract_text):,} chars")
    print(f"\nQuestion: {sample.question[:200]}...")
    print(f"\nGround Truth: {sample.ground_truth if sample.ground_truth else '(No clause)'}")

## 5. Contract Length Distribution

In [None]:
# Get unique contracts
contracts = {}
for sample in loader:
    if sample.contract_title not in contracts:
        contracts[sample.contract_title] = len(sample.contract_text)

lengths = list(contracts.values())

print("Contract Length Statistics")
print("=" * 40)
print(f"Number of contracts: {len(lengths)}")
print(f"Min length: {min(lengths):,} chars")
print(f"Max length: {max(lengths):,} chars")
print(f"Avg length: {sum(lengths)/len(lengths):,.0f} chars")
print(f"Median length: {sorted(lengths)[len(lengths)//2]:,} chars")

## 6. Positive Rate by Category

In [None]:
# Calculate positive rate for each category
category_stats = []

for cat in loader.get_categories():
    samples = loader.get_by_category(cat)
    positive = sum(1 for s in samples if s.ground_truth)
    rate = positive / len(samples) if samples else 0
    tier = get_category_tier(cat)
    category_stats.append({
        "category": cat,
        "tier": tier,
        "total": len(samples),
        "positive": positive,
        "rate": rate
    })

# Sort by positive rate
category_stats.sort(key=lambda x: x["rate"], reverse=True)

print("Positive Rate by Category (sorted by rate)")
print("=" * 70)
print(f"{'Category':<40} {'Tier':<10} {'Rate':<10} {'Pos/Total'}")
print("-" * 70)

for stat in category_stats:
    print(f"{stat['category']:<40} {stat['tier']:<10} {stat['rate']:.1%}     {stat['positive']}/{stat['total']}")

## 7. Example: Rare Category Deep Dive

Let's look at a rare (difficult) category to understand why models struggle.

In [None]:
# Pick a rare category
rare_cat = "Uncapped Liability"
rare_samples = loader.get_by_category(rare_cat)

print(f"Category: {rare_cat}")
print(f"Total samples: {len(rare_samples)}")
print(f"Positive samples: {sum(1 for s in rare_samples if s.ground_truth)}")

# Show positive examples
print("\n" + "="*60)
print("POSITIVE EXAMPLES (has clause):")
print("="*60)

positive_examples = [s for s in rare_samples if s.ground_truth][:3]
for i, sample in enumerate(positive_examples):
    print(f"\nExample {i+1}:")
    print(f"Contract: {sample.contract_title}")
    print(f"Ground Truth: {sample.ground_truth}")

## 8. Ready for Experiments

The data loader is working. Key things to remember:

1. **20,900 samples** total across 510 contracts
2. **41 categories** stratified into common/moderate/rare tiers
3. **32% positive rate** - most samples have no relevant clause
4. **Rare categories** have very low positive rates â†’ models tend to be "lazy"
5. Use `loader.get_by_tier("rare")` to focus on difficult cases