# CUAD Data Exploration

This notebook explores the CUAD (Contract Understanding Atticus Dataset) for contract clause extraction.

**Dataset Stats:**
- 510 contracts
- 41 clause categories
- ~20,900 Q&A pairs (contract × category combinations)
- ~13,800 answer spans (the actual clause extractions)
- 32% positive (has clause) / 68% negative (no clause)

In [1]:
import sys
sys.path.insert(0, "..")

from src.data import CUADDataLoader, CATEGORY_TIERS, get_category_tier

## 1. Load Dataset

In [2]:
# Load from local JSON file (default)
loader = CUADDataLoader()
loader.load()

print(f"Loaded {len(loader):,} samples")

Loaded 20,910 samples


## 2. Dataset Statistics

In [None]:
stats = loader.stats()

print("Dataset Statistics")
print("=" * 50)
print(f"Total Q&A pairs:        {stats['total_samples']:,}")
print(f"Positive (has clause):  {stats['positive_samples']:,} ({stats['positive_rate']:.1%})")
print(f"Negative (no clause):   {stats['negative_samples']:,}")
print(f"")
print(f"Answer Spans:")
print(f"  Total spans:          {stats['total_answer_spans']:,} (the ~13k CUAD labels)")
print(f"  Avg per positive:     {stats['avg_spans_per_positive']:.1f}")
print(f"")
print(f"Categories:             {stats['num_categories']}")
print(f"Contracts:              {stats['num_contracts']}")
print(f"")
print("Samples by tier:")
print(f"  Common (easy):        {stats['common_tier_samples']:,}")
print(f"  Moderate:             {stats['moderate_tier_samples']:,}")
print(f"  Rare (hard):          {stats['rare_tier_samples']:,}")

## 3. Category Tiers

Categories are stratified by baseline F1 performance from ContractEval:
- **Common (F1 > 0.7):** Easy categories where models perform well
- **Moderate (F1 0.3-0.7):** Medium difficulty
- **Rare (F1 near 0):** Hard categories where models struggle

In [4]:
print("Category Tiers")
print("=" * 60)

for tier, categories in CATEGORY_TIERS.items():
    print(f"\n{tier.upper()} ({len(categories)} categories):")
    for cat in categories:
        samples = loader.get_by_category(cat)
        positive = sum(1 for s in samples if s.ground_truth)
        print(f"  - {cat}: {len(samples)} samples ({positive} positive)")

Category Tiers

COMMON (6 categories):
  - Governing Law: 510 samples (437 positive)
  - Parties: 510 samples (509 positive)
  - Agreement Date: 510 samples (470 positive)
  - Effective Date: 510 samples (390 positive)
  - Expiration Date: 510 samples (413 positive)
  - Document Name: 510 samples (510 positive)

MODERATE (18 categories):
  - Renewal Term: 510 samples (176 positive)
  - License Grant: 510 samples (255 positive)
  - Termination For Convenience: 510 samples (183 positive)
  - Anti-Assignment: 510 samples (374 positive)
  - Change Of Control: 510 samples (121 positive)
  - Cap On Liability: 510 samples (275 positive)
  - Insurance: 510 samples (166 positive)
  - Audit Rights: 510 samples (214 positive)
  - Non-Compete: 510 samples (119 positive)
  - Exclusivity: 510 samples (180 positive)
  - Non-Transferable License: 510 samples (138 positive)
  - Irrevocable Or Perpetual License: 510 samples (70 positive)
  - Rofr/Rofo/Rofn: 510 samples (85 positive)
  - No-Solicit Of Em

## 4. Sample Exploration

In [None]:
# Get a few sample items
samples = list(loader)[:5]

for i, sample in enumerate(samples):
    print(f"\n{'='*60}")
    print(f"Sample {i+1}: {sample.id}")
    print(f"{'='*60}")
    print(f"Category: {sample.category} (Tier: {sample.tier})")
    print(f"Contract: {sample.contract_title}")
    print(f"Contract length: {len(sample.contract_text):,} chars")
    print(f"\nQuestion: {sample.question[:200]}...")
    print(f"\nHas clause: {sample.has_clause} ({sample.num_spans} answer spans)")
    if sample.has_clause:
        print(f"First span: {sample.ground_truth[:200]}..." if len(sample.ground_truth) > 200 else f"First span: {sample.ground_truth}")

## 5. Contract Length Distribution

In [6]:
# Get unique contracts
contracts = {}
for sample in loader:
    if sample.contract_title not in contracts:
        contracts[sample.contract_title] = len(sample.contract_text)

lengths = list(contracts.values())

print("Contract Length Statistics")
print("=" * 40)
print(f"Number of contracts: {len(lengths)}")
print(f"Min length: {min(lengths):,} chars")
print(f"Max length: {max(lengths):,} chars")
print(f"Avg length: {sum(lengths)/len(lengths):,.0f} chars")
print(f"Median length: {sorted(lengths)[len(lengths)//2]:,} chars")

Contract Length Statistics
Number of contracts: 510
Min length: 645 chars
Max length: 338,211 chars
Avg length: 52,563 chars
Median length: 33,425 chars


## 6. Positive Rate by Category

In [7]:
# Calculate positive rate for each category
category_stats = []

for cat in loader.get_categories():
    samples = loader.get_by_category(cat)
    positive = sum(1 for s in samples if s.ground_truth)
    rate = positive / len(samples) if samples else 0
    tier = get_category_tier(cat)
    category_stats.append({
        "category": cat,
        "tier": tier,
        "total": len(samples),
        "positive": positive,
        "rate": rate
    })

# Sort by positive rate
category_stats.sort(key=lambda x: x["rate"], reverse=True)

print("Positive Rate by Category (sorted by rate)")
print("=" * 70)
print(f"{'Category':<40} {'Tier':<10} {'Rate':<10} {'Pos/Total'}")
print("-" * 70)

for stat in category_stats:
    print(f"{stat['category']:<40} {stat['tier']:<10} {stat['rate']:.1%}     {stat['positive']}/{stat['total']}")

Positive Rate by Category (sorted by rate)
Category                                 Tier       Rate       Pos/Total
----------------------------------------------------------------------
Document Name                            common     100.0%     510/510
Parties                                  common     99.8%     509/510
Agreement Date                           common     92.2%     470/510
Governing Law                            common     85.7%     437/510
Expiration Date                          common     81.0%     413/510
Effective Date                           common     76.5%     390/510
Anti-Assignment                          moderate   73.3%     374/510
Cap On Liability                         moderate   53.9%     275/510
License Grant                            moderate   50.0%     255/510
Audit Rights                             moderate   42.0%     214/510
Termination For Convenience              moderate   35.9%     183/510
Post-Termination Services                m

## 7. Example: Rare Category Deep Dive

Let's look at a rare (difficult) category to understand why models struggle.

In [8]:
# Pick a rare category
rare_cat = "Uncapped Liability"
rare_samples = loader.get_by_category(rare_cat)

print(f"Category: {rare_cat}")
print(f"Total samples: {len(rare_samples)}")
print(f"Positive samples: {sum(1 for s in rare_samples if s.ground_truth)}")

# Show positive examples
print("\n" + "="*60)
print("POSITIVE EXAMPLES (has clause):")
print("="*60)

positive_examples = [s for s in rare_samples if s.ground_truth][:3]
for i, sample in enumerate(positive_examples):
    print(f"\nExample {i+1}:")
    print(f"Contract: {sample.contract_title}")
    print(f"Ground Truth: {sample.ground_truth}")

Category: Uncapped Liability
Total samples: 510
Positive samples: 111

POSITIVE EXAMPLES (has clause):

Example 1:
Contract: WHITESMOKE,INC_11_08_2011-EX-10.26-PROMOTION AND DISTRIBUTION AGREEMENT
Ground Truth: Subject to Clauses 9.1 and 9.2, each party's total liability under or in connection with this Agreement (whether in contract, tort or  otherwise) arising in any Contract Year is limited to the greater of:

  (a) [ * ] Euros ([ * ] Euros); and

  (b) [ * ]% of the total payment due to the Distributor in the relevant Contract Year pursuant to Clause 4 (Payment Terms).

Example 2:
Contract: DovaPharmaceuticalsInc_20181108_10-Q_EX-10.2_11414857_EX-10.2_Promotion Agreement
Ground Truth: THE FOREGOING SENTENCE SHALL NOT LIMIT (1) THE OBLIGATIONS OF EITHER PARTY TO INDEMNIFY THE OTHER PARTY FROM AND AGAINST THIRD PARTY CLAIMS UNDER SECTION 11.1 OR 11.2, AS APPLICABLE, OR (2) DAMAGES AVAILABLE FOR A PARTY'S BREACH OF THE CONFIDENTIALITY AND NON-USE OBLIGATIONS IN ARTICLE 9.

Example 3:


## 8. Ready for Experiments

The data loader is working. Key things to remember:

1. **20,900 Q&A pairs** = 510 contracts × 41 categories
2. **13,800 answer spans** = the actual "labels" (multiple spans per positive Q&A)
3. **41 categories** stratified into common/moderate/rare tiers
4. **32% positive rate** - most samples have no relevant clause
5. **Rare categories** have very low positive rates → models tend to be "lazy"
6. Use `loader.get_by_tier("rare")` to focus on difficult cases
7. Use `sample.ground_truth_spans` to access all answer spans (not just the first)