# Approach 3: Hybrid Claim-Phrase NER + LLM Restructuring

Combines neural NER with LLM post-processing for structured output.

## Overview
- **Step 1**: Use RoBERTa NER to extract claim phrases (like Approach 2)
- **Step 2**: Use LLM (GPT-4) to restructure claims into SPOT format
  - **S**ubject: Who/what is making the claim
  - **P**redicate: What action/state is claimed
  - **O**bject: What is being claimed about
  - **T**ime: When/urgency element

## Advantages
- ‚úÖ Most structured output
- ‚úÖ Best for complex verification logic
- ‚úÖ Combines neural speed + LLM reasoning

## Requirements
- OpenAI API key for GPT-4
- Note: Will incur API costs (~$0.01 per message)

## Setup Instructions
1. Upload `claim_annotations_2000.json`
2. Provide OpenAI API key when prompted
3. Run all cells

## 1. Environment Setup

In [None]:
# Install packages
!pip install -q transformers datasets accelerate seqeval scikit-learn torch openai

In [None]:
# Imports
import json
import torch
import numpy as np
from pathlib import Path
from dataclasses import dataclass
from typing import List, Dict
from sklearn.model_split import train_test_split
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)
from seqeval.metrics import classification_report, f1_score
from openai import OpenAI
import warnings
warnings.filterwarnings('ignore')

print(f"‚úÖ Setup complete")

In [None]:
# Upload data
from google.colab import files

print("üìÅ Upload 'claim_annotations_2000.json'")
uploaded = files.upload()
data_file = list(uploaded.keys())[0]
print(f"‚úÖ Uploaded: {data_file}")

In [None]:
# Get OpenAI API key
import getpass

print("üîë Enter your OpenAI API key:")
api_key = getpass.getpass("API Key: ")
client = OpenAI(api_key=api_key)
print("‚úÖ API key configured")

## 2. Step 1: Train Claim-Phrase NER (Same as Approach 2)

This is identical to Approach 2. We'll use the same code.

In [None]:
# Define claim types and labels (same as Approach 2)
CLAIM_TYPES = [
    'IDENTITY_CLAIM', 'DELIVERY_CLAIM', 'FINANCIAL_CLAIM',
    'ACCOUNT_CLAIM', 'URGENCY_CLAIM', 'ACTION_CLAIM',
    'VERIFICATION_CLAIM', 'SECURITY_CLAIM', 'REWARD_CLAIM',
    'LEGAL_CLAIM', 'SOCIAL_CLAIM', 'CREDENTIALS_CLAIM'
]

labels = ['O']
for claim_type in CLAIM_TYPES:
    labels.append(f'B-{claim_type}')
    labels.append(f'I-{claim_type}')

label2id = {label: idx for idx, label in enumerate(labels)}
id2label = {idx: label for label, idx in label2id.items()}

print(f"‚úÖ {len(labels)} labels defined")

In [None]:
# Data loading (abbreviated - same as Approach 2)
# NOTE: Copy the full data loading code from Approach 2 notebook
# For brevity, showing simplified version

print("üìÇ Loading data... (using same code as Approach 2)")
print("‚ö†Ô∏è  Copy full data loading cells from Approach 2 notebook")

# After loading, you should have:
# - train_examples, val_examples, test_examples
# - train_dataset, val_dataset, test_dataset

In [None]:
# Train NER model (same as Approach 2)
MODEL_NAME = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)

print("‚úÖ Model loaded - ready to train")
print("‚ö†Ô∏è  Copy training code from Approach 2 notebook")

In [None]:
# Assuming training is complete, define claim extraction function
def extract_claims_ner(text, model, tokenizer, id2label):
    """
    Extract claims using trained NER model
    (Same as Approach 2)
    """
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128,
        return_offsets_mapping=True
    )
    
    offset_mapping = inputs.pop('offset_mapping')[0]
    
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.argmax(outputs.logits, dim=2)[0]
    probabilities = torch.softmax(outputs.logits, dim=2)[0]
    
    claims = []
    current_claim = None
    
    for idx, (pred, prob, (start, end)) in enumerate(zip(predictions, probabilities, offset_mapping)):
        if start == 0 and end == 0:
            continue
        
        label = id2label[pred.item()]
        confidence = prob[pred].item()
        
        if label.startswith('B-'):
            if current_claim:
                claims.append(current_claim)
            
            current_claim = {
                'type': label[2:],
                'start': start.item(),
                'end': end.item(),
                'confidence': confidence
            }
        
        elif label.startswith('I-') and current_claim:
            if label[2:] == current_claim['type']:
                current_claim['end'] = end.item()
                current_claim['confidence'] = (current_claim['confidence'] + confidence) / 2
        
        elif label == 'O' and current_claim:
            claims.append(current_claim)
            current_claim = None
    
    if current_claim:
        claims.append(current_claim)
    
    for claim in claims:
        claim['text'] = text[claim['start']:claim['end']]
    
    return claims

print("‚úÖ Claim extraction function ready")

## 3. Step 2: LLM Restructuring to SPOT Format

This is the NEW part - using GPT-4 to structure claims

In [None]:
def restructure_claims_with_llm(text, claims, client):
    """
    Use GPT-4 to restructure claims into SPOT format:
    - Subject: Who/what is making the claim
    - Predicate: What action/state
    - Object: What is claimed about
    - Time: When/urgency
    """
    if not claims:
        return []
    
    # Format claims for LLM
    claims_text = "\n".join([
        f"- {claim['type']}: '{claim['text']}'"
        for claim in claims
    ])
    
    prompt = f"""You are a claim analysis expert. Given an SMS message and extracted claims, restructure each claim into SPOT format:

SPOT Format:
- Subject: Who/what entity is making the claim
- Predicate: What action or state is being claimed
- Object: What the claim is about (product, account, action, etc.)
- Time: Temporal/urgency element (if any)

SMS Message:
"{text}"

Extracted Claims:
{claims_text}

For EACH claim, provide SPOT structure as JSON:
{{
  "claim_type": "<type>",
  "original_text": "<text>",
  "spot": {{
    "subject": "<subject>",
    "predicate": "<predicate>",
    "object": "<object>",
    "time": "<time or null>"
  }},
  "verification_question": "<question to verify this claim>"
}}

Return ONLY a JSON array of structured claims."""
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",  # or "gpt-4" for better quality
            messages=[
                {"role": "system", "content": "You are a claim structuring expert. Always respond with valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        
        # Handle different response formats
        if isinstance(result, dict) and 'claims' in result:
            return result['claims']
        elif isinstance(result, list):
            return result
        else:
            return [result]
        
    except Exception as e:
        print(f"‚ùå LLM error: {e}")
        return []

print("‚úÖ LLM restructuring function ready")

## 4. Complete Hybrid Pipeline Demo

In [None]:
# Test complete hybrid pipeline
test_messages = [
    "Your Amazon package is delayed. Click here urgently to reschedule delivery.",
    "URGENT: Your PayPal account has been suspended. Verify your identity now to avoid legal action.",
    "Congratulations! You've won ¬£5000. Call 0800-123-456 to claim your prize today.",
]

print("üîç HYBRID PIPELINE: NER ‚Üí LLM RESTRUCTURING\n")
print("="*70)

for i, msg in enumerate(test_messages, 1):
    print(f"\n{i}. Message: {msg}")
    print("-"*70)
    
    # Step 1: Extract claims with NER
    claims = extract_claims_ner(msg, model, tokenizer, id2label)
    print(f"\n   STEP 1 - NER Extraction:")
    if claims:
        for claim in claims:
            print(f"     - {claim['type']:20} : '{claim['text']}'")
    else:
        print("     (no claims)")
        continue
    
    # Step 2: Restructure with LLM
    structured_claims = restructure_claims_with_llm(msg, claims, client)
    print(f"\n   STEP 2 - LLM SPOT Restructuring:")
    if structured_claims:
        for sc in structured_claims:
            print(f"\n     üìã {sc.get('claim_type', 'UNKNOWN')}:")
            spot = sc.get('spot', {})
            print(f"        Subject:   {spot.get('subject', 'N/A')}")
            print(f"        Predicate: {spot.get('predicate', 'N/A')}")
            print(f"        Object:    {spot.get('object', 'N/A')}")
            print(f"        Time:      {spot.get('time', 'N/A')}")
            
            if 'verification_question' in sc:
                print(f"        ‚ùì Verify:  {sc['verification_question']}")
    else:
        print("     (LLM restructuring failed)")
    
    print("="*70)

print("\n‚úÖ Pipeline complete!")

## 5. Comparison: Raw Claims vs SPOT Structure

In [None]:
# Show the value of SPOT structuring
example_msg = "Your Amazon package is delayed. Click here urgently."

claims = extract_claims_ner(example_msg, model, tokenizer, id2label)
structured = restructure_claims_with_llm(example_msg, claims, client)

print("üìä COMPARISON: Raw vs Structured\n")
print("="*70)

print("\n1Ô∏è‚É£  Raw Claims (Approach 2 output):")
for claim in claims:
    print(f"   {claim['type']}: '{claim['text']}'")

print("\n2Ô∏è‚É£  Structured Claims (Approach 3 output):")
for sc in structured:
    print(f"\n   {sc.get('claim_type')}:")
    spot = sc.get('spot', {})
    print(f"     S: {spot.get('subject')}")
    print(f"     P: {spot.get('predicate')}")
    print(f"     O: {spot.get('object')}")
    print(f"     T: {spot.get('time')}")
    print(f"     Verify: {sc.get('verification_question')}")

print("\n" + "="*70)
print("\nüí° Benefits of SPOT:")
print("   ‚úÖ Explicit subject identification")
print("   ‚úÖ Clear action/state definition")
print("   ‚úÖ Verification questions generated")
print("   ‚úÖ Ready for automated verification agent")

## 6. Cost Analysis

In [None]:
# Estimate API costs
print("üí∞ Cost Analysis\n")
print("="*60)
print("Model: gpt-4o-mini")
print("  Input:  $0.150 / 1M tokens")
print("  Output: $0.600 / 1M tokens")
print("\nTypical SMS processing:")
print("  ~300 input tokens")
print("  ~200 output tokens")
print("  Cost per message: ~$0.00015 ($0.15 per 1000 messages)")
print("\nFor 2000 test messages: ~$0.30")
print("="*60)
print("\nüí° Use gpt-4o-mini for cost-efficiency")
print("   Or use gpt-4 for higher quality (+10x cost)")

## 7. Save Pipeline

In [None]:
# Save NER model
model.save_pretrained("./hybrid-claim-ner")
tokenizer.save_pretrained("./hybrid-claim-ner")

# Save config
config = {
    'approach': 'hybrid',
    'step1': 'RoBERTa NER for claim extraction',
    'step2': 'GPT-4 for SPOT restructuring',
    'label2id': label2id,
    'id2label': {int(k): v for k, v in id2label.items()},
    'claim_types': CLAIM_TYPES,
    'spot_format': {
        'S': 'Subject - who/what makes claim',
        'P': 'Predicate - action/state claimed',
        'O': 'Object - what claim is about',
        'T': 'Time - temporal/urgency element'
    }
}

with open("./hybrid-claim-ner/config.json", "w") as f:
    json.dump(config, f, indent=2)

print("‚úÖ Hybrid pipeline saved")

## 8. Results Summary

In [None]:
print("="*60)
print("APPROACH 3: HYBRID CLAIM-PHRASE NER + LLM")
print("="*60)
print(f"\nPipeline:")
print(f"  1. RoBERTa NER extracts claims (fast, on-device)")
print(f"  2. GPT-4 restructures to SPOT format (slow, API call)")
print(f"\nAdvantages:")
print(f"  ‚úÖ Most structured output")
print(f"  ‚úÖ Explicit verification questions")
print(f"  ‚úÖ Best for complex verification agent")
print(f"  ‚úÖ Combines neural speed + LLM reasoning")
print(f"\nDisadvantages:")
print(f"  ‚ùå Requires LLM API (costs)")
print(f"  ‚ùå Slower inference (~1-2 sec per message)")
print(f"  ‚ùå More complex pipeline")
print(f"\nUse When:")
print(f"  - Need highly structured claims")
print(f"  - Building sophisticated verification system")
print(f"  - Can afford API costs")
print("="*60)