# 🎯 EchoNote Model Evaluation Suite

**Comprehensive evaluation of the fine-tuned EchoNote meeting summarization model.**

## Evaluation Metrics

| Metric | Description |
|--------|-------------|
| **Semantic Similarity** | Cosine similarity between generated and ground truth embeddings |
| **Format Compliance** | Valid JSON structure with all required fields |
| **Content Coverage** | Key information extraction completeness |
| **Action Items** | Quality of extracted action items (task, assignee, deadline, priority) |
| **NER Precision** | Accuracy of named entity recognition (people, orgs, dates) |
| **Sentiment Accuracy** | Correct sentiment classification |
| **Length Compliance** | Executive summary within target length range |
| **Topic Relevance** | Alignment of extracted topics with transcript content |

**Model:** `haris936hk/echonote`

## 1. Installation & Setup

In [None]:
%%capture
# Install dependencies
!pip install -q transformers>=4.40.0 accelerate>=0.27.0 bitsandbytes>=0.42.0
!pip install -q torch>=2.0.0
!pip install -q sentence-transformers>=2.5.0
!pip install -q spacy>=3.7.0 textblob>=0.18.0
!pip install -q scikit-learn>=1.4.0 numpy>=1.24.0 pandas>=2.0.0
!pip install -q matplotlib>=3.8.0 seaborn>=0.13.0
!pip install -q tqdm>=4.66.0 jsonschema>=4.21.0
!python -m spacy download en_core_web_lg


In [None]:
import json
import re
import random
import warnings
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass, field
from collections import Counter
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

import spacy
from textblob import TextBlob
from jsonschema import validate, ValidationError

import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')
print("✅ All imports successful!")
print(f"🔧 PyTorch version: {torch.__version__}")
print(f"🔧 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🔧 GPU: {torch.cuda.get_device_name(0)}")


## 2. Configuration

In [None]:
@dataclass
class EvalConfig:
    """Evaluation configuration"""
    # Model settings (merged model - not LoRA adapters)
    model_id: str = "haris936hk/echonote"
    
    # Dataset settings
    dataset_path: str = "echonote_dataset_combined.jsonl"
    test_size: int = 100  # Number of samples to evaluate
    random_seed: int = 42
    
    # Generation settings
    max_new_tokens: int = 512
    temperature: float = 0.1
    top_p: float = 0.9
    
    # Length constraints (from training)
    min_summary_chars: int = 150
    max_summary_chars: int = 600
    
    # Embedding model for semantic similarity
    embedding_model: str = "all-MiniLM-L6-v2"

config = EvalConfig()
print(f"📊 Evaluation Config:")
print(f"   Model: {config.model_id}")
print(f"   Test samples: {config.test_size}")
print(f"   Random seed: {config.random_seed}")


In [None]:
# Expected output schema
OUTPUT_SCHEMA = {
    "type": "object",
    "required": ["executiveSummary", "keyDecisions", "actionItems", "nextSteps", "keyTopics", "sentiment"],
    "properties": {
        "executiveSummary": {"type": "string", "minLength": 150},
        "keyDecisions": {"type": "array", "items": {"type": "string"}},
        "actionItems": {
            "type": "array",
            "items": {
                "type": "object",
                "required": ["task", "assignee", "deadline", "priority"],
                "properties": {
                    "task": {"type": "string"},
                    "assignee": {"type": "string"},
                    "deadline": {"type": "string"},
                    "priority": {"type": "string", "enum": ["high", "medium", "low"]}
                }
            }
        },
        "nextSteps": {"type": "array", "items": {"type": "string"}},
        "keyTopics": {"type": "array", "items": {"type": "string"}},
        "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]}
    }
}

# System prompt (same as training)
SYSTEM_PROMPT = """You are an expert meeting intelligence assistant.

Your task is to read a meeting transcript and produce a concise but comprehensive structured summary.

Follow these rules strictly:
1. Read the transcript carefully and identify key participants, topics, decisions, and action items.
2. The overall sentiment should be inferred from the tone and content of the meeting.
3. Output must be valid JSON matching the schema exactly.
4. Use this exact structure:
{
  "executiveSummary": string,
  "keyDecisions": string[],
  "actionItems": [
    {
      "task": string,
      "assignee": string,
      "deadline": string,
      "priority": "high" | "medium" | "low"
    }
  ],
  "nextSteps": string[],
  "keyTopics": string[],
  "sentiment": "positive" | "neutral" | "negative"
}

5. The "executiveSummary" must be a well-written narrative paragraph of AT LEAST 150 characters that accurately reflects the discussion and context of the meeting.
6. If there are no explicit decisions, action items, or next steps, return an EMPTY ARRAY for those fields.
7. Do NOT invent decisions, tasks, or deadlines that were not discussed or clearly implied.
8. "actionItems" must be concrete and assigned ONLY when responsibility is clear.
9. "keyTopics" should list the main discussion themes using short phrases.
10. "sentiment" must reflect the overall tone of the meeting, using transcript content and the provided NLP sentiment cues.

All required fields must be present.
Empty arrays are allowed when applicable.
Output must be valid JSON and nothing else.
"""

print("✅ Schema and system prompt defined.")

## 3. Load Model & Resources

In [None]:
print("🔄 Loading fine-tuned EchoNote model (merged)...")

# Quantization config for efficient inference
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load merged model directly (not base + adapters)
model = AutoModelForCausalLM.from_pretrained(
    config.model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model.eval()

print(f"✅ Model loaded successfully!")
print(f"   Model: {config.model_id}")


In [None]:
# Load embedding model for semantic similarity
print("🔄 Loading embedding model...")
embedding_model = SentenceTransformer(config.embedding_model)
print(f"✅ Embedding model loaded: {config.embedding_model}")

# Load spaCy for NER
print("🔄 Loading spaCy model...")
nlp = spacy.load("en_core_web_lg")
print("✅ spaCy model loaded: en_core_web_lg")

## 4. Load & Prepare Dataset

In [None]:
def load_dataset(path: str) -> List[Dict]:
    """Load JSONL dataset"""
    data = []
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line))
    return data

def parse_ground_truth(output_str: str) -> Optional[Dict]:
    """Parse ground truth JSON from output string"""
    try:
        return json.loads(output_str)
    except:
        match = re.search(r'\{[\s\S]*\}', output_str)
        if match:
            try:
                return json.loads(match.group())
            except:
                pass
    return None

# Load full dataset
print(f"🔄 Loading dataset from {config.dataset_path}...")
full_dataset = load_dataset(config.dataset_path)
print(f"✅ Loaded {len(full_dataset)} samples")

# Sample test set
random.seed(config.random_seed)
test_indices = random.sample(range(len(full_dataset)), min(config.test_size, len(full_dataset)))
test_dataset = [full_dataset[i] for i in test_indices]

print(f"📊 Test set: {len(test_dataset)} samples")

# Validate ground truth
valid_samples = []
for sample in test_dataset:
    gt = parse_ground_truth(sample['output'])
    if gt:
        valid_samples.append({'input': sample['input'], 'ground_truth': gt})

print(f"✅ Valid samples with parseable ground truth: {len(valid_samples)}")

## 5. Generation Function

In [None]:
def generate_summary(input_text: str) -> Tuple[str, Optional[Dict]]:
    """
    Generate meeting summary using the fine-tuned model.
    Returns: (raw_output, parsed_json)
    """
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": input_text}
    ]
    
    # Apply chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=config.max_new_tokens,
            temperature=config.temperature,
            top_p=config.top_p,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode only new tokens
    raw_output = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    ).strip()
    
    # Parse JSON
    parsed = None
    try:
        # Try direct parse
        parsed = json.loads(raw_output)
    except:
        # Try to extract JSON from response
        match = re.search(r'\{[\s\S]*\}', raw_output)
        if match:
            try:
                parsed = json.loads(match.group())
            except:
                pass
    
    return raw_output, parsed

# Test generation
print("🧪 Testing generation with first sample...")
test_raw, test_parsed = generate_summary(valid_samples[0]['input'])
print(f"✅ Generation successful!")
print(f"   Raw output length: {len(test_raw)} chars")
print(f"   Parsed JSON: {'Yes' if test_parsed else 'No'}")
if test_parsed:
    print(f"   Keys: {list(test_parsed.keys())}")

## 6. Evaluation Metrics Implementation

In [None]:
@dataclass
class EvaluationResult:
    """Container for all evaluation metrics"""
    # Core metrics
    semantic_similarity: float = 0.0
    format_compliance: float = 0.0
    content_coverage: float = 0.0
    action_items_score: float = 0.0
    ner_precision: float = 0.0
    sentiment_accuracy: float = 0.0
    length_compliance: float = 0.0
    topic_relevance: float = 0.0
    
    # Detailed breakdowns
    format_details: Dict = field(default_factory=dict)
    action_items_details: Dict = field(default_factory=dict)
    ner_details: Dict = field(default_factory=dict)
    
    def overall_score(self) -> float:
        """Weighted average of all metrics"""
        weights = {
            'semantic_similarity': 0.20,
            'format_compliance': 0.15,
            'content_coverage': 0.15,
            'action_items_score': 0.15,
            'ner_precision': 0.10,
            'sentiment_accuracy': 0.10,
            'length_compliance': 0.05,
            'topic_relevance': 0.10
        }
        total = sum(
            getattr(self, metric) * weight 
            for metric, weight in weights.items()
        )
        return total

print("✅ EvaluationResult class defined.")

In [None]:
class EchoNoteEvaluator:
    """Comprehensive evaluator for EchoNote model outputs"""
    
    def __init__(self, embedding_model, nlp_model, schema):
        self.embedding_model = embedding_model
        self.nlp = nlp_model
        self.schema = schema
    
    def evaluate_semantic_similarity(self, generated: Dict, ground_truth: Dict) -> float:
        """
        Metric 1: Semantic Similarity
        Compares embeddings of executive summaries and overall content.
        """
        if not generated or not ground_truth:
            return 0.0
        
        # Compare executive summaries
        gen_summary = generated.get('executiveSummary', '')
        gt_summary = ground_truth.get('executiveSummary', '')
        
        if not gen_summary or not gt_summary:
            return 0.0
        
        # Get embeddings
        gen_emb = self.embedding_model.encode([gen_summary])
        gt_emb = self.embedding_model.encode([gt_summary])
        
        # Cosine similarity
        similarity = cosine_similarity(gen_emb, gt_emb)[0][0]
        
        # Also compare key topics
        gen_topics = ' '.join(generated.get('keyTopics', []))
        gt_topics = ' '.join(ground_truth.get('keyTopics', []))
        
        if gen_topics and gt_topics:
            topics_emb_gen = self.embedding_model.encode([gen_topics])
            topics_emb_gt = self.embedding_model.encode([gt_topics])
            topics_sim = cosine_similarity(topics_emb_gen, topics_emb_gt)[0][0]
            similarity = 0.7 * similarity + 0.3 * topics_sim
        
        return float(max(0, similarity))
    
    def evaluate_format_compliance(self, generated: Dict, raw_output: str) -> Tuple[float, Dict]:
        """
        Metric 2: Format Compliance
        Checks if output is valid JSON with all required fields.
        """
        details = {
            'is_valid_json': False,
            'has_all_required_fields': False,
            'schema_valid': False,
            'field_scores': {}
        }
        
        if generated is None:
            return 0.0, details
        
        details['is_valid_json'] = True
        
        # Check required fields
        required = ['executiveSummary', 'keyDecisions', 'actionItems', 'nextSteps', 'keyTopics', 'sentiment']
        present_fields = [f for f in required if f in generated]
        details['has_all_required_fields'] = len(present_fields) == len(required)
        
        # Field-by-field scoring
        for field in required:
            if field in generated:
                value = generated[field]
                if field == 'executiveSummary':
                    details['field_scores'][field] = 1.0 if isinstance(value, str) and len(value) >= 50 else 0.5
                elif field == 'sentiment':
                    details['field_scores'][field] = 1.0 if value in ['positive', 'neutral', 'negative'] else 0.0
                elif field == 'actionItems':
                    if isinstance(value, list):
                        valid_items = sum(1 for item in value if all(k in item for k in ['task', 'assignee', 'deadline', 'priority']))
                        details['field_scores'][field] = valid_items / max(len(value), 1) if value else 1.0
                    else:
                        details['field_scores'][field] = 0.0
                else:
                    details['field_scores'][field] = 1.0 if isinstance(value, list) else 0.5
            else:
                details['field_scores'][field] = 0.0
        
        # Schema validation
        try:
            validate(instance=generated, schema=self.schema)
            details['schema_valid'] = True
        except ValidationError:
            details['schema_valid'] = False
        
        # Calculate score
        score = 0.0
        score += 0.3 if details['is_valid_json'] else 0.0
        score += 0.3 if details['has_all_required_fields'] else 0.0
        score += 0.2 if details['schema_valid'] else 0.0
        score += 0.2 * (sum(details['field_scores'].values()) / len(required))
        
        return score, details
    
    def evaluate_content_coverage(self, generated: Dict, ground_truth: Dict, input_text: str) -> float:
        """
        Metric 3: Content Coverage
        Measures how well the generated output covers key information from input.
        """
        if not generated:
            return 0.0
        
        # Extract entities and key phrases from input
        doc = self.nlp(input_text[:5000])  # Limit for performance
        
        input_entities = set(ent.text.lower() for ent in doc.ents if ent.label_ in ['PERSON', 'ORG', 'MONEY', 'PERCENT', 'DATE'])
        input_numbers = set(re.findall(r'\$?[\d,]+\.?\d*[%MKB]?', input_text))
        
        # Flatten generated content
        gen_text = json.dumps(generated).lower()
        gt_text = json.dumps(ground_truth).lower() if ground_truth else ''
        
        # Coverage scores
        entity_coverage = sum(1 for e in input_entities if e in gen_text) / max(len(input_entities), 1)
        number_coverage = sum(1 for n in input_numbers if n.lower() in gen_text) / max(len(input_numbers), 1)
        
        # Compare list lengths
        gen_decisions = len(generated.get('keyDecisions', []))
        gt_decisions = len(ground_truth.get('keyDecisions', [])) if ground_truth else 0
        
        gen_actions = len(generated.get('actionItems', []))
        gt_actions = len(ground_truth.get('actionItems', [])) if ground_truth else 0
        
        decisions_ratio = min(gen_decisions, gt_decisions) / max(gt_decisions, 1) if gt_decisions else (1.0 if gen_decisions > 0 else 0.5)
        actions_ratio = min(gen_actions, gt_actions) / max(gt_actions, 1) if gt_actions else (1.0 if gen_actions > 0 else 0.5)
        
        # Weighted score
        score = 0.3 * entity_coverage + 0.2 * number_coverage + 0.25 * decisions_ratio + 0.25 * actions_ratio
        
        return min(1.0, score)
    
    def evaluate_action_items(self, generated: Dict, ground_truth: Dict) -> Tuple[float, Dict]:
        """
        Metric 4: Action Items Quality
        Evaluates completeness and quality of action items.
        """
        details = {
            'generated_count': 0,
            'ground_truth_count': 0,
            'complete_items': 0,
            'valid_priorities': 0,
            'has_deadlines': 0,
            'has_assignees': 0
        }
        
        if not generated:
            return 0.0, details
        
        gen_items = generated.get('actionItems', [])
        gt_items = ground_truth.get('actionItems', []) if ground_truth else []
        
        details['generated_count'] = len(gen_items)
        details['ground_truth_count'] = len(gt_items)
        
        if not gen_items:
            # If ground truth also has no items, it's correct
            return 1.0 if not gt_items else 0.3, details
        
        for item in gen_items:
            if not isinstance(item, dict):
                continue
            
            has_task = bool(item.get('task', '').strip())
            has_assignee = bool(item.get('assignee', '').strip())
            has_deadline = bool(item.get('deadline', '').strip())
            has_priority = item.get('priority', '') in ['high', 'medium', 'low']
            
            if has_task and has_assignee and has_deadline and has_priority:
                details['complete_items'] += 1
            if has_priority:
                details['valid_priorities'] += 1
            if has_deadline:
                details['has_deadlines'] += 1
            if has_assignee:
                details['has_assignees'] += 1
        
        # Calculate score
        n = len(gen_items)
        completeness = details['complete_items'] / n
        priority_score = details['valid_priorities'] / n
        deadline_score = details['has_deadlines'] / n
        assignee_score = details['has_assignees'] / n
        
        # Count alignment
        count_ratio = min(details['generated_count'], details['ground_truth_count']) / max(details['ground_truth_count'], 1)
        
        score = 0.3 * completeness + 0.2 * priority_score + 0.2 * deadline_score + 0.2 * assignee_score + 0.1 * min(count_ratio, 1.0)
        
        return score, details
    
    def evaluate_ner_precision(self, generated: Dict, input_text: str) -> Tuple[float, Dict]:
        """
        Metric 5: NER Precision
        Checks if named entities in output actually appear in input.
        """
        details = {
            'persons_mentioned': [],
            'persons_valid': 0,
            'orgs_mentioned': [],
            'orgs_valid': 0,
            'dates_mentioned': [],
            'dates_valid': 0
        }
        
        if not generated:
            return 0.0, details
        
        input_lower = input_text.lower()
        gen_text = json.dumps(generated)
        doc = self.nlp(gen_text)
        
        persons = []
        orgs = []
        dates = []
        
        for ent in doc.ents:
            if ent.label_ == 'PERSON':
                persons.append(ent.text)
            elif ent.label_ == 'ORG':
                orgs.append(ent.text)
            elif ent.label_ in ['DATE', 'TIME']:
                dates.append(ent.text)
        
        # Also extract from action items assignees
        for item in generated.get('actionItems', []):
            if isinstance(item, dict) and item.get('assignee'):
                persons.append(item['assignee'])
        
        persons = list(set(persons))
        orgs = list(set(orgs))
        
        details['persons_mentioned'] = persons
        details['orgs_mentioned'] = orgs
        details['dates_mentioned'] = dates
        
        # Check validity
        for person in persons:
            # Check if any part of name appears in input
            name_parts = person.lower().split()
            if any(part in input_lower for part in name_parts if len(part) > 2):
                details['persons_valid'] += 1
        
        for org in orgs:
            if org.lower() in input_lower or any(word.lower() in input_lower for word in org.split() if len(word) > 3):
                details['orgs_valid'] += 1
        
        # Calculate precision
        person_precision = details['persons_valid'] / max(len(persons), 1)
        org_precision = details['orgs_valid'] / max(len(orgs), 1)
        
        # Weight persons more heavily (assignees are important)
        score = 0.7 * person_precision + 0.3 * org_precision
        
        return score, details
    
    def evaluate_sentiment_accuracy(self, generated: Dict, ground_truth: Dict) -> float:
        """
        Metric 6: Sentiment Accuracy
        Checks if sentiment matches ground truth.
        """
        if not generated or not ground_truth:
            return 0.0
        
        gen_sentiment = generated.get('sentiment', '').lower()
        gt_sentiment = ground_truth.get('sentiment', '').lower()
        
        valid_sentiments = ['positive', 'neutral', 'negative']
        
        if gen_sentiment not in valid_sentiments:
            return 0.0
        
        if gen_sentiment == gt_sentiment:
            return 1.0
        
        # Partial credit for adjacent sentiments
        sentiment_order = {'negative': 0, 'neutral': 1, 'positive': 2}
        if gen_sentiment in sentiment_order and gt_sentiment in sentiment_order:
            diff = abs(sentiment_order[gen_sentiment] - sentiment_order[gt_sentiment])
            if diff == 1:
                return 0.5  # Adjacent sentiment
        
        return 0.0
    
    def evaluate_length_compliance(self, generated: Dict) -> float:
        """
        Metric 7: Length Compliance
        Checks if executive summary is within target length.
        """
        if not generated:
            return 0.0
        
        summary = generated.get('executiveSummary', '')
        if not summary:
            return 0.0
        
        length = len(summary)
        
        # Target: 150-600 characters
        min_len = config.min_summary_chars
        max_len = config.max_summary_chars
        
        if min_len <= length <= max_len:
            return 1.0
        elif length < min_len:
            # Penalty for too short
            return max(0, length / min_len)
        else:
            # Smaller penalty for too long (still contains info)
            return max(0.5, 1.0 - (length - max_len) / max_len)
    
    def evaluate_topic_relevance(self, generated: Dict, input_text: str) -> float:
        """
        Metric 8: Topic Relevance
        Checks if extracted topics are relevant to the transcript.
        """
        if not generated:
            return 0.0
        
        topics = generated.get('keyTopics', [])
        if not topics:
            return 0.3  # Partial credit if no topics extracted
        
        input_lower = input_text.lower()
        
        # Extract important words from input
        doc = self.nlp(input_text[:5000])
        input_nouns = set(token.lemma_.lower() for token in doc if token.pos_ in ['NOUN', 'PROPN'] and len(token.text) > 3)
        
        # Check topic relevance
        relevant_topics = 0
        for topic in topics:
            topic_words = topic.lower().split()
            # Check if topic words appear in input
            matches = sum(1 for w in topic_words if w in input_lower or w in input_nouns)
            if matches >= len(topic_words) * 0.5:  # At least half the words match
                relevant_topics += 1
        
        return relevant_topics / len(topics)
    
    def evaluate_sample(self, input_text: str, raw_output: str, generated: Dict, ground_truth: Dict) -> EvaluationResult:
        """Run all evaluations on a single sample"""
        result = EvaluationResult()
        
        # 1. Semantic Similarity
        result.semantic_similarity = self.evaluate_semantic_similarity(generated, ground_truth)
        
        # 2. Format Compliance
        result.format_compliance, result.format_details = self.evaluate_format_compliance(generated, raw_output)
        
        # 3. Content Coverage
        result.content_coverage = self.evaluate_content_coverage(generated, ground_truth, input_text)
        
        # 4. Action Items
        result.action_items_score, result.action_items_details = self.evaluate_action_items(generated, ground_truth)
        
        # 5. NER Precision
        result.ner_precision, result.ner_details = self.evaluate_ner_precision(generated, input_text)
        
        # 6. Sentiment Accuracy
        result.sentiment_accuracy = self.evaluate_sentiment_accuracy(generated, ground_truth)
        
        # 7. Length Compliance
        result.length_compliance = self.evaluate_length_compliance(generated)
        
        # 8. Topic Relevance
        result.topic_relevance = self.evaluate_topic_relevance(generated, input_text)
        
        return result

# Initialize evaluator
evaluator = EchoNoteEvaluator(embedding_model, nlp, OUTPUT_SCHEMA)
print("✅ EchoNoteEvaluator initialized.")

## 7. Run Evaluation

In [None]:
def run_full_evaluation(samples: List[Dict], evaluator: EchoNoteEvaluator) -> Tuple[List[EvaluationResult], pd.DataFrame]:
    """
    Run evaluation on all samples and return results.
    """
    results = []
    detailed_records = []
    
    print(f"🚀 Starting evaluation on {len(samples)} samples...")
    print("="*60)
    
    for i, sample in enumerate(tqdm(samples, desc="Evaluating")):
        input_text = sample['input']
        ground_truth = sample['ground_truth']
        
        # Generate output
        try:
            raw_output, generated = generate_summary(input_text)
        except Exception as e:
            print(f"\n❌ Generation failed for sample {i}: {e}")
            generated = None
            raw_output = ""
        
        # Evaluate
        result = evaluator.evaluate_sample(input_text, raw_output, generated, ground_truth)
        results.append(result)
        
        # Record details
        detailed_records.append({
            'sample_id': i,
            'semantic_similarity': result.semantic_similarity,
            'format_compliance': result.format_compliance,
            'content_coverage': result.content_coverage,
            'action_items_score': result.action_items_score,
            'ner_precision': result.ner_precision,
            'sentiment_accuracy': result.sentiment_accuracy,
            'length_compliance': result.length_compliance,
            'topic_relevance': result.topic_relevance,
            'overall_score': result.overall_score(),
            'json_valid': result.format_details.get('is_valid_json', False),
            'gt_sentiment': ground_truth.get('sentiment', 'unknown'),
            'gen_sentiment': generated.get('sentiment', 'unknown') if generated else 'failed'
        })
        
        # Progress update every 10 samples
        if (i + 1) % 10 == 0:
            avg_score = np.mean([r.overall_score() for r in results])
            print(f"\n📊 Progress: {i+1}/{len(samples)} | Avg Score: {avg_score:.3f}")
    
    df = pd.DataFrame(detailed_records)
    return results, df

# Run evaluation
results, results_df = run_full_evaluation(valid_samples, evaluator)

## 8. Results Analysis & Visualization

In [None]:
def compute_aggregate_metrics(results: List[EvaluationResult]) -> Dict:
    """Compute aggregate statistics for all metrics"""
    metrics = [
        'semantic_similarity', 'format_compliance', 'content_coverage',
        'action_items_score', 'ner_precision', 'sentiment_accuracy',
        'length_compliance', 'topic_relevance'
    ]
    
    aggregates = {}
    for metric in metrics:
        values = [getattr(r, metric) for r in results]
        aggregates[metric] = {
            'mean': np.mean(values),
            'std': np.std(values),
            'min': np.min(values),
            'max': np.max(values),
            'median': np.median(values)
        }
    
    # Overall scores
    overall_scores = [r.overall_score() for r in results]
    aggregates['overall'] = {
        'mean': np.mean(overall_scores),
        'std': np.std(overall_scores),
        'min': np.min(overall_scores),
        'max': np.max(overall_scores),
        'median': np.median(overall_scores)
    }
    
    return aggregates

aggregates = compute_aggregate_metrics(results)

# Print summary
print("\n" + "="*70)
print("📊 ECHONOTE MODEL EVALUATION RESULTS")
print("="*70)
print(f"\nTotal samples evaluated: {len(results)}")
print(f"Model: {config.model_id}")
print("\n" + "-"*70)
print(f"{'Metric':<25} {'Mean':>10} {'Std':>10} {'Min':>10} {'Max':>10}")
print("-"*70)

for metric, stats in aggregates.items():
    print(f"{metric:<25} {stats['mean']:>10.3f} {stats['std']:>10.3f} {stats['min']:>10.3f} {stats['max']:>10.3f}")

print("="*70)
print(f"\n🎯 OVERALL SCORE: {aggregates['overall']['mean']:.3f} (±{aggregates['overall']['std']:.3f})")
print("="*70)

In [None]:
# Set style
plt.style.use('seaborn-v0_8-whitegrid')
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Metric Comparison Bar Chart
ax1 = axes[0, 0]
metrics = list(aggregates.keys())[:-1]  # Exclude 'overall'
means = [aggregates[m]['mean'] for m in metrics]
stds = [aggregates[m]['std'] for m in metrics]

colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(metrics)))
bars = ax1.bar(range(len(metrics)), means, yerr=stds, capsize=5, color=colors, edgecolor='black', alpha=0.8)
ax1.set_xticks(range(len(metrics)))
ax1.set_xticklabels([m.replace('_', '\n') for m in metrics], rotation=45, ha='right', fontsize=9)
ax1.set_ylabel('Score', fontsize=11)
ax1.set_title('Metric Comparison (Mean ± Std)', fontsize=13, fontweight='bold')
ax1.set_ylim(0, 1.1)
ax1.axhline(y=aggregates['overall']['mean'], color='red', linestyle='--', label=f"Overall: {aggregates['overall']['mean']:.3f}")
ax1.legend()

# 2. Score Distribution
ax2 = axes[0, 1]
overall_scores = [r.overall_score() for r in results]
ax2.hist(overall_scores, bins=20, edgecolor='black', alpha=0.7, color='steelblue')
ax2.axvline(x=np.mean(overall_scores), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(overall_scores):.3f}')
ax2.axvline(x=np.median(overall_scores), color='orange', linestyle='--', linewidth=2, label=f'Median: {np.median(overall_scores):.3f}')
ax2.set_xlabel('Overall Score', fontsize=11)
ax2.set_ylabel('Frequency', fontsize=11)
ax2.set_title('Overall Score Distribution', fontsize=13, fontweight='bold')
ax2.legend()

# 3. Radar Chart
ax3 = axes[1, 0]
ax3.remove()
ax3 = fig.add_subplot(2, 2, 3, projection='polar')

angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False).tolist()
angles += angles[:1]  # Complete the loop
values = means + means[:1]

ax3.plot(angles, values, 'o-', linewidth=2, color='steelblue')
ax3.fill(angles, values, alpha=0.25, color='steelblue')
ax3.set_xticks(angles[:-1])
ax3.set_xticklabels([m.replace('_', '\n') for m in metrics], fontsize=8)
ax3.set_ylim(0, 1)
ax3.set_title('Metric Radar Chart', fontsize=13, fontweight='bold', pad=20)

# 4. Sentiment Accuracy Breakdown
ax4 = axes[1, 1]
sentiment_correct = sum(1 for r in results if r.sentiment_accuracy == 1.0)
sentiment_partial = sum(1 for r in results if 0 < r.sentiment_accuracy < 1.0)
sentiment_wrong = sum(1 for r in results if r.sentiment_accuracy == 0.0)

sentiment_data = [sentiment_correct, sentiment_partial, sentiment_wrong]
sentiment_labels = ['Correct', 'Partial', 'Wrong']
colors_pie = ['#2ecc71', '#f39c12', '#e74c3c']
ax4.pie(sentiment_data, labels=sentiment_labels, autopct='%1.1f%%', colors=colors_pie, startangle=90)
ax4.set_title('Sentiment Classification Accuracy', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.savefig('echonote_evaluation_results.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n📈 Visualization saved to 'echonote_evaluation_results.png'")

In [None]:
# Correlation heatmap between metrics
plt.figure(figsize=(10, 8))

metric_cols = [
    'semantic_similarity', 'format_compliance', 'content_coverage',
    'action_items_score', 'ner_precision', 'sentiment_accuracy',
    'length_compliance', 'topic_relevance', 'overall_score'
]

corr_matrix = results_df[metric_cols].corr()

mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
sns.heatmap(
    corr_matrix, 
    mask=mask,
    annot=True, 
    fmt='.2f', 
    cmap='RdYlGn',
    center=0,
    square=True,
    linewidths=0.5,
    xticklabels=[m.replace('_', '\n') for m in metric_cols],
    yticklabels=[m.replace('_', '\n') for m in metric_cols]
)
plt.title('Metric Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('echonote_correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n📈 Correlation heatmap saved to 'echonote_correlation_heatmap.png'")

In [None]:
# Detailed breakdown analysis
print("\n" + "="*70)
print("📋 DETAILED METRIC BREAKDOWN")
print("="*70)

# Format Compliance Details
json_valid_count = sum(1 for r in results if r.format_details.get('is_valid_json', False))
schema_valid_count = sum(1 for r in results if r.format_details.get('schema_valid', False))
all_fields_count = sum(1 for r in results if r.format_details.get('has_all_required_fields', False))

print(f"\n📄 FORMAT COMPLIANCE:")
print(f"   Valid JSON outputs: {json_valid_count}/{len(results)} ({100*json_valid_count/len(results):.1f}%)")
print(f"   Schema compliant: {schema_valid_count}/{len(results)} ({100*schema_valid_count/len(results):.1f}%)")
print(f"   All required fields: {all_fields_count}/{len(results)} ({100*all_fields_count/len(results):.1f}%)")

# Action Items Details
total_gen_actions = sum(r.action_items_details.get('generated_count', 0) for r in results)
total_gt_actions = sum(r.action_items_details.get('ground_truth_count', 0) for r in results)
complete_actions = sum(r.action_items_details.get('complete_items', 0) for r in results)

print(f"\n📌 ACTION ITEMS:")
print(f"   Total generated: {total_gen_actions}")
print(f"   Total in ground truth: {total_gt_actions}")
print(f"   Complete items (all fields): {complete_actions}")
print(f"   Avg per sample: {total_gen_actions/len(results):.1f}")

# NER Details
total_persons = sum(len(r.ner_details.get('persons_mentioned', [])) for r in results)
valid_persons = sum(r.ner_details.get('persons_valid', 0) for r in results)

print(f"\n👤 NAMED ENTITY RECOGNITION:")
print(f"   Persons mentioned: {total_persons}")
print(f"   Persons validated: {valid_persons}")
print(f"   Person precision: {100*valid_persons/max(total_persons, 1):.1f}%")

# Sentiment Breakdown
print(f"\n😊 SENTIMENT ANALYSIS:")
print(f"   Correct: {sentiment_correct}/{len(results)} ({100*sentiment_correct/len(results):.1f}%)")
print(f"   Partial match: {sentiment_partial}/{len(results)} ({100*sentiment_partial/len(results):.1f}%)")
print(f"   Incorrect: {sentiment_wrong}/{len(results)} ({100*sentiment_wrong/len(results):.1f}%)")

# Length Compliance
length_compliant = sum(1 for r in results if r.length_compliance == 1.0)
print(f"\n📏 LENGTH COMPLIANCE:")
print(f"   Within target range: {length_compliant}/{len(results)} ({100*length_compliant/len(results):.1f}%)")

## 9. Sample Analysis

In [None]:
# Show best and worst performing samples
overall_scores = [r.overall_score() for r in results]
sorted_indices = np.argsort(overall_scores)

print("\n" + "="*70)
print("🏆 TOP 5 BEST PERFORMING SAMPLES")
print("="*70)

for idx in sorted_indices[-5:][::-1]:
    r = results[idx]
    print(f"\n📊 Sample {idx}: Overall Score = {r.overall_score():.3f}")
    print(f"   Semantic: {r.semantic_similarity:.3f} | Format: {r.format_compliance:.3f} | Content: {r.content_coverage:.3f}")
    print(f"   Actions: {r.action_items_score:.3f} | NER: {r.ner_precision:.3f} | Sentiment: {r.sentiment_accuracy:.3f}")

print("\n" + "="*70)
print("⚠️ BOTTOM 5 WORST PERFORMING SAMPLES")
print("="*70)

for idx in sorted_indices[:5]:
    r = results[idx]
    print(f"\n📊 Sample {idx}: Overall Score = {r.overall_score():.3f}")
    print(f"   Semantic: {r.semantic_similarity:.3f} | Format: {r.format_compliance:.3f} | Content: {r.content_coverage:.3f}")
    print(f"   Actions: {r.action_items_score:.3f} | NER: {r.ner_precision:.3f} | Sentiment: {r.sentiment_accuracy:.3f}")
    
    # Identify main issues
    issues = []
    if r.format_compliance < 0.5:
        issues.append("JSON parsing issues")
    if r.semantic_similarity < 0.5:
        issues.append("Low semantic match")
    if r.action_items_score < 0.5:
        issues.append("Poor action items")
    if r.sentiment_accuracy < 0.5:
        issues.append("Wrong sentiment")
    
    if issues:
        print(f"   Issues: {', '.join(issues)}")

## 10. Export Results

In [None]:
# Export detailed results to CSV
results_df.to_csv('echonote_evaluation_detailed.csv', index=False)
print("\n💾 Detailed results saved to 'echonote_evaluation_detailed.csv'")

# Export summary report
summary_report = {
    'model_id': config.model_id,
    
    'samples_evaluated': len(results),
    'evaluation_date': pd.Timestamp.now().isoformat(),
    'metrics': aggregates,
    'format_compliance_rate': json_valid_count / len(results),
    'schema_compliance_rate': schema_valid_count / len(results),
    'sentiment_accuracy_rate': sentiment_correct / len(results)
}

with open('echonote_evaluation_summary.json', 'w') as f:
    json.dump(summary_report, f, indent=2, default=float)
print("💾 Summary report saved to 'echonote_evaluation_summary.json'")

# Display final summary
print("\n" + "="*70)
print("✅ EVALUATION COMPLETE")
print("="*70)
print(f"\n🎯 Final Overall Score: {aggregates['overall']['mean']:.3f} (±{aggregates['overall']['std']:.3f})")
print(f"\n📊 Key Metrics:")
print(f"   • Semantic Similarity: {aggregates['semantic_similarity']['mean']:.3f}")
print(f"   • Format Compliance: {aggregates['format_compliance']['mean']:.3f}")
print(f"   • Content Coverage: {aggregates['content_coverage']['mean']:.3f}")
print(f"   • Action Items Score: {aggregates['action_items_score']['mean']:.3f}")
print(f"   • NER Precision: {aggregates['ner_precision']['mean']:.3f}")
print(f"   • Sentiment Accuracy: {aggregates['sentiment_accuracy']['mean']:.3f}")
print(f"   • Length Compliance: {aggregates['length_compliance']['mean']:.3f}")
print(f"   • Topic Relevance: {aggregates['topic_relevance']['mean']:.3f}")
print("\n" + "="*70)


## 11. Interactive Sample Inspection (Optional)

In [None]:
def inspect_sample(sample_idx: int):
    """Inspect a specific sample in detail"""
    if sample_idx >= len(valid_samples):
        print(f"❌ Sample index {sample_idx} out of range. Max: {len(valid_samples)-1}")
        return
    
    sample = valid_samples[sample_idx]
    result = results[sample_idx]
    
    print(f"\n{'='*70}")
    print(f"📋 SAMPLE {sample_idx} INSPECTION")
    print(f"{'='*70}")
    
    print(f"\n📊 SCORES:")
    print(f"   Overall: {result.overall_score():.3f}")
    print(f"   Semantic Similarity: {result.semantic_similarity:.3f}")
    print(f"   Format Compliance: {result.format_compliance:.3f}")
    print(f"   Content Coverage: {result.content_coverage:.3f}")
    print(f"   Action Items: {result.action_items_score:.3f}")
    print(f"   NER Precision: {result.ner_precision:.3f}")
    print(f"   Sentiment: {result.sentiment_accuracy:.3f}")
    print(f"   Length: {result.length_compliance:.3f}")
    print(f"   Topic Relevance: {result.topic_relevance:.3f}")
    
    print(f"\n📝 INPUT (first 500 chars):")
    print(f"   {sample['input'][:500]}...")
    
    print(f"\n🎯 GROUND TRUTH SUMMARY:")
    print(f"   {sample['ground_truth'].get('executiveSummary', 'N/A')[:300]}...")
    
    # Generate fresh output for inspection
    raw_output, generated = generate_summary(sample['input'])
    
    print(f"\n🤖 GENERATED SUMMARY:")
    if generated:
        print(f"   {generated.get('executiveSummary', 'N/A')[:300]}...")
        print(f"\n   Sentiment: {generated.get('sentiment', 'N/A')} (GT: {sample['ground_truth'].get('sentiment', 'N/A')})")
        print(f"   Action Items: {len(generated.get('actionItems', []))} (GT: {len(sample['ground_truth'].get('actionItems', []))})")
        print(f"   Key Topics: {generated.get('keyTopics', [])}")
    else:
        print(f"   ❌ Failed to parse JSON")
        print(f"   Raw output: {raw_output[:500]}...")

# Example: Inspect sample 0
inspect_sample(0)

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════════╗
║                    ECHONOTE EVALUATION COMPLETE                      ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  📊 Files Generated:                                                 ║
║     • echonote_evaluation_detailed.csv  (per-sample results)         ║
║     • echonote_evaluation_summary.json  (aggregate metrics)          ║
║     • echonote_evaluation_results.png   (visualizations)             ║
║     • echonote_correlation_heatmap.png  (metric correlations)        ║
║                                                                      ║
║  🎯 Use `inspect_sample(idx)` to examine specific samples            ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝
""")