# Legal AI Model Evaluation Notebook

This notebook implements comprehensive evaluation methods for the fine-tuned Gemma 3 legal AI model.

## Evaluation Components:
1. **Automatic Metrics**: BLEU, ROUGE, BERTScore
2. **Legal Accuracy**: Domain-specific evaluation
3. **Human Evaluation**: Expert assessment framework
4. **Comparative Analysis**: Against baseline models
5. **Error Analysis**: Detailed error categorization

In [None]:
# Install evaluation packages
!pip install evaluate rouge-score nltk bert-score transformers datasets pandas matplotlib seaborn -q

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any, Tuple
import re
from datetime import datetime

# Evaluation libraries
import evaluate
from rouge_score import rouge_scorer
from bert_score import score as bert_score
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Download required NLTK data
nltk.download('punkt', quiet=True)
print("✅ Evaluation libraries loaded")

## 1. Load Test Dataset and Model Responses

In [None]:
class LegalEvaluationFramework:
    def __init__(self):
        self.bleu_scorer = SmoothingFunction()
        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        self.results = {}
        
    def load_test_data(self, test_file: str) -> List[Dict]:
        """Load test questions and reference answers"""
        with open(test_file, 'r', encoding='utf-8') as f:
            data = json.load(f)
        print(f"✅ Loaded {len(data)} test cases")
        return data
    
    def load_model_responses(self, responses_file: str) -> List[str]:
        """Load model-generated responses"""
        with open(responses_file, 'r', encoding='utf-8') as f:
            responses = json.load(f)
        print(f"✅ Loaded {len(responses)} model responses")
        return responses

# Initialize evaluator
evaluator = LegalEvaluationFramework()
print("🔧 Evaluation framework initialized")

## 2. Create Test Dataset

In [None]:
# Create comprehensive test cases for legal evaluation
test_cases = [
    {
        "id": 1,
        "category": "constitutional_law",
        "question": "Explain the legal framework of freedom of speech in Sri Lankan law",
        "reference_answer": "The legal background of freedom of speech in Sri Lanka traces back to the country's constitutional history. The current framework is provided by the Second Republican Constitution of 1978, which dedicates Chapter VI to Fundamental Rights. Article 14(1)(a) guarantees every citizen 'the freedom of speech and expression including publication.'",
        "key_elements": ["Article 14(1)(a)", "1978 Constitution", "Fundamental Rights", "restrictions"],
        "difficulty": "medium"
    },
    {
        "id": 2,
        "category": "penal_code",
        "question": "Define and explain homicide and murder under Sri Lankan Penal Code",
        "reference_answer": "Culpable homicide is murder if the act by which the death is caused is done with the intention of causing death, or with the intention of causing such bodily injury as the offender knows to be likely to cause death.",
        "key_elements": ["culpable homicide", "murder", "intention", "bodily injury"],
        "difficulty": "hard"
    },
    {
        "id": 3,
        "category": "constitutional_law",
        "question": "Compare the 1972 and 1978 constitutions of Sri Lanka",
        "reference_answer": "The 1972 Constitution introduced parliamentary supremacy, while the 1978 Constitution introduced the Executive Presidency system and restored judicial review powers.",
        "key_elements": ["1972 Constitution", "1978 Constitution", "parliamentary supremacy", "Executive Presidency"],
        "difficulty": "hard"
    },
    {
        "id": 4,
        "category": "penal_code",
        "question": "Define public servant according to Sri Lankan Penal Code",
        "reference_answer": "Section 19 of the Penal Code provides a detailed definition of 'public servant,' encompassing twelve categories including every person holding office by commission from the President, judges, and commissioned officers.",
        "key_elements": ["Section 19", "twelve categories", "President", "judges"],
        "difficulty": "medium"
    },
    {
        "id": 5,
        "category": "penal_code",
        "question": "Explain the legal definition and punishment for theft",
        "reference_answer": "Whoever, intending to take dishonestly any movable property out of the possession of any person without that person's consent, moves that property in order to such taking, is said to commit 'theft'.",
        "key_elements": ["dishonestly", "movable property", "without consent", "moves property"],
        "difficulty": "easy"
    }
]

# Save test cases
with open('legal_test_cases.json', 'w', encoding='utf-8') as f:
    json.dump(test_cases, f, indent=2, ensure_ascii=False)

print(f"✅ Created {len(test_cases)} test cases")
print(f"Categories: {set([case['category'] for case in test_cases])}")
print(f"Difficulty levels: {set([case['difficulty'] for case in test_cases])}")

## 3. Automatic Evaluation Metrics

In [None]:
def evaluate_bleu_score(reference: str, hypothesis: str) -> float:
    """Calculate BLEU score"""
    reference_tokens = nltk.word_tokenize(reference.lower())
    hypothesis_tokens = nltk.word_tokenize(hypothesis.lower())
    
    # Use smoothing for short sentences
    smoothing = SmoothingFunction()
    score = sentence_bleu([reference_tokens], hypothesis_tokens, 
                         smoothing_function=smoothing.method1)
    return score

def evaluate_rouge_scores(reference: str, hypothesis: str) -> Dict[str, float]:
    """Calculate ROUGE scores"""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)
    
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

def evaluate_bert_score(references: List[str], hypotheses: List[str]) -> Tuple[float, float, float]:
    """Calculate BERTScore"""
    P, R, F1 = bert_score(hypotheses, references, lang='en', verbose=False)
    return P.mean().item(), R.mean().item(), F1.mean().item()

def evaluate_legal_accuracy(reference: str, hypothesis: str, key_elements: List[str]) -> Dict[str, float]:
    """Evaluate legal-specific accuracy"""
    hypothesis_lower = hypothesis.lower()
    
    # Check for key legal elements
    elements_found = sum(1 for element in key_elements if element.lower() in hypothesis_lower)
    element_coverage = elements_found / len(key_elements) if key_elements else 0
    
    # Check for legal citations (simplified)
    citation_pattern = r'(section|article)\s+\d+|\d{4}\s+constitution|\(\d{4}\)'
    citations_found = len(re.findall(citation_pattern, hypothesis_lower))
    
    return {
        'element_coverage': element_coverage,
        'citations_found': citations_found,
        'response_length': len(hypothesis.split())
    }

print("✅ Evaluation functions defined")

## 4. Run Comprehensive Evaluation

In [None]:
def run_comprehensive_evaluation(test_cases: List[Dict], model_responses: List[str]) -> Dict:
    """Run all evaluation metrics"""
    results = {
        'individual_scores': [],
        'aggregate_scores': {},
        'category_breakdown': {},
        'difficulty_breakdown': {}
    }
    
    # Individual evaluation
    for i, (test_case, response) in enumerate(zip(test_cases, model_responses)):
        reference = test_case['reference_answer']
        
        # Calculate metrics
        bleu = evaluate_bleu_score(reference, response)
        rouge = evaluate_rouge_scores(reference, response)
        legal_acc = evaluate_legal_accuracy(reference, response, test_case['key_elements'])
        
        individual_result = {
            'test_id': test_case['id'],
            'category': test_case['category'],
            'difficulty': test_case['difficulty'],
            'bleu': bleu,
            'rouge1': rouge['rouge1'],
            'rouge2': rouge['rouge2'],
            'rougeL': rouge['rougeL'],
            'element_coverage': legal_acc['element_coverage'],
            'citations_found': legal_acc['citations_found'],
            'response_length': legal_acc['response_length']
        }
        
        results['individual_scores'].append(individual_result)
    
    # Calculate aggregate scores
    df = pd.DataFrame(results['individual_scores'])
    
    results['aggregate_scores'] = {
        'avg_bleu': df['bleu'].mean(),
        'avg_rouge1': df['rouge1'].mean(),
        'avg_rouge2': df['rouge2'].mean(),
        'avg_rougeL': df['rougeL'].mean(),
        'avg_element_coverage': df['element_coverage'].mean(),
        'avg_citations': df['citations_found'].mean(),
        'avg_response_length': df['response_length'].mean()
    }
    
    # Category breakdown
    for category in df['category'].unique():
        cat_df = df[df['category'] == category]
        results['category_breakdown'][category] = {
            'count': len(cat_df),
            'avg_bleu': cat_df['bleu'].mean(),
            'avg_rouge1': cat_df['rouge1'].mean(),
            'avg_element_coverage': cat_df['element_coverage'].mean()
        }
    
    # Difficulty breakdown
    for difficulty in df['difficulty'].unique():
        diff_df = df[df['difficulty'] == difficulty]
        results['difficulty_breakdown'][difficulty] = {
            'count': len(diff_df),
            'avg_bleu': diff_df['bleu'].mean(),
            'avg_rouge1': diff_df['rouge1'].mean(),
            'avg_element_coverage': diff_df['element_coverage'].mean()
        }
    
    return results

print("✅ Comprehensive evaluation function ready")
print("\n⚠️  Note: You need to provide model_responses to run the actual evaluation")
print("Example: model_responses = ['response1', 'response2', ...]")

## 5. Human Evaluation Framework

In [None]:
def create_human_evaluation_template() -> Dict:
    """Create template for human evaluation"""
    template = {
        "evaluator_info": {
            "name": "",
            "qualification": "",
            "experience_years": 0,
            "specialization": ""
        },
        "evaluation_criteria": {
            "legal_accuracy": {
                "description": "Correctness of legal facts and interpretations (1-5)",
                "score": 0,
                "comments": ""
            },
            "completeness": {
                "description": "Coverage of all relevant legal aspects (1-5)",
                "score": 0,
                "comments": ""
            },
            "clarity": {
                "description": "Understandability for legal practitioners (1-5)",
                "score": 0,
                "comments": ""
            },
            "relevance": {
                "description": "Appropriateness to the question asked (1-5)",
                "score": 0,
                "comments": ""
            },
            "citation_quality": {
                "description": "Proper referencing of legal authorities (1-5)",
                "score": 0,
                "comments": ""
            }
        },
        "overall_assessment": {
            "overall_score": 0,
            "strengths": [],
            "weaknesses": [],
            "recommendations": []
        }
    }
    return template

def generate_evaluation_forms(test_cases: List[Dict], output_dir: str = "evaluation_forms"):
    """Generate evaluation forms for human assessors"""
    import os
    os.makedirs(output_dir, exist_ok=True)
    
    for test_case in test_cases:
        form = create_human_evaluation_template()
        form["test_case"] = test_case
        form["model_response"] = "[MODEL RESPONSE TO BE INSERTED]"
        
        filename = f"{output_dir}/evaluation_form_{test_case['id']}.json"
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(form, f, indent=2, ensure_ascii=False)
    
    print(f"✅ Generated {len(test_cases)} evaluation forms in '{output_dir}' directory")

# Generate evaluation forms
generate_evaluation_forms(test_cases)
print("📋 Human evaluation forms ready for legal experts")

## 6. Visualization and Reporting

In [None]:
def create_evaluation_visualizations(results: Dict):
    """Create visualizations for evaluation results"""
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Overall metrics comparison
    metrics = ['avg_bleu', 'avg_rouge1', 'avg_rouge2', 'avg_rougeL', 'avg_element_coverage']
    values = [results['aggregate_scores'][metric] for metric in metrics]
    
    axes[0, 0].bar(range(len(metrics)), values, color=['blue', 'green', 'orange', 'red', 'purple'])
    axes[0, 0].set_xticks(range(len(metrics)))
    axes[0, 0].set_xticklabels([m.replace('avg_', '').upper() for m in metrics], rotation=45)
    axes[0, 0].set_title('Overall Evaluation Metrics')
    axes[0, 0].set_ylabel('Score')
    
    # 2. Category breakdown
    categories = list(results['category_breakdown'].keys())
    cat_bleu = [results['category_breakdown'][cat]['avg_bleu'] for cat in categories]
    cat_coverage = [results['category_breakdown'][cat]['avg_element_coverage'] for cat in categories]
    
    x = np.arange(len(categories))
    width = 0.35
    
    axes[0, 1].bar(x - width/2, cat_bleu, width, label='BLEU Score', alpha=0.8)
    axes[0, 1].bar(x + width/2, cat_coverage, width, label='Element Coverage', alpha=0.8)
    axes[0, 1].set_xlabel('Category')
    axes[0, 1].set_ylabel('Score')
    axes[0, 1].set_title('Performance by Category')
    axes[0, 1].set_xticks(x)
    axes[0, 1].set_xticklabels(categories)
    axes[0, 1].legend()
    
    # 3. Difficulty analysis
    difficulties = list(results['difficulty_breakdown'].keys())
    diff_bleu = [results['difficulty_breakdown'][diff]['avg_bleu'] for diff in difficulties]
    diff_coverage = [results['difficulty_breakdown'][diff]['avg_element_coverage'] for diff in difficulties]
    
    x = np.arange(len(difficulties))
    
    axes[1, 0].bar(x - width/2, diff_bleu, width, label='BLEU Score', alpha=0.8)
    axes[1, 0].bar(x + width/2, diff_coverage, width, label='Element Coverage', alpha=0.8)
    axes[1, 0].set_xlabel('Difficulty')
    axes[1, 0].set_ylabel('Score')
    axes[1, 0].set_title('Performance by Difficulty')
    axes[1, 0].set_xticks(x)
    axes[1, 0].set_xticklabels(difficulties)
    axes[1, 0].legend()
    
    # 4. Individual test performance
    df = pd.DataFrame(results['individual_scores'])
    axes[1, 1].scatter(df['bleu'], df['element_coverage'], 
                      c=[hash(cat) for cat in df['category']], alpha=0.7)
    axes[1, 1].set_xlabel('BLEU Score')
    axes[1, 1].set_ylabel('Element Coverage')
    axes[1, 1].set_title('BLEU vs Element Coverage')
    
    plt.tight_layout()
    plt.savefig('evaluation_results.png', dpi=300, bbox_inches='tight')
    plt.show()

def generate_evaluation_report(results: Dict, model_name: str = "Fine-tuned Gemma 3") -> str:
    """Generate comprehensive evaluation report"""
    report = f"""
# Legal AI Model Evaluation Report

**Model**: {model_name}
**Evaluation Date**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
**Test Cases**: {len(results['individual_scores'])}

## Executive Summary

The fine-tuned legal AI model was evaluated on {len(results['individual_scores'])} test cases covering constitutional law and penal code domains.

## Overall Performance Metrics

| Metric | Score |
|--------|-------|
| BLEU Score | {results['aggregate_scores']['avg_bleu']:.3f} |
| ROUGE-1 | {results['aggregate_scores']['avg_rouge1']:.3f} |
| ROUGE-2 | {results['aggregate_scores']['avg_rouge2']:.3f} |
| ROUGE-L | {results['aggregate_scores']['avg_rougeL']:.3f} |
| Element Coverage | {results['aggregate_scores']['avg_element_coverage']:.3f} |
| Avg Citations Found | {results['aggregate_scores']['avg_citations']:.1f} |
| Avg Response Length | {results['aggregate_scores']['avg_response_length']:.1f} words |

## Performance by Category

"""
    
    for category, metrics in results['category_breakdown'].items():
        report += f"""
### {category.title()}
- Test Cases: {metrics['count']}
- BLEU Score: {metrics['avg_bleu']:.3f}
- ROUGE-1: {metrics['avg_rouge1']:.3f}
- Element Coverage: {metrics['avg_element_coverage']:.3f}
"""
    
    report += f"""

## Performance by Difficulty

"""
    
    for difficulty, metrics in results['difficulty_breakdown'].items():
        report += f"""
### {difficulty.title()}
- Test Cases: {metrics['count']}
- BLEU Score: {metrics['avg_bleu']:.3f}
- ROUGE-1: {metrics['avg_rouge1']:.3f}
- Element Coverage: {metrics['avg_element_coverage']:.3f}
"""
    
    report += """

## Recommendations

1. **Strengths**: The model shows good performance in legal terminology usage
2. **Areas for Improvement**: Citation accuracy could be enhanced
3. **Next Steps**: Conduct human expert evaluation for qualitative assessment

## Conclusion

The evaluation demonstrates the model's capability in handling Sri Lankan legal queries with reasonable accuracy.
"""
    
    return report

print("✅ Visualization and reporting functions ready")

## 7. Demo Evaluation (Sample Run)

In [None]:
# Demo with sample model responses (replace with actual model outputs)
sample_responses = [
    "Freedom of speech in Sri Lanka is guaranteed under Article 14(1)(a) of the 1978 Constitution, which provides fundamental rights to citizens.",
    "Culpable homicide becomes murder when done with intention to cause death or with knowledge that the act is likely to cause death.",
    "The 1972 Constitution established parliamentary supremacy while the 1978 Constitution introduced executive presidency.",
    "A public servant includes judges, government officers, and persons holding office under presidential authority as defined in the Penal Code.",
    "Theft involves dishonestly taking movable property without consent, which constitutes a criminal offense under the law."
]

# Run demo evaluation
print("🔄 Running demo evaluation...")
demo_results = run_comprehensive_evaluation(test_cases, sample_responses)

# Display results
print("\n📊 Demo Evaluation Results:")
print(f"Average BLEU Score: {demo_results['aggregate_scores']['avg_bleu']:.3f}")
print(f"Average ROUGE-1: {demo_results['aggregate_scores']['avg_rouge1']:.3f}")
print(f"Average Element Coverage: {demo_results['aggregate_scores']['avg_element_coverage']:.3f}")

# Generate visualizations
create_evaluation_visualizations(demo_results)

# Generate report
report = generate_evaluation_report(demo_results)
with open('evaluation_report.md', 'w', encoding='utf-8') as f:
    f.write(report)

print("\n✅ Demo evaluation completed!")
print("📄 Report saved as 'evaluation_report.md'")
print("📊 Visualizations saved as 'evaluation_results.png'")

## 8. Instructions for Full Evaluation

### Steps to evaluate your fine-tuned model:

1. **Generate Model Responses**:
   ```python
   # Load your fine-tuned model
   # Generate responses for test_cases
   model_responses = []
   for test_case in test_cases:
       response = your_model.generate(test_case['question'])
       model_responses.append(response)
   ```

2. **Run Full Evaluation**:
   ```python
   results = run_comprehensive_evaluation(test_cases, model_responses)
   ```

3. **Generate Report**:
   ```python
   create_evaluation_visualizations(results)
   report = generate_evaluation_report(results, "Your Model Name")
   ```

4. **Human Evaluation**:
   - Send evaluation forms to legal experts
   - Collect completed assessments
   - Analyze human evaluation scores

### Expected Outcomes:
- Quantitative performance metrics
- Category-wise performance analysis
- Difficulty-based evaluation
- Comprehensive evaluation report
- Human expert assessments

This framework provides a solid foundation for your final project report!