# About LLM Output Evaluation for Legal Question Answering

This notebook is part of the **NM Law Data Pipeline** and provides tools to systematically benchmark large language models (LLMs) on domain-specific legal questions.

---

##  Purpose

The legal domain demands precision, traceability, and factual accuracy—attributes that LLMs often struggle with, especially when faced with jurisdiction-specific rules or case law. This notebook aims to:

- **Compare model outputs** from multiple LLMs using consistent legal questions,
- **Score outputs using factuality and relevance metrics**,
- **Support ranking and selection** of the best-performing models for downstream legal automation tasks.

By evaluating model quality directly on real legal text and benchmark questions, we close the loop between raw generation and measurable reliability.

---

##  Components

1. **Model Access Setup**: Load API keys for OpenAI, Gemini, Claude, or custom models.
2. **Prompt Routing**: Send one or more questions to each model using a standardized interface.
3. **Output Capture**: Store model answers with metadata for evaluation and record-keeping.
4. **Metric Evaluation**: Score each answer using a suite of LLM-targeted metrics (FactCC, ROUGE, SummaC, NLI).
5. **Result Interpretation**: Analyze performance across metrics, prompts, or models.
---
---


# Query Multiple models 

This section enables querying multiple large language models (LLMs) simultaneously with the same legal prompt or question. The goal is to compare model behavior, output quality, and factual consistency across different providers or versions (e.g., GPT-4o, Gemini, Claude, SMART-SLIC, etc.).

This is critical for benchmarking LLMs in the legal domain, where subtle variations in phrasing or factual grounding can significantly impact legal reasoning or applicability.



### Define api keys


Load or define API keys for each model provider you want to evaluate (e.g., OpenAI, Google, Anthropic). These keys must be set securely and should not be committed to version control.

Each model provider may have unique endpoint structures, token limits, and pricing, so defining the keys and routing logic upfront allows the system to handle those differences in a modular way.


In [None]:
OPENAI_KEY = None
ANTHROPIC_KEY = None 
BARD_KEY = None

### Define class instance with keys

Initialize a wrapper class or unified interface that allows consistent calling of each LLM, regardless of backend. This class should:
- Handle retries and error logging,
- Format the prompt as needed by each model,
- Normalize outputs for consistent evaluation,
- Track metadata such as token counts or model version.

This design supports batch evaluations and fair comparison between systems under identical question conditions.


In [None]:
from multi_model_qa import MultiModelLegalQA

qa = MultiModelLegalQA(
    openai_api_key= OPENAI_KEY,
    anthropic_api_key=ANTHROPIC_KEY,
    google_bard_api_key=BARD_KEY
)

### Ask a question for all models to answer

Input a single benchmark legal question (e.g., "What are the conditions under which estoppel applies in New Mexico law?") and route it through each LLM.

The result is a dictionary or list of model outputs, ready for:
- Manual inspection,
- Quantitative evaluation (FactCC, NLI, ROUGE, etc.),
- Dataset generation for fine-tuning or posthoc analysis.

This step forms the core of legal question-answer evaluation.

### The legal question sets used for the paper can be found in the [58_sme.txt](https://github.com/lanl/T-ELF/tree/main/data/NM_law_questions_and_dates/58_sme.txt) and the [25_system_domain_questions.txt](https://github.com/lanl/T-ELF/tree/main/data/NM_law_questions_and_dates/25_system_domain_questions.txt) in the data directory, which were created and reviewed by a lawyer subject matter expert.

In [None]:
question = "How many New Mexico Supreme Court cases mention ‘Habeas Corpus’?"

answers = qa.ask_all_models(source='', question=question)
for model, answer in answers.items():
    print(f"\n=== {model} ===\n{answer}")

###

# Imortant to collect the model outputs into a csv here, defined in the following code block

### ___
 

# Define the inputs and outputs to Evaluate the models

This section structures the evaluation pipeline by defining:
- **Inputs**: Model responses, the original question, and optional gold-standard reference answers.
- **Metrics**: Chosen to reflect performance on factuality, legal consistency, summarization quality, and truthfulness. Common metrics include:
  - **FactCC** (Factual Consistency),
  - **ROUGE** (Summarization overlap),
  - **SummaC** (Semantic similarity),
  - **NLI Entailment** (Natural Language Inference agreement with ground truth).

This structured evaluation ensures the outputs are scored consistently and reproducibly, enabling model comparison across dozens of legal prompts.


In [None]:
CSV_PATH = None 
AI_OUTPUT = "ai_responses"
EVALUATIONS = "eval.csv"

In [None]:
from qa_eval import LegalQuestionAIProcessor
processor = LegalQuestionAIProcessor(
    cleaned_output_csv=CSV_PATH,
    ai_output_dir=AI_OUTPUT,
    evaluated_output_csv=EVALUATIONS,
)
processor.evaluate_responses()