# Chat Evaluation for Financial Analysis

## Overview

In this workshop, we'll work with a **financial analyst chatbot** that can answer financial analysis questions and process financial data. Our goal is to set up comprehensive evaluations for this agent to measure how well it performs on financial analysis tasks.

## What We'll Build

Throughout the following sections, we'll systematically build evaluation capabilities for our financial analysis agent, allowing you to measure performance quickly and reliably so you can iterate with confidence.

## Workshop Objectives

Using **LangSmith** as our evaluation platform, we will:

1. **Create an initial dataset** to measure performance on financial analysis tasks
2. **Define metrics** to evaluate the quality of financial insights and style of recommendations  
3. **Run evaluations** on different prompts, models, and agent configurations
4. **Compare results manually** to understand strengths and weaknesses
5. **Track results over time** to monitor improvements and regressions

By the end of this workshop, you'll have an evaluation framework that will help you iterate quickly on developing and enhancing this financial agent.

---

*Let's get started by examining our financial analyst chatbot and understanding its current capabilities...*

In [None]:
!pip install -U langsmith langchain-openai openai openevals

In [None]:
import os
import getpass

# Set up environment variables with your input
print("Please enter your API keys to get started:")
print("=" * 50)

# LangSmith tracing setting
langsmith_tracing = input("Enable LangSmith tracing? (true/false) [default: true]: ").strip() or "true"
os.environ["LANGSMITH_TRACING"] = langsmith_tracing

# LangSmith API key (secure input)
if not os.getenv("LANGSMITH_API_KEY"):
    langsmith_api_key = getpass.getpass("Enter your LangSmith API key: ")
    os.environ["LANGSMITH_API_KEY"] = langsmith_api_key
else:
    print("✓ LangSmith API key already set")

# OpenAI API key (secure input)  
if not os.getenv("OPENAI_API_KEY"):
    openai_api_key = getpass.getpass("Enter your OpenAI API key: ")
    os.environ["OPENAI_API_KEY"] = openai_api_key
else:
    print("✓ OpenAI API key already set")

print("\n✓ Environment setup complete!")
print("You can now proceed with the rest of the notebook.")

In [None]:
# import and initialize langsmith client
from langsmith import Client

client = Client()

In [None]:
# Name your dataset and add it to LangSmith
dataset_name = "Financial Analysis Dataset"
dataset = client.create_dataset(dataset_name)

## Creating Your Evaluation Dataset

Before testing your financial analysis agent, you need to define the datapoints for evaluation. Here are the key considerations:

### Schema Design
* **Minimum requirement:** Include inputs to your application (the financial questions)
* **Recommended:** Define expected outputs (ideal answers your agent should provide)
* **Advanced:** Add additional context like expected data sources or reasoning steps
* **Flexibility:** LangSmith datasets support arbitrary schemas as your needs evolve

### Dataset Size
* **Start small:** Even 10-50 examples provide significant value
* **Focus on coverage:** Ensure you capture edge cases and scenarios you want to guard against
* **Grow over time:** Datasets are living constructs that expand as you learn from real usage

### Data Collection Strategy
* **Begin manually:** Hand-label your first 10-20 examples to establish quality baselines
* **Learn from users:** Add problematic real-world examples as you discover pain points
* **Iterate continuously:** Evaluation is an ongoing process, not a one-time setup
* **Consider synthesis:** Advanced teams can augment with synthetically generated data

### Our Approach
For this tutorial, we'll create 5 financial analysis datapoints focusing on NVIDIA and AMD. Each example includes a question input and expected answer output for our question-answering agent.

In [None]:
# Create your dataset examples
client.create_examples(
	dataset_id=dataset.id,
	examples=[
		 {
		"inputs": {"question": "What is NVIDIA's primary revenue driver?"},
		"outputs": {"answer": "Primarily data center chips and AI accelerators"},
		 },
		 {
		"inputs": {"question": "What market does AMD compete in?"},
		"outputs": {"answer": "CPUs, GPUs, and data center processors"},
		 },
		 {
		"inputs": {"question": "Who leads the AI chip market?"},
		"outputs": {"answer": "NVIDIA dominates with over 80% market share"},
		 },
		 {
		"inputs": {"question": "What is AMD's main competitive advantage?"},
		"outputs": {"answer": "Cost-effective alternatives to Intel and NVIDIA"},
		 },
		 {
		"inputs": {"question": "Does NVIDIA or AMD have higher gross margins?"},
		"outputs": {"answer": "NVIDIA typically maintains higher gross margins"},
		 }
 	]
)

In [None]:
# we can use predefined scoring prompts (below) or create our own
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT

In [None]:
# Inspect the predefined scoring prompts
print("CORRECTNESS_PROMPT")
print(CORRECTNESS_PROMPT)

print('\n\n')

print("CONCISENESS_PROMPT")
print(CONCISENESS_PROMPT)

print('\n\n')

In [None]:
# using predefined scoring prompts
# def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
#     evaluator = create_llm_as_judge(
#         prompt=CORRECTNESS_PROMPT,
#         model="openai:o3-mini",
#         feedback_key="correctness",
#     )
#     eval_result = evaluator(
#         inputs=inputs,
#         outputs=outputs,
#         reference_outputs=reference_outputs
#     )
#     return eval_result

## Defining Evaluation Metrics

Now that we have our dataset, we need to establish metrics to measure our financial agent's performance. Here's our evaluation strategy:

### Evaluation Challenges
* **Semantic similarity:** We don't expect exact word matches with reference answers
* **Contextual correctness:** Financial responses need accuracy, not just similar phrasing  
* **Multiple valid answers:** Financial questions often have several correct approaches

### Our Two-Metric Approach

#### 1. Correctness Evaluation
* **Method:** LLM-as-a-judge using GPT-4o-mini
* **Why LLM:** Too complex for simple string matching or rule-based functions
* **Rubric criteria:** 
  - Accurate and complete information
  - No factual errors
  - Addresses all question components
  - Logical consistency
  - Precise financial terminology
* **Output:** Binary CORRECT/INCORRECT assessment

#### 2. Conciseness Evaluation  
* **Method:** Simple Python function measuring response length
* **Threshold:** Response must be less than 2x the reference answer length
* **Rationale:** Financial analysis should be clear and concise, not verbose
* **Output:** Boolean pass/fail metric

### Implementation Benefits
* **Automated assessment:** Both metrics run without human intervention
* **Scalable evaluation:** Can process large numbers of examples quickly
* **Balanced measurement:** Captures both accuracy and response quality

In [None]:
# import openai
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0)

eval_instructions = """
You are an expert data labeler evaluating model outputs for correctness. Your task is to grade the response as CORRECT or INCORRECT based on the following rubric:

<Rubric>
A correct answer:
- Provides accurate and complete information
- Contains no factual errors
- Addresses all parts of the question
- Is logically consistent
- Uses precise and accurate terminology
</Rubric>
"""

def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    user_content = f"""You are grading the following question:
    {inputs['question']}
    Here is the real answer:
    {reference_outputs['answer']}
    You are grading the following predicted answer:
    {outputs['response']}
    Respond with CORRECT or INCORRECT:
    Grade:
    """
    response = llm.invoke([
            {"role": "system", "content": eval_instructions},
            {"role": "user", "content": user_content},
        ],
    ).content
    return response == "CORRECT"

In [None]:
def conciseness(outputs: dict, reference_outputs: dict) -> bool:
    return int(len(outputs["response"]) < 2 * len(reference_outputs["answer"]))

## Running Evaluations

With our dataset and metrics in place, we're ready to evaluate our financial analysis agent. Here's how to set up and execute the evaluation process:

### Application Architecture
* **Simple design:** System message with instructions + user question passed to LLM
* **LangChain implementation:** Using ChatOpenAI for clean LLM interactions
* **Fixed configuration:** 
  - Model: gpt-4o-mini
  - Temperature: 0 for consistent responses
  - Customizable instructions for different financial analysis scenarios

### Key Components

#### 1. Core Application Function
* **Purpose:** Processes financial questions and returns concise answers
* **Default behavior:** Short, concise responses for financial clarity
* **Instruction flexibility:** Can modify system prompts for different evaluation tests
* **Note:** Current implementation uses fixed model (consider making dynamic if needed)

#### 2. LangSmith Integration Wrapper
* **Input mapping:** Converts dataset question format to application input
* **Output formatting:** Transforms application response to expected evaluation format
* **Key requirement:** Maps `inputs["question"]` → `{"response": answer}`

### Evaluation Execution
* **Automated process:** `client.evaluate()` runs your app against the entire dataset
* **Multi-metric assessment:** Both conciseness and correctness evaluated simultaneously
* **Experiment tracking:** Results tagged with "openai-4o-mini" prefix for easy identification
* **Scalable testing:** Same setup allows comparing different prompts or configurations

In [None]:
default_instructions = "Respond to the users question in a short, concise manner."

def my_app(question: str, model: str = "gpt-4.1-nano-2025-04-14", instructions: str = default_instructions) -> str:
    llm = ChatOpenAI(model=model, temperature=1)
    return llm.invoke([
        {"role": "system", "content": instructions},
        {"role": "user", "content": question},
    ]).content

In [None]:
def ls_target(inputs: str) -> dict:
    return {"response": my_app(inputs["question"])}

In [None]:
experiment_results = client.evaluate(
    ls_target, # Your AI system
    data=dataset_name, # The data to predict and grade over
    evaluators=[conciseness, correctness], # The evaluators to score the results
    experiment_prefix="openai-4.1-nano", # A prefix for your experiment names to easily identify them
)

### What adjustments can we make to improve the eval?