# RAG Evaluation for Financial Analysis

## Overview

Retrieval Augmented Generation (RAG) enhances Large Language Models by providing relevant external knowledge, making it essential for financial analysis applications. This tutorial focuses on evaluating RAG systems that analyze internal AMD and NVIDIA research and market data.

## What You'll Learn

This workshop will teach you to evaluate your financial RAG applications using LangSmith:

1. **Create targeted test datasets** with financial questions and expected answers
2. **Run your RAG application** against comprehensive evaluation scenarios  
3. **Measure performance** using specialized evaluation metrics for RAG systems

## RAG Evaluation Workflow

Our evaluation process follows three essential steps:

### 1. Dataset Creation
* Financial questions about AMD and NVIDIA performance, strategy, and market position
* Expected answers based on internal analysis and market research
* Edge cases covering complex financial scenarios

### 2. RAG Application Execution  
* Query processing against your internal AMD/NVIDIA knowledge base
* Document retrieval from financial reports, analysis, and market data
* Answer generation combining retrieved context with LLM reasoning

### 3. Multi-Dimensional Evaluation
* **Answer relevance:** Does the response address the financial question asked?
* **Answer accuracy:** Are the financial facts, figures, and analysis correct?
* **Retrieval quality:** Did the system find the most relevant internal documents?

## Our Focus
For this tutorial, we'll evaluate a financial analysis bot that answers questions about AMD and NVIDIA using your internal research database, ensuring reliable insights for investment and strategic decisions.

---

*Let's begin by examining how RAG systems handle complex financial queries...*

In [None]:
!pip install -U langsmith langchain-openai langchain-community openai openevals PyMuPDF



In [8]:
import os
import getpass

# Set up environment variables with your input
print("Please enter your API keys to get started:")
print("=" * 50)

# LangSmith tracing setting
langsmith_tracing = input("Enable LangSmith tracing? (true/false) [default: true]: ").strip() or "true"
os.environ["LANGSMITH_TRACING"] = langsmith_tracing

# LangSmith API key (secure input)
if not os.getenv("LANGSMITH_API_KEY"):
    langsmith_api_key = getpass.getpass("Enter your LangSmith API key: ")
    os.environ["LANGSMITH_API_KEY"] = langsmith_api_key
else:
    print("✓ LangSmith API key already set")

# OpenAI API key (secure input)  
if not os.getenv("OPENAI_API_KEY"):
    openai_api_key = getpass.getpass("Enter your OpenAI API key: ")
    os.environ["OPENAI_API_KEY"] = openai_api_key
else:
    print("✓ OpenAI API key already set")

print("\n✓ Environment setup complete!")
print("You can now proceed with the rest of the notebook.")

Please enter your API keys to get started:

✓ Environment setup complete!
You can now proceed with the rest of the notebook.


In [9]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

## Create Vector Database

We'll build a basic RAG application using your internal financial analysis documents. Our simple implementation follows three core steps:

### Internal Financial Data Focus
* **Source material:** Internal research reports, market analysis, and strategic assessments

### Key Components
* **Indexing:** Chunk and embed internal AMD/NVIDIA analysis reports into a vector store
* **Retrieval:** Find relevant document chunks based on financial questions  
* **Generation:** Combine retrieved context with user questions for LLM processing

In [None]:
# List of PDF files to load
pdf_files = [
    "/content/AMD_INTERNAL_ANALYSIS_20250724.pdf",
    "/content/NVIDIA_INTERNAL_ANALYSIS_20250724.pdf"
]

# Load documents from the URLs
docs = [PyMuPDFLoader(pdf_file).load() for pdf_file in pdf_files]
docs_list = [item for sublist in docs for item in sublist]

# Initialize a text splitter with specified chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=0
)

# Split the documents into chunks
doc_splits = text_splitter.split_documents(docs_list)

# Add the document chunks to the "vector store" using OpenAIEmbeddings
vectorstore = InMemoryVectorStore.from_documents(
    documents=doc_splits,
    embedding=OpenAIEmbeddings(),
)

In [11]:
# With langchain we can easily turn any vector store into a retrieval component:
retriever = vectorstore.as_retriever(k=6)

In [None]:
from langchain_openai import ChatOpenAI
from langsmith import traceable

llm = ChatOpenAI(model="gpt-4.1-nano-2025-04-14", temperature=1)

# Add decorator so this function is traced in LangSmith
@traceable()
def rag_bot(question: str) -> dict:
    # LangChain retriever will be automatically traced
    docs = retriever.invoke(question)
    docs_string = "".join(doc.page_content for doc in docs)

    instructions = f"""You are a finanial analysis assistant who is an expert at analyzing source information and answering questions.
    Use the following source documents to answer the user's questions. Only use information from the documents to answer the question.
    If you don't see the answer in the documents, just say that you don't know.
    Use three sentences maximum and keep the answer concise.

    Documents:
    {docs_string}"""

    # langchain ChatModel will be automatically traced
    ai_msg = llm.invoke([
            {"role": "system", "content": instructions},
            {"role": "user", "content": question},
        ],
    )

    return {"answer": ai_msg.content, "documents": docs}

## Create Eval Dataset

In [15]:
from langsmith import Client

client = Client()

# Define the examples for the dataset
examples = [
    {
        "inputs": {"question": "What's the most recent fiscal year revenue for AMD?"},
        "outputs": {"answer": "In fiscal year 2024, AMD reported revenue of $27.28 billion"},
    },
    {
        "inputs": {"question": "What is NVIDIA's market share in the AI chip market?"},
        "outputs": {"answer": "NVIDIA dominates the AI chip market with 70-95 percent market share"},
    },
    {
        "inputs": {"question": "Compare the data center business between AMD and NVIDIA"},
        "outputs": {"answer": "40B for NVIDIA, 4.5B for AMD"},
    }
]

# Create the dataset and examples in LangSmith
dataset_name = "Finance Analysis RAG"
dataset = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
    dataset_id=dataset.id,
    examples=examples
)

{'example_ids': ['e2ec8973-83d3-4a02-90fb-8f35fcfd6b16',
  'c90ab01b-2d70-44ca-b169-3d4f57421263',
  'd1ea2756-9fed-4d7c-baee-5e17adb3698e'],
 'count': 3}

## RAG Evaluation Framework

We'll evaluate your financial RAG system using four key dimensions. Each evaluator compares different components of the RAG pipeline:

### 1. Correctness: Response vs Reference Answer
* **Goal:** How accurate is the financial analysis compared to expected answers?
* **Requirements:** Ground truth answers in your dataset
* **Method:** LLM-as-judge assesses factual accuracy of financial information
* **Focus:** Validates correct financial metrics, trends, and company insights

### 2. Relevance: Response vs Input Question  
* **Goal:** How well does the answer address the original financial question?
* **Requirements:** No reference answer needed
* **Method:** LLM-as-judge evaluates response helpfulness and directness
* **Focus:** Ensures answers stay on-topic for financial queries

### 3. Groundedness: Response vs Retrieved Documents
* **Goal:** How faithful is the response to the source financial documents?
* **Requirements:** No reference answer needed  
* **Method:** LLM-as-judge detects hallucinations and unsupported claims
* **Focus:** Prevents fabricated financial data or analysis

### 4. Retrieval Relevance: Retrieved Docs vs Input Question
* **Goal:** How relevant are the retrieved documents for answering the question?
* **Requirements:** No reference answer needed
* **Method:** LLM-as-judge assesses document-question alignment  
* **Focus:** Validates that the right financial reports were found

### Evaluation Strategy
Each metric provides unique insights into your RAG system's performance, helping identify whether issues stem from retrieval, generation, or both components.

### Diagram
https://docs.smith.langchain.com/assets/images/rag_eval_overview-0d95d78db4d60c2bccbd333f8ba75e60.png

In [23]:
from typing_extensions import Annotated, TypedDict

# Grade output schema
class CorrectnessGrade(TypedDict):
    # Note that the order in the fields are defined is the order in which the model will generate them.
    # It is useful to put explanations before responses because it forces the model to think through
    # its final response before generating it:
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    correct: Annotated[bool, ..., "True if the answer is correct, False otherwise."]

# Grade prompt
correctness_instructions = """You are an expert data labeler evaluating model outputs for correctness.

You will be given a QUESTION, the GROUND TRUTH (correct) ANSWER, and the MODEL ANSWER. 

Here is the grade criteria to follow:
(1) Grade the model answers based ONLY on their factual accuracy relative to the ground truth answer. 
(2) Ensure that the model answer does not contain any conflicting statements.
(3) It is OK if the model answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.

Correctness:
A correctness value of True means that the model's answer meets all of the criteria.
A correctness value of False means that the model's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM
grader_llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0).with_structured_output(CorrectnessGrade, method="json_schema", strict=True)

def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    """An evaluator for RAG answer accuracy"""
    answers = f"""\
QUESTION: {inputs['question']}
GROUND TRUTH ANSWER: {reference_outputs['answer']}
MODEL ANSWER: {outputs['answer']}"""

    # Run evaluator
    grade = grader_llm.invoke([
        {"role": "system", "content": correctness_instructions}, 
        {"role": "user", "content": answers}
    ])
    return grade["correct"]

In [18]:
# Grade output schema
class RelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[bool, ..., "Provide the score on whether the answer addresses the question"]

# Grade prompt
relevance_instructions="""You are an expert data labeler evaluating model outputs for correctness.

You will be given a QUESTION and a MODEL ANSWER. 

Here is the grade criteria to follow:
(1) Ensure the MODEL ANSWER is concise and relevant to the QUESTION
(2) Ensure the MODEL ANSWER helps to answer the QUESTION

Relevance:
A relevance value of True means that the model's answer meets all of the criteria.
A relevance value of False means that the model's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM
relevance_llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0).with_structured_output(RelevanceGrade, method="json_schema", strict=True)

# Evaluator
def relevance(inputs: dict, outputs: dict) -> bool:
    """A simple evaluator for RAG answer helpfulness."""
    answer = f"QUESTION: {inputs['question']}\nMODEL ANSWER: {outputs['answer']}"
    grade = relevance_llm.invoke([
        {"role": "system", "content": relevance_instructions}, 
        {"role": "user", "content": answer}
    ])
    return grade["relevant"]

In [19]:
# Grade output schema
class GroundedGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    grounded: Annotated[bool, ..., "Provide the score on if the answer hallucinates from the documents"]

# Grade prompt
grounded_instructions = """You are an expert data labeler evaluating model outputs for correctness. 

You will be given FACTS and a MODEL ANSWER. 

Here is the grade criteria to follow:
(1) Ensure the MODEL ANSWER is grounded in the FACTS. 
(2) Ensure the MODEL ANSWER does not contain "hallucinated" information outside the scope of the FACTS.

Grounded:
A grounded value of True means that the student's answer meets all of the criteria.
A grounded value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM 
grounded_llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0).with_structured_output(GroundedGrade, method="json_schema", strict=True)

# Evaluator
def groundedness(inputs: dict, outputs: dict) -> bool:
    """A simple evaluator for RAG answer groundedness."""
    doc_string = "\n\n".join(doc.page_content for doc in outputs["documents"])
    answer = f"FACTS: {doc_string}\nMODEL ANSWER: {outputs['answer']}"
    grade = grounded_llm.invoke([{"role": "system", "content": grounded_instructions}, {"role": "user", "content": answer}])
    return grade["grounded"]

In [20]:
# Grade output schema
class RetrievalRelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[bool, ..., "True if the retrieved documents are relevant to the question, False otherwise"]

# Grade prompt
retrieval_relevance_instructions = """You are an expert data labeler evaluating information retrieval (RAG) outputs for correctness. 

You will be given a QUESTION and a set of FACTS provided by the RAG MODEL. 

Here is the grade criteria to follow:
(1) You goal is to identify FACTS that are completely unrelated to the QUESTION
(2) If the facts contain ANY keywords or semantic meaning related to the question, consider them relevant
(3) It is OK if the facts have SOME information that is unrelated to the question as long as (2) is met

Relevance:
A relevance value of True means that the FACTS contain ANY keywords or semantic meaning related to the QUESTION and are therefore relevant.
A relevance value of False means that the FACTS are completely unrelated to the QUESTION.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM
retrieval_relevance_llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0).with_structured_output(RetrievalRelevanceGrade, method="json_schema", strict=True)

def retrieval_relevance(inputs: dict, outputs: dict) -> bool:
    """An evaluator for document relevance"""
    doc_string = "\n\n".join(doc.page_content for doc in outputs["documents"])
    answer = f"FACTS: {doc_string}\nQUESTION: {inputs['question']}"

    # Run evaluator
    grade = retrieval_relevance_llm.invoke([
        {"role": "system", "content": retrieval_relevance_instructions}, 
        {"role": "user", "content": answer}
    ])
    return grade["relevant"]

In [21]:
def target(inputs: dict) -> dict:
    return rag_bot(inputs["question"])

experiment_results = client.evaluate(
    target,
    data=dataset_name,
    evaluators=[correctness, groundedness, relevance, retrieval_relevance],
    experiment_prefix="rag-doc-relevance",
    metadata={"version": "gpt-4.1-nano"},
)


View the evaluation results for experiment: 'rag-doc-relevance-9ef232ca' at:
https://smith.langchain.com/o/b8c00c25-9820-5aa4-a030-cfcde0aef6cf/datasets/ea2af870-1e68-4c76-8210-6706b22c386d/compare?selectedSessions=8f7ccf8c-deaa-426b-a5a2-1b3501f753e9




0it [00:00, ?it/s]

In [22]:
import pandas as pd

# Explore results locally as a dataframe if you have pandas installed
experiment_results.to_pandas()

Unnamed: 0,inputs.question,outputs.answer,outputs.documents,error,reference.answer,feedback.correctness,feedback.groundedness,feedback.relevance,feedback.retrieval_relevance,execution_time,example_id,id
0,What is NVIDIA's market share in the AI chip m...,NVIDIA dominates the AI chip market with a mar...,[page_content='The automotive segment achieved...,,NVIDIA dominates the AI chip market with 70-95...,True,True,True,True,6.024722,c90ab01b-2d70-44ca-b169-3d4f57421263,b662b032-1f12-421c-9d99-6ba17c595e0a
1,Compare the data center business between AMD a...,AMD's data center business generated $4.5 bill...,[page_content='Superior memory capacity with M...,,"40B for NVIDIA, 4.5B for AMD",False,True,True,True,3.391317,d1ea2756-9fed-4d7c-baee-5e17adb3698e,5e29e985-ec3c-4e52-a759-291bb4ca85a1
2,What's the most recent fiscal year revenue for...,The most recent fiscal year revenue for AMD is...,[page_content='solutions through Xilinx acquis...,,"In fiscal year 2024, AMD reported revenue of $...",False,False,True,True,1.235071,e2ec8973-83d3-4a02-90fb-8f35fcfd6b16,5172bbdf-eb37-4eb0-a5be-a3a13baa4df9
