Licensed under the MIT License.

Copyright (c) 2025-2035. All rights reserved by Hanhan Wu.

Permission is hereby granted to view this code for evaluation purposes only.
You may not reuse, copy, modify, merge, publish, distribute, sublicense,
or exploit this code without Hanhan Wu's EXPLICIT written permission.


# DsPy on Bigger FIQA Data & Optimize Answer Generation

* Dataset
  * 100 training records
  * 180 validation records

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import dspy
import contextlib
from concurrent.futures import ThreadPoolExecutor, as_completed

from utils import *

import warnings
warnings.filterwarnings('ignore')


model_str = 'gpt-4.1-nano'

### Load Data Input

In [2]:
train_df = pd.read_parquet('final_finance_qa_train.parquet')
val_df = pd.read_parquet('final_finance_qa_val.parquet')

print(train_df.shape, val_df.shape)
train_df.head()

(100, 4) (180, 4)


Unnamed: 0,doc_id,context,question,ground_truth
0,2574.jpeg,[**Document Type:** \nThis is a check issued b...,Who issued the check?,"The Tobacco Institute, located at 1875 I Stree..."
1,4492.jpeg,[### Document Analysis\n\n**Document Type**: C...,What is the type of document being analyzed?,The document is a Check Request Form.
2,7281.jpeg,[**Document Type**: This is a commercial invoi...,What is the name of the organization issuing t...,Philip Morris Limited
3,1242.jpeg,[### Document Type\nThis is a financial docume...,What is the client company for this outdoor ad...,The client company is P.M. Inc.
4,7700.jpeg,[### Document Type\nThis is a Production Estim...,What is the client's name mentioned in the doc...,The client's name is RJR/NOW Family.


In [3]:
# the trainset, devset for DsPy
dspy_trainset = [
    dspy.Example({
        "question": record['question'],
        "context": list(record['context']),
        "ground_truth": record['ground_truth']
        }).with_inputs('question', 'context')
    for record in train_df.to_dict(orient='records')
]

dspy_valset = [
    dspy.Example({
        "question": record['question'],
        "context": list(record['context']),
        "ground_truth": record['ground_truth']
        }).with_inputs('question', 'context')
    for record in val_df.to_dict(orient='records')
]

print(len(dspy_trainset), len(dspy_valset))
print(dspy_trainset[0])

100 180
Example({'question': 'Who issued the check?', 'context': ['**Document Type:** \nThis is a check issued by The Tobacco Institute.\n\n**Key Details:**\n- **Issuer:** The Tobacco Institute, 1875 I Street, Northwest, Washington, DC 20006\n- **Check Number:** 075013\n- **Date:** September 7, 1990\n- **Payee:** Citizens for Jim Mcpike\n- **Amount:** $500.00\n- **Check Status:** Non-Negotiable\n- **Bank Details:** Check drawn from The National Bank\n\n**Insights and Observations:**\n- The check is marked as non-negotiable, indicating it cannot be transferred and can only be cashed or deposited by the person named (Citizens for Jim Mcpike).\n- The substantial amount (considering the date in 1990) suggests it might have been for a significant expense or donation.\n- As the issuer is The Tobacco Institute, the payment could be related to lobbying or campaign activities, given that it is addressed to a political entity or initiative named after an individual, Jim Mcpike.\n- The document’s

### Prompt Optimization with DsPy

* Time Cost
  * 18 mins for compilation
* Metrics is based on DeepEval's G-eval: https://deepeval.com/docs/metrics-llm-evals
  * The G-eval implemented here measures answer quality by checking both ground truth and context

In [4]:
import dspy
from dspy.teleprompt import MIPROv2


llm = dspy.LM(f"openai/{model_str}")
dspy.settings.configure(lm=llm)

In [5]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers.
        You will receive context(contain relevant facts).
        Think step by step."""  # this is the initial prompt
    question = dspy.InputField(desc="the question to answer")
    context = dspy.InputField(desc="retrieved context, may contain relevant facts")
    answer = dspy.OutputField(desc="AI's answer")


class RAG_AnswerGeneration(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question, context):
        prediction = self.generate_answer(question=question, context=context)
        return dspy.Prediction(context=context,
                               answer=prediction.answer,
                               reasoning=prediction.reasoning)


# example input & output BEFORE prompt optimization
dev_example = dspy_valset[10]
generate_answer = RAG_AnswerGeneration()
pred = generate_answer(question=dev_example.question,
                       context=dev_example.context)
print(f"[Devset] Question: {dev_example.question}")
print(f"[Devset] Ground Truth: {dev_example.ground_truth}")
print(f"[Prediction] Predicted Answer: {pred.answer}")
print(f"[Prediction] Reasoning: {pred.reasoning}")

[Devset] Question: What is the date of the invoice?
[Devset] Ground Truth: The invoice date is August 28, 1978.
[Prediction] Predicted Answer: August 28, 1978
[Prediction] Reasoning: The document clearly states the invoice date as August 28, 1978. This date is explicitly labeled under "Invoice Date," making it the definitive date of the invoice. Other dates mentioned, such as the received date and handwritten notes, do not alter the primary invoice date. Therefore, the answer should be the date provided directly in the invoice details.


In [None]:
def answer_correctness_geval(example, prediction):
    ground_truth = example.ground_truth.strip().lower()
    ai_answer = prediction.answer.strip().lower()
    
    correctness_metric = GEval(
		    name="Answer Correctness",
            model=model_str,
		    criteria="Determine whether the ai_answer aligns with the ground_truth and the context.",
		    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT,
						       LLMTestCaseParams.EXPECTED_OUTPUT,
                               LLMTestCaseParams.CONTEXT],
		)
    test_case = LLMTestCase(
        input=example.question,
        context=example.context,
        actual_output=ai_answer,
        expected_output=ground_truth)
    correctness_metric.measure(test_case)
    score = correctness_metric.score

    print(f"""[Trial] Q: {example.question} | Score: {score}
                      GT: {ground_truth}
                      Pred: {ai_answer}
          """)
    return score


optimizer = MIPROv2(
    metric=answer_correctness_geval,
    prompt_model=llm,
    task_model=llm,
    num_candidates=5,  # number of proposed instructions
    init_temperature=0.7,
    seed=10,
    auto=None,
    verbose=True,
    track_stats=True
)


with open('fiqa_dspy_miprov2_biggerdata.txt', 'w') as f:
    with contextlib.redirect_stdout(f):
        compiled_rag = optimizer.compile(
            RAG_AnswerGeneration(),
            trainset=dspy_trainset,
            num_trials=5,
            max_bootstrapped_demos=2,
            max_labeled_demos=3,
            minibatch_size=4,
            requires_permission_to_run=False
        )

In [7]:
compiled_rag

generate_answer.predict = Predict(StringSignature(question, context -> reasoning, answer
    instructions='Answer questions with short factoid answers.\nYou will receive context(contain relevant facts).\nThink step by step.'
    question = Field(annotation=str required=True json_schema_extra={'desc': 'the question to answer', '__dspy_field_type': 'input', 'prefix': 'Question:'})
    context = Field(annotation=str required=True json_schema_extra={'desc': 'retrieved context, may contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'desc': "AI's answer", '__dspy_field_type': 'output', 'prefix': 'Answer:'})
))

In [None]:
compiled_rag.save("dspy_compiled_rag.json", save_program=False)  # save the optimized program in JSON format
compiled_rag.save("dspy_compiled_rag/", save_program=True)  # save the compiled program for later reuse

### Apply Optimized Prompt on Valset

* Generate answer after prompt optimization
* Compare `answer_after_prompt_opt` with `ground_truth`

In [9]:
results = []
for r in dspy_valset:
    pred = compiled_rag(question=r.question, context=r.context)
    results.append({
        "question": r.question,
        "context": r.context,
        "ground_truth": r.ground_truth,
        "answer_after_prompt_opt": pred.answer,
        "ai_answer_reasoning": pred.reasoning,
    })

results_df = pd.DataFrame(results)
print(results_df.shape)
results_df.head()

(180, 5)


Unnamed: 0,question,context,ground_truth,answer_after_prompt_opt,ai_answer_reasoning
0,What is the objective of Vitality Tests?,[### Document Type\nThe image appears to be a ...,The objective of Vitality Tests is to determin...,To determine VRL's responsiveness to increased...,The objective of Vitality Tests is to determin...
1,What is the company name and address listed on...,[### Document Type\nThe image depicts an invoi...,"The company name is Copiadora Gouldsvey, and t...",Company Name: Copiadora Gouldsvey\nAddress: 12...,The question asks for the company name and add...
2,Who is the issuer of the invoice?,[### Document Type\nThis is an invoice from En...,"Entertainment Partners, EPSG Talent Services","Entertainment Partners, EPSG Talent Services",The invoice explicitly states that the issuer ...
3,Who is the issuer of the invoice?,[**Document Type:**\n- This image shows an inv...,Fannon-Luers Associates Inc.,Fannon-Luers Associates Inc.,The question asks for the issuer of the invoic...
4,Who is the issuer of the check?,[### Document Type\nThe document is a check.\n...,The issuer of the check is the Center for Indo...,"Center for Indoor Air Research, Linthicum, MD ...",The issuer of the check is identified in the k...


In [None]:
def process_record(record):
    eval_score, eval_reason = my_deepeval_answer_correctness(
        record['question'], list(record['context']),
        record['ground_truth'],
        record['answer_after_prompt_opt']
    )
    record['eval_score'] = eval_score
    record['eval_reason'] = eval_reason
    return record

final_eval_lst = []
with ThreadPoolExecutor(max_workers=8) as executor:
    futures = [executor.submit(process_record, record) for record in results]
    for future in as_completed(futures):
        final_eval_lst.append(future.result())

In [11]:
final_eval_df = pd.DataFrame(final_eval_lst)
print(final_eval_df.shape)
print(final_eval_df['eval_score'].mean())
final_eval_df.head()

(180, 7)
0.7607632596008781


Unnamed: 0,question,context,ground_truth,answer_after_prompt_opt,ai_answer_reasoning,eval_score,eval_reason
0,Who is the issuer of the check?,[### Document Type\nThe document is a check.\n...,The issuer of the check is the Center for Indo...,"Center for Indoor Air Research, Linthicum, MD ...",The issuer of the check is identified in the k...,0.720855,The Actual Output correctly identifies the iss...
1,What is the objective of Vitality Tests?,[### Document Type\nThe image appears to be a ...,The objective of Vitality Tests is to determin...,To determine VRL's responsiveness to increased...,The objective of Vitality Tests is to determin...,0.717346,The Actual Output accurately states the object...
2,What is the company name and address listed on...,[### Document Type\nThe image depicts an invoi...,"The company name is Copiadora Gouldsvey, and t...",Company Name: Copiadora Gouldsvey\nAddress: 12...,The question asks for the company name and add...,0.910567,The Actual Output accurately reproduces the co...
3,Who is the issuer of the invoice?,[### Document Type\nThis is an invoice from En...,"Entertainment Partners, EPSG Talent Services","Entertainment Partners, EPSG Talent Services",The invoice explicitly states that the issuer ...,0.998901,The Actual Output matches the Expected Output ...
4,What is the document type of the financial per...,[**Document Type**: Financial Performance Summ...,The document type is a Financial Performance S...,Financial Performance Summary Report,"The document explicitly states ""Document Type:...",0.701041,The Actual Output states 'Financial Performanc...
