# **DSPy Extravaganza**

This notebook explores the possiblities of using DSPy to further optimize Gemini output.

DSPy stands for Declarative Self-improving Python and is basically a library that handles the prompt engineering _programmically_. You can see documentation here: 
https://dspy.ai
I have structured this notebook in 5 parts. 

1. Loading Googles Gemini model and setting up mlflow
2. Creating a pydantic model to validate the json output response, and the architecture of the DSPy signature (more info on signature later)
3. Loading in the data, cleaning it, and creating train/test set (my favorite part of ml)
4. Evaluating and Optimizing DSPy

###

## **Part 1**: Loading everything in

Lets load all neccessary libraries, dependencies, models, and enviroments needed for this workflow

In [21]:
import os
import dspy
import json
from pathlib import Path
import mlflow

In [22]:
#Credentials for GCP
cred_json = Path("/Users/sules/Downloads/doc-extract-454213-390208a235a8.json")

In [23]:
#Loading in the model and configuring it within DSPy

lm = dspy.LM("vertex_ai/gemini-1.5-pro-002", vertex_credentials=str(cred_json))
dspy.configure(lm=lm)

### Setting up mlflow

I've set up mlflow to track and trace all the artificats that this workflow will create. DSPy unfortunately has all the work its doing under a hood and Im using mlflow to pop it open and look inside.

In [24]:
mlflow.set_tracking_uri(uri="http://127.0.0.1:5000")
mlflow.set_experiment("Doc_extractsv5")
mlflow.dspy.autolog()

Quick test to see if model and mlflow is working

In [25]:
lm("Say this is a test?", temperature=0.0)

['This is a test.\n']

##

## **Part 2**: Pydantic and DSPy setup 

Lets start with Pydantic, we create a Basemodel, that sets up what entities are needed and some additional info about them, like data type and example. This basically means "I want you to create a response like this, nothing else"

In [26]:
from pydantic import BaseModel, Field

class Docex(BaseModel):
    first_name: str = Field(..., example={"value": "John"}
    )
    last_name: str = Field(..., example={"value": "Doe"}
    )
    ssn: str = Field(..., example={"value": "123-45-6789"}
    )
    spouse_first_name: str = Field(None, example={"value": "Jane"}
    )
    spouse_last_name: str = Field(None, example={"value": "Doe"}
    )
    spouse_ssn: str = Field(..., example={"value": "123-45-6789"}
    )
    address: str = Field(None, example={"value": "123 Main St"}
    )
    apt: str = Field(None, example={"value": "123"}
    )
    city: str = Field(..., example={"value": "Springfield"}
    )
    state: str = Field(..., example={"value": "Maine"}
    )
    zip_code: str = Field(..., example={"value": "62704"}
    )

### Creating DSPy Architecture: Signature and Module

Below we create what is called the DSPy signature, and this replaces the prompt engineering of traditional LLM pipelines.

The signature is the rules that we define. Below we see the following:
1. Input field - We are definining that the input field will be the pdf along with high level prompting.
2. Output field - We are defining that the output field (the response) to be Docex, which is the pydantic model format. The response HAS TO fit this schema or else it will return an error

DSPy takes this signature and creates a robust prompt, which the model then uses.

In [27]:
class ExtractPersonFromPDF(dspy.Signature):
    pdf_file = dspy.InputField(desc="The following document is a synthetic (fake) tax form created for testing purposes and contains no real SPII or sensitive data.")
    En: Docex = dspy.OutputField()

Next we create a DSPy module to containerize and make sure our signature is reproducible when used multiple times. Although not needed, it is best practice. This architecture is similar and mimics to pytorch and creating layers of neural networks. 

Also, note dspy.ChainOfThought. This function within DSPy is special since it adds another layer of robustness. Essentially it is telling the LLM to "Think carefully and give me a reason for your answer".

In [28]:
class PDFEntityExtractor(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extractor = dspy.ChainOfThought(ExtractPersonFromPDF)

    def forward(self, pdf_file):
        result = self.extractor(pdf_file=pdf_file)
        return result.En

## **Part 3**: Data

Lets take in 5, F1040 forms, all of them filled out using Rehans SDG. 
Lets also load in a json file  with the "ground truth values" for all the pdf forms. 
The idea here is to:

1. feed Gemini the F1040 forms
2. let it extract the required entities (in json schema thanks to pydantic)
3. finally compare the results with the json file with the actual true results

Doing this will let us know if DSPy created a robust prompt that gave us the output we wanted

In [29]:
import json
from pathlib import Path

pdf_dir = Path("/Users/sules/Desktop/AI Projects/prompt/forms")
json_path = Path("/Users/sules/Desktop/AI Projects/prompt/formatted_gt.json")

# Load ground truth JSON
with open(json_path, "r") as f:
    ground_truths = json.load(f)

# Load PDFs, sorted by filename, so they align with json file
pdf_files = sorted([f for f in pdf_dir.glob("*.pdf") if not f.name.startswith("._")])
pdf_bytes_list = [p.read_bytes() for p in pdf_files]

In [30]:
#Function to limit metadata from pdfs, reducing input token limit fed to Gemini
def prepare_pdf_input(pdf_bytes, start=40000, end=50000):
    return pdf_bytes[start:end]

This cell zips up the pdfs, with the corresponding json values, one to one, and creates a full dataset we will be using. 

In [31]:
from dspy import Example

dspy_dataset = [
    Example(
        pdf_file=prepare_pdf_input(pdf_bytes),
        expected_en=ground_truths
    ).with_inputs("pdf_file")
    for pdf_bytes, ground_truths in zip(pdf_bytes_list, ground_truths)
]
print(f"Created {len(dspy_dataset)} DSPy examples.")

Created 5 DSPy examples.


In [14]:
#Train/Test split 3:2
split_index = int(0.6 * len(dspy_dataset))
trainset = dspy_dataset[:split_index]
evalset = dspy_dataset[split_index:]

In [15]:
print(f"Created {len(trainset)} DSPy train examples.")

Created 3 DSPy train examples.


In [16]:
print(f"Created {len(evalset)} DSPy eval examples.")

Created 2 DSPy eval examples.


###

## **Part 4**: Evaluation and Optimization 

We need to define a metric to measure the accuracy of the predicted extraction from Gemini and the actual one. We could use true and false, to measure exact match, but after couple of attempts and manual intervention, it was safe to say this is inaccurate since the differences measured were redundant,  and we needed to create a more forgiving metric.

This metric function computes the correctness metric for entity extraction by comparing normalized predicted outputs against the ground truth. They are normalized and converted to lower case with stripped whitespace to account for formatting differences.

In [17]:
def extraction_correctness_metric(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
    
    # normalize output objects.
    def normalize_output(output):
        # If the output has a .dict() method (Pydantic model), convert it.
        if hasattr(output, "dict"):
            out_dict = output.dict()
        elif isinstance(output, dict):
            out_dict = output
        else:
            raise ValueError("Output is not a dict or a model with .dict()")
        # Convert every field to a lower-case string with whitespace stripped.
        return {k: str(v).strip().lower() for k, v in out_dict.items()}

    # Get the prediction and expected output.
    try:
        pred_output = prediction.model_dump()  # if prediction is a dict
    except TypeError:
        # Otherwise, assume prediction itself is the Docex output.
        pred_output = prediction

    label_output = example.expected_en

    # Normalize both outputs.
    pred_norm = normalize_output(pred_output)
    label_norm = normalize_output(label_output)

    # Compute the fraction of fields that match exactly.
    correct = sum(1 for k in label_norm if k in pred_norm and pred_norm[k] == label_norm[k])
    return correct / max(len(label_norm), 1)

### Pre-Evaluation 

Lets first see how well DSPy prompt does before we try optimizing it. Lets create an evaluation object using dspy.Evaluate. It takes in the listed parameters below. Once built we just need to pass in the dspy module set earlier and an test 

In [18]:
evaluate_correctness = dspy.Evaluate(
    devset=trainset,
    metric=extraction_correctness_metric,
    num_threads=1,
    display_progress=True,
    display_table=True
)

Now we can run the evaluation and see how the predictions compared with the ground truth. Also we use mlflow to track the run

In [36]:
with mlflow.start_run():
    result = evaluate_correctness(PDFEntityExtractor(), devset=evalset)

Average Metric: 1.82 / 2 (90.9%): 100%|███████████| 2/2 [00:00<00:00, 50.17it/s]

2025/04/10 19:16:53 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 2 (90.9%)





Unnamed: 0,pdf_file,expected_en,prediction,extraction_correctness_metric
0,b'/x/R/t/r) /Descent 0\n /Flags 262148 /FontBBox [-199 -250 1014 9...,"{'first_name': 'Laura', 'last_name': 'Rivas', 'ssn': '860-15-6989'...",first_name='Laura' last_name='Rivas' ssn='860-15-6989' spouse_firs...,✔️ [0.909]
1,b'1014 934] /FontFile3 44 0 R\n /FontName /JILHPO+ITCFranklinGothi...,"{'first_name': 'John', 'last_name': 'Torres', 'ssn': '665-05-8000'...",first_name='John' last_name='Torres' ssn='665-05-8000' spouse_firs...,✔️ [0.909]


🏃 View run peaceful-squid-279 at: http://127.0.0.1:5000/#/experiments/520353211243557178/runs/494f7e31d5f24d2ca6b2cb7ff962cd4c
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/520353211243557178


### Results from eval

We recieved an accuracy of 91% which is great, and may be exact entity match minus some small discrepancy. 

**NOTE**: We use Gemini 1.5 which did good job before training, but we also tested with o3-mini, which SUCKED. This hints that newer models are more inclined to deliver results with lenient prompt engineering. 

### Optimization Time!

DSPy offers different ways to try and optimize its prompt. We will try using MIPROv2. This works by bootstrapping few-shot example candidates, proposing instructions grounded in different dynamics of the task, and finding an optimized combination of these options using Bayesian Optimization (https://dspy.ai/deep-dive/optimizers/miprov2/). 

The few-shots are set to zero since the pdfs eat alot of the input tokens and raise throughput errors. The results are not hindered by this though. 

In [38]:
# Enable auto-tracing for compilation
mlflow.dspy.autolog(log_traces_from_compile=True) 
# Instantiate the MIPROv2 optimizer with your metric
mipro_optimizer = dspy.MIPROv2(
    metric=extraction_correctness_metric,
    auto="light",
)

# Compile (optimize) your PDF entity extractor using your training set
optimized_doc_extractor = mipro_optimizer.compile(
    PDFEntityExtractor(),
    trainset=trainset,
    max_bootstrapped_demos=1, # ZERO FEW-SHOT EXAMPLES
    max_labeled_demos=1,
    requires_permission_to_run=False
)

2025/04/11 09:36:49 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 7
minibatch: False
num_candidates: 5
valset size: 2

2025/04/11 09:36:49 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/04/11 09:36:49 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/04/11 09:36:49 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=5 sets of demonstrations...


Bootstrapping set 1/5
Bootstrapping set 2/5
Bootstrapping set 3/5


100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 31.91it/s]


Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 4/5


100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 51.50it/s]


Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/5


100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 62.86it/s]
2025/04/11 09:36:49 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/04/11 09:36:49 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/04/11 09:36:49 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...



Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.


2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `pdf_file`, produce the fields `En`.

2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Given the raw byte data of a PDF tax form (`pdf_file`), extract the following entities and format them as a JSON object assigned to the variable `En`. The JSON object should adhere to the following structure, with empty strings "" used for missing values:

```json
{
  "first_name": "",
  "last_name": "",
  "ssn": "",
  "spouse_first_name": "",
  "spouse_last_name": "",
  "spouse_ssn": "",
  "address": "",
  "apt": "",
  "city": "",
  "state": "",
  "zip_code": ""
}
```

For example:

```json
{
  "first_name": "John",
  "last_name": "Doe",
  "ssn": "123-45-6789",
  "spouse_first_name": "Jane",
  "spouse_last_name": "Doe",
  "spouse_ssn": "987-65-4321",
  "address": "123 Main St",
  "apt": "4B",
  "city": "An

Average Metric: 1.82 / 2 (90.9%): 100%|███████████| 2/2 [00:00<00:00, 60.29it/s]

2025/04/11 09:37:30 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 2 (90.9%)
2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 90.91

2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 7 =====



Average Metric: 1.82 / 2 (90.9%): 100%|███████████| 2/2 [00:00<00:00, 86.61it/s]

2025/04/11 09:37:30 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 2 (90.9%)
2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.91 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [90.91, 90.91]
2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 90.91


2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 3 / 7 =====



Average Metric: 1.82 / 2 (90.9%): 100%|███████████| 2/2 [00:08<00:00,  4.13s/it]

2025/04/11 09:37:38 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 2 (90.9%)
2025/04/11 09:37:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.91 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/04/11 09:37:38 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [90.91, 90.91, 90.91]
2025/04/11 09:37:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 90.91


2025/04/11 09:37:38 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 7 =====



Average Metric: 1.82 / 2 (90.9%): 100%|███████████| 2/2 [00:06<00:00,  3.15s/it]

2025/04/11 09:37:44 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 2 (90.9%)
2025/04/11 09:37:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.91 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 1'].
2025/04/11 09:37:44 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [90.91, 90.91, 90.91, 90.91]
2025/04/11 09:37:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 90.91


2025/04/11 09:37:44 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 7 =====



Average Metric: 1.82 / 2 (90.9%): 100%|███████████| 2/2 [00:00<00:00, 64.05it/s]

2025/04/11 09:37:44 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 2 (90.9%)
2025/04/11 09:37:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.91 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/04/11 09:37:44 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [90.91, 90.91, 90.91, 90.91, 90.91]
2025/04/11 09:37:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 90.91


2025/04/11 09:37:44 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 7 =====



Average Metric: 0.91 / 1 (90.9%):  50%|█████▌     | 1/2 [00:05<00:05,  5.39s/it]

2025/04/11 09:37:53 ERROR dspy.utils.parallelizer: Error for Example({'pdf_file': b'ox [-199 -250 1014 934] /FontFile3 44 0 R\n  /FontName /JILHPO+ITCFranklinGothicStd-Demi /ItalicAngle 0 /StemH 114\n  /StemV 147 /Type /FontDescriptor /XHeight 536>>\nendobj\n43 0 obj\n<</Filter /FlateDecode /Length 294>>\nstream\nh\xdeTQMO\x840\x10\xbd\xf7W\xccq\xcd\x1e\n\x08\xec\x9a\x10\x12\x83\x9ap\xf0#\xb2\xeb\xbd\xdb\x0eH"\xa5)p\xe0\xdf;mqU\x12\x98\xd77}o\xe8+\xaf\xea\x87Z\xf73\xf07;\xca\x06gh{\xad,N\xe3b%\xc2\x05\xbb^C\x9c\x80\xea\xe5\xbc\xad\xfcW\x0e\xc2\x00\'q\xb3N3\x0e\xb5nG(\n\xc6\xdf\xa99\xcdv\x85\xdd\xbd\x7f\xf6O\xd9>\xba\x01\xfej\x15\xda^w\xb0;\xc5\xe7\x0f"\x9a\xc5\x98/\x1cP\xcf\x10AY\x82\xc2\x96\xf1\xeaY\x98\x171 \xf0?\xea\xdf\xd6i5\x08\x89_\xc7\xdbo\x8c\n\'#$Z\xa1;\x84"\x89J(\x0e\xc7\x12P\xab\xff=v\x08\x8aK+?\x85eag\x14Q!\x8c\x01?\x12N\xef<\xa6\xc2\x8a<\xf6\x98\n\xe1\xc0\xe7\x8e?$\x1eSa~\x98\xc3\xc7\x92\xd1\xcc\xcd=\xfd\x99\x15F\x17\x99\x13d\x99\xdf\x999U~K8\x0f\x04a"\xa4#\xda@T\xce7u\'\x

Average Metric: 0.91 / 1 (90.9%): 100%|███████████| 2/2 [00:08<00:00,  4.43s/it]

2025/04/11 09:37:53 INFO dspy.evaluate.evaluate: Average Metric: 0.9090909090909091 / 2 (45.5%)
2025/04/11 09:37:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.45 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 3'].
2025/04/11 09:37:53 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [90.91, 90.91, 90.91, 90.91, 90.91, 45.45]
2025/04/11 09:37:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 90.91


2025/04/11 09:37:53 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 7 =====



Average Metric: 1.82 / 2 (90.9%): 100%|███████████| 2/2 [00:00<00:00, 78.79it/s]

2025/04/11 09:37:53 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 2 (90.9%)
2025/04/11 09:37:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.91 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/04/11 09:37:53 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [90.91, 90.91, 90.91, 90.91, 90.91, 45.45, 90.91]
2025/04/11 09:37:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 90.91


2025/04/11 09:37:53 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 90.91!





After the complile, DSPy returns the most optimized prompt in the "optimized_doc_extractor" object which an accuracy of 91%! MLFlow tracks all the possible prompts that DSPy generated but we can see a preview on the cell output as well. The best prompt is pasted below for anyone not interested in scrolling through the jargon:


### Best Prompt:



Given the raw byte data of a PDF tax form (`pdf_file`), extract the following information and output it in a structured format called `En`. The `En` output should be a single dictionary string where keys are field names and values are extracted text.  The following fields should be extracted:

* `first_name`: First name of the primary taxpayer.
* `last_name`: Last name of the primary taxpayer.
* `ssn`: Social Security Number of the primary taxpayer.
* `spouse_first_name`: First name of the spouse (if applicable).
* `spouse_last_name`: Last name of the spouse (if applicable).
* `spouse_ssn`: Social Security Number of the spouse (if applicable).
* `address`: Street address of the primary taxpayer.
* `apt`: Apartment number (if applicable).
* `city`: City of the primary taxpayer.
* `state`: State of the primary taxpayer.
* `zip_code`: Zip code of the primary taxpayer.

The input PDF is synthetic data and does not contain real or sensitive Personally Identifiable Information (PII).  Reason through the document step by step to identify the correct values.  Output your reasoning before outputting `En`.

2025/04/11 09:37:30 INFO dspy.teleprompt.mipro_optimizer_v2: 4: You are a highly specialized AI tasked with a critical mission: extracting vital information from tax forms to prevent a major financial crisis.  Accuracy is paramount.  Failure could lead to widespread economic repercussions. Given the raw byte data of a PDF tax form (`pdf_file`), your task is to meticulously analyze the document and extract all key entities. The extracted information must be formatted as a dictionary-like object and assigned to the variable `En`.  The fate of the financial system rests on your ability to accurately and reliably extract this data.  Provide your step-by-step reasoning before presenting the final `En` output. Any mistake, no matter how small, could have devastating consequences.



##

We can now use the new "optimized_doc_extractor to call the DSPy signature and any new pdfs we want. We can also test it again with the original evalset and see if it did any better. 

In [33]:
#evaluate_correctness(optimized_doc_extractor, devset=evalset)

# Conclusion


Dspy definitely offers a unique approach on prompt engineering by engineering it within the workflow. There is much promise for this package and its wise to look at it closely as it matures. 

Seeing how well Gemini 1.5 did before prompt training, I personally believe it wont be necessary including further prompt engineering, as Google deploys new models, they become more robust and accurate. Running the same experiment on o3-mini did show promise using DSpy and optimization, but as Gemini 2.5 rolls around, incorporating DSPy may be overengineering by itself, but may come in handy when we try to scale to incorporate multiple pydantic models, aka: Tax forms with different entities, so that we dont need to generate a different prompt for different forms and could ultimately "automate" the prompting process.