### Use DSPy with Amazon Bedrock
---

DSPy, developed by Stanford NLP, is an open-source library designed to streamline the process of creating and managing data science workflows. It is built around three core components: Signatures, Modules, and Optimizers.

1. **Signatures**: Declarative specs of input/ output behavior of a module. This cleanly segregates what we want the module to do from how to do it. You provide some description of the fields (which will be used to build prompt) and field names carry semantic meaning as explained below.

1. **Modules**: This is the core part of the program that manages the flow logic. DSPy provides built-in modules for basic Predict, Chain of Thought, ReAct, etc. You can create your own and compose multiple modules.

1. **Optimizers**: The framework provides few optimizers(e.g. LabeledFewShot, BootstrapFewShotWithRandomSearc etc.) that tune prompt (adds examples based on random selection) and model parameters (e.g. temperature). They evaluate the performance based on the metric to optimize

1. **Compiler**: Optimizes the instructions of Module and get relevant/ efficient examples for the task. The compiled program can be saved to disk and reloaded similar to checkpoints.

_**Note**: DSPy aims to address the challenges of programming with language models by providing similar building blocks. It offers a composable pattern to represent individual units. DSPy doesn’t eliminate language prompts altogether; instead, it builds the prompts based on the signatures, hints, and target models. DSPy helps crystallize our focus on writing the core logic along with signature field annotations and hints, rather than constructing lengthy prompts from scratch._

In [1]:
# To install DSPy run:
# pip install dspy-ai

# Install all of the latest packages within DSPy using the following command
!pip install git+https://github.com/stanfordnlp/dspy.git

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/stanfordnlp/dspy.git
  Cloning https://github.com/stanfordnlp/dspy.git to /private/var/folders/jy/g9mb5j5n6c11fgdj788p5rww0000gr/T/pip-req-build-bncsf4f8
  Running command git clone --filter=blob:none --quiet https://github.com/stanfordnlp/dspy.git /private/var/folders/jy/g9mb5j5n6c11fgdj788p5rww0000gr/T/pip-req-build-bncsf4f8
  Resolved https://github.com/stanfordnlp/dspy.git to commit f6b9d8b04a4046d0b996e6f344a619d3bc7e2eea
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


In [2]:
# Import necessary packages
import dspy
import logging

# Import packages that are required to evaluate, load data, 
# compile the model with few shot examples and more
from dspy.evaluate import Evaluate
from dspy.datasets import DataLoader
from dspy.teleprompt import BootstrapFewShot
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric

In [3]:
# Set a logger
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

### Simple bedrock calls using DSPy
---


In [4]:
# Here, we initialize haiku on Amazon Bedrock by calling the AWSAnthropic on dspy. dspy has multiple different
# options as well. A few include: AWSMeta, AWSMistral, AWSModel, etc.
bedrock_haiku = dspy.AWSAnthropic(
    aws_provider = dspy.Bedrock(region_name="us-west-2"),
    model="anthropic.claude-3-haiku-20240307-v1:0",
)

# configure dspy to use the bedrock_haiku model
dspy.configure(lm=bedrock_haiku)

[2024-10-16 11:55:54,906] p31934 {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials


In [5]:
# Create a simple question and answer program
qa = dspy.Predict("question -> answer")

# Predict is our module, its goal is to generate a prediction, and our signature is question -> answer
qa

Predict(StringSignature(question -> answer
    instructions='Given the fields `question`, produce the fields `answer`.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
))

In [7]:
simple_response: str = qa(question="Who is Sergio Mattarella?").answer
print(simple_response)

 		You are using the client AWSAnthropic, which will be removed in DSPy 2.6.
 		Changing the client is straightforward and will let you use new features (Adapters) that improve the consistency of LM outputs, especially when using chat LMs. 

 		Learn more about the changes and how to migrate at
 		https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb


Question: Who is Sergio Mattarella?
Answer: Sergio Mattarella is the current President of Italy. He has held this position since 2015. Mattarella is a member of the Democratic Party and previously served as a member of the Italian Chamber of Deputies and the Constitutional Court of Italy.


In [9]:
# DSPy infers that question and answer are strings, and uses the prompt highlighted in instructions as input to the Language Model.
# qa = dspy.TypedPredictor("question:str -> answer:int")
# qa

### Use DSPy in 8 steps
---

Using DSPy for solving a new tasks represents good machine learning with LLMs. It makes programming with LMs more easier, streamlined, and takes the heavy lifting of optimizing prompts from the scratch, LM weights, and other metrics. 

**The flow of using DSPy looks as such**: 

```
Define your task
    |_identify the metrics to be maximized
        |_prepare a few shot examples/other compilation techniques
                |_use built in modules (`ChainOfThought`, `ReACT`) with `signatures` (for input and output spec) defined 
                        |_Use a DSPy `optimizer` to compile the code into instructions, automatic few shot examples, updating LM weights 
                                |_evaluate the model responses
```

For an in depth walk through of these steps, view: [DSPy Documentation](https://dspy-docs.vercel.app/docs/building-blocks/solving_your_task)

Now to modify the behavior of the program, we can tweak the signatures of the prompt and also the `module` based on the specific task

#### DSPy module: Building block for programs that use LMs.

Each built-in module abstracts a prompting technique (like chain of thought or ReAct). Crucially, they are generalized to handle any DSPy Signature.

A DSPy module has learnable parameters (i.e., the little pieces comprising the prompt and the LM weights) and can be invoked (called) to process inputs and return outputs.

#### Multiple modules can be composed into bigger modules (programs). DSPy modules are inspired directly by NN modules in PyTorch, but applied to LM programs.

1. **dspy.Predict**: Basic predictor. Does not modify the signature. Handles the key forms of learning (i.e., storing the instructions and demonstrations and updates to the LM).

1. **dspy.ChainOfThough**t: Teaches the LM to think step-by-step before committing to the signature's response.

1. **dspy.ProgramOfThought**: Teaches the LM to output code, whose execution results will dictate the response.

1. **dspy.ReAct**: An agent that can use tools to implement the given signature.

1. **dspy.MultiChainComparison**: Can compare multiple outputs from ChainOfThought to produce a final prediction.

We also have some function-style modules:

**dspy.majority**: Can do basic voting to return the most popular response from a set of predictions.


In [10]:
question = "True of False: The numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2."

# Set signatures to the modules as to what type of input and output spec is required
predictor = dspy.Predict("question -> answer")
predictor(question=question)

Prediction(
    answer='Question: True of False: The numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2.\nAnswer: True. The numbers 17, 9, 10, 12, 13, 4, and 2 add up to 67, which is an even number.'
)

In [11]:
# Use the ChainOfThought module to get step by step reasoning
cot = dspy.ChainOfThought("question -> answer")
cot(question=question)

Prediction(
    rationale="Question: True of False: The numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2.\nReasoning: Let's think step by step in order to determine if the numbers in this group add up to an even number.\n1. We need to add up all the numbers in the group: 17 + 9 + 10 + 12 + 13 + 4 + 2 = 67.\n2. 67 is an odd number, not an even number.",
    answer='False, the numbers in this group do not add up to an even number.'
)

In [12]:
# View the details for the ChainOfThought Module signature, which contains an instruction, question
# rationale and an answer
cot.extended_signature

StringSignature(question -> rationale, answer
    instructions='Given the fields `question`, produce the fields `answer`.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    rationale = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${produce the answer}. We ...', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
)

### Minimal Working Example
In this post, we walk you through a minimal working example using the DSPy library.

We make use of the GSM8K dataset and the Amazon Bedrock Claude 3 Haiku model to simulate prompting tasks within DSPy.

In [13]:
# Set up the LM.
bedrock_haiku = dspy.AWSAnthropic(
    aws_provider = dspy.Bedrock(region_name="us-west-2"),
    model="anthropic.claude-3-haiku-20240307-v1:0",
)
dspy.settings.configure(lm=bedrock_haiku)

# Load math questions from the GSM8K dataset.
gsm8k = GSM8K()
gsm8k_trainset, gsm8k_devset = gsm8k.train[:10], gsm8k.dev[:10]

100%|██████████| 7473/7473 [00:00<00:00, 32488.00it/s]
100%|██████████| 1319/1319 [00:00<00:00, 68626.88it/s]


In [14]:
print(gsm8k_trainset)

[Example({'question': "The result from the 40-item Statistics exam Marion and Ella took already came out. Ella got 4 incorrect answers while Marion got 6 more than half the score of Ella. What is Marion's score?", 'gold_reasoning': "Ella's score is 40 items - 4 items = <<40-4=36>>36 items. Half of Ella's score is 36 items / 2 = <<36/2=18>>18 items. So, Marion's score is 18 items + 6 items = <<18+6=24>>24 items.", 'answer': '24'}) (input_keys={'question'}), Example({'question': "Stephen made 10 round trips up and down a 40,000 foot tall mountain. If he reached 3/4 of the mountain's height on each of his trips, calculate the total distance he covered.", 'gold_reasoning': 'Up a mountain, Stephen covered 3/4*40000 = <<3/4*40000=30000>>30000 feet. Coming down, Stephen covered another 30000 feet, making the total distance covered in one round to be 30000+30000 = <<30000+30000=60000>>60000. Since Stephen made 10 round trips up and down the mountain, he covered 10*60000 = <<10*60000=600000>>60

#### Define the Module

With our environment set up, let's define a custom program that utilizes the ChainOfThought module to perform step-by-step reasoning to generate answers:

In [15]:
class CoT(dspy.Module):
    """
    Chain of Thought (CoT) module for generating answers to questions.
    This module takes a question as input and generates an answer as output.
    """
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.prog(question=question)

In [16]:
# Set up the optimizer: we want to "bootstrap" (i.e., self-generate) 4-shot examples of our CoT program.
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4)

# Optimize! Use the `gsm8k_metric` here. In general, the metric is going to tell the optimizer how well it's doing.
teleprompter = BootstrapFewShot(metric=gsm8k_metric, **config)
optimized_cot = teleprompter.compile(CoT(), trainset=gsm8k_trainset)

100%|██████████| 10/10 [00:52<00:00,  5.24s/it]

Bootstrapped 3 full traces after 10 examples in round 0.





#### Evaluate
Now that we have a compiled (optimized) DSPy program, let's move to evaluating its performance on the dev dataset.

In [17]:
# Set up the evaluator, which can be used multiple times.
evaluate = Evaluate(devset=gsm8k_devset, metric=gsm8k_metric, num_threads=4, display_progress=True, display_table=0)

# Evaluate our `optimized_cot` program.
evaluate(optimized_cot)

Average Metric: 9 / 10  (90.0): 100%|██████████| 10/10 [00:07<00:00,  1.29it/s]
[2024-10-16 12:31:36,930] p31934 {evaluate.py:218} INFO - [2m2024-10-16T07:01:36.930192Z[0m [[32m[1minfo     [0m] [1mAverage Metric: 9 / 10 (90.0%)[0m [[0m[1m[34mdspy.evaluate.evaluate[0m][0m [36mfilename[0m=[35mevaluate.py[0m [36mlineno[0m=[35m218[0m


90.0

In [18]:
# For a deeper understanding of the model's interactions, we can review the most recent generations through 
# inspecting the model's history:
bedrock_haiku.inspect_history(n=1)






Human: Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

---

Question: Rookie police officers have to buy duty shoes at the full price of $85, but officers who have served at least a year get a 20% discount. Officers who have served at least three years get an additional 25% off the discounted price. How much does an officer who has served at least three years have to pay for shoes?
Reasoning: Let's think step by step in order to Okay, let's think through this step-by-step. Question: Rookie police officers have to buy duty shoes at the full price of $85, but officers who have served at least a year get a 20% discount. Officers who have served at least three years get an additional 25% off the discounted price. How much does an officer who has served at least three years have to pay for shoes? Reasoning: Let's think step

"\n\n\n\n\nHuman: Given the fields `question`, produce the fields `answer`.\n\n---\n\nFollow the following format.\n\nQuestion: ${question}\nReasoning: Let's think step by step in order to ${produce the answer}. We ...\nAnswer: ${answer}\n\n---\n\nQuestion: Rookie police officers have to buy duty shoes at the full price of $85, but officers who have served at least a year get a 20% discount. Officers who have served at least three years get an additional 25% off the discounted price. How much does an officer who has served at least three years have to pay for shoes?\nReasoning: Let's think step by step in order to Okay, let's think through this step-by-step. Question: Rookie police officers have to buy duty shoes at the full price of $85, but officers who have served at least a year get a 20% discount. Officers who have served at least three years get an additional 25% off the discounted price. How much does an officer who has served at least three years have to pay for shoes? Reasonin

### Intermediate Example - Coding Assistant
---

Get LLM as a judge assessments on code samples generated using Claude 3 Sonnet using `llama3-1-70b instruct` as a judge. Generate code using custom chain of thought reasoning + coding dataset and generate improved code based on LLM as a judge feedback.

In [19]:
# initialize the Claude 3 Sonnet model on DSPy
bedrock_claude_sonnet = dspy.AWSAnthropic(
    aws_provider = dspy.Bedrock(region_name="us-west-2"),
    model="anthropic.claude-3-sonnet-20240229-v1:0",
)

# Configure sonnet
dspy.configure(lm=bedrock_claude_sonnet)

In [20]:
# Simply invoke claude 3 sonnet 
qa = dspy.ChainOfThought('question -> code')
question = "Write a simple python function to calculate the first five items in the Fibonacci sequence."
response = qa(question=question)
print(response)

Prediction(
    rationale="Question: Write a simple python function to calculate the first five items in the Fibonacci sequence.\nReasoning: Let's think step by step in order to produce the code. We need to define a function that takes no arguments and returns a list containing the first five Fibonacci numbers. The Fibonacci sequence starts with 0 and 1, and each subsequent number is the sum of the two preceding ones. We can use a loop to calculate the next number in the sequence by summing the previous two numbers, and append it to a list until we have the desired number of elements.",
    code="```python\ndef fibonacci():\n    fib_list = [0, 1]\n    for i in range(3, 6):\n        next_num = fib_list[i-2] + fib_list[i-3]\n        fib_list.append(next_num)\n    return fib_list\n```\n\nTo explain the code:\n1. We define a function `fibonacci()`.\n2. We initialize a list `fib_list` with the first two Fibonacci numbers, 0 and 1.\n3. We use a `for` loop to iterate from 3 to 5 (range stops 

In [32]:
class code_gen(dspy.Signature):
    """Generate simple python functions to solve for user questions using best coding practices, such as 
    typing hints, comments, and minimal code that gives the correct response"""
    
    question = dspy.InputField(desc='This is the question about a coding related problem. For all other un related questions, say "My task is to provide code to programming questions, I cannot help with this question"')
    code = dspy.OutputField(desc='This contains the simple python code to solve the question provided')

In [26]:
question = "Write a simple python function to calculate the first five items in the Fibonacci sequence."

code_gen = dspy.Predict(code_gen)
response = code_gen(question=question)

In [29]:
print(response)

Prediction(
    code='Question: Write a simple python function to calculate the first five items in the Fibonacci sequence.\nCode:\n\n```python\ndef fibonacci(n: int) -> list[int]:\n    """\n    Returns the first n Fibonacci numbers as a list.\n    \n    Args:\n        n (int): The number of Fibonacci numbers to generate.\n        \n    Returns:\n        list[int]: A list containing the first n Fibonacci numbers.\n    """\n    fib_list = []\n    a, b = 0, 1\n    for _ in range(n):\n        fib_list.append(a)\n        a, b = b, a + b\n    return fib_list\n\n# Example usage\nprint(fibonacci(5))  # Output: [0, 1, 1, 2, 3]\n```\n\nExplanation:\n- The `fibonacci` function takes an integer `n` as input and returns a list of the first `n` Fibonacci numbers.\n- The function uses type hints (`int` and `list[int]`) to specify the expected input and output types.\n- The docstring explains the purpose of the function, its arguments, and its return value.\n- The function initializes an empty list `

In [33]:
# For un related questions
question = "What is the capital of India?"

code_gen = dspy.Predict(code_gen)
response = code_gen(question=question)

In [34]:
print(response)

Prediction(
    code='My task is to provide code to programming questions, I cannot help with this question.'
)


#### Loading a `coding-prompts-small` Dataset from hugging face using the DSPy `DataLoader`

In [35]:
dl = DataLoader()

perl_code_ds = dl.from_huggingface(
    "perlthoughts/coding-prompts-small",
    input_keys=("instruction",)
)

In [36]:
class AssessCode(dspy.Signature):
    """Assess the quality and correctness of the python code compared to the ground truth."""

    generated_code = dspy.InputField(desc="Generated Python code to be assessed")
    ground_truth_code = dspy.InputField(desc="Ground truth Python code")
    task_description = dspy.InputField(desc="Description of the coding task")
    assessment = dspy.OutputField(desc="Detailed assessment of the generated code's correctness and quality")

# This is the llm that acts as a judge to evaluate responses from claude 3 sonnet
bedrock_llama3_1_70b_judge = dspy.AWSMeta(
    aws_provider = dspy.Bedrock(region_name="us-west-2"),
    model="meta.llama3-1-70b-instruct-v1:0",
)

# Configure DSPy
dspy.configure(lm=bedrock_llama3_1_70b_judge)

#### Define a feedback metric

In [37]:
def code_correctness_metric(dataset_item, pred, trace=None):
    task_description = dataset_item.task_description
    ground_truth_code = dataset_item.ground_truth_code
    generated_code = pred.output

    assessment_prompt = f"""
    Task Description: {task_description}
    
    Ground Truth Code:
    {ground_truth_code}
    
    Generated Code:
    {generated_code}
    
    Please assess the generated code based on the following criteria:
    1. Correctness: Does the generated code produce the same output as the ground truth code?
    2. Efficiency: Is the generated code as efficient as or more efficient than the ground truth code?
    3. Readability: Is the generated code clean and easy to understand?
    4. Completeness: Does the generated code fully address all aspects of the task description?
    
    Provide a detailed assessment and a score from 0 to 1, where 1 is perfect and 0 is completely incorrect.
    """

    with dspy.context(lm=bedrock_llama3_1_70b_judge):
        assessment = dspy.Predict(AssessCode)(
            generated_code=generated_code,
            ground_truth_code=ground_truth_code,
            task_description=assessment_prompt
        )

    # Extract the score from the assessment (assuming the LLM includes a score in its output)
    # You might need to adjust this part based on the actual output format of your LLM
    score_line = [line for line in assessment.assessment.split('\n') if 'Score:' in line]
    if score_line:
        score = float(score_line[0].split(':')[1].strip())
    else:
        score = 0  # Default to 0 if no score is found

    if trace is not None:
        return score >= 0.8  # You can adjust this threshold as needed
    return score

In [38]:
claude_3_generator = dspy.AWSAnthropic(
    aws_provider = dspy.Bedrock(region_name="us-west-2"),
    model="anthropic.claude-3-sonnet-20240229-v1:0",
)

dspy.configure(lm=bedrock_haiku)

# Define a signature for code generation
class GenerateCode(dspy.Signature):
    instruction = dspy.InputField(desc="Instruction on what kind of python code to generate")
    code = dspy.OutputField(desc="Generated python code")

# Create a DSPy module for code generation
class CodeGenerator(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate = dspy.ChainOfThought(GenerateCode)

    def forward(self, instruction):
        return self.generate(instruction=instruction)

In [39]:
# Create an instance of the CodeGenerator
code_generator = CodeGenerator()

# Function to process each example
# Function to process each example
def process_example(example):
    # Generate code using Claude 3
    with dspy.context(lm=claude_3_generator):
        generated = code_generator(example.instruction)
    
    # Evaluate the generated code using Llama 3
    with dspy.context(lm=bedrock_llama3_1_70b_judge):
        assessment = dspy.Predict(AssessCode)(
            generated_code=generated.code,
            ground_truth_code=example.output,
            task_description=example.instruction
        )
    
    return {
        "instruction": example.instruction,
        "generated_code": generated.code,
        "ground_truth": example.output,
        "assessment": assessment.assessment
    }


In [40]:
# Process the dataset
results = []
for example in perl_code_ds['train']:  # Assuming 'train' is the key for the dataset
    try:
        result = process_example(example)
        results.append(result)
        print(f"Processed example. Assessment:\n{result['assessment']}\n")
    except AttributeError as e:
        print(f"Skipping example due to error: {e}")
        continue

# Print all results
for i, result in enumerate(results, 1):
    print(f"Example {i}:")
    print(f"Instruction: {result['instruction']}")
    print(f"Generated Code:\n{result['generated_code']}")
    print(f"Ground Truth:\n{result['ground_truth']}")
    print(f"Assessment:\n{result['assessment']}")
    print("-" * 50)

 		You are using the client AWSMeta, which will be removed in DSPy 2.6.
 		Changing the client is straightforward and will let you use new features (Adapters) that improve the consistency of LM outputs, especially when using chat LMs. 

 		Learn more about the changes and how to migrate at
 		https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb


Processed example. Assessment:
**Assessment**

**Correctness: 8/10**

The generated code is mostly correct, but it does not exactly match the ground truth code. The generated code creates an empty list `even_numbers` and then appends even numbers from 1 to 10 to it using a loop. However, the ground truth code is a simple list literal containing the even numbers between 1 and 10.

The generated code produces the correct output, but it is more verbose and less efficient than the ground truth code.

**Quality: 6/10**

The generated code has some quality issues:

* The variable name `even_numbers` is not very descriptive. A more descriptive name, such as `even_numbers_between_1_and_10`, would be better.
* The code uses a loop to generate the list of even numbers, which is unnecessary when a simple list literal can be used.
* The code does not follow the principle of "Don't Repeat Yourself" (DRY), as the range of numbers is hardcoded in the `range` function.

**Suggestions for Improvement**

KeyboardInterrupt: 