# Unlocking AI's Potential: From Monolithic LLMs to Modular Programs with DSPy

Artificial Intelligence (AI) has witnessed a significant transformation with the advent of Large Language Models (LLMs) like GPT-4. These models have demonstrated remarkable capabilities, but they also come with their own set of challenges. In this article, we'll explore the journey from using monolithic LLMs directly to building compound AI systems, and how DSPy offers a modular approach to overcome common pitfalls.

## Table of Contents

1. [Introduction to Monolithic LLMs](#introduction-to-monolithic-llms)
2. [Compound AI Systems: RAG and Multi-Hop](#compound-ai-systems-rag-and-multi-hop)
3. [The Limitations of Hand-Crafted Text Prompts](#the-limitations-of-hand-crafted-text-prompts)
4. [The Five Roles Coupled in Prompts](#the-five-roles-coupled-in-prompts)
5. [Introducing DSPy: A Modular Approach](#introducing-dspy-a-modular-approach)
6. [Conclusion](#conclusion)
7. [Getting Started with DSPy](#getting-started-with-dspy)

---

## Introduction to Monolithic LLMs

Imagine having a powerful tool that can generate human-like text, answer questions, and even write code. That's what Large Language Models (LLMs) like GPT-4 offer. Many developers and researchers initially use these models directly by sending them prompts and receiving outputs.

**Pros:**

- **Ease of Use:** You can get started quickly by sending a prompt and getting a response.
- **Versatility:** LLMs can handle a wide range of tasks.

**Cons:**

- **Lack of Transparency:** It's often unclear how the model arrives at a specific answer.
- **Error Handling:** Correcting mistakes or refining outputs can be challenging.
- **Scalability Issues:** As tasks become more complex, managing prompts becomes unwieldy.

---

## Compound AI Systems: RAG and Multi-Hop

To address some limitations of monolithic LLMs, developers have started creating **compound AI systems**. These systems break down tasks into multiple steps or modules, often incorporating external tools or data sources.

### Retrieval-Augmented Generation (RAG)

RAG combines LLMs with retrieval systems. Instead of relying solely on the model's internal knowledge, the system retrieves relevant information from a database or the internet to provide more accurate and up-to-date answers.

**Benefits:**

- **Improved Accuracy:** By accessing external data, the model can provide more precise answers.
- **Up-to-Date Information:** Retrieval allows the system to incorporate recent data not present in the training set.

### Multi-Hop Reasoning

Multi-hop systems involve the model making a series of reasoning steps to arrive at an answer. For example, to answer "What is the capital of the country where the Nile River originates?", the system might:

1. Determine where the Nile River originates.
2. Identify the country associated with that location.
3. Find the capital of that country.

**Benefits:**

- **Deeper Understanding:** Breaking down questions into steps allows for more complex reasoning.
- **Modularity:** Each step can be handled by different components or prompts.

---

## The Limitations of Hand-Crafted Text Prompts

While compound AI systems offer significant advantages, many developers still rely on hand-crafted text prompts to guide the LLMs. This approach introduces several challenges:

1. **Sensitivity to Prompts:**
   - Small changes in the prompt can lead to vastly different outputs.
   - Crafting the "perfect" prompt requires significant trial and error.

2. **Model Dependency:**
   - Prompts optimized for one model might not work well with another.
   - Upgrading to a newer or different model can cause the system to break or behave unpredictably.

3. **Scalability Issues:**
   - As the number of tasks or complexity increases, managing and updating prompts becomes impractical.

4. **Lack of Reusability:**
   - Hand-crafted prompts are often specific to a particular task and hard to generalize.

5. **Error Handling:**
   - Debugging and refining prompts to handle exceptions or edge cases is cumbersome.

---

## The Five Roles Coupled in Prompts

In traditional prompt-based systems, a single prompt often tries to fulfill multiple roles simultaneously:

1. **Instruction:** Tells the model what to do.
2. **Demonstration:** Provides examples of desired input-output pairs.
3. **Context:** Supplies background information or data required for the task.
4. **Constraints:** Specifies rules or guidelines the output should follow.
5. **Formatting:** Dictates the structure or style of the output.

Coupling these roles into a single prompt leads to:

- **Complexity:** The prompt becomes long and difficult to manage.
- **Inflexibility:** Adjusting one aspect (e.g., changing the instruction) may inadvertently affect others.
- **Maintenance Challenges:** Updating prompts for new requirements or models is error-prone.

---

## Introducing DSPy: A Modular Approach

**DSPy** offers a solution by encapsulating these different roles into a modular program. Instead of relying on monolithic prompts, DSPy allows developers to define structured modules that handle specific aspects of the AI system.

### Key Features of DSPy

1. **Modularity:**
   - Breaks down tasks into separate modules.
   - Each module has a clear purpose and interface.

2. **Signatures:**
   - Define input and output behavior for modules.
   - Promote consistency and reusability.

3. **Adapters:**
   - Handle formatting and parsing of inputs and outputs.
   - Ensure compatibility between modules and models.

4. **Optimizers:**
   - Automatically refine prompts and configurations.
   - Adapt to different models without manual intervention.

5. **Assertions:**
   - Enforce constraints and validate outputs.
   - Improve reliability and error handling.

### Benefits of Using DSPy

- **Flexibility:**
  - Easily swap out models or modules without breaking the system.
  - Adjust specific components without affecting others.

- **Scalability:**
  - Manage complexity by adding or updating modules as needed.
  - Handle more sophisticated tasks through composition.

- **Maintainability:**
  - Simplify updates and debugging.
  - Enhance transparency and understandability of the system.

### Example: Building a Privacy Policy Evaluation Agent

Imagine you want to evaluate online Privacy Policies based on a detailed rubric, such as:

1. **Data Collection:**
   - **DCQ1:** Do the policies list all data collected?

2. **Data Security**
3. **Third-party Data Sharing**
4. **Advertising Data**

**Challenges with Traditional Prompts:**

- Crafting prompts that cover all evaluation criteria is complex.
- Adjusting the prompt for changes in the rubric is tedious.
- The prompt may not generalize well to different models.

**Using DSPy:**

- **Define Modules:** Create separate modules for each section or sub-question.
- **Use Signatures:** Clearly specify what each module expects as input and what it should output.
- **Optimize Automatically:** Let DSPy's optimizers fine-tune the system for different models or requirements.
- **Maintain Easily:** Update modules independently as the rubric evolves.

---

## Conclusion

The evolution from monolithic LLMs to compound AI systems marks a significant advancement in AI development. However, relying on hand-crafted text prompts introduces limitations that hinder scalability, flexibility, and maintainability.

DSPy addresses these challenges by offering a modular framework that encapsulates the various roles traditionally bundled into prompts. By separating concerns and providing tools for optimization and validation, DSPy empowers developers to build robust, adaptable, and efficient AI systems.

---

## Getting Started with DSPy

Interested in exploring DSPy further? Here's how you can begin:

1. **Visit the DSPy GitHub Repository:**
   - Find the code, documentation, and examples to kickstart your journey.
   - [DSPy on GitHub](https://github.com/dspy-ai/dspy)

2. **Check Out Tutorials and Guides:**
   - Learn how to implement modules, define signatures, and use optimizers.
   - Access tutorials in the repository or on the DSPy website.

3. **Join the Community:**
   - Connect with other developers and contributors.
   - Share your experiences and learn from others.

4. **Experiment and Build:**
   - Start by building simple modules.
   - Gradually compose more complex systems as you become comfortable.

By embracing DSPy, you'll be well-equipped to harness the full potential of AI, creating systems that are not only powerful but also adaptable and maintainable.

In [2]:
import dspy
from typing import List, Dict
from pydantic import BaseModel, Field
# Configure DSPy to use OpenAI's GPT-4o mini
lm = dspy.OpenAI(model="gpt-4o-mini", api_key="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")
dspy.configure(lm=lm)

# Load the privacy policy from file
def load_policy_text(file_path: str) -> str:
    with open(file_path, 'r') as file:
        return file.read()

policy_text = load_policy_text('stepik-privacy-policy.txt')
# Define input and output models
class PolicyInput(BaseModel):
    policy_text: str = Field(description="The full text of the privacy policy")
    
class EvaluationCriteria(BaseModel):
    criteria: List[str]

class PolicyOutput(BaseModel):
    evaluation: str = Field(description="One of: 'Does Not Meet', 'Meets with Reservations', 'Meets Expectations'")
    criteria_met: List[str] = Field(description="List of specific criteria met for the given evaluation")
    citations: List[str] = Field(description="At least 3 citations from the policy supporting the evaluation")

# Define the evaluation criteria
EVALUATION_CRITERIA = {
    "Does Not Meet": EvaluationCriteria(criteria=[
        "Policies do not list data collected",
        "Policies list data collected not crucial to app functionality"
    ]),
    "Meets with Reservations": EvaluationCriteria(criteria=[
        "Policies give a broad statement of data collected",
        "Policies are not clear on data collected crucial to app functionality"
    ]),
    "Meets Expectations": EvaluationCriteria(criteria=[
        "Policies list the data collected",
        "Policies state no data collected"
    ])
}
# Define the signature for our agent
class DCQ1Signature(dspy.Signature):
    """Evaluate if the privacy policy lists all data collected, considering specific criteria for each evaluation level."""
    
    input: PolicyInput = dspy.InputField()
    criteria: Dict[str, EvaluationCriteria] = dspy.InputField(default=EVALUATION_CRITERIA)
    output: PolicyOutput = dspy.OutputField()

# Create the agent using TypedPredictor
dcq1_agent = dspy.TypedPredictor(DCQ1Signature)
# Function to run multiple passes of the agent and return the majority opinion
def evaluate_dcq1(policy_text: str, num_passes: int = 3) -> PolicyOutput:
    results = []
    for _ in range(num_passes):
        result = dcq1_agent(input=PolicyInput(policy_text=policy_text), criteria=EVALUATION_CRITERIA)
        results.append(result.output)
    
    # Count evaluations
    evaluation_counts = {
        "Does Not Meet": 0,
        "Meets with Reservations": 0,
        "Meets Expectations": 0
    }
    all_criteria_met = []
    all_citations = set()
    
    for result in results:
        evaluation_counts[result.evaluation] += 1
        all_criteria_met.extend(result.criteria_met)
        all_citations.update(result.citations)
    
    # Determine majority evaluation
    majority_evaluation = max(evaluation_counts, key=evaluation_counts.get)
    
    # Count occurrences of each criterion
    criteria_counts = {}
    for criterion in all_criteria_met:
        criteria_counts[criterion] = criteria_counts.get(criterion, 0) + 1
    
    # Select criteria that appear in the majority of passes
    majority_criteria = [criterion for criterion, count in criteria_counts.items() if count > num_passes / 2]
    
    return PolicyOutput(
        evaluation=majority_evaluation,
        criteria_met=majority_criteria,
        citations=list(all_citations)[:5]  # Limit to 5 citations
    )
# Function to format the output
def format_output(result: PolicyOutput):
    print(f"Evaluation: {result.evaluation}")
    print("Criteria Met:")
    for criterion in result.criteria_met:
        print(f"- {criterion}")
    print("Citations:")
    for citation in result.citations:
        print(f"- {citation}")

# Run the evaluation
result = evaluate_dcq1(policy_text)
format_output(result)
from dspy.teleprompt import BootstrapFewShot

def dcq1_metric(example, pred, trace=None):
    correct_evaluation = example.output.evaluation == pred.output.evaluation
    valid_criteria = set(pred.output.criteria_met).issubset(set(EVALUATION_CRITERIA[pred.output.evaluation].criteria))
    return correct_evaluation and valid_criteria
# Training examples
trainset = [
    dspy.Example(
        input=PolicyInput(policy_text="We collect your name, email, and browsing history."),
        criteria=EVALUATION_CRITERIA,
        output=PolicyOutput(
            evaluation="Meets Expectations",
            criteria_met=["Policies list the data collected"],
            citations=["We collect your name", "We collect your email", "We collect your browsing history"]
        )
    ).with_inputs('input', 'criteria')
]
# Set up the BootstrapFewShot optimizer
teleprompter = BootstrapFewShot(metric=dcq1_metric, max_bootstrapped_demos=3)

# Compile the optimized agent
optimized_dcq1_agent = teleprompter.compile(dcq1_agent, trainset=trainset)
valset = [
    dspy.Example(
        input=PolicyInput(policy_text="""
        At Stepik, we collect and use the following types of information:
        1. Personal Information: This includes your name, email address, and phone number when you create an account.
        2. Usage Data: We collect information on how you interact with our app, including features used and time spent.
        3. Device Information: We gather data about the device you use to access our service, such as device type and operating system.
        4. Location Data: With your permission, we may collect and process information about your location.
        We use this information to improve our services, personalize your experience, and ensure the security of your account.
        """),
        criteria=EVALUATION_CRITERIA,
        output=PolicyOutput(
            evaluation="Meets Expectations",
            criteria_met=["Policies list the data collected"],
            citations=[
                "we collect and use the following types of information:",
                "Personal Information: This includes your name, email address, and phone number",
                "Usage Data: We collect information on how you interact with our app",
                "Device Information: We gather data about the device you use",
                "Location Data: With your permission, we may collect and process information about your location"
            ]
        )
    ).with_inputs('input', 'criteria'),

    dspy.Example(
        input=PolicyInput(policy_text="""
        Our app collects certain information to provide you with a better experience. This may include some personal details and how you use our service. We also work with partners who may receive data about your activities on our platform.
        """),
        criteria=EVALUATION_CRITERIA,
        output=PolicyOutput(
            evaluation="Meets with Reservations",
            criteria_met=[
                "Policies give a broad statement of data collected",
                "Policies are not clear on data collected crucial to app functionality"
            ],
            citations=[
                "Our app collects certain information",
                "This may include some personal details and how you use our service",
                "We also work with partners who may receive data about your activities"
            ]
        )
    ).with_inputs('input', 'criteria'),

    dspy.Example(
        input=PolicyInput(policy_text="""
        We respect your privacy. Our service operates without collecting any personal information from users. We do not track your activities or store any data about your usage of our platform.
        """),
        criteria=EVALUATION_CRITERIA,
        output=PolicyOutput(
            evaluation="Meets Expectations",
            criteria_met=["Policies state no data collected"],
            citations=[
                "Our service operates without collecting any personal information from users",
                "We do not track your activities or store any data about your usage"
            ]
        )
    ).with_inputs('input', 'criteria')
]
from dspy.teleprompt import MIPROv2

# Set up the MIPROv2 optimizer
teleprompter = MIPROv2(
    metric=dcq1_metric,
    auto="light"
)

# Compile the optimized agent
optimized_dcq1_agent = teleprompter.compile(
    dcq1_agent,
    trainset=trainset,
    valset=valset,  # need a validation set for MIPROv2
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    requires_permission_to_run=False,
)
# Use the optimized agent
optimized_result = evaluate_dcq1(policy_text, num_passes=3)
format_output(optimized_result)

  from .autonotebook import tqdm as notebook_tqdm


FileNotFoundError: [Errno 2] No such file or directory: 'stepik-privacy-policy.txt'