# Homework and bakeoff: Few-shot OpenQA with DSPy

In [1]:
__author__ = "Christopher Potts and Omar Khattab"
__version__ = "CS224u, Stanford, Spring 2024"

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)
[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)

If Colab is opened with this badge, please **save a copy to drive** (from the File menu) before running the notebook.

## Overview

The goal of this homework is to explore retrieval-augmented in-context learning. This is an exciting area that brings together a number of recent task ideas and modeling innovations. We will use the [DSPy programming library](http://dspy.ai) to build systems in this new mode.

Our core task is __open-domain question answering (OpenQA)__. In this task, all that is given by the dataset is a question text, and the task is to answer that question. By contrast, in many modern QA tasks, the dataset provides a text and a gold passage, usually with a firm guarantee that the answer will be a substring of the passage.

OpenQA is substantially harder than standard QA. The usual strategy is to use a _retriever_ to find passages in a large collection of texts and train a _reader_ to find answers in those passages. This means we have no guarantee that the retrieved passage will contain the answer we need. If we don't retrieve a passage containing the answer, our reader has no hope of succeeding. Although this is challenging, it is much more realistic and widely applicable than standard QA. After all, with the right retriever, an OpenQA system could be deployed over the entire Web.

The task posed by this homework is harder even than OpenQA. We are calling this task __few-shot OpenQA__. The defining feature of this task is that the reader is simply a frozen, general purpose language model. It accepts string inputs (prompts) and produces text in response. It is not trained to answer questions per se, and nothing about its structure ensures that it will respond with a substring of the prompt corresponding to anything like an answer.

__Few-shot QA__ (but not OpenQA!) is explored in the famous GPT-3 paper ([Brown et al. 2020](https://arxiv.org/abs/2005.14165)). The authors are able to get traction on the problem using GPT-3, an incredible finding. Our task here – __few-shot OpenQA__ – pushes this even further by retrieving passages to use in the prompt rather than assuming that the gold passage can be used in the prompt. If we can make this work, then it should be a major step towards flexibly and easily deploying QA technologies in new domains.

In summary:

| Task             | Passage given | Task-specific reader training |Task-specific retriever training  |
|-----------------:|:-------------:|:-----------------------------:|:--------------------------------:|
| QA               | yes           | yes                           | n/a                              |
| OpenQA           | no            | yes                           | maybe                            |
| Few-shot QA      | yes           | no                            | n/a                              |
| Few-shot OpenQA  | no            | no                            | maybe                            |

Just to repeat: your mission is to explore the final line in this table. The core notebook and assignment don't address the issue of training the retriever in a task-specific way, but this is something you could pursue for a final project; [the ColBERT codebase](https://github.com/stanford-futuredata/ColBERT) makes easy.

It is a requirement of the bake-off that a general-purpose language model be used. In particular, trained QA systems cannot be used at all, and no fine-tuning is allowed either. See the original system question at the bottom of this message for guidance on which models are allowed.

Note: the models we are working with here are _big_. This poses a challenge that is increasingly common in NLP: you have to pay one way or another. You can pay to use the GPT-3 API, or you can pay to use a local model on a heavy-duty cluster computer, or you can pay with time by using a local model on a more modest computer.

## Set-up

We have sought to make this notebook self-contained and easy to use on a personal computer, on Google Colab, and in Sagemaker Studio. For personal computer use, we assume you have already done everything in [setup.ipynb](setup.ipynb]). For cloud usage, the next few code blocks should handle all set-up steps.

In [1]:
try:
    # This library is our indicator that the required installs
    # need to be done.
    import datasets
    root_path = '.'
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    root_path = 'dspy'

Cloning into 'cs224u'...
remote: Enumerating objects: 2324, done.[K
remote: Counting objects: 100% (232/232), done.[K
remote: Compressing objects: 100% (128/128), done.[K
remote: Total 2324 (delta 118), reused 160 (delta 104), pack-reused 2092[K
Receiving objects: 100% (2324/2324), 41.68 MiB | 22.76 MiB/s, done.
Resolving deltas: 100% (1417/1417), done.
Collecting jupyter>=1.0.0 (from -r cs224u/requirements.txt (line 7))
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting transformers==4.37.1 (from -r cs224u/requirements.txt (line 14))
  Downloading transformers-4.37.1-py3-none-any.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.14.6 (from -r cs224u/requirements.txt (line 15))
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m


In [None]:
# !pip uninstall dspy-ai -y
# !pip install dspy-ai==2.1.10
# print(dspy.__version__)

In [2]:
from datasets import load_dataset
import openai
import os
import dspy
from dotenv import load_dotenv

Save the API keys in a `.env` file in the local root directory as follows. Then, `load_dotenv()` will make them available to the notebook:

In [4]:
# keep the API keys in a `.env` file in the local root directory
# load_dotenv()

In [3]:
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(root_path, 'cache')

openai_key = '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' # os.getenv('OPENAI_API_KEY')  # or replace with your API key (optional)

colbert_server = 'http://20.102.90.50:2017/wiki17_abstracts' # 'http://index.contextual.ai:8893/api/search'

Here we establish the Language Model `lm` and Retriever Model `rm` that we will be using. The defaults for `lm` are just for development. You may want to develop using an inexpensive model and then do your final evalautions wih an expensive one. DSPy has support for a wide range of model APIs and local models.

In [4]:
lm = dspy.OpenAI(model='gpt-3.5-turbo', api_key=openai_key)

rm = dspy.ColBERTv2(url=colbert_server)

dspy.settings.configure(lm=lm, rm=rm)

Here's a command you can run to see which OpenAI models are available; OpenAI has entered into an increasingly closed mode where many older models are not available, so there are likely to be some surprises lurking here:

In [178]:
[d['id'] for d in openai.Model.list()['data']]

['gpt-4-vision-preview',
 'dall-e-3',
 'gpt-4-turbo-preview',
 'gpt-3.5-turbo-0613',
 'text-embedding-3-large',
 'dall-e-2',
 'gpt-3.5-turbo-instruct-0914',
 'whisper-1',
 'tts-1-hd-1106',
 'tts-1-hd',
 'babbage-002',
 'gpt-3.5-turbo-instruct',
 'gpt-3.5-turbo-0125',
 'gpt-3.5-turbo',
 'davinci-002',
 'text-embedding-ada-002',
 'gpt-3.5-turbo-0301',
 'text-embedding-3-small',
 'tts-1',
 'tts-1-1106',
 'gpt-3.5-turbo-1106',
 'gpt-3.5-turbo-16k',
 'gpt-4',
 'gpt-4-0613',
 'gpt-3.5-turbo-16k-0613',
 'gpt-4-1106-preview',
 'gpt-4-0125-preview']

## SQuAD

Our core development dataset is [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/). We chose this dataset because it is well-known and widely used, and it is large enough to support lots of meaningful development work, without, though, being so large as to require lots of compute power.

In [None]:
squad = load_dataset("squad")

The following utility just reads a SQuAD split in as a list of `dspy.Example` instances:

In [7]:
def get_squad_split(squad, split="validation"):
    """
    Use `split='train'` for the train split.

    Returns
    -------
    list of dspy.Example with attributes question, answer

    """
    data = zip(*[squad[split][field] for field in squad[split].features])
    exs = [dspy.Example(question=q, answer=a['text'][0]).with_inputs("question")
           for eid, title, context, q, a in data]
    return exs

### SQuAD train

To build few-shot prompts, we will often sample SQuAD train examples, so we load that split here:

In [8]:
squad_train = get_squad_split(squad, split="train")

### SQuAD dev

In [9]:
squad_dev = get_squad_split(squad)

### SQuAD dev sample

Evaluations are expensive in this new era! Here's a small sample to use for dev assessments:

In [10]:
import random

random.seed(1)

dev_exs = random.sample(squad_dev, k=200)

## DSPy basics

### LM usage

Here's the most basic way to use the LM:

In [81]:
lm("Which award did Gary Zukav's first book receive?")

['Gary Zukav\'s first book, "The Dancing Wu Li Masters: An Overview of the New Physics," received the 1979 American Book Award for Science.']

Keyword arguments to the underlying LM are passed through:

In [40]:
lm("Which U.S. states border no U.S. states?", temperature=0.9, n=4)

['There are two U.S. states that do not border any other U.S. states: Alaska and Hawaii. Both states are located separately from the contiguous United States.',
 'There are two U.S. states that do not border any other U.S. states: Alaska and Hawaii. Both of these states are located outside of the contiguous United States.',
 'There are two U.S. states that do not border any other U.S. states: Alaska and Hawaii.',
 'Hawaii and Alaska are the only two U.S. states that do not border any other U.S. states.']

With `lm.inspect_history`, we can see the most recent language model calls:

In [41]:
lm.inspect_history(n=1)





Which U.S. states border no U.S. states?[32m There are two U.S. states that do not border any other U.S. states: Alaska and Hawaii. Both states are located separately from the contiguous United States.[0m[31m 	 (and 3 other completions)[0m





### Signature-based prediction

In DSPy, __signatures__ are declarative statements about what we want the model to do. In the following `"question -> answer"` is the signature (the most basic QA signature one could write), and `dspy.Predict` is used to turn this into a complete QA system:

In [11]:
basic_predictor = dspy.Predict("question -> answer")

Here we use `basic_predictor`:

In [12]:
basic_predictor(question="Which award did Gary Zukav's first book receive?")

Prediction(
    answer="Question: Which award did Gary Zukav's first book receive?\nAnswer: The Dancing Wu Li Masters received the American Book Award for Science."
)

And here is the prompt that was given to the model:

In [13]:
lm.inspect_history(n=1)





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Answer: ${answer}

---

Question: Which award did Gary Zukav's first book receive?
Answer:[32m Question: Which award did Gary Zukav's first book receive?
Answer: The Dancing Wu Li Masters received the American Book Award for Science.[0m





In many cases, we will want more control over the prompt. Writing a small custom `dspy.Signature` class is the easiest way to accomplish this. In the following, we just just tweak the initial instruction and provide some formatting guidance for the answer:

In [82]:
class BasicQASignature(dspy.Signature):
    __doc__ = """Answer questions with short factoid answers."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

In [76]:
sig_predictor = dspy.Predict(BasicQASignature)

In [16]:
sig_predictor(question="Which U.S. states border no U.S. states?")

Prediction(
    answer='Maine, Hawaii'
)

In [17]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Answer: often between 1 and 5 words

---

Question: Which U.S. states border no U.S. states?
Answer:[32m Maine, Hawaii[0m





### Modules

One of the hallmarks of DSPy is that it adopts design patterns from PyTorch. The main example of this is DSPy's use of the `Module` as the basic unit for writing simple and complex programs. Here is a very basic module for QA that makes use of `BasicQASignature` as we defined it just above.

In [83]:
class BasicQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.Predict(BasicQASignature)

    def forward(self, question):
        return self.generate_answer(question=question)

As with PyTorch, the `forward` methos is called when we want to make predictions:

In [84]:
basic_qa_model = BasicQA()

In [85]:
basic_qa_model(question="Which award did Gary Zukav's first book receive?")

Prediction(
    answer='National Book Award'
)

The modular design of DSPy starts to become apparent now. If you want to change the above to use chain of thought instead of regular predictions, you need only change `dspy.Predict` to `dspy.ChainOfThought`, and similarly for `dspy.ReAct`, `dspy.ProgramOfThought`, or a module you wrote yourself.

### Teleprompting

The QA system we've defined so far is a zero-shot system. To change it into a few-shot system, we will rely on a DSPy __teleprompter__. This will allow us to flexibly move between the zero-shot and few-shot formulations. The following code achieves this.

In [21]:
from dspy.teleprompt import LabeledFewShot

dspy/cache/compiler


Here we instantiate a `LabeledFewShot` teleprompter that will add three demonstrations. These will be sampled randomly from the set of train examples we provide:

In [22]:
fewshot_teleprompter = LabeledFewShot(k=3)

And then we call `compile` on `basic_qa_model` as we defined it above. This returns a new module that we use like any other in DSPy:

In [24]:
basic_fewshot_qa_model = fewshot_teleprompter.compile(basic_qa_model, trainset=squad_train)

In [None]:
basic_fewshot_qa_model(question="Which award did Gary Zukav's first book receive?")

With `inspect_history`, we can see that prompts now contain demonstrations:

In [None]:
lm.inspect_history(n=1)

### Evaluation

Our evaluation metric is a standard one for SQuAD and related tasks: exact match of the answer (EM).

In [12]:
from dspy.evaluate import answer_exact_match, answer_passage_match
from dspy.evaluate.evaluate import Evaluate

In [None]:
answer_exact_match(dspy.Example(answer="STAGE 2!"), dspy.Prediction(answer="stage 2"))

In DSPy, `Evaluate` objects provide a uniform interface for running evaluations. Here are two for us to use in development. The first will evaluate on all of `dev_exs` and should provide a meaningful picture of how a system is doing. It could be expensive to use it a lot, though. The second is for debugging and is probably too small to give a reliable estimate.

In [109]:
dev_evaluater = Evaluate(
    devset=dev_exs, # 200 examples
    num_threads=1,
    display_progress=True,
    display_table=5)

In [None]:
dev_evaluater_1 = Evaluate(
    devset=dev_exs[:100], # 200 examples
    num_threads=1,
    display_progress=True,
    display_table=5)

In [110]:
dev_evaluater_2 = Evaluate(
    devset=dev_exs[100:], # 200 examples
    num_threads=1,
    display_progress=True,
    display_table=5)

In [87]:
tiny_evaluater = Evaluate(
    devset=dev_exs[:15],
    num_threads=1,
    display_progress=True,
    display_table=5)

Here is a tiny (debugging-oriented) evaluation of our few-shot QA sytem:

In [None]:
tiny_evaluater(basic_fewshot_qa_model, metric=answer_exact_match)

### Retrieval

The final major component of our systems is retrieval. When we defined `rm`, we connected to a remote ColBERT index and retriever system that we can now use for search.

The basic `dspy.retrieve` method returns only passages:

In [88]:
retriever = dspy.Retrieve(k=3)

In [89]:
passages = retriever("Which award did Gary Zukav's first book receive?")

In [None]:
passages

If we need passages with scores and other metadata, we can call `rm` directly:

In [None]:
rm("Which award did Gary Zukav's first book receive?", k=2)

## Few-shot OpenQA with context

Let's build on the above core concepts to define a basic retrieval-augmented generation (RAG) program. This program solves the core task of few-shot OpenQA task and will serve as the basis for the homework questions:

We begin with a signature that takes context into account but is otherwise just like `BasicQASignature` above:

In [90]:
class ContextQASignature(dspy.Signature):
    __doc__ = """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

And here is a complete program/system for the task:

In [91]:
class RAG(dspy.Module):
    def __init__(self, num_passages=1):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.Predict(ContextQASignature)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

In [43]:
rag_model = RAG(num_passages=3)

In [None]:
rag_model(question="Which award did Gary Zukav's first book receive?")

An optional tiny evaluation:

In [None]:
tiny_evaluater(rag_model, metric=answer_exact_match)

## Question 1: Optimizing RAG [2 points]

We used `RAG` above as a zero-shot system. We could turn it into a few-shot system by using `LabeledFewShot` as we did in [the teleprompting section](#Teleprompting) above, but this may actually be problematic: if we randomly sample demonstrations with retrieved passages, we might be instructing the model with a lot of cases where the context passage isn't helping (and may actually be actively misleading the model).

What we'd like to do is select demonstrations where the model gets the answer correct and the context passage does contain the answer. To do this, we will use the DSPy `BootstrapFewShot` optimizer. There are two steps for this: (1) defining a metric and (2) running the optimizer.

__Note__: The code for this question can be found in the DSPy tutorials, and you should feel free to make use of that code. The goal is to help you understand the design patterns and overall logic of optimizing DSPy programs.

__Task 1__: Complete `validate_context_and_answer` according to the specification in the docstring.

In [92]:
def validate_context_and_answer(example, pred, trace=None):
    """Return True if `example.answer` matches `pred.answer` according
    to `dspy.evaluate.answer_exact_match` and `pred.context` contains
    `example.answer` according to `dspy.evaluate.answer_exact_match`.

    Parameters
    ----------
    example: dspy.Example
        with attributes `answer` and `context`
    pred: dspy.Example
        with attributes `answer` and `context`
    trace : None (included for dspy internal compatibility)

    Returns
    -------
    bool

    """
    return answer_exact_match(example, pred) and \
           answer_passage_match(example, pred)


A test you can use to check your implementation:

In [46]:
def test_validate_context_and_answer(func):
    examples = [
        (
            dspy.Example(question="Q1", answer="B"),
            dspy.Prediction(question="Q1", context="A B C", answer="B"),
            True
        ),
        # Context doesn't contain answer, but predicted answer is correct.
        (
            dspy.Example(question="Q1", answer="D"),
            dspy.Prediction(question="Q1", context="A B C", answer="D"),
            False
        ),
        # Context contains answer, but predicted answer is not correct.
        (
            dspy.Example(question="Q1", answer="C"),
            dspy.Prediction(question="Q1", context="A B C", answer="D"),
            False
        )
    ]
    errcount = 0
    for ex, pred, result in examples:
        predicted = func(ex, pred, trace=None)
        if predicted != result:
            errcount += 1
            print(f"Error for `{func.__name__}`: "
                  f"Expected inputs\n\t{ex}\n\t{pred} to return {result}.")
    if errcount == 0:
        print(f"No errors detected for `{func.__name__}`")

In [47]:
test_validate_context_and_answer(validate_context_and_answer)

No errors detected for `validate_context_and_answer`


__Task 2__: Complete `bootstrap_optimize` according to the specification in the docstring.

In [151]:
from dspy.teleprompt import BootstrapFewShot, BootstrapFewShotWithRandomSearch

def bootstrap_optimize(model):
    """Use `BootstrapFewShot` to optimize `model`, with the metric set
    to `validate_context_and_answer` as defined above and default
    values for all other keyword arguments to `BootstrapFewShot`.

    Parameters
    ----------
    model: dspy.Module

    Returns
    -------
    dspy.Module, the optimized version of `model`

    """
    bootstrap_fewshot_teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
    return bootstrap_fewshot_teleprompter.compile(model, trainset=dev_exs[:30])




A test you can use to check your implementation:

In [49]:
def test_bootstrap_optimize(func):
    model = RAG()
    compiled = func(model)
    if not hasattr(compiled, "_compiled") or not compiled._compiled:
        print(f"Error for `{func.__name__}`: "
               "The return value is not a compiled program.")
        return None
    state = compiled.dump_state()
    if not state['generate_answer']['demos']:
        print(f"Error for `{func.__name__}`: "
               "The compiled program has no `demos`.")
        return None
    print(f"No errors detected for `{func.__name__}`")

In [50]:
test_bootstrap_optimize(bootstrap_optimize)

100%|██████████| 10/10 [00:06<00:00,  1.52it/s]

Bootstrapped 1 full traces after 10 examples in round 0.
No errors detected for `bootstrap_optimize`





## Question 2: Multi-passage summarization [2 points]

The `dspy.Retrieve` layer in our `RAG` retrieves `k` passages, where `k` is under the control of the user. One hypothesis one might have is that it would be good to summarize these passages before using them as evidence. This seems especially likely to help in scenarios where the question can be answered only by synthesizing information across documents – it might be too much to ask the language model to do both synthesizing and answering in a single step.

The current question maps out a basic strategy for summarization. The heart of it is a new signature called `SummarizeSignature`. This can be used on its own with a simple `dspy.Predict` call, and we'll incorporate it into a RAG program in the next question.

For this question, though, your task is just to complete `SummarizeSignature`. The requirements are as follows:

1. A `__doc__` value that gives an instruction that seems to work well. You can decide what to say here.
2. A `dspy.InputField` named `context`. You can decide whether to use the `desc` parameter.
3. A `dspy.OutputField` named `summary`. You can decide whether to use the `desc` parameter.

In [94]:
class SummarizeSignature(dspy.Signature):
    __doc__ = """Summarize the given passages into a single coherent passage."""

    context = dspy.InputField(desc="The passages to summarize.")
    summary = dspy.OutputField(desc="The summary of the passages.")



Here's a simple test that just checks for the required pieces in a basic way:

In [41]:
def test_SummarizeSignature(sigclass):
    fields = sigclass.fields
    expected_fieldnames = ['context', 'summary']
    fieldnames = sorted([field.input_variable for field in fields])
    errcount = 0
    if expected_fieldnames != fieldnames:
        errcount += 1
        print(f"Error for `{sigclass.__name__}`: "
              f"Expected fieldnames {expected_fieldnames}, got {fieldnames}.")
    if not sigclass.__doc__:
        errcount += 1
        print(f"Error for `{sigclass.__name__}`: No docstring specified.")
    if errcount == 0:
        print(f"No errors detected for `{sigclass.__name__}`")

In [42]:
test_SummarizeSignature(SummarizeSignature)

No errors detected for `SummarizeSignature`


Here is the simplest way to use `SummarizeSignature`:

In [43]:
summarizer = dspy.Predict(SummarizeSignature)

In [44]:
summarizer(context=retriever("Where is Guarani spoken?").passages)

Prediction(
    summary='Guarani is a Native American language spoken in Paraguay and parts of Bolivia, Argentina, and Brazil. The primary variety, known as Paraguayan Guarani, is an indigenous language belonging to the Tupi-Guarani family. It is one of the official languages of Paraguay and spoken by the majority of the population, with half of the rural population being monolingual. Additionally, there is a related language called Chiripá Guarani, also known as Ava Guarani, spoken in Paraguay, Brazil, and Argentina, with a smaller number of speakers compared to Paraguayan Guarani.'
)

## Question 3: Summarizing RAG [2 points]

Your task for this question is to modify `RAG` as defined above so that the retrieved passages are summarized before being passed to `generate_answer`.

Here is the `RAG` system copied from above with the class name changed to the one we will use for this new system. Your task is to add the summarization step. This should be very straightforward given the modular design that DSPy supports and encourages!

In [95]:
class SummarizingRAG(dspy.Module):
    def __init__(self, num_passages=3):
        # Please name your summarization later `summarize` so that we
        # can check for its presence.
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.summarize = dspy.Predict(SummarizeSignature)
        self.generate_answer = dspy.Predict(ContextQASignature)

    def forward(self, question):
        passages = self.retrieve(question).passages
        summary_prediction = self.summarize(context=passages)
        context = summary_prediction.summary
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

A simple test for this design spec:

In [60]:
def test_SummarizingRAG(classname):
    model = classname(num_passages=3)
    errcount = 0
    if not hasattr(model, "summarize"):
        errcount += 1
        print(f"Error for `{classname.__name__}`: "
              f"Expected a layer called 'summarize'")
    context = model.retrieve("What are some foods?").passages
    pred = model("What are some foods?")
    if context == pred.context:
        errcount += 1
        print(f"Error for `{classname.__name__}`: "
              "The model seems to be using raw retrieved contexts "
              "for predictions rather than summarizing them.")
    if errcount == 0:
        print(f"No errors detected for `{classname.__name__}`")

In [61]:
test_SummarizingRAG(SummarizingRAG)

No errors detected for `SummarizingRAG`


Model usage:

In [62]:
summarizing_rag_model = SummarizingRAG()

In [63]:
summarizing_rag_model(question="Which award did Gary Zukav's first book receive?")

Prediction(
    context='Gary Zukav is an American spiritual teacher and author known for his New York Times Best Sellers and appearances on "The Oprah Winfrey Show." His first book, "The Dancing Wu Li Masters," won a U.S. National Book Award and explores modern physics with metaphors from eastern spiritual movements. Zukiswa Wanner, a South African journalist and novelist, has been recognized for her work with awards such as the K Sello Duiker Memorial Literary Award and inclusion on the Africa39 list of talented Sub-Saharan African writers under 40.',
    answer='U.S. National Book Award'
)

Note: if you decide to use `BootstrapFewShot` on this, be sure not to use the metric we defined above, which requires that the passage embeds the correct answer as a substring. Now that we are summarizing, this is unlikely to hold, even if the answers are good ones.

## Question 4: Your original system [3 points]

This question asks you to design your own few-shot OpenQA system. All of the code above can be used and modified for this, and the requirement is just that you try something new that goes beyond what we've done so far.

Terms for the bake-off:

* You can make free use of SQuAD and other publicly available data.

* The LM must be an autoregressive language model. No trained QA components can be used. This includes general purpose LMs that have been fine-tuned for QA. (We have obviously waded into some vague territory here. The spirit of this is to make use of frozen, general-purpose models. We welcome questions about exactly how this is defined, since it could be instructive to explore this.)

Here are some ideas for the original system:

* We have relied almost entirely on `dspy.Predict`. Drop-in replacements include `dspy.ChainOfThought` and `dspy.ReAct`.

* We have used only one retriever. DSPy supports other retrieval mechanisms, including retrieval using [You.com](https://you.com/).

* DSPy includes additional optimizers. Two that are worth trying are `SignatureOptimizer` for automatic prompt exploration and `BootstrapFewShotWithRandomSearch`, which combines `LabeledFewShot` and `BootstrapFewShot`,

* Our one-step summarization procedure from Question 3 doesn't change the query to the retriever. We might want it to change as we gather evidence. This is a common design principle for multi-hop OpenQA systems.

__Original system instructions__:

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [185]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# START COMMENT: Enter your system description in this cell.

# Please note that 'http://20.102.90.50:2017/wiki17_abstracts' was used for
# retrieval instead of 'http://index.contextual.ai:8893/api/search' as Colbert
# server was unavailable.

# Evaluation was carried out using the 200 examples in `dev_exs`

# The following prompting strategies were evaluated:
# - No context (Basic QA)
# - Context (3 retrieved passages / 5 retrieved passages)
# - Context (summarized passages)
# Surprisingly, the simple no context (basic QA) strategy performed better than
# the others. One possible explanation for this is that gpt-3-turbo is a
# powerful LLM that can do well on fact based questions such as the ones in
# SQuAD without additional context. Retrieved context obtained from
# 'http://20.102.90.50:2017/wiki17_abstracts' was likely not relevant/helpful in
# many cases.

# Evaluated LabeledFewShot, BootstrapFewShot, and
# BootstrapFewShotWithRandomSearch. BootstrapFewShotWithRandomSearch performed
# noticeably better than the other optimizers. +2.5% over BootstrapFewShot when
# using the Basic QA model.

# Instantiate basic QA model
model = BasicQA()

# Instantiate BootstrapFewShotWithRandomSearch optimizer
config = dict(max_bootstrapped_demos=3, max_labeled_demos=3, num_candidate_programs=8, num_threads=2)
bootstrap_few_shot_optimizer = BootstrapFewShotWithRandomSearch(metric=answer_exact_match, **config)

# Optimize model
optimized_model = bootstrap_few_shot_optimizer.compile(model, trainset=squad_train[:30])

# Evaluate model
dev_evaluater(optimized_model, metric=answer_exact_match)

# STOP COMMENT: Please do not remove this comment.

## Question 5: Bakeoff entry [1 point]

For the bake-off, you simply need to be able to run your system on the file

```data/openqa/cs224u-openqa-test-unlabeled.txt```

The following code should download it for you if necessary:

In [100]:
import wget

if not os.path.exists(os.path.join("data", "openqa", "cs224u-openqa-test-unlabeled.txt")):
    os.makedirs(os.path.join('data', 'openqa'), exist_ok=True)
    wget.download('https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt', out='data/openqa/')

If the above fails, you can just download https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt and place it in `data/openqa`.

This file contains only questions. The starter code below will help you structure this. It writes a file "cs224u-openqa-bakeoff-entry.json" to the current directory. That file should be uploaded as-is. Please do not change its name.

In [176]:
import json
import tqdm

def create_bakeoff_submission(model):
    """"
    The argument `model` is a `dspy.Module`. The return value of its
    `forward` method must have an `answer` attribute.
    """

    filename = os.path.join("data", "openqa", "cs224u-openqa-test-unlabeled.txt")

    # This should become a mapping from questions (str) to response
    # dicts from your system.
    gens = {}

    with open(filename) as f:
        questions = f.read().splitlines()

    # Here we loop over the questions, run the system `model`, and
    # store its `answer` value as the prediction:
    for question in tqdm.tqdm(questions):
        gens[question] = model(question=question).answer

    # Quick tests we advise you to run:
    # 1. Make sure `gens` is a dict with the questions as the keys:
    assert all(question in gens for q in questions)
    # 2. Make sure the values are str:
    assert all(isinstance(d, str) for d in gens.values())

    # And finally the output file:
    with open("cs224u-openqa-bakeoff-entry.json", "wt") as f:
        json.dump(gens, f, indent=4)

Here's what it looks like to evaluate our first program, `basic_qa_model`, on the bakeoff data:

In [None]:
create_bakeoff_submission(basic_qa_model)