## DSpy Prompt Optimisation for Multi-Hop (CoT) QA (Baleen Architecture)

A single search query is often not enough for complex QA tasks. For instance, an example within HotPotQA includes a question about the birth city of the writer of "Right Back At It Again". A search query often identifies the author correctly as "Jeremy McKinnon", but lacks the capability to compose the intended answer in determining when he was born.

The standard approach for this challenge in retrieval-augmented NLP literature is to build multi-hop search systems, like GoldEn (Qi et al., 2019) and Baleen (Khattab et al., 2021). These systems read the retrieved results and then generate additional queries to gather additional information when necessary before arriving to a final answer. Using DSPy, we can simulate such systems in a few lines of code.

Refer to https://arxiv.org/pdf/2101.00436.pdf for a explanation of the Baleen architecture.


In [30]:
%env pip install -qU langchain-ibm

env: pip=install -qU langchain-ibm


In [31]:
import dspy
from dsp.utils import deduplicate
from dspy.datasets import HotPotQA
from dspy.predict.retry import Retry
from dspy.teleprompt import BootstrapFewShot, BootstrapFewShotWithRandomSearch
from dspy.evaluate.evaluate import Evaluate
from dspy.primitives.assertions import assert_transform_module, backtrack_handler

import os
import requests
from dsp import LM
import dspy

In [32]:
os.environ['WATSONX_URL']="https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2023-05-29"
os.environ['WATSONX_APIKEY']=""
os.environ['WATSONX_PROJECTID']=""

### Create Implementation of DSpy LM Class for watsonx.ai models

In [33]:
class Watson(LM):
    def __init__(self,model,api_key):
        self.kwargs = {
            "model": model,
            "temperature": 0.0,
            "max_tokens": 150,
            "top_p": 1,
            "frequency_penalty": 0,
            "presence_penalty": 0,
            "n": 1,
        }
        self.model = model
        self.api_key = api_key
        self.provider = "default"
        self.history = []
        
        self.base_url = os.environ['WATSONX_URL']
        self.project_id = os.environ['WATSONX_PROJECTID']

    def basic_request(self, prompt: str, **kwargs):
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Accept": "application/json",
            "content-type": "application/json"
        }

        data = {
            "parameters": {**kwargs},
            "model_id": self.model,
            "input": prompt,
            "project_id": self.project_id
        }

        response = requests.post(self.base_url, headers=headers, json=data)
        response = response.json()

        self.history.append({
            "prompt": prompt,
            "response": response,
            "kwargs": kwargs,
        })
        return response

    def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
        response = self.request(prompt, **kwargs)

        print(response)

        completions = [result["generated_text"] for result in response["results"]]

        return completions

### Configure DSpy with the watsonx.ai Language Model

In [34]:
import requests

def generate_access_token(api_key):
    headers={}
    headers["Content-Type"] = "application/x-www-form-urlencoded"
    headers["Accept"] = "application/json"
    data = {
        "grant_type": "urn:ibm:params:oauth:grant-type:apikey",
        "apikey": api_key
    }
    response = requests.post('https://iam.cloud.ibm.com/identity/token', data=data, headers=headers)
    json_data = response.json()
    iam_access_token = json_data['access_token']
    return iam_access_token

token = generate_access_token(os.environ['WATSONX_APIKEY'])

In [35]:
watsonx = Watson(model="meta-llama/llama-2-70b-chat",api_key=token)
dspy.settings.configure(lm=watsonx, trace=[], temperature=0.7)

### Load the dataset

We make use of the mentioned HotPotQA dataset, a collection of complex question-answer pairs typically answered in a multi-hop fashion. We can load this dataset provided by DSPy through the HotPotQA class:

In [36]:
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(rm=colbertv2_wiki17_abstracts)

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

(20, 50)

### Build the Signature of the DSpy Modules

We'll start by creating the GenerateAnswer signature that'll take context and question as input and give answer as output.

In [37]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

Unlike usual QA pipelines, we have an intermediate question-generation step in Baleen for which we'll need to define a new Signature for the "hop" behavior: inputting some context and a question to generate a search query to find missing information.

In [38]:
class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

### Create the DSpy Module (Baleen pipeline)

As we can see, the __init__ method defines a few key sub-modules:

- generate_query: For each hop, we will have one dspy.ChainOfThought predictor with the GenerateSearchQuery signature.
- retrieve: This module will conduct the search using the generated queries over our defined ColBERT RM search index via the dspy.Retrieve module.
- generate_answer: This dspy.Predict module will be used with the GenerateAnswer signature to produce the final answer.

The forward method uses these sub-modules in simple control flow.

1. First, we'll loop up to self.max_hops times.
2. In each iteration, we'll generate a search query using the predictor at self.generate_query[hop].
3. We'll retrieve the top-k passages using that query.
4. We'll add the (deduplicated) passages to our context accumulator.
5. After the loop, we'll use self.generate_answer to produce an answer.
6. We'll return a prediction with the retrieved context and predicted answer.

In [39]:
from dsp.utils import deduplicate

class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question):
        context = []
        
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

### Executing the Pipeline

In [40]:
# Ask any question you like to this simple RAG program.
my_question = "How many storeys are in the castle that David Gregory inherited?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
pred = uncompiled_baleen(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Question: How many storeys are in the castle that David Gregory inherited?
Predicted Answer: 2

---

Context:

Question: What is the name of the hotel
Retrieved Contexts (truncated): ['St. Gregory Hotel | The St. Gregory Hotel is a boutique hotel located in downtown Washington, D.C., in the United States. Established in 2000, the nine-floor hotel has 155 rooms, which includes 54 del...', 'Gregory House (Poughkeepsie, New York) | Gregory House, is a historic home located at 140 South Cherry Street in Poughkeepsie, Dutchess County, New York. It was built about 1869, is a 2 ⁄ -story, Seco...', 'Karl D. Gregory Cooperative House | Karl D. Gregory Cooperative House is a member of the Inter-Cooperative Council at the University of Michigan. The structure that stands at 1617 Washtenaw was origin...', 'Clark House (Poughkeepsie, New York) | Clark House is a historic home located at Poughkeepsie, Dutchess County, New York. It was built about 1919 and is a 2 ⁄ -story, three-bay-wide concrete blo

In [41]:
watsonx.history

[{'prompt': "Write a simple search query that will help answer a complex question.\n\n---\n\nFollow the following format.\n\nContext: may contain relevant facts\n\nQuestion: ${question}\n\nReasoning: Let's think step by step in order to ${produce the query}. We ...\n\nQuery: ${query}\n\n---\n\nContext: N/A\n\nQuestion: How many storeys are in the castle that David Gregory inherited?\n\nReasoning: Let's think step by step in order to",
  'response': {'model_id': 'meta-llama/llama-2-70b-chat',
   'created_at': '2024-04-08T09:57:38.905Z',
   'results': [{'generated_text': ' find the answer. We know that David Gregory inherited a castle from his father, John Gregory. We',
     'generated_token_count': 20,
     'input_token_count': 114,
     'stop_reason': 'max_tokens'}],
      'more_info': 'https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx'}]}},
  'kwargs': {}},
 {'prompt': "Write a simple search query that will help answer a complex question.\n\n--

### Compiling DSpy to Optimise the Pipeline

A zero-shot approach quickly falls short for more specialized tasks, novel domains/settings, and more efficient (or open) models.

To address this, DSPy offers compilation. Let's compile our multi-hop (SimplifiedBaleen) program.

Let's first define our validation logic for compilation:

The predicted answer matches the gold answer.
The retrieved context contains the gold answer.
None of the generated queries is rambling (i.e., none exceeds 100 characters in length).
None of the generated queries is roughly repeated (i.e., none is within 0.8 or higher F1 score of earlier queries).

In [42]:
def validate_context_and_answer_and_hops(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred): return False
    if not dspy.evaluate.answer_passage_match(example, pred): return False

    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True

We'll use one of the most basic teleprompters in DSPy, namely, BootstrapFewShot to optimize the predictors in pipeline with few-shot examples.

In [43]:
from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(metric=validate_context_and_answer_and_hops)
compiled_baleen = teleprompter.compile(SimplifiedBaleen(), teacher=SimplifiedBaleen(passages_per_hop=2), trainset=trainset)

  0%|          | 0/20 [00:00<?, ?it/s]



  5%|▌         | 1/20 [00:09<02:51,  9.01s/it]



 10%|█         | 2/20 [00:17<02:41,  8.95s/it]



 15%|█▌        | 3/20 [00:28<02:44,  9.66s/it]



 20%|██        | 4/20 [00:38<02:36,  9.80s/it]



 25%|██▌       | 5/20 [00:47<02:22,  9.48s/it]



 30%|███       | 6/20 [00:58<02:20, 10.00s/it]



 35%|███▌      | 7/20 [01:07<02:07,  9.82s/it]



 40%|████      | 8/20 [01:18<02:02, 10.17s/it]



 45%|████▌     | 9/20 [01:29<01:55, 10.47s/it]



 50%|█████     | 10/20 [01:40<01:44, 10.49s/it]



 55%|█████▌    | 11/20 [01:51<01:36, 10.76s/it]



 60%|██████    | 12/20 [02:02<01:26, 10.79s/it]



 65%|██████▌   | 13/20 [02:12<01:12, 10.37s/it]



 70%|███████   | 14/20 [02:21<00:59,  9.99s/it]



 75%|███████▌  | 15/20 [02:30<00:48,  9.75s/it]



 80%|████████  | 16/20 [02:40<00:39,  9.99s/it]



 85%|████████▌ | 17/20 [02:51<00:30, 10.25s/it]



 90%|█████████ | 18/20 [03:00<00:19,  9.74s/it]



 95%|█████████▌| 19/20 [03:08<00:09,  9.41s/it]



100%|██████████| 20/20 [03:18<00:00,  9.93s/it]

Bootstrapped 2 full traces after 20 examples in round 0.





In [44]:
def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(dspy.evaluate.normalize_text, example['gold_titles']))

    print(gold_titles)

    found_titles = set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))

    print(found_titles)

    return gold_titles.issubset(found_titles)

Define our evaluation function and compare the performance of the uncompiled and compiled Baleen pipelines. While this devset does not serve as a completely reliable benchmark, it is instructive to use for this tutorial.

In [45]:
from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)

uncompiled_baleen_retrieval_score = evaluate_on_hotpotqa(uncompiled_baleen, metric=gold_passages_retrieved, display=False)

compiled_baleen_retrieval_score = evaluate_on_hotpotqa(compiled_baleen, metric=gold_passages_retrieved)

print(f"## Retrieval Score for uncompiled Baleen: {uncompiled_baleen_retrieval_score}")
print(f"## Retrieval Score for compiled Baleen: {compiled_baleen_retrieval_score}")

{'cangzhou', 'qionghai'}
{'cang prefecture', 'cangzhou railway station', 'cangzhou'}
{'2017–18 pittsburgh penguins season', '2017 nhl expansion draft'}
{'marcandré fleury', 'list of vegas golden knights draft picks', '2017–18 pittsburgh penguins season', '2017 nhl expansion draft'}
{'2006–07 detroit red wings season', 'steve yzerman'}
{'kris draper', '2006–07 detroit red wings season', 'steve pittman', 'steve yzerman', 'list of detroit red wings general managers'}
{'crichton castle', 'crichton collegiate church'}
{'river tyne scotland', 'crichton collegiate church', 'crichton castle', 'river tyne', 'cranston midlothian', 'river tyne disambiguation'}
{'ealhswith', 'æthelweard son of alfred'}
{'æthelburg of wessex', 'eadgyth', 'æthelstan of kent', 'æthelred of wessex', 'wulfthryth of wessex', 'æthelbald king of wessex'}
{'newark airport interchange', 'newark liberty international airport'}
{'newark liberty international airport station', 'newark airport interchange', 'newark liberty int

  0%|          | 0/50 [00:00<?, ?it/s]



Average Metric: 0 / 1  (0.0):   2%|▏         | 1/50 [00:09<08:08,  9.98s/it]

{'cangzhou', 'qionghai'}
{'cang county', 'xinhua district cangzhou', 'dongguang county', 'cangzhou', 'haixing county'}


Average Metric: 0 / 2  (0.0):   4%|▍         | 2/50 [00:20<08:02, 10.05s/it]

{'2017–18 pittsburgh penguins season', '2017 nhl expansion draft'}
{'marc methot', 'theoren fleury', '2017–18 pittsburgh penguins season', 'marcandré fleury', 'marc bureau ice hockey', 'marcédouard vlasic'}


Average Metric: 1 / 3  (33.3):   6%|▌         | 3/50 [00:29<07:35,  9.70s/it]

{'2006–07 detroit red wings season', 'steve yzerman'}
{'2006–07 detroit red wings season', 'steve griggs', 'list of tampa bay lightning head coaches', 'steve yzerman'}


Average Metric: 2 / 4  (50.0):   8%|▊         | 4/50 [00:37<07:04,  9.22s/it]

{'crichton castle', 'crichton collegiate church'}
{'crichton', 'crichton collegiate church', 'crichton castle', 'cranston midlothian'}


Average Metric: 3 / 5  (60.0):  10%|█         | 5/50 [00:46<06:49,  9.10s/it]

{'ealhswith', 'æthelweard son of alfred'}
{'alfred great', 'æthelflæd', 'ealhswith', 'æthelred mucel', 'æthelweard son of alfred'}


Average Metric: 3 / 6  (50.0):  12%|█▏        | 6/50 [00:56<06:49,  9.31s/it]

{'newark airport interchange', 'newark liberty international airport'}
{'newark liberty international airport station', 'newark liberty international airport', 'airtrain newark', 'lafayette street terminal newark'}


Average Metric: 3 / 7  (42.9):  14%|█▍        | 7/50 [01:07<07:09,  9.98s/it]

{'2005–06 fc bayern munich season', 'claudio pizarro'}
{'luis suárez', 'luis suárez disambiguation', '2014–15 fc barcelona season', 'joaquín suárez', 'list of international goals scored by luis suárez'}


Average Metric: 4 / 8  (50.0):  16%|█▌        | 8/50 [01:16<06:47,  9.70s/it]

{'william r fairchild international airport', 'chico municipal airport'}
{'website', 'william r fairchild international airport', 'william robert johnston municipal airport', 'chico municipal airport', 'site map', 'google sites'}


Average Metric: 5 / 9  (55.6):  18%|█▊        | 9/50 [01:27<06:44,  9.85s/it]

{'stockton springs maine', 'fort pownall'}
{'stockton springs community church', 'fort point state park', 'fort pownall', 'fort halifax maine', 'stockton springs maine', 'sandy point maine'}


Average Metric: 5 / 10  (50.0):  20%|██        | 10/50 [01:38<06:55, 10.40s/it]

{'gene band', 'afghan whigs'}
{'gene and gents', 'up in it', 'afghan whigs', 'congregation afghan whigs album', 'genes reunited', 'geneabloggers'}


Average Metric: 5 / 11  (45.5):  22%|██▏       | 11/50 [01:49<06:55, 10.65s/it]

{'mount vesuvius', 'curse of faceless man'}
{'monte nuovo', 'eruption of mount vesuvius in 79', 'mount vesuvius', 'plinian eruption', 'pompeii last day'}


Average Metric: 5 / 12  (41.7):  24%|██▍       | 12/50 [02:00<06:46, 10.69s/it]

{'first united states army', '72nd field artillery brigade united states'}
{'72d air base wing', '65th field artillery brigade united states', '218th field artillery regiment united states', '72d fighter wing', '72nd field artillery brigade united states'}


Average Metric: 5 / 13  (38.5):  26%|██▌       | 13/50 [02:13<06:54, 11.20s/it]

{'stanisław kiszka', 'hetmans of polish–lithuanian commonwealth'}
{'stanisław kiszka', 'stanislaw kostka łukomski', 'culturepl', 'american council for polish culture', 'stanisław kiszka bishop', 'plateculture'}


Average Metric: 6 / 14  (42.9):  28%|██▊       | 14/50 [02:23<06:34, 10.96s/it]

{'wang xiaoshuai', 'del lord'}
{'wang xiaoshuai', 'xiao tong', 'wang xijie', 'dark lord rise of darth vader', 'del lord', 'dellords'}


Average Metric: 6 / 15  (40.0):  30%|███       | 15/50 [02:34<06:27, 11.07s/it]

{'jonathan aitken', 'lord north street'}
{'michael lord', 'lord disambiguation', 'lord band', 'lord', 'lord colum crichtonstuart', 'lord north street'}


Average Metric: 7 / 16  (43.8):  32%|███▏      | 16/50 [02:47<06:29, 11.44s/it]

{'pollenza', 'marche'}
{'pollenza', 'pollen novel', 'pollen video game', 'marchigiano dialect', 'marche', 'pollens band'}


Average Metric: 7 / 17  (41.2):  34%|███▍      | 17/50 [03:01<06:48, 12.38s/it]

{'william hughes miller', 'kosciusko mississippi'}
{'treaty of washington with menominee 1831', 'paris under louisphilippe', 'demographics of washington dc', 'crime in washington dc'}


Average Metric: 7 / 18  (38.9):  36%|███▌      | 18/50 [03:14<06:38, 12.44s/it]

{'meleko mokgosi', 'gallatin school of individualized study'}
{'allen feldman', 'meleko mokgosi', 'amy bentley', 'steinhardt museum of natural history', 'w russell neuman'}


Average Metric: 8 / 19  (42.1):  38%|███▊      | 19/50 [03:28<06:39, 12.88s/it]

{'restaurant impossible', 'robert irvine'}
{'list of restaurant impossible episodes', 'restaurant impossible', 'robert irvine'}


Average Metric: 9 / 20  (45.0):  40%|████      | 20/50 [03:40<06:24, 12.83s/it]

{'proposition joe', 'robert f chew'}
{'leo gorcey', 'david gorcey', 'proposition joe', 'robert f chew', 'poot wire'}


Average Metric: 9 / 21  (42.9):  42%|████▏     | 21/50 [03:53<06:13, 12.89s/it]

{'toby sawyer', 'wilmslow'}
{'prestbury cheshire', 'david plowright', 'toby sawyer', 'listed buildings in prestbury cheshire', 'prestbury gloucestershire'}


Average Metric: 9 / 22  (40.9):  44%|████▍     | 22/50 [04:07<06:03, 13.00s/it]

{'tony kaye director', 'deepa mehta'}
{'tony kaye director', 'judy kaye', 'tony kaye musician'}


Average Metric: 10 / 23  (43.5):  46%|████▌     | 23/50 [04:18<05:40, 12.60s/it]

{'bon marché', 'boise towne square'}
{'bon marché', 'westfield plaza bonita', 'karcher mall', 'macys herald square', 'boise towne square'}


Average Metric: 11 / 24  (45.8):  48%|████▊     | 24/50 [04:30<05:22, 12.40s/it]

{'lizzette reynolds', 'christine comer'}
{'paula irvine', 'cannock chase murders', 'emily j reynolds', 'christine comer', 'lizzette reynolds'}


Average Metric: 12 / 25  (48.0):  50%|█████     | 25/50 [04:43<05:13, 12.52s/it]

{'william s hutchings', 'p t barnum'}
{'lightning machine ep', 'p t barnum', 'thunder and lightnings', 'fred glazer', 'william s hutchings'}


Average Metric: 13 / 26  (50.0):  52%|█████▏    | 26/50 [04:58<05:15, 13.15s/it]

{'battle of chongchon river', 'meuseargonne offensive'}
{'battle of chongchon river', 'neuvillyenargonne', 'meuseargonne offensive', 'battle of canal du nord'}


Average Metric: 14 / 27  (51.9):  54%|█████▍    | 27/50 [05:11<05:03, 13.18s/it]

{'australian cricket team in england in 1981', 'ian botham'}
{'liam botham', 'australian cricket team in england in 1961', 'australian cricket team in england in 1981', 'english cricket team in australia in 1979–80', 'les botham', 'ian botham'}


Average Metric: 14 / 28  (50.0):  56%|█████▌    | 28/50 [05:23<04:45, 12.97s/it]

{'monte kiffin', '1982 nc state wolfpack football team'}
{'lane kiffin', 'tom thibodeau', 'tom matukewicz', 'tom amstutz', 'monte kiffin'}


Average Metric: 14 / 29  (48.3):  58%|█████▊    | 29/50 [05:35<04:24, 12.61s/it]

{'ewan mcgregor', 'come what may 2001 song'}
{'moulin rouge', 'shore album', 'moulin rouge 1934 film', 'come what may 2001 song', 'eien no uta'}


Average Metric: 15 / 30  (50.0):  60%|██████    | 30/50 [05:47<04:07, 12.38s/it]

{'ivan bella', 'frank de winne'}
{'ivan bella', 'frank de winne', 'iván bella'}


Average Metric: 16 / 31  (51.6):  62%|██████▏   | 31/50 [05:59<03:51, 12.18s/it]

{'platonov play', 'wild honey play'}
{'anton chekhov', 'michael chekhov', 'platonov play', 'in cart', 'student short story', 'wild honey play'}


Average Metric: 17 / 32  (53.1):  64%|██████▍   | 32/50 [06:10<03:33, 11.84s/it]

{'pago pago international airport', 'roswell international air center'}
{'nichols field colorado', 'boswell bay airport', 'pago pago international airport', 'roswell international air center'}


Average Metric: 17 / 33  (51.5):  66%|██████▌   | 33/50 [06:21<03:16, 11.57s/it]

{'marv albert', 'untold greatest sports stories never told'}
{'greatest story ever told david banner album', 'untold greatest sports stories never told', 'notsogreat moments in sports', 'greatest story never told', 'on shoulders of giants film'}


Average Metric: 17 / 34  (50.0):  68%|██████▊   | 34/50 [06:32<03:03, 11.49s/it]

{'sacro gra', 'walt disney film'}
{'gianfranco rosi director', 'walt amp el grupo', 'doctor sacrobosco', 'sacro gra'}


Average Metric: 17 / 35  (48.6):  70%|███████   | 35/50 [06:43<02:51, 11.41s/it]

{'status of territories occupied by israel in 1967', 'gaza strip'}
{'hamashkif', 'ahmadiyya in palestine', 'hamas of iraq', 'gaza strip', 'hamas'}


Average Metric: 17 / 36  (47.2):  72%|███████▏  | 36/50 [06:53<02:34, 11.04s/it]

{'2015 mtv video music awards', 'wildest dreams taylor swift song'}
{'blank space', '2015 mtv video music awards', 'style taylor swift song', 'i prevail', 'taylor swift videography', '2017 mtv video music awards'}


Average Metric: 17 / 37  (45.9):  74%|███████▍  | 37/50 [07:05<02:23, 11.07s/it]

{'gunnera manicata', 'apera'}
{'gunnera manicata', 'gunnera magellanica', 'gunnerales', 'gunnera'}


Average Metric: 17 / 38  (44.7):  76%|███████▌  | 38/50 [07:15<02:11, 10.96s/it]

{'pussy galore band', 'drums'}
{'feel good about your body', 'historia de la musica rock', 'pussy galore band', 'groovy hate fuck', 'right now pussy galore album'}


Average Metric: 17 / 39  (43.6):  78%|███████▊  | 39/50 [07:26<01:59, 10.89s/it]

{'banded brothers', 'university of exeter'}
{'banded brothers', 'brothers in unity', 'nelsons band of brothers', 'hemigobius hoevenii', 'banded mongoose'}


Average Metric: 18 / 40  (45.0):  80%|████████  | 40/50 [07:35<01:44, 10.43s/it]

{'len wiseman', 'benjamin christensen'}
{'devils circus', 'len wiseman', 'benjamin christensen'}


Average Metric: 19 / 41  (46.3):  82%|████████▏ | 41/50 [07:47<01:37, 10.87s/it]

{'bill melendez', 'steven c melendez'}
{'bill melendez', 'melendez films', 'steven c melendez'}


Average Metric: 20 / 42  (47.6):  84%|████████▍ | 42/50 [08:00<01:30, 11.33s/it]

{'shark creek new south wales', 'clarence river new south wales'}
{'boundary creek glen fernaigh river clarence valley', 'boundary creek nymboida river clarence valley', 'clarence river new zealand', 'clarence river alaska–yukon', 'clarence river new south wales', 'shark creek new south wales'}


Average Metric: 20 / 43  (46.5):  86%|████████▌ | 43/50 [08:12<01:21, 11.66s/it]

{'hayden rolence', 'finding dory'}
{'new zealand dory', 'finding dory', 'king dory', 'piper 2016 film', 'finding nemo franchise'}


Average Metric: 20 / 44  (45.5):  88%|████████▊ | 44/50 [08:22<01:07, 11.25s/it]

{'1995 monaco grand prix', 'benetton formula'}
{'f1 hero md', 'f1x dubai', 'f1 grand prix 2005 video game', 'michael schumacher', 'mercedes mgp w02', 'mercedes mgp w01'}


Average Metric: 20 / 45  (44.4):  90%|█████████ | 45/50 [08:34<00:56, 11.34s/it]

{'frederick law olmsted', 'cadwalader heights trenton new jersey'}
{'berkeley square trenton', 'cadwalader park', 'trenton ohio', 'cadwalader heights trenton new jersey'}


Average Metric: 20 / 46  (43.5):  92%|█████████▏| 46/50 [08:43<00:42, 10.73s/it]

{'franco zeffirelli', 'gordon warnecke'}
{'george warne', 'judith warnick', 'title tk', 'frank warnke', 'didier van der hove', '7641 1986 tt6'}


Average Metric: 20 / 47  (42.6):  94%|█████████▍| 47/50 [08:58<00:35, 11.98s/it]

{'signal magazine', 'andré zucca'}
{'rita zucca', 'anton wilhelm von zuccalmaglio', 'joseph gerhard zuccarini', 'andré zucca'}


Average Metric: 21 / 48  (43.8):  96%|█████████▌| 48/50 [09:08<00:22, 11.45s/it]

{'bill woodfull', 'bill ponsford'}
{'cricket disambiguation', 'bill woodfull', 'cricket musical', 'cricket', 'bill ponsford', 'adelaide leak'}


Average Metric: 21 / 49  (42.9):  98%|█████████▊| 49/50 [09:20<00:11, 11.51s/it]

{'sasha alexander', 'yes man film'}
{'suzana dinić', 'suzanna film', 'sasha alexander', 'hour of star'}


Average Metric: 21 / 50  (42.0): 100%|██████████| 50/50 [09:30<00:00, 11.42s/it]

{'election law journal', 'reed college'}
{'dave leips atlas of us presidential elections', 'electoralvotecom', 'john david booth', 'united kingdom election results', 'history of british political parties'}
Average Metric: 21 / 50  (42.0%)





Unnamed: 0,question,example_answer,gold_titles,context,pred_answer,gold_passages_retrieved
0,Are both Cangzhou and Qionghai in the Hebei province of China?,no,"{'Qionghai', 'Cangzhou'}","['Cangzhou | Cangzhou () is a prefecture-level city in eastern Hebei province, People\'s Republic of China. At the 2010 census, Cangzhou\'s built-up (""or metro"") area...",Yes --- Context: [1] «The 1972 FA,False
1,Who conducts the draft in which Marc-Andre Fleury was drafted to the Vegas Golden Knights for the 2017-18 season?,National Hockey League,"{'2017–18 Pittsburgh Penguins season', '2017 NHL Expansion Draft'}","['Marc-Édouard Vlasic | Marc-Édouard Vlasic (born March 30, 1987) is a Canadian professional ice hockey defenceman for the San Jose Sharks of the National Hockey...",The expansion draft. --- Context: [1] «The 19,False
2,"The Wings entered a new era, following the retirement of which Canadian retired professional ice hockey player and current general manager of the Tampa Bay...",Steve Yzerman,"{'2006–07 Detroit Red Wings season', 'Steve Yzerman'}","['Steve Yzerman | Stephen Gregory ""Steve"" Yzerman ( ; born May 9, 1965) is a Canadian retired professional ice hockey player and current general manager...",Steve Yzerman,✔️ [True]
3,What river is near the Crichton Collegiate Church?,the River Tyne,"{'Crichton Castle', 'Crichton Collegiate Church'}","[""Crichton Collegiate Church | Crichton Collegiate Church is situated about 0.6 mi south west of the hamlet of Crichton in Midlothian, Scotland. Crichton itself is...",River Tyne,✔️ [True]
4,In the 10th Century A.D. Ealhswith had a son called Æthelweard by which English king?,King Alfred the Great,"{'Æthelweard (son of Alfred)', 'Ealhswith'}","[""Æthelflæd | Æthelflæd, Lady of the Mercians ( 870 - 12 June 918) ruled Mercia in the English Midlands from 911 until her death. She...",Alfred the Great,✔️ [True]


## Retrieval Score for uncompiled Baleen: 36.0
## Retrieval Score for compiled Baleen: 42.0
