#### Lang Models : 

Connects with Local & Remote models




In [1]:
from dotenv import load_dotenv
import openai
import os

load_dotenv("D:\\gitFolders\\python_de_learners_data\\.env")
openai.api_key = os.environ['OPENAI_API_KEY']
# openai.api_key

In [3]:
from dspy import OpenAI, configure

turbo = OpenAI(model='gpt-3.5-turbo',
                max_tokens=100)

configure(lm=turbo)

In [4]:
turbo("This is a raw prompt...")

['Write a story about a young girl who discovers she has the power to control the weather. At first, she uses her powers for fun and to help her friends, but soon she realizes the responsibility that comes with such a gift. As she struggles to control her abilities and navigate the consequences of her actions, she must decide how to use her powers for the greater good.']

In [5]:
from dspy import ChainOfThought

qa = ChainOfThought('question -> answer')

response = qa(question="how many floors are in castle?")
print(response.answer)

determine the exact number of floors in the castle.


In [8]:
qa_5 = ChainOfThought('question -> answer', n=5)

response = qa_5(question="How many floors are in the castle David Gregory inherited?")

response.completions.answer

['6 floors', '7', '4', '4 floors']

In [9]:
for idx in range(5):
    response = qa(question="How many floors are in the castle David Gregory inherited?",
                    config=dict(temperature=0.7+0.0001*idx))
    
    print(f'{idx+1}.', response.answer)

1. The number of floors in the castle David Gregory inherited can be determined by examining historical records, architectural plans, or physically counting the floors in person.
2. 12 floors
3. 5
4. 4
5. The castle David Gregory inherited likely has multiple floors, including a basement and several levels above ground.


In [None]:
from dspy import OllamaLocal

mistral_ollama = OllamaLocal(model='mistral')

#### Signatures : 

Tell LM what it needs to do, rather than specify how we should ask the LM to do it.

Define and control the behaviour of modules.

Semantic roles are expressed in plain english.

DSPy compiler will **figure out how to build a highly optimized prompt**, for your signature on your data, within your pipeline.

In [None]:
#### Examples of Inline Signatures

# > Question Answering: "question -> answer"

# > Sentiment Classification: "sentence -> sentiment"

# > Summarization: "document -> summary"

# > Your signatures can also have multiple input/output fields.

# > Retrieval-Augmented Question Answering: "context, question -> answer"

# > Multiple-Choice Question Answering with Reasoning: "question, choices -> reasoning, selection"

In [10]:
from dspy import Predict

classify = Predict("sentence -> sentiment")

sentence = 'The DSPy library seems to be taking a huge leap in prominance'

classify(sentence=sentence)

Prediction(
    sentiment='Positive'
)

In [11]:
# Example from the XSum dataset.
document = """The 21-year-old made seven appearances for the Hammers and netted 
        his only goal for them in a Europa League qualification round match against Andorran 
        side FC Lustrains last season. Lee had two loan spells in League One last term, with 
        Blackpool and then Colchester United. He scored twice for the U's but was unable to save 
        them from relegation. The length of Lee's contract with the promoted Tykes has not been 
        revealed. Find all the latest football transfers on our dedicated page."""

summarize = ChainOfThought('document -> summary')
response = summarize(document=document)

print(response.summary)

A 21-year-old player who previously played for West Ham and had loan spells at Blackpool and Colchester United has signed with


In [12]:
response

Prediction(
    rationale='produce the summary. We have a 21-year-old player who made seven appearances and scored one goal for West Ham in a Europa League match. He had loan spells at Blackpool and Colchester United, scoring twice for the latter but unable to prevent their relegation. He has now signed with Barnsley, although the length of his contract is unknown.',
    summary='A 21-year-old player who previously played for West Ham and had loan spells at Blackpool and Colchester United has signed with'
)

In [13]:
# creating signatures with classes
from dspy import (
    Signature,
    InputField,
    OutputField
)

class Emotion(Signature):
    """Classify the emotion among Fear, Happiness, Anger, Frustration, Surprise"""
    sentence = InputField() 
    senti = OutputField() 

sentence = "It was cold and dark outside, and the bright light appearing suddenly jolted me"

classify = Predict(Emotion)  # instead of giving inline sentence -> classification, its given as class
classify(sentence=sentence)

Prediction(
    senti='Fear'
)

In [14]:
class CheckCitationFaithfulness(Signature):
    """Verify that text is based on provided context"""
    context = InputField(desc="Facts assumed here are true.")
    text = InputField()
    faithfulness = OutputField(desc="True/False indicating if text is faithful to context")


In [15]:
context = "The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."

text = "Lee scored 3 goals for Colchester United."

faithfulness = ChainOfThought(CheckCitationFaithfulness)

faithfulness(context=context, text=text)

Prediction(
    rationale='produce the faithfulness. We know from the context that Lee scored twice for Colchester United, not three times.',
    faithfulness='False'
)

You should compose multiple signatures into bigger [DSPy modules] and [compile] these modules into optimized prompts and finetunes.

### A DSPy module is a building block for programs that use LMs.

Each built-in module abstracts a prompting technique (like chain of thought or ReAct). Crucially, they are generalized to handle any [DSPy Signature].

A DSPy module has learnable parameters (i.e., the little pieces comprising the prompt and the LM weights) and can be invoked (called) to process inputs and return outputs.

Multiple modules can be composed into bigger modules (programs). DSPy modules are inspired directly by NN modules in PyTorch, but applied to LM programs.

### Internally, all other DSPy modules are just built using dspy.Predict.

Simply swapping dspy.ChainOfThought in place of dspy.Predict improves quality.

In [17]:
question = "What's something great about the ColBERT retrieval model?"

# 1) Declare with a signature, and pass some config.
answer = ChainOfThought('question -> answer', n=5)

# 2) Call with input argument.
response = answer(question=question)

# 3) Access the outputs.
response.completions.answer

['One great aspect of the ColBERT retrieval model is its efficiency and effectiveness in handling large-scale information retrieval tasks.',
 'One great aspect of the ColBERT retrieval model is its ability to efficiently retrieve relevant information in large-scale systems.']

-> Predict: Basic predictor. Does not modify the signature. Handles the key forms of learning (i.e., storing the instructions and demonstrations and updates to the LM).

-> ChainOfThought: Teaches the LM to think step-by-step before committing to the signature's response.

-> ProgramOfThought: Teaches the LM to output code, whose execution results will dictate the response.

-> ReAct: An agent that can use tools to implement the given signature.

-> MultiChainComparison: Can compare multiple outputs from ChainOfThought to produce a final prediction.

We also have some function-style modules:

-> majority: Can do basic voting to return the most popular response from a set of predictions.

Basically PyTorch's design approach for define-by-run / dynamic computation 

#### Data Sets

> Three types of datasets, train, development and test sets.

> There are 3 part of the data, Inputs, Intermediate and Final Labels

How to get the Data?

> Invest time in creating the data manually

> Use adjacent datasets inside HF

> Demo / deploy the the system and collect the data

#### Example is the Core DType in DSPy

> Any number and types of fields are acceptable 

obj = Example(field1='val1', field2='val2', field='val3'...)


In [1]:
from dspy import Example

qa_pair = Example(question="What is super nova?",
                  answer='Its a star explosion')

print(qa_pair)

  from .autonotebook import tqdm as notebook_tqdm


Example({'question': 'What is super nova?', 'answer': 'Its a star explosion'}) (input_keys=None)


In [2]:
train = [Example(report='LONG Report', summary='short sentence')]

In [3]:
print(qa_pair.with_inputs("question"))  # this marks the fields as inputs, rest are labels

Example({'question': 'What is super nova?', 'answer': 'Its a star explosion'}) (input_keys={'question'})


In [7]:
sum_pair = Example(report='long line', summary='short simel').with_inputs('report')
print(sum_pair.inputs())
print(sum_pair.labels())

Example({'report': 'long line'}) (input_keys=None)
Example({'summary': 'short simel'}) (input_keys=None)


### Metrics 

Are functions that takes examples from the data, recieves output from system and returns score that quantifies how good the output is

Example + Recieved Data ==> Sent to Metric ==> Quantify / rank

The long_form output must be checked at multiple points with the functions. Can be done using LLMs

Metric signature is : 

    def metric(example, pred): -> float/int/bool


In [8]:
def validate_answer(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

In [10]:
from dspy.evaluate.metrics import (
    answer_exact_match,
    answer_passage_match
)
from dspy.evaluate import Evaluate

def validate_context_and_answer(example, pred, trace=None):
    # check the gold label and the predicted answer are the same
    answer_match = example.answer.lower() == pred.answer.lower()

    # check the predicted answer comes from one of the retrieved contexts
    context_match = any((pred.answer.lower() in c) for c in pred.context)

    if trace is None: # if we're doing evaluation or optimization
        return (answer_match + context_match) / 2.0
    else: # if we're doing bootstrapping, i.e. self-generating good demonstrations of each step
        return answer_match and context_match

In [None]:
scores = []

for x in devset:  # eval set enumerated
    pred = program(**x.inputs())  # eval inputs are used to predict 
    score = metric(x, pred)  # metric is calculated
    scores.append(score)  # score is calculated

#### How to check multiple dims of preds using AI

In [None]:
from dspy import InputField, OutputField, Signature

class Assess(Signature):
    """Assess the quality of a tweet along the specified dimension."""

    assessed_text = InputField()
    assessment_question = InputField()
    assessment_answer = OutputField(desc="Yes or No")

In [None]:
from dspy import (
    OpenAI,
    Predict,
    context
)

gpt4T = OpenAI(model='gpt-4-1106-preview', max_tokens=1000, model_type='chat')

def metric(gold, pred, trace=None):
    question, answer, tweet = gold.question, gold.answer, pred.output

    engaging = "Does the assessed text make for a self-contained, engaging tweet?"
    correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"
    
    with context(lm=gpt4T):
        c_ckt =  Predict(Assess)(assessed_text=tweet, assessment_question=correct)
        # assess if the tweet is correct, and the value is yes / no
        e_eng = Predict(Assess)(assessed_text=tweet, assessment_question=engaging)
        # assess if the tweet is engaging and value is yes or no

    correct, engaging = [m.assessment_answer.lower() == 'yes' for m in [c_ckt, e_eng]]
    # get True / False into the above vars
    score = (correct + engaging) if correct and (len(tweet) <= 280) else 0
    # based on the correct  and tweet len, decide the score

    if trace is not None: return score >= 2  # Return True  / False during Compiling
    return score / 2.0  # returns True

In [None]:

# But during compiling (optimization), DSPy will trace your LM calls. The trace will contain inputs/outputs
# to each DSPy predictor and you can leverage that to validate intermediate steps for optimization.

def validate_hops(example, pred, trace=None):
    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True

#### Optimizers 

An algo that tunes the parameters of the DSPy program (prompts / LM weights) to maximize metrics

Optimizer Inputs:

    - Simple to Complex multi-module program.

    - Metric

    - Training Inputs [small datasets are also good] 


What do they Tune?

    - Module parameters. There are 3 kinds of it

    - LM weights

    - Instructions (Prompts / requests)

    - Input / Output Behaviour (Demonstations)

How do they Tune?

    - Gradient Descent algo for LM weights

    - Discrete LM Driven Optimization for updating Instructions / Demo

- Optimizers try more things, systematically and update metric

    LabeledFewShot: Simply constructs few-shot examples from provided labeled Q/A pairs.

    BootstrapFewShot: Uses your program to self-generate complete demonstrations for every stage of your program. Will simply use the generated demonstrations (if they pass the metric) without any further optimization. Advanced: Supports using a teacher program (a different DSPy program that has compatible structure) and a teacher LM, for harder tasks.

    BootstrapFewShotWithRandomSearch: Applies BootstrapFewShot several times with random search over generated demonstrations, and selects the best program.

    BootstrapFewShotWithOptuna: Applies BootstrapFewShot through Optuna hyperparameter optimization across demonstration sets, running trials to maximize evaluation metrics.

    SignatureOptimizer: Generates and refines new instructions for each step, and optimizes them with coordinate ascent.

    BayesianSignatureOptimizer: Generates instructions and few-shot examples in each step. The instruction generation is data-aware and demonstration-aware. Uses Bayesian Optimization to effectively search over the space of generation instructions/demonstrations across your modules.

    BootstrapFinetune: Distills a prompt-based DSPy program into weight updates (for smaller LMs). The output is a DSPy program that has the same steps, but where each step is conducted by a finetuned model instead of a prompted LM.


    KNNFewShot. Selects demonstrations through k-Nearest Neighbors algorithm integrating BootstrapFewShot for bootstrapping/selection process.

    Ensemble: Ensembles a set of DSPy programs and either uses the full set or randomly samples a subset into a single program.

Suggestion is to use BootstrapFewShot, BootstrapFewShotWithRandomSearch, BayesianSignatureOptimizer and BootstrapFineTune

In [None]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# Set up the optimizer: we want to "bootstrap" (i.e., self-generate) 8-shot examples of your program's steps.
# The optimizer will repeat this 10 times (plus some initial attempts) before selecting its best attempt on the devset.
config = dict(max_bootstrapped_demos=3, 
              max_labeled_demos=3, 
              num_candidate_programs=10, 
              num_threads=4)

teleprompter = BootstrapFewShotWithRandomSearch(metric=YOUR_METRIC_HERE, **config)
optimized_program = teleprompter.compile(YOUR_PROGRAM_HERE, trainset=YOUR_TRAINSET_HERE)

#### Assertions in DSPy

Constraining the output of LLMs is still a challenge. 

Assertions in DSPY guides the LLMs to provide the required outputs, with manual interventions

Objective is Reliability, Predictability and Correctness

Assert : 

    Parameters : 

        - Constraint : python defined boolean validation check

        - msg : User Defined message providing feedback or correction 

        - backtrack : Specifies target module, upon contraint failure. This is last module

Suggest:

    - Parameters : Same as Assert

    - Behaviour : Encourages self-refinement through retries w/o enforcing hard stops. Logs failure max backtracking


Specifically, when a constraint is not met:

Backtracking Mechanism: An under-the-hood backtracking is initiated, offering the model a chance to self-refine and proceed, which is done through

Dynamic Signature Modification: internally modifying your DSPy program’s Signature by adding the following fields:

    Past Output: your model's past output that did not pass the validation_fn

    Instruction: your user-defined feedback message on what went wrong and what possibly to fi

### Starting out the RAG Tutorial

In [9]:
from dotenv import load_dotenv
import openai
import os
from dspy import ColBERTv2, OpenAI
from dspy.datasets import HotPotQA 

load_dotenv("D:\\gitFolders\\python_de_learners_data\\.env")

openai.api_key = os.environ['OPENAI_API_KEY']
# openai.api_key

In [12]:
from dspy import settings

turbo = OpenAI(model='gpt-3.5-turbo')
colbert_wiki17 = ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

settings.configure(lm=turbo,rm=colbert_wiki17)

In [None]:
dataset = HotPotQA(train_seed=1,
                   train_size=20,
                   eval_seed=42,
                   dev_size=50,
                   test_size=0)

In [16]:
trainset = [x.with_inputs('question') for x in dataset.train]
evalset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(evalset)

(20, 50)

In [74]:
print(trainset[0].question)
print(trainset[0].answer)

At My Window was released by which American singer-songwriter?
John Townes Van Zandt


In [77]:
print(evalset[11].question)
print(evalset[11].answer)
print(evalset[11].gold_titles)

Which canal is located in Washington D.C, Washington City Canal or Beardsley Canal?
Washington City Canal
{'Washington City Canal', 'Beardsley Canal'}


In [78]:
print(f"For this dataset, training examples have input keys {trainset[0].inputs().keys()} and label keys {trainset[0].labels().keys()}")
print(f"For this dataset, dev examples have input keys {trainset[0].inputs().keys()} and label keys {evalset[0].labels().keys()}")

For this dataset, training examples have input keys ['question'] and label keys ['answer']
For this dataset, dev examples have input keys ['question'] and label keys ['answer', 'gold_titles']


In [18]:
# Generate Rag Pipeline
from dspy import (
    Signature,
    InputField,
    OutputField,
    Module,
    Retrieve,
    ChainOfThought,
    Prediction
)

class GenerateAnswer(Signature):
    """Answer question with short factoid answers."""
    context = InputField(desc="may contain relevant facts")
    question = InputField()
    answer = OutputField(desc="Often between 1 and 5 words")

In [19]:
class RAG(Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = Retrieve(k=num_passages)
        self.generate_answer = ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, 
                                          question=question)
        return Prediction(context=context,
                          answer=prediction.answer)

In [20]:
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate import (
    answer_exact_match,
    answer_passage_match
)

def validate_context_and_answer(example, pred, trace=None):
    answer_EM = answer_exact_match(example, pred)
    answer_PM = answer_passage_match(example, pred)
    return answer_EM and answer_PM

In [26]:
import inspect
print(inspect.getsource(answer_exact_match))

def answer_exact_match(example, pred, trace=None, frac=1.0):
    assert(type(example.answer) is str or type(example.answer) is list)
    
    if type(example.answer) is str:
        return dsp.answer_match(pred.answer, [example.answer], frac=frac)
    else: # type(example.answer) is list
        return dsp.answer_match(pred.answer, example.answer, frac=frac)



In [29]:
from dsp import answer_match
print(inspect.getsource(answer_match))

def answer_match(prediction, answers, frac=1.0):
    # pred = example.prediction
    # answers = example.answers

    if frac >= 1.0:
        return EM(prediction, answers)

    return F1(prediction, answers) >= frac



In [31]:
rag_1 = RAG()
trainset[0].question
# sending through retriever 
retriev_out = rag_1.retrieve(trainset[0].question)
retriev_out  # gives a prediction object, with the passage

Prediction(
    passages=['At My Window (album) | At My Window is an album released by Folk/country singer-songwriter Townes Van Zandt in 1987. This was Van Zandt\'s first studio album in the nine years that followed 1978\'s "Flyin\' Shoes", and his only studio album recorded in the 1980s. Although the songwriter had become less prolific, this release showed that the quality of his material remained high.', 'Little Window | Little Window is the debut album of American singer-songwriter Baby Dee. The album was released in 2002 on the Durtro label. It was produced, composed, and performed entirely by Dee.', 'Windows and Walls | Windows and Walls is the eighth album by American singer-songwriter Dan Fogelberg, released in 1984 (see 1984 in music). The first single, "The Language of Love", reached 13 on the U.S. "Billboard" Hot 100 chart. Although the follow-up, "Believe in Me", missed the Top 40 of the pop chart, peaking at No. 48, it became the singer\'s fourth No. 1 song on the "Billboa

In [34]:
cot_out = rag_1.generate_answer(context=retriev_out.passages,
                                question=trainset[0].question)
cot_out

Prediction(
    rationale='produce the answer. We know that the album "At My Window" was released by an American singer-songwriter in 1987.',
    answer='Townes Van Zandt'
)

In [35]:
output = Prediction(context=retriev_out.passages,
           answer=cot_out.answer)
output

Prediction(
    context=['At My Window (album) | At My Window is an album released by Folk/country singer-songwriter Townes Van Zandt in 1987. This was Van Zandt\'s first studio album in the nine years that followed 1978\'s "Flyin\' Shoes", and his only studio album recorded in the 1980s. Although the songwriter had become less prolific, this release showed that the quality of his material remained high.', 'Little Window | Little Window is the debut album of American singer-songwriter Baby Dee. The album was released in 2002 on the Durtro label. It was produced, composed, and performed entirely by Dee.', 'Windows and Walls | Windows and Walls is the eighth album by American singer-songwriter Dan Fogelberg, released in 1984 (see 1984 in music). The first single, "The Language of Love", reached 13 on the U.S. "Billboard" Hot 100 chart. Although the follow-up, "Believe in Me", missed the Top 40 of the pop chart, peaking at No. 48, it became the singer\'s fourth No. 1 song on the "Billboar

In [21]:
# A trainingset.
# A metric for validation. 
# A specific teleprompter.

teleprompter = BootstrapFewShot(metric=validate_context_and_answer)

compiled_rag = teleprompter.compile(RAG(),
                                    trainset=trainset)

 50%|█████     | 10/20 [00:21<00:21,  2.17s/it]

Bootstrapped 4 full traces after 11 examples in round 0.





In [40]:
rag_1.retrieve.dump_state()

{'k': 3}

In [42]:
rag_1.dump_state()

{'retrieve': {'k': 3},
 'generate_answer': {'lm': None,
  'traces': [],
  'train': [],
  'demos': [],
  'signature_instructions': 'Answer question with short factoid answers.',
  'signature_prefix': 'Answer:',
  'extended_signature_instructions': 'Answer question with short factoid answers.',
  'extended_signature_prefix': 'Answer:'}}

In [49]:
rag_1.generate_answer.activated

True

In [22]:
my_question = 'What castle did David Gregory Inherit?'

pred = compiled_rag(my_question)

print(f"Question: {my_question}")
print(f"Question: {pred.answer}")
print(f"Retrieved Contexts: {[c[:200] + '...' for c in pred.context]}")

Question: What castle did David Gregory Inherit?
Question: Kinnairdy Castle
Retrieved Contexts: ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'Gregory Tarchaneiotes | Gregory Tarchaneiotes (Greek: Γρηγόριος Ταρχανειώτης , Italian: "Gregorio Tracanioto" or "Tracamoto" ) was a "protospatharius" and the long-reigning catepan of Italy from 998 t...', 'David Gregory (mathematician) | David Gregory (originally spelt Gregorie) FRS (? 1659 – 10 October 1708) was a Scottish mathematician and astronomer. He was professor of mathematics at the University ...']


In [50]:
for name, parameter in compiled_rag.named_predictors():
    print(name)
    print(parameter.demos[0])
    print()

generate_answer
Example({'augmented': True, 'context': ['Candace Kita | Kita\'s first role was as a news anchor in the 1991 movie "Stealth Hunters". Kita\'s first recurring television role was in Fox\'s "Masked Rider", from 1995 to 1996. She appeared as a series regular lead in all 40 episodes. Kita also portrayed a frantic stewardess in a music video directed by Mark Pellington for the British group, Catherine Wheel, titled, "Waydown" in 1995. In 1996, Kita also appeared in the film "Barb Wire" (1996) and guest starred on "The Wayans Bros.". She also guest starred in "Miriam Teitelbaum: Homicide" with "Saturday Night Live" alumni Nora Dunn, "Wall To Wall Records" with Jordan Bridges, "Even Stevens", "Felicity" with Keri Russell, "V.I.P." with Pamela Anderson, "Girlfriends", "The Sweet Spot" with Bill Murray, and "Movies at Our House". She also had recurring roles on the FX spoof, "Son of the Beach" from 2001 to 2002, ABC-Family\'s "Dance Fever" and Oxygen Network\'s "Running with Scis

In [23]:
turbo.inspect_history(n=1)





Answer question with short factoid answers.

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

Question: "Everything Has Changed" is a song from an album released under which record label ?
Answer: Big Machine Records

Question: In what year was the club founded that played Manchester City in the 1972 FA Charity Shield
Answer: 1874

Question: Which Pakistani cricket umpire who won 3 consecutive ICC umpire of the year awards in 2009, 2010, and 2011 will be in the ICC World Twenty20?
Answer: Aleem Sarwar Dar

Question: Having the combination of excellent foot speed and bat speed helped Eric Davis, create what kind of outfield for the Los Angeles Dodgers?
Answer: "Outfield of Dreams"

Question: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?
Answer: Aleksandr Danilovich Aleksandrov

Question: The Organisation that allows a community to influence their operation or use and to enjoy the benefits arisingwas fou

In [51]:
compiled_rag.named_parameters()


[('retrieve', <dspy.retrieve.retrieve.Retrieve at 0x2765a02e290>),
 ('generate_answer',
  ChainOfThought(Signature(context, question -> answer
      instructions='Answer question with short factoid answers.'
      context = Field(annotation=str required=True json_schema_extra={'desc': 'may contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
      question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
      answer = Field(annotation=str required=True json_schema_extra={'desc': 'Often between 1 and 5 words', '__dspy_field_type': 'output', 'prefix': 'Answer:'})
  )))]

In [52]:
from dspy.evaluate.evaluate import Evaluate

eval_hpqa = Evaluate(devset=evalset,
                     num_threads=1, display_progress=False,
                     display_table=5)

In [56]:
eval_hpqa(compiled_rag, metric=answer_exact_match)

Average Metric: 25 / 50  (50.0%)


  df = df.applymap(truncate_cell)
  df.loc[:, metric_name] = df[metric_name].apply(lambda x: f'✔️ [{x}]' if x is True else f'{x}')


Unnamed: 0,question,example_answer,gold_titles,context,pred_answer,answer_exact_match
0,What is T. J. Lottie's husband's nickname?,Rip,"{'Richard Hamilton (basketball)', 'T. J. Lottie'}","['T. J. Lottie | T. J. Lottie is an American singer. She was a member of the R&B group So Plush, which was on the...",Rip,✔️ [True]
1,Aniki-Bóbó is akin to a children's counting rhyme that has existed in various forms since well before when?,1820,"{'Eeny, meeny, miny, moe', 'Aniki-Bóbó'}",['Aniki-Bóbó | Aniki-Bóbó is a 1942 Portuguese film directed by Manoel de Oliveira. It is his first feature-length film. The actors are mostly children from...,1820,✔️ [True]
2,Estonian Philharmonic Chamber Choir won the grammy Award for Best Choral Performance for two songs by a composer born in what year ?,1935,"{'Estonian Philharmonic Chamber Choir', 'Arvo Pärt'}",['Estonian Philharmonic Chamber Choir | The Estonian Philharmonic Chamber Choir (EPCC) is a professional choir based in Estonia. It was founded in 1981 by Tõnu...,1935,✔️ [True]
3,"What is the nationality of the woman who performed ""A Message to Your Heart"" at the Eurovision Song Contest in 1991?",British,"{'A Message to Your Heart', 'Samantha Womack'}","['A Message to Your Heart | ""A Message to Your Heart"", written and composed by Paul Curtis, was the United Kingdom\'s entry at the Eurovision...",British,✔️ [True]
4,What musical was based on the 2003 film directed by Jon Favreau?,Elf,"{'Elf: The Musical', 'Elf (film)'}",['The Big Empty (2003 film) | The Big Empty is a 2003 drama film directed and written by Steve Anderson. It stars Jon Favreau as...,Elf,✔️ [True]


50.0

In [57]:
from dspy.evaluate import normalize_text

def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(normalize_text, example['gold_titles']))
    found_titles = set(map(normalize_text, [c.split(' | ')[0] for c in pred.context]))

    return gold_titles.issubset(found_titles)


In [58]:
compiled_rag_retrieval_score = eval_hpqa(compiled_rag, metric=gold_passages_retrieved)

Average Metric: 13 / 50  (26.0%)


  df = df.applymap(truncate_cell)
  df.loc[:, metric_name] = df[metric_name].apply(lambda x: f'✔️ [{x}]' if x is True else f'{x}')


Unnamed: 0,question,example_answer,gold_titles,context,pred_answer,gold_passages_retrieved
0,What is T. J. Lottie's husband's nickname?,Rip,"{'Richard Hamilton (basketball)', 'T. J. Lottie'}","['T. J. Lottie | T. J. Lottie is an American singer. She was a member of the R&B group So Plush, which was on the...",Rip,False
1,Aniki-Bóbó is akin to a children's counting rhyme that has existed in various forms since well before when?,1820,"{'Eeny, meeny, miny, moe', 'Aniki-Bóbó'}",['Aniki-Bóbó | Aniki-Bóbó is a 1942 Portuguese film directed by Manoel de Oliveira. It is his first feature-length film. The actors are mostly children from...,1820,✔️ [True]
2,Estonian Philharmonic Chamber Choir won the grammy Award for Best Choral Performance for two songs by a composer born in what year ?,1935,"{'Estonian Philharmonic Chamber Choir', 'Arvo Pärt'}",['Estonian Philharmonic Chamber Choir | The Estonian Philharmonic Chamber Choir (EPCC) is a professional choir based in Estonia. It was founded in 1981 by Tõnu...,1935,False
3,"What is the nationality of the woman who performed ""A Message to Your Heart"" at the Eurovision Song Contest in 1991?",British,"{'A Message to Your Heart', 'Samantha Womack'}","['A Message to Your Heart | ""A Message to Your Heart"", written and composed by Paul Curtis, was the United Kingdom\'s entry at the Eurovision...",British,False
4,What musical was based on the 2003 film directed by Jon Favreau?,Elf,"{'Elf: The Musical', 'Elf (film)'}",['The Big Empty (2003 film) | The Big Empty is a 2003 drama film directed and written by Steve Anderson. It stars Jon Favreau as...,Elf,False


#### Multi-hop systems

GoldEn (Qi et al., 2019) and Baleen (Khattab et al., 2021). These systems read the retrieved results and then generate additional queries to gather additional information when necessary before arriving to a final answer.

In [62]:
class GenerateSearchQuery(Signature):
    """Write a simple search query that help to answer complex question"""
    context = InputField(desc="may contain relevant facts")
    # Sending in the context
    question = InputField()  # Sending the question that we are solving
    query = OutputField()  # we are asking for a query to LLM

In [65]:
from dsp.utils import deduplicate

class SimplifiedBaleen(Module):
    def __init__(self, passages_per_hop=2, max_hops=2):
        super().__init__()

        self.generate_query = [ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = Retrieve(k=passages_per_hop)
        self.generate_answer = ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops

    def forward(self, question):
        context = []

        for hop in range(self.max_hops):
            query = self.generate_query[hop](contex=context, question=question).query
            # above will generate query
            passage = self.retrieve(query).passages
            # extract the passage for that query
            context = deduplicate(context + passage)
            # deduplicate the context and passages
        
        pred = self.generate_answer(context=context, question=question)
        # generate the answer from the overall contexts
        return Prediction(context=context, answer=pred.answer)

In [66]:
my_que = "How many storeys are in the castle that David Gregory Inherited"

uncompiled_baleen = SimplifiedBaleen()
pred = uncompiled_baleen(question=my_que)

print(f"Question: {my_que}")
print(f"Pred answser: {pred.answer}")
print(f"The contexts: {[c[:200] + '...' for c in pred.context]}")

Question: How many storeys are in the castle that David Gregory Inherited
Pred answser: 3
The contexts: ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'List of owners of Warwick Castle | Warwick Castle, in Warwickshire, UK, was first constructed in 1068. Over its 950 years of history it has been owned by 36 different individuals, plus four periods as...']


In [68]:
turbo.inspect_history(n=8)





Answer question with short factoid answers.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: Often between 1 and 5 words

---

Question: How many storeys are in the castle that David Gregory Inherited

Reasoning: Let's think step by step in order to[32m determine the number of storeys in the castle.

Answer: Three[0m







Answer question with short factoid answers.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: Often between 1 and 5 words

---

Context:
determine the number of storeys in the castle.

Answer: Three

Question: How many storeys are in the castle that David Gregory Inherited

Reasoning: Let's think step by step in order to[32m produce the answer. We know that the castle David Gregory inherited has t

#### Implementing validation logic for query, and answer generate

In [69]:
from dspy.evaluate import answer_exact_match_str

def validate_context_and_answer_and_hops(example, pred, trace=None):

    # The predicted answer matches the gold answer.
    if not answer_exact_match(example, pred): return False
    # The retrieved context contains the gold answer.
    if not answer_passage_match(example, pred): return False

    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    # None of the generated queries is rambling (i.e., none exceeds 100 characters in length).
    if max([len(h) for h in hops]) > 100: return False
    # None of the generated queries is roughly repeated (i.e., none is within 0.8 or higher F1 score of earlier queries).
    if any(answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True

In [70]:
opti = BootstrapFewShot(metric=validate_context_and_answer_and_hops)

compiled_baleen = opti.compile(SimplifiedBaleen(), 
                               teacher=SimplifiedBaleen(passages_per_hop=2),
                               trainset=trainset)

  0%|          | 0/20 [00:00<?, ?it/s]



  5%|▌         | 1/20 [00:04<01:19,  4.20s/it]



 10%|█         | 2/20 [00:10<01:35,  5.29s/it]



 15%|█▌        | 3/20 [00:15<01:27,  5.15s/it]



 20%|██        | 4/20 [00:20<01:22,  5.16s/it]



 25%|██▌       | 5/20 [00:24<01:14,  4.94s/it]



 30%|███       | 6/20 [00:30<01:11,  5.12s/it]



 35%|███▌      | 7/20 [00:35<01:05,  5.03s/it]



 40%|████      | 8/20 [00:40<01:00,  5.08s/it]



 45%|████▌     | 9/20 [00:45<00:55,  5.02s/it]



 50%|█████     | 10/20 [00:50<00:49,  4.91s/it]



 55%|█████▌    | 11/20 [00:56<00:48,  5.42s/it]



 60%|██████    | 12/20 [01:01<00:43,  5.40s/it]



 65%|██████▌   | 13/20 [01:07<00:37,  5.41s/it]



 70%|███████   | 14/20 [01:11<00:30,  5.14s/it]



 75%|███████▌  | 15/20 [01:17<00:26,  5.24s/it]



 80%|████████  | 16/20 [01:22<00:20,  5.13s/it]



 85%|████████▌ | 17/20 [01:26<00:14,  4.85s/it]



 90%|█████████ | 18/20 [01:31<00:09,  4.84s/it]



 95%|█████████▌| 19/20 [01:37<00:05,  5.12s/it]



100%|██████████| 20/20 [01:41<00:00,  5.10s/it]

Bootstrapped 0 full traces after 20 examples in round 0.





In [71]:
# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_hotpotqa = Evaluate(devset=evalset,
                                num_threads=1,
                                display_progress=True,
                                display_table=5)


In [None]:
uncompiled_baleen_retrieval_score = evaluate_on_hotpotqa(uncompiled_baleen,
                                                         metric=gold_passages_retrieved,
                                                         display=False)

In [None]:
print(f"## Retrieval Score for uncompiled Baleen: {uncompiled_baleen_retrieval_score}")

In [None]:
compiled_baleen_retrieval_score = evaluate_on_hotpotqa(compiled_baleen,
                                                       metric=gold_passages_retrieved)


print(f"## Retrieval Score for compiled Baleen: {compiled_baleen_retrieval_score}")