# Anachronism example

This example takes a <a href="https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/anachronisms">simple task from BigBench</a>, where the goal is to identify whether a given sentence contains an anachronism (i.e. states something that is impossibile due to time periods).

In [1]:
import requests
import json
dataset_url = "https://raw.githubusercontent.com/google/BIG-bench/main/bigbench/benchmark_tasks/anachronisms/task.json"
response = requests.get(dataset_url)
if response.status_code == 200:
    task_data = response.json()
    
    # Extract examples
    examples = task_data.get('examples', [])
    
    # Process the data
    inputs = [ex['input'] for ex in examples]
    labels = [ex['target_scores'] for ex in examples]
    
    print(f"Loaded {len(inputs)} examples")
else:
    print(f"Error during download: {response.status_code}")

Loaded 230 examples


Now, let us load a model into `guidance`:

In [None]:
import guidance
from huggingface_hub import hf_hub_download

lm = guidance.models.LlamaCpp(
    hf_hub_download(
        repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
        filename="Llama-3.2-3B-Instruct-Q6_K_L.gguf",
    ),
    verbose=True,
    n_ctx=4096,
)

We can now define an `anachronism_query()` function.
This is a function, decorated with `@guidance` and which contains guidance instructions.
The first argument to a function decorated like this is always a language model, and the function returns the same model after appending whatever strings and `guidance` instructions are required.

In this case, we're going to take some few-shot examples in addition to the desired query, and build them into a prompt.
We then provide `guidance` commands to step through some chain-of-thought (CoT) reasoning.
Notice how we use the `stop` keyword to limit the generation before the next stage in the CoT (the model may go off the rails and generate more than we want in the first 'entity' generation call otherwise).
In the final step, we use `guidance.select` to force the model to generate a 'Yes' or 'No' answer:

In [3]:
@guidance
def anachronism_query(llm, query, examples):
    prompt_string = """Given a sentence tell me whether it contains an anachronism (i.e. whether it could have happened or
not based on the time periods associated with the entities).

Here are some examples:
"""
    for ex in examples:
        prompt_string += f"Sentence: { ex['input'] }" + "\n"
        prompt_string += "Entities and dates:\n"
        for en in ex['entities']:
            prompt_string += f"{en['entity']} : {en['time']}" + "\n"
        prompt_string += f"Reasoning: {ex['reasoning']}" + "\n"
        prompt_string += f"Anachronism: {ex['answer']}" + "\n"

    llm += f'''{prompt_string}
Now determine whether the following is an anachronism:
    
Sentence: { query }
Entities and dates:
{ guidance.gen(name="entities", max_tokens=100, stop="Reason") }'''
    llm += "Reasoning :"
    llm += guidance.gen(name="reason", max_tokens=100, stop="\n")
    llm += f'''\nAnachronism: { guidance.select(["Yes", "No"], name="answer") }'''
    return llm

We can now invoke our function with a query string and some examples.
Again, note how when we call `anachronism_query()` we _don't_ pass in the language model itself; the `@guidance` decorator takes care of that.

In [6]:
# define the few shot examples
fewshot_examples = [
    {'input': 'I wrote about shakespeare',
    'entities': [{'entity': 'I', 'time': 'present'}, {'entity': 'Shakespeare', 'time': '16th century'}],
    'reasoning': 'I can write about Shakespeare because he lived in the past with respect to me.',
    'answer': 'No'},
    {'input': 'Shakespeare wrote about me',
    'entities': [{'entity': 'Shakespeare', 'time': '16th century'}, {'entity': 'I', 'time': 'present'}],
    'reasoning': 'Shakespeare cannot have written about me, because he died before I was born',
    'answer': 'Yes'}
]

# Invoke the model
generate = lm + anachronism_query("The T-Rex bit my dog", fewshot_examples)

# Show the extracted generations
print("entities:\n{0}".format(generate['entities']))
print(f"reasoning: {generate['reason']}")
print(f"answer: {generate['answer']}")

StitchWidget(initial_height='auto', initial_width='100%', srcdoc='<!doctype html>\n<html lang="en">\n<head>\n …

entities:
T-Rex : 65 million years ago
I : present

reasoning:  The T-Rex is extinct and could not have bitten my dog because it died before I was born.
answer: Yes


For comparison purposes, we can also define a zero-shot function:

In [7]:
@guidance
def anachronism_query_zeroshot(llm, query):
    llm += f'''Given a sentence tell me whether it contains an anachronism (i.e. whether it could have happened or
not based on the time periods associated with the entities).

Sentence: {query}
Anachronism: { guidance.select(["Yes", "No"], name="answer") }
'''
    return llm

generate_zero = lm + anachronism_query_zeroshot("The T-Rex bit my dog")

# Show the extracted generations
print(f"answer: {generate_zero['answer']}")

StitchWidget(initial_height='auto', initial_width='100%', srcdoc='<!doctype html>\n<html lang="en">\n<head>\n …

answer: Yes


### Compute accuracy

We compute accuracy on the validation set, and compare it to using the same two-shot examples above without the output structure, as well as to the best reported result <a href="https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/anachronisms">here</a>. We hope that a simple output structure will improve the accuracy of the results:

In [17]:
import numpy as np

fews = []
zero_shot = []
count = 0
for input, label in zip(inputs, labels):
    # print(f"Working on item {count}")
    f = lm + anachronism_query(input, fewshot_examples)
    f = 'Yes' if 'Yes' in f['answer'] else 'No'
    fews.append(f)
    g = lm + anachronism_query_zeroshot(input)
    g = 'Yes' if 'Yes' in g['answer'] else 'No'
    zero_shot.append(g)
    count += 1
fews = np.array(fews)
zero_shot = np.array(zero_shot)

StitchWidget(initial_height='auto', initial_width='100%', srcdoc='<!doctype html>\n<html lang="en">\n<head>\n …

Now, we can compute the accuracy for each of the approaches:

In [18]:
# One-liner to extract correct answers
correct_answers = [next(k for k, v in label.items() if v == 1) for label in labels]

# Score
few_shot_accuracy = sum(pred == correct for pred, correct in zip(fews, correct_answers)) / len(fews)
print(f"Few-shot Accuracy: {accuracy:.2%}")

zero_shot_accuracy = sum(pred == correct for pred, correct in zip(zero_shot, correct_answers)) / len(zero_shot)
print(f"Zero-shot Accuracy: {accuracy:.2%}")

Few-shot Accuracy: 71.74%
Zero-shot Accuracy: 71.74%


<hr style="height: 1px; opacity: 0.5; border: none; background: #cccccc;">
<div style="text-align: center; opacity: 0.5">Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!</div>