# Anachronism example

This example takes a <a href="https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/anachronisms">simple task from BigBench</a>, where the goal is to identify whether a given sentence contains an anachronism (i.e. states something that is impossibile due to time periods).

In [2]:
import datasets

# load the data
data = datasets.load_dataset("bigbench", "anachronisms")
inputs = [x.split("\n")[0] for x in data["validation"]["inputs"]]
labels = [x[0] for x in data["validation"]["targets"]]

In [3]:
import guidance

# define the model we will use
guidance.llm = guidance.llms.OpenAI("text-davinci-003")

In [4]:
# define the few shot examples
examples = [
    {
        "input": "I wrote about shakespeare",
        "entities": [
            {"entity": "I", "time": "present"},
            {"entity": "Shakespeare", "time": "16th century"},
        ],
        "reasoning": "I can write about Shakespeare because he lived in the past with respect to me.",
        "answer": "No",
    },
    {
        "input": "Shakespeare wrote about me",
        "entities": [
            {"entity": "Shakespeare", "time": "16th century"},
            {"entity": "I", "time": "present"},
        ],
        "reasoning": "Shakespeare cannot have written about me, because he died before I was born",
        "answer": "Yes",
    },
]

# define the guidance program
structure_prompt = guidance(
    """Given a sentence tell me whether it contains an anachronism (i.e. whether it could have happened or not based on the time periods associated with the entities).
----

{{~! display the few-shot examples ~}}
{{~#each examples}}
Sentence: {{this.input}}
Entities and dates:{{#each this.entities}}
{{this.entity}}: {{this.time}}{{/each}}
Reasoning: {{this.reasoning}}
Anachronism: {{this.answer}}
---
{{~/each}}

{{~! place the real question at the end }}
Sentence: {{input}}
Entities and dates:
{{gen "entities"}}
Reasoning:{{gen "reasoning"}}
Anachronism:{{#select "answer"}} Yes{{or}} No{{/select}}"""
)

# execute the program
structure_prompt(examples=examples, input="The T-rex bit my dog")

### Compute accuracy

We compute accuracy on the validation set, and compare it to using the same two-shot examples above without the output structure, as well as to the best reported result <a href="https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/anachronisms">here</a>. The results below agree with existing literature, in that even a very simple output structure drastically improves performance, even compared against much larger models.

In [9]:
import numpy as np

input = inputs[0]
label = labels[0]
fews = []
structs = []
for input, label in zip(inputs, labels):
    f = fewshot_prompt(examples=examples, instruction=instruction, input=input)
    f = "Yes" if "Yes" in f["answer"] else "No"
    s = structure_prompt(examples=examples, input=input, instruction=instruction)
    s = "Yes" if "Yes" in s["answer"] else "No"
    fews.append(f)
    structs.append(s)
fews = np.array(fews)
structs = np.array(structs)

In [10]:
print("Few-shot", (np.array(labels) == fews).mean())
print("Structured output", (np.array(labels) == structs).mean())

Few-shot 0.6304347826086957
Structured output 0.7608695652173914


<hr style="height: 1px; opacity: 0.5; border: none; background: #cccccc;">
<div style="text-align: center; opacity: 0.5">Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!</div>