# Anachronism example

This example takes a <a href="https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/anachronisms">simple task from BigBench</a>, where the goal is to identify whether a given sentence contains an anachronism (i.e. states something that is impossibile due to time periods).

In [1]:
import datasets

# load the data
data = datasets.load_dataset('bigbench', 'anachronisms')
inputs = [x.split('\n')[0] for x in data['validation']['inputs']]
labels = [x[0] for x in data['validation']['targets']]

print(f"Loaded {len(labels)} data items")

2023-12-05 14:06:47.646285: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-05 14:06:47.646969: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-05 14:06:47.702073: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-05 14:06:47.892827: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Loaded 46 data items


Now, let us load a model into `guidance`:

In [2]:
import os

import guidance

# define the model we will use
# MODEL_PATH should point at the gguf file which you wish to use
target_model_path = os.getenv("MODEL_PATH")
print(f"Attempting to load {target_model_path}")

lm = guidance.models.LlamaCpp(target_model_path, n_gpu_layers=-1)

Attempting to load /mnt/c/Users/riedgar/Downloads/llama-2-7b.Q5_K_M.gguf


We can now define an `anachronism_query()` function.
This is a function, decorated with `@guidance` and which contains guidance instructions.
The first argument to a function decorated like this is always a language model, and the function returns the same model after appending whatever strings and `guidance` instructions are required.

In this case, we're going to take some few-shot examples in addition to the desired query, and build them into a prompt.
We then provide `guidance` commands to step through some chain-of-thought (CoT) reasoning.
Notice how we use the `stop` keyword to limit the generation before the next stage in the CoT (the model may go off the rails and generate more than we want in the first 'entity' generation call otherwise).
In the final step, we use `guidance.select` to force the model to generate a 'Yes' or 'No' answer:

In [3]:
@guidance
def anachronism_query(llm, query, examples):
    prompt_string = """Given a sentence tell me whether it contains an anachronism (i.e. whether it could have happened or
not based on the time periods associated with the entities).

Here are some examples:
"""
    for ex in examples:
        prompt_string += f"Sentence: { ex['input'] }" + "\n"
        prompt_string += "Entities and dates:\n"
        for en in ex['entities']:
            prompt_string += f"{en['entity']} : {en['time']}" + "\n"
        prompt_string += f"Reasoning: {ex['reasoning']}" + "\n"
        prompt_string += f"Anachronism: {ex['answer']}" + "\n"

    llm += f'''{prompt_string}
Now determine whether the following is an anachronism:
    
Sentence: { query }
Entities and dates:
{ guidance.gen(name="entities", max_tokens=100, stop="Reason") }'''
    llm += "Reasoning :"
    llm += guidance.gen(name="reason", max_tokens=100, stop="\n")
    llm += f'''\nAnachronism: { guidance.select(["Yes", "No"], name="answer") }'''
    return llm

We can now invoke our function with a query string and some examples.
Again, note how when we call `anachronism_query()` we _don't_ pass in the language model itself; the `@guidance` decorator takes care of that.

In [4]:
# define the few shot examples
fewshot_examples = [
    {'input': 'I wrote about shakespeare',
    'entities': [{'entity': 'I', 'time': 'present'}, {'entity': 'Shakespeare', 'time': '16th century'}],
    'reasoning': 'I can write about Shakespeare because he lived in the past with respect to me.',
    'answer': 'No'},
    {'input': 'Shakespeare wrote about me',
    'entities': [{'entity': 'Shakespeare', 'time': '16th century'}, {'entity': 'I', 'time': 'present'}],
    'reasoning': 'Shakespeare cannot have written about me, because he died before I was born',
    'answer': 'Yes'}
]

# Invoke the model
generate = lm + anachronism_query("The T-Rex bit my dog", fewshot_examples)

# Show the extracted generations
print("entities:\n{0}".format(generate['entities']))
print(f"reasoning: {generate['reason']}")
print(f"answer: {generate['answer']}")

entities:
T-Rex : 65 million years ago
Dog : present

reasoning:  The T-Rex is extinct, so it cannot bite my dog.
answer: Yes


For comparison purposes, we can also define a zero-shot function:

In [5]:
@guidance
def anachronism_query_zeroshot(llm, query):
    llm += f'''Given a sentence tell me whether it contains an anachronism (i.e. whether it could have happened or
not based on the time periods associated with the entities).

Sentence: {query}
Anachronism: { guidance.select(["Yes", "No"], name="answer") }
'''
    return llm

generate_zero = lm + anachronism_query_zeroshot("The T-Rex bit my dog")

# Show the extracted generations
print(f"answer: {generate_zero['answer']}")

answer: No


### Compute accuracy

We compute accuracy on the validation set, and compare it to using the same two-shot examples above without the output structure, as well as to the best reported result <a href="https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/anachronisms">here</a>. We hope that a simple output structure will improve the accuracy of the results:

In [6]:
import numpy as np

fews = []
zero_shot = []
count = 0
for input, label in zip(inputs, labels):
    print(f"Working on item {count}")
    f = lm + anachronism_query(input, fewshot_examples)
    f = 'Yes' if 'Yes' in f['answer'] else 'No'
    fews.append(f)
    g = lm + anachronism_query_zeroshot(input)
    g = 'Yes' if 'Yes' in g['answer'] else 'No'
    zero_shot.append(g)
    count += 1
fews = np.array(fews)
zero_shot = np.array(zero_shot)

Now, we can compute the accuracy for each of the approaches:

In [7]:
print('Few-shot', (np.array(labels) == fews).mean())
print('Zero-shot', (np.array(labels) == zero_shot).mean())

Few-shot 0.41304347826086957
Zero-shot 0.41304347826086957


<hr style="height: 1px; opacity: 0.5; border: none; background: #cccccc;">
<div style="text-align: center; opacity: 0.5">Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!</div>