# Evaluating Large Language Models as future event forecasters - Part One

## Setup - Install dependencies and download model

Note that we provide a compilation argument when installing llama-cpp-python to compile llama.cpp with GPU support. This is a very important step to getting tolerable generation speeds, so [read up](https://github.com/ggerganov/llama.cpp#Build) on installing with the right acceleration for your hardware if reusing this code outside of Collab.

In [None]:
# This will take a while
!pip install guidance &> /dev/null
!pip install huggingface-hub &> /dev/null
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python==0.2.27 &> /dev/null

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="TheBloke/Mistral-7B-OpenOrca-GGUF", local_dir="models", allow_patterns=["mistral-7b-openorca.Q4_K_M.gguf"])

## Constrained generation

Below is an example of using Guidance to constrain the output from a language model to a particular regular expression.


In [None]:
# import the modules we want to use
from guidance import models, gen
from IPython.display import clear_output

# load our model into memory
llm = models.LlamaCpp("./models/mistral-7b-openorca.Q4_K_M.gguf", n_gpu_layers=20, n_ctx=1000)

# create our prompt and forecast
prompt = 'Predict the likelihood of the following outcome on a scale from 0 to 1, with 0 meaning the event is impossible and 1 meaning the event is certain to occur: "Donald Trump will win the 2024 US election."\nPREDICTION:'

# use the model to generate with no constraint
output = llm + prompt + gen(name="response", max_tokens=100, temperature=0.7)
unconstrained_response = output['response']

# constrain the model to generate the format we want
output_regex = r"(0\.\d\d|1\.00)"
output = llm + prompt + gen(name="response", regex=output_regex, max_tokens=10, temperature=0.7)
constrained_response = output['response']

# clear the output so we can see results
clear_output(wait=True)

# show our results
print(f"Unconstrained output was:\n{unconstrained_response}\n\nConstrained output was:\n{constrained_response}")


## Proof of concept

Below we run the forecast many times for a contentious and non-contentious prompts. We then visualise histograms to get a feel for the distribution of forecasts a low power model can make with no additional contextual information.

For some forecasts, we will see a neat normal distribution, which indicates there is some internally consistent "reasoning" occurring. We'll revisit this idea of "sampling" a model that we are deliberately making random using temperature later.

In [None]:
predictions_trump = []
predictions_horse = []

In [None]:
forecast_trump = "Donald Trump will win the 2024 US election."
forecast_horse = "A horse will win the 2024 US election."

for i in range(100):
    output_regex = r"(0\.\d\d|1\.00)"
    output = llm + f'Predict the likelihood of the following outcome on a scale from 0.00 to 1.00, with 0.00 meaning the event is impossible and 1.00 meaning the event is certain to occur: "{forecast_trump}"\nPREDICTION:' + gen(name="response", regex=output_regex, max_tokens=10, temperature=0.7)
    predictions_trump.append(output['response'])
    output = llm + f'Predict the likelihood of the following outcome on a scale from 0.00 to 1.00, with 0.00 meaning the event is impossible and 1.00 meaning the event is certain to occur: "{forecast_horse}"\nPREDICTION:' + gen(name="response", regex=output_regex, max_tokens=10, temperature=0.7)
    predictions_horse.append(output['response'])


In [None]:
import matplotlib.pyplot as plt
import numpy as np

data1 = [(float(prediction) + 1e-10) for prediction in predictions_trump]
data2 = [(float(prediction) + 1e-10) for prediction in predictions_horse]

bins = np.arange(0, 1.1, 0.1)
hist1, _ = np.histogram(data1, bins=bins)
hist2, _ = np.histogram(data2, bins=bins)
max_count = max(max(hist1), max(hist2))
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.bar(np.arange(len(hist1)), hist1, align='center', alpha=0.7)
plt.xlabel('Prediction range')
plt.ylabel('Count')
plt.title('Trump 2024')
plt.xticks(np.arange(len(hist1)), ['{:.1f}-{:.1f}'.format(bins[i], bins[i+1]) for i in range(len(hist1))], rotation=45, ha='right')
plt.ylim(0, max_count)
plt.subplot(1, 2, 2)
plt.bar(np.arange(len(hist2)), hist2, align='center', alpha=0.7)
plt.xlabel('Prediction range')
plt.ylabel('Count')
plt.title('Horse 2024')
plt.xticks(np.arange(len(hist2)), ['{:.1f}-{:.1f}'.format(bins[i], bins[i+1]) for i in range(len(hist2))], rotation=45, ha='right')
plt.ylim(0, max_count)
plt.tight_layout()
plt.show()