# Evaluating Large Language Models as future event forecasters - Part Two

## Setup - Install dependencies and download model

Note that we provide a compilation argument when installing llama-cpp-python to compile llama.cpp with GPU support. This is a very important step to getting tolerable generation speeds, so [read up](https://github.com/ggerganov/llama.cpp#Build) on installing with the right acceleration for your hardware if reusing this code outside of Collab.

In [1]:
# This will take a while
!pip install guidance &> /dev/null
!pip install huggingface-hub &> /dev/null
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python==0.2.27 &> /dev/null

In [2]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="TheBloke/Mistral-7B-OpenOrca-GGUF", local_dir="models", allow_patterns=["mistral-7b-openorca.Q4_K_M.gguf"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

'/content/models'

## Tokenising text

Below is an example of accessing a llama.cpp model's tokeniser when using Guidance.


In [3]:
# import the modules we want to use
from guidance import models, gen
from IPython.display import clear_output

# load our model into memory
llm = models.LlamaCpp("./models/mistral-7b-openorca.Q4_K_M.gguf", n_gpu_layers=20, n_ctx=1000)

# create a text string to tokenise
string = "Mindsets tend to be quick to form but resistant to change."

# generate the tokens by accessing the model tokeniser
tokens_encoded = llm.engine.model_obj.tokenize(str.encode(string))

# decode the tokens
tokens = []
for token in tokens_encoded:
    if token:
        tokens.append(llm.engine.model_obj.detokenize([token]).decode("utf-8", errors="replace"))

# clear the output so we can see results
clear_output(wait=True)

# show results
print(tokens_encoded)
print(tokens)

[1, 14683, 6591, 6273, 298, 347, 2936, 298, 1221, 562, 605, 11143, 298, 2268, 28723]
['', ' Mind', 'sets', ' tend', ' to', ' be', ' quick', ' to', ' form', ' but', ' res', 'istant', ' to', ' change', '.']


## Aside: How Guidance handles token constraints

This example demonstrates how Guidance's handling of token-level constraints can result in unexpected behaviour. Example taken from: https://github.com/guidance-ai/guidance/issues/564

In [1]:
from guidance import models, select

# load the model
llm = models.LlamaCpp("./models/mistral-7b-openorca.Q4_K_M.gguf", n_gpu_layers=20, n_ctx=4096)

# deliberately trigger an incoherent generation
llm + 'A word very similar to "sky" is "' + select(["cloud","skill"])

In [2]:
# adjust the prompt to introduce a coherent generation
llm + "Which word (cloud or skill) is more similar to sky? The more similar is " + select(["cloud","skill"])

# Accessing logprobs

We can gain a huge peformance increase for our use case by accessing the propability that each candidate will be generated. We can access the logpobs by examining tokens in the model's cache following a generation. Note that this is a simple implementation of a concept you can use to chain through a multi-token candidate.

In [3]:
from guidance import models, gen
import llama_cpp
import torch
import math
import json

# load the model
llm = models.LlamaCpp("./models/mistral-7b-openorca.Q4_K_M.gguf", compute_log_probs=True, n_gpu_layers=20, n_ctx=4096)

# define a regular expression to match a single number
output_regex = r"\d"

# define our prompt - note that we've added a "0." force output to just examine the tenths place
prompt = 'Predict the likelihood of the following outcome on a scale from 0.00 to 1.00, with 0.00 meaning the event is impossible and 1.00 meaning the event is certain to occur: "Donald Trump will win the 2024 US election."\nPREDICTION:0.'

# run constrained inference - noting that we have set temperature to zero
output = llm + prompt + gen(name="response", regex=output_regex, max_tokens=1, temperature=0.0)

# define the options we want to check the probs for
options = [f"{n}" for n in range(0,10)]

# retrieve the logits from the model object
logits = llama_cpp.llama_get_logits(llm.engine.model_obj.ctx)

# tokenize our options
option_tokens = [llm.engine.model_obj.tokenize(str.encode(o)) for o in options]

# retrieve just the option token, discarding the <s> added by the tokenizer
option_tokens = [o[2] for o in option_tokens]

# retrieve the logits for the option
option_logits = [logits[o] for o in option_tokens]

# convert the logits into propabilities
option_probs = torch.softmax(torch.tensor(option_logits), dim=0)

# cast the softmaxes to floats
option_probs = [(float(o)) for o in option_probs]

# we could alternatively deal with logprobs directly here
# option_probs = [math.log(float(o)) for o in option_probs]

# zip the options and logprobs together
option_probs = dict(zip(options, option_probs))

# get the top token
top_token = max(option_probs, key=option_probs.get)

# print results
print(f"The highest probability option in the tenths place is: {top_token.strip()}")
print("The probability distribution for the tenths place is: ")
print(json.dumps(option_probs, indent=4))
print(sum(option_probs.values()))


The highest probability option in the tenths place is: 2
The probability distribution for the tenths place is: 
{
    "0": 0.15142710506916046,
    "1": 0.15139013528823853,
    "2": 0.18116815388202667,
    "3": 0.16766728460788727,
    "4": 0.12874439358711243,
    "5": 0.12978236377239227,
    "6": 0.0467846542596817,
    "7": 0.02292967587709427,
    "8": 0.010299457237124443,
    "9": 0.009806782938539982
}
1.000000006519258
