# Using logprobs for classification and retrieval evaluation

This notebook illustrates how to use the `logprobs` parameter in the Chat Completions API. 
With `logprobs` enabled, Chat Completions returns the log probabilities of each output token, and a limited number of the most likely tokens at each token position (along with their log probabilities). This can help with assessing the confidence of the model in its output, or to examine alternative responses the model may have given.<br><br> While there are a wide array of use cases for logprobs, this notebook will focus on using `logprobs` for:<br>
1. Classification tasks
2. Retrieval (Q&A) evaluation

LLMs are quite strong at many classification tasks, but accurately measuring the model's confidence in its outputs can be difficult. Using `logprobs` can give an associated probability to each class prediction, which allows users to set their own classification thresholds.

Further, `logprobs` can help with self-evaluation in retrieval applications. In the Q&A example, the model outputs a contrived 'sufficient_context_for_answer' boolean, which can be used as a confidence score of whether the answer is contained in the retrieved content. Evaluations of this type can help signficiantly with reducing RAG hallucinations and improving accuracy.

## 0. Imports and utils

In [1]:
from openai import OpenAI
from math import exp
import numpy as np
client= OpenAI()


In [2]:
def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4",
    max_tokens=500,
    temperature=1.0,
    stop=None,
    functions=None,
    logprobs=None,
    top_logprobs=None
) -> str:
    params = {
        'model': model,
        'messages': messages,
        'max_tokens': max_tokens,
        'temperature': temperature,
        'stop': stop,
        'logprobs': logprobs,
        'top_logprobs':top_logprobs
    }
    if functions:
        params['functions'] = functions

    completion = client.chat.completions.create(**params)
    return completion



## 1. Classification

Let's say we want to create a system to classify news articles into a set of categories. Without `logprobs`, we can use Chat Completions to do this, but it is much more difficult to assess how confident the model is in its classifications. <br><br>
Now, with `logprobs` enabled, we can see just how confident the model is in its predictions, which is crucial for creating an accurate and trustworthy classifier.

We can begin with a prompt that gives the model four categories: **Technology, Politics, Sports, and Arts**, and asks the model to classify articles into those categories based on headlines alone.

In [3]:
CLASSIFICATION_PROMPT = """You will be given a headline of a news article. Classify the article into one of the following categories: Technology, Politics, Sports, and Art.
Return only the name of the category, and nothing else. MAKE SURE your output is one of the four categories stated. Article headline: {headline}"""


Let's look at three sample headlines, and first begin with a standard Chat Completions output, without `logprobs`

In [4]:
headlines = ["Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.",
             "Local Mayor Launches Initiative to Enhance Urban Public Transport.",
"Tennis Champions Showcase Hidden Talents in Symphony Orchestra Debut"]


In [5]:
for headline in headlines:
  print(headline)
  API_RESPONSE = get_completion([{'role':'user','content':CLASSIFICATION_PROMPT.format(headline=headline)}],model='gpt-4')
  print(API_RESPONSE.choices[0].message.content,'\n')


Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
Technology 

Local Mayor Launches Initiative to Enhance Urban Public Transport.
Politics 

Tennis Champions Showcase Hidden Talents in Symphony Orchestra Debut
Sports 



Here we can see the selected category for each headline. However, we have no visibility into the confidence of the model in its predictions. Let's rerun the same prompt but with `logprobs` enabled, and `top_logprobs` set to 2 (this will show us the 2 most likely output tokens for each token). Additionally we can also output the linear probability of each output token, in order to convert the log probability to the more easily interprable scale of 0-100%. 


In [6]:
for headline in headlines:
      print(headline)
      API_RESPONSE = get_completion([{'role':'user','content':CLASSIFICATION_PROMPT.format(headline=headline)}],model='gpt-4',logprobs=True, top_logprobs=2)
      first, second = API_RESPONSE.choices[0].logprobs.content[0].top_logprobs
      print(f"\033[96mOutput token:\033[0m {first.token}, \033[93mlogprobs:\033[0m {first.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(first.logprob)*100,2)}%")
      print(f"\033[96mNext most likely token:\033[0m {second.token}, \033[93mlogprobs:\033[0m {second.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(second.logprob)*100,2)}%")

      print('\n')


Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
[96mOutput token:[0m Technology, [93mlogprobs:[0m -3.1737043e-06, [95mlinear probability:[0m 100.0%
[96mNext most likely token:[0m Techn, [93mlogprobs:[0m -13.390628, [95mlinear probability:[0m 0.0%


Local Mayor Launches Initiative to Enhance Urban Public Transport.
[96mOutput token:[0m Politics, [93mlogprobs:[0m -2.9352968e-06, [95mlinear probability:[0m 100.0%
[96mNext most likely token:[0m Technology, [93mlogprobs:[0m -13.859378, [95mlinear probability:[0m 0.0%


Tennis Champions Showcase Hidden Talents in Symphony Orchestra Debut
[96mOutput token:[0m Sports, [93mlogprobs:[0m -0.30504292, [95mlinear probability:[0m 73.71%
[96mNext most likely token:[0m Art, [93mlogprobs:[0m -1.336293, [95mlinear probability:[0m 26.28%




As expected from the first two headlines, `gpt-4` is nearly 100% confident in its classifications, as the content is clearly technology and politics focused respectively. However, the third headline combines both sports and art-related themes, so we see the model is significantly less confident in its selection, with a ~30% probability of selecting Sports instead of Art. <br><br> 
This shows how important using `logprobs` can be, as if we are using LLMs for classification tasks we can set confidence theshholds, or output several potential output tokens if the log probability of the selected output is not sufficiently high. For instance, if we are creating a recommendation engine to tag articles, we can automatically classify headlines crossing a certain threshold, and send the less certain headlines for manual review.

## 2. Retrieval confidence scoring

To reduce hallucinations, and the performance of our Q&A RAG system, we can use `logprobs` to evaluate how confident the model is in its retrieval.

Let's say we have built a retrieval system using RAG for Q&A, but are struggling with hallucinated answers to our questions. *Note:* we will use a hardcoded article for this example, but see other entries in the cookbook for tutorials on using RAG for Q&A.

In [7]:
#Article retrieved
ada_lovelace_article = """Augusta Ada King, Countess of Lovelace (nÃ©e Byron; 10 December 1815 â€“ 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.
Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as "poetical science" and herself as an "Analyst (& Metaphysician)".
When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as "the father of computers". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.
Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called "Notes".
Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer programâ€”that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of "poetical science" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.
"""

#Questions that can be easily answered given the article
easy_questions = ["What nationality was Ada Lovelace?", "What was an important finding from Lovelace's seventh note?"]

#Questions that are not fully covered in the article
medium_questions =["Did Lovelace collaborate with Charles Dickens","What concepts did Lovelace build with Charles Babbage"]


Now, what we can do is ask the model to respond to the question, but then also evaluate its response. Specifically, we will ask the model to output a boolean 'sufficient_context_for_answer'. We can then evaluate the `logprobs` to see just how confident the model is that its answer was contained in the provided context

In [8]:
PROMPT = """You retrieved this article: {article}. The question is: {question}. Before even answering the question, consider whether you have sufficient information in the article to answer the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.
Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False', nothing else.
"""


In [9]:
API_RESPONSE.choices[0].logprobs.content[0].token


'Sports'

In [10]:
import numpy as np

print('\033[1mQuestions clearly answered in article\033[0m\n')  # Blue text

for question in easy_questions:
    API_RESPONSE = get_completion([{'role':'user','content':PROMPT.format(article=ada_lovelace_article,
    question=question)}], model='gpt-4', logprobs=True)
    print('\033[92mQuestion:\033[0m', question)  # Green text
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        print(f"\033[96msufficient_context_for_answer:\033[0m {logprob.token}, \033[93mlogprobs:\033[0m {logprob.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(logprob.logprob)*100,2)}%", '\n')

print('\n\n\033[1mQuestions only partially covered in the article\033[0m\n')  # Blue text

for question in medium_questions:
    API_RESPONSE = get_completion([{'role':'user','content':PROMPT.format(article=ada_lovelace_article,
    question=question)}], model='gpt-4', logprobs=True,top_logprobs=3)
    print('\033[92mQuestion:\033[0m', question)  # Green text
    print(API_RESPONSE)
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        print(f"\033[96msufficient_context_for_answer:\033[0m {logprob.token}, \033[93mlogprobs:\033[0m {logprob.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(logprob.logprob)*100,2)}%", '\n')


[1mQuestions clearly answered in article[0m

[92mQuestion:[0m What nationality was Ada Lovelace?
[96msufficient_context_for_answer:[0m True, [93mlogprobs:[0m -1.9361265e-07, [95mlinear probability:[0m 100.0% 

[92mQuestion:[0m What was an important finding from Lovelace's seventh note?
[96msufficient_context_for_answer:[0m True, [93mlogprobs:[0m -1.0280384e-06, [95mlinear probability:[0m 100.0% 



[1mQuestions only partially covered in the article[0m

[92mQuestion:[0m Did Lovelace collaborate with Charles Dickens
ChatCompletion(id='chatcmpl-8XcDe8MxlcZxWYdKnbeseh0S7xaqr', choices=[Choice(finish_reason='stop', index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token='False', bytes=[70, 97, 108, 115, 101], logprob=-0.3132625, top_logprobs=[TopLogprob(token='False', bytes=[70, 97, 108, 115, 101], logprob=-0.3132625), TopLogprob(token='True', bytes=[84, 114, 117, 101], logprob=-1.3132625), TopLogprob(token='false', bytes=[102, 97, 108, 115, 101], lo

For the first two questions, our evaluator knows with (near) 100% confidence that the article has sufficient context to answer the posed questions.<br><br>
On the other hand, for the more tricky questions which are less clearly answered in the article, the model is signfiicantly less confident that it has sufficient context. This is a great guardrail to help ensure our retrieved content is sufficient.<br><br>
This self-evaluation can help reduce hallucinations, as you can restrict answers or re-prompt the user when your `sufficient_context_for_answer` log probability is below a certain threshold. Methods like this have been shown to significantly reduce RAG for Q&A hallucinations and errors ([Example]((https://jfan001.medium.com/how-we-cut-the-rate-of-gpt-hallucinations-from-20-to-less-than-2-f3bfcc10e4ec)))

## 3. Autocomplete

Another use case for `logprobs` are autocomplete systems. Without creating the entire autocomplete engine end-to-end, let's demonstrate how `logprobs` could help us decide when we to suggest a sentence completion as a user is typing.

First, let's come up with a sample sentence: "My least favorite TV show is Breaking Bad." Let's say we are building an autocomplete sentence, and we want it to dynamically recommend the next word or token as we are typing the sentence, but *only* if the model is quite sure of what the next word will be. To demonstrate this, let's break up the sentence into sequential components.

In [11]:
sentence_list = ["My","My least", "My least favorite","My least favorite TV","My least favorite TV show",
"My least favorite TV show is","My least favorite TV show is Breaking Bad"]


Now, we can ask `gpt-3.5-turbo` to act as an autocomplete engine with whatever context the model is given. We can enable `logprobs` and can see how confident the model is in its prediction.

In [12]:
high_prob_completions = {}
low_prob_completions = {}

for sentence in sentence_list:
  PROMPT = """Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}"""
  API_RESPONSE = get_completion([{'role':'user','content':PROMPT.format(sentence=sentence)}],model='gpt-3.5-turbo',logprobs=True,top_logprobs=3)
#  for next_token in API_RESPONSE.choices[0].logprobs.content[0]:
  print('Sentence:',sentence)
  first_token = True
  for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
    print(f"\033[96mPredicted next token:\033[0m {token.token}, \033[93mlogprobs:\033[0m {token.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(token.logprob)*100,2)}%")
    if first_token:
      if np.exp(token.logprob)>0.95:
        high_prob_completions[sentence]=token.token
      if np.exp(token.logprob)<0.60:
        low_prob_completions[sentence]=token.token
    first_token=False
  print('\n')


Sentence: My
[96mPredicted next token:[0m favorite, [93mlogprobs:[0m -0.32095337, [95mlinear probability:[0m 72.55%
[96mPredicted next token:[0m dog, [93mlogprobs:[0m -2.1003017, [95mlinear probability:[0m 12.24%
[96mPredicted next token:[0m ap, [93mlogprobs:[0m -2.814785, [95mlinear probability:[0m 5.99%


Sentence: My least
[96mPredicted next token:[0m favorite, [93mlogprobs:[0m -0.013642592, [95mlinear probability:[0m 98.65%
[96mPredicted next token:[0m My, [93mlogprobs:[0m -4.3126197, [95mlinear probability:[0m 1.34%
[96mPredicted next token:[0m  favorite, [93mlogprobs:[0m -9.684484, [95mlinear probability:[0m 0.01%


Sentence: My least favorite
[96mPredicted next token:[0m food, [93mlogprobs:[0m -0.9481721, [95mlinear probability:[0m 38.74%
[96mPredicted next token:[0m My, [93mlogprobs:[0m -1.3447137, [95mlinear probability:[0m 26.06%
[96mPredicted next token:[0m color, [93mlogprobs:[0m -1.3887696, [95mlinear probability:[0m 24

Let's look at the high confidence autocompletions:

In [13]:
high_prob_completions


{'My least': 'favorite', 'My least favorite TV': 'show'}

These look reasonable! We can feel confident in those suggestions. It's pretty likely you want to write 'show' after writing 'My least favorite TV'! Now let's look at the autocompletion suggestions the model was less confident about:

In [14]:
low_prob_completions


{'My least favorite': 'food', 'My least favorite TV show is': '"My'}

These are logical as well. It's pretty unclear what the user is going to say with just the prefix 'my least favorite', and it's really anyone's guess what the author's favorite TV show is. <br><br>
So, using `gpt-3.5-turbo`, we can create the root of a dynamic autocompletion engine with `logprobs`!

## 4. Highlighter and bytes parameter

Let's quickly touch on creating a simple token highlighter with `logprobs`, and using the bytes parameter. First, we can create a function that counts and highlights each token. While this doesn't use the log probabilities, it uses the built in tokenization that comes with enabling `logprobs`.

In [15]:
PROMPT = """What's the longest word in the English language?"""
API_RESPONSE = get_completion([{'role':'user','content':PROMPT}],model='gpt-4',logprobs=True,top_logprobs=5)

#Function to highlight each token
def highlight_text(api_response):
    colors = ['\033[95m', '\033[92m', '\033[93m', '\033[91m', '\033[94m']  # ANSI codes for purple, green, orange, red, blue
    reset_color = '\033[0m'
    tokens = api_response.choices[0].logprobs.content

    color_idx = 0
    for t in tokens:
        token_str = bytes(t.bytes).decode('utf-8')
        print(f"{colors[color_idx]}{token_str}{reset_color}", end="")

        # Move to the next color in the sequence, wrapping around if necessary
        color_idx = (color_idx + 1) % len(colors)
    print()  # for readability
    print(f"Total number of tokens: {len(tokens)}")



In [16]:
highlight_text(API_RESPONSE)


[95mThe[0m[92m longest[0m[93m word[0m[91m published[0m[94m in[0m[95m an[0m[92m English[0m[93m dictionary[0m[91m is[0m[94m '[0m[95mp[0m[92mne[0m[93mum[0m[91mon[0m[94moul[0m[95mtram[0m[92micro[0m[93msc[0m[91mop[0m[94mics[0m[95mil[0m[92mic[0m[93mov[0m[91mol[0m[94mcano[0m[95mcon[0m[92miosis[0m[93m',[0m[91m a[0m[94m lung[0m[95m disease[0m[92m caused[0m[93m by[0m[91m the[0m[94m inhal[0m[95mation[0m[92m of[0m[93m very[0m[91m fine[0m[94m sil[0m[95micate[0m[92m or[0m[93m quartz[0m[91m dust[0m[94m.[0m[95m It[0m[92m has[0m[93m [0m[91m45[0m[94m letters[0m[95m.[0m
Total number of tokens: 51


Cool, token highlighters like this can be used in various UIs. Next, let's reconstruct a sentence using the bytes parameter. With `logprobs` enabled, we are given both each token and the ASCII (decimal utf-8) values of the token string. These ASCII values can be helpful when handling tokens of or containing emojis or special characters.

In [17]:
PROMPT = """Output the blue heart emoji and its name."""
API_RESPONSE = get_completion([{'role':'user','content':PROMPT}],model='gpt-4',logprobs=True)

aggregated_bytes = []
joint_logprob = 0.0
for token in API_RESPONSE.choices[0].logprobs.content:
    print('Token:',token.token)
    print('Log prob:',token.logprob)
    print('Linear prob:',np.round(exp(token.logprob)*100,2),'%')
    print('Bytes:',token.bytes,'\n')
    aggregated_bytes += token.bytes
    joint_logprob += token.logprob


message_content = API_RESPONSE.choices[0].message.content
aggregated_text = bytes(aggregated_bytes).decode('utf-8')

assert message_content == aggregated_text
print('Bytes array:',aggregated_bytes)
print(f"Decoded bytes: {aggregated_text}")
print('Joint prob:',np.round(exp(joint_logprob)*100,2),'%')



Token: \xf0\x9f\x92
Log prob: -0.00012130453
Linear prob: 99.99 %
Bytes: [240, 159, 146] 

Token: \x99
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [153] 

Token:  -
Log prob: -0.002962131
Linear prob: 99.7 %
Bytes: [32, 45] 

Token:  Blue
Log prob: -0.00017660404
Linear prob: 99.98 %
Bytes: [32, 66, 108, 117, 101] 

Token:  Heart
Log prob: -4.441817e-05
Linear prob: 100.0 %
Bytes: [32, 72, 101, 97, 114, 116] 

Bytes array: [240, 159, 146, 153, 32, 45, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]
Decoded bytes: ðŸ’™ - Blue Heart
Joint prob: 99.67 %


Here, we see that while the first token was `\xf0\x9f\x92'`, we can get its ASCII value and append it to a bytes array. Then, we can easily decode this array into a full sentence, and validate with our assert statement that the decoded bytes is the same as our completion message! <br><br>
Additionally, we can get the joint probability of the entire completion, which is the exponentiated product of each token's log probability. This gives us how `likely` this given completion is given the prompt. Since, our prompt is quite directive (asking for a certain emoji and its name), the joint probability of this output is high! If we ask for a random output however, we'll see a much lower joint probability.

## 5. Conclusions

Nice! We were able to use the `logprobs` parameter to build a more robust classifier, evaluate our retrieval for Q&A system, and encode and decode each 'byte' of our tokens! `Logprobs` adds useful information and signal to our completions output, and we are excited to see how you incorporate them to improve your applications!

## 5. Extensions

There are many other use cases for `logprobs` that are not covered in this notebook. We can use `logprobs` for:
  - Evaluations (e.g.: calculate `perplexity` of outputs, which is the evaluation metric of uncertainty or surprise of the model at its outcomes)
  - Moderation
  - Keyword selection
  - and more!

    