>### 🚩 *Create a free WhyLabs account to complete this example!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylabs-free-sign-up?utm_source=github&utm_medium=referral&utm_campaign=langkit)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=Hallucination) to leverage the power of whylogs and WhyLabs together!*

# Analyzing LLM Response Consistency
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/LanguageToolkit/blob/main/langkit/examples/Response_Consistency.ipynb)

Recently, large language models (LLMs) have shown impressive and increasing capabilities, including generating highly fluent and convincing responses to user prompts. However, LLMs are known for their ability to generate non-factual or nonsensical statements, more commonly known as “hallucinations.” This characteristic can undermine trust in many scenarios where factuality is required, such as summarization tasks, generative question answering, and dialogue generations.

In this example we will show how to use the `response_hallucination` module to gain some insights into the consistency of the responses generated by a LLM. The approach is based on the premise that if the LLM has knowledge of the topic, then it should be able to generate similar and consistent responses when asked the same question multiple times. Conversely, if the LLM does not have knowledge of the topic, multiple answers to the same prompt should differ between each other.

The `response_hallucination` module is inspired by the research paper [SELFCHECKGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models](https://arxiv.org/pdf/2303.08896.pdf?ref=content.whylogsdocs).

> In order to calculate the consistency metrics, `response_hallucination` requires extra LLM calls, and currently only supports OpenAI LLMs.

## Setup

Let's first install `langkit`, `openai` and set up our OpenAI API key.



In [None]:
%pip install langkit[all] openai -q


In [1]:
import openai
import os

os.environ["OPENAI_API_KEY"] = "sk-xxx"
openai.api_key = os.getenv("OPENAI_API_KEY")


We also need to define the model we want to use to calculate the metrics. In this example we will use the `text-davinci-003` mode.

> Note: The chosen model must match the one used for the original response in the 'response' column. This ensures that our consistency metric evaluates the original model's factuality performance rather than being influenced by multiple models.

In [2]:
from langkit import response_hallucination
from langkit.openai import OpenAIDavinci


response_hallucination.init(llm=OpenAIDavinci(model="text-davinci-003"), num_samples=1)


  from .autonotebook import tqdm as notebook_tqdm


Now, let's evaluate a response and calculate the consistency metrics. The response below was generated using the same model `text-davinci-003`.


In [6]:
result = response_hallucination.consistency_check(
    prompt="Who was Philip Hayworth?",
    response="Philip Hayworth was an English barrister and politician who served as Member of Parliament for Thetford from 1859 to 1868.",
)

result


{'llm_score': 1.0,
 'semantic_score': 0.2514273524284363,
 'final_score': 0.6257136762142181,
 'total_tokens': 226,
 'samples': ["\nPhilip Hayworth was a British soldier and politician who served as Member of Parliament for Lyme Regis in Dorset between 1654 and 1659. He was also a prominent member of Oliver Cromwell's army and helped to bring about the restoration of the monarchy in 1660."],
 'response': 'Philip Hayworth was an English barrister and politician who served as Member of Parliament for Thetford from 1859 to 1868.'}

Let's break down the results:

The first step is to generate the additional samples that will be used later to calculate both `llm_score` and `semantic_score`. The number of samples generated is defined by the `num_samples` parameter. In this example we generated 1 sample, which is the default value.

The `llm_score` is calculated by comparing the original response with the generated samples. This is done by asking the LLM if the original response is supported by the context (additional samples). The `llm_score` is a value between 0 and 1, which is a result of averaging the scores across the samples. For each evaluated passage of the original response, the LLM is instructed to output `0` for `Accurate`, `0.5` for `Minor Inaccurate` and `1` for `Major Inaccurate`.

The `semantic_score` is calculated by encoding the sentences of the response and additional samples into embeddings and performing a semantic similarity between the sentences. The `semantic_score` is a value between 0 and 1, which is a result of averaging the scores across the samples. Values closer to 0 indicate that there are semantically similar sentences in the addititional samples when compared to the original response. Conversely, values closer to 1 indicate the opposite.

The `final_score` is simply the average between `llm_score` and `semantic_score`.

`total_tokens` is the total number of tokens that were used to calculate the scores. This accounts for the extra calls made to generate the additional samples and to perform the consistency check in the `llm_score`, but doesn't account for the original response generation. The number of calls to the LLM to calculate the consistency metric is equal to `3*num_samples` - in this case, 1 for generating the additional samples and 2 for calculating the `llm_score`.

`samples` will contain the generated samples used to calculate the `llm_score` and `semantic_score`.

`response` is the original response that was evaluated.

> Note: Currently, `response_hallucination` considers single-turn conversations. In the future, we plan to support historical interactions.

### Passing only the prompt

You can also pass a single prompt to `response_hallucination`. In this case, the response will also be generated when calling `consistency_check`, and the `total_tokens` will include the tokens used to generate the response.

In [7]:
result = response_hallucination.consistency_check(
    prompt="Who was Philip Hayworth?",
)

result


{'llm_score': 1.0,
 'semantic_score': 0.4006485939025879,
 'final_score': 0.700324296951294,
 'total_tokens': 239,
 'samples': ['\nPhilip Hayworth was a member of the United States House of Representatives from Illinois, serving from 1933 to 1943. He was a member of the Democratic Party.'],
 'response': '\nPhilip Hayworth was an English politician who served as the Member of Parliament for Ashton-under-Lyne from 1983 until his death in 1992.'}

## whylogs metrics

As with all the other metric modules, we can seamlessly generate a whylogs statistical profile with the consistency metrics. The result will contain aggregate metrics summarizing all the logged records. Let's see how to do that: 

In [8]:
"""
we already imported `response_hallucination` in the previous cell.
If not, we should include:

from langkit import response_hallucination
from langkit.openai import OpenAIDavinci


response_hallucination.init(llm=OpenAIDavinci(model="text-davinci-003"), num_samples=1)
"""
import whylogs as why
from whylogs.experimental.core.udf_schema import udf_schema

schema = udf_schema()
profile = why.log(
    {
        "prompt": "Where did fortune cookies originate?",
        "response": "Fortune cookies originated in Egypt. However, some say it's from Russia.",
    },
    schema=schema,
).profile()


In [10]:
# distribution metrics will reflect the consistency score ("final_score" in the result)
profile.view().get_column("response.hallucination").get_metric("distribution").to_summary_dict()


{'mean': 0.7380631566047668,
 'stddev': 0.0,
 'n': 1,
 'max': 0.7380631566047668,
 'min': 0.7380631566047668,
 'q_01': 0.7380631566047668,
 'q_05': 0.7380631566047668,
 'q_10': 0.7380631566047668,
 'q_25': 0.7380631566047668,
 'median': 0.7380631566047668,
 'q_75': 0.7380631566047668,
 'q_90': 0.7380631566047668,
 'q_95': 0.7380631566047668,
 'q_99': 0.7380631566047668}

You can also log a Pandas Dataframe with multiple prompt/response pairs:

In [11]:
import whylogs as why
import pandas as pd
from whylogs.experimental.core.udf_schema import udf_schema

df = pd.DataFrame(data={"prompt":["Where did fortune cookies originate?","Who is Bill Gates?"],
                        "response":["Fortune cookies originated in Egypt. However, some say it's from Russia.",
                                    "Bill Gates is a technology entrepreneur, investor, and philanthropist"]})

schema = udf_schema()
profile = why.log(
    df,
    schema=schema,
).profile()

print(
    profile.view().get_column("response.hallucination").get_metric("distribution").to_summary_dict()
)


{'mean': 0.3911959808319807, 'stddev': 0.5032094291759425, 'n': 2, 'max': 0.7470187805593014, 'min': 0.035373181104660034, 'q_01': 0.035373181104660034, 'q_05': 0.035373181104660034, 'q_10': 0.035373181104660034, 'q_25': 0.035373181104660034, 'median': 0.7470187805593014, 'q_75': 0.7470187805593014, 'q_90': 0.7470187805593014, 'q_95': 0.7470187805593014, 'q_99': 0.7470187805593014}
