# HHEM with RAG

This notebook explains what HHEM does and how it may be integrated into generative AI applications, and specifically into RAG (retrieval augmented generation).

## HHEM overview

So first, what is HHEM?

In spite of the amazing power of LLM, they still do hallucinate. In some cases, where creativity is required, hallucinations are okay or even necessary, but in most enterprise use-cases a trusted response is needed.

HHEM (Hughes Hallucination Evaluation Model) is a model that was built specifically to help LLM practitioners measure hallucinations. It is available for use on [Huggingface Hub](https://huggingface.co/vectara/hallucination_evaluation_model) and a public [leaderboard](https://huggingface.co/spaces/vectara/leaderboard) shows the likelihood of various LLMs (both commercial and open source) to hallucinate.

At a basic level, HHEM is a classification neural network model that given two text strings (sentence A and sentence B) returns a score between 0...1 reflecting the extent to which sentence B is factually consistent with sentence A.

Let's demonstrate this via an example:

In [1]:
from sentence_transformers import CrossEncoder
import numpy as np
np.set_printoptions(suppress=True, precision=4)

model = CrossEncoder('vectara/hallucination_evaluation_model')
scores = model.predict([
    ["A man walks his dog in the backyard.", "A guy walks his labradoodle in his backyard."],
    ["A person on a horse jumps over a broken down airplane.", "A person is at a diner, ordering an omelette."],
    ["A phone is ringing on the table", "A telephone is buzzing on the table"],
])

example_pairs = [
    # Good summary
    {"article": "The woman is playing mario cart while resting on the couch",
     "summary": "The woman is playing a game resting"},

    # Bad summary
    {"article": "A person on a horse jumps over a broken down airplane.", 
     "summary": "A person is at a diner, ordering an omelette."},
    
    # Discourse link error
    {"article": "Goldfish are being caught weighing up to 2kg and koi carp up to 8kg and one metre in length",
     "summary": "Koi carp can be as heavy as 2kg and as long as one meter"}, 

    # Coreference error
    {"article": "Mr Katter said the Government believes Mr Gordon would quit after he was recently accused of domestic violence.",
     "summary": "Mr Katter said he would quit after he was accused of domestic violence."},

    {"article":"The first vaccine for Ebola was approved by the FDA in 2019 in the US,     five years after the initial outbreak in 2014. To produce the vaccine,  scientists had to sequence the DNA of Ebola, then identify possible vaccines, and finally show successful clinical trials.  Scientists say a vaccine for COVID-19 is unlikely to be ready this year, although clinical trials have already started. ",
     "summary":"You won't get the COVID-19 vaccine this year."},

    # extrinsic error
    {"article": "The plants were found during the search of a warehouse near Ashbourne on Saturday morning. Police said they were in 'an elaborate grow house'. A man in his late 40s was arrested at the scene.",
     "summary":"Police have arrested a man in his late 40s after cannabis plants worth an estimated £100,000 were found in a warehouse near Ashbourne."}
]

scores = model.predict(
    [ [p["article"], p["summary"]] for p in example_pairs ]
)

scores

  return self.fget.__get__(instance, owner)()


array([0.9922, 0.0005, 0.0053, 0.2113, 0.9641, 0.1181], dtype=float32)

We can see HHEM in action:

- The first example is a very good summary, and correspondingly the score is very close to 1.0.
- The second case, the summary is completely different than the original text, and the score is very low.
- Example 3 shows an error where the summary associates 2kg weight to Koi where it should be 8kg. Thus the low score
- Examples 4, 5 and 6 show additional nuanaced examples

Overall, you can quickly get the sense of how summarization can go wrong and how HHEM does a great job scoring the summary based on the article.

## Using HHEM in RAG

RAG, or retrieval augmented generation, adapts LLMs to answer user questions based on facts retrieved from custom data. It is an extremely practical architecture that reduces hallucinations, increases trust, and provides an economical way to building GenAI applications.

Given a user query, RAG extracts relevant facts based on this query, and uses the LLM to summarize those facts into a cohesive response to the user query, which is then displayed to the user. You can see an example RAG question-answering application in [asknews](https://asknews.demo.vectara.com).

For LLM applications built with Vectara, it can be useful to evaluate the response from the LLM to the extracted facts and provide the user with an indicator of the factual consistency of the response with the facts.

For example, one can compute the HHEM score for each fact with the response and then average those score to provide an overall score. Let's look at an example:

In [2]:
# We query the vectara.com website content
# Customer-ID, corpus-ID and API key taken from create-ui

import os
os.environ['VECTARA_API_KEY'] = 'zqt_UXrBcnI2UXINZkrv4g1tQPhzj02vfdtqYJIDiA'
os.environ['VECTARA_CORPUS_ID'] = '1'
os.environ['VECTARA_CUSTOMER_ID']='1366999410'


In [3]:
import requests
import json

def vectara_query(
    query: str,
    config: dict,
) -> None:
    """Query Vectara and return the results.
    Args:
        query: query string
    """
    corpus_key = [
        {
            "customerId": config["customer_id"],
            "corpusId": config["corpus_id"],
            "lexicalInterpolationConfig": {"lambda": config["lambda_val"]},
        }
    ]
    data = {
        "query": [
            {
                "query": query,
                "start": 0,
                "numResults": config["top_k"],
                "contextConfig": {
                    "sentencesBefore": 2,
                    "sentencesAfter": 2,
                },
                "corpusKey": corpus_key,
                "summary": [
                    {
                        "responseLang": "eng",
                        "maxSummarizedResults": 5,
                    }
                ]
            }
        ]
    }

    headers = {
        "x-api-key": config["api_key"],
        "customer-id": config["customer_id"],
        "Content-Type": "application/json",
    }
    response = requests.post(
        headers=headers,
        url="https://api.vectara.io/v1/query",
        data=json.dumps(data),
    )
    if response.status_code != 200:
        print(
            "Query failed %s",
            f"(code {response.status_code}, reason {response.reason}, details "
            f"{response.text})",
        )
        return []

    result = response.json()
    responses = result["responseSet"][0]["response"]
    documents = result["responseSet"][0]["document"]
    summary = result["responseSet"][0]["summary"][0]["text"]

    res = [[r['text'], r['score']] for r in responses]
    return res, summary

In [4]:
api_key = os.environ.get("VECTARA_API_KEY", "")
customer_id = os.environ.get("VECTARA_CUSTOMER_ID", "")
corpus_id = os.environ.get("VECTARA_CORPUS_ID", "")

config = {
    "api_key": str(api_key),
    "customer_id": str(customer_id),
    "corpus_id": str(corpus_id),
    "lambda_val": 0.025,
    "top_k": 10,
}

query = "What does Vectara do?"
results, summary = vectara_query(query, config)
print(summary)

Vectara is an end-to-end platform that offers powerful generative AI capabilities for product builders [1]. It enhances traditional searches by understanding the context and meaning of data [1]. Vectara enables the construction of semantic search applications powered by LLMs, which are deep neural nets designed to understand human language [2]. It provides a secure environment that respects data sovereignty and protects user data [3]. By grounding search results in uploaded data and reducing hallucinations, Vectara ensures accurate and trustworthy responses [4]. Additionally, Vectara supports cross-language search, eliminating language barriers for users worldwide [4]. However, the ability to retrieve text is limited, and Vectara mainly retains metadata for document retrieval [5].


Now let's compare each of the facts extracted (results) with the summary using HHEM. We only use the top 5 results, since that is what we asked the summarization to use:

In [11]:
import pandas as pd
pd.set_option('display.width', 100)
pd.set_option('display.max_colwidth', None)  # Use None to display full content without truncation

texts = [r[0] for r in results[:5]]
scores = [model.predict([text, summary]) for text in texts]
df = pd.DataFrame({'fact': texts, 'HHEM score': scores})

In [12]:
summary

'Vectara is an end-to-end platform that offers powerful generative AI capabilities for product builders [1]. It enhances traditional searches by understanding the context and meaning of data [1]. Vectara enables the construction of semantic search applications powered by LLMs, which are deep neural nets designed to understand human language [2]. It provides a secure environment that respects data sovereignty and protects user data [3]. By grounding search results in uploaded data and reducing hallucinations, Vectara ensures accurate and trustworthy responses [4]. Additionally, Vectara supports cross-language search, eliminating language barriers for users worldwide [4]. However, the ability to retrieve text is limited, and Vectara mainly retains metadata for document retrieval [5].'

In [13]:
df

Unnamed: 0,fact,HHEM score
0,"What is the Vectara Platform? | Vectara Docs Welcome to the documentation homepage for Vectara , an end-to-end platform for product builders to embed powerful generative AI capabilities into applications with extraordinary results. Vectara offers significant improvements over traditional searches by understanding the context and meaning of your data.",0.894155
1,"Semantic Search Fundamentals\nVectara lets you build a semantic, LLM-powered search application. Semantic\nsearch is not just about finding data, but about understanding data and\nhelping you answer questions about your data. This topic outlines what Vectara\ncan do for this use case as well as why and how to employ these features for\nthe best overall end-user experience. LLMs are deep neural nets\nthat are built with the task of specifically understanding human language. These\nmodels can be a great asset to many different use cases, including search and\nlanguage generation.",0.49437
2,This versatile Vectara GenAI platform caters to a wide range of use cases to drive better outcomes and unlock new possibilities in search applications. Vectara provides an easy entry point to generative AI capabilities while protecting company IP and customer data. The data is secure. Vectara does not train on user data and respects data sovereignty and provides you with peace of mind. You might be wondering what kind of data to select for ingestion. Our Vectara Quick Start Tutorial provides an example that gets you set up and searching for answers quickly!,0.577135
3,"These hallucinations lead to inaccurate and misleading responses Vectara addresses this problem through Grounded Generation, meaning it grounds the search results in the uploaded data. By focusing on facts and reducing hallucinations, Vectara enhances trust in AI-powered decision making. Use Vectara to search across multiple languages, eliminating language barriers and enabling users to find what they need, regardless of the language they use. This cross-language approach provides a seamless search experience for users around the world. The best answer may be written in German but a user asked the question in Spanish.",0.42513
4,"At that point, the text is no longer recoverable. It also won't be returned in any Vectara APIs. Note that Vectara does retain any metadata that were supplied alongside the text, including document IDs. This retention allows you to retrieve the document from a separate system of record based on the ID to show the metadata, and it also allows Vectara to perform any metadata-based filtering on the document. Currently, the reranking capability relies on the text being stored.",0.392353


In an application using Vectara and HHEM we may for example display the average HHEM score, or maximum HHEM score, next to the summary to provide the user an indicator for the overall correctness of the summary given the individual facts.