# Faithfulness Evaluator

This notebook uses the `FaithfulnessEvaluator` module to measure if the response from a query engine matches any source nodes.  
This is useful for measuring if the response was hallucinated.  
The data is extracted from the [New York City](https://en.wikipedia.org/wiki/New_York_City) wikipedia page.

In [4]:
# %pip install llama-index-llms-openai pandas[jinja2] spacy

In [5]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

In [6]:
# import os

# os.environ["OPENAI_API_KEY"] = "sk-..."

In [7]:
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Response,
)
# from llama_index.llms.openai import OpenAI
from llama_index.llms.ollama import Ollama
from llama_index.core.evaluation import FaithfulnessEvaluator
from llama_index.core.node_parser import SentenceSplitter
import pandas as pd

from jet.llm.ollama import initialize_ollama_settings
initialize_ollama_settings()

pd.set_option("display.max_colwidth", 0)

Using GPT-4 here for evaluation

In [8]:
# gpt-4
# gpt4 = OpenAI(temperature=0, model="gpt-4")
llm = Ollama(temperature=0, model="llama3.1")

evaluator_gpt4 = FaithfulnessEvaluator(llm=llm)

In [9]:
documents = SimpleDirectoryReader("./test_wiki_data/").load_data()

In [10]:
# create vector index
splitter = SentenceSplitter(chunk_size=512)
vector_index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter]
)

In [11]:
from llama_index.core.evaluation import EvaluationResult


# define jupyter display function
def display_eval_df(response: Response, eval_result: EvaluationResult) -> None:
    if response.source_nodes == []:
        print("no response!")
        return
    eval_df = pd.DataFrame(
        {
            "Response": str(response),
            "Source": response.source_nodes[0].node.text[:1000] + "...",
            "Evaluation Result": "Pass" if eval_result.passing else "Fail",
            "Reasoning": eval_result.feedback,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

To run evaluations you can call the `.evaluate_response()` function on the `Response` object return from the query to run the evaluations. Lets evaluate the outputs of the vector_index.

In [12]:
query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("How did New York City get its name?")
eval_result = evaluator_gpt4.evaluate_response(response=response_vector)

In [13]:
display_eval_df(response_vector, eval_result)

Unnamed: 0,Response,Source,Evaluation Result,Reasoning
0,"The city was named after King Charles II of England granted the lands to his brother, the Duke of York, who then renamed it New York.","The settlement was named New Amsterdam (Dutch: Nieuw Amsterdam) in 1626 and was chartered as a city in 1653. The city came under British control in 1664 and was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city was regained by the Dutch in July 1673 and was renamed New Orange for one year and three months; the city has been continuously named New York since November 1674. New York City was the capital of the United States from 1785 until 1790, and has been the largest U.S. city since 1790. The Statue of Liberty greeted millions of immigrants as they came to the U.S. by ship in the late 19th and early 20th centuries, and is a symbol of the U.S. and its ideals of liberty and peace. In the 21st century, New York City has emerged as a global node of creativity, entrepreneurship, and as a symbol of freedom and cultural diversity. The New York Times has won the most Pulitzer Prizes for journalism and remains the U.S. media's ""newsp...",Pass,YES


## Benchmark on Generated Question

Now lets generate a few more questions so that we have more to evaluate with and run a small benchmark.

In [14]:
from llama_index.core.evaluation import DatasetGenerator

question_generator = DatasetGenerator.from_documents(documents)
eval_questions = question_generator.generate_questions_from_nodes(5)

eval_questions

  return cls(
  return QueryResponseDataset(queries=queries, responses=responses_dict)


['Here are 10 questions based on the provided context information:',
 'What is the most populous city in the United States, according to the text?',
 'In what year was New York City founded as a trading post by Dutch colonists?',
 'How many languages are spoken in New York City, making it the most linguistically diverse city in the world?',
 'What is the name of the largest metropolitan area in the U.S. by both population and urban area, which includes New York City?']

In [31]:
import asyncio
import httpx

# Set a custom timeout
TIMEOUT = httpx.Timeout(120.0, connect=10.0)  # Adjust as needed
client = httpx.AsyncClient(timeout=TIMEOUT)

async def evaluate_query_engine(query_engine, questions):
    async with client:  # Ensure proper cleanup
        total_correct = 0
        total = 0

        for question in questions:
            try:
                # Process each query
                response = await query_engine.aquery(question)
                eval_response = evaluator_gpt4.evaluate_response(response=response)
                eval_result = (
                    1 if eval_response.passing else 0
                )
                total_correct += eval_result
                total += 1

                # Yield progress
                yield {
                    "question": question,
                    "response": eval_response.response,
                    "correct": total_correct,
                    "total": total,
                    "passing": eval_response.passing,
                    "contexts": eval_response.contexts,
                    "feedback": eval_response.feedback,
                    "score": eval_response.score,
                    "query": eval_response.query,
                }

            except Exception as e:
                # Handle errors
                yield {"question": question, "error": str(e), "correct": total_correct, "total": total}


vector_query_engine = vector_index.as_query_engine()
eval_questions_sample = eval_questions[:5]  # Example questions

async for progress in evaluate_query_engine(vector_query_engine, eval_questions_sample):
    if "error" in progress:
        print(f"Question: {progress['question']} - Error: {progress['error']}")
    else:
        print(f"Progress: {progress['correct']}/{progress['total']}")




Question: Here are 10 questions based on the provided context information: - Error: 
Progress: 1/1
Progress: 2/2
Progress: 3/3


CancelledError: 