# Hallucination

Hallucination เป็นการประเมินว่า LLM สามารถสร้างข้อมูลที่ถูกต้องตรงตามข้อเท็จจริงหรือไม่ โดยเปรียบเทียบกับ actual output 

### Required argument 
- input
- actual_output
- context

Note :
หากต้องการวัด hallucination ใน RAG แนะนำให้ใช้ faithfulness metric 

In [1]:
from langchain_openai import AzureChatOpenAI
from deepeval.models.base_model import DeepEvalBaseLLM
import sys
sys.path.append('/opt/project/src/evaluate_llm/')
from api_key_config import settings
import os

os.environ["OPENAI_API_VERSION"] = settings.OPENAI_API_VERSION
os.environ["OPENAI_API_KEY"] = settings.OPENAI_API_KEY
os.environ["AZURE_OPENAI_ENDPOINT"] = settings.AZURE_OPENAI_ENDPOINT

class AzureOpenAI(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = AzureChatOpenAI(
    deployment_name="gpt-35-turbo",
)
azure_openai = AzureOpenAI(model=custom_model)



In [2]:
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual documents that you are passing as input to your LLM.
context=["A man with blond-hair, and a brown shirt drinking out of a public water fountain."]

# Replace this with the actual output from your LLM application
actual_output="A blond drinking water in public."

test_case = LLMTestCase(
    input="What was the blond doing?",
    actual_output=actual_output,
    context=context
)
metric = HallucinationMetric(threshold=0.5, model = azure_openai)

metric.measure(test_case)
print(metric.score)
print(metric.reason)


# or evaluate test cases in bulk
evaluate([test_case], [metric])


Output()

Output()

1.0
The score is 1.00 because the actual output does not accurately capture the specific details of the man's appearance and activity, leading to a higher likelihood of hallucinations.
Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Hallucination (score: 1.0, threshold: 0.5, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The hallucination score is 1.00 because the actual output does not provide specific details about the man's appearance or his surroundings, leading to a higher likelihood of hallucinations or false perceptions., error: None)

For test case:

  - input: What was the blond doing?
  - actual output: A blond drinking water in public.
  - expected output: None
  - context: ['A man with blond-hair, and a brown shirt drinking out of a public water fountain.']
  - retrieval context: None


Overall Metric Pass Rates

HallucinationMetric: 0.00% pass rate






[TestResult(success=False, metrics=[<deepeval.metrics.hallucination.hallucination.HallucinationMetric object at 0x7f05902ccee0>], input='What was the blond doing?', actual_output='A blond drinking water in public.', expected_output=None, context=['A man with blond-hair, and a brown shirt drinking out of a public water fountain.'], retrieval_context=None)]

There are five optional parameters when creating a HallucinationMetric:

- [Optional] threshold: a float representing the maximum passing threshold, defaulted to 0.5.
- [Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o'.
- [Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
- [Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 0 for perfection, 1 otherwise. It also overrides the current threshold and sets it to 0. Defaulted to False.
- [Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.

### How Is It Calculated?

Hallucination = Number of Contradicted Contexts/Total Number of Contexts

HallucinationMetric ใช้ llm ในการพิจารณาแต่ละ context ว่ามีความขัดแย้งกับ actual_output หรือไม่ 

แม้ว่าจะคล้ายกับ Faithfullness metric แต่การคำนวณไม่เหมือนกันเนื่องจากจะใช้ context ที่เป็น ground truth แทน 