# Using DeepEval for AWS Bedrock LLM evaluation

## Using DeepEval pre-build metrics for popular tasks

### Hallucination Detection
There are some papers that specifically research how to detect hallucinations based on LLM as a judge. For example, in the FACTOOL method(https://arxiv.org/pdf/2307.13528), the author proposes to first extract the claims that can be judged from the LLM's response, then generate queryable questions based on each claim, and then query on Google search or a customized knowledge base to obtain the evidence corresponding to each claim. Finally, based on the evidence, it is determined whether each claim is consistent with the factual basis, so as to determine whether there is hallucination in the answer generated by the model. 

Deepeval uses a simpler but similar pipeline for hallucination detection. To use the HallucinationMetric, you'll have to provide the following arguments:

* input: Means input prompt
* actual_output: Output from LLM
* context: Evidence or retrieved knowledge, could be a list

Then, the hallucination metric uses LLM-as-a-judge to determine whether your LLM generates factually correct information by comparing the actual_output to the provided context and output a score that can be calculated according to the following equation:

$$ \text{Hallucination} = \frac{\text{Number of Contradicted Contexts}}{\text{Total Number of Contexts}} $$

You can find more details about how DeepEval HallucinationMetric works from here:
https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/hallucination/template.py

In [None]:
!pip install aiobotocore botocore

In [1]:
from deepeval.models.llms.amazon_bedrock_model import AmazonBedrockModel
import nest_asyncio

# Apply nest_asyncio at the start
nest_asyncio.apply()

# Initialize the Bedrock model (e.g., Claude)
model = AmazonBedrockModel(
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    region_name="us-east-1"
)

# Define your input prompt
prompt1 = '''What new reasoning feature does Claude 3.7 Sonnet introduce?'''

# Run the model
output = model.generate(prompt1)

In [2]:
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

retrieval_context = [
    '''Anthropic Claude 3.7 Sonnet is the first Claude model to introduce optional step-by-step reasoning, called "extended thinking," which users can toggle alongside standard thinking. '''
    '''It supports up to 128K output tokens per request (with 64K–128K currently in beta) and features an enhanced computer use beta with support for new automated actions.'''
]

# Convert output to string if it's not already
actual_output = str(output) if not isinstance(output, str) else output

test_case = LLMTestCase(
    input=prompt1,
    actual_output=actual_output,
    context=retrieval_context
)
metric = HallucinationMetric(model=model)

# To run metric as a standalone
metric.measure(test_case)
print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

Output()

1.0 The score is 1.00 because there are no factual alignments and the output completely contradicts the context by incorrectly naming Claude 3.7 Sonnet's reasoning feature as 'chain-of-thought reasoning with self-critique' instead of 'extended thinking' as stated in the context, and by omitting the important detail that this feature is toggleable by users.


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:06,  6.98s/test case]



Metrics Summary

  - ❌ Hallucination (score: 1.0, threshold: 0.5, strict: False, evaluation model: us.anthropic.claude-3-7-sonnet-20250219-v1:0, reason: The score is 1.00 because the output contains a direct contradiction with the context, incorrectly referring to Claude 3.7 Sonnet's feature as 'chain-of-thought reasoning with self-critique' when the context specifically names it 'extended thinking'. With no factual alignments and a clear contradiction about a key feature name, the output is completely hallucinated., error: None)

For test case:

  - input: What new reasoning feature does Claude 3.7 Sonnet introduce?
  - actual output: ('Claude 3.7 Sonnet introduces a new reasoning feature called "chain-of-thought reasoning with self-critique." This capability allows me to:\n\n1. Break down complex problems into steps\n2. Evaluate my own reasoning as I go\n3. Identify and correct potential errors in my thinking\n4. Consider alternative approaches when appropriate\n\nThis self-critiqu




EvaluationResult(test_results=[TestResult(name='test_case_0', success=False, metrics_data=[MetricData(name='Hallucination', threshold=0.5, success=False, score=1.0, reason="The score is 1.00 because the output contains a direct contradiction with the context, incorrectly referring to Claude 3.7 Sonnet's feature as 'chain-of-thought reasoning with self-critique' when the context specifically names it 'extended thinking'. With no factual alignments and a clear contradiction about a key feature name, the output is completely hallucinated.", strict_mode=False, evaluation_model='us.anthropic.claude-3-7-sonnet-20250219-v1:0', error=None, evaluation_cost=0.0, verbose_logs='Verdicts:\n[\n    {\n        "verdict": "no",\n        "reason": "The actual output contradicts the provided context. The context states that Claude 3.7 Sonnet introduces \'optional step-by-step reasoning, called \\"extended thinking\\"\', but the output incorrectly refers to this feature as \'chain-of-thought reasoning wit

### Prompt Alignment

The prompt alignment metric uses LLM-as-a-judge to measure whether your LLM application is able to generate actual_outputs that aligns with any instructions specified in your prompt template. The algorithm is simple yet effective:
* (1) Loop through all instructions found in your prompt template, before...
* (2) Determining whether each instruction is followed based on the input and output

This works because instead of supplying the entire prompt to the metric, we only supply the list of instructions, which means your judge LLM instead of having to take in the entire prompt as context (which can be lengthy and cause hallucinations), it just has to consider one instruction at a time when making a verdict on whether an instruction is followed.

The score can be calculated according to the following equation:

$$ \text{Prompt Alignment} = \frac{\text{Number of Instructions Followed}}{\text{Total Number of Instructions}} $$

You can find more details about how DeepEval prompt alignment metric works from here:
https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/prompt_alignment/template.py

In [9]:
# Define your input prompt
prompt2 = '''Replace the lowwercase to uppercase, just output results: HelLo BedROck'''

# Run the model
output = model.generate(prompt2)

In [10]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import PromptAlignmentMetric

metric = PromptAlignmentMetric(
    prompt_instructions=["Reply in all uppercase"],
    model=model,
    include_reason=True
)
test_case = LLMTestCase(
    input=prompt2,
    # Replace this with the actual output from your LLM application
    actual_output=str(output)
)


evaluate(test_cases=[test_case], metrics=[metric])

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:07,  7.19s/test case]



Metrics Summary

  - ✅ Prompt Alignment (score: 1.0, threshold: 0.5, strict: False, evaluation model: us.anthropic.claude-3-7-sonnet-20250219-v1:0, reason: The score is 1.00 because the LLM perfectly followed the instruction to replace lowercase letters with uppercase ones in 'HelLo BedROck', correctly outputting 'HELLO BEDROCK'. Great job on achieving perfect alignment with the prompt requirements!, error: None)

For test case:

  - input: Replace the lowwercase to uppercase, just output results: HelLo BedROck
  - actual output: ('HELLO BEDROCK', 0)
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Prompt Alignment: 100.00% pass rate







EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Prompt Alignment', threshold=0.5, success=True, score=1.0, reason="The score is 1.00 because the LLM perfectly followed the instruction to replace lowercase letters with uppercase ones in 'HelLo BedROck', correctly outputting 'HELLO BEDROCK'. Great job on achieving perfect alignment with the prompt requirements!", strict_mode=False, evaluation_model='us.anthropic.claude-3-7-sonnet-20250219-v1:0', error=None, evaluation_cost=0.0, verbose_logs='Prompt Instructions:\n[\n    "Reply in all uppercase"\n] \n \nVerdicts:\n[\n    {\n        "verdict": "yes",\n        "reason": null\n    }\n]')], conversational=False, multimodal=False, input='Replace the lowwercase to uppercase, just output results: HelLo BedROck', actual_output="('HELLO BEDROCK', 0)", expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link=None)

## G-Eval for customized task-specific LLM Evaluation

G-Eval represents a contemporary evaluation methodology introduced in the research paper "NLG Evaluation using GPT-4 with Better Human Alignment" (available at https://arxiv.org/pdf/2303.16634). This framework employs large language models to assess outputs from LLMs (commonly referred to as LLM-Evals) and stands as one of the premier approaches for developing bespoke, task-oriented assessment metrics. The overall framework of G-EVAL:


Initially, it feeds **Task Introduction** and **Evaluation Criteria** into the language model, prompting it to create a Chain-of-Thought comprising detailed evaluation steps. Subsequently, prompt together with the generated Chain-of-Thought, is utilized to evaluate LLM outputs through a form-completion paradigm. The final assessment score is calculated via probability-weighted aggregation of the output scores.

In [None]:
from deepeval.models.llms.amazon_bedrock_model import AmazonBedrockModel
import nest_asyncio

# Apply nest_asyncio at the start
nest_asyncio.apply()

# Initialize the Bedrock model (e.g., Claude)
model = AmazonBedrockModel(
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    region_name="us-east-1"
)

# Define your input prompt
prompt1 = '''Describe how technological innovations have transformed education over the past century. Analyze both the benefits and drawbacks of these changes, and predict how emerging technologies might further change educational practices in the next decade.'''

# Run the model
output = model.generate(prompt1)

In [None]:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

test_case = LLMTestCase(input=prompt1, actual_output=output)
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence - the collective quality of all sentences in the actual output",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model
)

coherence_metric.measure(test_case)
print(coherence_metric.score)
print(coherence_metric.reason)