Atenção: 

Executar os comandos abaixo no prompt antes de iniciar os testes:

```bash
.venv\Scripts\activate
deepeval set-local-model \
         --model-name=llama3.2:latest \
         --base-url="http://localhost:11434/v1/" \
         --api-key="ollama"
```

ref.: https://docs.confident-ai.com/docs/metrics-llm-evals

In [1]:
from deepeval.test_case import LLMTestCase
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval

## Avaliação de corretude

In [2]:
# Define a métrica GEval para avaliar a "Corretude" da saída do LLM
metric = GEval(
    name="Correctness",  # Nome da métrica, que neste caso avalia a exatidão factual da resposta
    criteria="Determine if the current output is factually correct based on the expected output.",  # Descreve o objetivo da métrica
    evaluation_steps=[
        # Passos para a avaliação:
        "Check if any facts in the 'current output' contradict any facts in the 'expected output'.",  # Checa contradições entre a resposta e o esperado
        "Heavily penalize the omission of important details.",  # Define uma penalidade maior para omissões importantes
        "Vague language or contradictory opinions are not acceptable."  # Explicita que linguagem vaga afeta a avaliação
    ],
    # Parâmetros a serem usados na avaliação (entrada, saída atual e saída esperada)
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

In [3]:
# Define um caso de teste que contém a entrada, a saída gerada e a saída esperada para a comparação
test_case = LLMTestCase(
    input="Who was the first president of Brazil?",  # Entrada fornecida ao modelo
    actual_output="The first president of Brazil was Deodoro da Fonseca.",  # Resposta gerada pelo modelo
    expected_output="The first president of Brazil was Deodoro da Fonseca."  # Resposta correta esperada
)

# Aplica a métrica ao caso de teste e realiza a avaliação
metric.measure(test_case)

# Exibe o score calculado pela métrica, que representa o nível de corretude factual da saída
print(f"Score: {metric.score}")
# Exibe a justificativa para o score atribuído, explicando eventuais discrepâncias
print(f"Reason: {metric.reason}")

Output()

Score: 1.0
Reason: The output matches the expected output exactly, with no contradictions or omissions of important details.


In [4]:
test_case = LLMTestCase(
    input="Who was the first president of Brazil?",  # Entrada fornecida ao modelo
    actual_output="It was Deodoro da Fonseca.",  # Resposta gerada pelo modelo
    expected_output="The first president of Brazil was Deodoro da Fonseca."  # Resposta correta esperada
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.8
Reason: The actual output omits important details, but the language is clear and the opinion is consistent.


In [5]:
test_case = LLMTestCase(
    input="Who was the first president of Brazil?", 
    actual_output="It was the singer Roberto Carlos da Fonseca!",
    expected_output="The first president of Brazil was Deodoro da Fonseca." 
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")


Output()

Score: 0.0
Reason: The actual output contains a contradictory fact about Roberto Carlos da Fonseca being the first president, whereas the expected output clearly states Deodoro da Fonseca as the correct answer.


## Coerência

In [6]:
metric = GEval(
    name="Coherence", 
    criteria="Evaluate if the current output is logically consistent and maintains a clear sequence of ideas.",
    evaluation_steps=[
        "Ensure the response maintains a logical progression of thoughts.",
        "Identify any contradictions or deviations in the flow of information.",
        "Consider if the response is understandable and follows a cohesive structure."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)


In [7]:
test_case = LLMTestCase(
    input="Explain the process of photosynthesis.",
    actual_output=(
        "Photosynthesis is the process by which plants convert sunlight into chemical energy. "
        "First, plants absorb water through their roots and carbon dioxide through their leaves. "
        "Sunlight is captured by chlorophyll in the chloroplasts, initiating a reaction that produces glucose and oxygen. "
        "This process is essential for the production of oxygen and is a crucial part of the Earth’s ecosystem."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.9
Reason: The response follows a logical progression of thoughts, identifies no contradictions, and is understandable with a cohesive structure.


In [8]:
test_case = LLMTestCase(
    input="Explain the process of photosynthesis.",
    actual_output=(
        "Photosynthesis is the process where plants absorb sunlight and produce sugar. "
        "In this process, animals also play a role in capturing sunlight. "
        "This helps plants release water into the atmosphere. "
        "Overall, photosynthesis is about how plants make use of moonlight to grow faster."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.3
Reason: The response contains contradictions (animals capturing sunlight) and deviations from the truth (moonlight instead of sunlight), and lacks a clear structure.


## Relevância

In [16]:
metric = GEval(
    name="Relevance",
    criteria="Evaluate if the current output is directly related to the main subject of the question or context provided.",
    evaluation_steps=[
        "Ensure the response directly addresses the main subject of the question, specifically looking for a clear answer to 'What is the capital of France?'.",
        "Identify if any extraneous or unrelated information is included.",
        "Penalize responses that provide background information without answering the question directly."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)


In [17]:
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris."
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 1.0
Reason: The response directly addresses the main subject of the question and provides a clear answer to 'What is the capital of France?'


In [21]:
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Buenos Aires is known for its rich culture and famous landmarks... "
                  "and some say it looks a lot like France"
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.0
Reason: The response provides background information about Buenos Aires and makes a comparison to France, but fails to directly answer the question about the capital of France.
