# Avaliação automatizada de LLMs utilizando DeepEval

Atenção: 

Executar os comandos abaixo no prompt antes de iniciar os testes:

```bash
.venv\Scripts\activate
deepeval set-local-model \
         --model-name=llama3.2:latest \
         --base-url="http://localhost:11434/v1/" \
         --api-key="ollama"
```

ref.: https://docs.confident-ai.com/docs/metrics-llm-evals

In [1]:
from deepeval.test_case import LLMTestCase
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval

## 1. Corretude

In [2]:
# Define a métrica GEval para avaliar a "Corretude" da saída do LLM
metric = GEval(
    name="Correctness",  # Nome da métrica, que neste caso avalia a exatidão factual da resposta
    criteria="Determine if the current output is factually correct based on the expected output.",  # Descreve o objetivo da métrica
    evaluation_steps=[
        # Passos para a avaliação:
        "Check if any facts in the 'current output' contradict any facts in the 'expected output'.",  # Checa contradições entre a resposta e o esperado
        "Heavily penalize the omission of important details.",  # Define uma penalidade maior para omissões importantes
        "Vague language or contradictory opinions are not acceptable."  # Explicita que linguagem vaga afeta a avaliação
    ],
    # Parâmetros a serem usados na avaliação (entrada, saída atual e saída esperada)
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

In [3]:
# Define um caso de teste que contém a entrada, a saída gerada e a saída esperada para a comparação
test_case = LLMTestCase(
    input="Who was the first president of Brazil?",  # Entrada fornecida ao modelo
    actual_output="The first president of Brazil was Deodoro da Fonseca.",  # Resposta gerada pelo modelo
    expected_output="The first president of Brazil was Deodoro da Fonseca."  # Resposta correta esperada
)

# Aplica a métrica ao caso de teste e realiza a avaliação
metric.measure(test_case)

# Exibe o score calculado pela métrica, que representa o nível de corretude factual da saída
print(f"Score: {metric.score}")
# Exibe a justificativa para o score atribuído, explicando eventuais discrepâncias
print(f"Reason: {metric.reason}")

Output()

Score: 1.0
Reason: The output matches the expected output, with no contradictions or omissions of important details.


In [4]:
test_case = LLMTestCase(
    input="Who was the first president of Brazil?",  # Entrada fornecida ao modelo
    actual_output="It was Deodoro da Fonseca.",  # Resposta gerada pelo modelo
    expected_output="The first president of Brazil was Deodoro da Fonseca."  # Resposta correta esperada
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.8
Reason: The actual output omits important details, but the language is clear and the opinion is consistent.


In [6]:
test_case = LLMTestCase(
    input="Who was the first president of Brazil?", 
    actual_output="It was the singer Roberto Carlos da Fonseca!",
    expected_output="The first president of Brazil was Deodoro da Fonseca." 
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")


Output()

Score: 0.0
Reason: The actual output contains a contradictory fact about Roberto Carlos da Fonseca being the first president, whereas the expected output clearly states Deodoro da Fonseca as the correct answer.


## 2. Coerência

Verifica se a resposta é logicamente consistente e mantém uma sequência de ideias clara.

In [7]:
metric = GEval(
    name="Coherence", 
    criteria="Evaluate if the current output is logically consistent and maintains a clear sequence of ideas.",
    evaluation_steps=[
        "Ensure the response maintains a logical progression of thoughts.",
        "Identify any contradictions or deviations in the flow of information.",
        "Consider if the response is understandable and follows a cohesive structure."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)


In [8]:
test_case = LLMTestCase(
    input="Explain the process of photosynthesis.",
    actual_output=(
        "Photosynthesis is the process by which plants convert sunlight into chemical energy. "
        "First, plants absorb water through their roots and carbon dioxide through their leaves. "
        "Sunlight is captured by chlorophyll in the chloroplasts, initiating a reaction that produces glucose and oxygen. "
        "This process is essential for the production of oxygen and is a crucial part of the Earth’s ecosystem."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.9
Reason: The response follows a logical progression of thoughts, identifies no contradictions, and is understandable with a cohesive structure.


In [9]:
test_case = LLMTestCase(
    input="Explain the process of photosynthesis.",
    actual_output=(
        "Photosynthesis is the process where plants absorb sunlight and produce sugar. "
        "In this process, animals also play a role in capturing sunlight. "
        "This helps plants release water into the atmosphere. "
        "Overall, photosynthesis is about how plants make use of moonlight to grow faster."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.3
Reason: The response contains contradictions (animals capturing sunlight) and deviations from the process of photosynthesis, and lacks a clear explanation of how plants use sunlight to produce sugar.


## 3. Relevância

Avalia se a resposta está diretamente relacionada à pergunta ou ao contexto fornecido.

In [10]:
metric = GEval(
    name="Relevance",
    criteria="Evaluate if the current output is directly related to the main subject of the question or context provided.",
    evaluation_steps=[
        "Ensure the response directly addresses the main subject of the question, specifically looking for a clear answer to 'What is the capital of France?'.",
        "Identify if any extraneous or unrelated information is included.",
        "Penalize responses that provide background information without answering the question directly."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)


In [11]:
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris."
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 1.0
Reason: The response directly addresses the main subject of the question and provides a clear answer to 'What is the capital of France?'


In [12]:
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Buenos Aires is known for its rich culture and famous landmarks... "
                  "and some say it looks a lot like France"
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.0
Reason: The response provides background information about Buenos Aires and makes a comparison to France, but fails to directly answer the question about the capital of France.


## 4. Fluência

Analisa a gramática, a estrutura das frases e a naturalidade da linguagem utilizada na resposta.

In [13]:
metric = GEval(
    name="Fluency",
    criteria="Evaluate if the current output is grammatically correct, well-structured, and uses natural language.",  # Descreve o objetivo da métrica
    evaluation_steps=[
        "Check if the response is free of grammatical errors; if errors exist, identify them.",  # Verifica e identifica erros gramaticais
        "Ensure the sentence structure is clear and well-formed; specify any issues with structure.",  # Garante que a estrutura da frase é clara, especificando problemas
        "Assess if the language sounds natural and fluent; indicate any awkward phrasing or unnatural language."  # Identifica frases ou construções pouco naturais
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)



In [14]:
test_case = LLMTestCase(
    input="Explain the purpose of photosynthesis in plants.",
    actual_output="Photosynthesis allows plants to convert sunlight into energy. Through this process, plants produce oxygen and glucose, which they use for growth and energy storage. This process is essential for plant survival and supports life on Earth by producing oxygen."
)
metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.9
Reason: The text is free of grammatical errors, has clear and well-formed sentence structure, and natural language.


In [17]:
test_case = LLMTestCase(
    input="Explain the purpose of photosynthesis in plants.",
    actual_output="Photosynthesis very important process it make plant food from sun energy. Plants use water, light for make food oxygen make by plants."
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.4
Reason: The sentence structure is unclear, with multiple sentences that seem to be run together. The language sounds somewhat natural but contains awkward phrasing.


## 5. Consistência 

Determina se a resposta é internamente consistente e não apresenta contradições.

In [18]:
metric = GEval(
    name="Consistency",
    criteria="Evaluate if the current output is internally consistent and does not contain contradictions.",
    evaluation_steps=[
        "Ensure there are no contradictions within the response.",
        "Check if the response maintains a consistent stance or information throughout.",
        "Identify and penalize any shifts in perspective or conflicting statements within the answer."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

In [19]:
test_case = LLMTestCase(
    input="Explain the importance of recycling.",
    actual_output=(
        "Recycling is essential for reducing waste and conserving natural resources. "
        "By reusing materials like paper, plastic, and metal, we reduce the need for new resources, "
        "which can help preserve ecosystems and reduce pollution. Recycling also reduces the amount of waste sent to landfills, "
        "which can decrease greenhouse gas emissions and help combat climate change."
    )
)
metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.9
Reason: The response maintains a consistent stance on recycling's importance, and there are no apparent contradictions or shifts in perspective.


In [20]:
test_case = LLMTestCase(
    input="Is exercise beneficial for mental health?",
    actual_output=(
        "Exercise is highly beneficial for mental health. Regular physical activity releases endorphins, "
        "which can improve mood and reduce stress. However, exercise does not significantly impact mental well-being, "
        "and some studies suggest that physical activity might not be effective for mental health improvement."
    )
)
metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.4
Reason: The text acknowledges both benefits and limitations of exercise on mental health, but fails to maintain a consistent stance or perspective throughout.


## 6. Detalhamento 

Avalia o nível de profundidade e especificidade das informações fornecidas na resposta.

In [21]:
metric = GEval(
    name="Detailing",
    criteria="Evaluate if the current output provides sufficient depth and specificity in the information given.",
    evaluation_steps=[
        "Check if the response covers key aspects of the topic with sufficient depth.",
        "Identify if any important details are missing or only briefly mentioned.",
        "Assess if the response avoids vague language and provides specific, informative content."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)


In [22]:
test_case = LLMTestCase(
    input="Explain the process and benefits of composting.",
    actual_output=(
        "Composting is the process of breaking down organic waste, such as food scraps and yard waste, "
        "into a nutrient-rich soil amendment. The process involves combining carbon-rich materials (like leaves and straw) "
        "with nitrogen-rich materials (like food scraps and grass clippings) in a balanced ratio. Microorganisms then decompose "
        "these materials in the presence of moisture and oxygen. Benefits of composting include reducing landfill waste, "
        "enriching soil with nutrients, improving soil structure, and enhancing water retention. This process also reduces the need "
        "for chemical fertilizers and decreases greenhouse gas emissions from organic waste in landfills."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.9
Reason: The response covers key aspects of composting, provides a balanced ratio of carbon-rich and nitrogen-rich materials, and mentions benefits such as reducing landfill waste and enhancing water retention.


In [23]:
test_case = LLMTestCase(
    input="Explain the process and benefits of composting.",
    actual_output=(
        "Composting is a way to recycle."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.4
Reason: The response lacks depth on the topic, fails to mention benefits, and uses vague language.


## 7. Originalidade

Verifica se a resposta apresenta ideias ou informações novas, evitando repetições ou clichês.

In [24]:
metric = GEval(
    name="Originality",
    criteria="Evaluate if the current output provides new ideas or information and avoids clichés or repetitive content.",  # Descreve o objetivo da métrica
    evaluation_steps=[
        "Check if the response introduces fresh perspectives or unique information on the topic.",  # Verifica se há ideias ou perspectivas novas
        "Identify if there is any repetitive content or overused phrases, and penalize them.",  # Penaliza o uso de clichês ou repetições
        "Assess if the response is free from common or predictable statements, aiming for originality and creativity."  # Avalia se a resposta evita declarações previsíveis
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

In [25]:
test_case = LLMTestCase(
    input="What are some unique ways to reduce plastic waste?",
    actual_output=(
        "One innovative approach to reducing plastic waste is to support companies that develop biodegradable packaging made from seaweed. "
        "Another idea is to use refill stations at grocery stores where customers bring their own containers for essentials like grains, oil, and cleaning products. "
        "Additionally, some communities are introducing plastic-free libraries, where people can borrow reusable items like bags and containers, minimizing the need for single-use plastics."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.9
Reason: The response introduces fresh perspectives, avoids repetitive content, and presents original ideas for reducing plastic waste.


In [30]:
test_case = LLMTestCase(
    input="What are some unique ways to reduce plastic waste?",
    actual_output=(
        "To reduce plastic waste, remember to always recycle." 
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.4
Reason: The response lacks fresh perspectives, contains repetitive content, and relies on predictable statements.


## 8. Adequação ao Público

Analisa se a resposta é apropriada para o público-alvo em termos de linguagem, formalidade e conteúdo.

In [31]:
metric = GEval(
    name="Audience Appropriateness",
    criteria="Evaluate if the current output is appropriate for a young audience (up to 5 years old) in terms of language, formality, and content.",  # Descreve o objetivo da métrica
    evaluation_steps=[
        "Check if the language is simple and easy to understand for young children (up to 5 years old).",
        "Ensure that the content is age-appropriate, avoiding complex or mature topics.",
        "Assess if the tone is friendly and engaging, suitable for a young audience."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)


In [32]:
test_case = LLMTestCase(
    input="What is a rainbow?",
    actual_output=(
        "A rainbow is a big, colorful bow in the sky! It happens when sunshine and rain mix with raindrops. "
        "You might see red, yellow, blue, and lots of pretty colors. Isn’t it magical?"
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 1.0
Reason: The content is age-appropriate, the tone is friendly and engaging, and the language is simple and easy to understand for young children.


In [33]:
test_case = LLMTestCase(
    input="What is a rainbow?",
    actual_output=(
        "A rainbow forms when sunlight enters raindrops and bends, reflecting and refracting within the droplets. "
        "The light is split into multiple wavelengths, resulting in a spectrum. This process involves intricate optical physics, "
        "and the resulting visual effect is dependent on precise angles between the observer, sunlight, and water droplets."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.0
Reason: The content is too complex for young children and lacks a friendly tone.


## 9. Empatia 

Avalia se a resposta demonstra compreensão e consideração pelas emoções ou perspectivas do interlocutor.

In [34]:
metric = GEval(
    name="Empathy",
    criteria="Evaluate if the current output demonstrates understanding and consideration for the emotions or perspectives of a sensitive audience.",
    evaluation_steps=[
        "Check if the response acknowledges and respects the emotions or concerns of the audience.",
        "Ensure that the language used is sensitive and avoids triggering phrases or dismissive tones.",
        "Assess if the response shows support, compassion, or reassurance where appropriate."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)


In [35]:
test_case = LLMTestCase(
    input="I'm feeling really anxious about my job security.",
    actual_output=(
        "I'm so sorry you're feeling this way. Job security concerns can be very overwhelming, "
        "and it’s completely natural to feel anxious about it. Remember that it’s okay to feel worried, "
        "and there are people who can support you during this time. Taking things one step at a time can help, "
        "and don’t hesitate to reach out if you need someone to talk to."
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.9
Reason: The response acknowledges and respects the emotions of the audience, uses sensitive language, and shows support and compassion.


In [36]:
test_case = LLMTestCase(
    input="I'm feeling really anxious about my job security.",
    actual_output=(
        "Just stop complaining and get back to work rigth now!!!"
    )
)

metric.measure(test_case)

print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")

Output()

Score: 0.0
Reason: The response acknowledges no emotions, uses dismissive language and tone, and lacks support or compassion.
