# Setup

In [None]:
!pip install -qqU deepeval datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/558.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m558.7/558.7 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25h

This code installs two Python packages silently: deepeval and datasets.

The `!` at the start tells Jupyter notebook or Google Colab to run this as a shell command rather than Python code.

The `pip install` command downloads and installs packages from the Python Package Index (PyPI). The flags `-qqU` modify how pip behaves:
- `-qq` means "very quiet" - it suppresses most output messages
- `-U` means "upgrade" - it updates the packages to their latest versions if they're already installed

deepeval is a package for evaluating deep learning models, while datasets provides easy access to many public datasets used in machine learning. Installing them together suggests this code is part of setting up an environment for machine learning work, specifically focused on model evaluation.

The quiet installation flags indicate this is likely part of a larger notebook where the installation output isn't meant to distract from the main content.

In [None]:
# Standard library imports
import json
import os
from os import listdir
from os.path import isfile, join

# Third-party imports
import numpy as np
from google.colab import drive, userdata
from deepeval import evaluate
from deepeval.metrics import (
    ConversationalGEval,
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    ConversationRelevancyMetric
)
from deepeval.test_case import LLMTestCase, ConversationalTestCase, LLMTestCaseParams

from datasets import load_dataset

This code organizes the import statements for a machine learning evaluation project, specifically focused on analyzing conversations. Let me break down each part:

The standard library imports bring in essential Python tools:
- `json` handles reading and writing JSON data formats
- `os` and its specific functions (`listdir`, `join`, `isfile`) help work with files and directories across different operating systems

The third-party imports bring in specialized tools for machine learning evaluation:

From NumPy (`np`), we get powerful numerical computing capabilities, which form the foundation for most machine learning operations.

The Google Colab imports (`drive`, `userdata`) let the code interact with Google Drive storage and user-specific data in the Colab environment.

The `deepeval` imports set up a comprehensive conversation evaluation framework:
- `ConversationalGEval` evaluates the general quality of conversations
- `RoleAdherenceMetric` checks if participants stay true to their assigned roles
- `KnowledgeRetentionMetric` measures how well information is remembered throughout a conversation
- `ConversationCompletenessMetric` assesses if conversations reach satisfactory conclusions
- `ConversationRelevancyMetric` evaluates if responses stay on topic

The test case imports (`LLMTestCase`, `ConversationalTestCase`, `LLMTestCaseParams`) provide structures for organizing and running these evaluations systematically.

Finally, `load_dataset` from the `datasets` library gives access to pre-made datasets, which serve as benchmarks or training data for the evaluation process.


In [None]:
class CFG:
    model = 'gpt-4o-mini'
    temp = 0.3
    dataset = 'flpelerin/ChatAlpaca-10k'

This code defines a configuration class named `CFG` that acts as a central place to store important settings for a machine learning project. Let me explain how this works and why it's useful.

The class functions like a container for project-wide settings, following a common pattern in machine learning where configurations need to be easily accessible and modifiable. Inside the class, three key parameters are defined:

First, `model` is set to 'gpt-4o-mini', which specifies which language model will be used for the project. The name suggests this is a smaller version of a GPT-4-like model, optimized for efficiency while maintaining reasonable performance.

Second, `temp` (short for temperature) is set to 0.3. Temperature is a crucial parameter in language models that controls how random or deterministic the model's outputs will be. A temperature of 0.3 is relatively low, which means the model will generate more focused, consistent responses rather than creative or diverse ones. Think of it like a thermostat - lower settings produce more predictable results.

Third, `dataset` points to 'flpelerin/ChatAlpaca-10k', which identifies the specific dataset that will be used. The format suggests this is hosted on a platform like Hugging Face, and contains 10,000 conversation examples based on the Alpaca training format.

By grouping these settings in a class, the code makes it simple to maintain consistent configurations across different parts of the project. If you need to change the model or adjust the temperature, you only need to modify it in one place rather than hunting through multiple files or functions. This organization style also makes it easier for team members to understand and modify the project's core settings.

In [None]:
# setup OpenAI connection
api_key = userdata.get('openaivision')
os.environ['OPENAI_API_KEY'] = api_key

Let me explain how this code sets up a secure connection to OpenAI's API, which will be used for making calls to their AI models. This is a critical piece of security infrastructure that bridges your local development environment with OpenAI's services.

The first line `api_key = userdata.get('openaivision')` retrieves an API key that's been securely stored in Google Colab's user data system. Think of this like retrieving a special password from a secure vault - the 'openaivision' parameter tells Colab which specific key to fetch. This approach is much safer than hardcoding the API key directly in the code, where it could be accidentally shared or exposed.

The second line `os.environ['OPENAI_API_KEY'] = api_key` takes that retrieved key and stores it in the system's environment variables. Environment variables act like a secure, temporary storage space that exists only while your program is running. When you store the API key here, other parts of your code can access it safely without needing to pass it around explicitly.

The name 'openaivision' suggests this key might be specifically for accessing OpenAI's vision-related APIs, though it could potentially be used for other OpenAI services as well. This setup follows security best practices by keeping sensitive credentials out of the source code and managing them through proper credential management systems.

It's worth noting that this setup is running in Google Colab, which provides additional security benefits since Colab sessions are temporary and isolated. When the session ends, these environment variables are cleared automatically, adding an extra layer of security to your API key management.

# Funkcje

In [None]:
# largest even number no greater than k
def laevnu(k):

    if k % 2 == 0:
        return k

    return k - 1

Let me explain this function, which finds the largest even number that doesn't exceed a given input number. Let's break it down step by step to understand how it works.

The function name `laevnu` appears to be an acronym for "largest even number no greater than k", where k is the input parameter. When you call this function with any number k, it will return the largest possible even number that's less than or equal to k.

Here's how the function makes its decision:

First, it checks if k is already even by using the modulo operator `%`. When you divide a number by 2, the modulo operation gives you the remainder. If k % 2 equals 0, that means k divides evenly by 2 with no remainder – in other words, k is even. In this case, the function simply returns k since it's already the largest even number no greater than itself.

If k is not even (meaning k % 2 equals 1), then k must be odd. In this case, we need to go down by 1 to reach the next even number. The function does this with `k - 1`. This works because every odd number has an even number exactly one less than it.

For example:
- If k = 8, it's even, so the function returns 8
- If k = 7, it's odd, so the function returns 6
- If k = 4, it's even, so the function returns 4
- If k = 3, it's odd, so the function returns 2

This function is quite efficient because it makes just one simple check and at most one simple calculation, regardless of how large the input number is. It's a good example of how understanding the mathematical relationship between even and odd numbers lets us solve what might seem like a searching problem with just a single comparison.

In [None]:
def create_list_of_test_cases(dd):
  test_cases = []
  for jj in range(0, laevnu(len(dd)) , 2):
    tc = LLMTestCase(  input = dd[jj]['value'], actual_output= dd[jj+1]['value'] )
    test_cases.append(tc)

  return test_cases

Let me explain this function from the ground up. It's designed to pair up items from a dataset into test cases that can evaluate how well an AI model performs. I'll break it down step by step and explain the underlying concepts.

First, let's understand what this function is trying to achieve. When testing AI models, we need organized pairs of inputs and corresponding expected outputs. Think of it like creating flash cards for studying – each card has a question on one side and the correct answer on the other. This function automates that pairing process.

The function takes a parameter `dd`, which stands for data and represents our raw testing material. This data is structured as a list of dictionaries, where each dictionary has a 'value' key. Imagine you have a sequence like this:
```python
dd = [
    {'value': 'What is the capital of France?'},
    {'value': 'Paris'},
    {'value': 'What is 2+2?'},
    {'value': '4'}
]
```

The function processes this data in three main steps:

1. It starts by creating an empty list called `test_cases`. This will be our collection of paired questions and answers.

2. The core of the function is a loop that processes items two at a time. The loop uses `range(0, laevnu(len(dd)), 2)`, which means:
   - Start at index 0
   - Count up by 2s (to grab pairs of items)
   - Stop at the largest even number no greater than the length of dd (that's what laevnu does)

   This clever use of `laevnu` ensures we only process complete pairs. If we have an odd number of items, the last unpaired item is safely ignored, preventing errors.

3. For each pair of items, the function creates a `LLMTestCase` object. This object is designed specifically for testing language models (LLM stands for Large Language Model). It takes two parameters:
   - `input`: The first item of the pair (the question)
   - `actual_output`: The second item of the pair (the expected answer)

Here's what happens in practice with our example data:
```python
First iteration (jj = 0):
- input = "What is the capital of France?"
- actual_output = "Paris"

Second iteration (jj = 2):
- input = "What is 2+2?"
- actual_output = "4"
```

Each test case is added to our list, and finally, the complete collection is returned. This organized structure makes it easy to systematically test an AI model by comparing its responses to the expected outputs.

The beauty of this function lies in its simplicity and robustness. By processing items in pairs and using `laevnu` to handle the list length, it elegantly manages potential edge cases like incomplete pairs. This careful design ensures reliable test case creation, which is crucial for meaningful AI model evaluation.

# Dane

In [None]:
# load the conversational dataset
dataset = load_dataset(CFG.dataset, split = "train")

Let me explain this important line of code that loads data for training an AI model. The code connects to a specific dataset and prepares it for use in machine learning tasks, particularly those involving conversations.

The `load_dataset` function comes from the Hugging Face datasets library, which serves as a central hub for machine learning datasets - think of it like a digital library where researchers and developers can access collections of organized data. This function reaches out to that library and retrieves the dataset we want to work with.

The function takes two key parameters that tell it exactly what data we want and how we want it organized:

First, `CFG.dataset` refers to the dataset identifier we specified earlier in our configuration class - in this case, 'flpelerin/ChatAlpaca-10k'. The format of this identifier tells us two things: 'flpelerin' is the creator or owner of the dataset, and 'ChatAlpaca-10k' suggests it's a collection of 10,000 conversation examples based on the Alpaca training format. Alpaca is a notable project that aims to create more natural and capable language models.

Second, the `split = "train"` parameter specifies that we want the training portion of the dataset. In machine learning, datasets are often divided into different splits - typically training, validation, and testing. The training split is used to teach the model, while other splits would be used to evaluate its performance. By explicitly requesting the training split, we're preparing to use this data for model training rather than evaluation.

Let's look at what happens when this code runs:
1. The function connects to the Hugging Face dataset repository
2. It locates the specific ChatAlpaca-10k dataset
3. It downloads the training split of that dataset
4. It returns the data in a format that's ready for machine learning tasks

The resulting `dataset` variable will contain structured conversation data that can be used to train or evaluate language models. Each entry in this dataset likely contains pairs of inputs and outputs - perhaps questions and answers, or prompts and responses - that represent the kind of conversational behavior we want our AI model to learn.


In [None]:
dataset[0]['conversations']

[{'from': 'human', 'value': 'Find the product of the numbers: 5 and 8'},
 {'from': 'gpt', 'value': 'The product of 5 and 8 is 40.'},
 {'from': 'human', 'value': 'What is the sum of the numbers 6 and 12?'},
 {'from': 'gpt', 'value': 'The sum of the numbers 6 and 12 is 18.'},
 {'from': 'human', 'value': 'Can you tell me the quotient of 20 and 5?'},
 {'from': 'gpt', 'value': 'Yes, the quotient of 20 and 5 is 4.'},
 {'from': 'human', 'value': 'What is the difference between 25 and 13?'},
 {'from': 'gpt', 'value': 'The difference between 25 and 13 is 12.'},
 {'from': 'human', 'value': 'What is the square of 9?'},
 {'from': 'gpt', 'value': 'The square of 9 is 81.'},
 {'from': 'human', 'value': 'What is the cube of 6?'},
 {'from': 'gpt', 'value': 'The cube of 6 is 216.'}]

In [None]:
dataset[12]['conversations']

[{'from': 'human',
  'value': 'Find the angle in the triangle using trigonometry: AB=5, BC=7, AC=9'},
 {'from': 'gpt', 'value': 'Angle A = 65.7 degrees'},
 {'from': 'human', 'value': 'What is the value of Angle B in the triangle?'},
 {'from': 'gpt',
  'value': 'To find the value of angle B in the triangle using trigonometry, we can use the Law of Cosines:\n\ncos(B) = (a^2 + c^2 - b^2) / 2ac\n\nwhere a, b, and c are the lengths of the sides opposite to angles A, B, and C, respectively.\n\nIn this case, we have:\n\na = 5 (opposite to angle A)\nb = ?\nc = 7 (opposite to angle B)\n\nWe can use the Law of Cosines again to find the value of b:\n\nb^2 = a^2 + c^2 - 2ac cos(B)\nb^2 = 5^2 + 7^2 - 2(5)(7) cos(B)\nb^2 = 25 + 49 - 70 cos(B)\nb^2 = 74 - 70 cos(B)\n\nNow we can substitute this expression for b^2 into the first equation:\n\ncos(B) = (a^2 + c^2 - b^2) / 2ac\ncos(B) = (5^2 + 7^2 - (74 - 70 cos(B))) / (2)(5)(7)\ncos(B) = (74 - 70 cos(B)) / 70\n70 cos(B) = 74 - cos(B)\n71 cos(B) = 74\nco

In [None]:
dd = dataset[0]['conversations']
dd

[{'from': 'human', 'value': 'Find the product of the numbers: 5 and 8'},
 {'from': 'gpt', 'value': 'The product of 5 and 8 is 40.'},
 {'from': 'human', 'value': 'What is the sum of the numbers 6 and 12?'},
 {'from': 'gpt', 'value': 'The sum of the numbers 6 and 12 is 18.'},
 {'from': 'human', 'value': 'Can you tell me the quotient of 20 and 5?'},
 {'from': 'gpt', 'value': 'Yes, the quotient of 20 and 5 is 4.'},
 {'from': 'human', 'value': 'What is the difference between 25 and 13?'},
 {'from': 'gpt', 'value': 'The difference between 25 and 13 is 12.'},
 {'from': 'human', 'value': 'What is the square of 9?'},
 {'from': 'gpt', 'value': 'The square of 9 is 81.'},
 {'from': 'human', 'value': 'What is the cube of 6?'},
 {'from': 'gpt', 'value': 'The cube of 6 is 216.'}]

In [None]:
dataset[12]['conversations']

[{'from': 'human',
  'value': 'Find the angle in the triangle using trigonometry: AB=5, BC=7, AC=9'},
 {'from': 'gpt', 'value': 'Angle A = 65.7 degrees'},
 {'from': 'human', 'value': 'What is the value of Angle B in the triangle?'},
 {'from': 'gpt',
  'value': 'To find the value of angle B in the triangle using trigonometry, we can use the Law of Cosines:\n\ncos(B) = (a^2 + c^2 - b^2) / 2ac\n\nwhere a, b, and c are the lengths of the sides opposite to angles A, B, and C, respectively.\n\nIn this case, we have:\n\na = 5 (opposite to angle A)\nb = ?\nc = 7 (opposite to angle B)\n\nWe can use the Law of Cosines again to find the value of b:\n\nb^2 = a^2 + c^2 - 2ac cos(B)\nb^2 = 5^2 + 7^2 - 2(5)(7) cos(B)\nb^2 = 25 + 49 - 70 cos(B)\nb^2 = 74 - 70 cos(B)\n\nNow we can substitute this expression for b^2 into the first equation:\n\ncos(B) = (a^2 + c^2 - b^2) / 2ac\ncos(B) = (5^2 + 7^2 - (74 - 70 cos(B))) / (2)(5)(7)\ncos(B) = (74 - 70 cos(B)) / 70\n70 cos(B) = 74 - cos(B)\n71 cos(B) = 74\nco

# Metryki


In [None]:
dlist = create_list_of_test_cases(dataset[12]['conversations'])

## Role adherence

In [None]:
#
convo_test_case = ConversationalTestCase(
    chatbot_role = "You are a helpful and polite assistant",
    turns = dlist)


metric = RoleAdherenceMetric(threshold=0.5)

metric.measure(convo_test_case)
print(metric.score)
metric.reason

Output()

0.3333333333333333


'The score is 0.3333333333333333 because the LLM chatbot responses in turns #2 and #3 deviated significantly from the role of a "helpful and polite assistant." \n\nIn turn #2, the response given is: \'To find the value of angle B in the triangle using trigonometry, we can use the Law of Cosines... B = 21.37 degrees.\' This response is overly technical and lengthy, making it difficult for a layperson to understand quickly. It lacks the conduciveness of a polite assistant by not simplifying or breaking down the explanation in a more approachable manner, potentially overwhelming the user.\n\nIn turn #3, the response \'To find the value of angle C in the triangle using trigonometry, we can use the Law of Cosines... C = 29.1 degrees.\' not only continues to offer intricate explanations but also contains a calculation error. This deviation results in the bot being perceived as unhelpful and inaccurate, contrasting the expected behavior of a helpful role. These two instances severely affected

## Knowledge retention

In [None]:

test_case = ConversationalTestCase(turns = dlist )
metric = KnowledgeRetentionMetric(threshold = 0.5)

metric.measure(test_case)
print(metric.score)
metric.reason

Output()

1.0


'The score is 1.00 because there are no attritions, indicating perfect retention of knowledge throughout the conversation.'

## Conversation completeness

In [None]:
test_case = ConversationalTestCase(turns = dlist )
metric = ConversationCompletenessMetric(threshold=0.5)

metric.measure(convo_test_case)
print(metric.score)
metric.reason

Output()

0.5


"The score is 0.5 because the LLM response partially meets the user's intention by calculating angles A and C using trigonometry. However, it falls short of the user's request for a detailed and accurate step-by-step approach, specifically for angle B. The response included inconsistencies in calculations for angle B, such as 'mixing up known lengths,' which undermines the reliability of solving all angles as the user desired."

## Conversation relevancy

In [None]:
test_case = ConversationalTestCase(turns = dlist )
metric = ConversationRelevancyMetric(threshold=0.5)

metric.measure(convo_test_case)
print(metric.score)
metric.reason

Output()

0.6666666666666666


"The score is 0.67 because message number 3 contains an incorrect approach to calculating angle C in a triangle, using inconsistent values for sides 'a' and 'b'. This irrelevance stems from using the Pythagorean theorem incorrectly, leading to an erroneous solution for angle C, impacting the overall relevance of the actual outputs."