# Notebook-II: Tutorial on LLM-Based Evaluation with AITutor-AssessmentKit  

Welcome to this tutorial on evaluating large language model (LLM)-based AI tutors using automated evaluation metrics provided by [AITutor-AssessmentKit]() on [MRBench]() data. This tutorial demonstrates how to leverage state-of-the-art evaluation techniques to assess the pedagogical efficacy of AI tutors.  

## Key Features  

- **Evaluation Across 8 Pedagogical Dimensions**:  
  Drawing inspiration from the foundational principles of learning proposed by Maurya et al. (2024), this evaluation framework focuses on the following dimensions:  
  1. *Mistake Identification*  
  2. *Mistake Location*  
  3. *Revealing the Answer*  
  4. *Providing Guidance*  
  5. *Actionability*  
  6. *Coherence*  
  7. *Tutor Tone*  
  8. *Humanlikeness*  

- **Assessment of Student Mistake Remediation in the Mathematical Domain**:  
  For a given partial conversation between a tutor and a student, where the student's last utterance demonstrates a mistake or confusion, the automated evaluation provides detailed insights into the tutor's performance across the specified dimensions.  

- **Evaluation with LLMs as Critics/Evaluators**:  
  The AITutor-AssessmentKit leverages open-source LLMs as evaluators to assess the pedagogical efficacy of tutor responses and generate scores for each dimension. While any LLM can be employed, this tutorial demonstrates the process using the *Prometheus2* LLM as an example.   

## Objectives  

By the end of this tutorial, you will:  
1. Understand how to use LLMs to evaluate AI tutors on each pedagogical dimension.  
2. Display tutor responses alongside their corresponding LLM-based evaluation scores for selected dimension.  
3. Compare responses and evaluations from two tutors using LLMs as critics for specific dimensions.  
4. Generate and save comprehensive evaluation reports across all dimensions.  

This hands-on tutorial is designed to equip you with the expertise and tools necessary to systematically evaluate and enhance the effectiveness of AI tutors in addressing student challenges.  

---
## LLMEval Overview 
Example demonstrating the methods, features, and modules associated with the LLMEvaluator for various pedagogical dimensions.

| Method Name                          | Functionality                                                        | How to Call                                    |
|--------------------------------------|----------------------------------------------------------------------|-----------------------------------------------|
| `__init__`                           | Initializes the evaluator with models, evaluation settings, and GPU configuration. | `__init__(llm_model_name, evaluation_type, ...)` |
| `_get_conversation_prompt`           | Generates conversation prompts from input messages.                   | `_get_conversation_prompt(messages)`           |
| `_get_data_with_prompt_template`     | Prepares data using prompt templates for evaluation.                 | `_get_data_with_prompt_template(data, tutor_model)` |
| `_get_eval_rubric`                   | Retrieves the rubric for a specific pedagogical dimension.           | `_get_eval_rubric(dimension)`                 |
| `compute_scores`                     | Computes evaluation scores for a given dimension.                    | `compute_scores(dimension, rubric, ...)`      |
| `compute_mistake_identification`     | Computes scores for mistake identification.                          | `compute_mistake_identification()`            |
| `compute_mistake_location`           | Computes scores for mistake location.                                | `compute_mistake_location()`                  |
| `compute_revealing_of_the_answer`    | Computes scores for revealing the answer.                            | `compute_revealing_of_the_answer()`           |
| `compute_providing_guidance`         | Computes scores for providing guidance to the student.               | `compute_providing_guidance()`                |
| `_calculate_nli_score`               | Computes NLI-based coherence scores for conversation consistency.    | `_calculate_nli_score(convs, tutor_model)`    |
| `_calculate_bert_score`              | Computes BERTScore-based coherence scores for conversation quality.  | `_calculate_bert_score(convs, tutor_model)`   |
| `list_available_metrics`             | Lists all available metrics and their descriptions for evaluation.   | `list_available_metrics()`                   |
| `get_sample_examples_with_scores`    | Retrieves examples with specific scores for a given metric.          | `get_sample_examples_with_scores(...)`       |
| `compare_tutors_scores`              | Compares scores between two tutor models for a specific pedagogical dimension. | `compare_tutors_scores(...)`  |

---

### **Suggested Order for Testing/Usage**
1. Select a dimension
2. **Use `compute_scores`** to calculate scores for different pedagogical dimensions.
3. Call **`_get_eval_rubric`** to retrieve specific rubrics for the dimension being evaluated.
4. Use **`_get_conversation_prompt`** to understand the generated prompts for input data.
5. **Call individual methods** like `compute_mistake_identification`, `compute_mistake_location`, `compute_revealing_of_the_answer`,`compute_providing_guidance`, etc to evaluate the finer aspects of tutoring.
7. Retrieve **examples with scores** using **`get_sample_examples_with_scores`** for a deeper analysis of the data.
8. **Compare results across different models** using **`compare_tutors_scores`** to identify variations in evaluation outcomes.

---



In [1]:
import os
import sys

# Set the CUDA device for execution
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

# Add the parent directory to the system path
sys.path.insert(0, os.path.abspath(".."))

# Import required libraries
from aitutor_assessmentkit.llmevaluator import LLMEvaluator



## Initialization of LLMEvaluator

In this section, we initialize the `LLMEvaluator` object with specific parameters required for evaluation. The evaluator will use a pre-defined language model (LLM) to assess the performance of various tutor models based on the provided input data and evaluation criteria.

### Key Parameters:
- **LLM Model**: We specify the language model to use (e.g., `prometheus-eval/prometheus-7b-v2.0`) and additional parameters such as `max_tokens` and `temperature`.
- **Evaluation Type**: The type of evaluation (`absolute` or `relative`) is set to guide how the scores will be calculated.
- **Prompting Type**: The evaluation can be set to `zero-shot` or `few-shot` depending on how the model is prompted.
- **Input Files**: The location of the JSON files containing the data to be evaluated.
- **Output Directory**: Directory where the evaluation results will be saved.
- **Tutor Models**: A list of tutor models to compare during the evaluation process.
- **GPU and Resources**: The number of GPUs to use and the number of conversation examples to be processed.

The `LLMEvaluator` is set up with these parameters to run the evaluation across different tutor models, producing insights into their relative performance.
:


In [2]:
# Initialize the LLMEvaluator with specified parameters
evaluator = LLMEvaluator(
    llm_model_name="prometheus-eval/prometheus-7b-v2.0",  # Specify the LLM model to use
    llm_model_parama={"max_tokens": 1024, "temperature": 0.0},
    evaluation_type='absolute',  # Set the evaluation type (absolute or relative)
    prompting_type='zero-shot',  # Specify the prompting type (zero-shot or few-shot)
    file_names=["/home/kaushal.maurya/AITutor_AssessmentKit/data/MRBench_V5.json"],
    output_data_dir='/home/kaushal.maurya/AITutor_AssessmentKit/outputs',  # Directory for output data
    with_ref=False,  # Whether to include reference answers
    ngpus=1,  # Number of GPUs to use
    num_conv_examples=-1  # Number of conversation examples to evaluate (-1 for all)
)

INFO 12-15 11:45:20 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='prometheus-eval/prometheus-7b-v2.0', speculative_config=None, tokenizer='prometheus-eval/prometheus-7b-v2.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=prometheus-eval/prometheus-7b-v2.0, use_v2_block_manager=False, num_scheduler_steps=1,

Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]


INFO 12-15 11:45:25 model_runner.py:1025] Loading model weights took 13.4966 GB
INFO 12-15 11:45:28 gpu_executor.py:122] # GPU blocks: 12874, # CPU blocks: 2048
INFO 12-15 11:45:31 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-15 11:45:31 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 12-15 11:45:41 model_runner.py:1456] Graph capturing finished in 11 secs.


Loading data: 100%|██████████| 1/1 [00:00<00:00, 64.68it/s]


Loaded 200 examples from /home/kaushal.maurya/AITutor_AssessmentKit/data/MRBench_V5.json


Cleaning Data: 100%|██████████| 200/200 [00:00<00:00, 42213.20it/s]


## Evaluation Dimension: Mistake Identification

This section evaluates the *Mistake Identification* capabilities of the selected tutor models using the `compute_mistake_identification` function.

In [None]:
#Perform mistake identification evaluation with selected tutor models
scores_tutor, error_percenetage, raw_scores, annoated_data = evaluator.compute_mistake_identification(tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'])

# Output the evaluation scores
print(scores_tutor)

Sanity Check for Tutor Models: 100%|██████████| 200/200 [00:00<00:00, 582947.05it/s]
Processed prompts: 100%|██████████| 55/55 [00:07<00:00,  7.55it/s, est. speed input: 3708.07 toks/s, output: 1141.58 toks/s]
Finalizing: 100%|██████████| 55/55 [00:00<00:00, 27803.63it/s]
Processed prompts: 100%|██████████| 200/200 [00:26<00:00,  7.59it/s, est. speed input: 5598.33 toks/s, output: 1167.67 toks/s]
Finalizing: 100%|██████████| 200/200 [00:00<00:00, 32981.87it/s]
Processed prompts: 100%|██████████| 200/200 [00:28<00:00,  6.93it/s, est. speed input: 5364.03 toks/s, output: 1173.13 toks/s]
Finalizing: 100%|██████████| 200/200 [00:00<00:00, 33263.05it/s]
Processed prompts: 100%|██████████| 200/200 [00:29<00:00,  6.74it/s, est. speed input: 5188.68 toks/s, output: 1138.60 toks/s]
Finalizing: 100%|██████████| 200/200 [00:00<00:00, 32618.92it/s]
Processed prompts: 100%|██████████| 200/200 [00:28<00:00,  6.97it/s, est. speed input: 5219.00 toks/s, output: 1076.52 toks/s]
Finalizing: 100%|███████

{'Novice': 1.073, 'Expert': 1.285, 'Llama31405B': 2.35, 'GPT4': 2.465, 'Sonnet': 1.766, 'Phi3': 1.294, 'Llama318B': 2.317, 'Mistral': 2.23, 'Gemini': 1.985}





In [4]:
error_percenetage

{'Novice': 0.0,
 'Expert': 0.0,
 'Llama31405B': 0.0,
 'GPT4': 0.0,
 'Sonnet': 1.5,
 'Phi3': 1.5,
 'Llama318B': 0.5,
 'Mistral': 0.0,
 'Gemini': 0.5}

In [5]:
raw_scores

{'Novice': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1],
 'Expert': [1,
  1,
  3,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  3,
  1,
  1,
  1,
  3,
  3,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  3,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  3,
  1,
  1,
  2,
  1,
  1,
  2,
  1,
  2,
  1,
  2,
  1,
  2,
  3,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  2,
  2,
  1,
  1,
  1,
  3,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  2,
  3,
  2,
  1,
  1,
  1,
  1,
  1,
  2,
  3,
  1,
  2,
  1,
  1,

In [6]:
print(annoated_data)

[{'conversation_id': '930-b01cb51d-748d-460c-841a-08e4d5cd5cc7', 'conversation_history': '||| tutor: hi, could you please provide a step-by-step solution for the question below? the question is: elliott is trying to walk 10,000 steps a day. he finished half of his steps on his walks to and from school and did another 1,000 steps going for a short walk with his friend. he also went for a short jog around the block and realized that after he had finished his jog, he only had 2,000 steps left to take. how many steps did elliott take during his jog? ||| student: elliott finished half of his steps on his walks to and from school, so he took 10,000/2 = 5000 steps during these walks.\nadding the 1,000 steps he took with his friend, he has taken 5000+1000 = 6000 steps.\nsubtracting 6000 from his goal of 10,000, he has 10,000-6000 = 4000 steps left to take.\ntherefore, he took 4000 steps during his jog.\n4000 ||| tutor: can you tell me how you got to your answer? ||| student: sure. i started by

In [7]:
# Perform mistake identification evaluation using the selected tutor models
scores, error_percentage, raw_scores, annotated_data = evaluator.compute_mistake_identification(
    tutor_models=['Expert', 'GPT4'],  # List of tutor models to evaluate
    definition="Mistake Identification is defined as the degree to which the tutor accurately recognizes the presence of an error in the student’s previous response." # user-defined definition of mistake identification
)

# Output the evaluation scores to assess the performance of the models in mistake identification
print(scores)



Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 274137.52it/s]
Processed prompts: 100%|██████████| 10/10 [00:05<00:00,  1.95it/s, est. speed input: 1381.27 toks/s, output: 323.23 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 29454.38it/s]
Processed prompts: 100%|██████████| 10/10 [00:05<00:00,  1.97it/s, est. speed input: 1443.89 toks/s, output: 323.33 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 31847.41it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 2/2 [00:10<00:00,  5.12s/it]

{'Expert': 1.3, 'GPT4': 2.3}





In [7]:
# Perform mistake identification evaluation with selected tutor models
scores, error_percentage, raw_scores, annotated_data = evaluator.compute_mistake_identification(
    tutor_models=['Expert', 'GPT4'],  # Specify the tutor models to evaluate
    definition="Mistake Identification is defined as the degree to which the tutor accurately recognizes the presence of an error in the student’s previous response.", # Define the mistake identification
    eval_instruction_rubric="""
            [Has the tutor identified a mistake in the student’s response?]
            Score 1: The tutor fails to identify the mistake or misidentifies it.
            Score 2: The tutor partially identifies the mistake but lacks precision.
            Score 3: he tutor correctly identifies the mistake with high precision.
            """.strip() # Define the evaluation instruction rubric
)

# Output the error percentage for review
print(scores)

Sanity Check for Tutor Models: 100%|██████████| 200/200 [00:00<00:00, 913791.72it/s]
Processed prompts: 100%|██████████| 200/200 [00:29<00:00,  6.85it/s, est. speed input: 5074.77 toks/s, output: 1058.84 toks/s]
Finalizing: 100%|██████████| 200/200 [00:00<00:00, 34498.31it/s]
Processed prompts: 100%|██████████| 200/200 [00:30<00:00,  6.54it/s, est. speed input: 5059.67 toks/s, output: 1105.79 toks/s]
Finalizing: 100%|██████████| 200/200 [00:00<00:00, 31750.98it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 2/2 [01:00<00:00, 30.05s/it]

{'Expert': 1.296, 'GPT4': 2.47}





In [9]:
#Perform mistake identification evaluation with selected tutor models
scores_tutor, error_percenetage, raw_scores, annoated_data = evaluator.compute_mistake_identification(tutor_models=['Expert', 'Llama31405B'], save=True, file_name="test.json") # Perform mistake identification ans svae the output to a file

# Output the evaluation scores
print(scores_tutor)

Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 220752.84it/s]
Processed prompts: 100%|██████████| 10/10 [00:05<00:00,  1.92it/s, est. speed input: 1359.73 toks/s, output: 318.19 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 29620.79it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.26it/s, est. speed input: 1673.70 toks/s, output: 354.64 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 30705.01it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 2/2 [00:09<00:00,  4.83s/it]

{'Expert': 1.3, 'Llama31405B': 1.8}





### **Evaluation Dimension: Providing Guidance**

This section evaluates the *Providing Guidance* capabilities of the selected tutor models using the `compute_providing_guidance` function.


In [10]:
# Perform evaluation of 'Providing Guidance' with selected tutor models
scores, error_percentage, raw_scores, annotated_data = evaluator.compute_providing_guidance(
    tutor_models=['Llama31405B', 'GPT4']  # Specify the tutor models to evaluate
)

# Output the evaluation scores for review
print(scores)



Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 275941.05it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.09it/s, est. speed input: 1604.19 toks/s, output: 358.85 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 28282.56it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.27it/s, est. speed input: 1729.80 toks/s, output: 353.60 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 32388.45it/s]
Computing Providing_Guidance scores for tutor models: 100%|██████████| 2/2 [00:09<00:00,  4.61s/it]

{'Llama31405B': 2.3, 'GPT4': 2.4}





### **Evaluation Dimension: Actionability**

This section evaluates the *Actionability* capabilities of the selected tutor models using the `compute_actionability` function.


In [11]:
# Perform evaluation of 'Actionability' with selected tutor models
scores, error_percentage, raw_scores, annotated_data = evaluator.compute_actionability(
    tutor_models=['Llama31405B', 'GPT4']  # Specify the tutor models to evaluate
)

# Output the evaluation scores for review
print(scores)

Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 262144.00it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.17it/s, est. speed input: 1626.20 toks/s, output: 349.38 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 29127.11it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.43it/s, est. speed input: 1796.94 toks/s, output: 365.11 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 30817.81it/s]
Computing Actionability scores for tutor models: 100%|██████████| 2/2 [00:08<00:00,  4.38s/it]

{'Llama31405B': 2.2, 'GPT4': 2.4}





### **Evaluation Dimension: Coherence**

This section evaluates the *Coherence* capabilities of the selected tutor models using the `compute_coherence` function.

In [12]:
# Perform evaluation of 'Actionability' with selected tutor models
scores, error_percentage, raw_scores, annotated_data = evaluator.compute_coherence(
    tutor_models=['Llama31405B', 'GPT4']  # Specify the tutor models to evaluate
)

# Output the evaluation scores for review
print(scores)

Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 183157.38it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.32it/s, est. speed input: 1755.53 toks/s, output: 366.27 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 29392.46it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.25it/s, est. speed input: 1677.09 toks/s, output: 365.87 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 20360.70it/s]
Computing Coherence scores for tutor models: 100%|██████████| 2/2 [00:08<00:00,  4.39s/it]

{'Llama31405B': 2.9, 'GPT4': 2.8}





### **Evaluation Dimension: Tutor Tone**

This section evaluates the *Tutor Tone* capabilities of the selected tutor models using the `compute_tutor_tone` function.

In [13]:
# Perform evaluation of 'Tutor Tone' with selected tutor models
scores, error_percentage, raw_scores, annotated_data = evaluator.compute_tutor_tone(
    tutor_models=['Llama31405B', 'GPT4']  # Specify the tutor models to evaluate
)

# Output the evaluation scores for review
print(scores)

Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 131482.88it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.21it/s, est. speed input: 1659.26 toks/s, output: 344.25 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 29289.83it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.44it/s, est. speed input: 1812.90 toks/s, output: 342.74 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 33689.19it/s]
Computing Tutor_Tone scores for tutor models: 100%|██████████| 2/2 [00:08<00:00,  4.33s/it]

{'Llama31405B': 2.6, 'GPT4': 2.6}





### **Evaluation Dimension: Humanlikeness**

This section evaluates the *Humanlikeness* capabilities of the selected tutor models using the `compute_humanlikeness` function.

In [14]:
# Perform evaluation of 'Humanlikeness' with selected tutor models
scores, error_percentage, raw_scores, annotated_data = evaluator.compute_humanlikeness(
    tutor_models=['Llama31405B', 'GPT4']  # Specify the tutor models to evaluate
)

# Output the evaluation scores for review
print(scores)

Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 220752.84it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.23it/s, est. speed input: 1684.47 toks/s, output: 347.22 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 31559.85it/s]
Processed prompts: 100%|██████████| 10/10 [00:05<00:00,  1.83it/s, est. speed input: 1365.06 toks/s, output: 301.86 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 27147.60it/s]
Computing Humanlikeness scores for tutor models: 100%|██████████| 2/2 [00:09<00:00,  4.98s/it]

{'Llama31405B': 2.7, 'GPT4': 2.6}





## Compute Evaluation Scores

Perform evaluation for selected tutor models and dimensions.

In [15]:
# Perform evaluation for slected tutor models and dimension
scores, error_percentage, raw_scores, annotated_data = evaluator.compute_scores(
    dimension='Mistake_Identification',  # Specify the evaluation dimension
    tutor_models=['Expert', 'Llama31405B', 'GPT4']  # List of tutor models to evaluate
)

# Output the evaluation scores for review
print(scores)


Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 231729.50it/s]
Processed prompts: 100%|██████████| 10/10 [00:05<00:00,  1.91it/s, est. speed input: 1358.52 toks/s, output: 317.91 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 31184.42it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.24it/s, est. speed input: 1659.61 toks/s, output: 351.65 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 30131.49it/s]
Processed prompts: 100%|██████████| 10/10 [00:05<00:00,  1.94it/s, est. speed input: 1421.68 toks/s, output: 318.36 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 15815.63it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 3/3 [00:14<00:00,  4.96s/it]

{'Expert': 1.3, 'Llama31405B': 1.8, 'GPT4': 2.3}





## Tutor Responses and Evaluation Scores  
Display the tutor's responses alongside the evaluation scores computed by the LLM for a specified dimension.

In [10]:
## This example retrieves 10 responses from the 'GPT4' tutor model, evaluated under the 'Coherence' dimension.
evaluator.get_sample_examples_with_scores(dimension='Mistake_Identification', tutor_model='GPT4', num_examples=10)

Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 146143.00it/s]
Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.00it/s, est. speed input: 1470.52 toks/s, output: 329.30 toks/s]
Finalizing: 100%|██████████| 10/10 [00:00<00:00, 31254.13it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 1/1 [00:04<00:00,  5.00s/it]


Unnamed: 0,Conversation ID,History,GPT4 Response,GPT4 Mistake_Identification Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",i see where your confusion is. you subtracted ...,3
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","actually, to find out how many pencils each bo...",3
2,2895106109,"||| tutor: examples: triangles, rectangles and...","good try, but a five-sided polygon is actually...",3
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...","you've done a good job so far, but there seems...",3
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...","that's correct! now, if 1/3 of the thank you c...",1
5,290101923,||| tutor: a quadrilateral is a shape with fou...,it seems like there was a misunderstanding. wh...,2
6,2542-22f36986-95dc-4ccb-b98d-ff52e85d4851,"||| tutor: hi, could you please provide a step...",that's correct! by adding the next roll to the...,1
7,292754187,||| student: sorry for the j that i tipe ||| t...,"that's okay, camila. actually, if we divide 70...",3
8,2721-5902970b-2112-4b4c-992d-82014d134668,"||| tutor: hi, could you please provide a step...","that's okay, we all make mistakes! now you've ...",1
9,413466564,||| tutor: do you understand that step? ||| tu...,that's not quite right. let's try again. when ...,3


# Compare Two Tutors' Responses and Evaluation Scores  
Display and compare the responses of two tutors along with their evaluation scores computed by the LLM for a specified dimension.

In [11]:
# Compare the evaluation scores of two tutors for the specified dimension
comparison_results = evaluator.compare_tutors_scores(
    dimension='Mistake_Identification', 
    tutor_model1='GPT4', 
    tutor_model2='Llama31405B'
)
comparison_results


Sanity Check for Tutor Models: 100%|██████████| 5/5 [00:00<00:00, 116508.44it/s]
Processed prompts: 100%|██████████| 5/5 [00:04<00:00,  1.19it/s, est. speed input: 961.31 toks/s, output: 205.21 toks/s]
Finalizing: 100%|██████████| 5/5 [00:00<00:00, 25731.93it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 1/1 [00:04<00:00,  4.19s/it]
Sanity Check for Tutor Models: 100%|██████████| 5/5 [00:00<00:00, 197844.53it/s]
Processed prompts: 100%|██████████| 5/5 [00:03<00:00,  1.31it/s, est. speed input: 1063.33 toks/s, output: 197.47 toks/s]
Finalizing: 100%|██████████| 5/5 [00:00<00:00, 19599.55it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 1/1 [00:03<00:00,  3.83s/it]

{'GPT4': [3, 3, 3, 3, 1]}
{'Llama31405B': [2, 1, 3, 1, 1]}





Unnamed: 0,Conversation ID,History,GPT4 Response,GPT4 Mistake_Identification Score,Llama31405B Response,Llama31405B Mistake_Identification Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",i see where your confusion is. you subtracted ...,3,here's a tutor response that acknowledges the ...,2
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","actually, to find out how many pencils each bo...",3,"actually, i think we need to figure out how ma...",1
2,2895106109,"||| tutor: examples: triangles, rectangles and...","good try, but a five-sided polygon is actually...",3,"that's close, but remember we just talked abou...",3
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...","you've done a good job so far, but there seems...",3,"let's re-examine jam's pencils in boxes, consi...",1
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...","that's correct! now, if 1/3 of the thank you c...",1,"now that we know she got 5 gift cards, and we ...",1


## Generating LLM Evaluation Report
Generate and save the annotation and evaluation report using the `get_llm_evaluation_report` method for multiple tutor models across all evaluation dimensions.

In [18]:
# Generate and print the LLM evaluation report for specified tutor models and dimensions
# The report and evaluation results will be saved accordingly
print(evaluator.get_llm_evaluation_report(
    tutor_models=['Novice', 'Expert'],
    dimensions=['Mistake_Identification', 'Mistake_Location'],
    save_eval=True,
    save_report=True
))


Sanity Check for Tutor Models: 100%|██████████| 5/5 [00:00<00:00, 174762.67it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.35s/it, est. speed input: 209.49 toks/s, output: 56.09 toks/s]
Finalizing: 100%|██████████| 1/1 [00:00<00:00, 17403.75it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 1/1 [00:02<00:00,  2.36s/it]
Sanity Check for Tutor Models: 100%|██████████| 5/5 [00:00<00:00, 265462.28it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.36s/it, est. speed input: 208.78 toks/s, output: 55.90 toks/s]
Finalizing: 100%|██████████| 1/1 [00:00<00:00, 17189.77it/s]
Computing Mistake_Identification scores for tutor models: 100%|██████████| 1/1 [00:02<00:00,  2.37s/it]
Sanity Check for Tutor Models: 100%|██████████| 5/5 [00:00<00:00, 295373.52it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.64s/it, est. speed input: 189.16 toks/s, output: 56.10 toks/s]
Finalizing: 100%|██████████| 1/1 [00:00<00:00, 14820.86it/s]
Comput

        Mistake_Identification  Mistake_Location
Novice                     1.0               1.0
Expert                     1.4               1.4



