# Notebook-I: Tutorial on Automated Evaluation with AITutor-AssessmentKit  

Welcome to this tutorial on evaluating large language model (LLM)-based AI tutors using automated evaluation metrics provided by [AITutor-AssessmentKit]() on [MRBench]() data. This guide demonstrates a systematic approach to assessing the pedagogical effectiveness of AI tutors.  

## Key Features  

- **Evaluation Across 8 Pedagogical Dimensions**:  
  Based on the foundational principles of learning proposed by Maurya et al. (2024), this framework evaluates tutor performance on the following dimensions:  
  1. *Mistake Identification*  
  2. *Mistake Location*  
  3. *Revealing the Answer*  
  4. *Providing Guidance*  
  5. *Actionability*  
  6. *Coherence*  
  7. *Tutor Tone*  
  8. *Humanlikeness*  

- **Assessment of Student Mistake Remediation in the Mathematical Domain**:  
  For a given partial conversation between a tutor and a student, where the student's last utterance contains a mistake or demonstrates confusion, the automated evaluation framework provides an in-depth analysis of the tutor's pedagogical performance across the specified dimensions.  

- **Evaluation with Public NLP Models and Traditional Machine Learning Models**:  
  This AITutor-AssessmentKit leverages publicly available NLP models released by the NLP community as evaluators for various dimensions. For dimensions where no suitable publicly available model exists, traditional machine learning models are employed as ternary classifiers to assign labels to responses.  

## Objectives  

By the end of this tutorial, you will:  
1. Learn how to use evaluation responsse AI tutors on each pedagogical dimension.  
2. Explore the available metrics for specific dimensions and analyze tutor responses accordingly.  
3. Compare responses from two tutors based on selected metrics and pedagogical dimensions.  
4. Generate and save comprehensive evaluation reports across all dimensions.  

This hands-on tutorial is designed to equip you with the tools and knowledge necessary to evaluate and enhance the performance of AI tutors in addressing student mistakes effectively.  


---
## Overview 
Example demonstrating the methods, features, and modules associated with AutoEval for the Coherence dimension.  The same structure applies to other evaluation dimensions.
| Method Name                          | Functionality                                                        | How to Call                                    |
|--------------------------------------|----------------------------------------------------------------------|-----------------------------------------------|
| `__init__`                           | Initializes the evaluator and models.                                | --                          |
| `_calculate_nli_score`               | Computes NLI-based coherence scores.                                 | `_calculate_nli_score(convs, tutor_model)`    |
| `_calculate_bert_score`              | Computes BERTScore-based coherence scores.                           | `_calculate_bert_score(convs, tutor_model)`   |
| `compute`                            | Computes coherence scores for all examples using specified metrics.  | `compute(data, metrics, save, file_name)`     |
| `_get_metric_method`                 | Retrieves the scoring method for a metric.                           | `_get_metric_method(metric)`                 |
| `list_available_metrics`             | Lists all available metrics and their descriptions.                  | `list_available_metrics()`                   |
| `get_sample_examples_with_scores`    | Retrieves examples with coherence scores for a given metric.          | `get_sample_examples_with_scores(...)`       |
| `compare_tutors_scores`              | Compares scores between two tutor models for a specific metric.       | `compare_tutors_scores(...)`                 |

---

### **Suggested Order for Testing/Usage**
1. Test `list_available_metrics` to understand the available metrics.
2. Use `compute` to calculate scores for all metrics.
3. Call `_calculate_nli_score` and `_calculate_bert_score` separately to understand individual scoring methods.
4. Retrieve specific examples using `get_sample_examples_with_scores`.
5. Compare tutor models using `compare_tutors_scores`. 

---


## **Installation**

Let's install the `AITutor-AssessmentKit` with `pip`

In [1]:
!pip install AITutor-AssessmentKit



In [2]:
"""
Setting the environment, configures the system path, and imports necessary modules  
and classes for automated evaluation of tutoring systems.
"""

import os
import sys

# Configure the system path to include the parent directory
sys.path.insert(0, os.path.abspath(".."))

# Import external libraries
import pandas as pd

# Import AutoEvaluation classes from the assessment toolkit
from aitutor_assessmentkit.autoevaluator import (
    autoeval, 
    AutoMistakeIdentificationEvaluator,
    AutoMistakeLocationEvaluator,
    AutoRevealingOfTheAnswerEvaluator,
    AutoProvidingGuidanceEvaluator,
    AutoActionabilityEvaluator,
    AutoCoherenceEvaluator, 
    AutoTutorToneEvaluator,
    AutoHumanlikenessEvaluator, 
    AutoEvaluationReport,
)

# helper imports 
from aitutor_assessmentkit.helpers import utils

  from .autonotebook import tqdm as notebook_tqdm
Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


## **Data: `MRBench`**
Let's download the sample `MRBench` dataset

In [3]:
%%bash

# Download the required files from GitHub
wget -q "https://raw.githubusercontent.com/kaushal0494/aitutor_assessmentkit/main/data/sample_mrbench.json"

# Create the necessary directories if they don't exist
mkdir -p ../data
mkdir -p ../outputs

# Move the downloaded files to the 'data' directory
mv sample_mrbench.json ../data/

In [4]:
utils.load_data('../data/sample_mrbench.json')

Unnamed: 0,conversation_id,conversation_history,Data,Split,Topic,Ground_Truth_Solution,anno_llm_responses
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"Tutor: Hi, could you please provide a step-by-...",MathDial,test,Not Available,Elliott took half of his steps on his school j...,{'Gemini': {'response': 'It looks like you're ...
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"Tutor: Hi, could you please provide a step-by-...",MathDial,test,Not Available,There are a total of 26 - 2 = 24 pencils in th...,{'Sonnet': {'response': 'I appreciate your eff...
2,2895106109,"Tutor: Examples: triangles, rectangles and pen...",Bridge,train,4.5A.Geometric Lines,Not Available,"{'Llama31405B': {'response': 'That's close, bu..."
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"Tutor: Hi, could you please provide a step-by-...",MathDial,test,Not Available,There are a total of 26 - 2 = 24 pencils in th...,{'Llama31405B': {'response': 'Let's re-examine...
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"Tutor: Hi, could you please provide a step-by-...",MathDial,test,Not Available,She got 5 gift cards because 50 / 10 = 5\nShe ...,{'Expert': {'response': 'Ok. And if she got 5 ...
5,290101923,Tutor: A quadrilateral is a shape with four si...,Bridge,test,4.6D.Classifying 2D Figures,Not Available,{'Expert': {'response': 'Great try - you shoul...
6,2542-22f36986-95dc-4ccb-b98d-ff52e85d4851,"Tutor: Hi, could you please provide a step-by-...",MathDial,test,Not Available,The next roll will be his 11th roll.\nThe tota...,{'Llama31405B': {'response': 'It looks like th...
7,292754187,Student: sorry for the j that I tipe\n Tutor: ...,Bridge,train,4.9A.Converting Units of Measure (Review) - 1,Not Available,"{'Novice': {'response': 'That is a good try.',..."
8,2721-5902970b-2112-4b4c-992d-82014d134668,"Tutor: Hi, could you please provide a step-by-...",MathDial,test,Not Available,The total number of bath towels used in one mo...,"{'Sonnet': {'response': 'That's okay, I'm glad..."
9,413466564,Tutor: Do you understand that step?\n Tutor: N...,Bridge,train,5.3A.Multi-Digit Division with Two-Digit Divis...,Not Available,{'Phi3': {'response': 'That's not quite right;...


In [5]:
utils.load_json_data('../data/sample_mrbench.json')

Loaded 10 examples from ../data/sample_mrbench.json


[{'conversation_id': '930-b01cb51d-748d-460c-841a-08e4d5cd5cc7',
  'conversation_history': 'Tutor: Hi, could you please provide a step-by-step solution for the question below? The question is: Elliott is trying to walk 10,000 steps a day. He finished half of his steps on his walks to and from school and did another 1,000 steps going for a short walk with his friend. He also went for a short jog around the block and realized that after he had finished his jog, he only had 2,000 steps left to take. How many steps did Elliott take during his jog?\xa0\n\xa0Student: Elliott finished half of his steps on his walks to and from school, so he took 10,000/2 = 5000 steps during these walks.\nAdding the 1,000 steps he took with his friend, he has taken 5000+1000 = 6000 steps.\nSubtracting 6000 from his goal of 10,000, he has 10,000-6000 = 4000 steps left to take.\nTherefore, he took 4000 steps during his jog.\xa0\n\xa04000\xa0\n\xa0Tutor: can you tell me how you got to your answer?\xa0\n\xa0Studen

## Evaluation Dimension: Coherence

### Evaluating Coherence Across Tutor Models Using AutoCoherenceEvaluator

In this section, we demonstrate how to evaluate the "Coherence" of responses from multiple tutor models using the `AutoCoherenceEvaluator`. The evaluator computes scores based on two coherence metrics: "Coherence_BERT" and "Coherence_NLI," which assess the logical consistency and alignment of the tutor's response with the student’s previous responses.

The process follows these steps:
1. **Evaluator Initialization**: The `AutoCoherenceEvaluator` is initialized with the MRBench dataset (`MRBench_V5.json`), the output directory for saving results, and a list of tutor models to be evaluated (e.g., 'Novice', 'Expert', 'GPT4', etc.).
2. **Score Computation**: The `compute` method calculates the evaluation scores for the specified metrics (`Coherence_BERT` and `Coherence_NLI`), and the results are saved in the output directory.
3. **Result Display**: The cumulative score of the evaluation is printed, summarizing the coherence performance across all the selected tutor models.

The following code executes this evaluation process:


In [6]:
# Initialize the AutoCoherenceEvaluator with the specified parameters
evaluator = AutoCoherenceEvaluator(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Compute the evaluation scores and print the cumulative score
cumulative_score, all_scores, _ = evaluator.compute(save=True, metrics=['Coherence_BERT', 'Coherence_NLI'])
print(pd.DataFrame(cumulative_score))

Loading data: 100%|██████████| 1/1 [00:00<00:00, 3253.92it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 27630.46it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 272357.40it/s]


Computing Coherence Scores using ['Coherence_BERT', 'Coherence_NLI'] method(a) for 10 examples...


Calculating Coherence_BERT Score for Tutors:   0%|          | 0/9 [00:00<?, ?it/s]Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Calculating Coherence_BERT Score for Tutors: 100%|██████████| 9/9 [00:05<00:00,  1.75it/s]
Calculating Coherence_NLI Score for Tutors:  56%|█████▌    | 5/9 [00:01<00:00,  5.22it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1188 > 512). Running this sequence through the model will result in indexing errors
Calculating Coherence_NLI Score for Tutors: 100%|██████████| 9/9 [00:02<00:00,  4.49it/s]

             Coherence_BERT  Coherence_NLI
Novice                0.840          0.749
Expert                0.849          0.438
Llama31405B           0.853          0.676
GPT4                  0.855          0.575
Sonnet                0.851          0.601
Phi3                  0.825          0.613
Llama318B             0.853          0.855
Mistral               0.851          0.571
Gemini                0.845          0.660
Overall               0.847          0.629





### Listing Available Metrics
This code retrieves and displays the list of available metrics for the evaluator.

In [7]:
# List the available metrics for the evaluator
available_metrics = evaluator.list_available_metrics()
print(available_metrics)

           Metric                                        Description
0  Coherence_BERT  Uses BERTScore to evaluate coherence between s...
1   Coherence_NLI  Uses Natural Language Inference (NLI) to evalu...


### Retrieving Sample Examples with Scores
This code retrieves a set of sample examples along with their corresponding evaluation scores for a specific tutor model and metric. In this case, the tutor model is set to "Expert," and the metric used is "Coherence_BERT." The number of examples to retrieve is limited to 5.


In [8]:
# Retrieve sample examples with scores for the specified tutor model and metric
samples = evaluator.get_sample_examples_with_scores(
    tutor_model="Expert", 
    num_examples=5, 
    metric="Coherence_BERT"
)

# Display the retrieved samples
samples

Unnamed: 0,Conversation ID,History,Expert Response,Expert Coherence_BERT Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.844
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.861
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.822
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.896
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.857


In [9]:
# Retrieve sample examples with scores for the specified tutor model and metric
samples = evaluator.get_sample_examples_with_scores(
    tutor_model="Expert", 
    num_examples=5, 
    metric="Coherence_NLI"
)

# Display the retrieved samples
samples


Unnamed: 0,Conversation ID,History,Expert Response,Expert Coherence_NLI Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.893
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.041
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.275
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.047
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.026


In [10]:
# Retrieve sample examples with scores for the specified tutor model and metric
samples = evaluator.get_sample_examples_with_scores(
    tutor_model="Gemini", 
    num_examples=10, 
    metric="Coherence_BERT"
)

# Display the first 10 rows of the retrieved samples
samples.head(10)


Unnamed: 0,Conversation ID,History,Gemini Response,Gemini Coherence_BERT Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...","it looks like you're on the right track, but r...",0.849
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...",it looks like you're trying to find the number...,0.851
2,2895106109,"||| tutor: examples: triangles, rectangles and...","""that's a great guess! remember, an octagon h...",0.841
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",let's look at how many pencils are in the boxe...,0.876
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...","that's great! now, think about how those 5 gif...",0.85
5,290101923,||| tutor: a quadrilateral is a shape with fou...,can you tell me what makes a rectangle special...,0.844
6,2542-22f36986-95dc-4ccb-b98d-ff52e85d4851,"||| tutor: hi, could you please provide a step...",you're absolutely right that he needs a total ...,0.847
7,292754187,||| student: sorry for the j that i tipe ||| t...,that's a great observation!let's think about i...,0.809
8,2721-5902970b-2112-4b4c-992d-82014d134668,"||| tutor: hi, could you please provide a step...",that's great you caught that mistake!remember ...,0.832
9,413466564,||| tutor: do you understand that step? ||| tu...,"that's a great try! remember, 6 times 6 is 36,...",0.849


### Comparing Tutor Scores
This code compares the evaluation scores between two tutor models with selected metric. It retrieves n examples for the comparison and displays the results.

In [11]:
# Compare the evaluation scores between two tutor models for the specified metric
comparison = evaluator.compare_tutors_scores(
    tutor_model1="Expert", 
    tutor_model2="GPT4", 
    num_examples=5, 
    metric="Coherence_BERT"
)

# Display the comparison results
comparison


Unnamed: 0,Conversation ID,History,Expert Response,Expert Coherence_BERT Score,GPT4 Response,GPT4 Coherence_BERT Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.844,i see where your confusion is. you subtracted ...,0.859
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.861,"actually, to find out how many pencils each bo...",0.875
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.822,"good try, but a five-sided polygon is actually...",0.842
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.896,"you've done a good job so far, but there seems...",0.861
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.857,"that's correct! now, if 1/3 of the thank you c...",0.858


In [12]:
# Compare the evaluation scores between two tutor models for the specified metric
comparison = evaluator.compare_tutors_scores(
    tutor_model1="Expert", 
    tutor_model2="GPT4", 
    num_examples=5, 
    metric="Coherence_BERT"
)

# Display the comparison results
comparison


Unnamed: 0,Conversation ID,History,Expert Response,Expert Coherence_BERT Score,GPT4 Response,GPT4 Coherence_BERT Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.844,i see where your confusion is. you subtracted ...,0.859
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.861,"actually, to find out how many pencils each bo...",0.875
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.822,"good try, but a five-sided polygon is actually...",0.842
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.896,"you've done a good job so far, but there seems...",0.861
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.857,"that's correct! now, if 1/3 of the thank you c...",0.858


## Evaluation Dimension: Mistake Identification

### Evaluating Mistake Identification Using AutoMistakeIdentificationEvaluator

This section demonstrates the use of the `AutoMistakeIdentificationEvaluator` to assess "Mistake Identification" across multiple tutor models, using the MRBench dataset. The evaluator calculates the scores based on the "Mistake_Identification_Heuristic" metric, which evaluates the tutor’s ability to identify mistakes in student responses.

The following steps are executed:
1. **Evaluator Initialization**: The `AutoMistakeIdentificationEvaluator` is initialized with the dataset file (`MRBench_V5.json`), the output directory for storing results, and a list of tutor models to evaluate (e.g., 'Novice', 'Expert', 'GPT4', etc.).
2. **Score Computation**: The `compute` method calculates the cumulative and individual scores for the "Mistake_Identification_Heuristic" metric, saving the results to the specified output directory.
3. **Result Display**: The cumulative score is printed, summarizing the evaluation results for all tutor models under the given metric.

The following code runs the evaluation process:


In [13]:
# Initialize the AutoMistakeIdentificationEvaluator with the specified parameters
evaluator = AutoMistakeIdentificationEvaluator(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Compute the evaluation scores and print the cumulative score
cumulative_score, all_scores, _ = evaluator.compute(save=True, metrics=['Mistake_Identification_Heuristic'])

# Print the cumulative score
print(pd.DataFrame(cumulative_score))

Loading data: 100%|██████████| 1/1 [00:00<00:00, 3199.32it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 34435.99it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 299593.14it/s]


Computing Mistake Identification Scores using ['Mistake_Identification_Heuristic'] mtrics(s) for 10 examples...


Calculating Mistake_Identification_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 41527.76it/s]

             Mistake_Identification_Heuristic
Novice                                  0.750
Expert                                  0.300
Llama31405B                             0.900
GPT4                                    0.800
Sonnet                                  0.600
Phi3                                    0.600
Llama318B                               0.500
Mistral                                 0.800
Gemini                                  0.900
Overall                                 0.679





In [14]:
# List all available evaluation metrics and descriptions
evaluator.list_available_metrics()

Unnamed: 0,Method,Description
0,Mistake_Identification_Heuristic,Compute mistake identification scores using he...


In [15]:
# Retrieve sample examples with their corresponding scores for a specified tutor model and evaluation metric
evaluator.get_sample_examples_with_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Mistake_Identification_Heuristic Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.0
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",1.0
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,1.0
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.0
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.0


In [16]:
# Compare the evaluation scores of two tutor models based on a specified metric
evaluator.compare_tutors_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Mistake_Identification_Heuristic Score,GPT4 Response,GPT4 Mistake_Identification_Heuristic Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.0,i see where your confusion is. you subtracted ...,1.0
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",1.0,"actually, to find out how many pencils each bo...",1.0
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,1.0,"good try, but a five-sided polygon is actually...",1.0
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.0,"you've done a good job so far, but there seems...",1.0
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.0,"that's correct! now, if 1/3 of the thank you c...",0.0


## Evaluation Dimension: Mistake Location

### Evaluating Mistake Location Using AutoMistakeLocationEvaluator

This section demonstrates the use of the `AutoMistakeLocationEvaluator` to assess "Mistake Location" across multiple tutor models, using the MRBench dataset. The evaluator calculates the scores based on the "Mistake_Location_Heuristic" metric, which evaluates the tutor's ability to accurately locate and identify the position of mistakes in student responses.

The following steps are executed:
1. **Evaluator Initialization**: The `AutoMistakeLocationEvaluator` is initialized with the dataset file (`MRBench_V5.json`), the output directory for storing results, and a list of tutor models to evaluate (e.g., 'Novice', 'Expert', 'GPT4', etc.).
2. **Score Computation**: The `compute` method calculates the cumulative and individual scores for the "Mistake_Location_Heuristic" metric, saving the results to the specified output directory.
3. **Result Display**: The cumulative score is printed, summarizing the evaluation results for all tutor models under the given metric.

The following code runs the evaluation process:


In [17]:
evaluator = AutoMistakeLocationEvaluator(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Compute the evaluation scores and print the cumulative score
cumulative_score, all_scores, _ = evaluator.compute(save=True, metrics=['Mistake_Location_Heuristic'])

# Print the cumulative score
print(pd.DataFrame(cumulative_score))

Loading data: 100%|██████████| 1/1 [00:00<00:00, 3233.85it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 35635.55it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 317750.30it/s]


Computing Mistake Location Scores using ['Mistake_Location_Heuristic'] mtrics(s) for 10 examples...


Calculating Mistake_Location_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 49998.33it/s]

             Mistake_Location_Heuristic
Novice                            0.000
Expert                            0.200
Llama31405B                       0.800
GPT4                              0.500
Sonnet                            0.500
Phi3                              0.300
Llama318B                         0.500
Mistral                           0.600
Gemini                            0.700
Overall                           0.488





In [18]:
# List all available evaluation metrics and descriptions
evaluator.list_available_metrics()

Unnamed: 0,Method,Description
0,Mistake_Location_Heuristic,Compute mistake location scores using heuristics.


In [19]:
# Retrieve sample examples with their corresponding scores for a specified tutor model and evaluation metric
evaluator.get_sample_examples_with_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Mistake_Location_Heuristic Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.0
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",1.0
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,1.0
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.0
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.0


In [20]:
# Compare the evaluation scores of two tutor models based on a specified metric
evaluator.compare_tutors_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Mistake_Location_Heuristic Score,GPT4 Response,GPT4 Mistake_Location_Heuristic Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.0,i see where your confusion is. you subtracted ...,0.0
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",1.0,"actually, to find out how many pencils each bo...",1.0
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,1.0,"good try, but a five-sided polygon is actually...",1.0
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.0,"you've done a good job so far, but there seems...",1.0
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.0,"that's correct! now, if 1/3 of the thank you c...",0.0


## Evaluation Dimension: Revealing of the Answer

### Evaluating Revealing of the Answer Using AutoRevealingOfTheAnswerEvaluator

This section demonstrates the use of the `AutoRevealingOfTheAnswerEvaluator` to assess the "Revealing of the Answer" across multiple tutor models, using the MRBench dataset. The evaluator calculates the scores based on the "Revealing_Of_The_Answer_Heuristic" metric, which measures how effectively the tutor reveals the final answer to the student.

The following steps are executed:
1. **Evaluator Initialization**: The `AutoRevealingOfTheAnswerEvaluator` is initialized with the dataset file (`MRBench_V5.json`), the output directory for saving results, and a list of tutor models to evaluate (e.g., 'Novice', 'Expert', 'GPT4', etc.).
2. **Score Computation**: The `compute` method calculates the cumulative and individual scores for the "Revealing_Of_The_Answer_Heuristic" metric, and the results are saved to the specified output directory.
3. **Result Display**: The cumulative score is printed, providing a summary of the evaluation results for all tutor models under the given metric.

The following code runs the evaluation process:


In [21]:
# Initialize the AutoRevealingOfTheAnswerEvaluator with the specified parameters
evaluator = AutoRevealingOfTheAnswerEvaluator(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Compute the evaluation scores and print the cumulative score
cumulative_score, all_scores, _ = evaluator.compute(save=True, metrics=['Revealing_of_the_Answer_Heuristic'])

# Print the cumulative score
print(pd.DataFrame(cumulative_score))

Loading data: 100%|██████████| 1/1 [00:00<00:00, 3446.43it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 39309.32it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 317750.30it/s]


Computing Revealing of the_Answer Scores using ['Revealing_of_the_Answer_Heuristic'] mtrics(s) for 10 examples...


Calculating Revealing_of_the_Answer_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 56090.25it/s]

             Revealing_of_the_Answer_Heuristic
Novice                                   0.000
Expert                                   0.200
Llama31405B                              0.800
GPT4                                     0.500
Sonnet                                   0.500
Phi3                                     0.300
Llama318B                                0.500
Mistral                                  0.600
Gemini                                   0.700
Overall                                  0.488





In [22]:
# List all available evaluation metrics and descriptions
evaluator.list_available_metrics()

Unnamed: 0,Method,Description
0,Revealing_of_the_Answer_Heuristic,Compute revealing of the answer scores using h...


In [23]:
# Retrieve sample examples with their corresponding scores for a specified tutor model and evaluation metric
evaluator.get_sample_examples_with_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Revealing_of_the_Answer_Heuristic Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.0
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",1.0
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,1.0
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.0
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.0


In [24]:
# Compare the evaluation scores of two tutor models based on a specified metric
evaluator.compare_tutors_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Revealing_of_the_Answer_Heuristic Score,GPT4 Response,GPT4 Revealing_of_the_Answer_Heuristic Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.0,i see where your confusion is. you subtracted ...,0.0
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",1.0,"actually, to find out how many pencils each bo...",1.0
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,1.0,"good try, but a five-sided polygon is actually...",1.0
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.0,"you've done a good job so far, but there seems...",1.0
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.0,"that's correct! now, if 1/3 of the thank you c...",0.0


## Evaluation Dimension: Providing Guidance

### Evaluating Providing Guidance Using AutoProvidingGuidanceEvaluator

This section demonstrates the use of the `AutoProvidingGuidanceEvaluator` to assess how effectively the tutor provides guidance across multiple tutor models, using the MRBench dataset. The evaluator calculates the scores based on the "Providing_Guidance_Uptake" metric, which measures the extent to which the tutor provides meaningful and helpful guidance to the student.

The following steps are executed:
1. **Evaluator Initialization**: The `AutoProvidingGuidanceEvaluator` is initialized with the dataset file (`MRBench_V5.json`), the output directory for saving results, and a list of tutor models to evaluate (e.g., 'Novice', 'Expert', 'GPT4', etc.).
2. **Score Computation**: The `compute` method calculates the cumulative and individual scores for the "Providing_Guidance_Uptake" metric, and the results are saved to the specified output directory.
3. **Result Display**: The cumulative score is printed, providing a summary of the evaluation results for all the tutor models under the given metric.

The following code runs the evaluation process:


In [25]:
# Initialize the AutoProvidingGuidanceEvaluator with the specified parameters
evaluator = AutoProvidingGuidanceEvaluator(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Compute the evaluation scores and print the cumulative score
cumulative_score, all_scores, _ = evaluator.compute(save=True, metrics=['Providing_Guidance_Uptake'])

# Print the cumulative score
print(pd.DataFrame(cumulative_score))


Loading data: 100%|██████████| 1/1 [00:00<00:00, 2576.35it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 36889.22it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 308404.71it/s]


Computing Providing Scores using ['Providing_Guidance_Uptake'] metric(s) for 10 examples...


Calculating Providing_Guidance_Uptake Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 14.22it/s]

             Providing_Guidance_Uptake
Novice                           0.540
Expert                           0.883
Llama31405B                      0.998
GPT4                             0.997
Sonnet                           0.903
Phi3                             0.740
Llama318B                        0.989
Mistral                          0.994
Gemini                           0.888
Overall                          0.906





In [26]:
# List all available evaluation metrics and descriptions
evaluator.list_available_metrics()

Unnamed: 0,Method,Description
0,Providing_Guidance_Uptake,Providing guidance score using uptake metric.


In [27]:
# Retrieve sample examples with their corresponding scores for a specified tutor model and evaluation metric
evaluator.get_sample_examples_with_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Providing_Guidance_Uptake Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.998
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.999
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.957
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.998
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.999


In [28]:
# Compare the evaluation scores of two tutor models based on a specified metric
evaluator.compare_tutors_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Providing_Guidance_Uptake Score,GPT4 Response,GPT4 Providing_Guidance_Uptake Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.998,i see where your confusion is. you subtracted ...,0.999
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.999,"actually, to find out how many pencils each bo...",0.999
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.957,"good try, but a five-sided polygon is actually...",0.998
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.998,"you've done a good job so far, but there seems...",0.999
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.999,"that's correct! now, if 1/3 of the thank you c...",0.999


## Evaluation Dimension: Actionability

### Evaluating Actionability Using AutoActionabilityEvaluator

This section demonstrates the use of the `AutoActionabilityEvaluator` to assess the "Actionability" of the tutor’s responses across multiple tutor models, using the MRBench dataset. The evaluator calculates the scores based on the "Actionability_Heuristic" metric, which measures the clarity and effectiveness of the tutor’s actions or suggestions in terms of their ability to drive the learner towards a meaningful next step.

The following steps are executed:
1. **Evaluator Initialization**: The `AutoActionabilityEvaluator` is initialized with the dataset file (`MRBench_V5.json`), the output directory for saving results, and a list of tutor models to evaluate (e.g., 'Novice', 'Expert', 'GPT4', etc.).
2. **Score Computation**: The `compute` method calculates the cumulative and individual scores for the "Actionability_Heuristic" metric, and the results are saved to the specified output directory.
3. **Result Display**: The cumulative score is printed, providing a summary of the evaluation results for all the tutor models under the given metric.

The following code runs the evaluation process:


In [29]:
# Initialize the AutoActionabilityEvaluator with the specified parameters
evaluator = AutoActionabilityEvaluator(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Compute the evaluation scores and print the cumulative score
cumulative_score, all_scores, _ = evaluator.compute(save=True, metrics=['Actionability_Heuristic'])

# Print the cumulative score
print(pd.DataFrame(cumulative_score))


Loading data: 100%|██████████| 1/1 [00:00<00:00, 3192.01it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 36663.50it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 106997.55it/s]


Computing Actionability Scores using ['Actionability_Heuristic'] mtrics(s) for 10 examples...


Calculating Actionability_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 74017.13it/s]

             Actionability_Heuristic
Novice                         0.000
Expert                         0.500
Llama31405B                    0.500
GPT4                           0.300
Sonnet                         0.300
Phi3                           0.400
Llama318B                      0.200
Mistral                        0.400
Gemini                         0.200
Overall                        0.333





In [30]:
# List all available evaluation metrics and descriptions
evaluator.list_available_metrics()

Unnamed: 0,Method,Description
0,Actionability_Heuristic,Compute actionability scores using heuristics.


In [31]:
# Retrieve sample examples with their corresponding scores for a specified tutor model and evaluation metric
evaluator.get_sample_examples_with_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Actionability_Heuristic Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.0
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.0
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.0
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,1.0
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,1.0


In [32]:
# Compare the evaluation scores of two tutor models based on a specified metric
evaluator.compare_tutors_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Actionability_Heuristic Score,GPT4 Response,GPT4 Actionability_Heuristic Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.0,i see where your confusion is. you subtracted ...,0.0
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.0,"actually, to find out how many pencils each bo...",0.0
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.0,"good try, but a five-sided polygon is actually...",0.0
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,1.0,"you've done a good job so far, but there seems...",1.0
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,1.0,"that's correct! now, if 1/3 of the thank you c...",1.0


## Evaluation Dimension: Tutor Tone

### Evaluating Tutor Tone Using AutoTutorToneEvaluator

This section demonstrates the use of the `AutoTutorToneEvaluator` to assess the "Tutor Tone" across multiple tutor models, using the MRBench dataset. The evaluator calculates the scores based on the "Tutor_Tone_FTRoBERTa" metric, which measures the nature of the tutor’s response in terms of its tone. The tone is categorized into three primary categories: encouraging, neutral, or offensive.

The following steps are executed:
1. **Evaluator Initialization**: The `AutoTutorToneEvaluator` is initialized with the file containing the dataset (`MRBench_V5.json`), the output directory for saving the results, and a list of tutor models to evaluate (e.g., 'Novice', 'Expert', 'GPT4', etc.).
2. **Score Computation**: The `compute` method calculates the cumulative and individual scores for the specified metric (`Tutor_Tone_FTRoBERTa`), and the results are saved to the specified output directory.
3. **Result Display**: The cumulative score is printed, providing a summary of the evaluation results for all the tutor models under the given metric.

The following code runs the evaluation process:


In [33]:
# Initialize the AutoTutorToneEvaluator with the specified parameters
evaluator = AutoTutorToneEvaluator(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Compute the evaluation scores for the Tutor Tone using the FTRoBERTa metric, saving the results
cumulative_score, all_scores, _ = evaluator.compute(save=True, metrics=['Tutor_Tone_FTRoBERTa'])

# Print the cumulative score of the evaluation
print(pd.DataFrame(cumulative_score))

Loading data: 100%|██████████| 1/1 [00:00<00:00, 3214.03it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 35275.90it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 303935.07it/s]


Computing Tutor Tone Scores using ['Tutor_Tone_FTRoBERTa'] metric(s) for 10 examples...


Calculating Tutor_Tone_FTRoBERTa Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 40.81it/s]

             Tutor_Tone_FTRoBERTa
Novice                      0.694
Expert                      0.496
Llama31405B                 0.555
GPT4                        0.564
Sonnet                      0.811
Phi3                        0.674
Llama318B                   0.586
Mistral                     0.417
Gemini                      0.680
Overall                     0.603





In [34]:
# List all available evaluation metrics and descriptions
evaluator.list_available_metrics()


Unnamed: 0,Method,Description
0,Tutor_Tone_FTRoBERTa,Tutor Tone score using a fine-tuned RoBERTa mo...


In [35]:
# Retrieve sample examples with their corresponding scores for a specified tutor model and evaluation metric
evaluator.get_sample_examples_with_scores()


Unnamed: 0,Conversation ID,History,Expert Response,Expert Tutor_Tone_FTRoBERTa Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.725
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.42
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.788
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.453
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.237


In [36]:
# Compare the evaluation scores of two tutor models based on a specified metric
evaluator.compare_tutors_scores()


Unnamed: 0,Conversation ID,History,Expert Response,Expert Tutor_Tone_FTRoBERTa Score,GPT4 Response,GPT4 Tutor_Tone_FTRoBERTa Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.725,i see where your confusion is. you subtracted ...,0.059
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.42,"actually, to find out how many pencils each bo...",0.825
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.788,"good try, but a five-sided polygon is actually...",0.982
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.453,"you've done a good job so far, but there seems...",0.177
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.237,"that's correct! now, if 1/3 of the thank you c...",0.516


## Evaluation Dimension: Humanlikeness

### Evaluating Humanlikeness Using AutoHumanlikenessEvaluator

This section demonstrates the use of the `AutoHumanlikenessEvaluator` to assess the "Humanlikeness" of the tutor’s responses across multiple tutor models, using the MRBench dataset. The evaluator calculates the scores based on the "Humanness_OGPT2" and "Humanness_Heuristic" metrics, which measure the extent to which the responses resemble human-like characteristics such as naturalness, fluidity, and realism.

The following steps are executed:
1. **Evaluator Initialization**: The `AutoHumanlikenessEvaluator` is initialized with the dataset file (`MRBench_V5.json`), the output directory for saving results, and a list of tutor models to evaluate (e.g., 'Novice', 'Expert', 'GPT4', etc.).
2. **Score Computation**: The `compute` method calculates the cumulative and individual scores for the "Humanness_OGPT2" and "Humanness_Heuristic" metrics, and the results are saved to the specified output directory.
3. **Result Display**: The cumulative score is printed, providing a summary of the evaluation results for all the tutor models under the given metrics.

The following code runs the evaluation process:


In [37]:
# Initialize the AutoHumanlikenessEvaluator with the specified parameters
evaluator = AutoHumanlikenessEvaluator(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Compute the evaluation scores and print the cumulative score
cumulative_score, all_scores, _ = evaluator.compute(save=True, metrics=['Humanlikeness_OGPT2', 'Humanlikeness_Heuristic'])

# Print the cumulative score
print(pd.DataFrame(cumulative_score))


Loading data: 100%|██████████| 1/1 [00:00<00:00, 3052.62it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 34211.29it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 285326.80it/s]
Some weights of the model checkpoint at openai-community/roberta-large-openai-detector were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Computing Humanlikeness Scores using ['Humanlikeness_OGPT2', 'Humanlikeness_Heuristic'] mtrics(s) for 10 examples...


Calculating Humanlikeness_OGPT2 Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 21.65it/s]
Calculating Humanlikeness_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 33614.19it/s]

             Humanlikeness_OGPT2  Humanlikeness_Heuristic
Novice                     0.562                    1.000
Expert                     0.700                    1.000
Llama31405B                0.711                    0.750
GPT4                       0.858                    0.900
Sonnet                     0.679                    1.000
Phi3                       0.756                    0.750
Llama318B                  0.736                    0.900
Mistral                    0.732                    1.000
Gemini                     0.782                    1.000
Overall                    0.736                    0.917





In [38]:
# List all available evaluation metrics and descriptions
evaluator.list_available_metrics()

Unnamed: 0,Method,Description
0,Humanlikeness_OGPT2,Compute Humanlikeness using a pretrained Rober...
1,Humanlikeness_Heuristic,Compute Humanlikeness using keyword-based heur...


In [39]:
# Retrieve sample examples with their corresponding scores for a specified tutor model and evaluation metric
evaluator.get_sample_examples_with_scores()


Unnamed: 0,Conversation ID,History,Expert Response,Expert Humanlikeness_OGPT2 Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.795
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.799
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.897
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.488
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.952


In [40]:
# Compare the evaluation scores of two tutor models based on a specified metric
evaluator.compare_tutors_scores()

Unnamed: 0,Conversation ID,History,Expert Response,Expert Humanlikeness_OGPT2 Score,GPT4 Response,GPT4 Humanlikeness_OGPT2 Score
0,930-b01cb51d-748d-460c-841a-08e4d5cd5cc7,"||| tutor: hi, could you please provide a step...",but after his jog he still had 2000 steps left,0.795,i see where your confusion is. you subtracted ...,0.759
1,3711-05b6ce71-9710-4b83-8ad2-e207d306c73e,"||| tutor: hi, could you please provide a step...","not quite, remember, jam has three boxes full ...",0.799,"actually, to find out how many pencils each bo...",0.995
2,2895106109,"||| tutor: examples: triangles, rectangles and...",great try! an octagon has 8 sides (i remember ...,0.897,"good try, but a five-sided polygon is actually...",0.978
3,232-a53cdc95-d429-4503-95b8-a22ddec0a735,"||| tutor: hi, could you please provide a step...",how many pencils does jam have?,0.488,"you've done a good job so far, but there seems...",0.998
4,4211-015f58b6-1408-417d-aa60-2a069b1a8806,"||| tutor: hi, could you please provide a step...",ok. and if she got 5 thank you cards that had ...,0.952,"that's correct! now, if 1/3 of the thank you c...",0.879


## Generating Automated Evaluation Report

This section demonstrates the use of the `AutoEvaluationReport` to generate an automated evaluation report for multiple tutor models across various metrics, using the MRBench dataset. The evaluator generates the report based on the entire set of tutor models, summarizing the performance and evaluation metrics.

The following steps are executed:
1. **Evaluator Initialization**: The `AutoEvaluationReport` is initialized with the dataset file (`MRBench_V5.json`), the output directory for saving the results, a list of tutor models to evaluate (e.g., 'Novice', 'Expert', 'GPT4', etc.), and the number of conversation examples to evaluate (`num_conv_examples`). If `-1` is specified, all available examples are used.
2. **Report Generation**: The `get_automated_evaluation_report_with_all_models` method is used to compute and generate the automated evaluation report. The results are saved to the specified output directory.
3. **Result Display**: The report is displayed, showing the first 10 rows of the evaluation data.

The following code runs the evaluation process:


In [41]:
# Initialize the AutoEvaluationReport with the specified parameters
evaluator = AutoEvaluationReport(
    file_names=['../data/sample_mrbench.json'],
    output_data_dir='../outputs',
    tutor_models=['Novice', 'Expert', 'Llama31405B', 'GPT4', 'Sonnet', 'Phi3', 'Llama318B', 'Mistral', 'Gemini'],
    num_conv_examples=10,
)

# Generate the automated evaluation report without saving the evaluation data or the report itself
report, data = evaluator.get_automated_evaluation_report_with_all_models(save_eval=False, save_report=False)

# Display the first 10 rows of the generated report
report.head(10)


Loading data: 100%|██████████| 1/1 [00:00<00:00, 2816.86it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 34952.53it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 242445.32it/s]
Loading data: 100%|██████████| 1/1 [00:00<00:00, 3160.74it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 36095.56it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 245280.94it/s]
Loading data: 100%|██████████| 1/1 [00:00<00:00, 3421.13it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 36251.55it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 257319.26it/s]
Loading data: 100%|██████████| 1/1 [00:00<00:00, 3310.42it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 37282.70it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 306153.58it/s]
Loading data: 100%|██████████| 1/1 [00:00<00:00, 2106.63it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 32288.71it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 233016.89it/s]
Loading data: 100%|██████████| 1/1 [00:00<00:00, 4198.50it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 34323.27it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 281496.91it/s]
Loading data: 100%|██████████| 1/1 [00:00<00:00, 2490.68it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 33288.13it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 257319.26it/s]
Loading data: 100%|██████████| 1/1 [00:00<00:00, 1964.55it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 35040.13it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 306153.58it/s]
Loading data: 100%|██████████| 1/1 [00:00<00:00, 2555.94it/s]


Loaded 10 examples from ../data/sample_mrbench.json


Cleaning Data: 100%|██████████| 10/10 [00:00<00:00, 35275.90it/s]
Sanity Check for Tutor Models: 100%|██████████| 10/10 [00:00<00:00, 303935.07it/s]
Some weights of the model checkpoint at openai-community/roberta-large-openai-detector were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Computing Mistake Identification Scores using ['Mistake_Identification_Heuristic'] mtrics(s) for 10 examples...


Calculating Mistake_Identification_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 40072.97it/s]


Computing Mistake Location Scores using ['Mistake_Location_Heuristic'] mtrics(s) for 10 examples...


Calculating Mistake_Location_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 56007.03it/s]


Computing Providing Scores using ['Providing_Guidance_Uptake'] metric(s) for 10 examples...


Calculating Providing_Guidance_Uptake Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 19.12it/s]


Computing Revealing of the_Answer Scores using ['Revealing_of_the_Answer_Heuristic'] mtrics(s) for 10 examples...


Calculating Revealing_of_the_Answer_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 49998.33it/s]


Computing Actionability Scores using ['Actionability_Heuristic'] mtrics(s) for 10 examples...


Calculating Actionability_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 82062.47it/s]


Computing Coherence Scores using ['Coherence_BERT', 'Coherence_NLI'] method(a) for 10 examples...


Calculating Coherence_BERT Score for Tutors: 100%|██████████| 9/9 [00:03<00:00,  2.37it/s]
Calculating Coherence_NLI Score for Tutors:  56%|█████▌    | 5/9 [00:00<00:00,  6.24it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1188 > 512). Running this sequence through the model will result in indexing errors
Calculating Coherence_NLI Score for Tutors: 100%|██████████| 9/9 [00:01<00:00,  5.34it/s]


Computing Tutor Tone Scores using ['Tutor_Tone_FTRoBERTa'] metric(s) for 10 examples...


Calculating Tutor_Tone_FTRoBERTa Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 40.56it/s]


Computing Humanlikeness Scores using ['Humanlikeness_OGPT2', 'Humanlikeness_Heuristic'] mtrics(s) for 10 examples...


Calculating Humanlikeness_OGPT2 Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 21.61it/s]
Calculating Humanlikeness_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 34984.93it/s]


Unnamed: 0,Mistake_Identification_Heuristic,Mistake_Location_Heuristic,Providing_Guidance_Uptake,Revealing_of_the_Answer_Heuristic,Actionability_Heuristic,Coherence_BERT,Coherence_NLI,Tutor_Tone_FTRoBERTa,Humanlikeness_OGPT2,Humanlikeness_Heuristic
Novice,0.75,0.0,0.54,0.0,0.0,0.84,0.749,0.694,0.562,1.0
Expert,0.3,0.2,0.883,0.2,0.5,0.849,0.438,0.496,0.7,1.0
Llama31405B,0.9,0.8,0.998,0.8,0.5,0.853,0.676,0.555,0.711,0.75
GPT4,0.8,0.5,0.997,0.5,0.3,0.855,0.575,0.564,0.858,0.9
Sonnet,0.6,0.5,0.903,0.5,0.3,0.851,0.601,0.811,0.679,1.0
Phi3,0.6,0.3,0.74,0.3,0.4,0.825,0.613,0.674,0.756,0.75
Llama318B,0.5,0.5,0.989,0.5,0.2,0.853,0.855,0.586,0.736,0.9
Mistral,0.8,0.6,0.994,0.6,0.4,0.851,0.571,0.417,0.732,1.0
Gemini,0.9,0.7,0.888,0.7,0.2,0.845,0.66,0.68,0.782,1.0
Overall,0.679,0.488,0.906,0.488,0.333,0.847,0.629,0.603,0.736,0.917


In [42]:
# Generate the automated evaluation report for the best-performing models and save the results
report, data = evaluator.get_automated_evaluation_report_with_best_models(save_eval=False, save_report=False)

# Display the first 10 rows of the generated report
report.head(10)

Computing Mistake Identification Scores using ['Mistake_Identification_Heuristic'] mtrics(s) for 10 examples...


Calculating Mistake_Identification_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 54471.48it/s]


Computing Mistake Location Scores using ['Mistake_Location_Heuristic'] mtrics(s) for 10 examples...


Calculating Mistake_Location_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 49474.10it/s]


Computing Providing Scores using ['Providing_Guidance_Uptake'] metric(s) for 10 examples...


Calculating Providing_Guidance_Uptake Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 19.13it/s]


Computing Revealing of the_Answer Scores using ['Revealing_of_the_Answer_Heuristic'] mtrics(s) for 10 examples...


Calculating Revealing_of_the_Answer_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 53242.22it/s]


Computing Actionability Scores using ['Actionability_Heuristic'] mtrics(s) for 10 examples...


Calculating Actionability_Heuristic Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 75047.19it/s]


Computing Coherence Scores using ['Coherence_BERT'] method(a) for 10 examples...


Calculating Coherence_BERT Score for Tutors: 100%|██████████| 9/9 [00:03<00:00,  2.36it/s]


Computing Tutor Tone Scores using ['Tutor_Tone_FTRoBERTa'] metric(s) for 10 examples...


Calculating Tutor_Tone_FTRoBERTa Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 40.42it/s]


Computing Humanlikeness Scores using ['Humanlikeness_OGPT2'] mtrics(s) for 10 examples...


Calculating Humanlikeness_OGPT2 Score for Tutors: 100%|██████████| 9/9 [00:00<00:00, 21.47it/s]


Unnamed: 0,Mistake_Identification_Heuristic,Mistake_Location_Heuristic,Providing_Guidance_Uptake,Revealing_of_the_Answer_Heuristic,Actionability_Heuristic,Coherence_BERT,Tutor_Tone_FTRoBERTa,Humanlikeness_OGPT2
Novice,0.75,0.0,0.54,0.0,0.0,0.84,0.694,0.562
Expert,0.3,0.2,0.883,0.2,0.5,0.849,0.496,0.7
Llama31405B,0.9,0.8,0.998,0.8,0.5,0.853,0.555,0.711
GPT4,0.8,0.5,0.997,0.5,0.3,0.855,0.564,0.858
Sonnet,0.6,0.5,0.903,0.5,0.3,0.851,0.811,0.679
Phi3,0.6,0.3,0.74,0.3,0.4,0.825,0.674,0.756
Llama318B,0.5,0.5,0.989,0.5,0.2,0.853,0.586,0.736
Mistral,0.8,0.6,0.994,0.6,0.4,0.851,0.417,0.732
Gemini,0.9,0.7,0.888,0.7,0.2,0.845,0.68,0.782
Overall,0.679,0.488,0.906,0.488,0.333,0.847,0.603,0.736
