# Getting Started with LastMile AI AutoEval

This notebook demonstrates how to use the LastMile AI AutoEval library to evaluate a dataset, generate labels, fine-tune a model for improved evaluation, and evaluate using the fine-tuned model. We will follow these steps:

1. Set up the API key
2. Prepare and upload a dataset 
3. Evaluate the dataset against default metrics
4. Labeling the Dataset
   - 4.1 Define a custom prompt template for labeling
   - 4.2 Label the dataset using the prompt
   - 4.3 Retrieve and view the labeled dataset
5. Fine-tuning a Model
   - 5.1 Prepare and upload a test dataset
   - 5.2 Fine-tune a model on the labeled dataset
   - 5.3 Evaluate the test dataset using the fine-tuned model

In [None]:
!pip install lastmile --upgrade

## 1. Set Up Your API Key

To interact with the LastMile AI API, set your API key as an environment variable. If you haven't already obtained an API key, please visit the LastMile AI dashboard.

In [2]:
import os

api_token = os.environ.get("LASTMILE_API_TOKEN")

if not api_token:
    print("Error: Please set your API key in the environment variable LASTMILE_API_KEY")
elif api_token == "YOUR_API_KEY_HERE":
    print("Error: Please replace 'YOUR_API_KEY_HERE' with your actual API key")
else:
    print("✓ API key successfully configured!")

# Setup Pandas to display without truncation (for display purposes)
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


✓ API key successfully configured!


## 2. Prepare and Upload Your Dataset

Now that we have our API key configured, let's prepare and upload a dataset for evaluation. LastMile AI AutoEval expects a CSV file with the following columns:

- `input`: The user's query or input text
- `output`: The assistant's response to the user's query
- `ground_truth`: The correct or expected response for comparison (optional)

For this example, we'll use a sample dataset `customer-support-dataset.csv` containing airline customer support interactions. The dataset includes various scenarios like:

- Flight status inquiries
- Baggage policy questions  
- Check-in procedures
- Meal options and special requests

Each interaction includes the customer's query, the assistant's response, and the ground truth containing accurate airline policies and procedures. We'll upload this dataset and then evaluate how well the assistant's responses align with the ground truth using LastMile's evaluation metrics.

In [3]:
from lastmile.lib.auto_eval import AutoEval

client = AutoEval(api_token=api_token)

dataset_csv_path = "data/customer_support_dataset.csv"

dataset_id = client.upload_dataset(
    file_path=dataset_csv_path,
    name="Customer Support Dataset",
    description="Dataset containing customer support queries and responses"
)

print(f"Dataset created with ID: {dataset_id}")

Dataset created with ID: cm3nea6kv00gvt50112j8n5wj


## 3. Evaluate the Dataset Against Built-in Metrics

Now that we have uploaded our dataset, we can evaluate it against the built-in metrics provided by LastMile AI AutoEval. These metrics include:

- `BuiltinMetrics.FAITHFULNESS`: Assesses how well the assistant's responses align with the provided ground truth.
- `BuiltinMetrics.RELEVANCE`: Evaluates the relevance and appropriateness of the assistant's responses to the user's queries.
- LastMile provides a few other metrics out of the box such as Toxicity, Question-Answering, etc.

We'll run evaluations on a subset of our dataset and the selected built-in metrics.

Note: You can also run evaluations directly on a dataset by using the `evaluate_dataset()` method.

In [13]:
from lastmile.lib.auto_eval import BuiltinMetrics
default_metrics = [
    BuiltinMetrics.FAITHFULNESS,
    BuiltinMetrics.RELEVANCE,
    # LastMile provides a few other metrics out of the box such as Toxicity, Question-Answering, etc.
]

print(f"Evaluation job kicked off")
evaluation_results = client.evaluate_data(
    data=pd.read_csv(dataset_csv_path).head(10),
    metrics=default_metrics,
)

'''
Sample usage of evaluate_dataset() method to run evaluations on a dataframe.
evaluation_results = client.evaluate_dataset(
    dataset_id=dataset_id,
    metrics=default_metrics,
)
'''
print("Evaluation Results:")
evaluation_results.head(10)



Evaluation job kicked off
Evaluation Results:


Unnamed: 0,input,output,ground_truth,Faithfulness_score,Relevance_score
0,Can I change my flight date for free?,"You can change your flight date for free if it is within 24 hours of booking. Otherwise, a fee may apply.","You can change your flight date for free within 24 hours of making the booking. After that, a change fee will apply.",0.99958,0.981094
1,What is the phone number for customer service?,Please call customer service at 1-800-555-1243.,1-800-555-1234,0.895927,0.841731
2,What time is my flight SA795 departing from JFK?,Your flight SA795 is departing from JFK at 6:30PM.,"SA795: JFK Terminal 4, Gate 42, 6:30PM departure",0.998571,0.924473
3,What are the business class amenities on StrikeAir flights?,"Business class includes fully flat seats, gourmet meals, and a selection of fine wines.","Business class amenities on StrikeAir flights include flatbed seats, gourmet dining, premium beverages, and extra baggage allowance.",0.999438,0.791342
4,What gate is my flight departing from?,"Your flight SA795 is departing from JFK Terminal 5, Gate 42 at 6:30PM.","SA795: JFK Terminal 4, Gate 42, 6:30PM departure",0.000678,0.932763
5,Can you tell me how to check in for my flight online?,"You can check in for your flight online by visiting our website and clicking on the 'Check-In' tab. Enter your booking reference and last name, and follow the instructions.",Visit our website and click on the 'Check-In' tab. Enter your booking reference and last name to check in.,0.998471,0.944256
6,Can you tell me the flight status for SA450?,"Flight SA450 is on schedule and will depart from JFK Terminal 2, Gate 15 at 4:00PM.","SA450: On Schedule, JFK Terminal 2, Gate 15, 4:00PM departure",0.999635,0.942924
7,Can I get a vegetarian meal on my flight?,"Yes, we offer vegetarian meals on our flights. Please make sure to request it during booking or contact customer service.",Vegetarian meals are available upon request. Please ensure you request it at the time of booking or contact customer service.,0.999843,0.95675
8,Can I bring my pet on the plane?,"Yes, you can bring your pet, but it must fit under the seat in front of you and there is a pet fee.",Pets are not allowed on StrikeAir flights.,0.000892,0.84928
9,Do you have any vegetarian meals on your flights?,"Yes, we do offer vegetarian meal options on our flights. You can request it during booking or contact customer service for assistance.","Vegetarian meal options are available on all flights, requestable during booking or via customer service.",0.999865,0.953567


## 4. Labeling the Dataset

While the default metrics provide a good starting point for evaluating our dataset, we can get more tailored results by labeling the data ourselves. This allows us to define custom criteria for assessing each data point.

### 4.1 Define a Custom Prompt Template for Labeling

To label the dataset, we'll create a custom prompt template. This template will instruct the evaluator model to assign labels `1` or `0` based on how well the assistant's output addresses the customer support query, using the ground truth as a reference.

Here's the prompt template we'll use:

In [14]:
prompt_template = """
You are an evaluator model tasked with assessing a generated output for a provided input based on the following criteria. Return a label of 1 or 0 according to these rules:

Label 1:
  - If the input is a customer support query and the output provides accurate and helpful information using details from the ground_truth.
  - If the input is not related to customer support and the output politely informs the user that it cannot assist with that request.

Label 0:
  - If the output contains information not present in the ground_truth.
  - If the output does not appropriately address the user's request as per the above guidelines.
  - If the output is irrelevant, incorrect, or unhelpful in the context of the input and ground_truth.

Ground Truth:
{ground_truth}

Input:
{input}

Output:
{output}

Label:
"""

### 4.2 Label the Dataset

We use the `label_dataset` method from the `AutoEval` library to label our dataset based on the prompt template.

In [15]:
job_id = client.label_dataset(
    dataset_id=dataset_id,
    prompt_template=prompt_template,
    wait_for_completion=False
)

print(f"Labeling job started with ID: {job_id}")
print("Waiting for job to complete...")

client.wait_for_label_dataset_job(job_id)
print(f"Labeling Job with ID: {job_id} Completed")

Labeling job started with ID: cm3nezbaf00cznr019ml9olnu
Waiting for job to complete...
Labeling Job with ID: cm3nezbaf00cznr019ml9olnu Completed


### 4.3 Retrieve and View the Labeled Dataset

After the labeling job is completed, we'll download the dataset and display it to verify the labels.

In [16]:
print(f"Retrieving labeled dataset with ID: {dataset_id}")
dataset = client.download_dataset(dataset_id=dataset_id)
dataset.head(10)

Retrieving labeled dataset with ID: cm3nea6kv00gvt50112j8n5wj


Unnamed: 0,input,output,ground_truth,label
0,Can I change my flight date for free?,"You can change your flight date for free if it is within 24 hours of booking. Otherwise, a fee may apply.","You can change your flight date for free within 24 hours of making the booking. After that, a change fee will apply.",1
1,What is the phone number for customer service?,Please call customer service at 1-800-555-1243.,1-800-555-1234,0
2,What time is my flight SA795 departing from JFK?,Your flight SA795 is departing from JFK at 6:30PM.,"SA795: JFK Terminal 4, Gate 42, 6:30PM departure",1
3,What are the business class amenities on StrikeAir flights?,"Business class includes fully flat seats, gourmet meals, and a selection of fine wines.","Business class amenities on StrikeAir flights include flatbed seats, gourmet dining, premium beverages, and extra baggage allowance.",0
4,What gate is my flight departing from?,"Your flight SA795 is departing from JFK Terminal 5, Gate 42 at 6:30PM.","SA795: JFK Terminal 4, Gate 42, 6:30PM departure",0
5,Can you tell me how to check in for my flight online?,"You can check in for your flight online by visiting our website and clicking on the 'Check-In' tab. Enter your booking reference and last name, and follow the instructions.",Visit our website and click on the 'Check-In' tab. Enter your booking reference and last name to check in.,1
6,Can you tell me the flight status for SA450?,"Flight SA450 is on schedule and will depart from JFK Terminal 2, Gate 15 at 4:00PM.","SA450: On Schedule, JFK Terminal 2, Gate 15, 4:00PM departure",1
7,Can I get a vegetarian meal on my flight?,"Yes, we offer vegetarian meals on our flights. Please make sure to request it during booking or contact customer service.",Vegetarian meals are available upon request. Please ensure you request it at the time of booking or contact customer service.,1
8,Can I bring my pet on the plane?,"Yes, you can bring your pet, but it must fit under the seat in front of you and there is a pet fee.",Pets are not allowed on StrikeAir flights.,0
9,Do you have any vegetarian meals on your flights?,"Yes, we do offer vegetarian meal options on our flights. You can request it during booking or contact customer service for assistance.","Vegetarian meal options are available on all flights, requestable during booking or via customer service.",1


## 5. Fine-Tuning a Model

Now that we have a labeled dataset, the next step is to fine-tune a model specifically for our customer support use case. Fine-tuning allows us to adapt a pre-trained model to our domain, improving its performance on tasks similar to the examples in our labeled dataset.

The process involves the following steps:

1. Prepare a separate test dataset to evaluate the fine-tuned model's performance. 
2. Submit a fine-tuning job using the labeled dataset for training.
3. Wait for the fine-tuning job to complete.
4. Evaluate the fine-tuned model on the test dataset.

By fine-tuning the model, we can create a specialized evaluation metric tailored to assess the quality of customer support responses more accurately than the default metrics. In the following cells, we will walk through each step of the fine-tuning process.

### 5.1 Prepare a Test Dataset

To evaluate our fine-tuned model, we'll upload a separate test dataset. This dataset should have the same format as our training dataset but contain different examples.

In [19]:
test_dataset_id = client.upload_dataset(
    file_path="data/customer_support_test.csv",  # Your test dataset file
    name="Dining Assistant Test Dataset",
    description="Test dataset for evaluating the fine-tuned model"
)

print(f"Test dataset uploaded with ID: {test_dataset_id}")

Test dataset uploaded with ID: cm3hvp4hc004ylf01zomcvyen


### 5.2 Fine-Tuning a Model

Now that we have a labeled dataset, we can fine-tune an evaluator model to improve its performance on our specific use case. The fine-tuned model can then be used to evaluate new data more accurately.

We'll submit a fine-tuning job using our labeled training dataset and wait for it to complete. The default base model used for fine-tuning is an in-house custom ALBERTA model. In the future, we will introduce more powerful models.

In [21]:
model_name = "Dining Assistant Evaluator v1" + "oops"

fine_tune_job_id = client.fine_tune_model(
    train_dataset_id=dataset_id,  # From our earlier labeling step
    test_dataset_id=test_dataset_id,
    model_name=model_name,
    selected_columns=["input", "output", "ground_truth"],
    wait_for_completion=False
)

print(f"Fine-tuning job initiated with ID: {fine_tune_job_id}. Waiting for completion...")
client.wait_for_fine_tune_job(fine_tune_job_id)
print(f"Fine-tuning job completed with ID: {fine_tune_job_id}")

Fine-tuning job initiated with ID: cm3hvptlu0010my01te2cuh0g. Waiting for completion...
Fine-tuning job completed with ID: cm3hvptlu0010my01te2cuh0g


### 5.3 Evaluate Using Fine-tuned Model

After fine-tuning completes, we need to wait for the model to be available as a metric. We'll check the list of available metrics until our fine-tuned model appears.

Once the model is available, we can use it to evaluate our test dataset and view the results.

In [23]:
import time
from lastmile.lib.auto_eval import Metric

metric = Metric(name=model_name)
print(f"Waiting for fine-tuned model to be available as metric...")
fine_tuned_metric = client.wait_for_metric_online(metric)
print(f"Fine-tuned model available as metric with ID: {fine_tuned_metric.id}")

# Run evaluation using our fine-tuned model
results = client.evaluate_dataset(test_dataset_id, fine_tuned_metric)

# Display results
print("Evaluation Results:")
results

Waiting for fine-tuned model to be available as metric...
Fine-tuned model available as metric with ID: cm3hvsl1y0052lf01mh0v0gy3
Evaluation Results:


Unnamed: 0,ground_truth,input,output,label,Dining Assistant Evaluator v1oops_score
0,Musical instruments are allowed on board if they fit in the cabin's overhead bin.,Can I carry a musical instrument on board?,Small musical instruments that fit in the overhead bin are allowed.,1,0.52739
1,"2 checked bags, 23kg each",How many free checked bags am I allowed on long-haul flights?,"You are allowed 1 checked bag, weighing up to 23kg, on long-haul flights.",0,0.451582
2,1-800-555-1234,What is the phone number for customer service?,Please call customer service at 1-800-555-1243.,0,0.458713
3,Next flight: 8:00 PM from JFK to LHR,When is the next flight from New York to London?,The next flight from New York to London is at 9:00 PM.,0,0.456505
4,"Business class amenities on StrikeAir flights include flatbed seats, gourmet dining, premium beverages, and extra baggage allowance.",What are the business class amenities on StrikeAir flights?,"Business class includes fully flat seats, gourmet meals, and a selection of fine wines.",0,0.487009
5,Flight status for SA1234 is available on our website or by calling customer service at 1-800-555-1234.,How can I track my flight status?,You can check the status of your flight SA1234 on our website or call customer service at 1-800-555-1234.,1,0.527722
6,Vegetarian meals are available upon request. Please ensure you request it at the time of booking or contact customer service.,Can I get a vegetarian meal on my flight?,"Yes, we offer vegetarian meals on our flights. Please make sure to request it during booking or contact customer service.",1,0.556459
7,1 carry-on bag and 1 personal item are allowed,How many carry-on bags am I allowed to bring?,You are allowed to bring 1 carry-on bag and 1 personal item.,1,0.505438
8,A negative COVID-19 test result taken within 72 hours before boarding is required.,Do I need to provide a negative COVID-19 test result before boarding?,"Yes, you need to provide a negative COVID-19 test result taken within 72 hours before boarding.",1,0.55759
9,JFK Terminal 4,What terminal is my flight arriving at?,Your flight is arriving at JFK Terminal 5.,0,0.402872
