# Getting Started with LastMile AI AutoEval

This notebook demonstrates how to use the LastMile AI AutoEval library to generate labels for a dataset and fine-tune a model for improved evaluation. We will follow these steps:

1. Set up the API key
2. Prepare and upload a dataset for labeling
3. Define a custom prompt template for labeling
4. Label the dataset using the prompt
5. Retrieve and view the labeled dataset
6. Fine-tune a model on the labeled dataset
7. Prepare and upload a test dataset
8. Evaluate the test dataset using the fine-tuned model

In [None]:
!pip install lastmile --upgrade

## 1. Set Up Your API Key

To interact with the LastMile AI API, set your API key as an environment variable. If you haven't already obtained an API key, please visit the LastMile AI dashboard.

In [7]:
import os

api_token = os.environ.get("LASTMILE_API_TOKEN")

if not api_token:
    print("Error: Please set your API key in the environment variable LASTMILE_API_KEY")
elif api_token == "YOUR_API_KEY_HERE":
    print("Error: Please replace 'YOUR_API_KEY_HERE' with your actual API key")
else:
    print("✓ API key successfully configured!")

# Setup Pandas to display without truncation (for display purposes)
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


✓ API key successfully configured!


## 2. Prepare Your Dataset for Labeling

To generate labels for your dataset using LastMile AI AutoEval, you'll need to upload a CSV file containing the data you want to label. The CSV file should be formatted with the following columns:

- `input`: The user's query or input text.
- `output`: The assistant's response to the user's query.
- `ground_truth`: The correct or expected response for the given input (optional).

In this example, we'll use a sample dataset called `supercard-dining-dataset.csv`, which contains interactions between users and a dining assistant that helps find restaurants and provides recommendations.

Make sure the CSV file is in the correct format and located in the same directory as this notebook before proceeding to the next step.

In [9]:
from lastmile.lib.auto_eval import AutoEval

client = AutoEval(api_token=api_token)

dataset_csv_path = "data/customer_support_dataset.csv"

dataset_id = client.upload_dataset(
    file_path=dataset_csv_path,
    name="Supercard Dining Dataset",
    description="Dataset containing dining-related queries and responses"
)

print(f"Dataset created with ID: {dataset_id}")

Dataset created with ID: cm3hvlx80000umy016vdg47wb


## 3. Define a Custom Prompt Template

We'll create a prompt template that tells the evaluator model how to assess each data point. The template includes instructions on when to assign labels `1` or `0` based on specific criteria.

In [10]:
prompt_template = """
You are an evaluator model tasked with assessing a generated output for a provided input based on the following criteria. Return a label of 1 or 0 according to these rules:

Label 1:
  - If the input is a customer support query and the output provides accurate and helpful information using details from the ground_truth.
  - If the input is not related to customer support and the output politely informs the user that it cannot assist with that request.

Label 0:
  - If the output contains information not present in the ground_truth.
  - If the output does not appropriately address the user's request as per the above guidelines.
  - If the output is irrelevant, incorrect, or unhelpful in the context of the input and ground_truth.

Ground Truth:
{ground_truth}

Input:
{input}

Output:
{output}

Label:
"""

## 4. Label the Dataset

We use the `label_dataset` method from the `AutoEval` library to label our dataset based on the prompt template. We'll set `wait_for_completion` to `True` so that the notebook waits until the labeling job is finished.

In [11]:
job_id = client.label_dataset(
    dataset_id=dataset_id,
    prompt_template=prompt_template,
    wait_for_completion=False
)

print(f"Labeling job started with ID: {job_id}")
print("Waiting for job to complete...")

client.wait_for_label_dataset_job(job_id)
print(f"Labeling Job with ID: {job_id} Completed")

Labeling job started with ID: cm3hvmbhh001nuj01rt64pxp1
Waiting for job to complete...
Labeling Job with ID: cm3hvmbhh001nuj01rt64pxp1 Completed


## 5. Retrieve and View the Labeled Dataset

After the labeling job is completed, we'll download the dataset and display it to verify the labels.

In [18]:
print(f"Retrieving labeled dataset with ID: {dataset_id}")
dataset = client.download_dataset(dataset_id=dataset_id)
dataset.head(10)

Retrieving labeled dataset with ID: cm3hvlx80000umy016vdg47wb


Unnamed: 0,ground_truth,input,output,label
0,Musical instruments are allowed on board if they fit in the cabin's overhead bin.,Can I carry a musical instrument on board?,Small musical instruments that fit in the overhead bin are allowed.,1
1,"2 checked bags, 23kg each",How many free checked bags am I allowed on long-haul flights?,"You are allowed 1 checked bag, weighing up to 23kg, on long-haul flights.",0
2,1-800-555-1234,What is the phone number for customer service?,Please call customer service at 1-800-555-1243.,0
3,Next flight: 8:00 PM from JFK to LHR,When is the next flight from New York to London?,The next flight from New York to London is at 9:00 PM.,0
4,"Business class amenities on StrikeAir flights include flatbed seats, gourmet dining, premium beverages, and extra baggage allowance.",What are the business class amenities on StrikeAir flights?,"Business class includes fully flat seats, gourmet meals, and a selection of fine wines.",0
5,Flight status for SA1234 is available on our website or by calling customer service at 1-800-555-1234.,How can I track my flight status?,You can check the status of your flight SA1234 on our website or call customer service at 1-800-555-1234.,1
6,Vegetarian meals are available upon request. Please ensure you request it at the time of booking or contact customer service.,Can I get a vegetarian meal on my flight?,"Yes, we offer vegetarian meals on our flights. Please make sure to request it during booking or contact customer service.",1
7,1 carry-on bag and 1 personal item are allowed,How many carry-on bags am I allowed to bring?,You are allowed to bring 1 carry-on bag and 1 personal item.,1
8,A negative COVID-19 test result taken within 72 hours before boarding is required.,Do I need to provide a negative COVID-19 test result before boarding?,"Yes, you need to provide a negative COVID-19 test result taken within 72 hours before boarding.",1
9,JFK Terminal 4,What terminal is my flight arriving at?,Your flight is arriving at JFK Terminal 5.,0


## 6. Prepare a Test Dataset

To evaluate our fine-tuned model, we'll upload a separate test dataset. This dataset should have the same format as our training dataset but contain different examples.

In [19]:
test_dataset_id = client.upload_dataset(
    file_path="data/customer_support_test.csv",  # Your test dataset file
    name="Dining Assistant Test Dataset",
    description="Test dataset for evaluating the fine-tuned model"
)

print(f"Test dataset uploaded with ID: {test_dataset_id}")

Test dataset uploaded with ID: cm3hvp4hc004ylf01zomcvyen


## 7. Fine-Tuning a Model

Now that we have a labeled dataset, we can fine-tune an evaluator model to improve its performance on our specific use case. The fine-tuned model can then be used to evaluate new data more accurately.

We'll submit a fine-tuning job using our labeled training dataset and wait for it to complete. The default base model used for fine-tuning is an in-house custom ALBERTA model. In the future, we will introduce more powerful models.

In [21]:
model_name = "Dining Assistant Evaluator v1" + "oops"

fine_tune_job_id = client.fine_tune_model(
    train_dataset_id=dataset_id,  # From our earlier labeling step
    test_dataset_id=test_dataset_id,
    model_name=model_name,
    selected_columns=["input", "output", "ground_truth"],
    wait_for_completion=False
)

print(f"Fine-tuning job initiated with ID: {fine_tune_job_id}. Waiting for completion...")
client.wait_for_fine_tune_job(fine_tune_job_id)
print(f"Fine-tuning job completed with ID: {fine_tune_job_id}")

Fine-tuning job initiated with ID: cm3hvptlu0010my01te2cuh0g. Waiting for completion...
Fine-tuning job completed with ID: cm3hvptlu0010my01te2cuh0g


## 8. Evaluate Using Fine-tuned Model

After fine-tuning completes, we need to wait for the model to be available as a metric. We'll check the list of available metrics until our fine-tuned model appears.

Once the model is available, we can use it to evaluate our test dataset and view the results.

In [23]:
import time
from lastmile.lib.auto_eval import Metric

metric = Metric(name=model_name)
print(f"Waiting for fine-tuned model to be available as metric...")
fine_tuned_metric = client.wait_for_metric_online(metric)
print(f"Fine-tuned model available as metric with ID: {fine_tuned_metric.id}")

# Run evaluation using our fine-tuned model
results = client.evaluate_dataset(test_dataset_id, fine_tuned_metric)

# Display results
print("Evaluation Results:")
results

Waiting for fine-tuned model to be available as metric...
Fine-tuned model available as metric with ID: cm3hvsl1y0052lf01mh0v0gy3
Evaluation Results:


Unnamed: 0,ground_truth,input,output,label,Dining Assistant Evaluator v1oops_score
0,Musical instruments are allowed on board if they fit in the cabin's overhead bin.,Can I carry a musical instrument on board?,Small musical instruments that fit in the overhead bin are allowed.,1,0.52739
1,"2 checked bags, 23kg each",How many free checked bags am I allowed on long-haul flights?,"You are allowed 1 checked bag, weighing up to 23kg, on long-haul flights.",0,0.451582
2,1-800-555-1234,What is the phone number for customer service?,Please call customer service at 1-800-555-1243.,0,0.458713
3,Next flight: 8:00 PM from JFK to LHR,When is the next flight from New York to London?,The next flight from New York to London is at 9:00 PM.,0,0.456505
4,"Business class amenities on StrikeAir flights include flatbed seats, gourmet dining, premium beverages, and extra baggage allowance.",What are the business class amenities on StrikeAir flights?,"Business class includes fully flat seats, gourmet meals, and a selection of fine wines.",0,0.487009
5,Flight status for SA1234 is available on our website or by calling customer service at 1-800-555-1234.,How can I track my flight status?,You can check the status of your flight SA1234 on our website or call customer service at 1-800-555-1234.,1,0.527722
6,Vegetarian meals are available upon request. Please ensure you request it at the time of booking or contact customer service.,Can I get a vegetarian meal on my flight?,"Yes, we offer vegetarian meals on our flights. Please make sure to request it during booking or contact customer service.",1,0.556459
7,1 carry-on bag and 1 personal item are allowed,How many carry-on bags am I allowed to bring?,You are allowed to bring 1 carry-on bag and 1 personal item.,1,0.505438
8,A negative COVID-19 test result taken within 72 hours before boarding is required.,Do I need to provide a negative COVID-19 test result before boarding?,"Yes, you need to provide a negative COVID-19 test result taken within 72 hours before boarding.",1,0.55759
9,JFK Terminal 4,What terminal is my flight arriving at?,Your flight is arriving at JFK Terminal 5.,0,0.402872
