# Run Eval Experiments with LastMile AI AutoEval

This notebook demonstrates how to use the LastMile AI AutoEval library to create Experiments, a way to 
organize evaluation runs as you make changes to your AI application.

Experiments can be used to systematically test the effects of making changes to your AI application,
such as updating the LLM, the retrieval strategy for a RAG system, the system prompts for an agent, and more.

In this guide, we'll cover creating an Experiment, scheduling Evaluations on datasets and more. 

We will follow these steps:
1. Set up the API key and a Project
2. Prepare and upload a dataset 
3. Create an Experiment
4. Evaluate the dataset against default metrics and log to the Experiment

In [2]:
%pip install lastmile --upgrade

Processing /Users/saqadri/lm/lastmile-python/dist/lastmile-0.8.0-py3-none-any.whl
Collecting anyio<5,>=3.5.0 (from lastmile==0.8.0)
  Using cached anyio-4.8.0-py3-none-any.whl.metadata (4.6 kB)
Collecting distro<2,>=1.7.0 (from lastmile==0.8.0)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from lastmile==0.8.0)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pandas (from lastmile==0.8.0)
  Using cached pandas-2.2.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting pandas-stubs (from lastmile==0.8.0)
  Using cached pandas_stubs-2.2.3.241126-py3-none-any.whl.metadata (10.0 kB)
Collecting pydantic<3,>=1.9.0 (from lastmile==0.8.0)
  Using cached pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)
Collecting sniffio (from lastmile==0.8.0)
  Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting typing-extensions<5,>=4.10 (from lastmile==0.8.0)
  Using cached typing_extensions-4.12.2-py3-none-

## 1. Set Up AutoEval Client

To interact with the LastMile AI API, set your API key as an environment variable. If you haven't already obtained an API key, please visit the LastMile AI dashboard.

In [3]:
import os

api_token = os.environ.get("LASTMILE_API_TOKEN")

if not api_token:
    print("Error: Please set your API key in the environment variable LASTMILE_API_KEY")
elif api_token == "YOUR_API_KEY_HERE":
    print("Error: Please replace 'YOUR_API_KEY_HERE' with your actual API key")
else:
    print("✓ API key successfully configured!")

# Setup Pandas to display without truncation (for display purposes)
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


✓ API key successfully configured!


In [4]:
from lastmile.lib.auto_eval import AutoEval

client = AutoEval(api_token=api_token) # You can also set the project_id= for the project you want to scope to (see below)

## 2. Create a Project

A Project is the container to organize your Experiments, Evaluation runs and Datasets. A Project usually corresponds to the initiative or AI application you're building.

It is straightforward to create projects, so you can create as many as you need to easily delineate your evals, especially if you're juggling multiple projects.

In [12]:
project = client.create_project(
    name="AutoEval Experiments",
    description="Project to test AutoEval Experiments"
)

# Important - set the project_id in the client so all requests are scoped to this project
client.project_id = project.id

In [5]:
# Let's list the projects in our account. It should include the newly created Project, as well as the default "AutoEval" project
projects = client.list_projects()
projects

[Project(id='j13hf8g5kqfwbrh4332w89nd', created_at=datetime.datetime(2025, 2, 10, 21, 50, 55, 656000, tzinfo=datetime.timezone.utc), name='AutoEval Experiments', updated_at=datetime.datetime(2025, 2, 10, 21, 50, 55, 656000, tzinfo=datetime.timezone.utc), creator_id='cldfcu2780008qsueqgiqvenw', deleted_at=None, description='Project to test AutoEval Experiments', organization_id=None, organization_name=None, createdAt='2025-02-10T21:50:55.656Z', updatedAt='2025-02-10T21:50:55.656Z', creatorId='cldfcu2780008qsueqgiqvenw'),
 Project(id='z8kfriq6cga6j0fx38znw4y6', created_at=datetime.datetime(2024, 12, 23, 16, 54, 45, 89000, tzinfo=datetime.timezone.utc), name='Default', updated_at=datetime.datetime(2024, 12, 23, 16, 54, 45, 89000, tzinfo=datetime.timezone.utc), creator_id='cldfcu2780008qsueqgiqvenw', deleted_at=None, description=None, organization_id=None, organization_name=None, createdAt='2024-12-23T16:54:45.089Z', updatedAt='2024-12-23T16:54:45.089Z', creatorId='cldfcu2780008qsueqgiqven

In [6]:
# You can also get the project 
default_project = client.get_project(project_id="z8kfriq6cga6j0fx38znw4y6")
default_project

Project(id='z8kfriq6cga6j0fx38znw4y6', created_at=datetime.datetime(2024, 12, 23, 16, 54, 45, 89000, tzinfo=datetime.timezone.utc), name='Default', updated_at=datetime.datetime(2024, 12, 23, 16, 54, 45, 89000, tzinfo=datetime.timezone.utc), creator_id='cldfcu2780008qsueqgiqvenw', deleted_at=None, description=None, organization_id=None, organization_name=None, createdAt='2024-12-23T16:54:45.089Z', updatedAt='2024-12-23T16:54:45.089Z', creatorId='cldfcu2780008qsueqgiqvenw')

## 3. Prepare and Upload Your Dataset

Now that we have our API key configured, let's prepare and upload a dataset for evaluation. LastMile AI AutoEval expects a CSV file with the following columns:

- `input`: The user's query or input text
- `output`: The assistant's response to the user's query
- `ground_truth`: The correct or expected response for comparison (optional)

For this example, we'll use a sample dataset `customer-support-dataset.csv` containing airline customer support interactions. The dataset includes various scenarios like:

- Flight status inquiries
- Baggage policy questions  
- Check-in procedures
- Meal options and special requests

Each interaction includes the customer's query, the assistant's response, and the ground truth containing accurate airline policies and procedures. We'll upload this dataset and then evaluate how well the assistant's responses align with the ground truth using LastMile's evaluation metrics.

In [24]:
dataset_csv_path = "data/customer_support_dataset.csv"

dataset_id = client.upload_dataset(
    file_path=dataset_csv_path,
    name="Customer Support Dataset",
    description="Dataset containing customer support queries and responses"
)

print(f"Dataset created with ID: {dataset_id}")

Dataset created with ID: xf0jx4dlltr92628aio94g18


In [None]:
# You can also copy/clone a dataset
copied_dataset_id = client.copy_dataset(dataset_id=dataset_id)
copied_dataset_id

'zbhg5hnlnfw2kqilx263jarb'

In [27]:
# Deleting the copied dataset (which archives it)
deleted = client.delete_dataset(dataset_id=copied_dataset_id)
deleted

True

## 4. Create an Experiment

Let's create an experiment to house the evaluations under

In [13]:
experiment = client.create_experiment(
    name="Customer Support Experiment A",
    description="Experiment to test customer support queries",
    # You can specify any useful properties in the metadata, which will surface as columns in any evals logged to this experiment
    metadata={
        "model": "gpt-4o", 
        "temperature": 0.8, 
        "misc": {
            "dataset_version": "0.1.1", 
            "app": "customer-support"
        }
    }
)

In [14]:
experiment

Experiment(id='lifjh83z02qqrnpoxsdu1108', created_at=datetime.datetime(2025, 2, 10, 23, 43, 33, 784000, tzinfo=datetime.timezone.utc), name='Customer Support Experiment A', project_id='j13hf8g5kqfwbrh4332w89nd', updated_at=datetime.datetime(2025, 2, 10, 23, 43, 33, 784000, tzinfo=datetime.timezone.utc), creator_id='cldfcu2780008qsueqgiqvenw', description='Experiment to test customer support queries', metadata=ExperimentMetadata(fields=None, misc={'app': 'customer-support', 'dataset_version': '0.1.1'}, model='gpt-4o', temperature=0.8), createdAt='2025-02-10T23:43:33.784Z', updatedAt='2025-02-10T23:43:33.784Z', creatorId='cldfcu2780008qsueqgiqvenw', projectId='j13hf8g5kqfwbrh4332w89nd')

In [16]:
# You can get an expeiment by ID
experiment = client.get_experiment(experiment_id=experiment.id)

## 5. Evaluate the Dataset Against Built-in Metrics

Now that we have uploaded our dataset, we can evaluate it against the built-in metrics provided by LastMile AI AutoEval. These metrics include:

- `BuiltinMetrics.FAITHFULNESS`: Assesses how well the assistant's responses align with the provided ground truth.
- `BuiltinMetrics.RELEVANCE`: Evaluates the relevance and appropriateness of the assistant's responses to the user's queries.
- LastMile provides a few other metrics out of the box such as Toxicity, Question-Answering, etc.

We'll run evaluations on a subset of our dataset and the selected built-in metrics.

Note: You can also run evaluations directly on a dataset by using the `evaluate_dataset()` method.

In [15]:
from lastmile.lib.auto_eval import BuiltinMetrics
default_metrics = [
    BuiltinMetrics.FAITHFULNESS,
    BuiltinMetrics.RELEVANCE,
    # LastMile provides a few other metrics out of the box such as Toxicity, Question-Answering, etc.
]

print(f"Evaluation job kicked off")
evaluation_results = client.evaluate_dataset(
    dataset_id=dataset_id,
    metrics=default_metrics,
    # Log this evaluation run to the experiment we created above
    experiment_id=experiment.id,

    # You can add additional metadata properties here, which will get merged with any metadata on the Experiment itself
    # The metadata properties are surfaced as columns in the resulting dataset
    metadata={
        "extras": "Base metric tests"
    }
)

print("Evaluation Results:")
evaluation_results.head(10)

Evaluation job kicked off
Evaluation Results:


Unnamed: 0,input_log_id,input,output,ground_truth,tags,extras,misc,model,temperature,Faithfulness,Relevance
0,arg-18212c2670d356c0-03.958-01.1-0,Can I change my flight date for free?,"You can change your flight date for free if it is within 24 hours of booking. Otherwise, a fee may apply.","You can change your flight date for free within 24 hours of making the booking. After that, a change fee will apply.",,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.99958,1.0
1,arg-18212c2670d356c0-03.959-01.1-1,What is the phone number for customer service?,Please call customer service at 1-800-555-1243.,1-800-555-1234,,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.895925,0.863263
2,arg-18212c2670d356c0-03.95a-01.1-2,What time is my flight SA795 departing from JFK?,Your flight SA795 is departing from JFK at 6:30PM.,"SA795: JFK Terminal 4, Gate 42, 6:30PM departure",,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.998571,0.984251
3,arg-18212c2670d356c0-03.95b-01.1-3,What are the business class amenities on StrikeAir flights?,"Business class includes fully flat seats, gourmet meals, and a selection of fine wines.","Business class amenities on StrikeAir flights include flatbed seats, gourmet dining, premium beverages, and extra baggage allowance.",,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.999438,0.785916
4,arg-18212c2670d356c0-03.95c-01.1-4,What gate is my flight departing from?,"Your flight SA795 is departing from JFK Terminal 5, Gate 42 at 6:30PM.","SA795: JFK Terminal 4, Gate 42, 6:30PM departure",,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.000678,0.990655
5,arg-18212c2670d356c0-03.95d-01.1-5,Can you tell me how to check in for my flight online?,"You can check in for your flight online by visiting our website and clicking on the 'Check-In' tab. Enter your booking reference and last name, and follow the instructions.",Visit our website and click on the 'Check-In' tab. Enter your booking reference and last name to check in.,,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.998471,0.996501
6,arg-18212c2670d356c0-03.95e-01.1-6,Can you tell me the flight status for SA450?,"Flight SA450 is on schedule and will depart from JFK Terminal 2, Gate 15 at 4:00PM.","SA450: On Schedule, JFK Terminal 2, Gate 15, 4:00PM departure",,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.999635,0.995998
7,arg-18212c2670d356c0-03.95f-01.1-7,Can I get a vegetarian meal on my flight?,"Yes, we offer vegetarian meals on our flights. Please make sure to request it during booking or contact customer service.",Vegetarian meals are available upon request. Please ensure you request it at the time of booking or contact customer service.,,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.999843,0.999335
8,arg-18212c2670d356c0-03.960-01.1-8,Can I bring my pet on the plane?,"Yes, you can bring your pet, but it must fit under the seat in front of you and there is a pet fee.",Pets are not allowed on StrikeAir flights.,,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.000892,0.875631
9,arg-18212c2670d356c0-03.961-01.1-9,Do you have any vegetarian meals on your flights?,"Yes, we do offer vegetarian meal options on our flights. You can request it during booking or contact customer service for assistance.","Vegetarian meal options are available on all flights, requestable during booking or via customer service.",,Base metric tests,"{""app"":""customer-support"",""dataset_version"":""0.1.1""}",gpt-4o,0.8,0.999865,0.998895


## 6. Visualize this in the Experiments Console

You can log as many evaluation runs under an experiment as you want, and compare them.
The AutoEval UI has visualizations for this purpose:

* Experiments: https://lastmileai.dev/evaluations?view=experiments
* Evaluation Runs: https://lastmileai.dev/evaluations?view=all_runs
* Project Dashboard: https://lastmileai.dev/dashboard
* Dataset Library: https://lastmileai.dev/datasets

Check it out for yourself!

![Experiment Comparison](https://github.com/user-attachments/assets/8ccb3a2b-f1c5-4d69-9490-a00f385a7acc)

![Evaluation Run](https://github.com/user-attachments/assets/26d8f62d-0dcb-4c5f-b23e-2b3fa6e289b2)

![Project Dashboard](https://github.com/user-attachments/assets/7156d91f-9286-4996-989d-b0b76c623454)

## Next Steps

Head over to the AutoEval Getting Started notebook to learn about labeling, fine-tuning evaluator models, and other workflows that you can use to optimize your evals!