# Parameter Experimentation with LastMile AI


In this notebook, we'll demonstrate how to run and evaluate experiments with the parameters of your RAG application (ex. model, chunk size, k). We will use the RAG Debugger UI to visualize and analyze our results.

## Notebook Outline
* [Introduction](#intro)
* [Step 1: Install and Setup](#step1)
* [Step 2: Generate summary with LLM](#step2)
* [Step 3: Run experiment #1](#step3)
* [Step 4: Run experiment #2](#step4)
* [Step 5: View Evaluation Results](#step5)

<a name="intro"></a>

# Introduction
**Parameters**, such as model, chunk size, or k, define the behavior of your RAG system. These parameters serve as the adjustable knobs that you can tune and test to optimize your RAG system's performance. In RAG Debugger, the parameter set used for each evaluation run is clearly displayed. This allows for easy comparison of different evaluation runs with varying parameter sets, enabling you to identify the optimal set of parameters for your RAG system.

In this example, we demonstrate how to define a parameter (e.g. model) for a simple non-RAG app and explore the impact of different parameter values through experimentation and evaluation.


<a name="step1"></a>

## Step 1: Install and Setup

Before we begin, we need to install the following packages:

In [None]:
!pip install openai
!pip install -q -U google-generativeai
!pip install lastmile-eval --upgrade



Import the necessary libraries.

In [None]:
import os
from google.colab import userdata
import google.generativeai as genai
from openai import OpenAI
import pandas as pd
from lastmile_eval.text import calculate_summarization_score
from lastmile_eval.rag.debugger.api.evaluation import create_input_set
from lastmile_eval.rag.debugger.api.evaluation import run_and_store_evaluations
from lastmile_eval.text import calculate_summarization_score

We also need the following tokens/keys:

* **LastMile AI API Token:** Go to the [LastMile Settings page](https://lastmileai.dev/settings?page=tokens). You will need to first create a LastMile AI account.
* **OpenAI API Key:** Go to [OpenAI API Keys page](https://platform.openai.com/account/api-keys) to create and access your OpenAI API Key.
* **Google Gemini API Key:** Go to [Google Gemini API Keys](https://aistudio.google.com/app/apikey?_gl=1*xtckgy*_ga*MTQzNDQ1Mzk1NS4xNzE2OTE5NjYy*_ga_P1DBVKWT6V*MTcxNjkxOTY2Mi4xLjEuMTcxNjkxOTY4My4zOS4wLjE1MTYxODUzNTI.) to create and access your Google API Key.

We're using Google Colab's Secret Manager to set our tokens in this notebook.

In [None]:
os.environ['LASTMILE_API_TOKEN'] =  userdata.get('LASTMILE_API_TOKEN')
os.environ['OPENAI_API_KEY'] =  userdata.get('OPENAI_API_KEY')
os.environ['GOOGLE_API_KEY']=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])

<a name="step2"></a>
## Step 2: Generate a summary with an LLM

First, define a function `generate_summary` that takes in a `model_name` and `original_text` and generates a summary.

We will use the LastMile AI Tracing SDK to setup tracing and register the LLM as the parameter we want to track.

In [None]:
from lastmile_eval.rag.debugger.tracing import get_lastmile_tracer
from lastmile_eval.rag.debugger.api import LastMileTracer

# Instantiate LastMile Tracer
tracer: LastMileTracer = get_lastmile_tracer(
    tracer_name="summary-generator",
    lastmile_api_token= os.environ['LASTMILE_API_TOKEN'],
)

def wrapper_gemini_pro(prompt):
    model = genai.GenerativeModel('gemini-pro')
    response = model.generate_content(prompt)
    return response.text

def wrapper_chatgpt(prompt):
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt,}],
        model="gpt-3.5-turbo",
    )
    return response.choices[0].message.content

# Decorate function with tracer
@tracer.start_as_current_span("generate_summary")
def generate_summary(model_name, original_text):
    prompt = f"Give me a summary of: {original_text}"

    # Register model_name as parameter
    tracer.register_param("model_name", model_name)

    if model_name == "gemini-pro":
        return wrapper_gemini_pro(prompt)
    elif model_name == "gpt-3.5-turbo":
        return wrapper_chatgpt(prompt)
    else:
        raise ValueError(f"Unsupported model: {model_name}")


<a name="step3"></a>

## Step 3: Run Experiment #1

Our first experiment will be to generate a summary using `gpt-3.5-turbo` and get a score on the quality of the summary. We will be using a LastMile Evaluator - Summarization Score.

In [None]:
# Generate Summaries for a List of Original Texts using gpt-3.5-turbo
chatgpt_summaries = []
original_texts = [
    """asdfa""",
    """asdfa"""
]
for i in original_texts:
  generate_summary('gpt-3.5-turbo', i)
  chatgpt_summaries.append(i)

# Create Input Set with Generated Summaries and Original Texts
input_set_id = create_input_set( [original_texts, generated_summaries], "Test Set - Experiment 1").ids[0]

# Define Summarization Evaluator
def wrap_summarize(df: pd.DataFrame) -> list[float]:
    def helper(row) -> float:
        return calculate_summarization_score(
            [row["query"]],
            [row["groundTruth"]],
            model_name="gpt-3.5-turbo"
        )[0]

    return df.apply(helper, axis=1)

# Evaluate LLM-generated responses with Summarization Evaluator
run_and_store_evaluations(
      input_set_id,
      project_id = "Summary Generator",
      trace_level_evaluators={"summarize": wrap_summarize},
      dataset_level_evaluators = {},
      evaluation_set_name="Eval Set - Experiment 1"
)



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s

CreateEvaluationsResult(success=False, message='{"message":"Unable to find project with id Summary Generator","error":{},"status":500}', df_metrics_trace=                   testSetId                 testCaseId metricName  value
0  clwqu4ven002tqyeindfbpmhk  clwqu4ver002uqyeioi5d18we  summarize    0.0
1  clwqu4ven002tqyeindfbpmhk  clwqu4ves002vqyeihazampvi  summarize    0.0, df_metrics_dataset=                   testSetId       metricName  value
0  clwqu4ven002tqyeindfbpmhk   summarize_mean    0.0
0  clwqu4ven002tqyeindfbpmhk    summarize_std    0.0
0  clwqu4ven002tqyeindfbpmhk  summarize_count    2.0)

This creates an **Evaluation Set** with inputs, outputs, and evaluation metrics for experiment 1 which we can view in the RAG Debugger UI.

<a name="step4"></a>

## Step 4: Run Experiment #2

Our first experiment will be to generate a summary using `gemini-pro`.

We will evaluate the summary with one of the Lastmile Evaluators - Summarization Score.

In [None]:
import pandas as pd
from lastmile_eval.text import calculate_summarization_score
from lastmile_eval.rag.debugger.api.evaluation import create_input_set
from lastmile_eval.rag.debugger.api.evaluation import run_and_store_evaluations
from lastmile_eval.text import calculate_summarization_score


# Generate Summaries for a List of Original Texts using gpt-3.5-turbo
generated_summaries = []
original_texts = [
    """asdfa""",
    """asdfa"""
]
for i in original_texts:
  generate_summary('gemini-pro', i)

# Create Input Set with Generated Summaries and Original Texts
input_set_id = create_input_set( [original_texts, generated_summaries], "Test Set - Experiment 2").ids[0]

# Define Summarization Evaluator
def wrap_summarize(df: pd.DataFrame) -> list[float]:
    def helper(row) -> float:
        return calculate_summarization_score(
            [row["query"]],
            [row["groundTruth"]],
            model_name="gemini-pro"
        )[0]

    return df.apply(helper, axis=1)

# Evaluate LLM-generated responses with Summarization Evaluator
run_and_store_evaluations(
      input_set_id,
      project_id = "Summary Generator",
      trace_level_evaluators={"summarize": wrap_summarize},
      dataset_level_evaluators = {},
      evaluation_set_name="Eval Set - Experiment 2"
)

<a name="step5"></a>

## [BROKEN] Step 5: View Evaluation Results

**NOTE: PARAM_SET ISNT LOGGED TO EVAL_SET. WE NEED TO HOOK UP TRACES TO EVAL_SET FIRST FOR THIS TO WORK**

Now you can view the evaluation results in the RAG Debugger UI.

Run this CLI command to access the UI:

`rag-debug launch`

The 'Evaluation Console' is the landing page of RAG Debugger. Here you can see all your Evaluation Sets (including the one we just made):

<img width="973" alt="Screenshot 2024-05-24 at 7 07 37 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/3d49b64b-6263-4345-ad37-b5ce3c696a18"/>

We can quickly see the results of our two experiments! We've logged our parameter `model_name` here and it makes it easy to differntiate on what changed in each experiment. You can have multiple parameters logged here (ex. chunk size, k, model_name).