# Introduction

This notebook discusses how you can run E2E evaluations for your trained model, using Oumi inference for generating the responses, and [Alpaca Eval 2.0](https://github.com/tatsu-lab/alpaca_eval) for automatically calculating the win-rates vs. GPT4 Turbo (or other reference models of your choice).

First, start by pointing to your model and the max tokens it supports (to be used during generation). You can point to a model in HuggingFace, provide a path to a local folder that contains your model, or any other model format that Oumi inference supports. Also, provide a "human friendly" name for your model, to be used when displayed in leaderboards.

In [1]:
MODEL = "bartowski/Llama-3.2-1B-Instruct-GGUF"
MODEL_HUMAN_FRIENDLY_NAME = "MyLlamaModel"
MODEL_MAX_TOKENS = 8192

## Step 1: Retrieve Alpaca dataset

Alpaca Eval 2.0 requires that we run inference and produce model responses for the [tatsu-lab/alpaca_eval](https://huggingface.co/datasets/tatsu-lab/alpaca_eval) dataset.

In [2]:
from oumi.datasets.evaluation import AlpacaEvalDataset

alpaca_dataset = AlpacaEvalDataset(dataset_name="tatsu-lab/alpaca_eval").conversations()

INFO 12-06 17:47:53 importing.py:10] Triton not installed; certain GPU-related functions will not be available.


[2024-12-06 17:47:54,350][oumi][rank0][pid:41039][MainThread][INFO]][base_map_dataset.py:68] Creating map dataset (type: AlpacaEvalDataset)...
[2024-12-06 17:47:55,415][oumi][rank0][pid:41039][MainThread][INFO]][base_map_dataset.py:437] Dataset Info:
	Split: eval
	Version: 1.0.0
	Dataset size: 554496
	Download size: 620778
	Size: 1175274 bytes
	Rows: 805
	Columns: ['instruction', 'output', 'generator', 'dataset']
[2024-12-06 17:47:55,540][oumi][rank0][pid:41039][MainThread][INFO]][base_map_dataset.py:375] Loaded DataFrame with shape: (805, 4). Columns:
instruction    object
output         object
generator      object
dataset        object
dtype: object


Since this is sample code, we will only run inference on the first 3 examples of the dataset.

In [3]:
alpaca_dataset = alpaca_dataset[:3]

for index, conversation in enumerate(alpaca_dataset):
    print(index, conversation.messages)

0 [USER: What are the names of some famous actors that started their careers on Broadway?]
1 [USER: How did US states get their names?]
2 [USER: Hi, my sister and her girlfriends want me to play kickball with them. Can you explain how the game is played, so they don't take advantage of me?]


## Step 2: Run inference

First, define all the relevant parameters and configs required for inference.

In [4]:
from oumi.core.configs import GenerationParams, InferenceConfig, ModelParams

generation_params = GenerationParams(max_new_tokens=MODEL_MAX_TOKENS)
model_params = ModelParams(model_name=MODEL, model_max_length=MODEL_MAX_TOKENS)
inference_config = InferenceConfig(model=model_params, generation=generation_params)

Then, choose an inference engine that your model is compatible with. For more information on this, see Oumi's [inference documentation](https://oumi.ai/docs/latest/user_guides/infer/infer.html). 

In [5]:
from oumi.inference import LlamaCppInferenceEngine

inference_engine = LlamaCppInferenceEngine(model_params)

[2024-12-06 17:48:30,999][oumi][rank0][pid:41039][MainThread][INFO]][llama_cpp_inference_engine.py:117] Loading model from Huggingface Hub: bartowski/Llama-3.2-1B-Instruct-GGUF.


Next, run inference to get responses from your model for the prompts contained in the `alpaca_dataset`.

In [6]:
responses = inference_engine.infer(alpaca_dataset, inference_config)

  0%|          | 0/3 [00:00<?, ?it/s]

Then, convert the responses from Oumi format (list of `Conversation`s) to Alpaca format (list of `dict`s, where the data is contained under the keys `instruction` and `output`). Create a DataFrame from the data and add a new column "`generator`", which captures the human-readable name of the model the responses were produced with. 

In [7]:
import pandas as pd

from oumi.datasets.evaluation import utils

responses_json = utils.conversations_to_alpaca_format(responses)
responses_df = pd.DataFrame(responses_json)
responses_df["generator"] = MODEL_HUMAN_FRIENDLY_NAME

Your DataFrame should look as follows.

In [8]:
responses_df

Unnamed: 0,instruction,output,generator
0,What are the names of some famous actors that ...,There are many famous actors who started their...,MyLlamaModel
1,How did US states get their names?,The origin of US state names is a fascinating ...,MyLlamaModel
2,"Hi, my sister and her girlfriends want me to p...",Kickball is a fun team sport that's easy to le...,MyLlamaModel


## Step 3: Run Alpaca Eval 2.0

To compare your responses vs. Alpaca Eval 2.0's reference responses (generated with GPT4 Turbo) and calculate the win rates, you need a judge. By default, the judge is GPT4 Turbo, but there is an option to use other judges too. Note that the Alpaca leaderboard is established based on GPT4 Turbo as a judge, so using other judges is less informative, if you are interested in inspecting comparative results. To access the latest GPT-4 models, an Open API key is required. Details on creating an OpenAI account and generating a key can be found at [Open AI's quickstart webpage](https://platform.openai.com/docs/quickstart). 

In [9]:
import os

os.environ["OPENAI_API_KEY"] = "your_api_key_here"
print(f"Your Open AI key is: {os.environ.get('OPENAI_API_KEY')}")

Your Open AI key is: your_api_key_here


In addition, if you have not already installed the Alpaca Eval package, you need to do so at this point. 

In [10]:
! pip3 install --upgrade --quiet alpaca_eval

Once Alpaca Eval is installed, you can run the evaluation as follows. 

*** NOTES to self (to delete) ***

1) Defaults
    - reference_outputs: gpt4_turbo outputs retrieved from "tatsu-lab/alpaca_eval"
    - annotators_config: weighted_alpaca_eval_gpt4_turbo
    - is_avoid_reannotations: False
    - is_overwrite_leaderboard: False
    - output_path: None / auto
2) TODOs
    - Should we consider other judges that are for free? Does Alpaca allow this?

In [None]:
from alpaca_eval import evaluate

df_leaderboard, annotations = evaluate(
    model_outputs=responses_df,
    is_return_instead_of_print=True,
)

2024-12-06:17:51:26,884 INFO     [main.py:136] Evaluating the MyLlamaModel outputs.
2024-12-06:17:51:26,884 INFO     [base.py:104] Creating the annotator from `weighted_alpaca_eval_gpt4_turbo`.
2024-12-06:17:51:26,887 INFO     [base.py:521] Saving annotations to `/opt/miniconda3/envs/oumi/lib/python3.11/site-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json`.
2024-12-06:17:51:26,887 INFO     [base.py:513] Loading all annotations from /opt/miniconda3/envs/oumi/lib/python3.11/site-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json.
Annotation chunk:   0%|          | 0/1 [00:00<?, ?it/s]2024-12-06:17:51:26,951 INFO     [base.py:303] Annotating 0 examples with weighted_alpaca_eval_gpt4_turbo
2024-12-06:17:51:26,988 INFO     [base.py:500] Saving all annotations to /opt/miniconda3/envs/oumi/lib/python3.11/site-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotat

## Step 4: Inspect the metrics

Once the evaluation process completes, you can inspect the metrics produced, as shown below.

In [12]:
metrics = df_leaderboard.loc[MODEL_HUMAN_FRIENDLY_NAME]

print(f"Metrics for `{MODEL_HUMAN_FRIENDLY_NAME}`")
for metric, value in metrics.items():
    print(f" - {metric}={value}")

Metrics for `MyLlamaModel`
 - win_rate=0.0331325600000006
 - standard_error=0.02389778485055507
 - n_wins=0
 - n_wins_base=3
 - n_draws=0
 - n_total=3
 - discrete_win_rate=0.0
 - mode=community
 - avg_length=2208
 - length_controlled_winrate=0.16796629807158028
 - lc_standard_error=0.044130529121658255
