# Evaluating a Finetuned Classification Model in Okareo

<a target="_blank" href="https://colab.research.google.com/github/okareo-ai/okareo-python-sdk/blob/main/examples/scenarios.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## 🎯 Goals

After using this notebook, you will be able to:
- Finetune an open source LLM for classification
- Evaluate the finetuned model in Okareo
- Generate synthetic data from misclassified points to augment the finetuning set

In [1]:
import os
from okareo import Okareo

OKAREO_API_KEY = os.environ.get('OKAREO_API_KEY')
okareo = Okareo(OKAREO_API_KEY)

In [2]:
from finetuning.utils import load_raw_data

# load the train/test data
file_paths = ["./finetuning/webbizz_finetuning_train_data.jsonl",
              "./finetuning/webbizz_finetuning_test_data.jsonl"]
data_dict = load_raw_data(file_paths)
train_data = data_dict[file_paths[0]]
test_data = data_dict[file_paths[1]]


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from okareo_api_client.models.seed_data import SeedData
from okareo_api_client.models import ScenarioSetCreate

# create scenarios with full json results
scenario_ids = {}
for name, data in zip(['train', 'test'], [train_data, test_data]):
    scenario_set_create = ScenarioSetCreate(
        name=f"WebBizz Tool Classification - {name}",
        seed_data=[SeedData(input_=d['input'], result=d['result']) for d in data] 
    )

    source_scenario_static = okareo.create_scenario_set(scenario_set_create)
    scenario_ids[name] = source_scenario_static.scenario_id

    print(f'{name}: {source_scenario_static.app_link}')

train: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/scenario/bed059c5-90c9-4908-a0f5-330271a1e9e3
test: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/scenario/a41761a3-5c38-42ea-8ea7-2eccb6024aa1


Finetuning instructions should be formatted with the following three fields:
- Instruction: Description of the task, input/output format.
- Input: Text used to prompt the LLM
- Output: Expected response from the LLM

We provide a template for our WebBizz classification task in 'finetuning.utils' file:

In [4]:
from finetuning.utils import format_instruction_for_scenario

post_template = format_instruction_for_scenario("{input}")
print(post_template)

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- pricing
- complaints
- returns

Return only one category that is most relevant to the question.

### Input:
{input}
 
### Output:
{result}



In [5]:
finetuning_instruction_data = [post_template.format(input=d['input'], result=d['result']) for d in train_data]

In [6]:
print(finetuning_instruction_data[0])

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- pricing
- complaints
- returns

Return only one category that is most relevant to the question.

### Input:
Can I send this product back?
 
### Output:
returns



In [7]:
import json

file_path = f"./finetuning/webbizz_finetuning_train_instructions.jsonl"

with open(file_path, "w") as file:
    for row in finetuning_instruction_data:
        file.write(json.dumps({'sample': row}) + "\n")

In [9]:
from datasets import load_dataset

p = [os.getcwd(), "finetuning", "webbizz_finetuning_train_instructions.jsonl" ] 
filepath = os.path.join(*p)
print(filepath)
dataset = load_dataset('json', data_files={'train': file_path})

/home/mason/git/okareo-python-sdk/examples/finetuning/webbizz_finetuning_train_instructions.jsonl


Generating train split: 24 examples [00:00, 4804.70 examples/s]


In [10]:
print(dataset['train'][0])

{'sample': '### Instruction:\nGiven "Input", return a category under "Output" that is one of the following:\n\n- pricing\n- complaints\n- returns\n\nReturn only one category that is most relevant to the question.\n\n### Input:\nCan I send this product back?\n \n### Output:\nreturns\n'}


### Configure Phi-3 for finetuning

Now we set up a finetuning run using the finetuning instruction scenario.

In [11]:
HUGGINGFACE_API_KEY=os.environ.get("HUGGINGFACE_API_KEY")

In [14]:
from finetuning.utils import get_model_tokenizer_trainer
 
# Microsoft's huggingface model id for Phi-3
model_id = "microsoft/Phi-3-mini-4k-instruct"

# directory where finetuned model weights will be written
finetuned_model_name = "Phi-3-mini-4k-int4"

# target modules for LoRA
target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]

# set up peft model/tokenizer/trainer for finetuning
peft_model, tokenizer, trainer = get_model_tokenizer_trainer(model_id, finetuned_model_name, dataset, HUGGINGFACE_API_KEY, target_modules=target_modules, epochs=3)


Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.66s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Initializing with max_seq_length=94



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


In [15]:
# train
trainer.train() # there will not be a progress bar since tqdm is disabled
 
# save model
trainer.save_model()



{'train_runtime': 53.5696, 'train_samples_per_second': 1.12, 'train_steps_per_second': 0.112, 'train_loss': 1.2629125118255615, 'epoch': 2.4}


## Evaluate the finetuned model in Okareo

Now we will register the finetuned model in Okareo to perform classification evaluations on the train/test splits.

In [16]:
from finetuning.utils import get_finetuned_model_tokenizer

# load the peft model with the base model
finetuned_model_name = "Phi-3-mini-4k-int4"
peft_model, tokenizer = get_finetuned_model_tokenizer(finetuned_model_name)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Reloading model + unpatching flash attention


Loading checkpoint shards: 100%|██████████| 2/2 [00:15<00:00,  7.64s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from finetuning.utils import count_tokens

# get the number of tokens in each expected output json
n_train_tokens = [count_tokens(tokenizer, s) for s in train_data]

# get limit of tokens to generate as closest 100s above the max
nearest = 100
n_token_lim = nearest * (max(n_train_tokens) // nearest) + nearest
n_token_lim

100

In [18]:
from okareo.model_under_test import CustomModel, ModelInvocation
from finetuning.utils import format_instruction

mut_name = f"WebBizz Tool Classification - Phi-3-mini-4k finetuned"

class FinetunedPhi3Model(CustomModel):
    def __init__(self, name):
        super().__init__(name)
        self.model = peft_model
        self.categories = ["pricing", "complaints", "returns"]

    def invoke(self, input_value):
        prompt = format_instruction({ "question": input_value })
        input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
        outputs = self.model.generate(
            input_ids=input_ids,
            max_new_tokens=n_token_lim,
            do_sample=False,
        )
        pred = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):].strip()
        res = "unknown"
        for cat in self.categories:
            if cat in pred:
                res = cat
        return ModelInvocation(
            actual=res,
            model_input=input_value,
            model_result=pred,
        )

# Register the model to use in the test run
model_under_test = okareo.register_model(
    name=mut_name,
    model=[FinetunedPhi3Model(name=FinetunedPhi3Model.__name__)],
    update=True
)

In [19]:
from okareo_api_client.models.test_run_type import TestRunType

for name in ["train", "test"]:
    eval_name = f"Tool Classification ({name}, no synthetic data)"
    evaluation = model_under_test.run_test(
        name=eval_name,
        scenario=scenario_ids[name],
        test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
        calculate_metrics=True,
    )
    print(f"{name} split: See results in Okareo: {evaluation.app_link}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
You are not running the flash-attention implementation, expect numerical differences.


train split: See results in Okareo: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/eval/54a9e820-8261-493f-a9ed-3ddf81722769
test split: See results in Okareo: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/eval/18b70061-a54c-4733-87f8-33a9eac62235


### Generating more finetuning data from failure cases

To improve the finetuned model's performance, we will generate synthetic data from the mis-classified rows in the "train" split.

In [20]:
from okareo_api_client.api.default import (
    find_test_data_points_v0_find_test_data_points_post,
)

# use the id from the train split evaluation run
train_eval_id = "54a9e820-8261-493f-a9ed-3ddf81722769"

# get evaluation data
eval_response = model_under_test.get_test_run(train_eval_id)

# get scenario data points
scenario_rows = okareo.get_scenario_data_points(eval_response.scenario_set_id)

scenario_inputs = {}
scenario_results = {}
for row in scenario_rows:
    scenario_inputs[row.id] = row.input_
    scenario_results[row.id] = row.result

# get test data points
tdp = find_test_data_points_v0_find_test_data_points_post.sync(
    client=okareo.client,
    json_body=find_test_data_points_v0_find_test_data_points_post.FindTestDataPointPayload(
        test_run_id=train_eval_id,
    ),
    api_key=OKAREO_API_KEY
)

In [21]:
# compile scenario and test data points with relevant scores
scenario_to_test_dp = {}
for dp in tdp:
    scenario_to_test_dp[dp.scenario_data_point_id] = {
        'input': scenario_inputs[dp.scenario_data_point_id],
        'expected': dp.metric_value.additional_properties['expected'],
        'actual': dp.metric_value.additional_properties['actual'],
        'test_data_point_id': dp.id,
    }

In [22]:
# filter the datapoints based on misclassified results
filtered_dp = [dp for dp in scenario_to_test_dp.values() if dp['expected'] != dp['actual']]

In [23]:
import json

# write the failed rows
file_path = "./finetuning/webbizz_finetuning_train_failed_rows.jsonl" 
with open(file_path, "w") as f:
    for dp in filtered_dp:
        json.dump({ "input": dp['input'], "result": dp['expected'] }, f)
        f.write("\n")
    f.close()

In [25]:
seed_scenario_name = f"WebBizz Tool Classification - train failures"
seed_scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name=seed_scenario_name)

We utilize the same template from before, but this time we set it up for use with our scenario generator's `post_template` argument.

In [24]:
# format the scenario generation template to use the generator

post_template_rephrase = format_instruction_for_scenario("{generation.input}")
print(post_template_rephrase)

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- pricing
- complaints
- returns

Return only one category that is most relevant to the question.

### Input:
{generation.input}
 
### Output:
{result}



In [26]:
from okareo_api_client.models.generation_tone import GenerationTone
from okareo_api_client.models import ScenarioSetGenerate, ScenarioType

# generate rephrased versions 
generate_scenario_name = f"{seed_scenario_name} rephrased"

generate_request = ScenarioSetGenerate(
    source_scenario_id=seed_scenario.scenario_id,
    name=generate_scenario_name,
    number_examples=3,
    generation_type=ScenarioType.REPHRASE_INVARIANT,
    generation_tone=GenerationTone.NEUTRAL,
    post_template=post_template_rephrase
)

rephrased_instruction_scenario = okareo.generate_scenario_set(generate_request)

In [27]:
# copy the output of this cell into part 2 to continue finetuning with Okareo synthetic data
print(f"train_scenario_id='{scenario_ids['train']}'")
print(f"test_scenario_id='{scenario_ids['test']}'")
print(f"rephrased_instruction_scenario_id='{rephrased_instruction_scenario.scenario_id}'")

train_scenario_id='bed059c5-90c9-4908-a0f5-330271a1e9e3'
test_scenario_id='a41761a3-5c38-42ea-8ea7-2eccb6024aa1'
rephrased_instruction_scenario_id='16b578f7-2ae3-48c1-a20b-85c2a1de2c97'
