# Evaluation and Finetuning of Intent Detection Model in Okareo

<a target="_blank" href="https://colab.research.google.com/github/okareo-ai/okareo-python-sdk/blob/main/examples/classification_finetuning_eval_part1.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Note: Since Jupyter notebooks need to be restarted to clear the CUDA cache, this notebook comes in two parts. Before running [Part #2](https://github.com/okareo-ai/okareo-python-sdk/blob/main/examples/classification_finetuning_eval_part2.ipynb), please run all the cells in this notebook then restart to clear the CUDA cache.

## 🎯 Goals

After using this notebook, you will be able to:
- Finetune an open source LLM for classification
- Evaluate the finetuned model in Okareo
- Generate synthetic data from misclassified points to augment the finetuning set

## Problem Statement: Intent Detection

Suppose we are developing a RAG system that answers user questions about an online retailer called WebBizz. 

This notebook focuses on finetuning an open source LLM, [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), for the Intent Detection component of the RAG pipeline.

The purpose of Intent Detection is to determine which database to query during retrieval, which can help the RAG fetch relevant context chunks to improve the context of the generative model, improving downstream metrics like Consistency and reducing Hallucinations.

## Generating User Questions from Documents

We start by bootstraping our finetuning setup with synthetic user questions created with Okareo's scenario generators.

First, we setup our Okareo client. You will need API token from [https://app.okareo.com/](https://app.okareo.com/). (Note: You will need to register first.)

In [1]:
# get Okareo client

import os
from okareo import Okareo

OKAREO_API_KEY = "<YOUR_OKAREO_API_KEY>"
okareo = Okareo(OKAREO_API_KEY)

Now, we download our WebBizz articles and read them into a DataFrame.

In [2]:
import os
from io import StringIO 

# import chromadb
import pandas as pd 

# We load 10 short summaries about different business aspects to the vector database.

webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
json_df = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)

def label_article(val):
    # assign different intent labels based on substring matching
    if "return" in val:
        return "Returns"
    elif "newsletter" in val:
        return "Newsletter"
    elif "sustainability" in val:
        return "Sustainability"
    elif "security" in val:
        return "Safety"
    elif "support" in val or "help" in val:
        return "Support"
    elif "member" in val:
        return "Membership"
    else:
        return "Miscellaneous"

json_df['label'] = json_df['input'].apply(lambda x: label_article(x))

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5515  100  5515    0     0  38264      0 --:--:-- --:--:-- --:--:-- 38298


In [3]:
json_df

Unnamed: 0,result,input,label
0,75eaa363-dfcc-499f-b2af-1407b43cb133,WebBizz is dedicated to providing our customer...,Support
1,ac0d464c-f673-44b8-8195-60c965e47525,Safety and security of your data is our top pr...,Safety
2,aacf7a34-9d3a-4e2a-9a5c-91f2a0e8a12d,WebBizz places immense value on its dedicated ...,Membership
3,f1a37b5e-58c4-4f5a-bc42-1b70253b8bf3,"At WebBizz, we recognize that your shopping pr...",Miscellaneous
4,35a4fd5b-453e-4ca6-9536-f20db7303344,"At WebBizz, we value our customer's feedback a...",Returns
5,f658c264-4a8a-4c93-a6d7-9a3d75f5a6f3,Are you facing hurdles with technical glitches...,Miscellaneous
6,a8a97b0e-8d9a-4a1c-b93e-83d2bc9e5266,Navigating WebBizz is a breeze with our advanc...,Miscellaneous
7,0b85c12f-6ea6-4d4a-85de-6c6e9a9f8c78,We're proud of our diverse product catalog tha...,Miscellaneous
8,cda67f1d-19f2-4b45-9f3e-3b8d67f8c6c5,Subscribing to our newsletter gives you a fron...,Newsletter
9,6e4f1c97-3f7a-4fcd-a4a3-69c9817c8fd1,WebBizz believes in sustainability. We've inte...,Sustainability


In [4]:
from okareo_api_client.models.generation_tone import GenerationTone
from okareo_api_client.models.scenario_set_create import ScenarioSetCreate
from okareo_api_client.models.scenario_set_generate import ScenarioSetGenerate
from okareo_api_client.models.scenario_type import ScenarioType
from okareo_api_client.models.seed_data import SeedData

# Create a scenario set of the WebBizz documents
seed_data = []
for article, label in zip(json_df['input'].to_list(), json_df['label'].to_list()):
    seed_data.append(SeedData(input_=article, result=label))

document_scenario = okareo.create_scenario_set(
    ScenarioSetCreate(
        name=f"WebBizz Documents (Seed)", seed_data=seed_data
    )
)



In [5]:
# Use the scenario set of documents to generate a scenario of questions
generated_scenario = okareo.generate_scenario_set(
    ScenarioSetGenerate(
        name=f"Intent Detection - User Questions",
        source_scenario_id=document_scenario.scenario_id,
        number_examples=5,
        generation_type=ScenarioType.TEXT_REVERSE_QUESTION,
        generation_tone=GenerationTone.INFORMAL
    )
)
print(f"question scenario: {generated_scenario.app_link}")

# Get more questions with the rephrasing generator
rephrased_scenario = okareo.generate_scenario_set(
    ScenarioSetGenerate(
        name=f"Intent Detection - Rephrased User Questions",
        source_scenario_id=generated_scenario.scenario_id,
        number_examples=1,
        generation_type=ScenarioType.REPHRASE_INVARIANT,
        generation_tone=GenerationTone.INFORMAL
    )
)
print(f"rephrased scenario: {rephrased_scenario.app_link}")

question scenario: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/scenario/9b0ed30a-155f-4bd3-90c3-154a40eecbfd
rephrased scenario: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/scenario/85c1e499-e251-4071-a531-171e629580d0


## Preparing the Data for Instruction Finetuning

Now we get the generated user questions, build train/test splits, and format the train split for instruction finetuning.

In [6]:
import pandas as pd

# get the generated questions
data = {'input': [], 'label': []}
for s in [generated_scenario, rephrased_scenario]:
    sdp = okareo.get_scenario_data_points(s.scenario_id)
    for sd in sdp:
        data['input'].append(sd.input_)
        data['label'].append(sd.result[0])

df = pd.DataFrame(data)

In [7]:
df.head()

Unnamed: 0,input,label
0,What kind of products does WebBizz offer?,Miscellaneous
1,How does WebBizz ensure their collections are ...,Miscellaneous
2,Are there any special deals or sales that WebB...,Miscellaneous
3,Does WebBizz collaborate with designers for th...,Miscellaneous
4,What should I look out for if I want to catch ...,Miscellaneous


Use scikit-learn's model_selection module to get class-balanced train/test splits.

In [8]:
from sklearn.model_selection import StratifiedShuffleSplit

# Create a StratifiedShuffleSplit object
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0)

# Get the indices for the training and test sets
for train_index, test_index in sss.split(df, df['label']):
    train_df = df.loc[train_index].reset_index(drop=True)
    test_df = df.loc[test_index].reset_index(drop=True)

print(f"# train rows: {train_df.shape[0]}")
print(f"# test rows: {test_df.shape[0]}")

# train rows: 70
# test rows: 30


In [9]:
train_df.head()

Unnamed: 0,input,label
0,How can I easily find a specific product on a ...,Miscellaneous
1,What brands are supported by WebBizz?,Sustainability
2,Is there a way to get personalized product rec...,Support
3,What should I do if I have questions about my ...,Support
4,How can I sort products on an online store?,Miscellaneous


In [10]:
# create scenarios with full json results
scenario_ids = {}
for name, split_df in zip(['train', 'test'], [train_df, test_df]):
    scenario_set_create = ScenarioSetCreate(
        name=f"WebBizz Intent Detection - {name}",
        seed_data=[
            SeedData(
                input_=split_df.loc[i, 'input'],
                result=split_df.loc[i, 'label']
            ) for i in range(split_df.shape[0])
        ] 
    )

    split_scenario = okareo.create_scenario_set(scenario_set_create)
    scenario_ids[name] = split_scenario.scenario_id

    print(f'{name}: {split_scenario.app_link}')

train: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/scenario/8368324d-e57e-46d5-b79c-0e1dfbe3b7f0
test: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/scenario/4be1401d-c9fd-4691-90f0-21a1fa90ab72


Finetuning instructions should be formatted with the following three fields:
- Instruction: Description of the task, input/output format.
- Input: Text used to prompt the LLM
- Output: Expected response from the LLM

We provide a template for our WebBizz intent detection task below. 

In [11]:
PROMPT_PREAMBLE = """### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- Newsletter
- Miscellaneous
- Sustainability
- Membership
- Support
- Safety
- Returns

Return only one category that is most relevant to the question.
"""

def format_instruction_for_scenario(input_name="{input}"):
	return f"""{PROMPT_PREAMBLE}
### Input:
{input_name}
 
### Output:
{{result}}
"""

In [12]:
post_template = format_instruction_for_scenario("{input}")
print(post_template)

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- Newsletter
- Miscellaneous
- Sustainability
- Membership
- Support
- Safety
- Returns

Return only one category that is most relevant to the question.

### Input:
{input}
 
### Output:
{result}



In [13]:
finetuning_instruction_data = [
    post_template.format(
        input=train_df.loc[i, 'input'],
        result=train_df.loc[i, 'label']
    ) for i in range(train_df.shape[0])
]

In [14]:
print(finetuning_instruction_data[0])

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- Newsletter
- Miscellaneous
- Sustainability
- Membership
- Support
- Safety
- Returns

Return only one category that is most relevant to the question.

### Input:
How can I easily find a specific product on a website?
 
### Output:
Miscellaneous



In [15]:
import json

p = [os.getcwd(), "finetuning", "webbizz_finetuning_train_instructions.jsonl" ] 
file_path = os.path.join(*p)

with open(file_path, "w") as file:
    for row in finetuning_instruction_data:
        file.write(json.dumps({'sample': row}) + "\n")

In [16]:
from datasets import load_dataset

dataset = load_dataset('json', data_files={'train': file_path})

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 70 examples [00:00, 3246.90 examples/s]


### Configure Phi-3 for finetuning

Now we set up a finetuning run using the finetuning instruction scenario.

In [18]:
from finetuning.utils import get_model_tokenizer_trainer
 
# Microsoft's huggingface model id for Phi-3
model_id = "microsoft/Phi-3-mini-4k-instruct"

# directory where finetuned model weights will be written
finetuned_model_name = "Phi-3-mini-4k-int4"

# target modules for LoRA
target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]

# set up peft model/tokenizer/trainer for finetuning
peft_model, tokenizer, trainer = get_model_tokenizer_trainer(
    model_id,
    finetuned_model_name,
    dataset,
    target_modules=target_modules,
    epochs=5
)


The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00,  6.07s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Initializing with max_seq_length=124



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
Generating train split: 58 examples [00:00, 2366.16 examples/s]


In [19]:
# train
trainer.train() # there will not be a progress bar since tqdm is disabled
 
# save model
trainer.save_model()

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


{'loss': 0.9538, 'grad_norm': 0.21841156482696533, 'learning_rate': 0.0005, 'epoch': 1.3333333333333333}




{'loss': 0.4025, 'grad_norm': 0.1987246423959732, 'learning_rate': 0.0005, 'epoch': 2.6666666666666665}




{'loss': 0.1285, 'grad_norm': 0.22796112298965454, 'learning_rate': 0.0005, 'epoch': 4.0}




{'train_runtime': 129.1121, 'train_samples_per_second': 2.246, 'train_steps_per_second': 0.271, 'train_loss': 0.439920711517334, 'epoch': 4.666666666666667}


## Evaluate the finetuned model in Okareo

Now we will register the finetuned model in Okareo to perform classification evaluations on the train/test splits.

In [20]:
from finetuning.utils import get_finetuned_model_tokenizer

# load the peft model with the base model
finetuned_model_name = "Phi-3-mini-4k-int4"
peft_model, tokenizer = get_finetuned_model_tokenizer(finetuned_model_name)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Reloading model + unpatching flash attention


Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.67s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [21]:
from okareo.model_under_test import CustomModel, ModelInvocation

def format_instruction(sample):
	prompt = f"""{PROMPT_PREAMBLE}
### Input:
{sample['question']}
 
### Output:
"""
	if 'category' in sample.keys():
		prompt += f"{sample['category']}\n"
	return prompt

mut_name = f"WebBizz Intent Detection - Phi-3-mini-4k finetuned"

class FinetunedPhi3Model(CustomModel):
    def __init__(self, name):
        super().__init__(name)
        self.model = peft_model
        self.categories = [
            "Newsletter",
            "Miscellaneous",
            "Sustainability",
            "Membership",
            "Support",
            "Safety",
            "Returns",
        ]

    def invoke(self, input_value):
        prompt = format_instruction({ "question": input_value })
        input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
        outputs = self.model.generate(
            input_ids=input_ids,
            max_new_tokens=5,
            do_sample=False,
        )
        pred = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):].strip()
        res = "Unknown"
        for cat in self.categories:
            if cat in pred:
                res = cat
        return ModelInvocation(
            model_prediction=res,
            model_input=input_value,
            raw_model_output=pred,
        )

# Register the model to use in the test run
model_under_test = okareo.register_model(
    name=mut_name,
    model=[FinetunedPhi3Model(name=FinetunedPhi3Model.__name__)],
    update=True
)

In [22]:
from okareo_api_client.models.test_run_type import TestRunType

for name in ["train", "test"]:
    eval_name = f"Intent Detection ({name}, no synthetic data)"
    evaluation = model_under_test.run_test(
        name=eval_name,
        scenario=scenario_ids[name],
        test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
        calculate_metrics=True,
    )
    print(f"{name} split: See results in Okareo: {evaluation.app_link}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
You are not running the flash-attention implementation, expect numerical differences.


train split: See results in Okareo: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/eval/f2eabbfb-6d51-45f4-8862-71e1bda5e392
test split: See results in Okareo: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/eval/7df3dcfb-e19f-4baf-9754-41cb3ac06814


### Generating more finetuning data from failure cases

To improve the finetuned model's performance, we will generate synthetic data from the mis-classified rows in the "train" split.

In [23]:
from okareo_api_client.models.find_test_data_point_payload import FindTestDataPointPayload

# use the id from the train split evaluation run
train_eval_id = "f2eabbfb-6d51-45f4-8862-71e1bda5e392"

# get evaluation data
eval_response = model_under_test.get_test_run(train_eval_id)

# get scenario data points
scenario_rows = okareo.get_scenario_data_points(eval_response.scenario_set_id)

scenario_inputs = {}
scenario_results = {}
for row in scenario_rows:
    scenario_inputs[row.id] = row.input_
    scenario_results[row.id] = row.result

# get test data points
tdp = okareo.find_test_data_points(
    FindTestDataPointPayload(
        test_run_id=train_eval_id
    )
)

In [24]:
# compile scenario and test data points with relevant scores
scenario_to_test_dp = {}
for dp in tdp:
    scenario_to_test_dp[dp.scenario_data_point_id] = {
        'input': scenario_inputs[dp.scenario_data_point_id],
        'expected': dp.metric_value.additional_properties['expected'],
        'actual': dp.metric_value.additional_properties['actual'],
        'test_data_point_id': dp.id,
    }

In [25]:
# filter the datapoints based on misclassified results
filtered_dp = [dp for dp in scenario_to_test_dp.values() if dp['expected'] != dp['actual']]

In [26]:
import json

# write the failed rows
file_path = "./finetuning/webbizz_finetuning_train_failed_rows.jsonl" 
with open(file_path, "w") as f:
    for dp in filtered_dp:
        json.dump({ "input": dp['input'], "result": dp['expected'] }, f)
        f.write("\n")
    f.close()

In [27]:
seed_scenario_name = f"WebBizz Intent Detection - train failures"
seed_scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name=seed_scenario_name)

We utilize the same template from before, but this time we set it up for use with our scenario generator's `post_template` argument.

In [28]:
# format the scenario generation template to use the generator

post_template_rephrase = format_instruction_for_scenario("{generation.input}")
print(post_template_rephrase)

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- Newsletter
- Miscellaneous
- Sustainability
- Membership
- Support
- Safety
- Returns

Return only one category that is most relevant to the question.

### Input:
{generation.input}
 
### Output:
{result}



In [29]:
from okareo_api_client.models.generation_tone import GenerationTone
from okareo_api_client.models import ScenarioSetGenerate, ScenarioType

# generate rephrased versions 
generate_scenario_name = f"{seed_scenario_name} rephrased"

generate_request = ScenarioSetGenerate(
    source_scenario_id=seed_scenario.scenario_id,
    name=generate_scenario_name,
    number_examples=3,
    generation_type=ScenarioType.REPHRASE_INVARIANT,
    generation_tone=GenerationTone.NEUTRAL,
    post_template=post_template_rephrase
)

rephrased_instruction_scenario = okareo.generate_scenario_set(generate_request)

In [30]:
# copy the output of this cell into part 2 to continue finetuning with Okareo synthetic data
print(f"train_scenario_id='{scenario_ids['train']}'")
print(f"test_scenario_id='{scenario_ids['test']}'")
print(f"rephrased_instruction_scenario_id='{rephrased_instruction_scenario.scenario_id}'")

train_scenario_id='8368324d-e57e-46d5-b79c-0e1dfbe3b7f0'
test_scenario_id='4be1401d-c9fd-4691-90f0-21a1fa90ab72'
rephrased_instruction_scenario_id='02148484-77a3-4909-8c52-0132ed2ab68c'


Restart the notebook to clear your CUDA cache, then continue on to Part 2!