# Improving Intent Detection Performance with Synthetic Data in Okareo

<a target="_blank" href="https://colab.research.google.com/github/okareo-ai/okareo-python-sdk/blob/main/examples/classification_finetuning_eval_part2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Note: This notebook is a companion notebook to [Part #1](https://github.com/okareo-ai/okareo-python-sdk/blob/main/examples/classification_finetuning_eval_part1.ipynb). Before proceeding, please ensure you have completed that notebook and cleared your CUDA cache (i.e., restarted that notebook) before starting this one.

## 🎯 Goals

After using this notebook, you will be able to:
- Finetune an LLM with an augmented train split that includes synthetic data
- Evaluate the newly finetuned model in Okareo

## Get the Finetuning Data from Part \#1

In this part, we use the same train and test splits as before. To get more finetuning data, we use the synthetic data generated from failed rows of the train split evaluation.

In [1]:
# copy-paste these IDs from the last cell of Part 1
train_scenario_id='8368324d-e57e-46d5-b79c-0e1dfbe3b7f0'
test_scenario_id='4be1401d-c9fd-4691-90f0-21a1fa90ab72'
rephrased_instruction_scenario_id='02148484-77a3-4909-8c52-0132ed2ab68c'

In [2]:
scenario_ids = {
    'train': train_scenario_id,
    'test': test_scenario_id,
}

In [3]:
import os
from okareo import Okareo

OKAREO_API_KEY = os.environ.get('OKAREO_API_KEY')
okareo = Okareo(OKAREO_API_KEY)

In [4]:
import pandas as pd

# get training data from part #1
sdp = okareo.get_scenario_data_points(train_scenario_id)
data = {'input': [], 'label': []}
for sd in sdp:
    data['input'].append(sd.input_)
    data['label'].append(sd.result)

train_df = pd.DataFrame(data)

Use the same prompt template/formatting as in Part #1.

In [5]:
PROMPT_PREAMBLE = """### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- Newsletter
- Miscellaneous
- Sustainability
- Membership
- Support
- Safety
- Returns

Return only one category that is most relevant to the question.
"""

def format_instruction_for_scenario(input_name="{input}"):
	return f"""{PROMPT_PREAMBLE}
### Input:
{input_name}
 
### Output:
{{result}}
"""

In [6]:
post_template = format_instruction_for_scenario("{input}")
print(post_template)

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- Newsletter
- Miscellaneous
- Sustainability
- Membership
- Support
- Safety
- Returns

Return only one category that is most relevant to the question.

### Input:
{input}
 
### Output:
{result}



In [7]:
finetuning_instruction_data = [
    post_template.format(
        input=train_df.loc[i, 'input'],
        result=train_df.loc[i, 'label']
    ) for i in range(train_df.shape[0])
]

In [8]:
print(finetuning_instruction_data[0])

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- Newsletter
- Miscellaneous
- Sustainability
- Membership
- Support
- Safety
- Returns

Return only one category that is most relevant to the question.

### Input:
How can I easily find a specific product on a website?
 
### Output:
Miscellaneous



In [9]:
# get rephrased failure rows
sdp = okareo.get_scenario_data_points(rephrased_instruction_scenario_id)
for sd in sdp:
    finetuning_instruction_data.append(sd.input_)

In [10]:
import json

file_path = f"./finetuning/webbizz_finetuning_train_instructions_augmented.jsonl"

with open(file_path, "w") as file:
    for row in finetuning_instruction_data:
        file.write(json.dumps({'sample': row}) + "\n")

In [11]:
from datasets import load_dataset

p = [os.getcwd(), "finetuning", "webbizz_finetuning_train_instructions_augmented.jsonl" ] 
filepath = os.path.join(*p)
print(filepath)
dataset = load_dataset('json', data_files={'train': file_path})

  from .autonotebook import tqdm as notebook_tqdm


/home/mason/git/okareo-python-sdk/examples/finetuning/webbizz_finetuning_train_instructions_augmented.jsonl


Generating train split: 94 examples [00:00, 12287.36 examples/s]


### Finetune Phi-3 on the Augmented Dataset

Now we set up a finetuning run identical to the Part #1 run using the augmented dataset.

In [13]:
from finetuning.utils import get_model_tokenizer_trainer
 
# Microsoft's huggingface model id for Phi-3
model_id = "microsoft/Phi-3-mini-4k-instruct"

# directory where finetuned model weights will be written
finetuned_model_name = "Phi-3-mini-4k-int4-augmented"

# target modules for LoRA
target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]

# set up peft model/tokenizer/trainer for finetuning
peft_model, tokenizer, trainer = get_model_tokenizer_trainer(
    model_id,
    finetuned_model_name,
    dataset,
    target_modules=target_modules,
    epochs=5
)


The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.74s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Initializing with max_seq_length=124



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
Generating train split: 79 examples [00:00, 2871.02 examples/s]


In [14]:
# train
trainer.train() # there will not be a progress bar since tqdm is disabled
 
# save model
trainer.save_model()

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


{'loss': 0.9676, 'grad_norm': 0.3681010901927948, 'learning_rate': 0.0005, 'epoch': 1.0}




{'loss': 0.2193, 'grad_norm': 0.17968423664569855, 'learning_rate': 0.0005, 'epoch': 2.0}




{'loss': 0.1557, 'grad_norm': 0.3641696870326996, 'learning_rate': 0.0005, 'epoch': 3.0}




{'loss': 0.1138, 'grad_norm': 0.1862974613904953, 'learning_rate': 0.0005, 'epoch': 4.0}




{'loss': 0.0845, 'grad_norm': 0.42350006103515625, 'learning_rate': 0.0005, 'epoch': 5.0}
{'train_runtime': 154.7935, 'train_samples_per_second': 2.552, 'train_steps_per_second': 0.323, 'train_loss': 0.30817508935928345, 'epoch': 5.0}


## Evaluate the augmented finetuned model in Okareo

As before, we register the augmented finetuned model in Okareo to perform the same classification evaluations. This will let us compare model performance pre-/post-augmentation.

In [15]:
from finetuning.utils import get_finetuned_model_tokenizer

# load the peft model with the base model
finetuned_model_name = "Phi-3-mini-4k-int4-augmented"
peft_model, tokenizer = get_finetuned_model_tokenizer(finetuned_model_name)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Reloading model + unpatching flash attention


Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.49s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [16]:
from okareo.model_under_test import CustomModel, ModelInvocation

def format_instruction(sample):
	prompt = f"""{PROMPT_PREAMBLE}
### Input:
{sample['question']}
 
### Output:
"""
	if 'category' in sample.keys():
		prompt += f"{sample['category']}\n"
	return prompt

mut_name = f"WebBizz Intent Detection - Phi-3-mini-4k finetuned + augmented"

class FinetunedPhi3Model(CustomModel):
    def __init__(self, name):
        super().__init__(name)
        self.model = peft_model
        self.categories = [
            "Newsletter",
            "Miscellaneous",
            "Sustainability",
            "Membership",
            "Support",
            "Safety",
            "Returns",
        ]

    def invoke(self, input_value):
        prompt = format_instruction({ "question": input_value })
        input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
        outputs = self.model.generate(
            input_ids=input_ids,
            max_new_tokens=5,
            do_sample=False,
        )
        pred = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):].strip()
        pred = pred.split("\n")[0]
        res = "Unknown"
        for cat in self.categories:
            if cat in pred:
                res = cat
        return ModelInvocation(
            actual=res,
            model_input=input_value,
            model_result=pred,
        )

# Register the model to use in the test run
model_under_test = okareo.register_model(
    name=mut_name,
    model=[FinetunedPhi3Model(name=FinetunedPhi3Model.__name__)],
    update=True
)

In [17]:
from okareo_api_client.models.test_run_type import TestRunType

for name in ["train", "test"]:
    eval_name = f"Intent Detection ({name}, with synthetic data)"
    evaluation = model_under_test.run_test(
        name=eval_name,
        scenario=scenario_ids[name],
        test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
        calculate_metrics=True,
    )
    print(f"{name} split: See results in Okareo: {evaluation.app_link}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
You are not running the flash-attention implementation, expect numerical differences.


train split: See results in Okareo: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/eval/153986eb-1994-4b64-b0f7-60d9d6b1f7b0
test split: See results in Okareo: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/eval/6d40b28b-df5b-4236-a72e-ffed1f46b95d
