# Improving Finetuning with Synthetic Data in Okareo

<a target="_blank" href="https://colab.research.google.com/github/okareo-ai/okareo-python-sdk/blob/main/examples/classification_finetuning_eval_part2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## 🎯 Goals

Note: This notebook is a companion notebook to ['classification_finetuning_eval_part1.ipynb]. Please run all cells in that notebook before starting this one.

In [4]:
# copy-paste these IDs from the last cell of Part 1
train_scenario_id='bed059c5-90c9-4908-a0f5-330271a1e9e3'
test_scenario_id='a41761a3-5c38-42ea-8ea7-2eccb6024aa1'
rephrased_instruction_scenario_id='16b578f7-2ae3-48c1-a20b-85c2a1de2c97'

In [5]:
scenario_ids = {
    'train': train_scenario_id,
    'test': test_scenario_id,
}

In [1]:
import os
from okareo import Okareo

OKAREO_API_KEY = os.environ.get('OKAREO_API_KEY')
okareo = Okareo(OKAREO_API_KEY)

In [2]:
from finetuning.utils import load_raw_data

# load the train/test data
file_paths = ["./finetuning/webbizz_finetuning_train_data.jsonl",
              "./finetuning/webbizz_finetuning_test_data.jsonl"]
data_dict = load_raw_data(file_paths)
train_data = data_dict[file_paths[0]]
test_data = data_dict[file_paths[1]]


  from .autonotebook import tqdm as notebook_tqdm


Finetuning instructions should be formatted with the following three fields:
- Instruction: Description of the task, input/output format.
- Input: Text used to prompt the LLM
- Output: Expected response from the LLM

We provide a template for our WebBizz classification task in 'finetuning.utils' file:

In [7]:
from finetuning.utils import format_instruction_for_scenario

post_template = format_instruction_for_scenario("{input}")
print(post_template)

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- pricing
- complaints
- returns

Return only one category that is most relevant to the question.

### Input:
{input}
 
### Output:
{result}



In [8]:
finetuning_instruction_data = [post_template.format(input=d['input'], result=d['result']) for d in train_data]

In [6]:
print(finetuning_instruction_data[0])

### Instruction:
Given "Input", return a category under "Output" that is one of the following:

- pricing
- complaints
- returns

Return only one category that is most relevant to the question.

### Input:
Can I send this product back?
 
### Output:
returns



In [10]:
# get rephrased failure rows
sdp = okareo.get_scenario_data_points(rephrased_instruction_scenario_id)
for sd in sdp:
    finetuning_instruction_data.append(sd.input_)

In [11]:
import json

file_path = f"./finetuning/webbizz_finetuning_train_instructions_augmented.jsonl"

with open(file_path, "w") as file:
    for row in finetuning_instruction_data:
        file.write(json.dumps({'sample': row}) + "\n")

In [12]:
from datasets import load_dataset

p = [os.getcwd(), "finetuning", "webbizz_finetuning_train_instructions_augmented.jsonl" ] 
filepath = os.path.join(*p)
print(filepath)
dataset = load_dataset('json', data_files={'train': file_path})

/home/mason/git/okareo-python-sdk/examples/finetuning/webbizz_finetuning_train_instructions_augmented.jsonl


Generating train split: 30 examples [00:00, 6367.23 examples/s]


### Configure Phi-3 for finetuning

Now we set up a finetuning run using the finetuning instruction scenario.

In [13]:
HUGGINGFACE_API_KEY=os.environ.get("HUGGINGFACE_API_KEY")

In [14]:
from finetuning.utils import get_model_tokenizer_trainer
 
# Microsoft's huggingface model id for Phi-3
model_id = "microsoft/Phi-3-mini-4k-instruct"

# directory where finetuned model weights will be written
finetuned_model_name = "Phi-3-mini-4k-int4-augmented"

# target modules for LoRA
target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]

# set up peft model/tokenizer/trainer for finetuning
peft_model, tokenizer, trainer = get_model_tokenizer_trainer(model_id, finetuned_model_name, dataset, HUGGINGFACE_API_KEY, target_modules=target_modules, epochs=3)


The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.01s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Initializing with max_seq_length=95



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
Generating train split: 25 examples [00:00, 1189.13 examples/s]


In [15]:
# train
trainer.train() # there will not be a progress bar since tqdm is disabled
 
# save model
trainer.save_model()

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


{'train_runtime': 54.4602, 'train_samples_per_second': 1.377, 'train_steps_per_second': 0.165, 'train_loss': 0.9350931379530165, 'epoch': 2.571428571428571}


## Evaluate the finetuned model in Okareo

Now we will register the finetuned model in Okareo to perform classification evaluations on the train/test splits.

In [16]:
from finetuning.utils import get_finetuned_model_tokenizer

# load the peft model with the base model
finetuned_model_name = "Phi-3-mini-4k-int4-augmented"
peft_model, tokenizer = get_finetuned_model_tokenizer(finetuned_model_name)

Reloading model + unpatching flash attention


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:30<00:00, 15.48s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
from finetuning.utils import count_tokens

# get the number of tokens in each expected output json
n_train_tokens = [count_tokens(tokenizer, s) for s in train_data]

# get limit of tokens to generate as closest 100s above the max
nearest = 100
n_token_lim = nearest * (max(n_train_tokens) // nearest) + nearest
n_token_lim

100

In [18]:
from okareo.model_under_test import CustomModel, ModelInvocation
from finetuning.utils import format_instruction

mut_name = f"WebBizz Tool Classification - Phi-3-mini-4k finetuned + augmented"

class FinetunedPhi3Model(CustomModel):
    def __init__(self, name):
        super().__init__(name)
        self.model = peft_model
        self.categories = ["pricing", "complaints", "returns"]

    def invoke(self, input_value):
        prompt = format_instruction({ "question": input_value })
        input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
        outputs = self.model.generate(
            input_ids=input_ids,
            max_new_tokens=n_token_lim,
            do_sample=False,
        )
        pred = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):].strip()
        res = "unknown"
        for cat in self.categories:
            if cat in pred:
                res = cat
        return ModelInvocation(
            actual=res,
            model_input=input_value,
            model_result=pred,
        )

# Register the model to use in the test run
model_under_test = okareo.register_model(
    name=mut_name,
    model=[FinetunedPhi3Model(name=FinetunedPhi3Model.__name__)],
    update=True
)

In [19]:
from okareo_api_client.models.test_run_type import TestRunType

for name in ["train", "test"]:
    eval_name = f"Tool Classification ({name}, with synthetic data)"
    evaluation = model_under_test.run_test(
        name=eval_name,
        scenario=scenario_ids[name],
        test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
        calculate_metrics=True,
    )
    print(f"{name} split: See results in Okareo: {evaluation.app_link}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
You are not running the flash-attention implementation, expect numerical differences.


train split: See results in Okareo: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/eval/63ef7e22-75c0-49e1-981f-25b8d2d76fa5
test split: See results in Okareo: https://app.okareo.com/project/89920a9a-54cc-40c8-af68-9975e64e8d18/eval/fd664f3e-6ace-4a82-8113-1b1d0bcf156c
