<a target="_blank" href="https://colab.research.google.com/github/okareo-ai/okareo-python-sdk/blob/main/examples/generating_classification_scenarios.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Generate synthetic data to use in classification

1. Install Okareo's Python SDK: &nbsp;&nbsp;  `pip install okareo`  &nbsp;&nbsp; 

2. Get your API token from [https://app.okareo.com/](https://app.okareo.com/).  
   (Note: You will need to register first.)

3. Go directly to the API settings by clicking the button under **"1. Create API Token"**. You can skip all other steps.

4. Add your generated API token to the cell below. 👇


In [1]:
OKAREO_API_KEY = "YOUR_API_KEY"

In [None]:
%pip install okareo 

This notebook generates the data that is used to train the model in this notebook:

<a target="_blank" href="https://colab.research.google.com/github/okareo-ai/okareo-python-sdk/blob/main/examples/classification_eval_training.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

We start by manually creating a set of seed questions.

Those seed questions are used to create a Scenario Set in Okareo.

In [2]:
from okareo import Okareo
import os
import tempfile
import json

okareo = Okareo(OKAREO_API_KEY)

# Start with sample data for complaints, returns, and pricing
rows = [
	{"input": "What are the current discounts available on electronic gadgets?", "result": "pricing"},
	{"input": "Are there any seasonal sales or upcoming promotions I should be aware of?", "result": "pricing"},
	{"input": "How can I apply a discount code to my purchase?", "result": "pricing"},
	{"input": "Is there a loyalty program that provides additional savings on purchases?", "result": "pricing"},
	{"input": "I'd like to purchase additional filters for my model, how much are they?", "result": "pricing",},
	{"input": "Do you offer any discounts or promotions?", "result": "pricing"},
	{"input": "Why was I charged more than the listed price at checkout?", "result": "pricing"},
	{"input": "I received a damaged item; how do I go about getting a replacement?", "result": "returns"},
	{"input": "Will I be refunded in full, or will there be deductions for restocking fees?", "result": "returns"},
	{"input": "I received a damaged item. What is the process for returning it?", "result": "returns"},
	{"input": "I accidentally ordered two of the same item. Can I return one?", "result": "returns"},
	{"input": "I want to exchange the product I bought for a different size. What's the process?", "result": "returns"},
	{"input": "I have an issue with the quality of Product Y; whom do I contact?", "result": "complaints"},
	{"input": "Where can I send feedback about a particular problematic order?", "result": "complaints"},
	{"input": "My product arrived late, and this has caused inconvenience; what can be done about this?", "result": "complaints"},
	{"input": "What is the escalation process for unresolved issues?", "result": "complaints"},
	{"input": "Can I speak directly to a manager about my ongoing issue?", "result": "complaints"},
	{"input": "I have some quality concerns with your product, who can I talk to?", "result": "complaints"},
	{"input": "The product I bought is not working as advertised. Who can I contact?", "result": "complaints"},
	{"input": "I was overcharged for my purchase. Who can help me with this?", "result": "complaints"},
	{"input": "The product I bought is of poor quality. Who can I report this to?", "result": "complaints"},
]

temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "seed_data_sample.jsonl")

# Write to a .jsonl file
with open(file_path, "w+") as file:
    for row in rows:
        file.write(json.dumps(row) + '\n')
    

# Create scenario set with seed data file
source_scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Blog Test Set")
print(source_scenario.app_link)

# make sure to clean up tmp file
os.remove(file_path)

https://app.okareo.com/project/57101004-ed10-4301-bc64-5576f2aa3513/scenario/c7e8b8b0-5954-4c9a-96d7-335ca00f7ecd


From the seed Scenario Set, we create a new Scenario Set using Okareo's generator.

In [4]:
from okareo_api_client.models import ScenarioType
# Use scenario set id or scenario set object from previous step as source for generation
rephrased_scenario = okareo.generate_scenarios(
    source_scenario=source_scenario,
    name="Blog - rephrase",
    number_examples=3,
    generation_type=ScenarioType.REPHRASE_INVARIANT
)

print(rephrased_scenario.app_link)

https://app.okareo.com/project/57101004-ed10-4301-bc64-5576f2aa3513/scenario/1230ea78-4836-4c97-9ba4-05c4b47a3060


In [5]:
spelling_scenario = okareo.generate_scenarios(
    source_scenario=source_scenario,
    name="Blog - spelling",
    number_examples=3,
    generation_type=ScenarioType.COMMON_MISSPELLINGS
)

print(spelling_scenario.app_link)

https://app.okareo.com/project/57101004-ed10-4301-bc64-5576f2aa3513/scenario/dc5d0a94-2868-4081-b2c2-f0a0741c8f11


In [6]:
contr_scenario = okareo.generate_scenarios(
    source_scenario=source_scenario,
    name="Blog - contractions",
    number_examples=3,
    generation_type=ScenarioType.COMMON_CONTRACTIONS
)

print(contr_scenario.app_link)

https://app.okareo.com/project/57101004-ed10-4301-bc64-5576f2aa3513/scenario/c756017a-3765-4bfa-8625-0e98a1ac7b0c


In [7]:
cond_scenario = okareo.generate_scenarios(
    source_scenario=source_scenario,
    name="Blog - conditional",
    number_examples=3,
    generation_type=ScenarioType.CONDITIONAL
)

print(cond_scenario.app_link)

https://app.okareo.com/project/57101004-ed10-4301-bc64-5576f2aa3513/scenario/074b2868-e446-40fc-b719-ec6faffce43a


In [8]:
# Load all of the necessary libraries from Okareo
from okareo.model_under_test import CustomModel, ModelInvocation

# Load the torch library
import torch

# Load libraries
from transformers import AutoTokenizer, DistilBertForSequenceClassification

# Load a tokenizer for the model from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Load Okareo's pretrained model from the Hugging Face Hub
model = DistilBertForSequenceClassification.from_pretrained("blog_model_base")

# Create an instance of the Okareo client
okareo = Okareo(OKAREO_API_KEY)

# Define a model class that will be used used for classification
# The model takes in a scenario and returns a predicted class
class ClassificationModel(CustomModel):
    # Constructor for the model
    def __init__(self, name, tokenizer, model):
        self.name = name
        # The pretrained tokenizer
        self.tokenizer = tokenizer
        # The pretrained model
        self.model = model
        # The possible labels for the model
        self.label_lookup = ["pricing", "returns", "complaints"]

    # Callable to be applied to each scenario in the scenario set
    def invoke(self, input: str):
        # Tokenize the input
        encoding = self.tokenizer(input, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
        # Get the logits from the model
        logits = self.model(**encoding).logits
        # Get the index of the highest value (the predicted class)
        idx = torch.argmax(logits, dim=1).item()
        # Get the label for the predicted class
        prediction = self.label_lookup[idx]
        
        # Return the prediction in a ModelInvocation object
        return ModelInvocation(
                model_prediction=prediction,
                model_input=input,
                model_output_metadata={ "prediction": prediction, "confidence": logits.softmax(dim=1).max().item() },
            )

# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test_base = okareo.register_model(name="blog_intent_classifier_model_base", model=ClassificationModel(name="Classification model", tokenizer=tokenizer, model=model), update=True)

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
test_run_item = model_under_test_base.run_test(
    scenario=rephrased_scenario.scenario_id, 
    name="Blog - rephrase", 
    calculate_metrics=True)

test_run_item = model_under_test_base.run_test(
    scenario=spelling_scenario.scenario_id, 
    name="Blog - spelling", 
    calculate_metrics=True)

test_run_item = model_under_test_base.run_test(
    scenario=contr_scenario.scenario_id, 
    name="Blog - contractions", 
    calculate_metrics=True)

test_run_item = model_under_test_base.run_test(
    scenario=cond_scenario.scenario_id, 
    name="Blog - conditional", 
    calculate_metrics=True)

test_run_item = model_under_test_base.run_test(
    scenario=source_scenario.scenario_id, 
    name="Blog - base", 
    calculate_metrics=True)

In [10]:
# Load Okareo's pretrained model from the Hugging Face Hub
model = DistilBertForSequenceClassification.from_pretrained("blog_model_synthetic")

# Create an instance of the Okareo client
okareo = Okareo(OKAREO_API_KEY)

# Define a model class that will be used used for classification
# The model takes in a scenario and returns a predicted class
class ClassificationModel(CustomModel):
    # Constructor for the model
    def __init__(self, name, tokenizer, model):
        self.name = name
        # The pretrained tokenizer
        self.tokenizer = tokenizer
        # The pretrained model
        self.model = model
        # The possible labels for the model
        self.label_lookup = ["pricing", "returns", "complaints"]

    # Callable to be applied to each scenario in the scenario set
    def invoke(self, input: str):
        # Tokenize the input
        encoding = self.tokenizer(input, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
        # Get the logits from the model
        logits = self.model(**encoding).logits
        # Get the index of the highest value (the predicted class)
        idx = torch.argmax(logits, dim=1).item()
        # Get the label for the predicted class
        prediction = self.label_lookup[idx]
        
        # Return the prediction in a ModelInvocation object
        return ModelInvocation(
                model_prediction=prediction,
                model_input=input,
                model_output_metadata={ "prediction": prediction, "confidence": logits.softmax(dim=1).max().item() },
            )

# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test_syn = okareo.register_model(name="blog_intent_classifier_model_w_synthetic", model=ClassificationModel(name="Classification model", tokenizer=tokenizer, model=model), update=True)

In [11]:
test_run_item = model_under_test_syn.run_test(
    scenario=rephrased_scenario.scenario_id, 
    name="Blog - syn - rephrase", 
    calculate_metrics=True)

test_run_item = model_under_test_syn.run_test(
    scenario=spelling_scenario.scenario_id, 
    name="Blog - syn - spelling", 
    calculate_metrics=True)

test_run_item = model_under_test_syn.run_test(
    scenario=contr_scenario.scenario_id, 
    name="Blog - syn - contractions", 
    calculate_metrics=True)

test_run_item = model_under_test_syn.run_test(
    scenario=cond_scenario.scenario_id, 
    name="Blog - syn - conditional", 
    calculate_metrics=True)

test_run_item = model_under_test_syn.run_test(
    scenario=source_scenario.scenario_id, 
    name="Blog - syn - base", 
    calculate_metrics=True)