# Dataset Generator

## Information
- This notebook's main objective is to implement a dataset generator using a generative model. 
- This dataset generator can be used for any objective as long as the given prompt is clear enough.
- This is just a tool, and is in no way perfect. You should *always* check the results by hand.
- You should *always* generate a few examples before launching a full-scale generation.
- This dataset generator uses *AzureOpenAI*, so make sure you have the right setup.
- Should you wish to use your own API or model, the changes are minor, and the generator will adapt.
- To exploit the generator's full potential, make sure to give a *clear* and *concise* prompt.
- The model may take a longer time to generate the dataset.
- The base output of the generator is a *yaml* file, though this can easily be modified.
- The initial use for this generator is generating a dataset for La Roche Posay, so the examples given are relevant to that.

- Version: 2.2

## Setup

In [1]:
import random
import asyncio
from openai import OpenAI, AsyncOpenAI, AsyncAzureOpenAI
from dotenv import load_dotenv
import openai
import os
import yaml
from tqdm import tqdm

In [2]:
# Make sure the output is true
load_dotenv("../.env")

True

In [3]:
# Loading the client, you can load any client you'd like
azure_client = AsyncAzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"),
    api_version="2023-08-01-preview"
)

## Generating functions

### The heart of the mechanism

This next cell contains the most important function of this notebook, **generate_question**. 
The creation of the dataset starts with this seemingly simple function, but its simplicity is also its greatest weakness. As we all know, artificial intelligence is just that, artificial and like any generative model, the performance will heavily depend on the prompt it's given. Not to put any pressure on you, dear reader, but the performance of this notebook is 90% on you.

This function is the only function that will need heavy modifying in order for the generator to work properly (for you). The example given in this notebook is from a time I worked on generating a dataset for a skin care company, but you can do whatever you want with it. Take a look at my example, you can adapt it, modify it or even scrape it if it means you will get better performance. 

At the very end of this section there is a little test cell, make sure to **always** run it... run it as many times you need to until you get the structure you find satisfying. I'm heavily stressing the importance of these tests because once you launch the creation of your datasets, you're in it for a few hours. Make it count.

In [4]:
### Specific traits - relevant for the skin care products ###
skin_types = ["oily", "dry"]
product_types = ["cleanser", "moisturizer", "serum", "eyecream", "cream"]
concerns = ["simple_acne", "severe_acne", "dark_spots", "dryness", "dehydration", "fine_lines", 
            "allergy", "redness", "eczema", "wrinkles", "loss_of_firmness", "lack_of_elasticity", 
            "peeling_skin", "roughness", "tightness", "itchyness", "enlarged_pores", "excess_sebum", 
            "shining_zone", "hyperpigmentation", "dullness", "uneven_skin_tone", "skin_decoloration", 
            "dark_circles"]

In [80]:
# A function to generate one singular question, adapt to your needs
async def generate_question(id, azure_client=azure_client):

    ### Custom part - modify to your liking ###
    #skin_type = random.choice(skin_types)
    #product_type = random.choice(product_types)
    #product_line = random.choice(product_lines)
    num_concerns = random.choice([1, 2])  # Using simple random choice for 1 or 2 concerns
    selected_concerns = random.sample(concerns, num_concerns)

    prompt = f"""
        Generate a customer service question for a skincare product. You are human. You use La roche posay, but dont mention it by name all the time. 
        
        You will generete a sentence based on {selected_concerns}.
        You may mention the product type '{product_types}', the skin type '{skin_types}',
        or some concerns from {selected_concerns}. 
        The examples you generate must always be related to the {selected_concerns}, otherwise the sentence won't make sense.
        Be creative and provide diverse formulations.
        Dont always use everything.``
        Remember that you're human and you sometimes make mistakes when writing. 
        Make sure to include some questions just mentioning the issues, and not products or product lines,
        or pretend you're asking for a generic product that could solve an issue.
    
        Examples: 
    
        I have [dry skin](skintype). What would you suggest I do?
        Do you have any [serum](producttype) that is good for [oily skin](skintype) and acne?
    
        When you find words from these groups:
            - skin type: oily, dry
            - product type: cleanser, moisturizer, serum, eyecream
            - product line: cicaplast, effaclar, hyalu, lipikar, redermic, toleriane, pigmentclar
        
        always fromat them like in these examples:
        
        \"I'm looking for an [eyecream](producttype) that helps reduce dark circles. What do you suggest from [Pigmentclar](productline)?\"
        
        \"My [dry skin](skintype) feels very tight during winter. Could you recommend a [moisturizer](producttype)?\"
        Always follow these rules:
            Always format skin types, product types and product lines
            Never format concerns!
            Generate at most two full sentences, all in one line!
            Always write full sentences.
            Vary the length, I want some short sentences as well.
            Human language isn't perfect, make sure your examples don't look too much like they were generated by an AI.
            Rarely if never mention all four categories at the same time.
            Always generate finish sentences.
            Prefer making shorter sentences up to 30 tokens.
            Occasionally put some spelling mistakes.
            Never repeat the same example
        Never break these rules!
    """
    ### End of the custom part ###

    ### The generator - should remain the same ###
    response = await azure_client.chat.completions.create(
        model="sandbox-gpt-35", # sandbox-gpt-35
        messages=[
            {"role": "system", "content": prompt}
        ],
        max_tokens=50
    )
    
    text = response.choices[0].message.content.strip()

    ### End of the generator ###

    return {"id": id, "text": text, "concerns": selected_concerns} # Yaml Format; Adapt to your specific needs.

In [84]:
test = await generate_question(-1, azure_client)
test


{'id': -1,
 'text': 'My skin has been looking dull and uneven lately. Can you recommend a serum to help with skin discoloration from your [Pigmentclar](productline) line?',
 'concerns': ['dark_circles', 'skin_decoloration']}

### The generating body

The next few functions need little to no modification. They're made in such a way to insure that the API is always questioned slowly and non-blockingly. While running initial tests in this notebook's first ever version, I had many unpleasant surprises with a little error called *RateLimitError*. This arises when you send too many requests to the API, and unless you have an unlimited plan, I suggest you keep it.

I also have to mention that when this error arises, and the **call_api_with_retry** comes out of the cooldown period, the generation will start from the beginning, i.e. you could be at 99% of the way and this error will wipe everything out so you have to start over. One way of combatting this issue will be addressed in the next section, *Generating our datasets*.

In [85]:
# This function is vital.
# We have a limit on how many calls we can make.
async def call_api_with_retry(task, max_retries=5, base_delay=0.5):
    for attempt in range(max_retries):
        try:
            return await task()
        except openai.RateLimitError as e:
            if attempt < max_retries - 1:
                sleep_time = base_delay * 2 ** attempt + random.uniform(0, 0.1 * base_delay)
                print(f"Rate limit reached, retrying after {sleep_time:.2f} seconds...", flush=True)
                await asyncio.sleep(sleep_time)
            else:
                raise
        except Exception as e:
            print(f"An unexpected error occurred: {str(e)}", flush=True)
            raise

In [86]:
# Intermediate function
async def generate_questions(num_entries, azure_client=azure_client):
    dataset = []
    for i in tqdm(range(num_entries)):
        question = await generate_question(i + 1, azure_client)
        dataset.append(question)
    return dataset

In [87]:
# The generator function
async def generate_dataset(num_entries, azure_client=azure_client):
    return await call_api_with_retry(lambda: generate_questions(num_entries, azure_client))

## Generating our datasets

In [None]:
# 85-15 split, adapt as needed, I use mini-batches 
batch_size = 100
train_iter = 6000//batch_size # We're generating 6000 examples for the training dataset
test_iter = 1000//batch_size # We're generating 1000 examples for the training dataset

In [88]:
path = "../data/lrp/testtrain/data.yml"
batch = 600
dataset = await generate_dataset(batch)
with open(path, 'w') as file:
    yaml.dump(dataset, file, sort_keys=False)

100%|██████████| 600/600 [26:55<00:00,  2.69s/it]


### Battling the *RateLimitError* (my way)

As mentioned in the previous section, *RateLimitError* is your worst enemy, and I had to learn it the hard way. I've tested many different ways of dealing with it, but it would eventually always arise after ~700 generated examples in one go.

So, this is one of my solutions: set up checkpoints of no more than 100 examples, and loop through it. Though 100 examples per checkpoint might seem too less, what you have to take into account is the fact that the speed at which the API communicates with the model is constant. Let's say generating 100 examples takes 5 minutes, then generating 1000 examples would take 50 minutes in any case. However, letting the model generate one file of 1000 examples will result in a *critical RateLimitError** 98% of the time (I had to learn that one the hard way as well), but generating 10 files with 100 examples each guarantees that at the end of those 50 minutes you will almost always have the result you were seeking. 

Keep in mind that a *RateLimitError* might still occur, you can consider it "normal" behaviour as many factors play into the performance of the API calls. If it does, nothing is lost as you have all of the previously generated mini-batch files saved and only the *current* file is being regenerated.

I didn't include the handling of different files as that is something unique to every scenario, but what worked for me is just using the **yaml** library to merge all of the files into one with a simple loop.


* *Critical RateLimitError*: To better understand this term, we have to go back to the **call_api_with_retry** function. This function allows for a maximum of 5 retries before it shuts down the whole process. A *critical RateLimitError* occurs when we run out of retries.

In [None]:
# Train dataset

for i in range():
    train_dataset_path = f"../data/lrp/train/lrp_train_60.yml"
    train_dataset = await generate_dataset(batch_size)

    with open(train_dataset_path, "w") as file:
        yaml.dump(train_dataset, file, sort_keys=False)

print('Train dataset generated.')

In [None]:
# Test dataset
for i in range(test_iter):
    test_dataset_path = f"../data/lrp/test/lrp_test_{i+1}.yml"
    test_dataset = await generate_dataset(batch_size)

    with open(test_dataset_path, "w") as file:
        yaml.dump(test_dataset, file, sort_keys=False)

print('Test dataset generated.')