## Sample ticket extraction

It's not easy to find some reference dataset with 5 sentiment labels as requested so I picked Amazon reviews as a source of seed examples for data generation.

In [1]:
import urllib.request
urllib.request.urlretrieve("https://huggingface.co/datasets/hugginglearners/amazon-reviews-sentiment-analysis/resolve/main/amazon_reviews.csv?download=true", "amazon_reviews.csv")

('amazon_reviews.csv', <http.client.HTTPMessage at 0x7f145f0f9f90>)

In [2]:
import pandas as pd

reviews = pd.read_csv("amazon_reviews.csv")

In [3]:
SAMPLES_PER_SCORE = 10
seed_selection_df = reviews.groupby('overall', group_keys=False).apply(lambda x: x.sample(min(len(x), SAMPLES_PER_SCORE)))

  seed_selection_df = reviews.groupby('overall', group_keys=False).apply(lambda x: x.sample(min(len(x), SAMPLES_PER_SCORE)))


In [4]:
seed_selection_df.describe()

Unnamed: 0.1,Unnamed: 0,overall,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,2465.44,3.0,444.92,0.2,0.18,0.38,0.02,0.106667,0.025012
std,1281.044156,1.428571,244.196378,0.494872,0.481918,0.878078,0.428095,0.268868,0.067668
min,109.0,1.0,2.0,0.0,0.0,0.0,-1.0,0.0,0.0
25%,1439.5,2.0,311.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2543.0,3.0,439.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3354.5,4.0,615.75,0.0,0.0,0.0,0.0,0.0,0.0
max,4855.0,5.0,941.0,2.0,2.0,4.0,2.0,1.0,0.34238


In [5]:
seed_selection_df.head()

Unnamed: 0.1,Unnamed: 0,reviewerName,overall,reviewText,reviewTime,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound
1699,1699,froemer,1.0,"I've requested a replacement, so I'll update t...",2014-01-18,324,0,0,0,0,0.0,0.0
3031,3031,matt,1.0,"Bought it, installed it in my Droid RazrM neve...",2013-04-24,593,0,0,0,0,0.0,0.0
3111,3111,Mercury,1.0,"I got this card only a few months ago, used it...",2014-09-07,92,1,0,1,1,1.0,0.206549
1593,1593,Eric Redman,1.0,Could not figure out why this card crashed mul...,2012-12-14,724,1,2,3,-1,0.333333,0.061492
2510,2510,J. Parnell,1.0,"After I put this card in my new S4, my camera ...",2013-05-28,559,0,0,0,0,0.0,0.0


In [6]:
def translate_score_to_sentiment(score: int) -> str:
    normalized_score = int(score)-1
    if normalized_score < 5:
        return ["Strong Negative", "Mild Negative", "Neutral", "Mild Positive", "Strong Positive"][normalized_score]

## Seed tasks

For the next step of ticket generation 

In [10]:
import json
import random

instances = []
for i, row in seed_selection_df.iterrows():
    instances.append({
        "id": f"seed_task_{i}", 
        "name": "sentiment_analysis", 
        "ticket": row["reviewText"], 
        "instances": [
            {
                "input": "", 
                "output": translate_score_to_sentiment(row["overall"])
            }
        ],
        "is_classification": False
    })

with open("./stanford_alpaca/seed_tasks.jsonl", "w") as outfile:
    for instance in instances:
        json.dump(instance, outfile)
        outfile.write('\n')

In [11]:
import os
import getpass

if not os.environ["OPENAI_API_KEY"]:
    os.environ["OPENAI_API_KEY"] = getpass.getpass()

In [12]:
%cd stanford_alpaca 

/home/jovyan/work/stanford_alpaca


## Data generation

For generation purposes I used [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) project which aims to train LLaMa on instructions generated by OpenAI. It can resume process, support multi-processing for de-duplication, how many requests are sent in a row. In my view great tool for the purpose. I know I might be expected to use Langchain for that purpose, but since I used this tool before and time is limited I prefered something that I can run fast.

Related code can be found in the `./stanford_alpaca` folder in this project.

Using seed tasks created above and several modifications:
- to support recent OpenAI API
- convert instruction generation to ticket generation by modifying prompt and related keywords
it can effectively orchestrate instruction generation with the final amount of items close to the requested number.

For the purpose of de-duplication it uses [Rouge Score](https://en.wikipedia.org/wiki/ROUGE_(metric)) which is the reasonable way to filter out diplicates. In this specific case longest common subsequence is calculated. If new ticket hits a score > 0.7 to some stored ticket then it is not saved. 

Prompts are shipped with maximum temperature of 1.0 which requires some result investigation but due to a limited time I've skipped it here and take output as is.

Here is the prompt example:
```
You are asked to come up with a set of 20 diverse customer tickets. These customer tickets will be given to a GPT model and we will evaluate the GPT model for completing the customer sentiment.

Here are the requirements:
1. Try not to repeat the product for each ticket to maximize diversity.
2. The language used for the ticket also should be diverse. For example, you should combine short and long tickets.
3. The ticket should be in English.
4. The output should be an appropriate sentiment of the customer ticket. Make sure the output is in the following list of options: "Strong Negative", "Mild Negative", "Neutral", "Mild Positive", "Strong Positive".

List of 20 tickets:
###
1. Ticket: SanDisk Ultra 32 GB MicroSDHC C10/UHS1 Memory Card with Adapter...SanDisk Ultra 32 GB MicroSDHC C10/UHS1 Memory Card with Adapter...Works fine.
1. Input:
<noinput>
1. Output:
Mild Positive
###
2. Ticket: Pictures were bright and sharp and download from the camera to the card very quickly. If you have a better quality camera this is a great card
2. Input:
<noinput>
2. Output:
Strong Positive
###
3. Ticket: Spent over $50 and the thing came fast and didn't work. Neither my computer nor my phone could read it or format it. I have had plenty of memory cards, so I know to unmount it with software. Odds are yours will be fine, mine sucked,
3. Input:
<noinput>
3. Output:
Strong Negative
###
4. Ticket:
```

In [32]:
%%bash
python -m generate_instruction generate_instruction_following_data \
  --output_dir ./ \
  --num_instructions_to_generate 10000 \
  --model_name="gpt-4o-mini" \
  --request_batch_size=1

Loaded 50 human-written seed instructions
Loaded 604 machine-generated instructions
Sending OpenAI request


  0%|          | 0/10000 [00:00<?, ?it/s]
prompt_batches:   0%|          | 0/1 [00:00<?, ?it/s][A
prompt_batches: 100%|██████████| 1/1 [00:04<00:00,  4.26s/it][A


Received response in 4.258951187133789
Request 1 took 4.26s, processing took 0.00s
Generated 0 instructions, kept 0 instructions
Sending OpenAI request



prompt_batches:   0%|          | 0/1 [00:00<?, ?it/s][A
prompt_batches: 100%|██████████| 1/1 [00:07<00:00,  7.39s/it][A


Received response in 7.393707513809204
Request 2 took 7.39s, processing took 1.46s
Generated 14 instructions, kept 14 instructions
Sending OpenAI request


  6%|▌         | 617/10000 [00:13<03:36, 43.26it/s]
prompt_batches:   0%|          | 0/1 [00:33<?, ?it/s]
Traceback (most recent call last):
  File "/home/jovyan/work/stanford_alpaca/utils.py", line 115, in openai_completion
    completion_batch = client.chat.completions.create(messages=messages, **shared_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/openai/_utils/_utils.py", line 274, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/openai/resources/chat/completions.py", line 668, in create
    return self._post(
           ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/openai/_base_client.py", line 1260, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  

## Splitting generated data

In [35]:
SPLIT_SIZE = 0.8
all_tickets = json.load(open("regen.json", "r"))
bound = int(len(all_tickets)*SPLIT_SIZE)
regen_train = all_tickets[:bound]
print(f"Train size: {len(regen_train)}")
regen_valid = all_tickets[bound:]
print(f"Validation size: {len(regen_valid)}")
json.dump(regen_train, open("regen_train.json", "w"))
json.dump(regen_valid, open("regen_valid.json", "w"))

Train size: 1855
Validation size: 464


Unfortunately I wasn't able to produce 10000 samples as intended. The first run gave me 1007 samples which I reserved for training and the second run produced 693 before I hit rate limit. Second result I reserved for validation. I could rebalance later depending on the training results, but they were satisfying in this exact scenario so I keep it as is.