## Setup

To complete the following guide you will need to install the following packages:
- fireworks-ai
- pandas

You will also need:

- Fireworks account
- Fireworks API key

In [2]:
!pip install fireworks-ai pandas

In [3]:
import json
import os

from fireworks.client import Fireworks
import pandas as pd

In [4]:
# Make sure you have the FIREWORKS_API_KEY environment variable set to your account's key!
# os.environ['FIREWORKS_API_KEY'] = 'XXX'

client = Fireworks()

## Problem Definition: Insurance Support Ticket Classifier

*Note: The problem definition, data, and labels used in this example were synthetically generated by Claude 3 Opus*

In the insurance industry, customer support plays a crucial role in ensuring client satisfaction and retention. Insurance companies receive a high volume of support tickets daily, covering a wide range of topics such as billing, policy administration, claims assistance, and more. Manually categorizing these tickets can be time-consuming and inefficient, leading to longer response times and potentially impacting customer experience.

### Task
In this exercise, we will evaluate the accuracy of various prompts on the test.tsv dataset.

#### Labeled Data

The data can be found in the week-1 `data` folder.

We will use the following datasets:
- `./data/train.tsv`
- `./data/test.tsv`

In [7]:
training_examples = pd.read_csv('data/train.tsv', sep='\t')
test_examples = pd.read_csv('data/test.tsv', sep='\t')

# In order to not leak information about the test labels into our prompts, the list of possible categories will be defined 
# based on the training labels. We'll discuss train/test splits more during week 2.
categories = sorted(training_examples['label'].unique().tolist())
categories_str = '\n'.join(categories)

test_tickets = test_examples['text'].tolist()
test_labels = test_examples['label'].tolist()

In [8]:
# Uses an LLM to predicted class labels for a list of support tickets
def classify_tickets(tickets, prompt_generator, model="accounts/fireworks/models/llama-v3p1-8b-instruct"):
    responses = list()

    for ticket in tickets:
        user_prompt = prompt_generator(ticket)
    
        response = client.chat.completions.create(
            model=model,
            messages=[
                { "role": "user", "content": user_prompt}
            ],
            # setting temperature to 0 for this use case, so that responses are as deterministic as possible
            temperature=0, 
            stop=["</category>"],
            max_tokens=2048,
        )
        response = response.choices[0].message.content.split("<category>")[1].strip()
        responses.append(response)

    return responses


# Calculates the percent of predictions we classified correctly
def evaluate_accuracy(predicted, actual):
    num_correct = sum([predicted[i] == actual[i] for i in range(len(actual))])
    return round(100 * num_correct / len(actual), 2)

### Classification with a simple prompt

We will first evaluate the accuracy of the LLM on a simple prompt that does not used any advanced prompt engineering techniques.

In [9]:
def create_simple_prompt(ticket):
    return f"""classify a customer support ticket into one of the following categories:
<categories>
{categories_str}
</categories>

Here is the customer support ticket:    
<ticket>{ticket}</ticket>

Respond using this format:
<category>The category label you chose goes here</category>"""    

In [11]:
simple_responses = classify_tickets(
    tickets=test_tickets, 
    prompt_generator=create_simple_prompt
)

In [12]:
accuracy = evaluate_accuracy(simple_responses, test_labels)
print(f"Initial Prompt Accuracy: {accuracy}%")

Initial Prompt Accuracy: 60.29%


### Classification with an improved prompt

This prompt builds upon the simple prompt, and improves the accuracy of the classification by applying the following techniques:
- chain-of-thought: makes the LLM reflect on its reasoning before providing a response
- few-shot learning: provides examples within the context of the prompt

In [15]:
# Retrieve one example from each of the k most popular labels
def retrieve_few_shot_examples(df, k=5):
    # Count the frequency of each label
    label_counts = df['label'].value_counts()

    # Identify the k most common labels
    top_labels = label_counts.head(k).index

    # Retrieve a single row for each of these labels
    rows = df[df['label'].isin(top_labels)].groupby('label').first().reset_index()

    # Convert each row to the example string format required by the prompt
    example_strs = list()
    for idx, row in rows.iterrows():
        example_strs.append(f"<example><ticket>{row['text']}</ticket><label>{row['label']}</label></example>")

    return '\n'.join(example_strs)

few_shot_examples = retrieve_few_shot_examples(training_examples)

def create_improved_prompt(ticket):
    return f"""classify a customer support ticket into one of the following categories:
<categories>
{categories_str}
</categories>

Here is the customer support ticket:    
<ticket>{ticket}</ticket>

Use the following examples to help you classify the query:
<examples>
{few_shot_examples}
</examples>

First you will think step-by-step about the problem in the scratchpad tag.
You should consider all the information provided and create a concrete argument for your classification.

Respond using this format:
<response>
  <scratchpad>Your thoughts and analysis go here</scratchpad>
  <category>The category label you chose goes here</category>
</response>"""    

In [16]:
improved_responses = classify_tickets(
    tickets=test_tickets, 
    prompt_generator=create_improved_prompt
)

In [17]:
accuracy = evaluate_accuracy(improved_responses, test_labels)
print(f"Improved Prompt Accuracy: {accuracy}%")

Improved Prompt Accuracy: 63.24%
