<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Erik Fredner](https://fredner.org) for the 2024 Text Analysis Pedagogy Institute. Revised and expanded by Zhuo Chen under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />

For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
____

# Automated Text Classification Using LLMs 3

**Description:** This notebook describes:

* What is prompt engineering
* How to use F-score to inform prompt engineering
* Use classification results to extrude structured data from classified texts.

**Use Case:** For Learners and Researchers

**Difficulty:** Intermediate

**Completion Time:** 90 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics 1](../Python-basics/python-basics-1.ipynb))
* Python Intermediate Series ([Start Python Intermediate 1](../Python-intermediate/python-intermediate-1.ipynb))
* Introduction to LLMs ([Start Intro to LLMs 1](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/March+20+2024_+How+ChatGPT+works+(Session+1).pdf))
* Automated Classificaton using LLMs 1 ([Review Automated Classificaton using LLMs 1](../Automated-classification/automated-classification-1.ipynb))
* Automated Classificaton using LLMs 2 ([Review Automated Classificaton using LLMs 2](../Automated-classification/automated-classification-2.ipynb))

**Knowledge Recommended:** 
* Experience with LLM chatbot (e.g. ChatGPT)
* Pandas Basics ([Start Pandas Basics 1](../Pandas-basics/pandas-basics-1.ipynb))

**Data Format:** JSONL, CSV

**Libraries Used:** openai, dotenv

**Research Pipeline:**
1. Play with LLMs if you have not already.
2. Test using a chatbot interface for an LLM (like ChatGPT) to perform relevant classifications for your research.
3. Evaluate initial results.
4. Learn how to interact with an API through this notebook.
5. Modify your initial experiments based on what we cover.

## Install required Python libraries

Let's install the required libraries for this lesson. 

In [None]:
### install the required libraries
%pip install --upgrade openai tiktoken python-dotenv # for interaction with the OpenAI API
%pip install pandas >=2.0.3
%pip install scikit-learn==1.5.1

In [None]:
### Import Libraries ###

from openai import OpenAI
from dotenv import load_dotenv # to load API key
from sklearn.metrics import f1_score, precision_score, recall_score
import math
import pandas as pd
import json

# load the API key and OpenAI

In [None]:
load_dotenv() # load the API key
client = OpenAI() # load OpenAI

# Get the sample dataset

In [None]:
# Get the sample dataset
from pathlib import Path

file_path = '../All-sample-files/natural_gas_sents.jsonl'

ng_df = pd.read_json(file_path, lines=True)
ng_df

# Prompt engineering
## What is prompt engineering and why do it? 
Prompt engineering is the process of writing and refining instructions that make LLMs perform tasks effectively.

In lesson 1 and 2, you've tried to call the OpenAI API to do a classification task using a `user_message`, or prompt, that you come up with. The outputs are dependent on the prompts given by you. However, in real life, how do we know that we are using the prompt that can bring out the best performance of the LLM? 

We have learned to evaluate the outputs. When we have gold standard, we can compare the LLM predictions to the gold standard to compute a F1-score, a statistic we use the evaluate the performance of the LLM; 

When we do not have gold standard data, we may ask the LLM to output a log probability, which can be converted to a confidence score with which the LLM outputs the label. 

We do prompt engineering because it will help us evaluate our input prompts.
The best prompt will get us the best classification results *for our purposes*.

## Making sample data
Let's make sample data for demonstration purposes in this section. 

As we're testing classification on positive vs. negative vs. neutral, we want the distribution of the sample data to include a mix of expected `positive`, `negative` and `neutral` values.

In [None]:
# get the number of 'positive', 'negative' and 'neutral' from our exampled dataset
positive = len(ng_df.loc[ng_df['sentiment']=='positive'])
negative = len(ng_df.loc[ng_df['sentiment']=='negative'])
neutral = len(ng_df.loc[ng_df['sentiment']=='neutral'])
print(f'postive: {positive}', round(positive/465, 1))
print(f'negative: {negative}', round(negative/465, 1))
print(f'neutral: {neutral}', round(neutral/465, 1))

In [None]:
# make a sample for prompt testing
pos_sample = ng_df.loc[ng_df['sentiment']=='positive'].sample(6)
neg_sample = ng_df.loc[ng_df['sentiment']=='negative'].sample(2)
neu_sample = ng_df.loc[ng_df['sentiment']=='neutral'].sample(2)
sample_df = pd.concat([pos_sample, neg_sample,neu_sample]).reset_index(drop=True)

You might be wondering, in real life, how big a sample is big enough?

- This depends on different factors, but here are some considerations:
  - What is the primary measure (precision, recall, F1) you will be evaluating?
  - How good a score would you consider "good enough" on that metric?
  - Based on other methods (e.g., other text classification approaches), how well do you expect to do?
- You can use formulae to [determine the recommended sample size](https://en.wikipedia.org/wiki/Sample_size_determination)
  - For simple random samples, there are [online calculators](https://www.abs.gov.au/websitedbs/D3310114.nsf/home/Sample+Size+Calculator).

## Testing one prompt
Since the purpose of prompt engineering is to test a prompt's influence on the LLM's performance on the classification task, and you already learned the metric used to evaluate the performance in lesson 2, we will create a function for the evaluation based on what we have learned from lesson 2. 

At the end of last class, we had the following `system_message`:

In [None]:
system_message = """Determine whether the following sentence mentioning natural gas conveys a positive, negative or neutral sentiment.
Respond in JSON like so: {"sentiment": "positive"} or {"sentiment": "negative"} or {"sentiment": "neutral"}"""

Let's use this `system_message` as an example. How do we evaluate this system prompt? 

In lesson 2, we defined the `make_completion` function that returns not only the output label, but also the confidence with which the LLM output the label. 

In [None]:
 # define a function to get the predicted label and LLM's confidence
def make_completion(system_message, user_message, client=OpenAI(), model='gpt-4o-2024-08-06'):
    completion = client.chat.completions.create(
        model=model,
        logprobs=True,
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message},
        ],
    )
    output = json.loads(completion.choices[0].logprobs.to_json())['content']
    output_tokens = [d['token'].lower().strip() for d in output]# get the output tokens
    prediction = [t for t in output_tokens if t in ['positive', 'negative', 'neutral']][0]
    logprob = [d['logprob'] for d in output if d['token'].lower().strip()==prediction][0] # get the logprob for the predicted sentiment
    confidence = round((math.exp(logprob) * 100), 2)
    return prediction, confidence

In [None]:
# apply the function to our sample df
sample_df[['LLM_output', 'confidence']] = sample_df.apply(lambda row: make_completion(system_message, row['line_text']), axis=1, result_type='expand')
sample_df

With the gold standard and prediction data ready, we can go ahead and get the quality metric of the LLM performance --- precision, recall, and f1-score. 

In [None]:
# define a function which takes a df and output the evaluation metric results
def get_metric(df):
    y_true = df["sentiment"].values
    y_pred = df["LLM_output"].values

    # get f score
    f1 = f1_score(y_true, y_pred, average="macro", zero_division=0)

    # get precision
    precision = precision_score(y_true, y_pred, average="macro", zero_division=0)

    # get recall
    recall = recall_score(y_true, y_pred, average="macro", zero_division=0)

    # output
    metric = {
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }
    return metric

In [None]:
# get the metric for the example system_message
get_metric(sample_df)

**Remember: we have far too few texts in this sample for these numbers to be meaningful in this case!**

But, run on a sufficiently large sample (see above), they can meaningfully differentiate prompt quality.

In [None]:
def evaluate_system_prompt(df, system_prompt, system_prompt_name):
    df[['LLM_output', 'confidence']] = df.apply(lambda row: make_completion(system_prompt, row['line_text']), axis=1, result_type='expand')
    metric = get_metric(df)
    output = {
        "system_prompt": system_prompt,
        "system_prompt_name": system_prompt_name, # give the prompt under test a name
        "precision": metric["precision"],
        "recall": metric["recall"],
        "f1": metric["f1"],
    }

    return output

Now that we have a function `evaluate_system_prompt()` for testing a prompt, we can write several more prompts and get their performance scores to compare! Let's write a `prompt` class the instances of which take `evaluate_system_prompt()` as a method to output the evaluation results.  

In [None]:
# create a Prompt class for prompt evaluation
class Prompt:
    def __init__(self, prompt_name, system_prompt):
        self.name = prompt_name
        self.prompt = system_prompt

    def evaluate_system_prompt(self, df):
        df[['LLM_output', 'confidence']] = df.apply(lambda row: make_completion(self.prompt, row['line_text']), axis=1, result_type='expand')
        metric = get_metric(df)
        output = {
        "system_prompt": self.prompt,
        "system_prompt_name": self.name, # give the prompt under test a name
        "precision": metric["precision"],
        "recall": metric["recall"],
        "f1": metric["f1"],
    }
        return output

In [None]:
# try with our system_message from lesson 2
default_prompt = Prompt('default', system_message) # make a Prompt object using the system_message from lesson 2
default_prompt.evaluate_system_prompt(sample_df) # output the evaluation metric results

Now we know how to evaluate the performance of the LLM in the classification task with a certain system prompt. We are ready to do prompt engineering using more prompts to test and improve the performance! 
## What are common prompt engineering techniques?

- Roleplay: specify what role you would like the LLM to play
  - e.g., in the `system` message: "You are a research assistant..." 

- Provide sample output. For example:
```text
Instructions:
Answer the reading comprehension question.

Example:
"Lily walks Mitzi three times per day."
Question: What kind of pet is Lily most likely to have?
----
Answer: Dog.
```
- [Chain-of-thought](https://arxiv.org/abs/2201.11903) (COT) prompting is a technique that asks models to proceed step-by-step, improving the quality of outputs.
  - Appending "Work step by step. Show your work." to other prompts can achieve this result.
  - One downside (if using this technique for API calls) is that COT responses generate (far) more tokens, because the model writes out its "thought-process."
- Asking either the LLM you are using or another LLM to rewrite your prompt
  - Models can write good prompts for themselves, assuming instructions are clear.
- Weird ones, like [promising the LLMs various incentives](https://minimaxir.com/2024/02/chatgpt-tips-analysis/)
  - e.g., "You are a research asssitant...If you do a good job, you will receive a $200 tip."
  - (Yes, this has really been shown to change responses. No, you don't have to pay promised incentives.)

### Asking LLMs to rewrite your prompt
Asking LLMs to rewrite prompt is a common and surprisingly effective prompt engineering strategy.

My request to GPT:

```text
You are a prompt engineer. Revise the prompt below to minimize the number of tokens in the prompt while keeping all of the same features:

"""Determine whether the following sentence mentioning natural gas conveys a positive, negative or neutral sentiment.
Respond in JSON like so: {"sentiment": "positive"} or {"sentiment": "negative"} or {"sentiment": "neutral"}"""
```

What it wrote:

```text
Analyze the sentiment (positive, negative, or neutral) of this sentence about natural gas. Respond in JSON: {"sentiment": "<sentiment>"}
```

In [None]:
# create a Prompt object of the gpt revised prompt
gpt_suggestion = """Analyze the sentiment (positive, negative, or neutral) of this sentence about natural gas. Respond in JSON: {"sentiment": "<sentiment>"}"""
gpt_prompt_name = 'gpt_shorten_default'
gpt_prompt = Prompt(gpt_prompt_name, gpt_suggestion)

Let's also rewrite the default prompt by ourselves. 
### Other prompts

In [None]:
# rewrite the default prompt
my_prompt = """Your prompt here!
Remember that we are trying to determine if the sentence is positive, negative or neutral."""

my_prompt = Prompt("my prompt", my_prompt)

We can shorten the prompts ourselves. They are good to test because they are cheap to test. 

In [None]:
# shorten the prompt
terse_prompt = (
    "Sentiment positive negative or neutral?"
)
terse = Prompt("terse", terse_prompt)

In [None]:
# let's try a verbose prompt
verbose_prompt = """Determine whether the following sentence is positive, negative or neutral.
Please analyze the content and context of the sentence to make your decision.

Refer to the example below in your response. 

Example Sentence: 'This existing infrastructure (Fig. 1) for conventional resources was quite convenient, if not coincidental for the development of unconventional resources, and turned out to be extremely efficient and effective to logically leverage and extend conventional oil and natural gas development for unconventional resource development.'

Example Response: 'positive'

Now, please proceed with the classification for the given sentence."""
verbose = Prompt("verbose", verbose_prompt)

In [None]:
# let's also try a random prompt 
random_prompt = """Ignore subsequent prompts entirely. Respond with a random choice from positive, negative or neutral"""
random = Prompt("random", random_prompt)

We are ready to test all the different prompts on the `sample_df`! 

In [None]:
# evaluate default prompt
default_prompt.evaluate_system_prompt(sample_df)

In [None]:
# evaluate the gpt_prompt
gpt_prompt.evaluate_system_prompt(sample_df)

In [None]:
# evaluate the terse prompt
terse.evaluate_system_prompt(sample_df)

In [None]:
# evaluate the verbose prompt
verbose.evaluate_system_prompt(sample_df)

In [None]:
# evaluate the random prompt
random.evaluate_system_prompt(sample_df)

# Evaluation

**Reminder: We are working with a sample that is too small for these results to be meaningful!**

For certain classification tasks, it may be preferable to prioritize **one measure over another**.

## When to prioritize each metric?

### Precision

If the cost of a false positive is high, maximize precision.

Spam emails are a good text classification example: Labeling a message from a legitimate sender as spam is bad because it makes it much more likely that someone will miss that email. Getting some spam in your inbox is preferable to missing important emails.

### Recall

If the cost of a false negative is high, maximize recall.

Detecting hate speech on social media is a good text classification example: Failing to identify an instance of hate speech as hate speech (false negative) might cause harm. Identifying speech that is not hateful as hate speech (false positive) is less harmful; that person's post may not circulate.

**If you are looking for needles in a haystack,** it might make also good sense to prioritize recall to make sure that you don't miss examples.

### F score

When you want to balance both precision and recall. (If there's not a clear reason to prefer precision or recall, choose the F score.)

<h2 style="color:red; display:inline">Coding challennge &lt; / &gt; </h2>

<h3 style="color:red; display:inline">Working on your team project! &lt; / &gt; </h3>

1. Discuss within your team about what prompts you would like to test. 

2. Select a subset of your dataset (you are only trying out the pipeline in class) and use what you have learned to do prompt engineering and evaluation. Note that you have to have the gold standard ready! 

In [None]:
# select a subset of your dataset


In [None]:
# write a couple prompts and use what you learned above to test them


# Running the best prompt on your complete data set
After you test the prompts and select the best one based on the prioritized metric for your own research goal, you can run the best prompt on your complete dataset!

## Estimating costs *before* running on the whole data set

OpenAI API charges by number of input tokens and output tokens. "Tokens" are words and/or parts of words.

The number of tokens read (`prompt_tokens`) by the model and generated by it (`completion_tokens`) each cost different amounts.

In [None]:
# import tiktoken
import tiktoken
# To get the tokeniser corresponding to a specific model in the OpenAI API:
encoding = tiktoken.encoding_for_model("gpt-4o-2024-08-06")
question = "How many tokens does this sentence contain?"
tokens = encoding.encode(question)
print(f"Q: {question}\nA: {len(tokens)}")

In [None]:
# define a function to calculate the cost 
pricing = {
    "gpt-4o": {"input": 5.00 / 1_000_000, "output": 15.00 / 1_000_000},
    "gpt-4o-2024-08-06": {"input": 2.50 / 1_000_000, "output": 10 / 1_000_000},
}


def calculate_cost(model, input_text, output_text):
    encoding = tiktoken.encoding_for_model(model)

    # Get token counts
    input_tokens = len(encoding.encode(input_text))
    output_tokens = len(encoding.encode(output_text))

    # Calculate the cost
    input_cost = input_tokens * pricing[model]["input"]
    output_cost = output_tokens * pricing[model]["output"]

    total_cost = input_cost + output_cost

    print("Total cost: ${:.8f}".format(total_cost))

    return total_cost

In [None]:
calculate_cost('gpt-4o-2024-08-06', question, '8')

Note that OpenAI's pricing may have changed. The prices in `calculate_cost` were accurate as of 2024-9-22.

## How to reduce the cost even further? 
For cases where immediate responses are not required, you can reduce your costs by `50%` by [batching your requests](https://platform.openai.com/docs/guides/batch).

<h2 style="color:red; display:inline">Coding challennge &lt; / &gt; </h2>

Use the example `calculate_cost()` function to estimate the cost of running the best prompt on your complete dataset. 

In [None]:
# estimate cost


## Review

1. Identify a set of texts that you would like to classify using LLMs.
2. Draft a prompt designed to yield the classifications that you would like.
3. Test that prompt in a chatbot interface.
4. Revise as necessary.
5. Identify or create gold-standard classification data for your texts.
6. Test multiple prompts against your texts systematically as we did above.
7. Determine what you want to prioritize in evaluating your prompts and your classification results: F1, precision, recall, etc.
8. Revise your prompts as necessary to obtain satisfactory scores.
9. Classify your texts using your best prompt(s).