<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Erik Fredner](https://fredner.org) for the 2024 Text Analysis Pedagogy Institute. Revised and expanded by Zhuo Chen under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />

For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
____

# Automated Text Classification Using LLMs 2

**Description:** This notebook describes:

* What is F-score
* How to create gold standard data for the evaluation
* How to evaluate the performance of the LLM classification outputs using F-score

**Use Case:** For Learners and Researchers

**Difficulty:** Intermediate

**Completion Time:** 90 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics 1](../Python-basics/python-basics-1.ipynb))
* Python Intermediate Series ([Start Python Intermediate 1](../Python-intermediate/python-intermediate-1.ipynb))
* Introduction to LLMs ([Start Intro to LLMs 1](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/March+20+2024_+How+ChatGPT+works+(Session+1).pdf))
* Automated Classificaton using LLMs 1 ([Review Automated Classificaton using LLMs 1](../Automated-classification/automated-classification-1.ipynb))

**Knowledge Recommended:** Experience with LLM chatbot (e.g. ChatGPT)

**Data Format:** JSON

**Libraries Used:** openai, dotenv, tiktoken, JSON

**Research Pipeline:**
1. Play with LLMs if you have not already.
2. Test using a chatbot interface for an LLM (like ChatGPT) to perform relevant classifications for your research.
3. Evaluate initial results.
4. Learn how to interact with an API through this notebook.
5. Modify your initial experiments based on what we cover.

## Install required Python libraries

Let's install the required libraries for this lesson. 

In [None]:
### install the required libraries
%pip install --upgrade openai tiktoken python-dotenv # for interaction with the OpenAI API
%pip install pandas==2.2.3
%pip install numexpr==2.10.1
%pip install bottleneck==1.5.1
%pip install scikit-learn==1.5.1

In [None]:
### Import Libraries ###

from openai import OpenAI 
import pandas as pd
import random
from dotenv import load_dotenv # to load API key
import random
import json 
import numpy as np

## Download the sample data
Let's download the natural gas sentiment data. 

In [None]:
# download the sample dataset
from pathlib import Path
import pandas as pd

file_path = '../All-sample-files/natural_gas_sents.jsonl'

ng_df = pd.read_json(file_path, lines=True)
ng_df

Now that we know how to use the OpenAI API to do sentiment analysis in an automated way, how do we evaluate the performance of the LLM? 
# How do you evaluate an LLM's classifications?

- How well do the LLM's judgments align with the gold standard?

## Using the gold standard
We will compare the LLM output classification with the gold standard. Specifically. we will use a statistic **F-score** to evaluate the performance of the LLM. 

### Review of lesson 1

Let's first briefly review what you have learned in lesson 1 about using OpenAI API to classify texts. 

In [None]:
# set the system message
system_message = """Determine whether the following sentence mentioning natural gas conveys a positive, negative or neutral sentiment.
Respond in JSON like so: {"sentiment": "positive"} or {"sentiment": "negative"} or {"sentiment": "neutral"}"""

In [None]:
load_dotenv() # load the API key
client = OpenAI() # load OpenAI

In [None]:
# write a chat completion function
def make_completion(user_message, client=OpenAI(), model='gpt-4o-2024-08-06', print_message=False):
    completion = client.chat.completions.create(
        model=model,
        
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message},
        ],
    )
    if print_message:
        print(f"System message: {system_message}\n{'-' * 80}") # print system message
        print(f"User message: {user_message}\n{'-' * 80}") # print user message
        print(
            f"Assistant response: {completion.choices[0].message.content}\n{'*' * 80}" # get the LLM response
        )

    return completion.choices[0].message.content

In [None]:
# use the first row in the data to test
test = make_completion(ng_df.iloc[0]['line_text'], client=OpenAI(), model='gpt-4o-2024-08-06', print_message=True)

Now you know how to classify the sentence in each row into one of three categories --- positive, negative or neutral--- using the OpenAI API. Let's create a column to store the LLM output to facilitate the evaluation in the next section. For demonstration purposes, we will only select three rows from the dataframe. 

In [None]:
# select 3 rows from the datafrmae
columbia = ng_df.loc[ng_df['school']=='Columbia University_Center on Global Energy Policy'].sample(1)
mit = ng_df.loc[ng_df['school']=='Massachusetts Institute of Technology_MIT Energy Initiative'].sample(1)
stanford = ng_df.loc[ng_df['school']=='Stanford_Natural Gas Initiative'].sample(1)
sample_df = pd.concat([columbia, mit, stanford]).reset_index(drop=True)

In [None]:
# create a column storing the LLM output
# note that only 3 rows are selected for demonstration
sample_df['LLM_output'] = sample_df['line_text'].apply(make_completion)

In [None]:
# take a look at the resulting sample df
sample_df

In [None]:
# get the sentiment string
import json # import json to read the output string from the OpenAI API

def get_sentiment(output):
    """load the output string from OpenAI API and get the sentiment only"""
    sentiment = json.loads(output)['sentiment']
    return sentiment

sample_df['LLM_output'] = sample_df['LLM_output'].apply(get_sentiment) # update the LLM_output column with the sentiments
sample_df

In [None]:
## Comparing LLM classifications to human classifications
sample_df['LLM_output'] = sample_df['LLM_output'].str.lower()
sample_df['LLM_Gold_agree'] = sample_df.apply(lambda row: row['sentiment'] == row['LLM_output'], axis=1) 
sample_df

There is a widely used statistic in machine learning for evaluating the performance of a language model on a classification task: the F-score. 

## The F-Score
Two important concepts related to the F-score are: **precision** and **recall**.

### F-score in binary classification
In binary classification, we can consider the output as either True or False. The F-score (aka F1 score) is calculated based on True Positives (TP), False Positives (FP), and False Negatives (FN), using the following formula:

#### Precision
$Precision = \frac{TP}{TP + FP}$

**Precision** measures how many of the items the model identified as `True` were really `True` according to the gold standard data (i.e. true positives). That is, it answers the question: among all the items classified as True, how many are actually True? 

#### Recall
$Recall = \frac{TP}{TP + FN}$

**Recall** measures how many of the True values in the gold standard are correctly labeled as True by the model. That is, it answers the question:  among all the items that are actually True, how many are identified as True by the model? 

#### F-score (aka F1)

The F score is the **harmonic mean** of precision and recall.

$F_{1}= \frac{2}{\frac{1}{Precision}+\frac{1}{Recall}}= 2 \times \frac{Precision \times Recall}{Precision + Recall}$

The intuition behind the harmonic mean is that a model has to perform reasonally well in both precision and recall to get a high F1-score. 

#### How to interpret F-score

* F1-Score = 1:

This means both precision and recall are perfect (100%). The model correctly identified all positive instances and made no false predictions.

* F1-Score = 0:

This means either precision or recall (or both) is 0. The model is performing poorly, either not identifying any positives or making only incorrect predictions.

* Intermediate Values (between 0 and 1):

The F1-score balances between precision and recall, so an intermediate value indicates a trade-off:

If the score is closer to 1, the model has reasonably good precision and recall.

If the score is closer to 0, either precision, recall, or both are poor.

<h4 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h4>

Given the toy example below, can you calculate the precision, recall and F1-score? 

In [None]:
# a toy example for F-score
y_true = [0, 1, 0, 0, 1, 1] # gold standard
y_pred = [0, 1, 1, 0, 1, 1] # model predictions

In [None]:
# calculate precision


In [None]:
# calculate recall


In [None]:
# calculate F1-score


#### Using Sklearn to calculate F-score

[Sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) provides a function `f1_score` so we don't need to calculate the F1-core by hand!

In [None]:
# import the f1_score function from sklearn
from sklearn.metrics import f1_score # to evaluate the performance of LLM

In [None]:
# a toy example for F-score
y_true = [0, 1, 0, 0, 1, 1] # gold standard
y_pred = [0, 1, 1, 0, 1, 1] # model predictions

In [None]:
# use the f1_score function to calculate the F1wcore
f1_score(y_true, y_pred)

### F-score in multi-class classification
In our example project, we are using OpenAI API to classify sentences containing 'natural gas' into one of three categories: positive, negative or neutral. This is a multi-class classification task. F-score used to evaluate the performance of a language model in a multi-class classification task comes in three variations. They have different use cases. 

#### Macro F-score

**Macro F1** calculates the F1-score independently for each class, and then takes the average of the F1-scores across all classes.
It treats all classes equally, regardless of their frequency. 

An example use case: 

Imagine you're building a model to classify news articles into one of several categories: Politics, Sports, Technology, and Health. The dataset has different numbers of articles in each category, but the stakeholders have expressed that all categories are equally important to them, regardless of how frequently they appear in the dataset. They want the model to perform equally well across all categories.

$Macro~F1 = \frac{1}{N}\sum_{i=1}^{N}F1_{i}$

Suppose there are 500 news articles in Politics, Sports and Technology in your dataset but only 100 news articles in Health. If the performance of the language model on all categories are equally important to you, then you can use the **Macro-F1** to evaluate the model. 

In this example use case, 

$Macro~F1 = \frac{0.924 + 0.824 + 0.674 + 0.646}{4} = 0.767$

#### Micro F-score
**Micro F1** is calculated by summing up the **True Positives (TP)**, **False Positives (FP)**, and **False Negatives (FN)** across all classes and then computes **precision** and **recall** from the aggregated values. 

$Micro~Precision = \frac{\sum TP}{\sum TP + \sum FP}$

$Micro~Recall = \frac{\sum TP}{\sum TP + \sum FN}$

$Micro~F1 = 2 * \frac{Micro~Precision~*~Micro~Recall}{Micro~Precision~+~Micro~Recall }$

This means  that every individual instance, no matter which class it comes from, is treated equally in terms of the precision and recall. 

An example use case:

Suppose you're building a model to classify medical images into multiple categories such as **Benign Tumor**, **Malignant Tumor**, **Inflammation**, and **No Abnormality**. The dataset is heavily imbalanced: most images are classified as **No Abnormality**, while the other categories (e.g., **Malignant Tumor**) are rare but extremely important.

The stakeholders are more interested in correctly classifying all individual cases (especially reducing false positives and false negatives) rather than evaluating each class independently. Therefore, you want the model to focus on the overall performance across all instances, not the performance per class. 

$Precision = \frac{80+5+25+800}{80+5+25+800+20+15+5+50}=0.91$

$Recall=\frac{80+5+25+800}{80+5+25+800+10+5+10+20}=0.953$ 

$Micro~F1= 2 * \frac{0.91 * 0.953}{0.91 + 0.953} = 0.931$

#### Weighted-average-F1-Score
**Weighted F1** is calculated by taking the F1-score of each class and weights it by the proportion of the true instances for that class. The number of true instances in a class is called the **support** of that class. 

$Weighted~F1=\sum_{i=1}^{n}(w_{i} \times F1_{i})$

An example use case: 
Suppose you are building a classification model to detect different types of defects --- cracks, dents, misalignment, discoloration --- of products on an assembly line. The dataset is imbalanced because some types of defects are much more common than others. The stakeholders are interested in identifying as many products defects as possible across all classes. Therefore, you would not want to penalize the model's performance on the more rare classes. 

* Cracks

$Precision = \frac{60}{60+5}\approx 0.9231$

$Recall = \frac{60}{60+10} \approx 0.8571$

$F1 = 2 * \frac{0.9231 * 0.8571}{0.9231 + 0.8571} = 0.8889$

$weight = \frac{60}{100}=0.6$

* Dents 

$Precision = \frac{15}{15 + 7} \approx 0.6818$

$Recall = \frac{15}{15 + 5} = 0.75$

$F1 = 2 * \frac{0.6818 * 0.75}{0.6818 + 0.75} \approx 0.7143$

$weight = \frac{15}{100} = 0.15$

* Misalignment

$Precision = \frac{4}{4+4}=0.5$

$Recall = \frac{4}{4+1}=0.8$

$F1 = 2 * \frac{0.5 * 0.8}{0.5 + 0.8} \approx 0.6154$

$weight = \frac{5}{100}=0.05$

* Discoloration

$Precision = \frac{2}{2+2}=0.5$

$Recall = \frac{2}{2+1}\approx 0.6667$

$F1 = 2 * \frac{0.5 * 0.6667}{0.5 + 0.6667} \approx 0.5714$

$weight = \frac{5}{100}=0.05$

$Weighted~F1 = (0.8889\times 0.6) + (0.7143 \times 0.15) + (0.6154 \times 0.05) + (0.5714 \times 0.05)= 0.8244$

## Back to our example
In our example study, the authors are interested in the all of the classes equally. Therefore, we will use the **Macro F1** score. 

### Use sklearn to calculate the F1 score
In the above, you've seen how to use the `f1_score` to evaluate the performance of a model on a toy example of binary classification. The `f1_score` method in `sklearn` has a parameter `average` whose value can be set to `binary`, `micro`, `macro`, `weighted`. The value you give to this parameter determines which type of F1 score you would like to use to evaluate the performance of your model. 

In [None]:
### for demonstration purposes, we will use the sample_df with 3 samples
sample_df

We will first turn the sentiment label to numbers for computers to understand. 

In [None]:
# calculate the macro F1 score of the model's performance 
y_true = sample_df['sentiment'].tolist()
y_pred = sample_df['LLM_output'].tolist()
f1_score(y_true,y_pred, average='macro')

<h2 style="color:red; display:inline">Coding challenge &lt; / &gt; </h2>

<h3 style="color:red; display:inline">Working on your team project! &lt; / &gt; </h3>

1. Discuss within your team about which type of F1 score is more appropriate for your project and why. 

2. Select a subset of your dataset (you are only trying out the pipeline in class) and use what you have learned to do classification. You will also need to do the classification using your own expertise so that you have the gold standard data ready.  

In [None]:
# strip the gold standard column to pretend we don't have them!
sample_df = sample_df.drop(columns=['sentiment', 'LLM_output', 'LLM_Gold_agree']).copy()

In [None]:
# an example as to how to call a function to record the gold standard data
def get_gold_standard(df):
    ### it takes a df of the natural gas data and add a column storing the gold standard data
    gold_ans = []
    for i in range(len(df)):
        text = df.iloc[i].loc['line_text'] 
        user_input = input(f"Classify the text\n{'-' * 80}\n{text}\n{'-' * 80} :")
        gold_ans.append(user_input)
    df['gold standard'] = gold_ans
    return df

In [None]:
# run the function to get the gold standard data
get_gold_standard(sample_df)

In [None]:
# it's your turn! Try the above to apply to your own data! 


<h3 style="color:red; display:inline">Working on the example Jeopardy dataset! &lt; / &gt; </h3>

If you don't have a team project, you can try with the Jeopardy dataset we played with in Lesson 1. 

In [None]:
from pathlib import Path
import pandas as pd

file_path = '../All-sample-files/jeopardy_data.csv'

# Read in the data
jeopardy_df = pd.read_csv(file_path)
jeopardy_df

In [None]:
# select a subset from the df and do the classification (you are trying out the pipeline in class)
# then, tweak the get_gold_standard() function to apply to the df


<h3 style="color:red; display:inline">Calculate the F1-score to evaluate the performance of the LLM on your dataset &lt; / &gt; </h3>


In [None]:
# use sklearn to calculate the F1-score 


# What if I don't have the gold standard data? 

## Use LLM's confidence to evaluate the outputs
The OpenAI API can output a log probability for the output token. The log probability tells us how confident the LLM is when giving the output token. The closer `logprob` is to 0, the more confident the model is in its response.

`logprobs` is an attribute of a `ChatCompletion` object. Therefore, it is quite easy to ask the LLM to output the label together with the `logprob` associated with the output. 

In [None]:
### how to get the logprob of an output
system_message = """Determine whether the following sentence mentioning natural gas conveys a positive, negative or neutral sentiment.
Respond in JSON like so: {"sentiment": "positive"} or {"sentiment": "negative"} or {"sentiment": "neutral"}"""

user_message = "Easy to assemble. Very sturdy."

completion = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    logprobs=True,  # new
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ],
)

In [None]:
# take a look at the output logprobs
import json
d = json.loads(completion.choices[0].logprobs.to_json())

In [None]:
# take a look at each token and its log probability
d["content"]

In [None]:
# check out the logprob value of the target token
logprob = -0.00091170

We can turn the `logprob` into confidence probabilities. 

In [None]:
# convert log prob to confidence prob
import math
confidence = round((math.exp(logprob) * 100), 2)
print(
    f"The model was {round((math.exp(logprob) * 100), 2)}% confident in this classification."
)

In [None]:
### Apply the calculation to the whole dataset 

# get a small subset
example = ng_df.sample(10).copy()

def make_completion(user_message, client=OpenAI(), model='gpt-4o-2024-08-06', print_message=False):
    completion = client.chat.completions.create(
        model=model,
        logprobs=True,
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message},
        ],
    )
    output = json.loads(completion.choices[0].logprobs.to_json())['content']
    prediction = [d['token'] for d in output if d['token'] in ['positive', 'negative', 'neutral']][0] # get the predicted sentiment
    logprob = [d['logprob'] for d in output if d['token']==prediction][0] # get the logprob for the predicted sentiment
    confidence = round((math.exp(logprob) * 100), 2)
    return prediction, confidence

In [None]:
# create two new columns to store the response and confidence 
example[['LLM_output', 'confidence']] = example.apply(lambda row: make_completion(row['line_text']), axis=1, result_type='expand')

# take a look at the resulting df
example

<h2 style="color:red; display:inline">Coding challenge &lt; / &gt; </h2>

Take your dataframe, create a column for the confidence scores. 

Determine a threshhold of confidence. Any outputs by the LLM with a confidence score lower than the threshhold are the outputs you will want to take a look at to confirm or disconfirm the judgement by the model. 


In [None]:
### create a column for the confidence scores


In [None]:
### sort the df by the confidence scores


In [None]:
### extract all the data with a LLM confidence score lower than your threshhold


___
## Lesson Complete
Congratulations! You have completed **Automated Classification with LLMs 2**. There are two more lessons in this series:

* *Automated Classification with LLMs 3*

### Start Next Lesson: [Automated Classification with LLMs 3](./Automated_Classification_3.ipynb)

### Coding Challenge! Solutions

There are often many ways to solve programming problems. Here are a few possible ways to solve the challenges, but there are certainly more!

In [None]:
### given the toy example, calculate the F-score

# a toy example for F-score
y_true = [0, 1, 0, 0, 1, 1] # gold standard
y_pred = [0, 1, 1, 0, 1, 1] # model predictions

precision = 3/4
recall = 3/3
f_score = round(2 * (precision * recall / (precision + recall)), 4)
f_score

In [None]:
### Use the Jeopardy dataset to do classification 
from pathlib import Path
import pandas as pd

file_path = '../All-sample-files/jeopardy_data.csv'

jeopardy_df = pd.read_csv(file_path)

# select a subset
jeopardy_subset = jeopardy_df.sample(3).copy()

### Write the messages
# write the system message
system_message = """Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}"""

# write the user message
jeopardy_subset['user_message'] = jeopardy_subset.apply(lambda row: f"""Category: {row['CATEGORY']}\nClue: {row['CLUE']}\nAnswer: {row['ANSWER']}""", axis=1)

# write a chat completion function
def make_completion(user_message, client=OpenAI(), model='gpt-4o-2024-08-06'):
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message},
        ],
    )
    return completion.choices[0].message.content

#  classify the texts
jeopardy_subset['LLM_output'] = jeopardy_subset['user_message'].apply(make_completion)

#tweak the get_gold_standard() function to work on the jeopardy dataset
def get_gold_standard(df):
    ### it takes a df of the jeopardy data and add a column storing the gold standard data
    gold_ans = []
    for i in range(len(df)):
        text = df.iloc[i].loc['user_message'] 
        user_input = input(f"Classify the text\n{'-' * 80}\n{text}\n{'-' * 80} :")
        gold_ans.append(user_input)
    df['gold standard'] = gold_ans
    return df

# run the function to get the gold standard data
get_gold_standard(jeopardy_subset)