<a href="https://colab.research.google.com/github/laurencleek/text_classification_workshop/blob/main/OpenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>LLM Text Classification Workshop</h1>
<i>This notebook provides some basic code for practical applications of Large Language Models (LLMs) for text classification tasks. It focuses only on using the OpenAI API which is just one of the possible LLMs for text classification. It shows how to load data, pre-process data and set up a basic classification pipeline.</i>

<a href="https://github.com/laurencleek/text_classification_workshop"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>

---

This notebook is developed by [Lauren Leek](https://laurenleek.eu/).

---



### Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab, you need to **uncomment and run** the following codeblock to install the dependencies.

Note that %%capture supresses the cell output.

---


In [None]:
%%capture
!pip install datasets transformers sentence-transformers openai

# Task
Set up a text classification pipeline using the OpenAI API (gpt 5 nano). The pipeline should cover data loading, pre-processing, API usage, and basic validation. Use the data from the "validation_sample_limited.xlsx" file, which contains 100 sentences with 'speech_identifiers', 'sentence', and 'label' columns for classifying sentences as 'descriptive' or 'normative'.

## Load data

Load the data from the "validation_sample_limited.xlsx" file into a pandas DataFrame.


**Reasoning**:
Import pandas and load the data from the excel file into a dataframe, then display the head, columns and dtypes to verify the loading was successful.



In [None]:
import os
print(os.listdir())

['.config', 'sample_data']


In [None]:
import pandas as pd

df = pd.read_excel("sample_data/validation_sample_limited.xlsx")

display(df.head())
display(df.info())

Unnamed: 0,sentence_id,speech_identifier,sentence,label
0,863941,2008-06-17_e.txt,"Against this backdrop, it is essential to elim...",normative
1,484125,2009-12-08_e.txt,"Indeed, the domestic financial institutions ha...",descriptive
2,536961,2007-03-30_e.txt,"Lending to households, household debt and hous...",descriptive
3,947025,2021-06-14_a.txt,"Furthermore, monetary policy implementation in...",normative
4,325809,2007-07-05_f.txt,Turnover in the UK and US foreign-exchange mar...,descriptive


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   sentence_id        1000 non-null   int64 
 1   speech_identifier  99 non-null     object
 2   sentence           99 non-null     object
 3   label              99 non-null     object
dtypes: int64(1), object(3)
memory usage: 31.4+ KB


None

Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


## Preprocessing Data

### Subtask:

Handle missing values and prepare the text data for classification.

**Reasoning**:

Check for missing values in the DataFrame and drop rows with missing values to ensure data quality for classification.

In [None]:
# Check for missing values
display(df.isnull().sum())

# Drop rows with missing values
df.dropna(inplace=True)

# Verify that missing values have been removed
display(df.isnull().sum())
display(df.info())

Unnamed: 0,0
sentence_id,0
speech_identifier,901
sentence,901
label,901
predicted_label,940


Unnamed: 0,0
sentence_id,0
speech_identifier,0
sentence,0
label,0
predicted_label,0


<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   sentence_id        60 non-null     int64 
 1   speech_identifier  60 non-null     object
 2   sentence           60 non-null     object
 3   label              60 non-null     object
 4   predicted_label    60 non-null     object
dtypes: int64(1), object(4)
memory usage: 2.8+ KB


None

## Classify Dataset (subset)



Apply the classification function to a subset of the data.

**Reasoning**:

Define a Python function `classify_sentence` that takes a sentence as input and uses the OpenAI API to classify it as 'descriptive' or 'normative'.

In [None]:
import openai
import os
from getpass import getpass

# Fetch the OpenAI API key securely.
# If running in a secure environment like Colab, you can use the secrets manager.
# In other environments, be cautious about how you handle API keys.
try:
    # Assuming you have stored your API key in Colab's secrets manager as 'OPENAI_API_KEY'
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
except ImportError:
    # Fallback for environments without Colab secrets manager
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    if OPENAI_API_KEY is None:
        OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

openai.api_key = OPENAI_API_KEY

def classify_sentence(sentence, model="gpt-5-nano"):
    """
    Classifies a sentence as 'descriptive' or 'normative' using the OpenAI API.

    Args:
        sentence (str): The sentence to classify.
        model (str): The OpenAI model to use for classification.

    Returns:
        str: The predicted label ('descriptive' or 'normative'), or None if classification fails.
    """
    try:
        client = openai.OpenAI(api_key=OPENAI_API_KEY)
        response = client.chat.completions.create(
          model=model,
          messages=[
                {"role": "system", "content": "You are a helpful assistant that classifies sentences as 'descriptive' or 'normative'."},
                {"role": "user", "content": f"Classify the following sentence as either 'descriptive' or 'normative': {sentence}"}
            ],
          max_tokens=10,
          n=1,
          stop=None,
          temperature=0.5,
        )
        # Extract the classification from the response
        classification = response.choices[0].message.content.strip().lower()
        if 'descriptive' in classification:
            return 'descriptive'
        elif 'normative' in classification:
            return 'normative'
        else:
            # If the model doesn't return a clear 'descriptive' or 'normative'
            return None
    except Exception as e:
        print(f"Error classifying sentence: {e}")
        return None

**Reasoning**:

Iterate through the first 5 rows of the DataFrame and use the `classify_sentence` function to get predictions for each sentence. Store these predictions in a new column.

In [None]:
# Apply the classification function to the first 5 sentences
df['predicted_label'] = None  # Initialize a new column for predictions

for index, row in df.head(5).iterrows():
    sentence = row['sentence']
    predicted_label = classify_sentence(sentence)
    df.loc[index, 'predicted_label'] = predicted_label

# Display the first 5 rows with the predicted labels
display(df.head(5))

Unnamed: 0,sentence_id,speech_identifier,sentence,label,predicted_label
0,863941,2008-06-17_e.txt,"Against this backdrop, it is essential to elim...",normative,normative
1,484125,2009-12-08_e.txt,"Indeed, the domestic financial institutions ha...",descriptive,descriptive
2,536961,2007-03-30_e.txt,"Lending to households, household debt and hous...",descriptive,descriptive
3,947025,2021-06-14_a.txt,"Furthermore, monetary policy implementation in...",normative,normative
4,325809,2007-07-05_f.txt,Turnover in the UK and US foreign-exchange mar...,descriptive,descriptive


**Reasoning**:

Iterate through all rows of the DataFrame and use the `classify_sentence` function to get predictions for each sentence. Store these predictions in the 'predicted_label' column.

## Basic Validation

### Subtask:

Evaluate the classification results for the subset of data.

**Reasoning**:

Compare the 'label' column with the 'predicted_label' column for the first 5 rows to assess the accuracy of the model on this small sample.

In [None]:
# Compare the actual and predicted labels for the first 5 rows
comparison_subset = df.head(5)[['label', 'predicted_label']]
display(comparison_subset)

# Calculate accuracy for the subset
correct_predictions = (comparison_subset['label'] == comparison_subset['predicted_label']).sum()
total_predictions = len(comparison_subset)
accuracy_subset = correct_predictions / total_predictions if total_predictions > 0 else 0

print(f"\nAccuracy on the first 5 sentences: {accuracy_subset:.2f}")

Unnamed: 0,label,predicted_label
0,normative,normative
1,descriptive,descriptive
2,descriptive,descriptive
3,normative,normative
4,descriptive,descriptive



Accuracy on the first 5 sentences: 1.00
