<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Erik Fredner](https://fredner.org) for the 2024 Text Analysis Pedagogy Institute. Revised and expanded by Zhuo Chen under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />

For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
____

# Automated Text Classification Using LLMs 1

**Description:** This notebook describes:

* How to create a project Open AI project API key 
* How to interact with the OpenAI API
* How to do automated text classification using OpenAI API

**Use Case:** For Learners and Researchers

**Difficulty:** Intermediate

**Completion Time:** 90 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics 1](../Python-basics/python-basics-1.ipynb))
* Python Intermediate Series ([Start Python Intermediate 1](../Python-intermediate/python-intermediate-1.ipynb))
* Introduction to LLMs ([Start Intro to LLMs 1](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/March+20+2024_+How+ChatGPT+works+(Session+1).pdf))

**Knowledge Recommended:** Experience with LLM chatbot (e.g. ChatGPT)

**Data Format:** JSON

**Libraries Used:** openai, dotenv, csv, JSON

**Research Pipeline:** 
1. Play with LLMs if you have not already.
2. Test using a chatbot interface for an LLM (like ChatGPT) to perform relevant classifications for your research.
3. Evaluate initial results.
4. Learn how to interact with an API through this notebook.
5. Modify your initial experiments based on what we cover.

## Create and fund a project API key on OpenAI
In order to interact with OpenAI API, we need to create a project API key. In the following, you will find the step-by-step instruction of creating an API key on OpenAI. 

### Create an API key
#### Go to the [OpenAI developer platform](https://platform.openai.com/)
 
If you already have a ChatGPT account, log into your account; if you don't have one yet, sign up for an account.

<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/text_classification_openai_platform.png">

#### Go to the dashboard
  
After you are logged in, go to the dashboard.
<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/text_classification_openai_dashboard.png">

####  Create a new secret key
  
Click on the green button to create a new key. Give a name to your own key. 
<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/text_classification_create_new_key.png">

#### Save the new key to a text file or other accessible place

After you create a new key, you will not be able to view it on your OpenAI account when you come back for security reasons. Therefore, you will need to save your key to a secure and accessible place, e.g. a text file. 
<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/text_classification_save_key.png">

### Add credits to your key
To prompt the OpenAI API to do a task, we will need to add credits to the key. The cost is by token. You can think of tokens as pieces of words. According to the pricing page, 1,000 tokens is about 750 words.

Pricing is available [on this page](https://openai.com/api/pricing/).

#### Add a payment method
Go to **Settings** on the top menu, and then go to **Billings** in the side bar. Under the **Payment methods** tab, you will find the **Add payment method** button.  
<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/text_classification_billing.png">

#### Add credits to your key
After you add a credit card to your account, you can go to the **Overview** tab to find the **Add to credit balance** button. Click on the button to specify how much credits you would like to add to the key. You can start with $5, which is the minimum required amount by OpenAI, for this class. 
<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/text_classification_add_credits.png">

Now that we have a project API key with credits, we can go ahead and prepare for the API interaction. 

## Install Required Libraries
To interact with the OpenAI API, we need to install three packages. 
* [openai](https://pypi.org/project/openai/)

The OpenAI Python library provides convenient access to the OpenAI API from applications written in the Python language. 

* [python-dotenv](https://pypi.org/project/python-dotenv/)

python-dotenv reads key-value pairs from a ```.env``` file and can set them as environment variables. 


In [None]:
### Install Libraries ###
%pip install --upgrade openai python-dotenv

In [None]:
### Import Libraries ###
from openai import OpenAI # import the OpenAI class from the openai library
from dotenv import load_dotenv # to load the API key 

In [None]:
import logging

# Set the httpx logger to only show WARNING and above
logging.getLogger("httpx").setLevel(logging.WARNING)

## Set the API key
Next, we will add the API key we generated before to the ```.env``` file in our working directory. 

In [None]:
# Assign the generated API key to a variable
OPENAI_API_KEY = " "  # copy-paste the class key here

In [None]:
# write the API key into the .env file
with open(".env", "w") as f: # open the .env file in write mode
    f.write(f"OPENAI_API_KEY={OPENAI_API_KEY}") # write the API key into the file

You might be wondering, where is the ```.env``` file? Note that files with names starting with a dot '.' are invisible. If you would like to confirm that we do have a ```.env``` file in our working directory, you can use a terminal command to do that.    

In [None]:
# list all files in the working directory, including the invisible files
!ls -a

If you would like to confirm further that the API key has been written into the ```.env``` file, you can open the file and read the content in it. 

In [None]:
# read the .env file
with open(".env", "r") as f: # open the .env file
    content = f.read() # read the content in the .env file
    print(content) # print out the content 

Next time you restart this notebook kernel (or open up a new notebook), the `openai` library will read the API key directly from your `.env` file. No need to specify the `api_key=` argument in `OpenAI()`.

# Making your first API call

The API calls can be found in the [OpenAI's tutorial](https://platform.openai.com/docs/api-reference/chat/create). Let's load the API key we stored in the ```.env``` file and use it to try a chat completion task. 

In [None]:
# load the API key and set the client to the OpenAI class
load_dotenv() # load the environment variables in the .env file, in our case, the API key
client = OpenAI() # create an OpenAI instance 

We can get a list of models available in the OpenAI API by running the following code. 

In [None]:
### get a list of all models in OpenAI API
models = client.models.list()
for model in models.data:
    print(model.id)

Let's try a simple chat completion task with gpt-4o-2024-08-06

In [None]:
# interact with the OpenAI API
# using a text completion task
completion = client.chat.completions.create(
    model="gpt-4o-2024-08-06", # set the model to gpt-4o-2024-08-06, you can try other models as well
    messages=[
        {
            "role": "system", # system message passed to the API, tell the LLM to be a role
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user", # user prompt
            "content": "Explain what text classification is in ten or fewer words.",
        },
    ],
)

 As you can see, to prompt the API to do a chat completion task, we need to specify the values for some parameters. 
 
 * model

First of all, we need to specify which model we would like to use for this chat completion task. Here, we choose the model ```gpt-4o-2024-08-06```, but there are more available for you to try, as you can see from the list of models we've got above.

* messages
  
`messages` is a list of dictionaries containing the messages sent to the `model`. There are two kinds of messages we need to specify. First, a system message that specifies what role you would like the LLM to play. Second, a user message which functions as a prompt to the LLM. 

In [None]:
# see what's inside the completion variable
completion

In [None]:
# get the chat completion given by the LLM
completion.choices[0].message.content

Congratulations! You've just made your first API call! 

## Changing the system message
As you can see, there are some variations we can get potentially by giving different values to the parameters in the ```client.chat.completions.create()``` function. Let's experiment with it some more by changing the system message. 

In [None]:
new_system_message = "You are a French tutor. Respond to all prompts in French followed by English in parentheses."
user_message = "Explain what text classification is in ten or fewer words."

In [None]:
# see how a different system message conditions a different output chat completion
completion = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": new_system_message},  # new system message
        {
            "role": "user",
            "content": user_message,  # same user message
        },
    ],
)

print(f"user: {user_message}")
print("-" * 80)
print(f"gpt-4o-2024-08-06: {completion.choices[0].message.content}")

This makes obvious how changing the `system` message impacts how the model responds to subsequent `user` prompts.

You might be wondering, what other roles are there available in the API? According to the documentation, there are three different roles in the Chat Completions API. 

In the ```messages``` parameter, each message object has a role (either ```system```, ```user```, or ```assistant```) and content.

* The system message is optional and can be used to set the behavior of the assistant
* The user messages provide requests or comments for the assistant to respond to
* Assistant messages store previous assistant responses, but can also be written by you to give examples of desired behavior

We will see more in a moment. 

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

It's your turn! Can you change the messages you give to the API and see what response you get?

In [None]:
system_message = "You are a math teacher."  # change this!
user_message = "Replace this message with something of your interest."  # change this!
model = "gpt-4o-2024-08-06" # you can try other models

In [None]:
# use the new system message and user message to get a response
completion = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ],
)

In [None]:
print(f"user: {user_message}")
print("-" * 80)
print(f"{model}: {completion.choices[0].message.content}")

# Classifying texts using OpenAI API
In this section, we will use a dataset shared in a research paper on the sentiments of energy reports produced by university-based energy centers as an example to demonstrate how to use the OpenAI API to classify texts.

In [None]:
# read in the data from jsonl file into a dataframe 
file_path = '../All-sample-files/natural_gas_sents.jsonl'
ng_df = pd.read_json(file_path, lines=True)
ng_df

The ```sentiment``` column stores the gold standard for the classification. Of course, in this lesson, we are going to pretend that we don't have the gold standard because otherwise we don't need the LLM to do the work! In the next lesson, we will compare the LLM output to the gold standard. 

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

Take your JSONL file you built in the Build a Dataset class; select a subset of it to work on in class. 

Then, read in the data relevant for classification into a dataframe. 

As an example, I'll use a Constellate dataset to demonstrate

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

If you are using a Constellate dataset, following the example below. Tweak the code to serve your own purpose. 

<h3 style="color:red; display:inline">Note! The following code cell assumes that you have downloaded the JSONL file containing metadata, ngrams and full texts to the current working directory.&lt; / &gt; </h3>

In [None]:
import gzip
import json
# path to the jsonl file in the current directory
dataset_file = '' # copy and paste the path to the JSONL file 

# function that reads a jsonl file into a generator
def dataset_reader(file_path):
    """
    Helper to read in gzip files and yield Python dictionary
    documents.
    """
    with gzip.open(file_path, "rb") as input_file:
        for row in input_file:
            yield json.loads(row)

In [None]:
# read in the data as a generator 
dataset = dataset_reader(dataset_file)

In [None]:
# get a sense of the dataset
for doc in dataset:
    print(doc.keys())
    break

In [None]:
from itertools import islice

# Use islice to get only the first 500 documents from the dataset generator
subset = islice(dataset, 500)

# Create lists for year and title using list comprehensions
data_of_interest = [(doc['publicationYear'], doc['title']) for doc in subset]

In [None]:
# turn the data of interest into a dataframe
constellate_df = pd.DataFrame(data_of_interest, columns =['Year', 'Title'])
constellate_df

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

If you are using a jsonl file you built with your own dataset, follow the example below. Tweak the code to serve your own purpose. 

In [None]:
# read in the data as a df
file_path = '../All-sample-files/natural_gas_sents.jsonl'
big_file_df = pd.read_json(file_path, lines=True)

In [None]:
# have a peek into the file
big_file_df.head(5)

In [None]:
# get a subset of the df
example_df = big_file_df.iloc[:500].copy()

## Classification
In this section, we are going to use the OpenAI API to do a ternary classification of the texts - whether the sentences mentioning natural gas are positive, negative or neutral. 

## Classification using OpenAI API
### Set the System Message
Let's give a system message that instructs the LLM to do a text classification task --- whether the given text containig 'natural gas' is positive, negative or neutral. 

In [None]:
# set the system message
system_message = """Determine whether the following sentence mentioning natural gas conveys a positive, negative or neutral sentiment.
Respond in JSON like so: {"sentiment": "positive"} or {"sentiment": "negative"} or {"sentiment": "neutral"}"""

In [None]:
# look back at the data df
ng_df

As we can see, each row in the dataframe represents one sentence about natural gas. Therefore, to classify each sentence to the category ```{"sentiment": "positive"}``` or ```{"sentiment": "negative"}``` or ```{"sentiment": "neutral"}```, we will need to write a user message for each sentence in the dataframe. 

### Write User Message 
What user message should we give to the LLM for it to do the ternary classification? Well, the text in the `line_text` column! The text is very useful information for the LLM to classify its sentiment. 

In [None]:
# define a function that makes a customized prompt for each row in the data df
def make_user_message(row):
    user_message = row['line_text']
    return user_message

In [None]:
# try making a prompt for the first row in the data df
user_message = make_user_message(ng_df.iloc[0])
print(user_message)

### Classify the sentences
Now that we have the system message and user message ready, we can go ahead and start classfying the sentences!

In [None]:
# write a chat completion function
def make_completion(user_message, client=OpenAI(), model='gpt-4o-2024-08-06', print_message=False):
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message},
        ],
    )
    if print_message:
        print(f"System message: {system_message}\n{'-' * 80}") # print system message
        print(f"User message: {user_message}\n{'-' * 80}") # print user message
        print(
            f"Assistant response: {completion.choices[0].message.content}\n{'*' * 80}" # get the LLM response
        )

    return completion.choices[0].message.content

In [None]:
# try with the first row from the data df
test = make_completion(user_message, client=OpenAI(), model='gpt-4o-2024-08-06', print_message=True)

### Generalize to all sentences
Let's generalize the pipeline we wrote above to all the sentences in the dataframe. 

In [None]:
# As an example, we will only use 3 sampled sentences 

# for demonstration, we will only use the first three questions
ng_df = ng_df.sample(3).copy()
ng_df

In [None]:
# classify the texts
ng_df['LLM_output'] = ng_df['line_text'].apply(make_completion)

In [None]:
# take a look at the updated df
ng_df

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

In the following exercise, we will use [Erik Fredner](https://www.google.com/search?client=safari&rls=en&q=erik+fredner&ie=UTF-8&oe=UTF-8)'s example dataset of Jeopardy questions containing 500 selected Jeopardy! questions to practice what you have learned about using the OpenAI API to classify texts. Here are the goals of this classification task. 

1. For each question, use its ```CATEGORY```, ```CLUE``` and ```ANSWER``` to inform the OpenAI API to classify the question.
2. Every question is classified into two categories --- whether it is a question about literature or not.
3. Write the output into JSON format. For each question, the output will look like this ```{"Literature": true}``` or ```{"Literature": false}```

In [None]:
# download the sample dataset
from pathlib import Path
import pandas as pd

file_path = '../All-sample-files/jeopardy_data.csv'

# Read in the data
jeopardy_df = pd.read_csv(file_path)
jeopardy_df

In [None]:
### write a system message


In [None]:
### write a function that generates a user message for each question


In [None]:
### classify the texts, create a new column storing the LLM classification output


___
## Lesson Complete
Congratulations! You have completed **Automated Classification with LLMs 1**. There are two more lessons in this series:

* *Automated Classification with LLMs 2* 
* *Automated Classification with LLMs 3*

### Start Next Lesson: [Automated Classification with LLMs 2](./Automated_Classification_2.ipynb)

### Coding Challenge! Solutions

There are often many ways to solve programming problems. Here are a few possible ways to solve the challenges, but there are certainly more!

In [None]:
file_path = '../All-sample-files/jeopardy_data.csv'

# Read in the data
jeopardy_df = pd.read_csv(file_path)
jeopardy_df

In [None]:
### write the system message
system_message = """Determine whether the following Jeopardy question is about Literature.
Respond in JSON like so: {"Literature": True}"""

### write the user message
jeopardy_df['user_message'] = jeopardy_df.apply(lambda row: f"""Category: {row['CATEGORY']}\nClue: {row['CLUE']}\nAnswer: {row['ANSWER']}""", axis=1)

In [None]:
### Use OpenAI API to classify the questions

# write a chat completion function
def make_completion(user_message, client=OpenAI(), model='gpt-4o-2024-08-06', print_message=False):
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message},
        ],
    )
    if print_message:
        print(f"System message: {system_message}\n{'-' * 80}") # print system message
        print(f"User message: {user_message}\n{'-' * 80}") # print user message
        print(
            f"Assistant response: {completion.choices[0].message.content}\n{'*' * 80}" # get the LLM response
        )

    return completion.choices[0].message.content

#  classify the texts
jeopardy_df['LLM_output'] = jeopardy_df['user_message'].apply(make_completion)