# Classification with OpenAI
This notebook walks through the steps I took to classify the corpus, including:

1. Data preparation.
2. Use of the OpenAI API to classify the texts (following previous work, see: https://guillelezama.com/publication/pela/ and https://guillelezama.com/publication/immigration/)

In [1]:
!pip install openai



In [2]:
import openai
import pandas as pd
import time

In [3]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [4]:
folder=

In [5]:
# Set the OpenAI API key
openai.api_key = 

In [6]:
df=pd.read_csv(folder+"test_data_for_openAI.csv", index_col='Unnamed: 0')
df.head()

Unnamed: 0,text,top
963,A defesa das terras e da cultura indígena é u...,0
1114,Concretizar a gestão democrática através da pa...,4
21784,Programar as ações do Programa Saúde da Fam...,5
24017,Diretrizes Orçamentárias o Orçamento Anual...,3
1475, Contratar mais médicos clínicos geral,5


In [7]:
# Define the categories with descriptions of what they encompass
categories = {
    "Titulo": "Sentences that contain or reference the title of a document or section.",
    "Introduccion": "Sentences that introduce a topic, provide an overview, or set the stage for the content.",
    "Servidores Publicos": "Sentences related to public servants, government administration, participation in governance, or public finance.",
    "Educacion y Deportes": "Sentences about education, culture, sports, tourism, or social policies related to youth, gender, or social development.",
    "Salud": "Sentences specifically related to healthcare, public health policies, or medical services.",
    "Transporte": "Sentences referring to transportation, infrastructure, urban development, housing, sanitation, or disaster management.",
    "Ambiente y Agricultura": "Sentences related to environmental issues, agriculture, rural development, sustainable development, or economic production.",
    "Trash": "Sentences that are empty, nonsensical, or don't provide any valuable information.",
    "Seguridad": "Sentences related to security, law enforcement, or public safety.",
    "Other": "Sentences that don’t fit into any other categories, including events, funerals, communication, religion, or miscellaneous topics."
}

In [8]:
# Get a series with the texts to classify.
X_test=df['text']

In [9]:
# List to store predictions
predictions_open_AI = []
start_time = time.time()

# Loop to classify each sentence
for idx, sentence in enumerate(X_test):
    # Create the classification prompt for the model
    category_prompt = (
        f"Classify the following sentence into one of the categories: {', '.join(categories.keys())}.\n\n"
        f"Category descriptions:\n"
    )

    # Append the category descriptions to the prompt
    for category, description in categories.items():
        category_prompt += f"{category}: {description}\n"

    # Add the sentence to classify
    category_prompt += f"\nSentence: {sentence}\n\nCategory:"

    try:
        # Generate a completion (prediction) using the OpenAI API
        category_response = openai.chat.completions.create(
            model="gpt-4o-mini-2024-07-18",  # Keeping the original model name
            messages=[
                {"role": "system", "content": "You are an assistant that helps categorize sentences or lines of manifestos. These are manifestos for the mayor position in Brazil in 2012. The output should be the category title that better categorizes the proposed sentence."},
                {"role": "user", "content": category_prompt}
            ],
            temperature=0.1,
            max_tokens=5,
        )

        # Extract the predicted category from the response
        category = category_response.choices[0].message.content

        # Append the prediction to the list
        predictions_open_AI.append(category)

        if (idx + 1) % 100 == 0:
            print(f"{idx + 1} sentences classified so far.")
    except Exception as e:
        print(f"Error during prediction: {e}")
        predictions_open_AI.append("Error")
end_time = time.time()
print(f"Classification process completed in {end_time - start_time:.2f} seconds.")

100 sentences classified so far.
200 sentences classified so far.
300 sentences classified so far.
400 sentences classified so far.
500 sentences classified so far.
600 sentences classified so far.
700 sentences classified so far.
800 sentences classified so far.
900 sentences classified so far.
1000 sentences classified so far.
1100 sentences classified so far.
1200 sentences classified so far.
1300 sentences classified so far.
1400 sentences classified so far.
1500 sentences classified so far.
1600 sentences classified so far.
1700 sentences classified so far.
1800 sentences classified so far.
1900 sentences classified so far.
2000 sentences classified so far.
2100 sentences classified so far.
2200 sentences classified so far.
2300 sentences classified so far.
2400 sentences classified so far.
2500 sentences classified so far.
2600 sentences classified so far.
2700 sentences classified so far.
2800 sentences classified so far.
2900 sentences classified so far.
3000 sentences classifi

In [10]:
df=pd.concat([df[['text','top']].reset_index(),pd.DataFrame(predictions_open_AI, columns=['OpenAI'])], axis=1)

In [11]:
df['OpenAI'].unique()

array(['Ambiente y Agricultura', 'Servidores Publicos', 'Salud',
       'Educacion y Deportes', 'Seguridad', 'Other', 'Trash', 'Titulo',
       'Transporte', 'Introduccion'], dtype=object)

In [12]:
df.to_csv(folder+"test_data_for_openAI_predictions.csv")