# Using GPT-4 turbo for policy area classification

In this notebook, we take data collected by Benoit et al. ([2016](doi.org/10.1017/S0003055416000058)) to illustrate how to use GPT-4-turbo through the OpenAI chat completions API to classify texts.

## Setup

In [3]:
import warnings
warnings.filterwarnings("ignore")

# for using OpenAI API
import os
from openai import OpenAI
import tiktoken
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# for data wrangling
import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()

# for evaluation
from sklearn.metrics import classification_report

In [4]:
MODEL = 'gpt-4-0125-preview'
# note: if you do not have an OpenAI Plus subscription, use gpt-3.5-turbo instead

In [5]:
from typing import Union, List

# define a class to count tokens
class TokenCounter:
    def __init__(self, encoding_name: Union[str, None] = None, model: Union[str, None] = None):
        """
        Initialize the tokenizer with either a model or an encoding name.

        Args:
            encoding_name (Union[str, None]): The name of the encoding to use. Default is None.
            model (Union[str, None]): The model to use for encoding. Default is None.

        Raises:
            ValueError: If neither model nor encoding_name is provided.
            ValueError: If both model and encoding_name are provided.
        """
        # ensure that either model or encoding_name is provided
        if model is None and encoding_name is None:
            raise ValueError("Either `model` or `encoding_name` must be provided.")
        if model is not None and encoding_name is not None:
            raise ValueError("Only one of `model` or `encoding_name` can be provided.")
        if encoding_name:
            self.encoding = tiktoken.get_encoding(encoding_name)
        else:
            self.encoding = tiktoken.encoding_for_model(model)
    
    def count_tokens(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Count the number of tokens in the input.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        if isinstance(input, str):
            return len(self.encoding.encode(input))
        else:
            toks = self.encoding.encode_batch(input)
            return [len(t) for t in toks]

    def __call__(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Call the tokenizer on the input. This is equivalent to calling count_tokens.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        return self.count_tokens(input)

In [6]:
# instantiate the TokenCounter
token_counter = TokenCounter(model=MODEL)

### Load and prepare the data 

In [7]:
base_path = os.path.join('..', '..')
data_path = os.path.join(base_path, 'data', 'benoit_crowdsourced_2016') 

fp = os.path.join(data_path, "benoit_crowdsourced_2016_policy_area.csv")
df = pd.read_csv(fp)

# keep only gold-standard examples
df = df[df.metadata__gold]
len(df)

506

In [8]:
df.head(1)

Unnamed: 0,uid,text,label,metadata__gold,metadata__sentence_id,metadata__pre_sentence,metadata__post_sentence
2,10000031,We have risen to fresh challenges at home and ...,1,True,10000031,We have discovered a new strength and a new pr...,Once again our economy is strong. Our industri...


In [15]:
df.label.value_counts()

label
2    225
1    181
3    100
Name: count, dtype: int64

In [16]:
id2label = {1: "neither", 2: "economic", 3: "social"}

When distributing sentences to crowd workers for coding, Benoit et al. provided the sentence(s) preceeding and following the to-be-coded sentence.
We will replicate this approach and thus need to add the context sentences to the to-be-coded text

In [11]:
# construct input 
def construct_input(row):
    out = ""
    if isinstance(row['metadata__pre_sentence'], str):
        out += row['metadata__pre_sentence'].strip() + " "
    # wrap the to-be-coded sentence in triple quotes (as noted in the instructions)
    out += '"""'
    out += row['text'].strip()
    out += '"""'
    if isinstance(row['metadata__post_sentence'], str):
        out += " " + row['metadata__post_sentence'].strip()
    return out

df['input'] = df.apply(construct_input, axis=1).tolist()

In [12]:
df['input']

2       We have discovered a new strength and a new pr...
12      Given the opportunities provided by Conservati...
13      Together we are building One Nation of free, p...
14      A Conservative dream is at last becoming a rea...
17      A vast change separates the Britain of today f...
                              ...                        
4769    We will bring the government's policy of forci...
4884    Parliament will remain free to enhance these r...
4968    The country takes pride in their professionali...
4984    We believe that part of its expertise can be e...
5015    A new Labour government will use those assets ...
Name: input, Length: 506, dtype: object

## Classify texts

### Define the instructions

In [13]:
# adapt instructions from Benoit et al.'s original crowd coding instructions (see data/benoit_crowdsourced_2016/instructions/) 
instructions = """
Your task is to read sentences from political texts and judging whether they deal with economic or social policy.

The sentences you will be asked to interpret come from political party manifestos. Some of these sentences will deal with economic policy; some will deal with social policy; other sentences will deal with neither economic nor social policy. We tell you below about what we mean by "economic" and "social" policy.

First, you will read a short section from a party manifesto. For the focal sentence enclosed in triple quotes, indicate your best judgment about whether it mainly refers to economic policy, to social policy, or to neither.

For each focal sentence, choose one of the following categories: "economic", "social", or "neither". If the sentence refers to economic policy, select "economic"; if it refers to social policy, select "social". If the sentence does not refer to either policy area, select "neither".

## What is "economic" policy?

**"Economic" policies** deal with all aspects of the economy, including:

- Taxation
- Government spending
- Services provided by the government or other public bodies
- Pensions, unemployment and welfare benefits, and other state benefits
- Property, investment and share ownership, public or private
- Interest rates and exchange rates
- Regulation of economic activity, public or private
- Relations between employers, workers and trade unions

## What is "social" policy?

**"Social" policies** deal with aspects of social and moral life, relationships between social groups, and matters of national and social identity, including:

- Policing, crime, punishment and rehabilitation of offenders;
- Immigration, relations between social groups, discrimination and multiculturalism;
- The role of the state in regulating the social and moral behavior of individuals
"""

### Example

In [17]:
i = 4
print('TEXT:', df.input.values[i])
print('LABEL:', id2label[df.label.values[i]])

TEXT: A vast change separates the Britain of today from the Britain of the late 1970s. Is it really only such a short time ago that inflation rose to an annual rate of 27 per cent? """That the leader of the Transport and General Workers' Union was widely seen as the most powerful man in the land?""" That a minority Labour Government, staggering from crisis to crisis on borrowed money, was nonetheless maintained in power by the Liberal Party in return for the paper concession of a Lib-Lab pact? And that Labour's much-vaunted pay pact with the unions collapsed in the industrial anarchy of the winter of discontent, n which the dead went unburied, rubbish piled up in the streets and the country was gripped by a creeping paralysis which Labour was powerless to cure?
LABEL: economic


In [19]:
messages = [ 
    {"role": "system", "content": instructions},
    {"role": "user", "content": df['input'].values[i]}
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    seed=42,
    temperature=0.0,
)

results = response.choices[0].message.content
results

'economic'

### Zero-shot classification

In [22]:
def classify_text(text):
    messages = [ 
        {"role": "system", "content": instructions},
        {"role": "user", "content": text}
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

In [23]:
# take a sample of 25 texts from each label class
samples = df.groupby('label').sample(25, random_state=42)[["uid", "input", "label"]].reset_index(drop=True)
samples

Unnamed: 0,uid,input,label
0,10000621,"By such steadfastness, we have not only rebuil...",1
1,10008851,set up safe facilities for disposing of radioa...,1
2,20001231,In 1832 Britain took the first step with the G...,1
3,10010111,Conclusion: The Way Forward. The proposals out...,1
4,20000991,A fair electoral system will have that effect ...,1
...,...,...,...
70,10007441,but much has been done. The great majority of ...,3
71,40007181,Our national DNA database - the first in the w...,3
72,40007211,Since 1985 the average sentence for violence a...,3
73,40007951,Anyone convicted of a second serious sexual or...,3


Let's compute how much it will cost to request classifications for this sample: 

In [24]:
# number of tokens in inputs
n_input_tokens = samples.input.apply(token_counter.count_tokens).sum()
# add token count for instructions (for each example)
n_input_tokens += token_counter(instructions) * len(samples)
print('# input tokens:', n_input_tokens)

# input tokens: 35223


In [25]:
# given that we instruct the model to reply only with the category, the number of output tokens per example is 1
n_output_tokens = len(samples)
print('# output tokens:', n_output_tokens)

# output tokens: 75


Now we go to https://openai.com/pricing to see what's the actual pricing for using the GPT-4-turbo model.

On March 24, 2024, the princing is 

- $10.00 per one million (1M) **input** tokens and
- $30.00/1M **output** tokens (hence the cost-factor above)

In [26]:
# comopute cost (see https://openai.com/pricing)
n_input_tokens/1_000_00*10 + n_output_tokens/1_000_000*30 # dollar cents

3.52455

In [27]:
# classify: apply custom classification function to all inputs
results = samples.input.progress_apply(classify_text)

  0%|          | 0/75 [00:00<?, ?it/s]

In [28]:
results.value_counts()

input
social      28
economic    24
neither     23
Name: count, dtype: int64

### Evaluate

Since the dataset records texts "true" labels (based on the authors expert judgments), we can compute standard [multi-class classification metrics](https://www.kaggle.com/code/nkitgupta/evaluation-metrics-for-multi-class-classification) by comparing true labels to GPT's classifications:

In [29]:
print(classification_report(samples.label.map(id2label), results.values))

              precision    recall  f1-score   support

    economic       0.96      0.92      0.94        25
     neither       1.00      0.92      0.96        25
      social       0.89      1.00      0.94        25

    accuracy                           0.95        75
   macro avg       0.95      0.95      0.95        75
weighted avg       0.95      0.95      0.95        75



## Few-shot classification

Let's get two examples (at random) per category that are not in our sample of to-be-classified examples:

In [30]:
examples = df[~df.uid.isin(samples.uid)].groupby('label').sample(2, random_state=42)[["uid", "input", "label"]].reset_index(drop=True)
# resuffle
examples = examples.sample(frac=1.0, random_state=42)
# convert numberic to string labels
examples['label'] = examples['label'].map(id2label)
examples

Unnamed: 0,uid,input,label
0,20000541,Government must enable society to take the lon...,neither
1,20000031,We know that it is possible to unite our count...,neither
5,30003571,"Our policies for employment, education, housin...",social
2,20004691,The &lt;U+00A3&gt;10 Christmas bonus has becam...,economic
4,40007901,Persistent offenders account for a high propor...,social
3,20004751,FAMILIES IN WORK. We will add £5 per week to t...,economic


In [31]:
def classify_text(text, examples: pd.DataFrame):
    messages = [{"role": "system", "content": instructions}]
    
    for _, d in examples.iterrows():
        messages +=  [   
            {"role": "user", "content": d.input},
            {"role": "assistant", "content": d.label}
        ]

    messages.append({"role": "user", "content": text})
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

Let's update our estimate of the number of input tokens and the resulting cost:

In [32]:
n_input_tokens += examples.input.apply(token_counter.count_tokens).sum() + len(examples)*3

# compute cost
n_input_tokens/1_000_00*10 + n_output_tokens/1_000_000*30 # dollar cents

3.60445

In [33]:
# classify: apply custom classification function to all inputs
results = samples.input.progress_apply(classify_text, examples=examples)

  0%|          | 0/75 [00:00<?, ?it/s]

In [34]:
results.value_counts()

input
neither     28
social      25
economic    22
Name: count, dtype: int64

### Evaluate

In [35]:
print(classification_report(samples.label.map(id2label), results.values))

              precision    recall  f1-score   support

    economic       1.00      0.88      0.94        25
     neither       0.86      0.96      0.91        25
      social       0.96      0.96      0.96        25

    accuracy                           0.93        75
   macro avg       0.94      0.93      0.93        75
weighted avg       0.94      0.93      0.93        75



In this case, there is no improvement of the few- compared to the zero-shot classification performance.