# Using GPT-4 turbo for economic policy stance classification

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2024-03-25 |

In this notebook, we take data collected by Benoit et al. ([2016](doi.org/10.1017/S0003055416000058)) to illustrate how to use GPT-4-turbo through the OpenAI chat completions API to classify texts.

## Setup

In [37]:
import warnings
warnings.filterwarnings("ignore")

# for using openai models
import pandas as pd
from openai import OpenAI
import tiktoken
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# for data wrangling
import os
from tqdm.auto import tqdm
tqdm.pandas()

# for evaluate
from sklearn.metrics import classification_report

In [2]:
MODEL = 'gpt-4-0125-preview'

### Get the tokenizer for cost computations

In [3]:
from typing import Union, List

class TokenCounter:
    def __init__(self, encoding_name: Union[str, None] = None, model: Union[str, None] = None):
        """
        Initialize the tokenizer with either a model or an encoding name.

        Args:
            encoding_name (Union[str, None]): The name of the encoding to use. Default is None.
            model (Union[str, None]): The model to use for encoding. Default is None.

        Raises:
            ValueError: If neither model nor encoding_name is provided.
            ValueError: If both model and encoding_name are provided.
        """
        # ensure that either model or encoding_name is provided
        if model is None and encoding_name is None:
            raise ValueError("Either `model` or `encoding_name` must be provided.")
        if model is not None and encoding_name is not None:
            raise ValueError("Only one of `model` or `encoding_name` can be provided.")
        if encoding_name:
            self.encoding = tiktoken.get_encoding(encoding_name)
        else:
            self.encoding = tiktoken.encoding_for_model(model)
    
    def count_tokens(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Count the number of tokens in the input.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        if isinstance(input, str):
            return len(self.encoding.encode(input))
        else:
            toks = self.encoding.encode_batch(input)
            return [len(t) for t in toks]

    def __call__(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Call the tokenizer on the input. This is equivalent to calling count_tokens.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        return self.count_tokens(input)

In [4]:
token_counter = TokenCounter(model=MODEL)

### Load the data

In [7]:
base_path = os.path.join('..', '..')
data_path = os.path.join(base_path, 'data', 'benoit_crowdsourced_2016') 

In [8]:
fp = os.path.join(data_path, "benoit_crowdsourced_2016_econ_policy_stance.csv")
df = pd.read_csv(fp)

In [9]:
# keep only gold-standard examples
df = df[df.metadata__gold]
len(df)

225

In [10]:
df.label.value_counts()

label
 1    160
-1     65
Name: count, dtype: int64

**_Note:_** while the stance/position scale ranges from -2 (very left) to 2 (very right) and includes 0 (neutral), all of the gold-standard examples are either "somewhat left" (-1) or "somewhat right" (+1).

In [11]:
df.head(1)

Unnamed: 0,uid,text,label,metadata__gold,metadata__sentence_id,metadata__pre_sentence,metadata__post_sentence
5,10000211,That the leader of the Transport and General W...,1,True,10000211,A vast change separates the Britain of today f...,"That a minority Labour Government, staggering ..."


When distributing sentences to crowd workers for coding, Benoit et al. provided the sentence(s) preceeding and following the to-be-coded sentence.
We will replicate this approach and thus need to add the context sentences to the to-be-coded text

In [12]:
# construct input 
def construct_input(row):
    out = ""
    if isinstance(row['metadata__pre_sentence'], str):
        out += row['metadata__pre_sentence'].strip() + " "
    out += '"""'
    out += row['text'].strip()
    out += '"""'
    if isinstance(row['metadata__post_sentence'], str):
        out += " " + row['metadata__post_sentence'].strip()
    return out

df['input'] = df.apply(construct_input, axis=1).tolist()

## Classify

### Define the instructions

In [17]:
# adapt instructions from Benoit et al.'s original crowd coding instructions (see data/benoit_crowdsourced_2016/instructions/) 
instructions = """
Your task is to read sentences from political texts about economic policy issues and classify their stance on the issue.

The sentences you will be asked to interpret come from political party manifestos.

First, you will read a short section from a party manifesto. For the focal sentence enclosed in triple quotes, you will then indicate your best judgment of whether the sentence expresses a left or right economic policy stance.

For each focal sentence, choose one of the following categories: "left", "right".

We tell you below about what we mean by "left" and "right".

## What is a "left" economic policy stance?

**"Left" economic policies** tend to favor one or more of the following: 

- High levels of services provided by the government and state benefits, even if this implies high levels of taxation;
- Public investment. Public ownership or control of sections of business and industry;
- Public regulation of private business and economic activity;
- Support for workers/trade unions relative to employers

## What is a "right" economic policy stance?

**"Right" economic policies** tend to favor one or more of the following: 

- Low levels of taxation, even if this implies low levels of levels of services provided by the government and state benefits;
- Private investment. Minimal public ownership or control of business and industry;
- Minimal public regulation of private business and economic activity;
- Support for employers relative to trade unions/workers

"""

### Simple example

Let's just inspect one example:

In [19]:
i = 101 # 
print('TEXT:', df.input.values[i])
print('LABEL:', df.label.values[i]) 

TEXT: to strengthen the rights of women at work including equal pay for work of equal value and equal treatment. We will ensure that all public authorities and private contractors are equal opportunity employers and we will promote changes to enable those with domestic responsibilities to secure access to employment. """We would restore maternity grants and give a tax allowance to help with child-care costs.""" We would remove the tax on the use of workplace nurseries and encourage wider provision of child-care facilities. UNEMPLOYMENT. Unemployment at present levels is not the inevitable result of new technology or world recession - Japan has only 2.5% unemployment and US unemployment has fallen by two million since 1983.
LABEL: -1


In [20]:
messages = [ 
    {"role": "system", "content": instructions},
    {"role": "user", "content": df['input'].values[i]}
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    seed=42,
    temperature=0.0,
)

results = response.choices[0].message.content
results

'left'

The category 'left' corresponds to the numeric code -1, so the classification is correct according to the gold standard!

### Zero-shot classification

In [21]:
def classify_text(text):
    messages = [ 
        {"role": "system", "content": instructions},
        {"role": "user", "content": text}
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

Let's just sample 25 examples per category. This should be enough to compute how well GPT performs in this classification task.

In [22]:
samples = df.groupby('label').sample(25, random_state=42)[["uid", "input", "label"]].reset_index(drop=True)
samples

Unnamed: 0,uid,input,label
0,30001331,Encourage the establishment and success of co-...,-1
1,50008451,We will raise the basic rate of income tax by ...,-1
2,20002551,The dole queue is three times what it was in 1...,-1
3,30000891,We will increase child benefit by Œ£3 a week f...,-1
4,20004521,The Conservatives' taxation and benefit polici...,-1
5,60002701,The University for Industry will be a public/p...,-1
6,20004651,For poorer pensioners we will introduce an add...,-1
7,20004611,The second phase of our proposals will be a re...,-1
8,60004261,Every modern industrial country has a minimum ...,-1
9,20004831,We do not support this discrimination based on...,-1


Before iterating over multiple examples to code them, you should _always_ (!) compute first how much this will cost you:

In [27]:
# number of tokens in inputs
n_input_tokens = samples.input.apply(token_counter.count_tokens).sum()
# add token count for instructions (for each example)
n_input_tokens += token_counter(instructions) * len(samples)
print('# input tokens:', n_input_tokens)

# input tokens: 20134


In [33]:
# given that we instruct the model to reply only with the category, the number of output tokens per example is 1
n_output_tokens = len(samples)
print('# output tokens:', n_output_tokens)

# output tokens: 50


Now we go to https://openai.com/pricing to see what's the actual pricing for using the GPT-4-turbo model.

On March 24, 2024, the princing is 

- $10.00 per one million (1M) **input** tokens and
- $30.00/1M **output** tokens (hence the cost-factor above)

In [31]:
# comopute cost (see https://openai.com/pricing)
n_input_tokens/1_000_00*10 + n_output_tokens/1_000_000*30 # dollar cents

2.0149

Let's just run the classification:

In [36]:
# classify: apply custom classification function to all inputs
results = samples.input.progress_apply(classify_text)

  0%|          | 0/50 [00:00<?, ?it/s]

In [40]:
results.value_counts()

input
right      25
left       24
"right"     1
Name: count, dtype: int64

**_Note:_** GPT once put its classification in quotes so we need to post-process the output to get the actual classification.

In [43]:
results = results.str.replace('"', '')

#### Evaluate

In [44]:
id2label = {-1: "left", 1: "right"}
print(classification_report(samples.label.map(id2label), results.values))

              precision    recall  f1-score   support

        left       1.00      0.96      0.98        25
       right       0.96      1.00      0.98        25

    accuracy                           0.98        50
   macro avg       0.98      0.98      0.98        50
weighted avg       0.98      0.98      0.98        50



This is really great!

- (almost) perfect recall for the "right" ("left") category
- (almost) perfet precision for the "left" ("right") category

This means 25 of 25 "right" examples and 24 of 25 "left" examples were correctly classified (recall).

### Few-shot classification 

Let's get three examples (at random) per category that are not in our sample of to-be-classified examples:

In [69]:
examples = df[~df.uid.isin(samples.uid)].groupby('label').sample(3, random_state=42)[["uid", "input", "label"]].reset_index(drop=True)
# resuffle
examples = examples.sample(frac=1.0, random_state=42)
# convert numberic to string labels
examples['label'] = examples['label'].map(id2label)
examples

Unnamed: 0,uid,input,label
0,30001251,Create a new Ministry of Science and Technolog...,left
1,20004821,YOUNG PEOPLE. The Conservatives' benefit chang...,left
5,40011361,Politicians whose own declared policies would ...,right
2,30000921,Our special Minister for the Disabled will be ...,left
4,40001781,The European social model is not social and no...,right
3,10008431,Inner Cities. The regeneration of the inner ci...,right


In [65]:
def classify_text(text, examples: pd.DataFrame):
    messages = [{"role": "system", "content": instructions}]
    
    for _, d in examples.iterrows():
        messages +=  [   
            {"role": "user", "content": d.input},
            {"role": "assistant", "content": d.label}
        ]

    messages.append({"role": "user", "content": text})
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

Before iterating over multiple examples to code them, you should _always_ (!) compute first how much this will cost you:

In [59]:
# number of tokens in inputs
n_input_tokens = samples.input.apply(token_counter.count_tokens).sum()
# add token count for instructions (for each example)
n_input_tokens += token_counter(instructions) * len(samples)
# add token counts for examples' texts and labels 
n_input_tokens += examples.input.apply(token_counter.count_tokens).sum() + len(examples)*3

print('# input tokens:', n_input_tokens)

# input tokens: 20997


In [60]:
# given that we instruct the model to reply only with the category, the number of output tokens per example is 1
n_output_tokens = len(samples)
print('# output tokens:', n_output_tokens)

# output tokens: 50


In [61]:
# compute cost (see https://openai.com/pricing)
n_input_tokens/1_000_00*10 + n_output_tokens/1_000_000*30 # dollar cents

2.1012

Let's just run the classification:

In [66]:
# classify: apply custom classification function to all inputs
results = samples.input.progress_apply(classify_text, examples=examples)

  0%|          | 0/50 [00:00<?, ?it/s]

In [67]:
results.value_counts()

input
left     25
right    25
Name: count, dtype: int64

#### Evaluate

In [68]:
print(classification_report(samples.label.map(id2label), results.values))

              precision    recall  f1-score   support

        left       1.00      1.00      1.00        25
       right       1.00      1.00      1.00        25

    accuracy                           1.00        50
   macro avg       1.00      1.00      1.00        50
weighted avg       1.00      1.00      1.00        50



So by including a few examples, we booted classification performance to 100% for both categories!

This is what we should generally expect. 

However, the sample size (*N*=50) is very small and zero-shot performance was already very good.
So it's fair if you object that the improvement we found might not be (statistically) significant.