# Using GPT-4 turbo for Section 230 stance classification

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2024-03-25 |

In this notebook, we take data analyzed in Gilardi et al. ([2023](https://doi.org/10.1073/pnas.2305016120)) to illustrate how to use GPT-4-turbo through the OpenAI chat completions API to classify stances in tweets.

## Setup

In [32]:
import warnings
warnings.filterwarnings("ignore")

import os
from openai import OpenAI
import tiktoken
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()

from sklearn.metrics import classification_report, f1_score

In [2]:
MODEL = 'gpt-4-0125-preview'
# note: if you do not have an OpenAI Plus subscription, use gpt-3.5-turbo instead

In [3]:
from typing import Union, List

class TokenCounter:
    def __init__(self, encoding_name: Union[str, None] = None, model: Union[str, None] = None):
        """
        Initialize the tokenizer with either a model or an encoding name.

        Args:
            encoding_name (Union[str, None]): The name of the encoding to use. Default is None.
            model (Union[str, None]): The model to use for encoding. Default is None.

        Raises:
            ValueError: If neither model nor encoding_name is provided.
            ValueError: If both model and encoding_name are provided.
        """
        # ensure that either model or encoding_name is provided
        if model is None and encoding_name is None:
            raise ValueError("Either `model` or `encoding_name` must be provided.")
        if model is not None and encoding_name is not None:
            raise ValueError("Only one of `model` or `encoding_name` can be provided.")
        if encoding_name:
            self.encoding = tiktoken.get_encoding(encoding_name)
        else:
            self.encoding = tiktoken.encoding_for_model(model)
    
    def count_tokens(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Count the number of tokens in the input.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        if isinstance(input, str):
            return len(self.encoding.encode(input))
        else:
            toks = self.encoding.encode_batch(input)
            return [len(t) for t in toks]

    def __call__(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Call the tokenizer on the input. This is equivalent to calling count_tokens.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        return self.count_tokens(input)

In [4]:
token_counter = TokenCounter(model=MODEL)

## Load the data

In [6]:
base_path = os.path.join('..', '..')
data_path = os.path.join(base_path, 'data', 'gilardi_chatgpt_2023') 

fp = os.path.join(data_path, "gilardi_chatgpt_2023_section230_stance.csv")
df = pd.read_csv(fp)
len(df)

780

In [7]:
df.label.value_counts()

label
neutral     420
negative    327
positive     33
Name: count, dtype: int64

In [8]:
i = 0
print('TEXT:', df.text.values[i])
print('LABEL:', df.label.values[i])


TEXT: Isn’t it fascinating that my Twitter followers went from 1K to 23.9K in a matter of days and now are suddenly starting to decline over the last 48 hours? Big Tech censorship is a clear and present danger to America. Section 230 protection must go away one way or another. #USA
LABEL: negative


## Classify

### define the instructions

In [9]:
instructions = """
Your task is to read tweets about content moderation and classify what stance they take on Section 230 (if any).

In the context of content moderation, Section 230 is a law in the United States that protects websites and other online platforms from being held legally responsible for the content posted by their users. This means that if someone posts something illegal or harmful on a website, the website itself cannot be sued for allowing it to be posted. However, websites can still choose to moderate content and remove anything that violates their own policies. 

For each tweet in the sample, follow these instructions: 

1. Carefully read the text of the tweet, paying close attention to details.
2. Classify the tweet as having a positive stance towards Section 230, a negative stance, or a neutral stance.

For each tweet, choose one of the following categories: "negative", "neutral", "positive"
"""

### simple example

In [10]:
messages = [ 
    {"role": "system", "content": instructions},
    {"role": "user", "content": df.text.values[i]}
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    seed=42,
    temperature=0.0,
)

results = response.choices[0].message.content
results

'negative'

### automate

In [11]:
def classify_text(text):
    messages = [ 
        {"role": "system", "content": instructions},
        {"role": "user", "content": text}
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

In [12]:
samples = df.groupby('label').sample(25, random_state=42).reset_index(drop=True)
samples

Unnamed: 0,status_id,text,label
0,1316498227399720960,BigTech finally went too far censoring conserv...,negative
1,1347788874890715138,A lot of big section 230 talk over the last ye...,negative
2,1275813106212655104,.@Twitter is at it again — censoring @realDona...,negative
3,1336723788759752704,".@Google/@YouTube, under false and non-sensica...",negative
4,1340367091904491520,@zackfox This is why we need to abolish sectio...,negative
...,...,...,...
70,1313573294516428804,@realDonaldTrump If you repeal section 230 of ...,positive
71,1348388767032242176,@generativist Section 230 literally exists to ...,positive
72,1346763207344599041,"$2,000 checks, monthly\r\nRestore net neutrali...",positive
73,1266180983155494913,This EO is a reactionary and politicized appro...,positive


#### compute costs

see this notebook for details: https://github.com/haukelicht/llm_text_coding/blob/main/code/tokenization_and_costs.ipynb

In [13]:
# tokens in inputs
n_input_tokens = samples.text.apply(token_counter.count_tokens).sum()
# add token count for instructions
n_input_tokens += token_counter(instructions) * len(samples)
# add token count for outputs (multiplied by cost factor for output vs. input)
n_output_tokens = len(samples)

# comopute cost (see https://openai.com/pricing)
n_input_tokens/1_000_000*10.00 + n_output_tokens/1_000_000*30.00 # dollar cents

0.17465

#### classify all examples

In [18]:
# classify: apply custom classification function to all inputs
results = samples.text.progress_apply(classify_text)

  0%|          | 0/75 [00:00<?, ?it/s]

#### Evaluate

In [19]:
# evaluate: compute performance metrics
print(classification_report(samples.label, results.values))

              precision    recall  f1-score   support

    negative       0.55      0.96      0.70        25
     neutral       0.77      0.40      0.53        25
    positive       0.89      0.64      0.74        25

    accuracy                           0.67        75
   macro avg       0.73      0.67      0.66        75
weighted avg       0.73      0.67      0.66        75



With a macro F1 score of only 0.66, the overall perfomance is meeger.
The performance in the "neutral" category is espeically low.

**_Note_:** Gilardi et al. report an accuracy of ~0.7 (see panel A of their Figure 1).

## Few-shot classification

In [23]:
examples = df[~df.status_id.isin(samples.status_id)].groupby('label').sample(3, random_state=42)[["status_id", "text", "label"]].reset_index(drop=True)
# resuffle
examples = examples.sample(frac=1.0, random_state=42)
examples

Unnamed: 0,status_id,text,label
7,1344115549937217538,"i don't know what makes me more depressed, tha...",positive
1,1354071003010273281,"SO IT'S TRUE, THEY ARE A PUBLISHER AND GETTING...",negative
5,1341856271213838336,@Chris2every @DebbieforFL Changing the names o...,neutral
0,1343440457972416513,@LindseyGrahamSC Section 230 Must Terminated t...,negative
8,1266302387280506882,"Ideally, keep Section 230 &amp; require Twitte...",positive
2,1382143650889670661,It’s time we examine the need for Section 230 ...,negative
4,1344091130300919808,#BREAKING: Senate Majority Leader Mitch McConn...,neutral
3,1316518172456030208,Big Tech claims they aren’t biased against Con...,neutral
6,1346850718674771968,STATEMENT: if Democrats control the Senate the...,positive


In [27]:
def classify_text(text, examples: pd.DataFrame):
    messages = [{"role": "system", "content": instructions}]
    
    for _, d in examples.iterrows():
        messages +=  [   
            {"role": "user", "content": d.text},
            {"role": "assistant", "content": d.label}
        ]

    messages.append({"role": "user", "content": text})
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

In [28]:
# classify: apply custom classification function to all inputs
results_fs = samples.text.progress_apply(classify_text, examples=examples)

  0%|          | 0/75 [00:00<?, ?it/s]

In [29]:
results_fs.value_counts()

text
neutral     29
negative    24
positive    22
Name: count, dtype: int64

In [31]:
print(classification_report(samples.label, results_fs.values))

              precision    recall  f1-score   support

    negative       0.79      0.76      0.78        25
     neutral       0.62      0.72      0.67        25
    positive       0.82      0.72      0.77        25

    accuracy                           0.73        75
   macro avg       0.74      0.73      0.74        75
weighted avg       0.74      0.73      0.74        75



In [38]:
zeroshot_f1 = f1_score(samples.label, results.values, average='macro')
fewshot_f1 = f1_score(samples.label, results_fs.values, average='macro')
(round(fewshot_f1/zeroshot_f1, 3)-1)*100 # percentage improvement

12.3