# Using GPT-4 turbo for pairwise comparison

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2024-03-25 |


In this notebook, we take data collected by Hargrave and Blumenau ([2022](https://doi.org/10.1017/S0007123421000648)) to illustrate how to use GPT-4-turbo through the OpenAI chat completions API to classify texts.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
MODEL = 'gpt-4-turbo-preview'

import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()

from sklearn.metrics import classification_report

In [2]:
base_path = os.path.join('..', '..')
data_path = os.path.join(base_path, 'data', 'hargrave_no_2022') 

fp = os.path.join(data_path, "hargrave_no_2022_comparisons_by_style.tsv")
df_all = pd.read_csv(fp, sep="\t")

In [3]:
df_all["style"].value_counts()

style
Negative Emotion    119
Fact                 98
Complexity           97
Positive Emotion     90
Repetition           89
Aggression           86
Affect               82
Human Narrative      73
Name: count, dtype: int64

Let's see what the numeric labels mean:

In [4]:
df_all[['comparison', 'comparison_label']].drop_duplicates().reset_index(drop=True)

Unnamed: 0,comparison,comparison_label
0,-1,text two more representative of <STYLE>
1,1,text one more representative of <STYLE>
2,0,both similarly representative of <STYLE>


Let's create a "mapping" of numeric label indicators to label names:

In [5]:
id2label = {1: 'first', -1: 'second', 0: 'none'}

## Affect/Emotional language

Let's first focus on the "emotionality" comparisons: 

In [6]:
df = df_all[df_all["style"] == "Affect"]
df.head(1)

Unnamed: 0,style,pair_id,text1_id,text2_id,text1,text2,intercoder,comparison,intensity1,intensity1_num,intensity2,intensity2_num,comparison_label
0,Affect,05c0929cebf188730eec1d9c927d550a,d2ed8334934ee49dc9fb2d8f04f29f7e,3240fc883037f763bebe7e46351ab563,What have the families of British service pers...,I repeat the apology to the families of people...,False,-1,,,,,text two more representative of <STYLE>


In [7]:
# construct input 
def construct_input(row):
    return f'"""{row["text1"]}"""\n"""{row["text2"]}"""'

df['input'] = df.apply(construct_input, axis=1).tolist()

In [8]:
# adapt instructions from Benoit et al.'s original crowd coding instructions (see data/benoit_crowdsourced_2016/instructions/) 
instructions = """
Your task is to select the sentence which you believe uses more emotional language, which might be either positive or negative emotion; such as expressing criticism, praise, disapproval, pride, empathy or fear.

The sentences you will be asked to compare come from parliamentary speeches in the UK *House of Commons*.

The pair of sentences will be in separate lines in the input and each sentence will be wrapped in tripple quotes.

Choose one of the following categories:

- first: Sentence one (i.e., the sentence in the first line) uses more emotional language
- second: Sentence two (i.e., the sentence in the second line) uses more emotional language
- none: Both sentences use about equal amounts of emotional language

Choose the "none" category when you are not certain that either of both sentences uses more complex language.

Return only the category you choose and no further text in your response.
"""

### Zero-shot

In [9]:
def classify_pair(input):
    messages = [ 
        {"role": "system", "content": instructions},
        {"role": "user", "content": input}
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

In [10]:
# take a sample of 25 texts from each label class
samples = df[df.comparison != 0].groupby('comparison').sample(25, random_state=42)[["pair_id", "input", "comparison"]].reset_index(drop=True)
samples

Unnamed: 0,pair_id,input,comparison
0,542e9f2aac7f84ee8c99fbc185482377,"""""""Even though adjacent councils may be of dif...",-1
1,d8a6d65fb2d3bda566a1299c1ccaf5f4,"""""""Although such trends and behaviour may disa...",-1
2,4d0212737762904e086a9f45e6ec125a,"""""""However, even allowing for the difficulties...",-1
3,b546655ac88f1f9db7c07630a6aa1549,"""""""99 B ) on certain of the days specified un...",-1
4,05c0929cebf188730eec1d9c927d550a,"""""""What have the families of British service p...",-1
5,70d24421f01cc6143224ae107020c7eb,"""""""I am not sure that the hon Gentleman serio...",-1
6,9e9b5f9916d3cb3210505f68841c5fcb,"""""""In the past couple of days, we have heard y...",-1
7,c82aca2de8c49bb7f2cce8c9a6a8ea88,"""""""I have raised the matter several times duri...",-1
8,7059c5705694dbe59ab1e7a70e76c35a,"""""""Conservative Members think that something i...",-1
9,82a031b3870e4922cd55ee56f08ed98e,"""""""It prompts one to ask why there are not oth...",-1


In [11]:
# test
i = 0
print('LABLE:', id2label[samples['comparison'][i]])
print('CLASSIFICATION:' , classify_pair(samples['input'][i]))

LABLE: second
CLASSIFICATION: second


In [12]:
results = samples.input.progress_apply(classify_pair)

  0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
# evaluate: compute performance metrics
print(classification_report(samples.comparison.map(id2label), results.values, labels=["first", "second"]))

### Few-shot

In [None]:
examples = df[~df.pair_id.isin(samples.pair_id)].groupby('comparison').sample(2, random_state=42)[["pair_id", "input", "comparison"]].reset_index(drop=True)
# resuffle
examples = examples.sample(frac=1.0, random_state=42)
# convert numberic to string labels
examples['label'] = examples['comparison'].map(id2label)

examples

In [None]:
def classify_pair(input, examples: pd.DataFrame):
    messages = [{"role": "system", "content": instructions}]
    
    for _, d in examples.iterrows():
        messages +=  [   
            {"role": "user", "content": d.input},
            {"role": "assistant", "content": d.label}
        ]

    messages.append({"role": "user", "content": input})
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

In [None]:
results_fs = samples.input.progress_apply(classify_pair, examples=examples)

In [None]:
# evaluate: compute performance metrics
print(classification_report(samples.comparison.map(id2label), results_fs.values, labels=["first", "second"]))

**_Note:_** some gain relative to zero-shot baseline

## Complexity

In [None]:
df = df_all[df_all["style"] == "Complexity"]

With pairwise data, its also ueful to "augment" the labeled data by reversing airs order and adapting the comparison label accordingly.

In [None]:
# construct input 
def construct_input(row, reversed=False) -> str:
    out = f'"""{row["text2"]}"""\n"""{row["text1"]}"""' if reversed else f'"""{row["text1"]}"""\n"""{row["text2"]}"""'
    return out

df.loc[:,'input'] = df.apply(construct_input, axis=1).tolist()
df.loc[:,'input_reversed'] = df.apply(construct_input, reversed=True, axis=1).tolist()

In [None]:
# adapt instructions from Benoit et al.'s original crowd coding instructions (see data/benoit_crowdsourced_2016/instructions/) 
instructions = """
You will be presented with a pair of sentences. The pair of sentences will be in separate lines in the input and each sentence will be wrapped in tripple quotes.

Your task is to judge which of the two sentences (if any) uses more complex language, where complexity is defined as the use of elaborate and sophisticated language that is challenging to read and understand.

Choose one of the following categories:

- first: Sentence one (i.e., the sentence in the first line) more complex language
- second: Sentence two (i.e., the sentence in the second line) more complex language
- none: Both sentences use about equally complex language

Choose the "none" category when you are not certain that either of both sentences uses more complex language.

Return only the category you choose and no further text in your response.
"""

In [None]:
def classify_pair(input):
    messages = [ 
        {"role": "system", "content": instructions},
        {"role": "user", "content": input}
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        seed=42,
        temperature=0.0,
    )

    results = response.choices[0].message.content
    return results

In [None]:
# take a sample of 10 texts from each label class
samples = df.groupby('comparison').sample(10, random_state=42)[["pair_id", "input", "input_reversed", "comparison"]].reset_index(drop=True)

In [None]:
# test
i = 0
print('LABEL:', id2label[samples['comparison'][i]])
print('CLASSIFICATION:', classify_pair(samples['input'][i]))
print('CLASSIFICATION (reversed):', classify_pair(samples['input_reversed'][i]))

In [None]:
results = samples.input.progress_apply(classify_pair)

In [None]:
# evaluate: compute performance metrics
print(classification_report(samples.comparison.map(id2label), results.values))

In [None]:
results_reversed = samples.input_reversed.progress_apply(classify_pair)

In [None]:
reversed_recode_map = {"first": "second", "second": "first", "none": "none"}

tmp = pd.DataFrame([results, results_reversed]).T
tmp.columns = ["forward", "backwards"]
tmp["backwards"] = tmp.backwards.map(reversed_recode_map)
# subset to cases where the classification differs if sentences are swapped
tmp[tmp.forward != tmp.backwards]

As the above table shows, there are a few cases where swaping the order of sentences in a pair changes the classification.
This indicates that the classification is not very confident and we can use this information to recode them into the "none" category:

In [None]:
results_edited = results.copy()
# let's update the classifications for those where it depends on the order of the sentences in the pair
idxs = tmp[tmp.forward != tmp.backwards].index
results_edited[idxs] = "none"

# recompute the performance metrics
print(classification_report(samples.comparison.map(id2label), results_edited.values))
# note: creates slight improvements (but with these few examples, it's hard to say if this is a generalizable pattern)

This simple (although pricy) trick increases classification performance a little.