# Using GPT-4 turbo for social group mention detection and extraction in party manifesto sentences

In this notebook, we build on Licht & Sczepanski ([2024]()) to illustrate how to use GPT-4-turbo through the OpenAI chat completions API to perform a word-level classification based entity detection and extraction task.

## Setup

In [8]:
import os
from openai import OpenAI
import tiktoken

import json

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [9]:
MODEL = 'gpt-4-turbo-preview'

In [10]:
from typing import Union, List

class TokenCounter:
    def __init__(self, encoding_name: Union[str, None] = None, model: Union[str, None] = None):
        """
        Initialize the tokenizer with either a model or an encoding name.

        Args:
            encoding_name (Union[str, None]): The name of the encoding to use. Default is None.
            model (Union[str, None]): The model to use for encoding. Default is None.

        Raises:
            ValueError: If neither model nor encoding_name is provided.
            ValueError: If both model and encoding_name are provided.
        """
        # ensure that either model or encoding_name is provided
        if model is None and encoding_name is None:
            raise ValueError("Either `model` or `encoding_name` must be provided.")
        if model is not None and encoding_name is not None:
            raise ValueError("Only one of `model` or `encoding_name` can be provided.")
        if encoding_name:
            self.encoding = tiktoken.get_encoding(encoding_name)
        else:
            self.encoding = tiktoken.encoding_for_model(model)
    
    def count_tokens(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Count the number of tokens in the input.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        if isinstance(input, str):
            return len(self.encoding.encode(input))
        else:
            toks = self.encoding.encode_batch(input)
            return [len(t) for t in toks]

    def __call__(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Call the tokenizer on the input. This is equivalent to calling count_tokens.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        return self.count_tokens(input)

In [11]:
token_counter = TokenCounter(model=MODEL)

## Define the instructions

In [12]:
instructions = """
You will receive a text.
Your task is to detect and extract all mentions of social groups or social group characteristics in the text.

## Definition

By social group, we mean a group of people who share one or more common characteristic.
This can be a common demographic characteristic like age, a common profession, a common experience like being poor, common values, etc.

The following collective actors and entities are _not_ social groups:

- political parties
- groups of political representatives (e.g., "Members of Parliament")
- collective politica actors (e.g., "the government")
- collective economic actors (e.g., "business", "unions", etc.)
- inernational, national, or regional state agencies and institutions (e.g., "the police", "the United Nations", "the European Union", etc.)
- countries, regions, cities, etc.

## Step by step instructions

1. Read the text
2. Identify all social groups or common social characteristics mentioned in the text
3a. If there are none, return the JSON format `{"mentions": []}`
3b. Otherwise extract the individual mentions. Quote them verbatim and do not change spelling, grammar, or capitalization in any way
4. Return the mentions in JSON format `{"mentions": ["mention1", "mention2", "mention3"]}`

## Examples

Text: We seek to bring about a fundamental change in the balance of power and wealth in favour of working people and their families.
Output: ["working people", "their families"]

Text: The new system to help the two million hardest hit families should be in operation this year.
Output: ["the two million hardest hit families"]

Text: Those who have dedicated their lives to the service of the community deserve that stability.
Output: ["Those who have dedicated their lives to the service of the community"]

Text: we believe it is right that those with the broadest shoulders bear the greatest burden
Output: ["those with the broadest shoulders"]
"""
# todo: add negative examples

In [13]:
token_counter(instructions)/1000*0.01 # dollar cents per request

0.00407

## Example

In [14]:
messages = [ 
    {"role": "system", "content": instructions},
    # # positive examples
    # {"role": "user", "content": "The elderly are more likely to suffer from loneliness."}
    # {"role": "user", "content": "The governemnt should do more to help the poor."}
    {"role": "user", "content": "Eight years of meanness towards the needy in our country and towards the wretched of the world."}
    # {"role": "user", "content": "We will build more and better schools for our children."}
    # # negative examples
    # {"role": "user", "content": "The quick brown fox jumps over the lazy dog."}
    # {"role": "user", "content": "We will build more and better schools."}
    # {"role": "user", "content": "The existing tax system is disfunctional."}
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    seed=42,
    temperature=0.0,
    response_format={"type": "json_object"},
)

results = json.loads(response.choices[0].message.content)
results['mentions']

['the needy', 'the wretched of the world']

## Automate

In [15]:
def extract_mentions(text: str, **kwargs) -> list:
    messages = [ 
        {"role": "system", "content": instructions}, # <== hard-coded above
        {"role": "user", "content": text.strip()}
    ]
    
    response = client.chat.completions.create(
        model=MODEL, # <== hard-coded above
        messages=messages,
        seed=42,
        temperature=0.0,
        response_format={"type": "json_object"},
        **kwargs

    )

    result = response.choices[0].message.content
    try:
        result = json.loads(result)
    except:
        print("Error parsing result")
        print(result)
    else:
        if not 'mentions' in result:
            print("Result has no mention element")
        else:
            result = result['mentions']
    return result

In [16]:
extract_mentions("The existing tax system is disfunctional.")

[]

In [17]:
samples = [
    {'id': 'edf9aa0cf50328139532b43e3e3bf034',
    'source': 'The present Government came to office with promises of lower taxation, stable prices, reduced unemployment and increased financial strength.',
    },
    {'id': '1764473b42cda089c82bcb0d0d158ff4',
    'source': 'It is not a recipe for an easy or a perfect life.',
    },
    {'id': '24860534da4cd9aaa13b80ee1a1013b9',
    'source': 'We want to preserve the integrity of the Single Market, by insisting on protections for those countries that have kept their own currencies.',
    },
    {'id': '2f924dc8bb71c5154dc3b0466368df2f',
    'source': 'The Labour leaders have failed to tackle the fundamental economic and social problems at home.',
    },
    {'id': '39bab64f53a0b6ae50b7db2c89450d86',
    'source': 'And we will restore trust in politics with greater transparency and accountability in a system battered by the expenses scandal.',
    },
    {'id': '438adcb200c55bd1de00cef04a2f3224',
    'source': 'An important element will be the establishment of a productivity scheme.',
    },
    {'id': 'cf5c40eeb44ed0c12498bc43bfeb4a38',
    'source': 'We will introduce further measures to impose tighter controls on pyramid selling.',
    },
    {'id': 'bcee6789488a6997f5eb428072af5a3f',
    'source': 'But whatever the outcome in Brussels, the decision will be taken here by the British people.',
    },
    {'id': 'd8b1677617a26c535aa149f72474bb30',
    'source': 'Now we want to give a much bigger say to citizens in all their various capacities - as tenants, shoppers, patients, voters.',
    },
    {'id': '7c6298967f0c89104d2e6d1bc7589d71',
    'source': 'Technologies to come together, coordinate their protests against the state, and communicate with the outside world.',
    },
    {'id': '6df4ea9ddbbd152e28f071f0c768cfbb',
    'source': 'I am confident about our future prosperity, even optimistic, if we have the courage to change and use it to build a better Britain.',
    },
    {'id': 'e18308d3bb585837d29a6e176ca911d9',
    'source': 'We will act quickly to save jobs and stop the further destruction of industry.',
    },
    {'id': 'c9576865cc182890c81346669dacf7e1',
    'source': 'The term covers a whole complex of questions which in one way or another affect the quality of our lives.',
    },
    {'id': '34e0bb4921dfb144bbf75d6be2d158d4',
    'source': 'We will raise the school leaving age to sixteen as planned.',
    },
    {'id': 'ac2db42907d4ae5b319beef66915c81a',
    'source': 'If individuals are to achieve their full creative potential, and our society is to advance, we must substantially improve educational provision and opportunity.',
    },
    {'id': 'abef582381866fb0ab68ae33e91ac452',
    'source': 'There would be short-term disadvantages in Britain going into the European Economic Community which must be weighed against the long-term benefits.',
    },
    {'id': 'a10290c2a329b5fb4b39aae0f2c00d61',
    'source': 'We need to balance the needs of parents and carers, with those of employers, especially small businesses.',
    },
    {'id': '8a692e4687ca4e365265b22e3e5370f5',
    'source': 'We will continue to support the liberation movements of Southern Africa.',
    },
    {'id': 'e7e5e73641167589582acdae9d63c13d',
    'source': 'Labour will promote environmentally sustainable development and encourage new approaches to reduce Third World debt.',
    },
    {'id': 'ac12ab818bd7f49a939e8038dddc68f6',
    'source': 'Almost everywhere in Western Europe and North America the standard of living grows faster than in Britain.',
    },
    {'id': '25ab0d9c3ae3c86d8d8ce12cd6da14c0',
    'source': "Labour's plan for expansion will help the industry back to its feet.",
    },
    {'id': '18590687136d5aa36d1e806e0060ede3',
    'source': 'Our programme is rooted in the instincts of millions of people whose beliefs are mocked by Labour.',
    },
    {'id': 'b843c75108a9f482bd5d54630cee80e9',
    'source': 'Under Labour, the NHS will remain a universal health service, not a second-rate safety net.',
    },
    {'id': 'f7505e13bffedd465a4bf79c5f7a4399',
    'source': 'We have supported changes to how Parliament functions in order to strengthen Select Committees and to give a stronger voice to backbenchers.',
    },
    {'id': 'b495fbdf05dd093a91f201e98166a03d',
    'source': 'There are over one million more jobs in the economy than in 1997.',
    },
    {'id': 'c92194617f4652d880c812ab35453972',
    'source': 'Our laws, which will not discriminate on grounds of sex or race, will respect the right to family life.',
    },
    {'id': '4b7a39843c764f3d6c26acd015acaa22',
    'source': 'We intend to go on working for sound schemes of disarmament and arms control.',
    },
    {'id': 'aa06f2694a4f56cf54477583d5fd631c',
    'source': 'With effective leadership and clear vision, Britain could once again be at the centre of international decision-making instead of at its margins.',
    },
    {'id': '900e54335ec1e5676f382da5b9692e84',
    'source': 'We shall control prices and attack speculation and set a climate fair enough to work together with the unions.',
    },
    {'id': 'f8e9dab355d3c201fd90004f094e758b',
    'source': 'The proceeds will contribute to reparations for their victims and to the upkeep of their own families.',
    },
    {'id': '211ef7930a5ba594f9d0b062d77eb90f',
    'source': 'We guarantee that the Minimum Income Guarantee will be uprated each year in line with earnings, throughout the next Parliament.',
    },
    {'id': 'fb49d2043a449beaf431c62fe477db04',
    'source': 'We must not shackle ourselves or burden our children with that future of failure.',
    },
    {'id': '729064df34babbb9a6ec7524775a2d6a',
    'source': 'The general adoption of 200-mile limits has fundamentally altered the situation which existed when the Treaty of Accession was negotiated.',
    },
    {'id': '737f6785d2f21990e12a738f78dd1c84',
    'source': 'To establish an international disarmament agency to supervise a disarmament treaty.',
    },
    {'id': '9025793f775583749553618f1e9d2f6f',
    'source': 'It is costly, vulnerable to fraud and not geared to environmental protection.',
    },
    {'id': '7155db1631aeacd9861e02e08c1084c5',
    'source': 'We will fulfil our responsibility to hand on a richer and more sustainable natural environment to future generations.',
    },
    {'id': '6bce6f61a34b2cffff7f98007960749e',
    'source': 'We will seek to strengthen parliamentary democracy and introduce state aid for political parties, along the lines of the Houghton Report.',
    },
    {'id': '1d167ffd7d5289b185d89cef78ebc924',
    'source': 'We mean to work for more until the danger of war is eliminated.',
    },
    {'id': '221ac4a1ed7256f761c8c9f6b622c6c2',
    'source': 'We will continue to seek value for money in defence procurement, recognising the important contribution that the UK defence industry makes to our prosperity.',
    },
    {'id': '93f0bd1f1437c9d5d8d9d80ddb8cc94f',
    'source': 'This system is obviously unfair, particularly since the lower paid get nothing at all.',
    },
    {'id': '6dfef5984732db96eabc374f5d6bdd76',
    'source': 'We are committed to tackling these problems, not talking them up to run Britain down.',
    },
    {'id': '9d207f1b892ac3bc1deed41471fdacdf',
    'source': 'So we have raised the stamp duty threshold from £60,000 to £120,000 for residential properties, exempting an extra 300,000 homebuyers from stamp duty every year.',
    },
    {'id': 'b2534f4f7e1546d373e8d1c08669459f',
    'source': "This would reduce the farmers' security and push up food prices to new high levels.",
    },
    {'id': 'f7a40f122b09c3297c25a4e5531abd54',
    'source': 'We propose a new approach to law and order: tough on crime and tough on the causes of crime.',
    },
    {'id': 'da588472a92652e4b6fcb869538920cf',
    'source': 'But it is the world outside Europe that now presents the greatest challenge and the greatest danger to mankind.',
    },
    {'id': '9940ca40018808b84cb261e9b0f5bdc3',
    'source': 'It will mean strong and continued emphasis on investment for economic strength.',
    },
    {'id': '8e37a2e859de42da33e7b759591a3244',
    'source': 'What is more, its decision should be aimed at the long term.',
    },
    {'id': '9a7ee834f64427c8154549c4f0e834bb',
    'source': 'Ensure fairness for food producers through EU reform and a Supermarkets Ombudsman; and support post offices, shops and pubs in rural communities.',
    },
    {'id': '113d3e94d84cef6c53d02dc2ab5c1b6e',
    'source': 'We will continue to promote the golden thread of democracy, the rule of law, property rights, a free media and open, accountable institutions.',
    },
    {'id': '26af558db82e3345be1419c7d6c34cda',
    'source': 'The people of Northern Ireland will continue to be offered a framework for participation in local democracy and political progress through the Assembly.',
    },
]

Compute the expected cost:

In [18]:
# tokens in inputs
n_tokens = sum(token_counter(s['source']) for s in samples)
# add token count for instructions
n_tokens += token_counter(instructions) * len(samples)
# add expected token count for outputs (multiplied by cost factor for output vs. input)
n_tokens += n_tokens*.2 * 3

# comopute cost (see https://openai.com/pricing)
n_tokens/1000*0.01 # dollar cents

0.341376

In [19]:
from tqdm.auto import tqdm
n = len(samples)
for i in tqdm(range(n), total=n):
    samples[i]["label"] = extract_mentions(samples[i]["source"])

  0%|          | 0/50 [00:00<?, ?it/s]

In [93]:
samples

[{'id': 'edf9aa0cf50328139532b43e3e3bf034',
  'source': 'The present Government came to office with promises of lower taxation, stable prices, reduced unemployment and increased financial strength.',
  'label': []},
 {'id': '1764473b42cda089c82bcb0d0d158ff4',
  'source': 'It is not a recipe for an easy or a perfect life.',
  'label': []},
 {'id': '24860534da4cd9aaa13b80ee1a1013b9',
  'source': 'We want to preserve the integrity of the Single Market, by insisting on protections for those countries that have kept their own currencies.',
  'label': ['those countries that have kept their own currencies']},
 {'id': '2f924dc8bb71c5154dc3b0466368df2f',
  'source': 'The Labour leaders have failed to tackle the fundamental economic and social problems at home.',
  'label': ['The Labour leaders']},
 {'id': '39bab64f53a0b6ae50b7db2c89450d86',
  'source': 'And we will restore trust in politics with greater transparency and accountability in a system battered by the expenses scandal.',
  'label': [

In [32]:
text = 'The people of Northern Ireland will continue to be offered a framework for participation in local democracy and political progress through the Assembly.'
label = ['The people of Northern Ireland']

In [33]:
import re
import numpy 

spans = [re.match(l, text).span() for l in label]

In [34]:
spans

[(0, 30)]

In [82]:
import regex
toks = regex.split(r"(?= )", text)
print(toks)
import numpy as np
import pandas as pd

lengths = np.array([len(t) for t in toks])
ends = lengths.cumsum()
starts = np.empty_like(lengths)
#starts[1:] = (lengths[:-1]+np.arange(len(lengths))[1:]).cumsum()
starts[1:] = ends[:-1]
starts[0] = 0
toks, starts, ends
df = pd.DataFrame({"token": toks, "start": starts, "end": ends  })
df

['The', ' people', ' of', ' Northern', ' Ireland', ' will', ' continue', ' to', ' be', ' offered', ' a', ' framework', ' for', ' participation', ' in', ' local', ' democracy', ' and', ' political', ' progress', ' through', ' the', ' Assembly.']


Unnamed: 0,token,start,end
0,The,0,3
1,people,3,10
2,of,10,13
3,Northern,13,22
4,Ireland,22,30
5,will,30,35
6,continue,35,44
7,to,44,47
8,be,47,50
9,offered,50,58


In [110]:
df["label"] = [0]*len(df)

for (s, e) in spans:
    idxs = np.logical_and(df.start >= s, df.end <= e)
    df.loc[idxs, "label"] = 2
    idx = idxs.index[idxs].min()
    df.loc[idx, "label"] = 1

In [111]:
id2label = {0: "O", 1: "B", 2: "I"}
df["annotation"] = df.label.map(id2label)
df

Unnamed: 0,token,start,end,label,annotation
0,The,0,3,1,B
1,people,3,10,2,I
2,of,10,13,2,I
3,Northern,13,22,2,I
4,Ireland,22,30,2,I
5,will,30,35,0,O
6,continue,35,44,0,O
7,to,44,47,0,O
8,be,47,50,0,O
9,offered,50,58,0,O
