# Comparison of results between GPT-3.5-turbo, GPT-4o-mini and human annotator
In the following notebook there are compared various possible methods which are either used or could be used by the company. Currently the company is using GPT-3.5-turbo model to annotate the category of the given text. Nowadays, better models are available. What is more, more cleaning of the data can be implemented, to increase the performance of used models.

Here:
* each experiment consists of input data, for which predictions by GPT-3.5-turbo and GPT-4o-mini are made, each 3 times,
* the predictions are made separately for each of the types of the tasks: 1) category classification, 2) sentiment and 3) recognizing companies.
* each prediction is compared with human annotated labels. The final score of each model is an average of scores for each of the tasks separately.
* there are four types of input data: 1) not preprocessed, 2) preprocessed by GPT-4o-mini model, 3) preprocessed using regex definded by human annotator based on randomly chosen samples, 4) preprocessed by human annotator.

## Setup

In [1]:
from openai import OpenAI
import os
import numpy as np
import pandas as pd
import time

from config_experiments import (
    OPEN_AI_KEY,
    PATH_TO_SAVE_RESULTS
)

from utils import (
    load_data_preprocessed
)

In [4]:
client = OpenAI(api_key=OPEN_AI_KEY)

In [62]:
df_gt=load_data_preprocessed('gt')
not_preprocessed_data = load_data_preprocessed('not_preprocessed_data')
processed_4o_data = load_data_preprocessed('processed_4o_data')
processed_regex_data = load_data_preprocessed('processed_regex_data')
processed_human_data = load_data_preprocessed('processed_human_data')

In [6]:
not_preprocessed_data[3]

"The BBC's director-general has tried to calm tensions among staff over the Israel-Gaza war, telling them to think 'carefully about the language that you use' in the workplace. Tim Davie said employees needed to be 'kind' to each other and 'no one should ever face any fear or prejudice' at their place of work. His email to his workforce comes as one insider at the corporation reported there had been arguments between staff with different views on the Middle-East crisis. There have already been reports of how Jewish staff had been upset at the corporation's refusal to describe Hamas fighters as terrorists. But it has also been claimed that some of its own journalists believe the BBC has been too soft on Israel and was 'dehumanising' Palestinian civilians. Yesterday in an apparent recognition of the tensions among staff, Mr Davie called for calm and respect in the workplace over the issue. Mr Davie has called for calm and respect in the workplace over the issue A mother covers her child'

### Functions

In [7]:
def ask_model(data, which_task = "category", which_model="gpt-3.5-turbo"):
  answers=["" for _ in range(len(data))]
  answers_confidence=["" for _ in range(len(data))]
  for i in range(len(data)):
    if which_task=="category":
      prompt=f"""You are a text classification endpoint, classifying given text into categories:
      human_employee_rights
      diversity_equity_inclusion
      environment
      animal_care
      corporate_transparency
      business_involvement
      political_and_religious_views

      If you are not sure, return other.
      Return only name of the category.

      Texts to classify:

      {data[i]}
      """
    elif which_task=="sentiment":
      prompt=f"""You are a sentiment classification endpoint, classifying given text into: positive, neutral, negative.
      Return only the sentiment: 'positive', 'neutral', 'negative'.
      Texts to classify:

      {data[i]}
      """
    elif which_task=="company":
      prompt=f"""You are a NER endpoint, finding names of the companies occuring in the article.
      Return only those names. If there are more than one companies, separate each name with ','.
      If there will be no company in the article, return 'no company'.

      Text:

      {data[i]}
      """
    else:
      return answers, answers_confidence

    completion = client.chat.completions.create(
      model=which_model,
      seed=123,
      messages=[
        {"role": "user", "content": prompt}
      ],
      logprobs=True,
      top_logprobs=5,
      temperature=0
    )

    answers[i] = completion.choices[0].message.content
    power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
    answers_confidence[i] = np.round(np.exp(power)*100, 1)

  return answers, answers_confidence

In [8]:
def conduct_experiments(input_data, type_of_data="notpreprocessed", which_model="gpt-3.5-turbo", exp_number=1):
  category_not_preprocessed=[["" for _ in range(len(input_data))] for _ in range(3)]
  category_not_preprocessed_confidence=[["" for _ in range(len(input_data))] for _ in range(3)]
  sentiment_not_preprocessed=[["" for _ in range(len(input_data))] for _ in range(3)]
  sentiment_not_preprocessed_confidence=[["" for _ in range(len(input_data))] for _ in range(3)]
  company_not_preprocessed=[["" for _ in range(len(input_data))] for _ in range(3)]
  company_not_preprocessed_confidence=[["" for _ in range(len(input_data))] for _ in range(3)]
  print("Lists created.")

  for i in range(3):
    category_not_preprocessed[i], category_not_preprocessed_confidence[i]=ask_model(input_data, which_task = "category", which_model=which_model)
    print(f"{type_of_data} number {i} category done.")
    sentiment_not_preprocessed[i], sentiment_not_preprocessed_confidence[i]=ask_model(input_data, which_task = "sentiment", which_model=which_model)
    print(f"{type_of_data} number {i} sentiment done.")
    company_not_preprocessed[i], company_not_preprocessed_confidence[i]=ask_model(input_data, which_task = "company", which_model=which_model)
    print(f"{type_of_data} number {i} company done.")
  print("Asking model done.")

  #save to files
  df_category_not_preprocessed=pd.DataFrame(category_not_preprocessed)
  df_category_not_preprocessed.to_csv(f"{PATH_TO_SAVE_RESULTS}experiment_{exp_number}/category_{type_of_data}.csv", index=False, sep=';')
  df_category_not_preprocessed_confidence=pd.DataFrame(category_not_preprocessed_confidence)
  df_category_not_preprocessed_confidence.to_csv(f"{PATH_TO_SAVE_RESULTS}experiment_{exp_number}/category_{type_of_data}_confidence.csv", index=False, sep=';')

  df_sentiment_not_preprocessed=pd.DataFrame(sentiment_not_preprocessed)
  df_sentiment_not_preprocessed.to_csv(f"{PATH_TO_SAVE_RESULTS}experiment_{exp_number}/sentiment_{type_of_data}.csv", index=False, sep=';')
  df_sentiment_not_preprocessed_confidence=pd.DataFrame(sentiment_not_preprocessed_confidence)
  df_sentiment_not_preprocessed_confidence.to_csv(f"{PATH_TO_SAVE_RESULTS}experiment_{exp_number}/sentiment_{type_of_data}_confidence.csv", index=False, sep=';')

  df_company_not_preprocessed=pd.DataFrame(company_not_preprocessed)
  df_company_not_preprocessed.to_csv(f"{PATH_TO_SAVE_RESULTS}experiment_{exp_number}/company_{type_of_data}.csv", index=False, sep=';')
  df_company_not_preprocessed_confidence=pd.DataFrame(company_not_preprocessed_confidence)
  df_company_not_preprocessed_confidence.to_csv(f"{PATH_TO_SAVE_RESULTS}experiment_{exp_number}/company_{type_of_data}_confidence.csv", index=False, sep=';')

  #calculate scores
  score_category_not_preprocessed=0
  score_sentiment_not_preprocessed=0
  score_company_not_preprocessed=0

  for i in range(3):
    #lowercase all elements in all lists
    category_not_preprocessed[i]=[x.lower() for x in category_not_preprocessed[i]]
    sentiment_not_preprocessed[i]=[x.lower() for x in sentiment_not_preprocessed[i]]
    score_category_not_preprocessed+=np.sum(category_not_preprocessed[i]==df_gt['Category'])
    score_sentiment_not_preprocessed+=np.sum(sentiment_not_preprocessed[i]==df_gt['Sentiment'])
    #for company the score will be to count the number of elements that are both in prediction and gt, divided by length of list in gt
    for j in range(len(company_not_preprocessed)):
      num_of_same_elements=len(set(company_not_preprocessed[i][j].split(",")) & set(df_gt['Company'][j].split(",")))
      score_company_not_preprocessed += num_of_same_elements/len(df_gt['Company'][j].split(","))

  score_category_not_preprocessed /= 3
  score_sentiment_not_preprocessed /= 3
  score_company_not_preprocessed /= 3

  print(f"Score for {type_of_data} data for category for {which_model} is {score_category_not_preprocessed}.")
  print(f"Score for {type_of_data} for sentiment for {which_model} is {score_sentiment_not_preprocessed}.")
  print(f"Score for {type_of_data} for company for {which_model} is {score_company_not_preprocessed}.")

  f = open(f"{PATH_TO_SAVE_RESULTS}experiment_{exp_number}/scores_{type_of_data}_model_{which_model}.txt", "a")
  f.write(f"Category: {score_category_not_preprocessed}, sentiment: {score_sentiment_not_preprocessed}, company: {score_company_not_preprocessed}.")
  f.close()

## Examples of prompts and results

#### For GPT-4o-mini

In [9]:
not_preprocessed_data=not_preprocessed_data[:10]

In [10]:
answers=["" for _ in range(len(not_preprocessed_data))]
answers_confidence=["" for _ in range(len(not_preprocessed_data))]
sum_time=0
for i in range(len(not_preprocessed_data)):
  start_time=time.time()
  article =f"""You are a text classification endpoint, classifying given text into categories:
  human_employee_rights
  diversity_equity_inclusion
  environment
  animal_care
  corporate_transparency
  business_involvement
  political_and_religious_views

  If you are not sure, return other.
  Return only name of the category.

  Texts to classify:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )
  end_time=time.time()
  answers[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  answers_confidence[i] = np.round(np.exp(power)*100, 1)
  sum_time+=end_time-start_time
  print(answers[i], answers_confidence[i], end_time-start_time)

print(f"Average time: {sum_time/len(not_preprocessed_data)}")

human_employee_rights 100.0 0.6447861194610596
human_employee_rights 100.0 0.7493846416473389
human_employee_rights 100.0 0.5567820072174072
diversity_equity_inclusion 100.0 0.7454171180725098
human_employee_rights 100.0 0.5667414665222168
diversity_equity_inclusion 100.0 0.5902402400970459
diversity_equity_inclusion 100.0 0.7475118637084961
diversity_equity_inclusion 100.0 0.6907482147216797
environment 100.0 0.4636249542236328
environment 100.0 0.5377843379974365
Average time: 0.6293020963668823


In [11]:
sentiment=["" for _ in range(len(not_preprocessed_data))]
sentiment_confidence=["" for _ in range(len(not_preprocessed_data))]
company=["" for _ in range(len(not_preprocessed_data))]
company_confidence=["" for _ in range(len(not_preprocessed_data))]
sum_time=0
sum_time_2=0
for i in range(len(not_preprocessed_data)):
  start_time=time.time()
  article =f"""You are a sentiment classification endpoint, classifying given text into: positive, neutral, negative.
  Return only the sentiment.
  Texts to classify:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )
  end_time=time.time()
  sum_time+=end_time-start_time
  sentiment[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  sentiment_confidence[i] = np.round(np.exp(power)*100, 1)
  print(sentiment[i], sentiment_confidence[i])
  print(end_time-start_time)

  start_time_2=time.time()
  article =f"""You are a NER endpoint, finding names of the companies occuring in the article.
  Return only those names. If there are more than one companies, separate each name with ','.
  If there will be no company in the article, return 'no company'.

  Text:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )
  end_time_2=time.time()
  sum_time_2+=end_time_2-start_time_2
  company[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  company_confidence[i] = np.round(np.exp(power)*100, 1)
  print(company[i], company_confidence[i])
  print(end_time_2-start_time_2)

print(f"Average time sentiment: {sum_time/len(not_preprocessed_data)}")
print(f"Average time NER: {sum_time_2/len(not_preprocessed_data)}")

negative 73.1
0.7310116291046143
Amazon 100.0
0.43302392959594727
neutral 63.4
0.6770398616790771
Starbucks, Workers United, NBC Los Angeles, KUOW, The New York Times, The Philadelphia Inquirer, Placer.ai 100.0
0.9027674198150635
negative 90.2
0.44138503074645996
Amazon, Tripadvisor, Microsoft 99.6
1.2336509227752686
neutral 83.8
0.9065001010894775
BBC, Hamas, The Times, Israel 100.0
0.8046102523803711
negative 85.2
0.8811581134796143
Amazon 100.0
0.6908962726593018
negative 49.8
0.7682647705078125
BBC, GB News, LBC, Opera Holland Park, NCTJ, MailOnline 100.0
0.7556369304656982
negative 92.4
0.7898740768432617
BBC, Reform UK, Free Speech Union 100.0
0.6687710285186768
negative 56.6
0.6208770275115967
Aviva, BP, CBI, AXA, Zurich 100.0
0.8453435897827148
negative 86.7
0.5124692916870117
BP, British Museum, Greenpeace, National Portrait Gallery, Tate, Culture Unstained 98.6
0.6523213386535645
positive 99.3
0.43262600898742676
Carbonfuture, Microsoft, Exomad Green 100.0
0.6782064437866211


In [12]:
df = pd.DataFrame(columns=['Article', 'Category'])
df['Article'] = not_preprocessed_data
df['Category'] = answers
df['Certainity_category'] = answers_confidence
df['Sentiment'] = sentiment
df['Certainity_sentiment'] = sentiment_confidence
df['Company'] = company
df['Certainity_company'] = company_confidence
df.to_csv(PATH_TO_SAVE_RESULTS+"results_4o_mini_logprob.csv", index=False, sep=';')

#### For GPT-3.5-turbo

In [13]:
answers_35=["" for _ in range(len(not_preprocessed_data))]
answers_confidence_35=["" for _ in range(len(not_preprocessed_data))]
sentiment_35=["" for _ in range(len(not_preprocessed_data))]
sentiment_confidence_35=["" for _ in range(len(not_preprocessed_data))]
company_35=["" for _ in range(len(not_preprocessed_data))]
company_confidence_35=["" for _ in range(len(not_preprocessed_data))]
sum_time_class=0
sum_time_sent=0
sum_time_NER=0
for i in range(len(not_preprocessed_data)):
  start_time_class=time.time()
  article =f"""You are a text classification endpoint, classifying given text into categories:
  human_employee_rights
  diversity_equity_inclusion
  environment
  animal_care
  corporate_transparency
  business_involvement
  political_and_religious_views

  If you are not sure, return other.
  Return only name of the category.

  Texts to classify:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )
  end_time_class=time.time()

  answers_35[i] = completion.choices[0].message.content
  power_35=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  answers_confidence_35[i] = np.round(np.exp(power_35)*100, 1)
  print(answers_35[i], answers_confidence_35[i])

  start_time_sent=time.time()
  article =f"""You are a sentiment classification endpoint, classifying given text into: positive, neutral, negative.
  Return only the sentiment.
  Texts to classify:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
    )

  end_time_sent=time.time()

  sentiment_35[i] = completion.choices[0].message.content
  power_35=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  sentiment_confidence_35[i] = np.round(np.exp(power_35)*100, 1)
  print(sentiment_35[i], sentiment_confidence_35[i])

  start_time_NER=time.time()
  article =f"""You are a NER endpoint, finding names of the companies occuring in the article.
  Return only those names. If there are more than one companies, separate each name with ','.
  If there will be no company in the article, return 'no company'.

  Text:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5
  )
  end_time_NER=time.time()
  company_35[i] = completion.choices[0].message.content
  power_35=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  company_confidence_35[i] = np.round(np.exp(power_35)*100, 1)
  print(company_35[i], company_confidence_35[i])
  sum_time_class+=end_time_class-start_time_class
  sum_time_sent+=end_time_sent-start_time_sent
  sum_time_NER+=end_time_NER-start_time_NER
print(f"Average time classification: {sum_time_class/len(not_preprocessed_data)}")
print(f"Average time sentiment: {sum_time_sent/len(not_preprocessed_data)}")
print(f"Average time NER: {sum_time_NER/len(not_preprocessed_data)}")

human_employee_rights 100.0
Negative 93.3
Amazon 100.0
human_employee_rights 81.5
Negative 92.2
Starbucks, Workers United, National Labor Relations Board 99.9
human_employee_rights 50.0
Negative 84.4
Amazon 98.2
human_employee_rights 98.0
Negative 85.3
BBC, Hamas 96.4
human_employee_rights 100.0
Negative 86.6
Amazon 100.0
diversity_equity_inclusion 100.0
Negative 79.6
BBC, X 99.6
diversity_equity_inclusion 100.0
Negative 66.1
BBC 99.5
diversity_equity_inclusion 100.0
Neutral 61.8
Aviva 100.0
environment 97.7
Negative 88.7
British Museum, BP, Greenpeace, Extinction Rebellion, National Portrait Gallery, Tate, Culture Unstained, Greenpeace UK 60.9
environment 100.0
Positive 60.3
Carbonfuture, Microsoft, Exomad Green 99.4
Average time classification: 0.6394724607467651
Average time sentiment: 0.9335868835449219
Average time NER: 0.6314952135086059


In [14]:
df_35 = pd.DataFrame(columns=['Article', 'Category'])
df_35['Article'] = not_preprocessed_data
df_35['Category']= answers_35
df_35['Certainity_category'] = answers_confidence_35
df_35['Sentiment'] = sentiment_35
df_35['Certainity_sentiment'] = sentiment_confidence_35
df_35['Company'] = company_35
df_35['Certainity_company'] = company_confidence_35
df_35.to_csv(PATH_TO_SAVE_RESULTS+"results_35_turbo_logprob.csv", index=False, sep=';')

## Comparison of outputs between GPT-3.5-turbo and GPT-4o-mini

In [65]:
processed_4o_data=processed_4o_data[:10]

In [66]:
answers_cleaned=["" for _ in range(len(processed_4o_data))]
answers_confidence_cleaned=["" for _ in range(len(processed_4o_data))]
for i in range(len(processed_4o_data)):
  article =f"""You are a text classification endpoint, classifying given text into categories:
  human_employee_rights
  diversity_equity_inclusion
  environment
  animal_care
  corporate_transparency
  business_involvement
  political_and_religious_views

  If you are not sure, return other.
  Return only name of the category.

  Texts to classify:

  {processed_4o_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )

  answers_cleaned[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  answers_confidence_cleaned[i] = np.round(np.exp(power)*100, 1)
  print(answers_cleaned[i], answers_confidence_cleaned[i])

human_employee_rights 100.0
human_employee_rights 100.0
human_employee_rights 100.0
diversity_equity_inclusion 100.0
human_employee_rights 100.0
diversity_equity_inclusion 100.0
diversity_equity_inclusion 100.0
diversity_equity_inclusion 100.0
environment 100.0
environment 100.0


In [67]:
sentiment_cleaned=["" for _ in range(len(processed_4o_data))]
sentiment_confidence_cleaned=["" for _ in range(len(processed_4o_data))]
company_cleaned=["" for _ in range(len(processed_4o_data))]
company_confidence_cleaned=["" for _ in range(len(processed_4o_data))]

for i in range(len(processed_4o_data)):
  article =f"""You are a sentiment classification endpoint, classifying given text into: positive, neutral, negative.
  Return only the sentiment.
  Texts to classify:

  {processed_4o_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )

  sentiment_cleaned[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  sentiment_confidence_cleaned[i] = np.round(np.exp(power)*100, 1)
  print(sentiment_cleaned[i], sentiment_confidence_cleaned[i])

  article =f"""You are a NER endpoint, finding names of the companies occuring in the article.
  Return only those names. If there are more than one companies, separate each name with ','.
  If there will be no company in the article, return 'no company'.

  Text:

  {processed_4o_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )

  company_cleaned[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  company_confidence_cleaned[i] = np.round(np.exp(power)*100, 1)
  print(company_cleaned[i], company_confidence_cleaned[i])

negative 88.1
Amazon 100.0
Neutral 59.0
Starbucks, Workers United, NBC Los Angeles, KUOW, The Philadelphia Inquirer, The New York Times, Placer.ai 100.0
negative 73.1
Amazon, Release, Tripadvisor 99.5
neutral 82.1
BBC, Hamas, The Times, Israel 100.0
negative 85.2
Amazon 100.0
negative 67.0
BBC, GB News, Opera Holland Park, LBC, NCTJ 99.9
negative 90.5
BBC, The Sunday Telegraph, The Mail, Michelle Obama, Amal Clooney 100.0
positive 85.7
Aviva, Confederation of British Industry, BP 100.0
negative 87.5
BP, Greenpeace, National Portrait Gallery, Tate, Culture Unstained 98.6
positive 77.7
Carbonfuture, Microsoft, Exomad Green 100.0


In [68]:
df_cleaned = pd.DataFrame(columns=['Article'])
df_cleaned['Article'] = processed_4o_data
df_cleaned['Category']= answers_cleaned
df_cleaned['Certainity_category'] = answers_confidence_cleaned
df_cleaned['Sentiment'] = sentiment_cleaned
df_cleaned['Certainity_sentiment'] = sentiment_confidence_cleaned
df_cleaned['Company'] = company_cleaned
df_cleaned['Certainity_company'] = company_confidence_cleaned
df_cleaned.to_csv(PATH_TO_SAVE_RESULTS+"results_4o_after_cleaning_by_4o.csv", index=False, sep=';')

In [69]:
df_cleaned

Unnamed: 0,Article,Category,Certainity_category,Sentiment,Certainity_sentiment,Company,Certainity_company
0,People work in the Amazon Fulfillment Center i...,human_employee_rights,100.0,negative,88.1,Amazon,100.0
1,A federal agency is seeking to force Starbucks...,human_employee_rights,100.0,Neutral,59.0,"Starbucks, Workers United, NBC Los Angeles, KU...",100.0
2,You might have seen a new energy drink on Amaz...,human_employee_rights,100.0,negative,73.1,"Amazon, Release, Tripadvisor",99.5
3,The BBC's director-general has tried to calm t...,diversity_equity_inclusion,100.0,neutral,82.1,"BBC, Hamas, The Times, Israel",100.0
4,Amazon is running a competition to give its br...,human_employee_rights,100.0,negative,85.2,Amazon,100.0
5,Nihal Arthanayake says he saw 'a lack of diver...,diversity_equity_inclusion,100.0,negative,67.0,"BBC, GB News, Opera Holland Park, LBC, NCTJ",99.9
6,The BBC has been slammed after it emerged it i...,diversity_equity_inclusion,100.0,negative,90.5,"BBC, The Sunday Telegraph, The Mail, Michelle ...",100.0
7,The boss of Aviva has revealed senior white ma...,diversity_equity_inclusion,100.0,positive,85.7,"Aviva, Confederation of British Industry, BP",100.0
8,The British Museum has secured a £50m donation...,environment,100.0,negative,87.5,"BP, Greenpeace, National Portrait Gallery, Tat...",98.6
9,Carbon removal solutions provider Carbonfuture...,environment,100.0,positive,77.7,"Carbonfuture, Microsoft, Exomad Green",100.0


In [70]:
df_gt=df_gt[:10]

In [71]:
print(f"Number of the same categories returned by 4o mini and 3.5 turbo: {np.sum(df['Category']==df_35['Category'])}.")
print(f"Number of the same categories returned by 4o mini and me: {np.sum(df['Category']==df_gt['Category'])}.")
print(f"Number of the same categories returned by 3.5 turbo and me: {np.sum(df_35['Category']==df_gt['Category'])}.")
print(f"Number of the same categories returned by 4o after cleaning by 4o and me: {np.sum(df_cleaned['Category']==df_gt['Category'])}.")

print(f"Number of the same sentiments returned by 4o mini and 3.5 turbo: {np.sum(df['Sentiment']==df_35['Sentiment'])}.")
print(f"Number of the same sentiments returned by 4o mini and me: {np.sum(df['Sentiment']==df_gt['Sentiment'])}.")
print(f"Number of the same sentiments returned by 3.5 turbo and me: {np.sum(df_35['Sentiment']==df_gt['Sentiment'])}.")
print(f"Number of the same sentiments returned by 4o after cleaning by 4o and me: {np.sum(df_cleaned['Sentiment']==df_gt['Sentiment'])}.")

Number of the same categories returned by 4o mini and 3.5 turbo: 9.
Number of the same categories returned by 4o mini and me: 9.
Number of the same categories returned by 3.5 turbo and me: 9.
Number of the same categories returned by 4o after cleaning by 4o and me: 9.
Number of the same sentiments returned by 4o mini and 3.5 turbo: 0.
Number of the same sentiments returned by 4o mini and me: 7.
Number of the same sentiments returned by 3.5 turbo and me: 0.
Number of the same sentiments returned by 4o after cleaning by 4o and me: 8.


## Analysis for articles cleaned with regex

In [30]:
answers_cleaned=["" for _ in range(len(processed_regex_data))]
answers_confidence_cleaned=["" for _ in range(len(processed_regex_data))]
for i in range(len(processed_regex_data)):
  article =f"""You are a text classification endpoint, classifying given text into categories:
  human_employee_rights
  diversity_equity_inclusion
  environment
  animal_care
  corporate_transparency
  business_involvement
  political_and_religious_views

  If you are not sure, return other.
  Return only name of the category.

  Texts to classify:

  {processed_regex_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )

  answers_cleaned[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  answers_confidence_cleaned[i] = np.round(np.exp(power)*100, 1)
  print(answers_cleaned[i], answers_confidence_cleaned[i])

human_employee_rights 100.0
human_employee_rights 100.0
human_employee_rights 100.0
diversity_equity_inclusion 100.0
human_employee_rights 100.0
diversity_equity_inclusion 100.0
diversity_equity_inclusion 100.0
diversity_equity_inclusion 100.0
environment 100.0
environment 100.0


In [31]:
sentiment_cleaned=["" for _ in range(len(processed_regex_data))]
sentiment_confidence_cleaned=["" for _ in range(len(processed_regex_data))]
company_cleaned=["" for _ in range(len(processed_regex_data))]
company_confidence_cleaned=["" for _ in range(len(processed_regex_data))]

for i in range(len(processed_regex_data)):
  article =f"""You are a sentiment classification endpoint, classifying given text into: positive, neutral, negative.
  Return only the sentiment.
  Texts to classify:

  {processed_regex_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )

  sentiment_cleaned[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  sentiment_confidence_cleaned[i] = np.round(np.exp(power)*100, 1)
  print(sentiment_cleaned[i], sentiment_confidence_cleaned[i])

  article =f"""You are a NER endpoint, finding names of the companies occuring in the article.
  Return only those names. If there are more than one companies, separate each name with ','.
  If there will be no company in the article, return 'no company'.

  Text:

  {processed_regex_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )

  company_cleaned[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  company_confidence_cleaned[i] = np.round(np.exp(power)*100, 1)
  print(company_cleaned[i], company_confidence_cleaned[i])

negative 73.1
Amazon 100.0
neutral 58.7
Starbucks, Workers United, NBC Los Angeles, KUOW, The Philadelphia Inquirer, The New York Times, Placer.ai 100.0
negative 95.0
Amazon, Tripadvisor, Microsoft 99.5
neutral 93.5
BBC, Hamas, The Times, Israel 100.0
negative 92.4
Amazon 100.0
Negative 56.1
BBC, GB News, LBC, Opera Holland Park, NCTJ, MailOnline 100.0
negative 90.5
BBC, Reform UK, Free Speech Union 100.0
neutral 39.3
Aviva, BP, CBI, AXA, Zurich 100.0
negative 53.5
BP, Greenpeace, National Portrait Gallery, Tate, Culture Unstained 99.8
positive 97.7
Carbonfuture, Microsoft, Exomad Green 100.0


In [34]:
df_cleaned_regex = pd.DataFrame(columns=['Article'])
df_cleaned_regex['Article'] = processed_regex_data
df_cleaned_regex['Category']= answers_cleaned
df_cleaned_regex['Certainity_category'] = answers_confidence_cleaned
df_cleaned_regex['Sentiment'] = sentiment_cleaned
df_cleaned_regex['Certainity_sentiment'] = sentiment_confidence_cleaned
df_cleaned_regex['Company'] = company_cleaned
df_cleaned_regex['Certainity_company'] = company_confidence_cleaned
df_cleaned_regex.to_csv(PATH_TO_SAVE_RESULTS+"results_4o_after_cleaning_regex.csv", index=False, sep=';')

In [33]:
print(f"Number of the same categories returned by 4o mini after regex cleaning and me: {np.sum(df_cleaned_regex['Category']==df_gt['Category'])}.")

print(f"Number of the same sentiments returned by 4o mini after regex cleaning and me: {np.sum(df_cleaned_regex['Sentiment']==df_gt['Sentiment'])}.")

Number of the same categories returned by 4o mini after regex cleaning and me: 9.
Number of the same sentiments returned by 4o mini after regex cleaning and me: 6.


## Experiments

#### Conduct experiments
For each of the prepared subsets, the experiments and evaluation were conducted for all three tasks: category recognition, sentiment analysis and company recognition. The results were later manually analyzed and described in the work.

In [None]:
# 1. Not processed GPT 3.5

conduct_experiments(not_preprocessed_data, type_of_data="notpreprocessed", which_model="gpt-3.5-turbo", exp_number=3)
print("1. Not processed GPT 3.5 DONE")

# 2. Not processed GPT 4o

conduct_experiments(not_preprocessed_data, type_of_data="notpreprocessed", which_model="gpt-4o-mini", exp_number=4)
print("2. Not processed GPT 4o DONE")

Lists created.
notpreprocessed number 0 category done.
notpreprocessed number 0 sentiment done.
notpreprocessed number 0 company done.
notpreprocessed number 1 category done.
notpreprocessed number 1 sentiment done.
notpreprocessed number 1 company done.
notpreprocessed number 2 category done.
notpreprocessed number 2 sentiment done.
notpreprocessed number 2 company done.
Asking model done.
Score for notpreprocessed data for category for gpt-3.5-turbo is 41.666666666666664.
Score for notpreprocessed for sentiment for gpt-3.5-turbo is 67.0.
Score for notpreprocessed for company for gpt-3.5-turbo is 3.0.
1. Not processed GPT 3.5 DONE
Lists created.
notpreprocessed number 0 category done.
notpreprocessed number 0 sentiment done.
notpreprocessed number 0 company done.
notpreprocessed number 1 category done.
notpreprocessed number 1 sentiment done.
notpreprocessed number 1 company done.
notpreprocessed number 2 category done.
notpreprocessed number 2 sentiment done.
notpreprocessed number 2

In [None]:
# 3. Processed by GPT 4o, now GPT 3.5

conduct_experiments(processed_4o_data, type_of_data="processed_by_4o", which_model="gpt-3.5-turbo", exp_number=5)
print("3. Processed by GPT 4o, now GPT 3.5 DONE")

# 4. Processed by GPT 4o, now GPT 4o

conduct_experiments(processed_4o_data, type_of_data="processed_by_4o", which_model="gpt-4o-mini", exp_number=6)
print("4. Processed by GPT 4o, now GPT 4o DONE")

Lists created.
processed_by_4o number 0 category done.
processed_by_4o number 0 sentiment done.
processed_by_4o number 0 company done.
processed_by_4o number 1 category done.
processed_by_4o number 1 sentiment done.
processed_by_4o number 1 company done.
processed_by_4o number 2 category done.
processed_by_4o number 2 sentiment done.
processed_by_4o number 2 company done.
Asking model done.
Score for processed_by_4o data for category for gpt-3.5-turbo is 42.666666666666664.
Score for processed_by_4o for sentiment for gpt-3.5-turbo is 65.33333333333333.
Score for processed_by_4o for company for gpt-3.5-turbo is 3.0.
3. Processed by GPT 4o, now GPT 3.5 DONE
Lists created.
processed_by_4o number 0 category done.
processed_by_4o number 0 sentiment done.
processed_by_4o number 0 company done.
processed_by_4o number 1 category done.
processed_by_4o number 1 sentiment done.
processed_by_4o number 1 company done.
processed_by_4o number 2 category done.
processed_by_4o number 2 sentiment done.


In [None]:
# 5. Processed with regex, now GPT 3.5

conduct_experiments(processed_regex_data, type_of_data="processed_with_regex", which_model="gpt-3.5-turbo", exp_number=7)
print("5. Processed with regex, now GPT 3.5 DONE")

# 6. Processed with regex, now GPT 4o

conduct_experiments(processed_regex_data, type_of_data="processed_with_regex", which_model="gpt-4o-mini", exp_number=8)
print("6. Processed with regex, now GPT 4o DONE")

Lists created.
processed_with_regex number 0 category done.
processed_with_regex number 0 sentiment done.
processed_with_regex number 0 company done.
processed_with_regex number 1 category done.
processed_with_regex number 1 sentiment done.
processed_with_regex number 1 company done.
processed_with_regex number 2 category done.
processed_with_regex number 2 sentiment done.
processed_with_regex number 2 company done.
Asking model done.
Score for processed_with_regex data for category for gpt-3.5-turbo is 41.666666666666664.
Score for processed_with_regex for sentiment for gpt-3.5-turbo is 66.33333333333333.
Score for processed_with_regex for company for gpt-3.5-turbo is 3.0.
5. Processed with regex, now GPT 3.5 DONE
Lists created.
processed_with_regex number 0 category done.
processed_with_regex number 0 sentiment done.
processed_with_regex number 0 company done.
processed_with_regex number 1 category done.
processed_with_regex number 1 sentiment done.
processed_with_regex number 1 comp

In [None]:
# 7. Processed by human_annotator, now GPT 3.5

conduct_experiments(processed_human_data, type_of_data="processed_by_human", which_model="gpt-3.5-turbo", exp_number=9)
print("7. Processed by human_annotator, now GPT 3.5 DONE")

# 8. Processed by human_annotator, now GPT 4o

conduct_experiments(processed_human_data, type_of_data="processed_by_human", which_model="gpt-4o-mini", exp_number=10)
print("8. Processed by human_annotator, now GPT 4o DONE")

Lists created.
processed_by_human number 0 category done.
processed_by_human number 0 sentiment done.
processed_by_human number 0 company done.
processed_by_human number 1 category done.
processed_by_human number 1 sentiment done.
processed_by_human number 1 company done.
processed_by_human number 2 category done.
processed_by_human number 2 sentiment done.
processed_by_human number 2 company done.
Asking model done.
Score for processed_by_human data for category for gpt-3.5-turbo is 42.333333333333336.
Score for processed_by_human for sentiment for gpt-3.5-turbo is 65.33333333333333.
Score for processed_by_human for company for gpt-3.5-turbo is 3.0.
7. Processed by human_annotator, now GPT 3.5 DONE
Lists created.
processed_by_human number 0 category done.
processed_by_human number 0 sentiment done.
processed_by_human number 0 company done.
processed_by_human number 1 category done.
processed_by_human number 1 sentiment done.
processed_by_human number 1 company done.
processed_by_human

## Experiments for incidents recognition
The pipeline for incident recognition is similar as for other tasks presented above. The only difference is that it is unsupervised task, therefore the evaluation was different.
The results were not described in the work.

In [36]:
df_incidents = pd.DataFrame(columns=['Article', 'Incident', 'Confidence','Time', 'Incident_2', 'Confidence_2', 'Time_2', 'Incident_3', 'Confidence_3', 'Time_3'])
df_incidents['Article'] = not_preprocessed_data
answers=["" for _ in range(len(not_preprocessed_data))]
answers_confidence=["" for _ in range(len(not_preprocessed_data))]
sum_time=0
for i in range(len(not_preprocessed_data)):
  start_time=time.time()
  article =f"""You are an incident recognition endpoint, for each given text return incident. Your answer cannot be longer than 3 words.

  Text:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )
  end_time=time.time()
  answers[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  answers_confidence[i] = np.round(np.exp(power)*100, 1)
  sum_time+=end_time-start_time
  print(answers[i], answers_confidence[i], end_time-start_time)

df_incidents['Incident']= answers
df_incidents['Confidence']= answers_confidence
df_incidents['Time']=end_time-start_time
print(f"Average time: {sum_time/len(not_preprocessed_data)}")

Workplace bullying incident 89.3 0.67862868309021
Labor dispute escalation 77.7 0.5191090106964111
Worker exploitation allegations 80.4 0.5036749839782715
Workplace tensions 98.6 0.5653071403503418
Worker protests Amazon 85.2 0.6032676696777344
Workplace racism 63.6 0.5277941226959229
Diversity spending controversy 54.3 0.8049933910369873
Sexism in finance 99.9 0.5860962867736816
Environmental protest 93.9 1.0483601093292236
Carbon removal agreement 100.0 0.6526217460632324
Greenwashing investigation 99.5 0.5971922874450684
Telehealth expansion 28.8 1.1860978603363037
Food recall incident 73.1 0.9719581604003906
Grocery delivery service 96.5 0.5075788497924805
Grocery price concerns 96.2 0.5102365016937256
Food waste incident 100.0 0.7876942157745361
Food donation initiative 99.9 0.48348164558410645
Discrimination complaint filed 54.9 0.47858095169067383
Bribery charges Switzerland 98.0 0.5533983707427979
PPE contract scandal 96.0 0.564786434173584
Misleading advertisements 95.7 0.5305

In [37]:
answers=["" for _ in range(len(not_preprocessed_data))]
answers_confidence=["" for _ in range(len(not_preprocessed_data))]
sum_time=0
for i in range(len(not_preprocessed_data)):
  start_time=time.time()
  article =f"""You are an incident recognition endpoint, for each given text return incident. Your answer cannot be longer than 3 words.

  Text:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )
  end_time=time.time()
  answers[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  answers_confidence[i] = np.round(np.exp(power)*100, 1)
  sum_time+=end_time-start_time
  print(answers[i], answers_confidence[i], end_time-start_time)

df_incidents['Incident_2']= answers
df_incidents['Confidence_2']=answers_confidence
df_incidents['Time_2']=end_time-start_time
print(f"Average time: {sum_time/len(not_preprocessed_data)}")

Workplace bullying incident 81.0 0.7966446876525879
Labor dispute escalation 77.7 0.7581255435943604
Worker exploitation allegations 80.2 0.5566978454589844
Workplace tensions 98.6 0.9661712646484375
Worker exploitation 87.9 0.7535190582275391
Workplace racism 63.6 0.4298710823059082
Diversity spending controversy 54.3 0.7015550136566162
Sexism in finance 99.9 0.7589480876922607
Environmental protest 94.2 0.9662995338439941
Carbon removal agreement 100.0 0.4668853282928467
Greenwashing investigation 99.5 0.5280771255493164
Telehealth expansion 28.1 0.9538311958312988
Food recall incident 62.2 1.001758337020874
Grocery delivery service 95.8 0.37368011474609375
Grocery price concerns 94.5 0.5318479537963867
Food waste incident 100.0 0.5200080871582031
Food donation initiative 99.9 0.4218144416809082
Discrimination complaint filed 48.6 0.4254434108734131
Bribery charges Switzerland 98.0 0.5660936832427979
PPE contract scandal 93.8 0.7977972030639648
Misleading advertisements 94.9 0.471123

In [38]:
answers=["" for _ in range(len(not_preprocessed_data))]
answers_confidence=["" for _ in range(len(not_preprocessed_data))]
sum_time=0
for i in range(len(not_preprocessed_data)):
  start_time=time.time()
  article =f"""You are an incident recognition endpoint, for each given text return incident. Your answer cannot be longer than 3 words.

  Text:

  {not_preprocessed_data[i]}
  """
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "user", "content": article}
    ],
    logprobs=True,
    top_logprobs=5,
    seed=123,
    temperature=0
  )
  end_time=time.time()
  answers[i] = completion.choices[0].message.content
  power=completion.choices[0].logprobs.content[0].top_logprobs[0].logprob
  answers_confidence[i] = np.round(np.exp(power)*100, 1)
  sum_time+=end_time-start_time
  print(answers[i], answers_confidence[i], end_time-start_time)

df_incidents['Incident_3']= answers
df_incidents['Confidence_3']= answers_confidence
df_incidents['Time_3']= end_time-start_time
print(f"Average time: {sum_time/len(not_preprocessed_data)}")

Workplace bullying incident 81.0 1.2986400127410889
Labor dispute escalation 73.0 0.4606778621673584
Worker exploitation allegations 80.2 0.5183687210083008
Workplace tensions 98.6 0.7458746433258057
Worker exploitation 87.9 0.5177152156829834
Workplace racism 63.6 0.6620256900787354
Diversity spending controversy 46.0 0.7685351371765137
Sexism in finance 99.9 0.679248571395874
Environmental protest 92.6 0.9188222885131836
Carbon removal agreement 100.0 2.672606945037842
Greenwashing investigation 99.5 0.4577467441558838
Telehealth expansion 28.8 1.2005720138549805
Food recall incident 62.2 1.0598440170288086
Grocery delivery service 95.9 0.362396240234375
Grocery price concerns 94.5 2.1176507472991943
Food waste incident 100.0 0.4573543071746826
Food donation initiative 99.9 0.4730827808380127
Discrimination complaint filed 54.9 1.9974400997161865
Bribery charges Switzerland 98.0 0.5793232917785645
PPE contract scandal 93.8 0.9059092998504639
Misleading advertisements 96.6 0.641569375

In [41]:
# Save results of experiments
df_incidents.to_csv(PATH_TO_SAVE_PREPROCESSED+'/gpt_4o_incidents.csv', index=False, sep=';')

#### For all three experiments incident, model confidence and time was saved

In [42]:
df_incidents 

Unnamed: 0,Article,Incident,Confidence,Time,Incident_2,Confidence_2,Time_2,Incident_3,Confidence_3,Time_3
0,People work in the Amazon Fulfillment Center i...,Workplace bullying incident,89.3,0.451547,Workplace bullying incident,81.0,0.459337,Workplace bullying incident,81.0,0.453882
1,A federal agency is seeking to force Starbucks...,Labor dispute escalation,77.7,0.451547,Labor dispute escalation,77.7,0.459337,Labor dispute escalation,73.0,0.453882
2,You might have seen a new energy drink on Amaz...,Worker exploitation allegations,80.4,0.451547,Worker exploitation allegations,80.2,0.459337,Worker exploitation allegations,80.2,0.453882
3,The BBC's director-general has tried to calm t...,Workplace tensions,98.6,0.451547,Workplace tensions,98.6,0.459337,Workplace tensions,98.6,0.453882
4,Amazon is running a competition to give its br...,Worker protests Amazon,85.2,0.451547,Worker exploitation,87.9,0.459337,Worker exploitation,87.9,0.453882
...,...,...,...,...,...,...,...,...,...,...
95,READ MOER: The Tesla killer? Toyota claims its...,Hybrid transition announcement,87.0,0.451547,Hybrid transition announcement,83.9,0.459337,Hybrid transition announcement,83.9,0.453882
96,"Exterior view of the Siemens Forum, part of th...",Financial performance report,70.4,0.451547,Financial performance report,71.5,0.459337,Financial performance report,71.5,0.453882
97,Companies are accused of greenwashing when the...,Greenwashing accusation,100.0,0.451547,Greenwashing accusation,100.0,0.459337,Greenwashing accusation,100.0,0.453882
98,It has noted the doubling and quadrupling of t...,Scam alert warning,62.8,0.451547,Scam alert,56.1,0.459337,Scam alert,62.8,0.453882


In [43]:
np.sum(df_incidents['Incident']==df_incidents['Incident_2'])

np.int64(93)

In [44]:
np.sum(df_incidents['Incident']==df_incidents['Incident_3'])

np.int64(93)

In [45]:
np.sum(df_incidents['Incident_2']==df_incidents['Incident_3'])

np.int64(92)

In [46]:
#mean confidence
print(np.mean(df_incidents['Confidence']))
print(np.mean(df_incidents['Confidence_2']))
print(np.mean(df_incidents['Confidence_3']))

80.015
79.893
79.766


In [47]:
#mean time
print(np.mean(df_incidents['Time']))
print(np.mean(df_incidents['Time_2']))
print(np.mean(df_incidents['Time_3']))

0.45154738426208496
0.4593372344970703
0.45388221740722656


In [48]:
for i in range(len(not_preprocessed_data)):
  if df_incidents['Incident'][i]!=df_incidents['Incident_2'][i]:
    print(f"First answer: {df_incidents['Incident'][i]}; second answer: {df_incidents['Incident_2'][i]}; confidence 1: {df_incidents['Confidence'][i]}, confidence 2: {df_incidents['Confidence_2'][i]}")

First answer: Worker protests Amazon; second answer: Worker exploitation; confidence 1: 85.2, confidence 2: 87.9
First answer: Advertising boycott; second answer: Advertising boycott X; confidence 1: 62.6, confidence 2: 66.5
First answer: Banking climate exit; second answer: Banking sector exit; confidence 1: 56.2, confidence 2: 56.2
First answer: Parking fee increases; second answer: Parking fee increase; confidence 1: 99.8, confidence 2: 99.9
First answer: Manufacturing shift; second answer: Industry transformation; confidence 1: 26.1, confidence 2: 22.7
First answer: Assistance request; second answer: Assisted conversation request; confidence 1: 76.7, confidence 2: 74.9


In [49]:
for i in range(len(not_preprocessed_data)):
  if df_incidents['Incident_2'][i]!=df_incidents['Incident_3'][i]:
    print(f"First answer: {df_incidents['Incident_2'][i]}; second answer: {df_incidents['Incident_3'][i]}; confidence 1: {df_incidents['Confidence_2'][i]}, confidence 2: {df_incidents['Confidence_3'][i]}")

First answer: Banking sector exit; second answer: Banking climate exit; confidence 1: 56.2, confidence 2: 56.2
First answer: Parking fee increase; second answer: Parking fee increases; confidence 1: 99.9, confidence 2: 99.9
First answer: Industry transformation; second answer: Manufacturing shift; confidence 1: 22.7, confidence 2: 26.1
First answer: Classified document leak; second answer: Classified documents leak; confidence 1: 91.6, confidence 2: 88.4
First answer: Record jet deal; second answer: Aviation deal; confidence 1: 29.6, confidence 2: 36.7
First answer: Assisted conversation request; second answer: Assistance request; confidence 1: 74.9, confidence 2: 76.7
First answer: Binance settlement issues; second answer: Binance settlement scandal; confidence 1: 85.0, confidence 2: 85.3
First answer: Fraud prevention partnership; second answer: Fraud detection partnership; confidence 1: 98.9, confidence 2: 98.9


In [50]:
for i in range(len(not_preprocessed_data)):
  if df_incidents['Incident'][i]!=df_incidents['Incident_3'][i]:
    print(f"First answer: {df_incidents['Incident'][i]}; second answer: {df_incidents['Incident_3'][i]}; confidence 1: {df_incidents['Confidence'][i]}, confidence 2: {df_incidents['Confidence_3'][i]}")

First answer: Worker protests Amazon; second answer: Worker exploitation; confidence 1: 85.2, confidence 2: 87.9
First answer: Advertising boycott; second answer: Advertising boycott X; confidence 1: 62.6, confidence 2: 66.5
First answer: Classified document leak; second answer: Classified documents leak; confidence 1: 91.6, confidence 2: 88.4
First answer: Record jet deal; second answer: Aviation deal; confidence 1: 29.6, confidence 2: 36.7
First answer: Binance settlement issues; second answer: Binance settlement scandal; confidence 1: 85.0, confidence 2: 85.3
First answer: Fraud prevention partnership; second answer: Fraud detection partnership; confidence 1: 98.9, confidence 2: 98.9
