Notes:
For each dataset, first we need to convert numerical to text. Secondly, we will map article id the summarised text.

In [1]:
import pandas as pd
import numpy as np
import openai
import time

In [2]:
import os
os.chdir("/g/data/jr19/rh2942/text-empathy/")

In [3]:
with open("./openai-api.txt", 'r') as f:
    openai.api_key = f.read()

In [4]:
def chat(prompt, model="gpt-3.5-turbo-0613", temp=1.5):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temp, # degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

# Numeric to text

In [8]:
def num_to_text(df):
    id_demographic = df.copy()
    print('Shape before dropping duplicates:', id_demographic.shape)
    id_demographic = id_demographic.drop_duplicates(subset='speaker_id')
    print('Shape after dropping duplicates:', id_demographic.shape)
    print('Starting access to ChatGPT ...')
    
    for row in id_demographic.itertuples():
        prompt_convert = f"""
        Your task is to format five numerical data (individual's gender, education level, race, age, and income) into meaningful sentences.
        The numerical data are delimited by triple backticks.
        Write from a first-person point-of-view.
        Complete the task with no more than three sentences.
        
        Use the following mapping between the number and the corresponding text:
        
        Gender:
        1 = Male
        2 = Female
        5 = Other
        
        Education level:
        1 = Less than a high school diploma
        2 = High school diploma
        3 = Technical/Vocational school
        4 = Some college but no degree
        5 = Two-year associate degree
        6 = Four-year bachelor’s degree
        7 = Postgraduate or professional degree
        
        Race:
        1 = White        
        2 = Hispanic or Latino
        3 = Black or African American
        4 = Native American or American Indian
        5 = Asian/Pacific Islander
        6 = Other
        
        Age:
        <number> = <number> years
        
        Income:
        <number> = <number> USD
        
        For example, if the input numbers are: "Gender: 1, Education level: 5, Race: 1, Age: 25, Income: 40000"
        The output can be "I am a 25-year-old male of the White race. I completed a two-year associate degree and earn 40000 USD."
        
        Input numbers: ```Gender: {id_demographic.loc[row.Index, 'gender']}, Education level: {id_demographic.loc[row.Index, 'education']}, Race: {id_demographic.loc[row.Index, 'race']}, Age: {id_demographic.loc[row.Index, 'age']}, Income: {id_demographic.loc[row.Index, 'income']}```
        """
        print('Working on row index:', row.Index)
        try:
            id_demographic.loc[row.Index, 'demographic'] = chat(prompt_convert)
        except Exception as e:
            print(e)
            print("\nFailed but we're trying again in 60 seconds...\n")
            time.sleep(60) # normally it asks to wait for 20s
            id_demographic.loc[row.Index, 'demographic'] = chat(prompt_convert)
    
        ## Debugging
        # print(f"Input numbers: ```Gender: {data.loc[row.Index, 'gender']}, Education level: {data.loc[row.Index, 'education']}, Race: {data.loc[row.Index, 'race']}, Age: {data.loc[row.Index, 'age']}, Income: {data.loc[row.Index, 'income']}")
        
        # if row.Index == 2:
        #     break

    id_demographic = id_demographic[['speaker_id', 'demographic']] #keeping only required columns

    print('Here are the converted texts:')
    print(id_demographic.loc[:,'demographic'].values)

    return id_demographic

In [9]:
def id_to_demographic(filename, WS22=False, save_as=None, access_gpt=True):
    '''
    access_gpt: if False, it will just load existing saved id_demographic file using save_as as filename
    '''
    data = train_ws23 = pd.read_csv(filename, sep='\t', na_values='unknown')
    print('Columns having NaN values: ', data.columns[data.isna().any()].tolist())
    print('Initial data shape:', data.shape)
    data = data.dropna()
    print()
    print('After dropping NA values, data shape:', data.shape)
    print('Columns having NaN values: ', data.columns[data.isna().any()].tolist())
    print('Columns having object datatypes: ')
    print(data.dtypes[data.dtypes=='object'])    

    assert data.isna().any().any() == False #no NA values
    assert data.isnull().any().any() == False #no null values

    # WASSA22 has no speaker_id
    if WS22:
        data['speaker_id'] = data.groupby(['gender', 'education', 'race', 'age', 'income']).ngroup()
    
    if access_gpt:
        print()
        id_demographic = num_to_text(data)
    else:
        if WS22:
            id_demographic = pd.read_csv('./data/' + save_as+'.csv', index_col=0) #unfortunately saved as csv
        else:
            id_demographic = pd.read_csv('./data/' + save_as+'.tsv', sep='\t', index_col=0)

    if access_gpt and (save_as is not None):
        id_demographic.to_csv('./data/' + save_as+'.tsv', sep='\t')
        print(f'\nSaved as {save_as}.tsv')
    return data, id_demographic

In [10]:
def map_id_to_main(id_demographic_df, main_df, save_as):
    id_demographic = id_demographic_df.copy()
    main = main_df.copy()
    id_demographic.set_index('speaker_id', inplace=True) #changing index so that I can use it easily to map as below
    main['demographic'] = main['speaker_id'].apply(lambda x: id_demographic.loc[x, 'demographic'])
    main['demographic_essay'] = main['demographic'] + ' ' + main['essay']
    if save_as is not None:
        main.to_csv('./data/' + save_as + '.tsv', sep='\t')
        print(f'Saved it as: {save_as}.tsv')
    return main

## WASSA23

In [10]:
train_file_ws23 = './data/WASSA23/WASSA23_essay_level_with_labels_train.tsv'
dev_file_ws23 = './data/WASSA23/WASSA23_essay_level_dev.tsv'
test_file_ws23 = './data/WASSA23/WASSA23_essay_level_test.tsv'

In [19]:
train_ws23, id_demographic_train_ws23 = id_to_demographic(
    filename=train_file_ws23,
    save_as='WS23-train-id_demographic-text',
    access_gpt=False
)

Columns having NaN values:  ['gender', 'education', 'race', 'age', 'income', 'personality_conscientiousness', 'personality_openess', 'personality_extraversion', 'personality_agreeableness', 'personality_stability', 'iri_perspective_taking', 'iri_personal_distress', 'iri_fantasy', 'iri_empathatic_concern']
Initial data shape: (792, 24)

After dropping NA values, data shape: (779, 24)
Columns having NaN values:  []
Columns having object datatypes: 
essay      object
split      object
emotion    object
dtype: object


In [20]:
train_ws23 = map_id_to_main(id_demographic_df=id_demographic_train_ws23, main_df=train_ws23, save_as=None)

In [28]:
dev_ws23, id_demographic_dev_ws23 = id_to_demographic(
    filename=dev_file_ws23,
    save_as='WS23-dev-id_demographic-text',
    access_gpt=False
)

Columns having NaN values:  []
Initial data shape: (208, 12)

After dropping NA values, data shape: (208, 12)
Columns having NaN values:  []
Columns having object datatypes: 
essay    object
split    object
dtype: object


In [29]:
dev_ws23 = map_id_to_main(id_demographic_df=id_demographic_dev_ws23, main_df=dev_ws23, save_as=None)

In [39]:
test_ws23, id_demographic_test_ws23 = id_to_demographic(
    filename=test_file_ws23,
    save_as='WS23-test-id_demographic-text',
    access_gpt=True
)

Columns having NaN values:  []
Initial data shape: (100, 12)

After dropping NA values, data shape: (100, 12)
Columns having NaN values:  []
Columns having object datatypes: 
essay    object
split    object
dtype: object

Shape before dropping duplicates: (100, 12)
Shape after dropping duplicates: (65, 12)
Starting access to ChatGPT ...
Working on row index: 0
Working on row index: 1
Working on row index: 2
Working on row index: 3
Working on row index: 4
Working on row index: 5
Working on row index: 6
Working on row index: 7
Working on row index: 9
Working on row index: 10
Working on row index: 11
Working on row index: 12
Working on row index: 13
Working on row index: 14
Working on row index: 15
Working on row index: 16
Working on row index: 18
Working on row index: 19
Working on row index: 21
Working on row index: 22
Working on row index: 23
Working on row index: 24
Working on row index: 25
Working on row index: 26
Working on row index: 27
Working on row index: 28
Working on row index

In [40]:
test_ws23 = map_id_to_main(id_demographic_df=id_demographic_test_ws23, main_df=test_ws23, save_as=None)

## WASSA22

In [11]:
train_file = './data/WASSA22/messages_train_ready_for_WS.tsv'
dev_file = './data/WASSA22/messages_dev_features_ready_for_WS_2022.tsv'
test_file = './data/WASSA22/messages_test_features_ready_for_WS_2022.tsv'

In [12]:
train, id_demographic_train = id_to_demographic(
    filename=train_file,
    WS22=True,
    save_as='WS22-train-id_demographic-text',
    access_gpt=True
)

Columns having NaN values:  []
Initial data shape: (1860, 23)

After dropping NA values, data shape: (1860, 23)
Columns having NaN values:  []
Columns having object datatypes: 
message_id     object
response_id    object
essay          object
emotion        object
dtype: object

Shape before dropping duplicates: (1860, 24)
Shape after dropping duplicates: (369, 24)
Starting access to ChatGPT ...
Working on row index: 0
Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x14dbcd5472d0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Failed but we're trying again in 60 seconds...
STOP THE PROGRAM IF IT'S ANY OTHER ERROR UNSOLVABLE BY RUNNING IT AGAIN



APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x14dbcd716f10>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

In [49]:
train = map_id_to_main(id_demographic_df=id_demographic_train, main_df=train, save_as=None)

In [8]:
dev, id_demographic_dev = id_to_demographic(
    filename=dev_file,
    WS22=True,
    save_as='WS22-dev-id_demographic-text',
    access_gpt=False
)

Columns having NaN values:  []
Initial data shape: (270, 9)

After dropping NA values, data shape: (270, 9)
Columns having NaN values:  []
Columns having object datatypes: 
message_id     object
response_id    object
essay          object
dtype: object


In [9]:
dev = map_id_to_main(id_demographic_df=id_demographic_dev, main_df=dev, save_as=None)

In [58]:
test, id_demographic_test = id_to_demographic(
    filename=test_file,
    WS22=True,
    save_as='WS22-test-id_demographic-text',
    access_gpt=False
)

Columns having NaN values:  []
Initial data shape: (525, 9)

After dropping NA values, data shape: (525, 9)
Columns having NaN values:  []
Columns having object datatypes: 
message_id     object
response_id    object
essay          object
dtype: object


In [59]:
test = map_id_to_main(id_demographic_df=id_demographic_test, main_df=test, save_as=None)

# Combine article to the main dataset

In [10]:
article = pd.read_csv("./data/article-summarised.csv", index_col=1) #article_id as index so that I can map to it easily

In [11]:
article.sample(2)

Unnamed: 0_level_0,Unnamed: 0,text,summary_text
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
375,374,Visitors bring trouble to Norway's polar bears...,An increasing number of visitors to Norway's A...
44,43,Airline worker gunned down at Oklahoma City ai...,An airline worker was shot and killed at Oklah...


In [24]:
article.loc[63, 'text']

'Article Title: —  \ufeffArticle Title:    Article URL:   Article author(s):   Article date:   News source:'

In [13]:
def make_main_data(main_df, dataname, test=False):
    input_data = main_df.copy() #mandatory step as dataframe is mutable

    if not test: #can't remove anything on test set
        print('Initial shape:', input_data.shape)
        print('Rows having invalid article 63:')
        print(input_data[input_data['article_id']==63].values)
        
        input_data = input_data[input_data['article_id'] != 63]
        print('\nCurrent shape after removing artile 63, if any:', input_data.shape)

    #converting article id to corresponding article texts
    input_data['article'] = input_data['article_id'].apply(lambda x: article.loc[x, 'summary_text'])
        
    # print(input_data.isna().any())
    assert input_data.isna().any().any() == False #no NA values
    assert input_data.isnull().any().any() == False #no null values
    
    input_data['demographic_essay'] = input_data['demographic'] + ' ' + input_data['essay']
  
    input_data.to_csv("./data/PREPROCESSED-" + dataname + ".tsv", sep='\t')

In [27]:
make_main_data(main_df=train_ws23, dataname="WS23-train")

In [None]:
make_main_data(main_df=dev_ws23, dataname="WS23-dev", test=True) # will use as test set while validation, so no need to drop

In [42]:
make_main_data(main_df=test_ws23, dataname="WS23-test")

Rows having invalid article 63:
[]


In [53]:
make_main_data(main_df=train, dataname="WS22-train")

Initial shape: (1860, 26)
Rows having invalid article 63:
[['R_pE6yE7SVdOZiObL_1' 'R_pE6yE7SVdOZiObL' 63 4.833 1.25 1 0
  "The first article never appeared, there was only a generic place holders and nothing to write about. I answered questions about my emotions based on how I feel right now and I'm writing this to explain that the article wasn't there. I hope that the following articles will appear and I'll be able to finish this survey as intended."
  'neutral' 2 6 1 45 30000 4.5 5.5 2.5 5.5 5.5 3.571 2.143 4.143
  4.428999999999999 325
  "I am a 45-year-old female of the White race. I have a four-year bachelor's degree and earn 30000 USD."
  "I am a 45-year-old female of the White race. I have a four-year bachelor's degree and earn 30000 USD. The first article never appeared, there was only a generic place holders and nothing to write about. I answered questions about my emotions based on how I feel right now and I'm writing this to explain that the article wasn't there. I hope that

In [14]:
make_main_data(main_df=dev, dataname="WS22-dev", test=True) # will use as test set while validation, so no need to drop

In [63]:
make_main_data(main_df=test, dataname="WS22-test", test=True)

# Article summary

In [13]:
article = pd.read_csv('./data/WASSA23/articles_adobe_AMT.csv', header=0)
article.sample(2)

Unnamed: 0,article_id,text
291,292,Shooting Occurs Near California Polling Statio...
406,407,"Yaffa Eliach, Holocaust survivor who revived a..."


In [41]:
long_index = []
for row in article.itertuples():
    prompt_summary = f"""
    Your task is to summarize given text delimited by triple backticks.
    Use at most 1000 characters.
    Do not add any additional information not contained in the input text.
    
    Input text: ```{article.loc[row.Index,'text']}```
    """
    # we don't need variation, so, temp=0 
    try:
        article.loc[row.Index, 'summary_text'] = chat(prompt=prompt_summary, temp=0)
    except:
        # base gpt3.5: "This model's maximum context length is 4097 tokens. However, your messages resulted in 4536 tokens.", so 16K context is used
        article.loc[row.Index, 'summary_text'] = chat(prompt=prompt_summary, model="gpt-3.5-turbo-16k", temp=0)
        long_index.append(row.Index)
    
    # debugging
    print(article.loc[row.Index, 'summary_text'])
    print("\n")
    # if row.Index == 1:
    #     break

An 11-year-old Rangers fan was attacked with a bottle before a game against Celtic, leaving him with a large cut on his head. The police have described the attack as "abhorrent" and are appealing for help to find the person responsible. The boy was walking with his family and other Rangers fans when the bottle was thrown at them. He was taken to the hospital for treatment and has since been released. The police are urging anyone with information to come forward. The incident has shocked the communities of Glasgow, and the person responsible needs to be caught.


Sharbat Gula, the Afghan woman known for her striking green eyes in a National Geographic cover photo, has been arrested in Pakistan for falsifying documents and staying illegally in the country. If convicted, she could face up to 14 years in jail or deportation. Last year, Gula was arrested on similar charges but was later released. Photographer Steve McCurry, who took the iconic photo, has expressed his objection to her arres

In [42]:
article.to_csv("./data/article-summarised.csv")

In [59]:
print(f"Longer texts (index) summarised by 16k model: {long_index}")

Longer texts (index) summarised by 16k model: [16, 49, 112, 153, 387, 390]


## \# of characters

In [62]:
print("Minimum, mean and maximum of raw articles:", 
      article['text'].str.len().min(), article['text'].str.len().mean(), article['text'].str.len().max())

Minimum, mean and maximum of raw articles: 101 4316.566985645933 32058


In [85]:
print("Minimum, mean and maximum of summarised articles:", 
      article['summary_text'].str.len().min(), article['summary_text'].str.len().mean(), article['summary_text'].str.len().max())

Minimum, mean and maximum of summarised articles: 107 776.7846889952153 2063


# Combine WS22 and WS23

## Preprocessed

### Train

In [22]:
ws22 = pd.read_csv('./data/PREPROCESSED-WS22-train.tsv', sep='\t', index_col=0)
ws23 = pd.read_csv('./data/PREPROCESSED-WS23-train.tsv', sep='\t', index_col=0)

In [23]:
ws22.head(2)

Unnamed: 0,message_id,response_id,article_id,empathy,distress,empathy_bin,distress_bin,essay,emotion,gender,...,personality_agreeableness,personality_stability,iri_perspective_taking,iri_personal_distress,iri_fantasy,iri_empathatic_concern,speaker_id,demographic,demographic_essay,article
0,R_1hGrPtWM4SumG0U_1,R_1hGrPtWM4SumG0U,67,5.667,4.375,1,1,it is really diheartening to read about these ...,sadness,1,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,At least 239 migrants are believed to have dro...
1,R_1hGrPtWM4SumG0U_2,R_1hGrPtWM4SumG0U,86,4.833,4.875,1,1,the phone lines from the suicide prevention li...,sadness,1,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,The US presidential election has led to a surg...


In [24]:
ws23.head(2)

Unnamed: 0,conversation_id,article_id,essay,empathy,distress,speaker_id,gender,education,race,age,...,iri_personal_distress,iri_fantasy,iri_empathatic_concern,speaker_number,split,essay_id,emotion,demographic,demographic_essay,article
0,2,35,It breaks my heart to see people living in tho...,6.833333,6.625,30,1.0,6.0,3.0,37.0,...,2.0,3.429,5.0,1,train,1,Hope/Sadness,I am a 37-year-old male of the African America...,I am a 37-year-old male of the African America...,A month after Hurricane Matthew hit southweste...
1,3,35,I wonder why there aren't more people trying t...,5.833333,6.0,19,1.0,6.0,2.0,32.0,...,2.857,2.857,2.714,1,train,2,Anger,I am a 32-year-old male of Hispanic or Latino ...,I am a 32-year-old male of Hispanic or Latino ...,A month after Hurricane Matthew hit southweste...


In [25]:
column_ws22 = ws22.columns.tolist()

In [26]:
column_ws23 = ws23.columns.tolist()

In [27]:
common_columns = [i for i in column_ws22 if i in column_ws23]

In [28]:
common_columns

['article_id',
 'empathy',
 'distress',
 'essay',
 'emotion',
 'gender',
 'education',
 'race',
 'age',
 'income',
 'personality_conscientiousness',
 'personality_openess',
 'personality_extraversion',
 'personality_agreeableness',
 'personality_stability',
 'iri_perspective_taking',
 'iri_personal_distress',
 'iri_fantasy',
 'iri_empathatic_concern',
 'speaker_id',
 'demographic',
 'demographic_essay',
 'article']

In [29]:
ws22 = ws22[common_columns]
ws23 = ws23[common_columns]

In [30]:
print(ws22.shape, ws23.shape)

(1857, 23) (779, 23)


In [31]:
ws = pd.concat([ws22, ws23], ignore_index=True)

In [32]:
ws.shape

(2636, 23)

In [34]:
ws.to_csv('./data/PREPROCESSED-WS22-WS23-train.tsv', sep='\t', index=False)

### Dev

In [3]:
ws22 = pd.read_csv('./data/PREPROCESSED-WS22-dev.tsv', sep='\t', index_col=0)
ws23 = pd.read_csv('./data/PREPROCESSED-WS23-dev.tsv', sep='\t', index_col=0)

In [4]:
ws22.head(2)

Unnamed: 0,message_id,response_id,article_id,essay,gender,education,race,age,income,speaker_id,demographic,demographic_essay,article
0,R_3QLVVnAgRBRH41U_1,R_3QLVVnAgRBRH41U,13,The story about the air strikes is very sadden...,1.0,4.0,3.0,20.0,24000.0,12,I am a 20-year-old male of African American de...,I am a 20-year-old male of African American de...,At least 26 Afghan civilians were killed and m...
1,R_3QLVVnAgRBRH41U_2,R_3QLVVnAgRBRH41U,127,It is clear that climate change is something t...,1.0,4.0,3.0,20.0,24000.0,12,I am a 20-year-old male of African American de...,I am a 20-year-old male of African American de...,New research suggests that habitat degradation...


In [5]:
ws23.head(2)

Unnamed: 0,conversation_id,article_id,essay,speaker_id,gender,education,race,age,income,speaker_number,split,essay_id,demographic,demographic_essay,article
0,1,35,How sad is it that this kind of pain and suffe...,68,2,2,1,21,20000,1,dev,0,I am a 21-year-old female of the White race. I...,I am a 21-year-old female of the White race. I...,A month after Hurricane Matthew hit southweste...
1,4,35,The article is kind of tragic and hits close t...,79,1,6,3,33,64000,1,dev,3,I am a 33-year-old male of the Black or Africa...,I am a 33-year-old male of the Black or Africa...,A month after Hurricane Matthew hit southweste...


In [6]:
column_ws22 = ws22.columns.tolist()

In [7]:
column_ws23 = ws23.columns.tolist()

In [8]:
common_columns = [i for i in column_ws22 if i in column_ws23]

In [9]:
common_columns

['article_id',
 'essay',
 'gender',
 'education',
 'race',
 'age',
 'income',
 'speaker_id',
 'demographic',
 'demographic_essay',
 'article']

In [10]:
ws22 = ws22[common_columns]
ws23 = ws23[common_columns]

In [11]:
print(ws22.shape, ws23.shape)

(270, 11) (208, 11)


In [12]:
ws = pd.concat([ws22, ws23], ignore_index=True)

In [13]:
ws.shape

(478, 11)

In [14]:
ws.to_csv('./data/PREPROCESSED-WS22-WS23-dev.tsv', sep='\t', index=False)

### Dev labels

In [13]:
ws22 = pd.read_csv('./data/WASSA22/goldstandard_dev_2022.tsv', sep='\t', header=None)
ws23 = pd.read_csv('./data/WASSA23/goldstandard_dev.tsv', sep='\t', header=None)

In [14]:
ws22.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,7.0,7.0,sadness,5.5,5.5,4.0,5.0,4.5,4.429,2.286,4.143,3.143
1,3.167,3.625,sadness,5.5,5.5,4.0,5.0,4.5,4.429,2.286,4.143,3.143


In [15]:
ws23.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,3.833333,3.375,Sadness,5.0,3.0,5.0,4.0,3.5,2.714,3.0,3.143,3.286
1,3.0,1.0,Sadness,6.5,7.0,3.5,4.5,7.0,3.714,1.0,2.429,1.429


In [16]:
print(ws22.shape, ws23.shape)

(270, 12) (208, 12)


In [17]:
ws = pd.concat([ws22, ws23], ignore_index=True)

In [20]:
ws.shape

(478, 12)

In [22]:
ws.to_csv('./data/goldstandard-WS22-WS23-dev.tsv', sep='\t', index=False, header=None)

## Raw

In [9]:
train_WS22 = './data/WASSA22/messages_train_ready_for_WS.tsv'
train_WS23 = './data/WASSA23/WASSA23_essay_level_with_labels_train.tsv'

In [10]:
train_WS22 = pd.read_csv(train_WS22, sep='\t', header=0)
train_WS23 = pd.read_csv(train_WS23, sep='\t', na_values='unknown', header=0)

In [11]:
train_WS22.dropna(inplace=True)
train_WS23.dropna(inplace=True)

In [12]:
train_WS22.select_dtypes(exclude='number').columns.tolist()

['message_id', 'response_id', 'essay', 'emotion']

In [13]:
train_WS23.select_dtypes(exclude='number').columns.tolist()

['essay', 'split', 'emotion']

In [19]:
column_WS22 = train_WS22.columns.tolist()

In [20]:
column_WS23 = train_WS23.columns.tolist()

In [21]:
common_columns = [i for i in column_WS22 if i in column_WS23]

In [22]:
common_columns

['article_id',
 'empathy',
 'distress',
 'essay',
 'emotion',
 'gender',
 'education',
 'race',
 'age',
 'income',
 'personality_conscientiousness',
 'personality_openess',
 'personality_extraversion',
 'personality_agreeableness',
 'personality_stability',
 'iri_perspective_taking',
 'iri_personal_distress',
 'iri_fantasy',
 'iri_empathatic_concern']

In [23]:
train_WS22 = train_WS22[common_columns]
train_WS23 = train_WS23[common_columns]

In [32]:
train_WS22.shape

(1860, 19)

In [33]:
train_WS23.shape

(779, 19)

In [28]:
train_WS = pd.concat([train_WS22, train_WS23], ignore_index=True)

In [30]:
train_WS.shape

(2639, 19)

In [36]:
train_WS.to_csv('./data/essay-train-ws22-ws23.tsv', sep='\t')

## GPT-annotated

### Train

In [6]:
ws22 = pd.read_csv('./data/WS22-train-gpt.tsv', sep='\t')
ws23 = pd.read_csv('./data/WS23-train-gpt.tsv', sep='\t')

In [7]:
ws22.head(2)

Unnamed: 0,message_id,response_id,article_id,empathy,distress,empathy_bin,distress_bin,essay,emotion,gender,...,personality_agreeableness,personality_stability,iri_perspective_taking,iri_personal_distress,iri_fantasy,iri_empathatic_concern,speaker_id,demographic,demographic_essay,article
0,R_1hGrPtWM4SumG0U_1,R_1hGrPtWM4SumG0U,67,6.5,4.375,1,1,it is really diheartening to read about these ...,sadness,1,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,At least 239 migrants are believed to have dro...
1,R_1hGrPtWM4SumG0U_2,R_1hGrPtWM4SumG0U,86,4.2,4.875,1,1,the phone lines from the suicide prevention li...,sadness,1,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,The US presidential election has led to a surg...


In [8]:
ws23.head(2)

Unnamed: 0,conversation_id,article_id,essay,empathy,distress,speaker_id,gender,education,race,age,...,iri_personal_distress,iri_fantasy,iri_empathatic_concern,speaker_number,split,essay_id,emotion,demographic,demographic_essay,article
0,2,35,It breaks my heart to see people living in tho...,6.833333,6.625,30,1.0,6.0,3.0,37.0,...,2.0,3.429,5.0,1,train,1,Hope/Sadness,I am a 37-year-old male of the African America...,I am a 37-year-old male of the African America...,A month after Hurricane Matthew hit southweste...
1,3,35,I wonder why there aren't more people trying t...,6.2,6.0,19,1.0,6.0,2.0,32.0,...,2.857,2.857,2.714,1,train,2,Anger,I am a 32-year-old male of Hispanic or Latino ...,I am a 32-year-old male of Hispanic or Latino ...,A month after Hurricane Matthew hit southweste...


In [9]:
column_ws22 = ws22.columns.tolist()

In [10]:
column_ws23 = ws23.columns.tolist()

In [11]:
common_columns = [i for i in column_ws22 if i in column_ws23]

In [12]:
common_columns

['article_id',
 'empathy',
 'distress',
 'essay',
 'emotion',
 'gender',
 'education',
 'race',
 'age',
 'income',
 'personality_conscientiousness',
 'personality_openess',
 'personality_extraversion',
 'personality_agreeableness',
 'personality_stability',
 'iri_perspective_taking',
 'iri_personal_distress',
 'iri_fantasy',
 'iri_empathatic_concern',
 'speaker_id',
 'demographic',
 'demographic_essay',
 'article']

In [13]:
ws22 = ws22[common_columns]
ws23 = ws23[common_columns]

In [14]:
print(ws22.shape, ws23.shape)

(1857, 23) (779, 23)


In [15]:
ws = pd.concat([ws22, ws23], ignore_index=True)

In [16]:
ws.shape

(2636, 23)

In [47]:
ws.dropna(inplace=True)

In [48]:
ws.shape

(2634, 23)

In [49]:
ws.to_csv('./data/WS22-WS23-train-gpt.tsv', sep='\t', index=False)

In [51]:
ws22.shape

(1857, 23)

In [52]:
ws22.dropna(inplace=True)

In [53]:
ws22.shape

(1855, 23)

In [54]:
ws22.to_csv('./data/WS22-train-gpt.tsv', sep='\t', index=False)

# Augmentation

In [5]:
data = pd.read_csv('./data/PREPROCESSED-WS22-WS23-train.tsv', sep='\t')

In [6]:
data.head()

Unnamed: 0,article_id,empathy,distress,essay,emotion,gender,education,race,age,income,...,personality_agreeableness,personality_stability,iri_perspective_taking,iri_personal_distress,iri_fantasy,iri_empathatic_concern,speaker_id,demographic,demographic_essay,article
0,67,5.667,4.375,it is really diheartening to read about these ...,sadness,1.0,4.0,1.0,33.0,50000.0,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,At least 239 migrants are believed to have dro...
1,86,4.833,4.875,the phone lines from the suicide prevention li...,sadness,1.0,4.0,1.0,33.0,50000.0,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,The US presidential election has led to a surg...
2,206,5.333,3.5,"no matter what your heritage, you should be ab...",neutral,1.0,4.0,1.0,33.0,50000.0,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,"Senator Mark Kirk apologized to his opponent, ..."
3,290,4.167,5.25,it is frightening to learn about all these sha...,fear,1.0,4.0,1.0,33.0,50000.0,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,A man in Australia escaped with cuts to his le...
4,342,5.333,4.625,the eldest generation of russians aren't being...,sadness,1.0,4.0,1.0,33.0,50000.0,...,5.5,5.5,3.571,2.0,3.429,4.0,41,I am a 33-year-old male of the White race. I h...,I am a 33-year-old male of the White race. I h...,"The eldest generation in Russia, known as the ..."


In [None]:
data.drop(['essay', 'demographic'], axis=1, inplace=True) #since only essay will not be paraphrased
paraphrased = data.copy()
paraphrased.loc[:, ['demographic_essay', 'article']] = np.nan # paraphrased texts will be placed there 

for row in data.itertuples():
    prompt = f"""
    As a data augmentaion tool for NLP, your task is to paraphrase the newspaper article delimited by triple backticks.

    Do not add any additional information not contained in the input texts.

    Your response must not have any backticks or any additional symbols.
    
    Input newspaper article: ```{data.loc[row.Index,'article']}```
    """
    try:
        response = chat(prompt=prompt, temp=1)
    except Exception as e:
        print(e)
        print("\nFailed but we're trying again in 60 seconds with a different model...\n")
        time.sleep(60) # normally it asks to wait for 20s
        # base gpt3.5: "This model's maximum context length is 4097 tokens. However, your messages resulted in 4536 tokens.", so 16K context is used
        response = chat(prompt=prompt, model="gpt-3.5-turbo-16k", temp=1)

    # demographic_essay
    prompt_essay = f"""
    In a data collection experiment for empathy detection, the study participant writes essay to describe their feeling after reading a newspaper article involving harm to individuals, groups or other entities.
    
    The participant's demographic information are also available within the essay.
    
    As a data augmentaion tool for NLP, your task is to paraphrase the demographic and essay information delimited by triple backticks.

    Do not add any additional information not contained in the input texts.

    Overall, the participant expressed {data.loc[row.Index,'emotion']} emotion. Do not change this overall emotion of the participant's essay.

    Your response must not have any backticks or any additional symbols.
    
    Input demographic and essay: ```{data.loc[row.Index,'demographic_essay']}```
    """
    try:
        response_essay = chat(prompt=prompt_essay, temp=1)
    except Exception as e:
        print(e)
        print("\nFailed but we're trying again in 60 seconds with a different model...\n")
        time.sleep(60) # normally it asks to wait for 20s
        # base gpt3.5: "This model's maximum context length is 4097 tokens. However, your messages resulted in 4536 tokens.", so 16K context is used
        response_essay = chat(prompt=prompt_essay, model="gpt-3.5-turbo-16k", temp=1)

    # process the response
    paraphrased.loc[row.Index, 'demographic_essay'] = response_essay
    paraphrased.loc[row.Index, 'article'] = response

    print('Completed row index:', row.Index)
    # saving per 10 new paraphrase
    if row.Index % 10 == 0:
        paraphrased.to_csv('./data/paraphrased-preprocessed-WS22-WS23-train.tsv', sep='\t', index=None)
    
    # debugging
    # print("\n")
    # if row.Index == 2:
    #     break

In [8]:
paraphrased.to_csv('./data/paraphrased-preprocessed-WS22-WS23-train.tsv', sep='\t', index=None)

In [10]:
data.columns

Index(['article_id', 'empathy', 'distress', 'emotion', 'gender', 'education',
       'race', 'age', 'income', 'personality_conscientiousness',
       'personality_openess', 'personality_extraversion',
       'personality_agreeableness', 'personality_stability',
       'iri_perspective_taking', 'iri_personal_distress', 'iri_fantasy',
       'iri_empathatic_concern', 'speaker_id', 'demographic_essay', 'article'],
      dtype='object')

In [11]:
paraphrased.columns

Index(['article_id', 'empathy', 'distress', 'emotion', 'gender', 'education',
       'race', 'age', 'income', 'personality_conscientiousness',
       'personality_openess', 'personality_extraversion',
       'personality_agreeableness', 'personality_stability',
       'iri_perspective_taking', 'iri_personal_distress', 'iri_fantasy',
       'iri_empathatic_concern', 'speaker_id', 'demographic_essay', 'article'],
      dtype='object')

In [12]:
all = pd.concat([data, paraphrased], ignore_index=True)

In [15]:
all.shape

(5272, 21)

In [16]:
all.to_csv('./data/COMBINED-PREPROCESSED-PARAPHRASED-WS22-WS23-train.tsv', sep='\t', index=None)

## Changing original annotations by GPT annotations

In [18]:
ws['empathy'].shape

(2636,)

In [31]:
original_anno = pd.read_csv('./data/COMBINED-PREPROCESSED-PARAPHRASED-WS22-WS23-train.tsv', sep='\t')

In [32]:
original_anno.shape

(5272, 21)

In [33]:
original_anno.rename(columns={'empathy': 'wrong_empathy'}, inplace=True)

In [34]:
original_anno.columns

Index(['article_id', 'wrong_empathy', 'distress', 'emotion', 'gender',
       'education', 'race', 'age', 'income', 'personality_conscientiousness',
       'personality_openess', 'personality_extraversion',
       'personality_agreeableness', 'personality_stability',
       'iri_perspective_taking', 'iri_personal_distress', 'iri_fantasy',
       'iri_empathatic_concern', 'speaker_id', 'demographic_essay', 'article'],
      dtype='object')

In [39]:
original_anno['empathy'] = ws['empathy'].tolist() + ws['empathy'].tolist()

In [43]:
original_anno.shape

(5272, 22)

In [44]:
original_anno.dropna(inplace=True) #there should be two+two NA values, as those were unable to annotate

In [45]:
original_anno.shape

(5268, 22)

In [46]:
original_anno.to_csv('./data/WS22-WS23-augmented-train-gpt.tsv', sep='\t', index=None)

# Extra

# Splitting train and val set

In [18]:
# df = pd.read_csv("./data/PREPROCESSED-essay-train-dev.csv", index_col=0)
df.sample(2)

Unnamed: 0,conversation_id,article_id,essay,speaker_id,gender,education,race,age,income,speaker_number,...,personality_extraversion,personality_agreeableness,personality_stability,iri_perspective_taking,iri_personal_distress,iri_fantasy,iri_empathatic_concern,demographic,article,demographic_essay
424,50,210,When tragedies like the one in Paris happen we...,19,1.0,6.0,2.0,32.0,35000.0,2,...,2.0,5.5,4.5,3.429,2.857,2.857,2.714,I am a 32-year-old male of Hispanic or Latino ...,"Members of the band Eagles of Death Metal, who...",I am a 32-year-old male of Hispanic or Latino ...
460,102,67,I am expressing my thought over this incident ...,5,2.0,6.0,3.0,22.0,100000.0,2,...,3.5,6.0,6.0,3.714,2.857,2.571,3.429,I am a female of the African American race. I ...,At least 239 migrants are believed to have dro...,I am a female of the African American race. I ...


In [19]:
df.shape

(987, 27)

In [7]:
df.iloc[779:,:].shape

(208, 27)

In [23]:
df.iloc[:779, :].to_csv('./data/PREPROCESSED-WS23-train.tsv', sep='\t')

In [24]:
df.iloc[779:, :].to_csv('./data/PREPROCESSED-WS23-dev.tsv', sep='\t')

## BARD

In [18]:
from bardapi import Bard

In [19]:
with open("./bard-api.txt", 'r') as f:
    bard = Bard(f.read())

In [38]:
response = bard.get_answer(prompt_convert)['content']
print(response)

I'm a text-based AI, and that is outside of my capabilities.


In [20]:
response = bard.get_answer(prompt_summary)['content']
print(response)

An 11-year-old Rangers fan was attacked with a bottle before the League Cup semi-final game with Celtic on Sunday. The boy, Kraig Mackay, was walking with his family and other fans when a bottle was thrown at the group. The bottle struck Kraig on the head, causing a large cut. He was taken to the hospital for treatment and released later that day.

Police are appealing for help to find the person responsible for the attack. They believe the bottle was thrown from a passing car.

The attack has been condemned by both Rangers and Celtic. Rangers manager Mark Warburton said Kraig would be the club's mascot for their next home game and would watch the game from the directors' box.

The game itself was won by Celtic 1-0.

The attack on Kraig Mackay is a reminder of the dangers of football hooliganism. It is important for fans to remember that violence is never the answer. If you see something, say something. Report any incidents of violence to the police.

Here are some tips for staying saf