In [2]:
import pandas as pd

We have created a 20% overlap for the samples to be labeled to assess the inter-rated agreement. Since annotator vary for each label duplicate pair, we can't use the sklearn's solution as it assumes that raters are the same in each case. Averaging the sklearn's  cohen_kappa_score along the pairs is not a viable solution since if you only have one data point per rater pair, the Cohen's kappa score calculation will result in a warning because it expects more variation in the data (or will return 0 if values are different).

We have calculated the Cohen's kappa for the dataset with the (wrong) assumption that there are only two annotators.
Further we improved the calculation with averaging the pairs, for that we have used this [solution](https://towardsdatascience.com/inter-annotator-agreement-2f46c6d37bf3) instead.

In [235]:
# https://towardsdatascience.com/inter-annotator-agreement-2f46c6d37bf3
def cohen_kappa(ann1, ann2):
    """Computes Cohen kappa for pair-wise annotators.
    :param ann1: annotations provided by first annotator
    :type ann1: list
    :param ann2: annotations provided by second annotator
    :type ann2: list
    :rtype: float
    :return: Cohen kappa statistic
    """
    count = 0
    for an1, an2 in zip(ann1, ann2):
        if an1 == an2:
            count += 1
    A = count / len(ann1)  # observed agreement A (Po)

    uniq = set(ann1 + ann2)
    E = 0  # expected agreement E (Pe)
    for item in uniq:
        cnt1 = ann1.count(item)
        cnt2 = ann2.count(item)
        count = ((cnt1 / len(ann1)) * (cnt2 / len(ann2)))
        E += count

    return round((A - E) / (1 - E), 4)

# Loading and preprocessing the data

In [346]:
df_labels = pd.read_json('/Users/katerynaburovova/PycharmProjects/dehumanization/annotation/labels_ready.json')

In [347]:
import json
def extract_class_pairs(json_lst):
    pairs = []
    for json_str in json_lst:
        question_title = json_str['title']
        answer_title = json_str['answer']['title']
        pairs.append([question_title,answer_title])
    return pairs

In [348]:
df_labels['pairs'] = df_labels['Label'].apply(lambda x: extract_class_pairs(x['classifications']))

In [349]:
df = df_labels[['Created By', 'pairs', 'External ID']]

In [267]:
value_counts = df['External ID'].value_counts()
unique_rows = df[df['External ID'].isin(value_counts[value_counts == 1].index)]
df.drop(unique_rows.index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(unique_rows.index, inplace=True)


In [350]:
for i in range(3):
    col_name = 'pair{}'.format(i+1)
    df[col_name] = df['pairs'].apply(lambda x: x[i] if len(x) > i else None)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col_name] = df['pairs'].apply(lambda x: x[i] if len(x) > i else None)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col_name] = df['pairs'].apply(lambda x: x[i] if len(x) > i else None)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col_name] = df['pairs'].apply(lambda x: x[i] if len(x) >

In [351]:
df = df[~df['pair1'].isna()]
df.reset_index(inplace=True, drop=True)

In [352]:
text2 = df['pair2'].iloc[16]
text2

['–ß–∏ –ø—Ä–∏—Ä—ñ–≤–Ω—é—é—Ç—å—Å—è —É–∫—Ä–∞—ó–Ω—Ü—ñ –¥–æ –Ω–µ—ñ—Å—Ç–æ—Ç, —Ç–≤–∞—Ä–∏–Ω —á–∏ –ª—é–¥–µ–π, –ø–æ–∑–±–∞–≤–ª–µ–Ω–∏—Ö –ª—é–¥—Å—å–∫–∏—Ö —Ä–∏—Å (—á–∞—Å—Ç–∫–æ–≤–æ –∞–±–æ –ø–æ–≤–Ω—ñ—Å—Ç—é)?',
 '–Ω—ñ']

In [353]:
text3 = df['pair3'].iloc[16]
text3

['–ß–∏ –ø—Ä–∏—Å—É—Ç–Ω—è –≤ —Ç–µ–∫—Å—Ç—ñ –µ–º–æ—Ü—ñ–π–Ω–∞ –æ—Ü—ñ–Ω–∫–∞ —É–∫—Ä–∞—ó–Ω—Ü—ñ–≤?', '–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è']

In [354]:
def replace_with(cell, replace_list):
    if cell is None:
        return replace_list
    else:
        return cell

In [355]:
df["Dehumanization"] = df.pair2.apply(lambda x: replace_with(x, text2)[1])

In [356]:
df["Emotion"] = df.pair3.apply(lambda x: replace_with(x, text3)[1])

In [357]:
df["Mention"] = df['pair1'].apply(lambda x: x[1])

In [358]:
df = df.sort_values(by='External ID')
df.reset_index(inplace=True, drop=True)

# First attempt to calculate with the assumption that annotators are same

In [198]:
df['Rater'] = ['Rater1' if i % 2 == 0 else 'Rater2' for i in range(len(df))]

In [None]:
from sklearn.metrics import cohen_kappa_score

# define a function to calculate Cohen's kappa for a given column
def calculate_kappa(column_name):
    rater1_labels = df[column_name].loc[df['Rater'] == 'Rater1'].tolist()
    rater2_labels = df[column_name].loc[df['Rater'] == 'Rater2'].tolist()
    kappa = cohen_kappa_score(rater1_labels, rater2_labels)
    return kappa

In [202]:
# calculate Cohen's kappa for the 'Dehumanization' column
dehumanization_kappa = calculate_kappa('Dehumanization')
print("Cohen's kappa for Dehumanization: {:.2f}".format(dehumanization_kappa))

# repeat the above for the 'Emotion' and 'Mention' columns
emotion_kappa = calculate_kappa('Emotion')
print("Cohen's kappa for Emotion: {:.2f}".format(emotion_kappa))

mention_kappa = calculate_kappa('Mention')
print("Cohen's kappa for Mention: {:.2f}".format(mention_kappa))

Cohen's kappa for Dehumanization: 0.50
Cohen's kappa for Emotion: 0.49
Cohen's kappa for Mention: 0.64


# Correct calculation under the real conditions of different annotators

In [359]:
def calculate_cohens_kappa(df, col_name):
    # creating the data for question
    df_column = df[['External ID', 'Created By', col_name]]

    # identifying the overlapping samples and raters who labeled them
    overlapping_samples = df_column.groupby('External ID').filter(lambda x: len(x) > 1)
    unique_sample_ids = overlapping_samples['External ID'].unique()
    rater_pairs = []
    for sample_id in unique_sample_ids:
        raters = overlapping_samples[overlapping_samples['External ID'] == sample_id]['Created By'].tolist()
        rater_pairs.append(raters)

    # averaging the kappa scores for pairs
    kappa_scores = []
    for sample_id, rater_pair in zip(unique_sample_ids, rater_pairs):
        sample_data = overlapping_samples[overlapping_samples['External ID'] == sample_id]
        rater1_labels = sample_data[sample_data['Created By'] == rater_pair[0]][col_name].tolist()[0]
        rater2_labels = sample_data[sample_data['Created By'] == rater_pair[1]][col_name].tolist()[0]
        kappa = cohen_kappa(rater1_labels, rater2_labels)
        kappa_scores.append(kappa)
    mean_kappa_score = sum(kappa_scores) / len(kappa_scores)

    return mean_kappa_score

In [360]:
calculate_cohens_kappa(df, 'Dehumanization')

0.850657954545455

In [361]:
calculate_cohens_kappa(df, 'Mention')

0.9697505681818179

In [362]:
calculate_cohens_kappa(df, 'Emotion')

0.8473826704545456

# Labels export

In [491]:
import os
import pandas as pd

directory_path = "/Users/katerynaburovova/PycharmProjects/dehumanization/annotation/dataset_shuffled"
file_list = []
for filename in os.listdir(directory_path):
    if filename.endswith(".txt"):
        with open(os.path.join(directory_path, filename), "r") as file:
            file_content = file.read()
        file_list.append({"text": file_content, "file_name": filename})

df_datarows = pd.DataFrame(file_list)

In [492]:
df_datarows.rename(columns={'file_name':'External ID'}, inplace=True)
df_datarows

Unnamed: 0,text,External ID
0,"–ö–∞–∫ –∏–∑–≤–µ—Å—Ç–Ω–æ, —Å–≤–∏–Ω—å—è –≤–µ–∑–¥–µ –≥—Ä—è–∑—å –Ω–∞–π–¥–µ—Ç, –∏ —É–∫—Ä...",row_1281.txt
1,"–¢–æ–≥–¥–∞ –∫–∞–∫ –º–Ω–æ–≥–∏–µ —É–∫—Ä–∞–∏–Ω—Ü—ã, –Ω–∞–æ–±–æ—Ä–æ—Ç, —Å–ª–µ–ø–æ –≤–µ—Ä...",row_2950.txt
2,"–û —Ç–∞–∫–æ–º —â–µ–¥—Ä–æ–º –∏ –º–∏—Ä–Ω–æ–º —Å–æ—Å–µ–¥–µ, –∫–∞–∫ –†–æ—Å—Å–∏—è, –º–æ...",row_3496.txt
3,üá∫üá¶‚ùå–î–µ–Ω–∞—Ü–∏—Ñ–∏–∫–∞—Ü–∏—è –ø–æ-–≤–∏–Ω–Ω–∏—Ü–∫–∏ –¢–∏–≤—Ä–∏–≤—Å–∫–∏–π —Å–µ–ª—å—Å...,row_2788.txt
4,–£–∫—Ä–æ–Ω–∞—Ü–∏—Å—Ç—ã –æ–±—ä—è–≤–∏–ª–∏ –≤–æ–π–Ω—É –ü—É—à–∫–∏–Ω—É.,row_2944.txt
...,...,...
3638,"¬´–Ø –≤–∞—Å, –±–ª@–¥–µ–π, –Ω–∞ —ç—Ç–æ—Ç –∫–æ—Ä–∞–±–ª—å —Ç—Ä–∏ –≥–æ–¥–∞ —Å–æ–±–∏—Ä...",row_2791.txt
3639,–ù–∞–≥—Ä–∞–∂–¥–µ–Ω –æ—Ä–¥–µ–Ω–æ–º –ú—É–∂–µ—Å—Ç–≤–∞ –∑–∞ –æ—Ç–≤–∞–≥—É –∏ —Å–∞–º–æ–æ—Ç–≤...,row_474.txt
3640,üá∑üá∫–í —Ä–∞–π–æ–Ω–µ –Ω–∞—Å–µ–ª–µ–Ω–Ω–æ–≥–æ –ø—É–Ω–∫—Ç–∞ –ß–∞—Å–æ–≤ –Ø—Ä —É–Ω–∏—á—Ç–æ–∂...,row_2949.txt
3641,–£—á–∞—Å—Ç–Ω–∏–∫–∏ –∫–æ–Ω–∫—É—Ä—Å–∞ –∞–º–µ—Ä–∏–∫–∞–Ω—Å–∫–æ–π –∞—Ä–º–∏–∏ –Ω–∞ –∑–∞–º–µ–Ω...,row_1298.txt


In [514]:
df_datarows[df_datarows['External ID'] == 'row_3210.txt']

Unnamed: 0,text,External ID
2016,"""–ë–∞–Ω–¥–µ—Ä–æ–≤—Ü—ã –Ω–∞ –£–∫—Ä–∞–∏–Ω–µ —Å–µ–≥–æ–¥–Ω—è –≤–∑—è–ª–∏ –≤—Å–µ —Ö—É–¥—à–µ...",row_3210.txt


In [494]:
df_labels

Unnamed: 0,ID,DataRow ID,Labeled Data,Label,Created By,Project Name,Created At,Updated At,Seconds to Label,Seconds to Review,...,Is Benchmark,Benchmark Agreement,Benchmark ID,Dataset Name,Reviews,View Label,Has Open Issues,Skipped,DataRow Workflow Info,pairs
0,clemlh3ot5jlz07zg93fndqql,cleh5u16r1y1d077nda1a1qc8,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",kateryna.burovova@ucu.edu.ua,Dehumanization,2023-02-27T09:05:35.000Z,2023-02-27T09:05:36.000Z,36.784,30.751,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac...",[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...
1,clemlh51q9umq07za3cx6h7v9,cleh5u16r1y1h077n7sp91xy0,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",kateryna.burovova@ucu.edu.ua,Dehumanization,2023-02-27T09:05:41.000Z,2023-02-27T09:05:41.000Z,6.542,1.000,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac...",[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...
2,clemlhd635hv4070q0dndfk3h,cleh5u16r1y1l077n8445az5a,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",kateryna.burovova@ucu.edu.ua,Dehumanization,2023-02-27T09:05:53.000Z,2023-02-27T09:05:53.000Z,16.489,6.234,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac...",[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...
3,clemlhglg1vlr0711eh7rbxsf,cleh5u16r1y1p077n7yxhdfio,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",kateryna.burovova@ucu.edu.ua,Dehumanization,2023-02-27T09:06:07.000Z,2023-02-27T09:06:09.000Z,304.801,291.289,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac...",[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...
4,clemrc8wr1kas07xzew1n4sf3,cleh5u16s1y1x077nggra0vhg,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",snizannabotvin@gmail.com,Dehumanization,2023-02-27T11:54:12.000Z,2023-02-27T19:05:13.000Z,353.906,3.816,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,"{'taskName': 'Done', 'Workflow History': [{'ac...",[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4279,clf6u3sp40gds07yt6gjlasf1,cleh5u19e38eu077z4r3e62qh,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",eugene.1martynyuk@gmail.com,Dehumanization,2023-03-13T14:22:20.000Z,2023-03-13T14:22:20.000Z,14.318,0.000,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,{'taskId': '3817fcae-b9f0-4015-95c1-1afe41afc4...,[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...
4280,clf6wydkb0jff07zy1tig2pgk,cleh5u19eig9w078n33qp7j53,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",eugene.1martynyuk@gmail.com,Dehumanization,2023-03-13T14:22:24.000Z,2023-03-13T14:22:24.000Z,6.467,0.000,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,{'taskId': '3817fcae-b9f0-4015-95c1-1afe41afc4...,[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...
4281,clf6wyjgi051507y3dy1x46sz,cleh5u19e38di077zboeo8ujg,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",eugene.1martynyuk@gmail.com,Dehumanization,2023-03-13T14:23:09.000Z,2023-03-13T14:23:09.000Z,42.807,0.000,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,{'taskId': '3817fcae-b9f0-4015-95c1-1afe41afc4...,[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...
4282,clf6wyol90jgi07zy77hr6ibb,cleh5u19eigak078n18yhc7xi,https://storage.labelbox.com/cldvi451e3lej07ww...,"{'objects': [], 'classifications': [{'featureI...",eugene.1martynyuk@gmail.com,Dehumanization,2023-03-13T14:23:28.000Z,2023-03-13T14:23:28.000Z,21.218,0.000,...,0,-1,,Dehumanization_final_dataset,[],https://editor.labelbox.com?project=cleh63iav1...,0,False,{'taskId': '3817fcae-b9f0-4015-95c1-1afe41afc4...,[[–ß–∏ –∑–≥–∞–¥—É—é—Ç—å—Å—è –≤ —Ç–µ–∫—Å—Ç—ñ —É–∫—Ä–∞—ó–Ω—Ü—ñ –∞–±–æ —â–æ—Å—å —É–∫—Ä...


In [495]:
df_final = df[['External ID', 'Dehumanization', 'Emotion', 'Mention', 'Created By']]

In [496]:
df_final

Unnamed: 0,External ID,Dehumanization,Emotion,Mention,Created By
0,row_0.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com
1,row_1.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",–Ω—ñ,snizannabotvin@gmail.com
2,row_10.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com
3,row_100.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com
4,row_1000.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,tutovadesign@gmail.com
...,...,...,...,...,...
4245,row_996.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com
4246,row_997.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,yevhen.marchenko91@gmail.com
4247,row_998.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com
4248,row_998.txt,—Ç–∞–∫,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,yevhen.marchenko91@gmail.com


In [497]:
df_merged = pd.merge(df_final, df_datarows, on='External ID', how='left')

In [498]:
#bs check
df_merged[df_merged['Mention']=='–Ω—ñ']['Dehumanization'].unique()


array(['–Ω—ñ'], dtype=object)

In [499]:
#bs check
df_merged[df_merged['Mention']=='–Ω—ñ']['Emotion'].unique()

array(['–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è'], dtype=object)

In [500]:
df_merged

Unnamed: 0,External ID,Dehumanization,Emotion,Mention,Created By,text
0,row_0.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,"–í—Å–≤—è–∑–∏ —Å —ç—Ç–∏–º –Ω–µ–º–Ω–æ–≥–æ –ø–æ–ø—Ä–∞–≤–ª—é –∫–æ–ª–ª–µ–≥ ‚§µÔ∏è ""–û–Ω–∏..."
1,row_1.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",–Ω—ñ,snizannabotvin@gmail.com,–õ–∏—Ç–µ—Ä–∞—Ç—É—Ä–Ω—ã–π –∫—Ä–∏—Ç–∏–∫ –ì–∞–ª–∏–Ω–∞ –Æ–∑–µ—Ñ–æ–≤–∏—á –æ –Ω–æ–≤–æ–º —Ä–æ...
2,row_10.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,–ü–æ—á–µ–º—É –Ω–∞ –±–∞–∑–∞—Ö –Ω–µ–æ–Ω–∞—Ü–∏—Å—Ç–æ–≤ —Å—Ç–æ—è—Ç —è–∑—ã—á–µ—Å–∫–∏–µ –∏—Å...
3,row_100.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,–ì—Ä—É–ø–ø–∞ –¥–æ–±—Ä–æ–≤–æ–ª—å—Ü–µ–≤-–º–µ–¥–∏–∫–æ–≤ –∏–∑ –ß–µ—á–µ–Ω—Å–∫–æ–π –†–µ—Å–ø—É...
4,row_1000.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,tutovadesign@gmail.com,"–í–°–£—à–Ω–∏–∫–∏, –ø–µ—Ä–µ—Ö–æ–¥–∏—Ç–µ –Ω–∞ —Å—Ç–æ—Ä–æ–Ω—É –¥–æ–±—Ä–∞, —É –Ω–∞—Å —Ç..."
...,...,...,...,...,...,...
4245,row_996.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com,–ò –ø–æ–Ω–µ—Å–ª–∞—Å—å –º–∞–∑–µ–ø–∏–Ω—â–∏–Ω–æ-–ø–µ—Ç–ª—é—Ä–æ–≤—â–∏–Ω–æ-–±–∞–Ω–¥–µ—Ä–æ–≤—â...
4246,row_997.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–ù–∞—à —Å–æ—Ä–∞—Ç–Ω–∏–∫ –ø–æ —Ä—É—Å—Å–∫–æ–º—É –¥–≤–∏–∂–µ–Ω–∏—é –ê–ª–µ–∫—Å–µ–π –°–µ–ª–∏...
4247,row_998.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com,–•–æ—Ä–æ—à–µ–µ –≤–∏–¥–µ–æ –æ—Ç 4 –±—Ä–∏–≥–∞–¥—ã –ù–ú –õ–ù–† https://t.me...
4248,row_998.txt,—Ç–∞–∫,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–•–æ—Ä–æ—à–µ–µ –≤–∏–¥–µ–æ –æ—Ç 4 –±—Ä–∏–≥–∞–¥—ã –ù–ú –õ–ù–† https://t.me...


At this stage we should choose whose labels should remain in the final dataset for those samples labeled by 2 labelers.

We have investigated a random portion of disagreements manually as only manual investigation can provide insights into potential biases and/or misunderstanding of guidelines.
We have also considered the annotator expertise and background that we were informed of.

In [501]:
df_labels['Created By'].unique()

array(['kateryna.burovova@ucu.edu.ua', 'snizannabotvin@gmail.com',
       'nazariy.melnychuk9@gmail.com', 'chennnakal@gmail.com',
       's.sterpul@icloud.com', 'mariana.scorp@gmail.com',
       'tutovadesign@gmail.com', 'yevhen.marchenko91@gmail.com',
       'eugene.1martynyuk@gmail.com'], dtype=object)

In [502]:
authors = df_labels['Created By'].unique().tolist()
rating = [1, 4, 7, 6, 8, 2, 5, 3, 9]

In [503]:
data = {'Created By': authors, 'rating': rating}
df_rating = pd.DataFrame(data)

In [504]:
df_rating

Unnamed: 0,Created By,rating
0,kateryna.burovova@ucu.edu.ua,1
1,snizannabotvin@gmail.com,4
2,nazariy.melnychuk9@gmail.com,7
3,chennnakal@gmail.com,6
4,s.sterpul@icloud.com,8
5,mariana.scorp@gmail.com,2
6,tutovadesign@gmail.com,5
7,yevhen.marchenko91@gmail.com,3
8,eugene.1martynyuk@gmail.com,9


In [505]:
df_combined = df_merged.merge(df_rating, on='Created By')

In [506]:
df_combined = df_combined.sort_values(['External ID', 'rating'], ascending=True)

In [507]:
df_combined

Unnamed: 0,External ID,Dehumanization,Emotion,Mention,Created By,text,rating
0,row_0.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,"–í—Å–≤—è–∑–∏ —Å —ç—Ç–∏–º –Ω–µ–º–Ω–æ–≥–æ –ø–æ–ø—Ä–∞–≤–ª—é –∫–æ–ª–ª–µ–≥ ‚§µÔ∏è ""–û–Ω–∏...",4
1,row_1.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",–Ω—ñ,snizannabotvin@gmail.com,–õ–∏—Ç–µ—Ä–∞—Ç—É—Ä–Ω—ã–π –∫—Ä–∏—Ç–∏–∫ –ì–∞–ª–∏–Ω–∞ –Æ–∑–µ—Ñ–æ–≤–∏—á –æ –Ω–æ–≤–æ–º —Ä–æ...,4
2,row_10.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,–ü–æ—á–µ–º—É –Ω–∞ –±–∞–∑–∞—Ö –Ω–µ–æ–Ω–∞—Ü–∏—Å—Ç–æ–≤ —Å—Ç–æ—è—Ç —è–∑—ã—á–µ—Å–∫–∏–µ –∏—Å...,4
3,row_100.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,–ì—Ä—É–ø–ø–∞ –¥–æ–±—Ä–æ–≤–æ–ª—å—Ü–µ–≤-–º–µ–¥–∏–∫–æ–≤ –∏–∑ –ß–µ—á–µ–Ω—Å–∫–æ–π –†–µ—Å–ø—É...,4
1441,row_1000.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,tutovadesign@gmail.com,"–í–°–£—à–Ω–∏–∫–∏, –ø–µ—Ä–µ—Ö–æ–¥–∏—Ç–µ –Ω–∞ —Å—Ç–æ—Ä–æ–Ω—É –¥–æ–±—Ä–∞, —É –Ω–∞—Å —Ç...",5
...,...,...,...,...,...,...,...
2639,row_996.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com,–ò –ø–æ–Ω–µ—Å–ª–∞—Å—å –º–∞–∑–µ–ø–∏–Ω—â–∏–Ω–æ-–ø–µ—Ç–ª—é—Ä–æ–≤—â–∏–Ω–æ-–±–∞–Ω–¥–µ—Ä–æ–≤—â...,5
2760,row_997.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–ù–∞—à —Å–æ—Ä–∞—Ç–Ω–∏–∫ –ø–æ —Ä—É—Å—Å–∫–æ–º—É –¥–≤–∏–∂–µ–Ω–∏—é –ê–ª–µ–∫—Å–µ–π –°–µ–ª–∏...,3
2761,row_998.txt,—Ç–∞–∫,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–•–æ—Ä–æ—à–µ–µ –≤–∏–¥–µ–æ –æ—Ç 4 –±—Ä–∏–≥–∞–¥—ã –ù–ú –õ–ù–† https://t.me...,3
2640,row_998.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com,–•–æ—Ä–æ—à–µ–µ –≤–∏–¥–µ–æ –æ—Ç 4 –±—Ä–∏–≥–∞–¥—ã –ù–ú –õ–ù–† https://t.me...,5


In [508]:
df_cleaned = df_combined.drop_duplicates(subset='External ID', keep='first')

In [509]:
df_cleaned

Unnamed: 0,External ID,Dehumanization,Emotion,Mention,Created By,text,rating
0,row_0.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,"–í—Å–≤—è–∑–∏ —Å —ç—Ç–∏–º –Ω–µ–º–Ω–æ–≥–æ –ø–æ–ø—Ä–∞–≤–ª—é –∫–æ–ª–ª–µ–≥ ‚§µÔ∏è ""–û–Ω–∏...",4
1,row_1.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",–Ω—ñ,snizannabotvin@gmail.com,–õ–∏—Ç–µ—Ä–∞—Ç—É—Ä–Ω—ã–π –∫—Ä–∏—Ç–∏–∫ –ì–∞–ª–∏–Ω–∞ –Æ–∑–µ—Ñ–æ–≤–∏—á –æ –Ω–æ–≤–æ–º —Ä–æ...,4
2,row_10.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,–ü–æ—á–µ–º—É –Ω–∞ –±–∞–∑–∞—Ö –Ω–µ–æ–Ω–∞—Ü–∏—Å—Ç–æ–≤ —Å—Ç–æ—è—Ç —è–∑—ã—á–µ—Å–∫–∏–µ –∏—Å...,4
3,row_100.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,–ì—Ä—É–ø–ø–∞ –¥–æ–±—Ä–æ–≤–æ–ª—å—Ü–µ–≤-–º–µ–¥–∏–∫–æ–≤ –∏–∑ –ß–µ—á–µ–Ω—Å–∫–æ–π –†–µ—Å–ø—É...,4
1441,row_1000.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,tutovadesign@gmail.com,"–í–°–£—à–Ω–∏–∫–∏, –ø–µ—Ä–µ—Ö–æ–¥–∏—Ç–µ –Ω–∞ —Å—Ç–æ—Ä–æ–Ω—É –¥–æ–±—Ä–∞, —É –Ω–∞—Å —Ç...",5
...,...,...,...,...,...,...,...
2638,row_995.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,tutovadesign@gmail.com,–£—Ç—Ä–µ–Ω–Ω–∏–π –±—Ä–∏—Ñ–∏–Ω–≥ –ú–∏–Ω–æ–±–æ—Ä–æ–Ω—ã –†–æ—Å—Å–∏–∏: ‚ñ™Ô∏è —Ä–æ—Å—Å–∏–π...,5
2639,row_996.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com,–ò –ø–æ–Ω–µ—Å–ª–∞—Å—å –º–∞–∑–µ–ø–∏–Ω—â–∏–Ω–æ-–ø–µ—Ç–ª—é—Ä–æ–≤—â–∏–Ω–æ-–±–∞–Ω–¥–µ—Ä–æ–≤—â...,5
2760,row_997.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–ù–∞—à —Å–æ—Ä–∞—Ç–Ω–∏–∫ –ø–æ —Ä—É—Å—Å–∫–æ–º—É –¥–≤–∏–∂–µ–Ω–∏—é –ê–ª–µ–∫—Å–µ–π –°–µ–ª–∏...,3
2761,row_998.txt,—Ç–∞–∫,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–•–æ—Ä–æ—à–µ–µ –≤–∏–¥–µ–æ –æ—Ç 4 –±—Ä–∏–≥–∞–¥—ã –ù–ú –õ–ù–† https://t.me...,3


Double-checking since it's crucial for our task

In [510]:
value_counts = df_combined['External ID'].value_counts()
unique_rows = df_combined[df_combined['External ID'].isin(value_counts[value_counts == 1].index)]
df_bs_check = df_combined.drop(unique_rows.index)

In [511]:
df_bs_check.sort_values(['External ID'])

Unnamed: 0,External ID,Dehumanization,Emotion,Mention,Created By,text,rating
2642,row_1005.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,yevhen.marchenko91@gmail.com,"–û–Ω–æ –ø—Ä–æ–≤–∞–ª–∏–ª–æ—Å—å, –Ω–æ —É–∫—Ä–æ–≤–æ—è–∫–∏ —Ö–≤–∞–ª–∏–ª–∏—Å—å —Ç–µ–º, —á...",3
1446,row_1005.txt,–Ω—ñ,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com,"–û–Ω–æ –ø—Ä–æ–≤–∞–ª–∏–ª–æ—Å—å, –Ω–æ —É–∫—Ä–æ–≤–æ—è–∫–∏ —Ö–≤–∞–ª–∏–ª–∏—Å—å —Ç–µ–º, —á...",5
2643,row_1006.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–ó–∞–ø–∞–¥–Ω—ã–π –º–µ–º –æ —Å—Ç–µ–ø–µ–Ω–∏ –ø—Ä–∞–≤–¥–∏–≤–æ—Å—Ç–∏ –ø—Ä–æ–ø–∞–≥–∞–Ω–¥—ã ...,3
1447,row_1006.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,tutovadesign@gmail.com,–ó–∞–ø–∞–¥–Ω—ã–π –º–µ–º –æ —Å—Ç–µ–ø–µ–Ω–∏ –ø—Ä–∞–≤–¥–∏–≤–æ—Å—Ç–∏ –ø—Ä–æ–ø–∞–≥–∞–Ω–¥—ã ...,5
2644,row_1008.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–í–æ–π–Ω–∞ –¥–æ –ø–æ—Å–ª–µ–¥–Ω–µ–≥–æ —É–∫—Ä–∞–∏–Ω—Ü–∞ ‚Äì —ç—Ç–æ –≤–æ–≤—Å–µ –Ω–µ –≤—ã...,3
...,...,...,...,...,...,...,...
3809,row_98.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,nazariy.melnychuk9@gmail.com,–ë–ª–∞–≥–æ–¥–∞—Ä—è —Ç–∞–∫–æ–º—É –µ–¥–∏–Ω—Å—Ç–≤—É –º–æ–∂–Ω–æ —É–≤–µ—Ä–µ–Ω–Ω–æ –≥–∞—Ä–∞–Ω...,7
3997,row_984.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",–Ω—ñ,kateryna.burovova@ucu.edu.ua,–≠—Ç–∞–ø—ã –ö–∏—Ç–∞–π—Å–∫–æ–π –∫–æ–º–ø–∞—Ä—Ç–∏–∏ –ø–æ –ø—É—Ç–∏ –∫ –º–∏—Ä–æ–≤–æ–π –≥–µ...,1
2627,row_984.txt,–Ω—ñ,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",–Ω—ñ,tutovadesign@gmail.com,–≠—Ç–∞–ø—ã –ö–∏—Ç–∞–π—Å–∫–æ–π –∫–æ–º–ø–∞—Ä—Ç–∏–∏ –ø–æ –ø—É—Ç–∏ –∫ –º–∏—Ä–æ–≤–æ–π –≥–µ...,5
2761,row_998.txt,—Ç–∞–∫,"–Ω—ñ, –æ—Ü—ñ–Ω–∫–∞ –Ω–µ –ø—Ä–∏—Å—É—Ç–Ω—è",—Ç–∞–∫,yevhen.marchenko91@gmail.com,–•–æ—Ä–æ—à–µ–µ –≤–∏–¥–µ–æ –æ—Ç 4 –±—Ä–∏–≥–∞–¥—ã –ù–ú –õ–ù–† https://t.me...,3


In [512]:
df_cleaned[df_cleaned['External ID']=='row_1128.txt']

Unnamed: 0,External ID,Dehumanization,Emotion,Mention,Created By,text,rating
98,row_1128.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,–ü–æ –≤–∏—Ä—Ç—É–∞–ª—å–Ω—ã–º –ø–ª–∞–Ω–∞–º —É–∫—Ä–æ—Ä–µ–π—Ö–∞ –æ–Ω–∏ —É–∂–µ –≤–∑—è–ª–∏ ...,4


In [516]:
df_cleaned.to_csv('/Users/katerynaburovova/PycharmProjects/dehumanization/annotation/final_labels.csv')

In [515]:
df_cleaned[df_cleaned['External ID']=='row_3210.txt']

Unnamed: 0,External ID,Dehumanization,Emotion,Mention,Created By,text,rating
941,row_3210.txt,—Ç–∞–∫,"—Ç–∞–∫, –ø—Ä–∏—Å—É—Ç–Ω—è –Ω–µ–≥–∞—Ç–∏–≤–Ω–∞",—Ç–∞–∫,snizannabotvin@gmail.com,"""–ë–∞–Ω–¥–µ—Ä–æ–≤—Ü—ã –Ω–∞ –£–∫—Ä–∞–∏–Ω–µ —Å–µ–≥–æ–¥–Ω—è –≤–∑—è–ª–∏ –≤—Å–µ —Ö—É–¥—à–µ...",4
