In [10]:
import pandas as pd
import numpy as np
from sklearn.metrics import cohen_kappa_score

## Evaluierung 
Nach der manuellen Bewertung der generierten Gegenreden enthält der Datensatz nun mehrere Spalten, die die Bewertungen der unterschiedlichen Aspekte der Gegenreden abbilden. Dazu gehören `Empathiefaktor`, `Faktentreue`, `Ton`, `Klarheit der Argumentation` und `erfolgreicher Einsatz von Storytelling-Elementen`. Jeder dieser Aspekte wurde auf einer Skala von 1 bis 5 bewertet, um eine quantitative Analyse der qualitativen Merkmale zu ermöglichen. Im Folgenden wird ein Ausschnitt der bewerteten Tabelle präsentiert.

In [3]:
# Load the evaluated dataset
evaluated_samples = pd.read_csv('../data/evaluation/counterspeech_90_samples_evaluated.csv', delimiter=';', quotechar='"')
evaluated_samples.head(30)

Unnamed: 0,comment_id,comment_text,model,counter_speech,empathy_factor,fidelity_to_facts,tone,clarity_of_argumentation,successful_use_of_storytelling
0,6,FOURTEEN OCTOBER TWENTY TWENTY FOUR BEGIN THIR...,counter_speech_gpt-4o,I can see where concerns about world events co...,4,4,5,4,3.0
1,6,FOURTEEN OCTOBER TWENTY TWENTY FOUR BEGIN THIR...,counter_speech_llama3,I understand your concern about the date and t...,5,3,5,5,4.0
2,6,FOURTEEN OCTOBER TWENTY TWENTY FOUR BEGIN THIR...,counter_speech_mistral,"""As a history buff like Neil deGrasse Tyson, I...",3,3,4,2,2.0
3,9,"Wake up, sheeple!\nBiden's just a puppet for t...",counter_speech_gpt-4o,It's understandable to feel concerned about th...,4,4,5,5,3.0
4,9,"Wake up, sheeple!\nBiden's just a puppet for t...",counter_speech_llama3,I understand your concerns about President Bid...,4,4,5,5,
5,9,"Wake up, sheeple!\nBiden's just a puppet for t...",counter_speech_mistral,Hello there! It's understandable to feel unset...,5,4,5,4,
6,12,PATRIOTS TAP IN:\nWe’re doing a nation wide gr...,counter_speech_gpt-4o,I appreciate the concern for the well-being of...,5,4,5,4,3.0
7,12,PATRIOTS TAP IN:\nWe’re doing a nation wide gr...,counter_speech_llama3,"""I understand the enthusiasm, but let's focus ...",4,3,4,4,
8,12,PATRIOTS TAP IN:\nWe’re doing a nation wide gr...,counter_speech_mistral,"In the heart of Washington D.C., Dr. Fauci, a ...",4,3,5,3,3.0
9,15,It’s time to take over the digital billboards...,counter_speech_gpt-4o,I understand your passion and concern for the ...,4,5,5,5,4.0


### Statistische Konsistenzprüfung
Nach der ersten Bewertungsrunde wird zur Sicherstellung der Bewertungskonsistenz eine zweite Runde durchgeführt. Dafür wird zunächst eine Stichprobe von 20 Gegenreden randomisiert aus dem Datensatz ausgewählt um diese erneut bewerten zu können. Dieser neue Datensatz wird abgespeichert um eine zweite Evaluierungsrunde durchzuführen.


In [8]:
sorted_melted_data = pd.read_csv('../data/evaluation/counterspeech_90_samples_to_evaluate.csv')

# Set the random seed to ensure reproducibility
np.random.seed(42)

# Randomly select 20 entries without replacement
sampled_data_2eval = sorted_melted_data.sample(n=20, replace=False)

# Save the selected sample to a new CSV file
sampled_data_2eval.to_csv('../data/evaluation/intra_rater_reliability_20samples_2eval_to_evaluate.csv', index=False)
sampled_data_2eval.head(30)

Unnamed: 0,comment_id,comment_text,model,counter_speech,empathy_factor,fidelity_to_facts,tone,clarity_of_argumentation,successful_use_of_storytelling
40,45,#BreakingNews : Trump’s Sending Deep State Eli...,counter_speech_mistral,"Ah, the buzz about Trump sending deep state el...",,,,,
22,27,Do you get paid to shill for vaccines?\nOr are...,counter_speech_gpt-4o,"I understand your concern, and it’s important ...",,,,,
55,60,GOOD. Now it’s time to dismantle all those cor...,counter_speech_gpt-4o,I understand the frustration of feeling deceiv...,,,,,
70,75,If #DrDavidMorens was Republican this email wo...,counter_speech_mistral,"Acting as a fact-driven researcher, I'd like t...",,,,,
0,6,FOURTEEN OCTOBER TWENTY TWENTY FOUR BEGIN THIR...,counter_speech_gpt-4o,I can see where concerns about world events co...,,,,,
26,30,WE KNOW THE CIA COINED THE TERM “CONSPIRACY TH...,counter_speech_llama3,"""I understand the curiosity behind this questi...",,,,,
39,45,#BreakingNews : Trump’s Sending Deep State Eli...,counter_speech_gpt-4o,I understand how alarming this information mig...,,,,,
65,69,#CovidCoverUp Fauci's and his gang are Serial ...,counter_speech_mistral,"""Imagine Dr. Fauci, a humble public servant, t...",,,,,
10,15,It’s time to take over the digital billboards...,counter_speech_gpt-4o,I understand your passion and concern for the ...,,,,,
44,48,Issue is Christian Zionists want an Israel as ...,counter_speech_mistral,"""Let's journey into the world of real-life lea...",,,,,


#### Zusammenführung der ersten und zweiten Evaluierungsrunde

Nach der erneuten Bewertung müssen die Werte aus der 1. Evaluierungsrunde und aus der 2. Evaluierungsrunde zusammengeführt werden, um sie für die Berechnung des gewichteten Kappa-Koeffizienten vorzubereiten. Dazu verwenden wir die `comment_id` und das `model` als Schlüssel, um sicherzustellen, dass die Bewertungen jeder Gegenrede aus beiden Runden korrekt zugeordnet werden.

Die relevanten Spalten (Bewertungskriterien) aus beiden Datensätzen werden extrahiert und entsprechend gekennzeichnet (als "_first" und "_second" für die erste und zweite Runde) und dann zusammengeführt.

In [9]:
# Load sampled datasets from first and second round of evaluations
data_first_eval = pd.read_csv('../data/evaluation/intra_rater_reliability_20samples_1eval.csv', delimiter=';', quotechar='"')
data_second_eval = pd.read_csv('../data/evaluation/intra_rater_reliability_20samples_2eval.csv', delimiter=';', quotechar='"')

# Rename columns for first and second round of evaluation for clear identification
columns_to_use = ['empathy_factor', 'fidelity_to_facts', 'tone', 'clarity_of_argumentation', 'successful_use_of_storytelling']
data_first_eval_renamed = data_first_eval[['comment_id', 'model'] + columns_to_use].add_suffix('_first')
data_second_eval_renamed = data_second_eval[['comment_id', 'model'] + columns_to_use].add_suffix('_second')

# Merge data on 'comment_id' and 'model'
merged_data_kappa = pd.merge(data_first_eval_renamed, data_second_eval_renamed, left_on=['comment_id_first', 'model_first'], right_on=['comment_id_second', 'model_second'])

# Show merged data
merged_data_kappa.head(30)

Unnamed: 0,comment_id_first,model_first,empathy_factor_first,fidelity_to_facts_first,tone_first,clarity_of_argumentation_first,successful_use_of_storytelling_first,comment_id_second,model_second,empathy_factor_second,fidelity_to_facts_second,tone_second,clarity_of_argumentation_second,successful_use_of_storytelling_second
0,6,counter_speech_gpt-4o,4,4,5,4,3.0,6,counter_speech_gpt-4o,5,4,5,4,3.0
1,9,counter_speech_llama3,4,4,5,5,,9,counter_speech_llama3,4,4,5,5,
2,15,counter_speech_gpt-4o,4,5,5,5,4.0,15,counter_speech_gpt-4o,5,5,5,5,5.0
3,18,counter_speech_llama3,5,5,5,5,,18,counter_speech_llama3,5,5,5,5,
4,24,counter_speech_gpt-4o,5,4,5,5,4.0,24,counter_speech_gpt-4o,5,4,5,4,4.0
5,27,counter_speech_gpt-4o,5,5,5,5,5.0,27,counter_speech_gpt-4o,5,5,5,5,5.0
6,30,counter_speech_llama3,5,4,5,4,,30,counter_speech_llama3,5,4,5,4,
7,33,counter_speech_gpt-4o,4,4,5,5,4.0,33,counter_speech_gpt-4o,4,4,5,5,4.0
8,39,counter_speech_gpt-4o,4,4,5,5,4.0,39,counter_speech_gpt-4o,4,4,5,5,4.0
9,45,counter_speech_mistral,3,3,4,3,4.0,45,counter_speech_mistral,3,3,4,3,4.0


### Berechnung des gewichteten Kappa-Koeffizienten:
Nach der korrekten Zusammenführung der Daten können wir den gewichteten Kappa-Koeffizienten für jedes Bewertungskriterium berechnen. Dazu wird die Funktion cohen_kappa_score aus der Bibliothek sklearn.metrics verwendet.

In [13]:
# Calculation of weighted Kappa for each evaluation criterion
kappa_scores = {}
for criterion in columns_to_use:
    # Names of columns for first and second round
    col_first = f'{criterion}_first'
    col_second = f'{criterion}_second'
    
    # Replace missing values with 0 
    ratings_first = merged_data_kappa[col_first].fillna(0)
    ratings_second = merged_data_kappa[col_second].fillna(0)
    
    # Compute weighted Kappa
    kappa = cohen_kappa_score(ratings_first, ratings_second, weights='linear')
    kappa_scores[criterion] = kappa

# Print Kappa scores
kappa_scores

{'empathy_factor': np.float64(0.849624060150376),
 'fidelity_to_facts': np.float64(0.8780487804878049),
 'tone': np.float64(1.0),
 'clarity_of_argumentation': np.float64(0.8529411764705882),
 'successful_use_of_storytelling': np.float64(0.9107142857142857)}