# Cohen Kappa

A measure of the inter-rater reliability which takes into account the the possibility of agreement by chance. 

In [1]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score

$\kappa = 1 - \frac{1-P_o}{1-P_e}$

Where:
- \( P_o \) is the **observed agreement**, which is the proportion of times that the raters agree.
- \( P_e \) is the **expected agreement** by chance, which is the proportion of times that the raters would agree by random chance.


Consider an example where 30 outputs had been assessed by both a human and LLM judge.   Each has been asked to output an objective score of 1 to 5 for the accuracy (taking into account falsehoods and missing content) and a subjective opinion of the clarity of the output (either "good", "okay" or "poor").  

In [6]:
human_responses = [
    [5, "good"], [4, "good"], [2, "okay"], [5, "good"], [2, "poor"], 
    [4, "good"], [5, "good"], [3, "okay"], [4, "good"], [5, "good"], 
    [2, "poor"], [3, "okay"], [5, "good"], [4, "good"], [3, "good"], 
    [3, "okay"], [4, "good"], [2, "poor"], [4, "good"], [5, "good"], 
    [3, "okay"], [5, "good"], [4, "good"], [2, "poor"], [4, "good"], 
    [5, "good"], [3, "okay"], [1, "good"], [4, "good"], [3, "okay"]
]


In [3]:
llm_responses = [
    [5, "good"], [4, "good"], [3, "okay"], [5, "good"], [2, "poor"], 
    [4, "okay"], [5, "good"], [3, "okay"], [4, "good"], [5, "good"], 
    [2, "poor"], [3, "okay"], [5, "good"], [4, "good"], [5, "good"], 
    [3, "okay"], [4, "good"], [2, "poor"], [4, "good"], [5, "good"], 
    [3, "okay"], [5, "good"], [4, "good"], [2, "poor"], [4, "okay"], 
    [5, "good"], [3, "okay"], [5, "good"], [4, "good"], [3, "okay"]
]


The Cohen Kappa score for the alignment of the two judges can be calculated using the scikit package.  Here we've decieded to calcualted this separately for the objective component.

In [7]:
# Extract only the objective accuracy values (first element of each response)
human_accuracy = [response[0] for response in human_responses]
llm_accuracy = [response[0] for response in llm_responses]

# Calculate Cohen's Kappa for objective accuracy
kappa_score = cohen_kappa_score(human_accuracy, llm_accuracy)

# Output the result
print(f"Cohen's Kappa Score for Objective Accuracy: {kappa_score:.4f}")

Cohen's Kappa Score for Objective Accuracy: 0.8657


And the subjective componenet

In [8]:
# Extract only the subjective accuracy values (first element of each response)
human_accuracy_sub = [response[1] for response in human_responses]
llm_accuracy_sub = [response[1] for response in llm_responses]

# Calculate Cohen's Kappa for subjective accuracy
kappa_score_sub = cohen_kappa_score(human_accuracy_sub, llm_accuracy_sub)

# Output the result
print(f"Cohen's Kappa Score for Subjective Accuracy: {kappa_score_sub:.4f}")

Cohen's Kappa Score for Subjective Accuracy: 0.8795


A rough interpretation of Cohen's kappa scores:

Less than or equal to 0: No agreement
0.01–0.20: None to slight agreement
0.21–0.40: Fair agreement
0.41–0.60: Moderate agreement
0.61–0.80: Substantial agreement
0.81–1.00: Almost perfect agreement 

A score of 0 means that the raters agreed as often as if they were randomly guessing. A score of 1 means that the raters agreed completely.