# Fleiss' Kappa 
To understand how much your judges agree with each other.  It is meant to be used with more than two judges.

Read https://www.datanovia.com/en/blog/kappa-coefficient-interpretation/ to learn more.

Please copy this example and customize it for your own purposes!

## Imports

In [1]:
import pandas as pd
from js import fetch
import json

from collections import defaultdict
from statsmodels.stats.inter_rater import aggregate_raters
from statsmodels.stats.inter_rater import fleiss_kappa
from IPython.display import display, Markdown

## Step 0: Configuration

In [2]:
QUEPID_BOOK_NUM = 25

## Step 1: Download the Quepid Book

In [3]:
async def get_text(url):
    resp = await fetch(url)
    resp_text = await resp.text()
    return resp_text

In [4]:
data = await get_text(f'/api/books/{QUEPID_BOOK_NUM}.csv')

## Step 2: Extract and Prepare Data

In [5]:
from io import StringIO
df = pd.read_csv(StringIO(data))
df

Unnamed: 0,query,docid,David Tippett,Eric Pugh,Atita Arora,Cody Collier,Benjamin Trent,Jeff Alexander,Chris Marino,charlie@flax.co.uk,Michael Froh,peter@searchintuition.com,Maximilian Werk,David Fisher,Ryan Finley,Erica Schramma,Peter Fries
0,projector screen,325961,3.0,3.0,,,,,,,,,,,,,
1,projector screen,47471,3.0,3.0,,,,,,,,,,,,,
2,projector screen,126679,3.0,3.0,,,,,,,,,,,,,
3,projector screen,254441,,3.0,,,,,,,,,,,,,
4,projector screen,325958,,3.0,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2415,power supply,1667352,,0.0,,,,,,,,,,,,,
2416,power supply,1667804,,0.0,,,,,,,,,,,,,
2417,power supply,1667752,,0.0,,,,,,,,,,,,,
2418,power supply,1667821,,0.0,,,,,,,,,,,,,


## Step 3: Aggregate Raters' Data

In [6]:
# Count the ratings values
raters = list(df.columns[2:])
df['judgments'] = df[raters].values.tolist()
df['judgments'] = df['judgments'].apply(lambda x: pd.Series(x).dropna().tolist())
rated = df[['query', 'docid', 'judgments']].explode('judgments')
rated['count'] = rated.groupby(['query', 'docid'])['judgments'].transform('count')

# Use crosstab to create a contingency table
data_crosstab = pd.crosstab(index=rated['docid'], columns=rated['judgments'], values=rated['count'], aggfunc='sum')

# Drop any rows missing judgements
data_crosstab = data_crosstab.dropna(how='any')

# Convert the DataFrame to the format expected by aggregate_raters
data_for_aggregation = data_crosstab.values

# Aggregate the raters' data
table, _ = aggregate_raters(data_for_aggregation)

## Step 4: Compute Fleiss' Kappa

In [7]:
kappa = fleiss_kappa(table, method='fleiss')
display(Markdown(f"## Fleiss' Kappa: {kappa:.4f}"))

## Fleiss' Kappa: -0.2632

_This notebook was last updated 17-JAN-2025_