| Kappa value | Strength of agreement | Notes                                                    |
| ----------- | --------------------- | -------------------------------------------------------- |
| ≤ 0.00      | Poor                  | Agreement is no better than chance, or worse than chance |
| 0.01 – 0.20 | Slight                | Very low agreement beyond chance                         |
| 0.21 – 0.40 | Fair                  | Some agreement, but not strong                           |
| 0.41 – 0.60 | Moderate              | Reasonable consistency between raters                    |
| 0.61 – 0.80 | Substantial           | Good agreement between raters                            |
| 0.81 – 1.00 | Almost perfect        | Very high agreement between raters                       |


In [6]:
import pandas as pd
from statsmodels.stats import inter_rater

# for each criterion
rater_1 = [1, 0, 1, 1, 0]
rater_2 = [1, 1, 0, 1, 0]

# Cosntroi a contigency table (cross-tabulation of counts)
# Row = rating by rater 1
# Column = rating by rater 2
# Cell = count of times that combination occurred
# | rater\_2 | 0 | 1 |
# | -------- | - | - |
# | rater\_1 |   |   |
# | **0**    | 1 | 1 |
# | **1**    | 1 | 2 |

contingency = pd.crosstab(rater_1, rater_2, margins = False) # margins = False don't add row/column totals.

res = inter_rater.cohens_kappa(contingency)

# Kappa value
print("kappa: ", res.kappa)

# p-value tests whether the observed agreement is significantly greater than chance
# pvalue_one_sided if you only care about agreement being better than chance
print("p-value (one_sided): ", res. pvalue_one_sided)


kappa:  0.1666666666666666
p-value (one_sided):  0.35469405750711325


Explaining results:
- Null hypothesis (H₀): The agreement between raters is no better than chance (remember: kappa = 0, no agreement (only chance-level)).

- If p-value = 0.355, 0.355 > 0.05 (assuming a common significance level α = 0.05), means you fail to reject H₀.

Thus:
- the observed agreement could plausibly be due to random chance. You don't have strong statistical evidence that the raters agree beyond chance.

| Fleiss’ kappa | Strength of agreement |
| ------------- | --------------------- |
| ≤ 0.00        | Poor (no agreement)   |
| 0.01 – 0.20   | Slight                |
| 0.21 – 0.40   | Fair                  |
| 0.41 – 0.60   | Moderate              |
| 0.61 – 0.80   | Substantial           |
| 0.81 – 1.00   | Almost perfect        |


In [8]:
# Example ratings from 3 raters on 5 items (rows = items, cols = raters)
# Each row corresponds to a subject/item, each column to a rater’s decision (0 = no, 1 = yes)
ratings = [
    [1, 0, 1],  # Item 1
    [1, 1, 1],  # Item 2
    [0, 0, 0],  # Item 3
    [1, 1, 0],  # Item 4
    [0, 1, 0],  # Item 5
]

df = pd.DataFrame(ratings, columns=["rater_1", "rater_2", "rater_3"])

# Convert ratings into the required format: count of raters per category for each item
# e.g., for each row: how many raters said 0, how many said 1
table = df.apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)

# Fleiss' kappa
# Notice that Fleiss’ kappa (multi-rater agreement) is implemented as a measure only in
# statsmodels, not as a full hypothesis test — so you just get the statistic
fleiss = inter_rater.fleiss_kappa(table.values)

print(f"Fleiss’ kappa = {fleiss:.3f}")

Fleiss’ kappa = 0.196
