Cohen's kappa coefficient

Jeffrey M Girard edited this page Apr 4, 2016 · 41 revisions

Overview

The kappa coefficient is a chance-adjusted index for the reliability of categorical measurements. It is thus meant to equal the ratio of "observed nonchance-agreement" to "possible nonchance-agreement" and to answer the question: how often did the raters agree when they weren't guessing?

The kappa coefficient estimates chance agreement using an individual-distribution-based approach. Specifically, it assumes that raters engage in a chance-based process to determine whether to classify each item randomly or deliberately prior to inspecting it. It assumes that the likelihood of raters randomly assigning an item to the same category is based on the product of raters' individual distributions for each category.

Zhao et al. (2012) described these assumptions using the following metaphor (in the case of two categories for simplicity). Each rater has an individual "quota" for how many items he or she must assign to each category. He or she then places two sets of marbles into an unshared urn, where each set corresponds to one of the two categories. Each set has a number of marbles corresponding to its category's quota and has its own color. For each item, each rater draws a marble randomly from his or her urn, notes its color, and then puts it back. If both raters drew the same color, then both raters classify that item randomly by classifying it into the category that corresponds to the color that was drawn (without inspecting the item at all). Only if the raters drew different colors would they classify the item deliberately by inspecting the item and comparing its features to the established category membership criteria. Each rater keeps track of the number of items he or she has assigned to each category; whenever a coder reaches his or her quota for a category, he or she stops drawing from the urn and begins classifying all items to the other category in order to meet its quota.

History

Cohen (1960) proposed the kappa coefficient as an alternative to Scott's pi coefficient for estimating the reliability of two raters using nominal categories. Cohen (1968) then extended the kappa coefficient to accommodate a weighting scheme and Conger (1980) extended it to accommodate any number of raters. Gwet (2014) fully generalized it to accommodate multiple raters, multiple categories, any weighting scheme, and missing data. It is worth noting that Rogot & Goldberg's (1966) A_2 coefficient is equivalent to Cohen's kappa coefficient.

MATLAB Functions

  • mKAPPA %Calculates kappa using vectorized formulas

Simplified Formulas

Use these formulas with two observers and two (dichotomous) categories:


p_o

p_c

kappa


n_11 is the number of items both raters assigned to category 1

n_22 is the number of items both raters assigned to category 2

n is the total number of items

n_1+ is the number of items rater 1 assigned to category 1

n_2+ is the number of items rater 1 assigned to category 2

n_+1 is the number of items rater 2 assigned to category 1

n_+2 is the number of items rater 2 assigned to category 2

Contingency Table

Generalized Formulas

Use these formulas with multiple raters, multiple categories, and any weighting scheme:


rstar_ik

p_o

p_gk

pbar_+k

ssq_kl

p_c

kappa


q is the total number of categories

w_kl is the weight associated with two raters assigning an item to categories k and l

r_il is the number of raters that assigned item i to category l

n' is the number of items that were coded by two or more raters

r_ik is the number of raters that assigned item i to category k

r_i is the number of raters that assigned item i to any category

n_gk is the number of items that rater g assigned to category k

n_g is the number of items that rater g assigned to any category

r is the total number of raters

References

  1. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
  2. Rogot, E., & Goldberg, I. D. (1966). A proposed index for measuring agreement in test-retest studies. Journal of Chronic Diseases, 19(9), 991–1006.
  3. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.
  4. Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88(2), 322–328.
  5. Uebersax, J. S. (1982). A design-independent method for measuring the reliability of psychiatric diagnosis. Journal of Psychiatric Research, 17(4), 335–342.
  6. Zhao, X., Liu, J. S., & Deng, K. (2012). Assumptions behind inter-coder reliability indices. In C. T. Salmon (Ed.), Communication Yearbook (pp. 418–480). Routledge.
  7. Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.