Agreement or Accuracy

Overview

Agreement is perhaps the most straight-forward way to quantify the reliability of categorical measurements. It quantifies the amount of observed agreement (i.e., objects that pairs of raters assigned to the same or similar categories) divided by the amount of possible agreement (i.e., objects that pairs of raters could have assigned to the same categories).

History

Observed agreement, especially in its simplified form, has been in use for a long time (see Benini, 1901) and has been given many different names. Many fields call it "accuracy" while others call it "agreement." It has also been reinvented and given different names over the years, such as Osgood's (1959) coefficient and Holsti's (1969) CR. Observed agreement is often criticized for not adjusting for chance agreement and as such has been called the "index of crude agreement" (Rogot & Goldberg, 1966), the "most primitive" index (Cohen, 1960), and "flawed" (Hayes & Krippendorff, 2007). Despite this criticism, and perhaps due to the challenge of adjusting for chance agreement successfully, agreement has continued to enjoy widespread use. The idea of calculating observed agreement for multiple raters using the "mean pairs" approach was described by Armitage et al. (1966). Gwet (2014) fully generalized the approach to accommodate multiple raters, multiple categories, and any weighting scheme.

MATLAB Functions

mAGREE %Calculates agreement using vectorized formulas

Simplified Formulas

Use these formulas with two raters and two (dichotomous) categories:

$$p_o = \frac{n_{11} + n_{22}}{n}$$

$n_{11}$ is the number of items both raters assigned to category 1

$n_{22}$ is the number of items both raters assigned to category 2

$n$ is the total number of items

Contingency Table

Generalized Formulas

Use these formulas with multiple raters, multiple categories, and any weighting scheme:

$$r_{ik}^\star = \sum_{l=1}^q w_{kl} r_{il}$$

$$p_o = \frac{1}{n'} \sum_{i=1}^{n'} \sum_{k=1}^q \frac{r_{ik}(r_{ik}^\star - 1)}{r_i (r_i - 1)}$$

$q$ is the total number of categories

$w_{kl}$ is the weight associated with two raters assigning an item to categories $k$ and $l$

$r_{il}$ is the number of raters that assigned item $i$ to category $l$

$n'$ is the number of items that were coded by two or more raters

$r_{ik}$ is the number of raters that assigned item $i$ to category $k$

$r_i$ is the number of raters that assigned item $i$ to any category

References

Benini, R. (1901). Principii di demongraphia: Manuali barbera di scienze giuridiche sociali e politiche. Firenze, Italy: G. Barbera.
Osgood, C. E. (1959). The representational model and relevant research methods. In I. de Sola Pool (Ed.), Trends in Content Analysis (pp. 33–88). Urbana, Illinois: University of Illinois Press.
Holsti, O. R. (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley.
Rogot, E., & Goldberg, I. D. (1966). A proposed index for measuring agreement in test-retest studies. Journal of Chronic Diseases, 19(9), 991–1006.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89.
Armitage, P., Blendis, L. M., & Smyllie, H. C. (1966). The measurement of observer disagreement in the recording of signs. Journal of the Royal Statistical Society, 129(1), 98–109.
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly