Agreement or Accuracy

Jeffrey M Girard edited this page May 10, 2018 · 2 revisions

Overview

Agreement is perhaps the most straight-forward way to quantify the reliability of categorical measurements. It quantifies the amount of observed agreement (i.e., objects that pairs of raters assigned to the same or similar categories) divided by the amount of possible agreement (i.e., objects that pairs of raters could have assigned to the same categories).

History

Observed agreement, especially in its simplified form, has been in use for a long time (see Benini, 1901) and has been given many different names. Many fields call it "accuracy" while others call it "agreement." It has also been reinvented and given different names over the years, such as Osgood's (1959) coefficient and Holsti's (1969) CR. Observed agreement is often criticized for not adjusting for chance agreement and as such has been called the "index of crude agreement" (Rogot & Goldberg, 1966), the "most primitive" index (Cohen, 1960), and "flawed" (Hayes & Krippendorff, 2007). Despite this criticism, and perhaps due to the challenge of adjusting for chance agreement successfully, agreement has continued to enjoy widespread use. The idea of calculating observed agreement for multiple raters using the "mean pairs" approach was described by Armitage et al. (1966). Gwet (2014) fully generalized the approach to accommodate multiple raters, multiple categories, and any weighting scheme.

MATLAB Functions

  • mAGREE %Calculates agreement using vectorized formulas

Simplified Formulas

Use these formulas with two raters and two (dichotomous) categories:


p_o


n_11 is the number of items both raters assigned to category 1

n_22 is the number of items both raters assigned to category 2

n is the total number of items

Contingency Table

Generalized Formulas

Use these formulas with multiple raters, multiple categories, and any weighting scheme:


rstar_ik

p_o


q is the total number of categories

w_kl is the weight associated with two raters assigning an item to categories k and l

r_il is the number of raters that assigned item i to category l

n' is the number of items that were coded by two or more raters

r_ik is the number of raters that assigned item i to category k

r_i is the number of raters that assigned item i to any category

References

  1. Benini, R. (1901). Principii di demongraphia: Manuali barbera di scienze giuridiche sociali e politiche. Firenze, Italy: G. Barbera.
  2. Osgood, C. E. (1959). The representational model and relevant research methods. In I. de Sola Pool (Ed.), Trends in Content Analysis (pp. 33–88). Urbana, Illinois: University of Illinois Press.
  3. Holsti, O. R. (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley.
  4. Rogot, E., & Goldberg, I. D. (1966). A proposed index for measuring agreement in test-retest studies. Journal of Chronic Diseases, 19(9), 991–1006.
  5. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
  6. Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89.
  7. Armitage, P., Blendis, L. M., & Smyllie, H. C. (1966). The measurement of observer disagreement in the recording of signs. Journal of the Royal Statistical Society, 129(1), 98–109.
  8. Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.