Bennett et al.'s S score

Overview

The $S$ score is a chance-adjusted index for the reliability of categorical measurements. It is thus meant to equal the ratio of "observed nonchance-agreement" to "possible nonchance-agreement" and to answer the question: how often did the raters agree when they weren't guessing?

The $S$ score estimates chance agreement using a category-based approach. Specifically, it assumes that raters first engage in a chance-based process to determine whether to classify each item randomly or deliberately, where the likelihood that the raters will randomly assign an item to the same category is based solely on the number of possible categories.

Zhao et al. (2012) described these assumptions using the following metaphor. Each rater places $q$ sets of marbles into an urn, where $q$ equals the number of possible categories. Each set has an equal number of marbles and has its own color. For each item to be classified, each rater draws a marble randomly from the urn, notes its color, and then puts it back in the urn. If both raters drew the same color, then both raters classify that item randomly by classifying it into the category that corresponds to the color of marble that was drawn (without inspecting the item at all). Only if the raters drew different colors would they classify the item deliberately by inspecting the item and comparing its features to the established category membership criteria.

History

Bennett, Alpert, & Goldstein (1954) proposed the $S$ score as an index of reliability for categorical measurements from two raters. It was later generalized by Gwet (2014) to accommodate multiple raters, any weighting scheme, and missing data. It has also been proposed in multiple equivalent forms such as Guilford's (1963) G score; Maxwell's (1977) random error coefficient; Jason & Vegelius's (1979) C score; Brennan & Prediger's (1989) free marginal kappa coefficient; Byrt, Bishop, and Carlin's (1993) prevalence-and-bias-adjusted kappa coefficient; and Potter & Levine-Donnerstein's (1999) redefined pi coefficient.

MATLAB Functions

mSSCORE %Calculates S using vectorized formulas

Simplified Formulas

Use these formulas with two raters and two (dichotomous) categories:

$$p_o = \frac{n_{11} + n_{22}}{n}$$

$$p_c = \frac{1}{2}$$

$$S = \frac{p_o - p_c}{1 - p_c}$$

$n_{11}$ is the number of items both raters assigned to category 1

$n_{22}$ is the number of items both raters assigned to category 2

$n$ is the total number of items

Contingency Table

Generalized Formulas

Use these formulas with multiple raters, multiple categories, and any weighting scheme:

$$r_{ik}^\star = \sum_{l=1}^q w_{kl} r_{il}$$

$$p_o = \frac{1}{n'} \sum_{i=1}^{n'} \sum_{k=1}^q \frac{r_{ik} (r_{ik}^\star - 1)}{r_i (r_i - 1)}$$

$$p_c = \frac{1}{q^2} \sum_{k,l} w_{kl}$$

$$S = \frac{p_o - p_c}{1 - p_c}$$

$q$ is the total number of categories

$w_{kl}$ is the weight associated with two raters assigning an item to categories $k$ and $l$

$r_{il}$ is the number of raters that assigned item $i$ to category $l$

$n'$ is the number of items that were coded by two or more raters

$r_{ik}$ is the number of raters that assigned item $i$ to category $k$

$r_i$ is the number of raters that assigned item $i$ to any category

References

Bennett, E. M., Alpert, R., & Goldstein, A. C. (1954). Communication through limited response questioning. The Public Opinion Quarterly, 18(3), 303–308.
Guilford, J. P. (1963). Preparation of item scores for the correlations between persons in a Q factor analysis. Educational and Psychological Measurement, 23(1), 13–22.
Maxwell, A. E. (1977). Coefficients of agreement between observers and their interpretation. The British Journal of Psychiatry, 130, 79–83.
Janson, S., & Vegelius, J. (1979). On generalizations of the G index and the phi coefficient to nominal scales. Multivariate Behavioral Research, 14(2), 255–269.
Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699.
Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429.
Potter, W. J., & Levine-Donnerstein, D. (1999). Rethinking validity and reliability in content analysis. Journal of Applied Communication Research, 27(3), 258–284.
Zhao, X., Liu, J. S., & Deng, K. (2012). Assumptions behind inter-coder reliability indices. In C. T. Salmon (Ed.), Communication Yearbook (pp. 418–480). Routledge.
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly