Scott's pi coefficient

Overview

The pi coefficient is a chance-adjusted index for the reliability of categorical measurements. It is thus meant to equal the ratio of "observed nonchance-agreement" to "possible nonchance-agreement" and to answer the question: how often did the raters agree when they weren't guessing?

The pi coefficient estimates chance agreement using an average-distribution-based approach. Specifically, it assumes that raters engage in a chance-based process to determine whether to classify each item randomly or deliberately prior to inspecting it. It assumes that the likelihood of raters randomly assigning an item to the same category is based on the product of raters' average distributions for each category.

Zhao et al. (2012) described these assumptions using the following metaphor (in the case of two categories for simplicity). All raters have a shared "quota" for how many items they must, on average, assign to each category. One rater then places two sets of marbles into a shared urn, where each set corresponds to one of the two categories. Each set has a number of marbles corresponding to its category's quota and has its own color. For each item, each rater draws a marble randomly from the urn, notes its color, and then puts it back. If both raters drew the same color, then both raters classify that item randomly by classifying it into the category that corresponds to the color that was drawn (without inspecting the item at all). Only if the raters drew different colors would they classify the item deliberately by inspecting the item and comparing its features to the established category membership criteria. Each rater keeps track of the number of items he or she has assigned to each category; whenever a coder reaches his or her quota for a category, he or she stops drawing from the urn and begins classifying all items to the other category in order to meet its quota.

History

Scott (1955) proposed the pi coefficient to estimate the reliability of two raters assigning items to nominal categories. Fleiss (1971) extended the pi coefficient to accommodate multiple raters. Then, Gwet (2014) generalized the pi coefficient to accommodate multiple raters, any weighting scheme, and missing data. The generalized formulas provided here, and instantiated in the provided function, correspond to Gwet's formulation (which he refers to as the generalized Fleiss' kappa coefficient). It is also worth noting that several other reliability indices are equivalent to Scott's pi coefficient including Siegel & Castellan's (1988) revised kappa coefficient and Byrt, Bishop, and Carlin's (1993) bias-adjusted kappa coefficient.

MATLAB Functions

mSCOTTPI %Calculates pi using vectorized formulas

Simplified Formulas

Use these formulas with two raters and two (dichotomous) categories:

$$p_o = \frac{n_{11}+n_{22}}{n}$$

$$m_1 = \frac{n_{+1} + n_{1+}}{2}$$

$$m_2 = \frac{n_{+2} + n_{2+}}{2}$$

$$p_c = \left( \frac{m_1}{n} \right) \left( \frac{m_1}{n} \right) + \left( \frac{m_2}{n} \right) \left( \frac{m_2}{n} \right)$$

$$\pi = \frac{p_o - p_c}{1 - p_c}$$

$n_{11}$ is the number of items both raters assigned to category 1

$n_{22}$ is the number of items both raters assigned to category 2

$n$ is the total number of items

$n_{1+}$ is the number of items rater 1 assigned to category 1

$n_{2+}$ is the number of items rater 1 assigned to category 2

$n_{+1}$ is the number of items rater 2 assigned to category 1

$n_{+2}$ is the number of items rater 2 assigned to category 2

Contingency Table

Generalized Formulas

Use these formulas with multiple raters, multiple categories, and any weighting scheme:

$$r_{ik}^\star = \sum_{l=1}^q w_{kl} r_{il}$$

$$p_o = \frac{1}{n'} \sum_{i=1}^{n'} \sum_{k=1}^q \frac{r_{ik} (r_{ik}^\star - 1)}{r_i (r_i - 1)}$$

$$\pi_k = \frac{1}{n} \sum_{i=1}^n \frac{r_{ik}}{r_i}$$

$$p_c = \sum_{k,l}^q w_{kl} \pi_k \pi_l$$

$$\pi = \frac{p_o - p_c}{1 - p_c}$$

$q$ is the total number of categories

$w_{kl}$ is the weight associated with two raters assigning an item to categories $k$ and $l$

$r_{il}$ is the number of raters that assigned item $i$ to category $l$

$n'$ is the number of items that were coded by two or more raters

$r_{ik}$ is the number of raters that assigned item $i$ to category $k$

$r_i$ is the number of raters that assigned item $i$ to any category

$n$ is the total number of items

References

Scott, W. A. (1955). Reliability of content analysis: The case of nominal scaling. Public Opinion Quarterly, 19(3), 321–325.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioural sciences. New York, NY: McGraw-Hill.
Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429.
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly