Intraclass correlation coefficient

Overview

The inter-class correlation coefficient (i.e., Pearson’s $r$) is used to measure the relation between measurements that share neither their metric nor variance. Intraclass correlation coefficients (i.e., ICCs), on the other hand, are used to measure the relationship among variables of a common class, which share both their metric and variance. Common examples of ICCs include twin correlations, Cronbach's alpha, heritability coefficients, and measures of reliability. This latter category will be our focus here.

All ICCs attempt to partition the variance in a set of measurements according to sources. Although techniques from Generalizability theory can be used to accommodate any number of sources, we focus on ICCs that accommodate two sources of systematic variance (i.e., two-way models) as these are commonly applicable to reliability analysis. One source of systematic variance is always present: the random sampling of objects of measurement. This variance is typically represented as variance across rows of a table in which each row corresponds to the data from a single object of measurement (e.g., participant or observation). Another source of variance that is almost always present in reliability contexts is the random (or fixed) sampling of measurements. This variance is typically represented as variance across columns of a table in which each column corresponds to the data from a single measurement provider (e.g., observer or instrument). The values in such tables can be conveniently indexed using matrix notation where each object of measurement $i \in [1,n]$ and each measurement provider $j \in [1,k]$. The measurement of object $i$ from provider $j$ is thus indexed $x_{ij}$ (see below).

The basic logic of the variance partitioning in an ICC for reliability analysis is that, if reliability is good, then the variance between rows (i.e., objects) should constitute a relatively large amount of the total variance in measurements. Thus, the other sources of variance should constitute a relatively small amount of the total variance. Reliability can be quantified as the proportion of the total "relevant" variance that is between rows $(\sigma_r^2)$. Different ICCs include different sources of variance in their denominators and can be used based on which sources are deemed "relevant." All two-way ICCs include residual variance in the denominator $(\sigma_e^2)$ and some also include the variance between columns (e.g., observers) $(\sigma_c^2)$ and the row-by-column interaction effect $(\sigma_{rc}^2)$. The different ICCs will be discussed next.

Types of Two-Way ICCS

There are four formulations for estimating two-way ICCs. Although the population parameter definitions vary based on whether the column effects are assumed to be random or fixed, the process for calculating the sample estimator is identical in both cases. Thus, we provide the sample estimator formulas that apply to both assumptions and, for illustration, provide only the population parameter definitions for the random effects assumption. Again, the estimate will be identical in numerical value in both cases and only the interpretation of the estimate will vary (i.e., in how generalizable it is).

The ICC formulations differ based on whether they describe the reliability of single scores (i.e., scores taken from a single measurement provider) or the reliability of average scores (i.e., scores calculated by averaging the scores from all measurement providers). The choice to use a single scores ICC or an average scores ICC should be determined by what type of score will be ultimately used in analysis. If all objects of interest are measured by all measurement providers, and average scores will be analyzed, then an average scores ICC should be calculated and interpreted. If, however, only a subset of the objects of interest were measured by all measurement providers, and scores from any single provider will be analyzed, then a single scores ICC should be calculated and interpreted.

The ICC formulations also differ based on whether they describe the consistency or the absolute agreement of the measurements from different providers. Absolute agreement is a stricter requirement than consistency and will typically have a lower ICC value. One way to understand the difference between the two is to think about the model of equivalence they require. Pearson's $r$ is a "linearity" index that quantifies the extent to which one measure $y$ relates to another measure $x$ by a linear transformation (i.e., $y=ax+b$). A consistency ICC is an "additivity" index that quantifies the extent to which one measure $y$ relates to another measure $x$ by the addition of some constant (i.e., $y=x+b$). Finally, an absolute agreement ICC is an "agreement" index that quantifies the extent to which one measure $y$ relates to another measure $x$ without any transformation (i.e., $y=x$). The choice to use a consistency ICC or an absolute agreement ICC should be determined by what type of analysis will be used and whether this analysis is sensitive to the addition of a constant. If measurement providers are meant to be fully interchangeable, an absolute agreement ICC is the most appropriate. In most cases, I recommend that both a consistency ICC and an absolute agreement ICC be calculated and interpreted.

The formulas for estimating ICCs provided in the next section make use of the mean squares from a two-way Analysis of Variance (ANOVA): a mean square for rows $(MS_R)$, a mean square for columns $(MS_C)$, and a residual mean square traditionally referred to as mean square error $(MS_E)$. The expected mean square for $MS_E$ estimates the combined row-by-column interaction and error variance.

As a final note, because all two-way ICCs rely on partitioning the variance and dividing the rows-variance by the total relevant variance, they tend to be very small when the rows-variance is also small. Put another way, it is almost impossible to achieve a high ICC score in the absence of between-rows (i.e., between-objects) variance. Thus, it is imperative to structure data collection in such a way to ensure that adequate between-objects variance occurs. Otherwise, some alternative to the ICC must be used.

MATLAB Functions

ICC_A_1 %Calculates the single-score absolute agreement ICC
ICC_A_k %Calculates the average-score absolute agreement ICC
ICC_C_1 %Calculates the single-score consistency ICC
ICC_C_k %Calculates the average-score consistency ICC

Population Formulas

$$ICC(A,1) = \frac{\sigma_r^2}{\sigma_r^2 + \sigma_c^2 + \sigma_{rc}^2}$$

$$ICC(A,k) = \frac{\sigma_r^2}{\sigma_r^2 + (\sigma_c^2 + \sigma_{rc}^2) / k}$$

$$ICC(C,1) = \frac{\sigma_r^2}{\sigma_r^2 + \sigma_{rc}^2}$$

$$ICC(C,k) = \frac{\sigma_r^2}{\sigma_r^2 + \sigma_{rc}^2 / k}$$

Sample Formulas

$$ICC(A,1) = \frac{MS_R - MS_E}{MS_R + (k - 1)MS_E + \frac{k}{n}(MS_C - MS_E)}$$

$$ICC(A,k) = \frac{MS_R - MS_E}{MS_R + (MS_C - MS_E)/n}$$

$$ICC(C,1) = \frac{MS_R - MS_E}{MS_R + (k - 1)MS_E}$$

$$ICC(C,k) = \frac{MS_R - MS_E}{MS_R}$$

$MS_R$ is the mean square for rows (i.e., items)

$MS_C$ is the mean square for columns (i.e., raters)

$MS_E$ is the mean square error

$k$ is the number of raters

$n$ is the number of items

References

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly