# Let’s talk about Thurstone & Co.: An information-theoretical model for

comparative judgments, and its statistical translation

[Jose Manuel Rivera Espejo](https://www.uantwerpen.be/en/staff/jose-manuel-rivera-espejo_23166/) [](https://orcid.org/0000-0002-3088-2783) (University of Antwerp)  
[Tine van Daal](https://www.uantwerpen.be/en/staff/tine-vandaal/) [](https://orcid.org/https://orcid.org/0000-0001-9398-9775) (University of Antwerp)  
[Sven De Maeyer](https://www.uantwerpen.be/en/staff/sven-demaeyer/) [](https://orcid.org/0000-0003-2888-1631) (University of Antwerp)  
[Steven Gillis](https://www.uantwerpen.be/nl/personeel/steven-gillis/) (University of Antwerp)  
November 5, 2024

(to do)

In [None]:
# # load packages
# libraries = c('RColorBrewer','stringr','dplyr','tidyverse',
#               'rechape2','knitr','kableExtra',
#               'rstan','StanHeaders','runjags','rethinking')
# sapply(libraries, require, character.only=T)

In [None]:
# # load functions
# main_dir = '/home/josema/Desktop/1. Work/1 research/PhD Antwerp/#thesis/paper2/paper2_manuscript'
# source( file.path( main_dir, 'code', 'user-defined-functions.R') )

# Introduction

In *comparative judgment* (CJ) studies, judges assess the presence of a trait or competence by conducting pairwise comparisons of stimuli \[@Thurstone_1927; @Pollitt_2004; @Pollitt_2012a\]. The comparison produces a dichotomous outcome, indicating which stimulus is perceived to possess a higher trait level. After conducting multiple rounds of pairwise comparisons, researchers use the Bradley-Terry-Luce (BTL) model \[@Bradley_et_al_1952; @Luce_1959\] to process the outcomes and estimate scores that reflect the underlying trait of interest. This method has been successfully employed in assessing the quality of written texts, where quality describes the underlying trait of interest and the texts serve as the stimuli \[@Laming_2004; @Pollitt_2012b; @Whitehouse_2012; @vanDaal_et_al_2016; @Lesterhuis_2018; @Coertjens_et_al_2017; @Goossens_et_al_2018; @Bouwer_et_al_2023\].

Numerous studies have documented the effectiveness of CJ in assessing various traits and competencies over the past decade. These studies have emphasized three aspects of the method’s effectiveness: its reliability, validity, and practical applicability. Research on reliability indicates that CJ requires a relatively small number of pairwise comparisons \[@Verhavert_et_al_2019; @Crompvoets_et_al_2022\] to produce trait scores that are as precise and consistent as those generated by other assessment methods \[@Coertjens_et_al_2017; @Goossens_et_al_2018; @Bouwer_et_al_2023\]. Furthermore, evidence suggests that the reliability and time efficiency of CJ are comparable, if not superior, to those of other assessment methods when employing adaptive comparison algorithms \[@Pollitt_2012b; @Verhavert_et_al_2022; @Mikhailiuk_et_al_2021\]. Additionally, research on validity suggests that scores generated by CJ can accurately represent the traits under measurement \[@Whitehouse_2012; @vanDaal_et_al_2016; @Lesterhuis_2018; @Bartholomew_et_al_2018; @Bouwer_et_al_2023\]. Finally, research on practical applicability highlights the method’s versatility across both educational and non-educational contexts \[@Jones_2015; @Bartholomew_et_al_2018; @Jones_et_al_2019; @Marshall_et_al_2020; @Bartholomew_et_al_2020; @Boonen_et_al_2020\].

Nevertheless, despite the growing number of CJ studies, the literature’s unsystematic and fragmented research approaches have left several critical issues unaddressed. This research primarily focuses on three: the apparent disconnect between CJ’s measurement and structural model, the over-reliance on the assumptions of Thurstone’s Case 5 \[-@Thurstone_1927\] in CJ’s measurement model, and the unclear role of comparison algorithms on the method’s reliability and validity. The following sections will discuss each of these issues in detail, followed by the introduction of a theoretical model and its statistical translation, which aims to address all three concerns simultaneously.

# Three critical issues in CJ literature

## The disconnect between structural and measurement models

In a typical CJ study, the BTL model serves as the measurement model for CJ \[@Andrich_1978; @Bramley_2008\]. A measurement model specifies how manifest variables contribute to the estimation of latent variables \[@Everitt_et_al_2010\]. For example, when evaluating text quality, the BTL model uses the dichotomous outcomes resulting from the pairwise comparisons (the manifest variables) to estimate scores that reflect the underlying quality level of the texts (the latent variable) \[@Laming_2004; @Pollitt_2012b; @Whitehouse_2012; @vanDaal_et_al_2016; @Lesterhuis_2018; @Coertjens_et_al_2017; @Goossens_et_al_2018; @Bouwer_et_al_2023\].

Researchers then typically use the estimated BTL scores, or their transformations, to conduct additional analyses or hypothesis testing. The scores have been used to identify ‘misfit’ judges and stimuli \[@Pollitt_2012b; @vanDaal_et_al_2017; @Goossens_et_al_2018\], detect biases in judges’ ratings \[@Pollitt_et_al_2003; @Pollitt_2012b\], calculate correlations with other assessment methods \[@Goossens_et_al_2018; @Bouwer_et_al_2023\], or test hypotheses related to the underlying trait of interest \[@Bramley_et_al_2019; @Boonen_et_al_2020; @Bouwer_et_al_2023; @vanDaal_et_al_2017; @Jones_et_al_2019; @Gijsen_et_al_2021\].

However, the statistical literature cautions against using estimated scores to conduct additional analyses or tests. A key consideration is that BTL scores are parameter estimates that inherently carry uncertainty. Ignoring this uncertainty when conducting separate analyses and tests can inflate their precision and statistical power, increasing the risk of committing a type I error \[@McElreath_2020\]. A type I error results when a null hypothesis is incorrectly rejected \[@Everitt_et_al_2010\].

To mitigate these risks, principles from Structural Equation Modeling (SEM) \[@Hoyle_et_al_2023; @Kline_et_al_2023\] and Item Response Theory (IRT) \[@deAyala_2009; @Fox_2010; @vanderLinden_et_al_2017\] recommend conducting these analyses and tests within a structural model. A structural model specify how different manifest or latent variables influence the latent variable of interest \[@Everitt_et_al_2010\]. This approach allows analyses that can account for both the scores and their uncertainties simultaneously, rather than treating them as separate elements. Therefore, an integrated approach that combines CJ’s structural and measurement models can offer significant advantages.

## The assumptions of Case 5 and the measurement model

From early on in the literature, it has been clear that the BTL model represents a statistical articulation of Thurstone’s Case 5 \[-@Thurstone_1927\]. Talk about @Pollitt_et_al_2003 and @Bramley_2008.

What case 5 implies, Assumptions. Not a normal distribution but a logistic distribution

Table with assumptions

## The role and impact of comparison algorithms

# Theory

## A theoretical model for CJ

## From theory to statistics

# Discussion

## Findings

## Limitations and further research

# Conclusion



# Declarations

**Funding:** The project was founded through the Research Fund of the University of Antwerp (BOF).

**Financial interests:** The authors have no relevant financial interest to disclose.

**Non-financial interests:** Author XX serve on advisory broad of Company Y but receives no compensation this role.

**Ethics approval:** The University of Antwerp Research Ethics Committee has confirmed that no ethical approval is required.

**Consent to participate:** Not applicable

**Consent for publication:** All authors have read and agreed to the published version of the manuscript.

**Availability of data and materials:** No data was utilized in this study.

**Code availability:** All the code utilized in this research is available in the digital document located at: <https://jriveraespejo.github.io/paper2_manuscript/>.

**AI-assisted technologies in the writing process:** The authors used ChatGPT, an AI language model, during the preparation of this work. They occasionally employed the tool to refine phrasing and optimize wording, ensuring appropriate language use and enhancing the manuscript’s clarity and coherence. The authors take full responsibility for the final content of the publication.

**CRediT authorship contribution statement:** *Conceptualization:* S.G., S.DM., T.vD., and J.M.R.E; *Methodology:* S.DM., T.vD., and J.M.R.E; *Software:* J.M.R.E.; *Validation:* J.M.R.E.; *Formal Analysis:* J.M.R.E.; *Investigation:* J.M.R.E; *Resources:* S.G., S.DM., and T.vD.; *Data curation:* J.M.R.E.; *Writing - original draft:* J.M.R.E.; *Writing - review & editing:* S.G., S.DM., and T.vD.; *Visualization:* J.M.R.E.; *Supervision:* S.G. and S.DM.; *Project administration:* S.G. and S.DM.; *Funding acquisition:* S.G. and S.DM.



# Appendix

## Appendix A: Ignoring uncertainty

In [None]:
# # simulate units and sub-units
# USd = sim_units_trait( Un=10,
#                        Us=c(1,1),
#                        Ub=c(0,-0.2,0.2),
#                        Useed = 45789,
#                        Sn=6,
#                        Ss=c(0.8,0.8),
#                        Sb=c(0,0,0),
#                        Sseed = 9478 )
# 
# # simulate judges bias
# Jd = sim_judges_bias( Jn=10,
#                       Js=c(0.02,0.02),
#                       Jseed = 79985 )
# 
# 
# # simulate full comparison data
# d = cm_full( USd=USd$USd,
#              Jd=Jd$Jd )
# 
# 
# # set data in list
# dL = list_data( d )
# str(dL)

## Appendix B: The five cases of Thurstone



# References