# New Insights into PCA + Varimax for Psychological Researchers

A short commentary on Rohe & Zeng (2023)

Florian Pargent [](https://orcid.org/0000-0002-2388-553X) (Department of Psychology, LMU Munich)  
David Goretzko (Utrecht University)  
Timo von Oertzen (Bundeswehr University Munich and Max Planck Institute for Human Development)  
March 26, 2024

In [None]:
phonedata_items = read.csv2("datasets/Items.csv")
phonedata_items = na.omit(phonedata_items[, 3:302])
phonedata_sensing = readRDS(file = "datasets/clusterdata.RDS")
phonedata_sensing = phonedata_sensing[, c(1:1821)]

> **Important**
>
> This document is **an updated copy** of a [published commentary](https://doi.org/10.1093/jrsssb/qkad054), to showcase [Quarto manuscripts](https://quarto.org/docs/manuscripts/) in our [Quarto workshop](https://florianpargent.github.io/Quarto_LMU_OSC/). The official online repository of our published commentary can be found [here](https://osf.io/5symf/).

### Commentary

As psychologists, we appreciate Rohe & Zeng’s (R&Z; Rohe & Zeng ([2023](#ref-rohe2023vintage))) new insights into “vintage” principal component analysis with varimax rotation (PCA+VR). Theories of intelligence and personality, perhaps psychology’s contributions best known outside of our field, have been a direct product of PCA. PCA+VR is still widely used for developing and evaluating psychological tests and questionnaires, although the literature has fought against it in favor of more complex factor analytic techniques ([Fokkema & Greiff, 2017](#ref-fokkema2017how)).

In our opinion, abandoning the simpler PCA(+VR) is a mistake and R&Z refute a common argument by proving that PCA+VR *can* perform statistical inference in latent variable models: The factor indeterminacy problem which plagued VR since its invention only applies for the special case of normally distributed factors. For any other distribution, perfect factor indeterminacy does not apply, although identifiability might be weak. However, distributions producing sparse components fulfill a *sufficient* leptokurtic condition, which can be confirmed by simple diagnostics.

Because the results are complicated, we relate them to psychological applications. The examples in R&Z only deal with sparse binary network data, but in typical psychological applications, the $A$ matrix consists of responses of $n$ persons to $d$ items which are either binary (e.g., intelligence tests), integer-valued (e.g., personality questionnaires) or continuous (e.g., digital sensors). Psychologists are often interested in whether i) items can be structured in a simple way to represent a small number of meaningful components, and ii) those components can be interpreted as psychological constructs that describe interindividual differences. R&Z show that “radial streaks” in the rotated loading matrix $\hat{Y}$ suggest that item loadings are identified and can be estimated with PCA+VR from the data. Similarly, streaks in the component matrix $\hat{Z}$ suggest that person scores can be estimated.

However, we question whether streaks are common in psychology with regard to both aspects. Test and questionnaire items are traditionally designed to measure only a single construct, so “simple structure” reflected by streaks in $\hat{Y}$ might be expected. Psychological constructs are often conceptualized as roughly normally distributed, so streaks in $\hat{Z}$ seem more questionable. In our online materials (<https://osf.io/5symf/>), we analyze a dataset ([Stachl et al., 2020](#ref-stachl2020predicting)) containing both personality items ($n =687$, $d =300$) and smartphone sensing variables ($n =624$, $d =1821$). Streaks were found only in $\hat{Y}$ but not in $\hat{Z}$. It is also a cautionary example of how imputation of missing values in combination with inappropriate data processing seemingly produce streaks in $\hat{Z}$ that belong to uninterpretable components. Degree normalization as discussed in R&Z is not suitable for many psychological datasets and other procedures like z-standardization are often required to detect meaningful factors. Finally, we demonstrate R&Z’s side result that the matrix $\hat{Z}\hat{B}$ from PCA+VR can estimate person scores simulated from oblique leptokurtic components.

In our opinion, the main usefulness of PCA+VR not necessarily stems from its ability to estimate latent variable models. PCA excels at providing meaningful descriptions in practical applications but R&Z’s and our examples also show that there is rarely a single definite structure. Components are most useful when they predict other meaningful quantities, regardless of the assumed epistemological nature of psychological constructs ([Yarkoni, 2020](#ref-yarkoni2020implicit)).

### References

Fokkema, M., & Greiff, S. (2017). How Performing PCA and CFA on the Same Data Equals Trouble: Overfitting in the Assessment of Internal Structure and Some Editorial Thoughts on It. *European Journal of Psychological Assessment*, *33*(6), 399–402. <https://doi.org/10.1027/1015-5759/a000460>

Rohe, K., & Zeng, M. (2023). <span class="nocase">Vintage factor analysis with Varimax performs statistical inference</span>. *Journal of the Royal Statistical Society Series B: Statistical Methodology*, *85*(4), 1037–1060. <https://doi.org/10.1093/jrsssb/qkad029>

Stachl, C., Au, Q., Schoedel, R., Gosling, S. D., Harari, G. M., Buschek, D., Völkel, S. T., Schuwerk, T., Oldemeier, M., Ullmann, T., Hussmann, H., Bischl, B., & Bühner, M. (2020). Predicting personality from patterns of behavior collected with smartphones. *Proceedings of the National Academy of Sciences of the United States of America*, *117*(30), 17680–17687. <https://doi.org/10.1073/pnas.1920484117>

Yarkoni, T. (2020). Implicit Realism Impedes Progress in Psychology: Comment on Fried (2020). *Psychological Inquiry*, *31*(4), 326–333. <https://doi.org/10.1080/1047840X.2020.1853478>