copy from https://stats.idre.ucla.edu/wp-content/uploads/2020/02/cfa.r for personal study

# Confirmatory Factor Analysis (CFA) in R with lavaan

## Purpose

This seminar will show you how to perform a confirmatory factor analysis using lavaan in the R statistical programming language. Its emphasis is on understanding the concepts of CFA and interpreting the output rather than a thorough mathematical treatment or a comprehensive list of syntax options in lavaan. For exploratory factor analysis (EFA), please refer to [A Practical Introduction to Factor Analysis: Exploratory Factor Analysis](https://stats.oarc.ucla.edu/spss/seminars/introduction-to-factor-analysis/a-practical-introduction-to-factor-analysis/). A rudimentary knowledge of linear regression is required to understand some of the material in this seminar.

This seminar is the first in a three-part series on latent variable modeling. The second seminar goes over a broader range of observed and latent variable models. In this first seminar, all variables are presumed to be
$y$-side variables and the direction of the arrows are unconventional (pointing to the left). Traditionally, CFA models should be $x$-side variables with parameters for the latent factor and for the observed residuals. Since $y$-side notation is more common in the literature, we use $\eta$ and $\epsilon$ for the respective factor and observed residual parameters. However, in the second seminar we necessitate distinguishing between -side and
$x$-side variables for instructional purposes.

- [Introduction to Structural Equation Modeling (SEM) in R with lavaan](https://stats.idre.ucla.edu/r/seminars/rsem/).

The third seminar goes over intermediate topics in CFA including latent growth modeling and measurement invariance.

- [Latent Growth Models (LGM) and Measurement Invariance with R in lavaan](https://stats.idre.ucla.edu/r/seminars/lgm/)


## Requirements

Before beginning the seminar, please make sure you have [R](https://cran.r-project.org/) and [RStudio](https://www.rstudio.com/) installed.

Please also make sure to have the following R packages installed, and if not, run these commands in R (RStudio).

In [1]:
install.packages("lavaan", dependencies=TRUE)

Installing package into 'C:/Users/rolfz/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)

also installing the dependencies 'mnormt', 'pbivnorm', 'numDeriv', 'quadprog'




package 'mnormt' successfully unpacked and MD5 sums checked
package 'pbivnorm' successfully unpacked and MD5 sums checked
package 'numDeriv' successfully unpacked and MD5 sums checked
package 'quadprog' successfully unpacked and MD5 sums checked
package 'lavaan' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\rolfz\AppData\Local\Temp\RtmpGQDY9h\downloaded_packages


Once you’ve installed the packages, you can load them via the following

In [2]:
library(foreign) 
library(lavaan)

This is lavaan 0.6-15
lavaan is FREE software! Please report any bugs.



Download files here

You may download the complete R code here: [cfa.r](https://stats.idre.ucla.edu/wp-content/uploads/2020/02/cfa.r)

After clicking on the link, you can copy and paste the entire code into R or RStudio.

## Introduction

Factor analysis can be divided into two main types, **exploratory** and **confirmatory**. Exploratory factor analysis, 
also known as EFA, as the name suggests is an exploratory tool to understand the underlying psychometric properties 
of an unknown scale. Confirmatory factor analysis borrows many of  the same concepts from exploratory factor analysis except that instead of letting 
the data tell us the factor structure, we pre-determine the factor structure and verify the psychometric structure 
of a previously developed scale. More recent work by Asparouhov and Muthén (2009) blurs the boundaries between 
EFA and CFA, but traditionally the two methods have been distinct. EFA has a longer historical precedence, 
dating back to the era of Spearman (1904) whereas CFA became more popular after a breakthrough in both computing
technology and an estimation method developed by Jöreskog (1969). This distinction shows up in software as well. 
For example, EFA is available in SPSS FACTOR, SAS PROC FACTOR  and Stata’s factor. However, in SPSS a separate 
program called Amos is needed to run CFA, along with other packages such as Mplus, EQS, SAS PROC CALIS, 
Stata’s sem and more recently, R’s [lavaan](http://lavaan.ugent.be/). Since the focus of this seminar is CFA 
and R, we will focus on lavaan.

In this seminar, we will understand the concepts of CFA through the lens of a statistical analyst tasked to explore the psychometric properties of a newly proposed 8-item SPSS Anxiety Questionnaire. Due to budget constraints, the lab uses the freely available R statistical programming language, and lavaan as the CFA and structural equation modeling (SEM) package of choice. We will understand concepts such as the factor analysis model, basic lavaan syntax, model parameters, identification and model fit statistics. These concepts are crucial to deciding how many items to use per factor, as well how to successfully fit a one-factor, two-factor and second-order factor
analysis. By the end of this training, you should be able to understand enough of these concepts to run your own confirmatory factor analysis in lavaan.


## Motivating example: SPSS Anxiety Questionnaire (SAQ-8)

Suppose you are tasked with evaluating a hypothetical but real world
example of a questionnaire which [Andy Field](https://edge.sagepub.com/field5e/student-resources/datasets) terms the SPSS Anxiety Questionnaire (SAQ).  The first eight items consist of the following (note the actual items have been modified slightly from the original data set):

 1. Statistics makes me cry
 2. My friends will think I’m stupid for not being able to cope with SPSS
 3. Standard deviations excite me
 4. I dream that Pearson is attacking me with correlation coefficients
 5. I don’t understand statistics
 6. I have little experience with computers
 7. All computers hate me
 8. I have never been good at mathematics

Throughout the seminar we will use the terms items and *indicators* interchangeably, with the latter emphasizing the relationship of these items to a latent variable. Just as in our [exploratory factor analysis](https://stats.idre.ucla.edu/spss/seminars/efa-spss/) our Principal Investigator would like to evaluate the psychometric properties of our proposed 8-item SPSS Anxiety Questionnaire “SAQ-8”, proposed as a
shortened version of the original SAQ in order to shorten the time commitment for participants while maintaining internal consistency and validity.  The data collectors have collected 2,571 subjects so far and
uploaded the SPSS file to the IDRE server. The SPSS file can be download through the following link: [SAQ.sav](https://stats.idre.ucla.edu/wp-content/uploads/2018/05/SAQ.sav). Even though this is an SPSS file, R can translate this file directly to an R object through the function `read.spss` via the `library(foreign)`. The option `to.data.frame` ensures the data imported is a data frame and not an R list, and `use.value.labels = FALSE` converts categorical variables to numeric values rather than factors. This is done because we want to
run covariances on the items which is not possible with factor variables.

In [3]:
dat <- read.spss("https://stats.idre.ucla.edu/wp-content/uploads/2018/05/SAQ.sav",
                 to.data.frame=TRUE, use.value.labels = FALSE)

Now that we have imported the data set, the first step besides looking at the data itself is to look a the correlation table of all 8 variables. The function `cor` specifies a the correlation and `round` with the option `2` specifies that we want to round the numbers to the second digit.

In [4]:
round(cor(dat[,1:8]),2)

Unnamed: 0,q01,q02,q03,q04,q05,q06,q07,q08
q01,1.0,-0.1,-0.34,0.44,0.4,0.22,0.31,0.33
q02,-0.1,1.0,0.32,-0.11,-0.12,-0.07,-0.16,-0.05
q03,-0.34,0.32,1.0,-0.38,-0.31,-0.23,-0.38,-0.26
q04,0.44,-0.11,-0.38,1.0,0.4,0.28,0.41,0.35
q05,0.4,-0.12,-0.31,0.4,1.0,0.26,0.34,0.27
q06,0.22,-0.07,-0.23,0.28,0.26,1.0,0.51,0.22
q07,0.31,-0.16,-0.38,0.41,0.34,0.51,1.0,0.3
q08,0.33,-0.05,-0.26,0.35,0.27,0.22,0.3,1.0


In a typical **variance-covariance matrix**, the diagonals constitute the variances of the item and the off-diagonals the covariances. The interpretation of the correlation table are the standardized covariances between a pair of items, equivalent to running covariances on the Z-scores of each item. In a correlation table, the diagonal elements are always one because an item is always perfectly correlated with itself. Recall that the **magnitude** of a correlation $|r|$ is determined by the absolute value of the correlation. From this table we can see that most items have magnitudes ranging from 0.38 for Items 3 and 7 to 0.51 for Items 6 and 7. Notice that the correlations in the upper right triangle are the same as those in the lower right triangle, meaning the correlation for Items 6 and 7 is the same as the correlation for Items 7 and 6. This property is known as **symmetry** and will be important later on.

In psychology and the social sciences, the magnitude of a [correlation above **0.30**](https://www.simplypsychology.org/effect-size.html) is considered a medium effect size. Due to relatively high correlations among many of the items, this would be a good candidate for factor analysis. The goal of factor analysis is to model the interrelationships between many items with fewer unobserved or latent variables. Before we move on, let’s understand the confirmatory factor analysis model.

## The factor analysis model

The factor analysis or **measurement model** is essentially a linear regression model where the main predictor, the factor, is **latent or unobserved**. For a single subject, the simple linear regression equation is defined as:

$$
    y = b_0 + b_1 x +\epsilon
$$

where $b_0$ is the intercept and $b_1$ is the coefficient and $x$ is an observed predictor. Similarly, for a single item, the factor analysis model is:


$$
    y_1 = \tau_1 + \lambda_1 \eta + \epsilon_1
$$

where is $\tau_1$ the intercept of the first item and $\lambda_1$ is the loading or regression weight of the first factor on the first item, and $\epsilon_1$ is the residual for the first item. There are three main differences between the factor analysis model and linear regression:

1. Factor analysis outcomes are items not observations, so $y_1$ indicates the first item.
2. Factor analysis is a multivariate model there are as many outcomes per subject as there are items. In a linear regression, there is only one outcome per subject.
3. The predictor or factor,$\eta$ (“eta”), is unobserved whereas in a linear regression the predictors are observed.