[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/leaguilar/QuantUX/blob/main/notebooks/CFA_and_EFA.ipynb?flush_cache=true)

CFA (Confirmatory Factor Analysis) and EFA (Exploratory Factor Analysis) differ in their purpose, assumptions, and approach to modeling latent variables.

| Aspect | EFA (Exploratory Factor Analysis) | CFA (Confirmatory Factor Analysis) |
|--------|----------------------------------|-----------------------------------|
| Purpose | To explore possible underlying factor structures without prior assumptions. | To test a hypothesized factor structure based on theory or prior evidence. |
| When Used | Early in research, when you don’t know how variables group together. | Later in research, when you already have a model or theory to verify. |
| Model Specification | No predefined factor structure — the algorithm determines which variables load on which factors. | The researcher specifies how many factors exist and which variables load on each. |
| Cross-loadings | Variables can load on multiple factors (the model finds the best structure). | Usually, each variable loads only on its predefined factor (cross-loadings are constrained to zero). |
| Assumptions | Minimal — used for discovery. | Strong — used to confirm a specific model. |
| Rotation | Rotations (orthogonal or oblique) are often used to improve interpretability. | No rotation — loadings are fixed as per the model. |
| Fit Indices | Not typically used (focus is on loadings). | Model fit indices (e.g., CFI, RMSEA, χ²) are key for evaluating how well the model fits the data. |
| Software Output Focus | Factor loadings, eigenvalues, explained variance. | Model fit indices, standardized loadings, modification indices. |

In [5]:
# --- Setup: install missing packages ---
import sys
import subprocess

def install_if_missing(package):
    try:
        __import__(package)
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Required packages
for pkg in ["pandas", "factor_analyzer", "semopy"]:
    install_if_missing(pkg)

In [None]:
import pandas as pd
import os
from factor_analyzer import FactorAnalyzer
from semopy import Model, calc_stats


# EFA (Exploratory Factor Analysis)

In [2]:
# local path
file_path = "../data/synthetic_responses.csv"

# remote (GitHub raw) URL
url = "https://raw.githubusercontent.com/leaguilar/QuantUX/main/data/synthetic_responses.csv"

# load from local if available, otherwise from GitHub
if os.path.exists(file_path):
    print(f"Loading local file: {file_path}")
    data = pd.read_csv(file_path)
else:
    print(f"Local file not found. Loading from URL: {url}")
    data = pd.read_csv(url)


Loading local file: ../data/synthetic_responses.csv


In [3]:
data

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9
0,0.544841,0.397790,0.758387,0.097280,-0.429733,-1.034791,0.760314,-0.233231,0.816711
1,-0.267947,-0.268547,-0.344912,-0.472409,-0.752947,-0.433435,-0.301340,-0.989670,-1.205119
2,0.529649,0.502301,0.255640,0.314472,0.175240,0.453892,1.500593,0.443282,0.869295
3,1.729805,1.283440,1.369377,-0.117190,0.877433,0.440671,1.493994,1.704022,0.809773
4,-0.110883,-0.144363,-0.278812,-0.737977,-0.490396,0.650569,0.430471,0.512026,0.812420
...,...,...,...,...,...,...,...,...,...
295,-0.378138,0.317400,-0.924406,-0.438319,-0.936870,0.113532,0.549960,0.023972,1.042415
296,0.711825,1.454321,0.682018,0.052292,-0.557656,0.189819,-0.030593,0.553575,-0.539758
297,0.466836,0.698456,-0.041841,-1.211754,-1.086718,-1.405775,-0.379333,-0.275503,0.534896
298,0.739855,0.978628,1.161979,-0.976938,-0.228098,-0.151036,1.098878,0.383216,1.220316


In [10]:
# run EFA
fa = FactorAnalyzer(n_factors=3, rotation='varimax')
fa.fit(data)

# show factor loadings
loadings = pd.DataFrame(fa.loadings_, index=data.columns, columns=['Factor1','Factor2','Factor3'])
print(loadings.round(2))

    Factor1  Factor2  Factor3
x1     0.91    -0.01    -0.01
x2     0.85    -0.05     0.01
x3     0.93     0.01    -0.02
x4    -0.06     0.89     0.00
x5     0.02     0.86     0.03
x6    -0.01     0.92    -0.02
x7    -0.00     0.00     0.91
x8    -0.01     0.03     0.85
x9    -0.00    -0.03     0.90




In [11]:
# define CFA model in lavaan-style syntax
model_desc = """
F1 =~ x1 + x2 + x3
F2 =~ x4 + x5 + x6
F3 =~ x7 + x8 + x9
"""

In [12]:
# fit CFA
model = Model(model_desc)
model.fit(data)

SolverResult(fun=np.float64(0.05511677287284833), success=True, n_it=30, x=array([ 0.83802864,  1.13208415,  0.84579161,  1.14358994,  0.82996855,
        1.11836654,  0.67921133, -0.02221937, -0.00960048,  0.59071715,
        0.66600713, -0.00381686,  0.14429141,  0.17915425,  0.14522552,
        0.15301873,  0.15850976,  0.14251535,  0.13816628,  0.18438174,
        0.19238574]), message='Optimization terminated successfully', name_method='SLSQP', name_obj='MLW')

In [15]:
estimates = model.inspect(std_est=True)
print(estimates.columns)

Index(['lval', 'op', 'rval', 'Estimate', 'Est. Std', 'Std. Err', 'z-value',
       'p-value'],
      dtype='object')


In [22]:
print(estimates[['lval', 'op', 'rval', 'Estimate', 'Est. Std']])

   lval  op rval  Estimate  Est. Std
0    x1   ~   F1  1.000000  0.908176
1    x2   ~   F1  0.838029  0.852623
2    x3   ~   F1  1.132084  0.925754
3    x4   ~   F2  1.000000  0.891211
4    x5   ~   F2  0.845792  0.852771
5    x6   ~   F2  1.143590  0.918833
6    x7   ~   F3  1.000000  0.910049
7    x8   ~   F3  0.829969  0.844582
8    x9   ~   F3  1.118367  0.901320
9    F1  ~~   F1  0.679211  1.000000
10   F1  ~~   F2 -0.022219 -0.035078
11   F1  ~~   F3 -0.009600 -0.014274
12   F2  ~~   F2  0.590717  1.000000
13   F3  ~~   F3  0.666007  1.000000
14   F3  ~~   F2 -0.003817 -0.006085
15   x1  ~~   x1  0.144291  0.175217
16   x2  ~~   x2  0.179154  0.273035
17   x3  ~~   x3  0.145226  0.142979
18   x4  ~~   x4  0.153019  0.205743
19   x5  ~~   x5  0.158510  0.272781
20   x6  ~~   x6  0.142515  0.155745
21   x7  ~~   x7  0.138166  0.171812
22   x8  ~~   x8  0.184382  0.286681
23   x9  ~~   x9  0.192386  0.187622


In [23]:
fit_stats = calc_stats(model)
print("\nModel fit indices:")
for k, v in fit_stats.items():
    print(f"{k}: {v}")


Model fit indices:
DoF: Value    24
Name: DoF, dtype: int64
DoF Baseline: Value    36
Name: DoF Baseline, dtype: int64
chi2: Value    16.535032
Name: chi2, dtype: float64
chi2 p-value: Value    0.867804
Name: chi2 p-value, dtype: float64
chi2 Baseline: Value    2011.064369
Name: chi2 Baseline, dtype: float64
CFI: Value    1.00378
Name: CFI, dtype: float64
GFI: Value    0.991778
Name: GFI, dtype: float64
AGFI: Value    0.987667
Name: AGFI, dtype: float64
NFI: Value    0.991778
Name: NFI, dtype: float64
TLI: Value    1.005669
Name: TLI, dtype: float64
RMSEA: Value    0
Name: RMSEA, dtype: int64
AIC: Value    41.889766
Name: AIC, dtype: float64
BIC: Value    119.669198
Name: BIC, dtype: float64
LogLik: Value    0.055117
Name: LogLik, dtype: float64
