https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6004510/table/T2/

https://en.wikipedia.org/wiki/Confirmatory_factor_analysis

https://link.springer.com/article/10.3758/s13428-018-1055-2

In [1]:
import pandas as pd
from questions_columns import sci_af_ca
from factor_analyzer import ConfirmatoryFactorAnalyzer, ModelSpecificationParser
import re
import semopy
from semopy import stats

In [2]:
df = pd.read_csv(r"../create_dataset/data_for_research.csv")

load factors mappings

In [3]:
f = pd.read_excel("Factors.xlsx")
f = f.dropna(how='all', subset = ['question', 'F'])
f['question_number'] = f.question.str.extract(r'(\d+)')
f['factor_number'] = f.F.str.extract(r'(\d+)')

In [4]:
model_dict = {f"Factor{str(i)}": [] for i in range(1,6)}
for i, row in f.iterrows():
    model_dict[f"Factor{row['factor_number']}"].append(f"sci_af_ca_{row['question_number']}")

In [5]:
model_factors_list = [f"{key} =~ {' + '.join(model_dict[key])}" for key in model_dict.keys()]

All the observables into one factor

In [6]:
model_string = f"""
{model_factors_list[0]}
{model_factors_list[1]}
{model_factors_list[2]}
{model_factors_list[3]}
{model_factors_list[4]}
"""

data  = df[sci_af_ca]

Since indicators were binary, diagonally weighted least squares estimation, as recommended.

In [7]:
model= semopy.Model(model_string)
model_fit = model.fit(data=data, obj = 'DWLS')

Did it succeeded?

In [8]:
model_fit.message

'Optimization terminated successfully'

### Chi Squared

The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. The chi-square fit index assesses the fit between the hypothesized model and data from a set of measurement items (the observed variables). The model chi-square is the chi-square statistic obtained using maximum likelihood method.

The chi-square value is a test statistic of the goodness of fit of a factor model. It compares the observed covariance matrix with a theoretically proposed covariance matrix

expecting -  non-significant χ2

In [9]:
stats.calc_chi2(model)

(673.6817673404281, 0.9326807675504475)

### CFI, comparative fit index

The comparative fit index (CFI) analyzes the model fit by examining the discrepancy between the data and the hypothesized model, while adjusting for the issues of sample size

CFI is an incremental relative fit index that measures the relative improvement in the fit of a researcher's model over that of a baseline model

expecting - CFI≥.95

 CFI and TLI are incremental fit indices that compare the fit of a hypothesized model with that of a baseline model (i.e., a model with the worst fit)

In [10]:
stats.calc_cfi(model)

1.0008618281129649

### TLI, Tucker–Lewis index

Also known as the non-normed fit index.

TLI is based on the idea of comparing the proposed factor model to a model in which no interrelationships at all are assumed among any of the items

analyzes the discrepancy between the chi-squared value of the hypothesized model and the chi-squared value of the null model. 

expecting - TLI≥.95

 CFI and TLI are incremental fit indices that compare the fit of a hypothesized model with that of a baseline model (i.e., a model with the worst fit)

In [11]:
stats.calc_tli(model)

1.0009208574357706

### RMSEA, root mean square error of approximation

RMSEA is a measure of the estimated discrepancy between the population and model-implied population covariance matrices per degree of freedom 

The root mean square error of approximation (RMSEA) avoids issues of sample size by analyzing the discrepancy between the hypothesized model, with optimally chosen parameter estimates, and the population covariance matrix

expecting - RMSEA≤.06

## All stats

In [12]:
stats.calc_stats(model)

Unnamed: 0,DoF,DoF Baseline,chi2,chi2 p-value,chi2 Baseline,CFI,GFI,AGFI,NFI,TLI,RMSEA,AIC,BIC,LogLik
Value,730,780,673.681767,0.932681,66127.407229,1.000862,0.989812,0.989115,0.989812,1.000921,0,158.274688,470.400101,10.862656


The comparative fit index (CFI) analyzes the model fit by examining the discrepancy between the data and the hypothesized model, while adjusting for the issues of sample size.


The CFI (Bentler, 1990) measures the relative improvement in fit going from the baseline model to the postulated model.

CFI≥.95