## HEXAD vs student activity

This notebook explores the impact of HEXAD traits on student activity.


In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("..")

import pandas as pd
import src.modeling.ols_model as ols_models
import src.modeling.negative_binomial_model as nb_models
import src.modeling.binomial_model as bin_models
import src.modeling.logistic_model as logistic_models

# Load and scale data
df = pd.read_csv("../data/preprocessed/student_time_features_2021_2024.csv")
scale_cols = ['pre_test', 'HEXAD_P', 'HEXAD_S', 'HEXAD_F', 'HEXAD_A', 'HEXAD_D', 'HEXAD_R']
df = ols_models.standardize_columns(df, scale_cols)

## Solved tasks

This model examines how students’ HEXAD traits and prior knowledge relate to the number of programming tasks they completed. It helps identify which traits are associated with higher or lower task-solving activity.

In [2]:
# Select model
model_name = "solved_tasks_main"
formula = nb_models.get_nb_formula_by_name(model_name)

# Fit the selected model
model = nb_models.fit_negative_binomial(df, formula)

Optimization terminated successfully.
         Current function value: 5.463103
         Iterations: 12
         Function evaluations: 13
         Gradient evaluations: 13


In [3]:
formula

'n_solved_tasks ~ pre_test + HEXAD_P + HEXAD_S + HEXAD_F + HEXAD_A + HEXAD_D + HEXAD_R'

In [4]:
model.summary()

0,1,2,3
Dep. Variable:,n_solved_tasks,No. Observations:,871.0
Model:,NegativeBinomial,Df Residuals:,863.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 23 May 2025",Pseudo R-squ.:,0.003705
Time:,17:10:14,Log-Likelihood:,-4758.4
converged:,True,LL-Null:,-4776.1
Covariance Type:,nonrobust,LLR p-value:,9.449e-06

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,4.5159,0.043,106.135,0.000,4.432,4.599
pre_test,-0.1980,0.044,-4.524,0.000,-0.284,-0.112
HEXAD_P,-0.0416,0.064,-0.649,0.516,-0.167,0.084
HEXAD_S,0.0638,0.056,1.135,0.257,-0.046,0.174
HEXAD_F,-0.0063,0.059,-0.106,0.915,-0.122,0.109
HEXAD_A,0.0851,0.069,1.229,0.219,-0.051,0.221
HEXAD_D,-0.1380,0.048,-2.861,0.004,-0.233,-0.043
HEXAD_R,0.0144,0.063,0.229,0.819,-0.109,0.138
alpha,1.5655,0.073,21.479,0.000,1.423,1.708


## Participation

This model examines how students’ HEXAD traits and prior knowledge relate to whether they participated (submitted code at least once). It helps identify which traits are associated with a higher or lower likelihood of participation.

In [5]:
df = logistic_models.add_participation_variable(df)

In [6]:
model_name = "participated_main"
formula = logistic_models.get_logit_formula_by_name(model_name)
model = logistic_models.fit_logit_model(df, formula)

Optimization terminated successfully.
         Current function value: 0.288754
         Iterations 7


In [7]:
formula

'participated ~ pre_test + HEXAD_P + HEXAD_S + HEXAD_F + HEXAD_A + HEXAD_D + HEXAD_R'

In [8]:
model.summary()

0,1,2,3
Dep. Variable:,participated,No. Observations:,871.0
Model:,Logit,Df Residuals:,863.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 23 May 2025",Pseudo R-squ.:,0.162
Time:,17:10:14,Log-Likelihood:,-251.5
converged:,True,LL-Null:,-300.12
Covariance Type:,nonrobust,LLR p-value:,4.032e-18

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.6207,0.159,16.510,0.000,2.310,2.932
pre_test,-0.6563,0.129,-5.081,0.000,-0.909,-0.403
HEXAD_P,-0.2102,0.165,-1.272,0.203,-0.534,0.114
HEXAD_S,0.4540,0.148,3.063,0.002,0.164,0.744
HEXAD_F,0.3286,0.160,2.060,0.039,0.016,0.641
HEXAD_A,0.3194,0.178,1.796,0.072,-0.029,0.668
HEXAD_D,-0.7961,0.152,-5.238,0.000,-1.094,-0.498
HEXAD_R,0.1516,0.160,0.950,0.342,-0.161,0.464


## Late work

This model examines how students’ HEXAD traits and prior knowledge relate to the likelihood of being a late worker (i.e., submitting most work near the deadline). It helps identify which traits are associated with a higher or lower tendency to submit tasks late.

In [9]:
late_work_threshold = 0.9
df = logistic_models.add_late_worker_variable(df, late_work_threshold)

In [10]:
model_name = "late_work_main"
formula = logistic_models.get_logit_formula_by_name(model_name)
model = logistic_models.fit_logit_model(df, formula)

Optimization terminated successfully.
         Current function value: 0.293045
         Iterations 7


In [11]:
formula

'late_worker ~ pre_test + HEXAD_P + HEXAD_S + HEXAD_F + HEXAD_A + HEXAD_D + HEXAD_R'

In [12]:
model.summary()

0,1,2,3
Dep. Variable:,late_worker,No. Observations:,871.0
Model:,Logit,Df Residuals:,863.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 23 May 2025",Pseudo R-squ.:,0.02805
Time:,17:10:15,Log-Likelihood:,-255.24
converged:,True,LL-Null:,-262.61
Covariance Type:,nonrobust,LLR p-value:,0.03962

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4104,0.129,-18.696,0.000,-2.663,-2.158
pre_test,-0.0576,0.123,-0.466,0.641,-0.300,0.184
HEXAD_P,0.1254,0.176,0.711,0.477,-0.220,0.471
HEXAD_S,-0.0174,0.158,-0.110,0.913,-0.327,0.293
HEXAD_F,-0.2097,0.166,-1.260,0.208,-0.536,0.116
HEXAD_A,-0.2557,0.188,-1.363,0.173,-0.624,0.112
HEXAD_D,0.4537,0.144,3.149,0.002,0.171,0.736
HEXAD_R,0.0224,0.169,0.132,0.895,-0.309,0.354


## Shallow engagement

This model examines how students’ HEXAD traits and prior knowledge relate to the likelihood of showing shallow engagement, defined as being among the least active 25% of students (based on n_days_active). It helps identify which traits are associated with lower levels of consistent activity.

In [13]:
engagement_threshold = 0.25
df = logistic_models.add_shallow_engagement_variable(df, engagement_threshold)

In [14]:
model_name = "shallow_engagement_main"
formula = logistic_models.get_logit_formula_by_name(model_name)
model = logistic_models.fit_logit_model(df, formula)

Optimization terminated successfully.
         Current function value: 0.535562
         Iterations 6


In [15]:
formula

'shallow_engagement ~ pre_test + HEXAD_P + HEXAD_S + HEXAD_F + HEXAD_A + HEXAD_D + HEXAD_R'

In [16]:
model.summary()

0,1,2,3
Dep. Variable:,shallow_engagement,No. Observations:,871.0
Model:,Logit,Df Residuals:,863.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 23 May 2025",Pseudo R-squ.:,0.1142
Time:,17:10:15,Log-Likelihood:,-466.47
converged:,True,LL-Null:,-526.62
Covariance Type:,nonrobust,LLR p-value:,6.69e-23

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.0339,0.085,-12.187,0.000,-1.200,-0.868
pre_test,0.6894,0.085,8.093,0.000,0.522,0.856
HEXAD_P,0.2244,0.114,1.967,0.049,0.001,0.448
HEXAD_S,-0.3092,0.102,-3.041,0.002,-0.508,-0.110
HEXAD_F,-0.0752,0.112,-0.672,0.501,-0.294,0.144
HEXAD_A,-0.1242,0.123,-1.013,0.311,-0.365,0.116
HEXAD_D,0.3624,0.093,3.885,0.000,0.180,0.545
HEXAD_R,-0.2222,0.111,-2.005,0.045,-0.439,-0.005


## Solve rate

This model examines how students’ HEXAD traits and prior knowledge relate to their solve rate - the proportion of successfully completed tasks out of all attempts. It helps identify which traits are associated with more effective problem solving during task submissions.

In [17]:
# Prepare data
endog, exog =  bin_models.prepare_solve_rate_data(df)

# Fit binomial model
model = bin_models.fit_binomial_model(endog, exog)

In [18]:
model.summary()

0,1,2,3
Dep. Variable:,"['n_solved_tasks', 'failures']",No. Observations:,776.0
Model:,GLM,Df Residuals:,768.0
Model Family:,Binomial,Df Model:,7.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2886.2
Date:,"Fri, 23 May 2025",Deviance:,4001.4
Time:,17:10:15,Pearson chi2:,5770.0
No. Iterations:,7,Pseudo R-squ. (CS):,0.336
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.3889,0.021,162.730,0.000,3.348,3.430
pre_test,0.3320,0.021,15.916,0.000,0.291,0.373
HEXAD_P,0.0103,0.028,0.370,0.711,-0.044,0.065
HEXAD_S,-0.0646,0.025,-2.615,0.009,-0.113,-0.016
HEXAD_F,0.0631,0.025,2.510,0.012,0.014,0.112
HEXAD_A,0.0479,0.030,1.590,0.112,-0.011,0.107
HEXAD_D,-0.1499,0.021,-7.057,0.000,-0.192,-0.108
HEXAD_R,0.0150,0.027,0.552,0.581,-0.038,0.068


## First day

This model examines how students’ HEXAD traits and prior knowledge relate to the timing of their first code submission. It helps identify which traits are associated with earlier or later initial engagement in the course.

In [19]:
# Filter out zero-day users for 'first_day'
df = df[df["first_day"] > 0].copy()

In [20]:
is_over, mean_val, var_val = nb_models.check_overdispersion(df, count_col="first_day")
print(f"Mean: {mean_val:.2f}, Variance: {var_val:.2f}")
if is_over:
    print("➡ Overdispersion likely — NB model appropriate")

model_name = "first_day_main"
# Get formula and fit model
formula = nb_models.get_nb_formula_by_name(model_name)
model = nb_models.fit_negative_binomial(df, formula)

Mean: 25.86, Variance: 903.92
➡ Overdispersion likely — NB model appropriate
Optimization terminated successfully.
         Current function value: 4.233528
         Iterations: 12
         Function evaluations: 13
         Gradient evaluations: 13


In [21]:
formula

'first_day ~ pre_test + HEXAD_P + HEXAD_S + HEXAD_F + HEXAD_A + HEXAD_D + HEXAD_R'

In [22]:
model.summary()

0,1,2,3
Dep. Variable:,first_day,No. Observations:,752.0
Model:,NegativeBinomial,Df Residuals:,744.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 23 May 2025",Pseudo R-squ.:,0.003684
Time:,17:10:16,Log-Likelihood:,-3183.6
converged:,True,LL-Null:,-3195.4
Covariance Type:,nonrobust,LLR p-value:,0.00137

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,3.2514,0.042,77.171,0.000,3.169,3.334
pre_test,0.0646,0.044,1.476,0.140,-0.021,0.150
HEXAD_P,0.0264,0.064,0.415,0.678,-0.098,0.151
HEXAD_S,0.0317,0.055,0.571,0.568,-0.077,0.140
HEXAD_F,0.0012,0.060,0.020,0.984,-0.115,0.118
HEXAD_A,-0.0946,0.065,-1.465,0.143,-0.221,0.032
HEXAD_D,0.1227,0.048,2.530,0.011,0.028,0.218
HEXAD_R,-0.1046,0.056,-1.860,0.063,-0.215,0.006
alpha,1.2745,0.060,21.101,0.000,1.156,1.393


## Median day of activity

This model examines how students’ HEXAD traits and prior knowledge relate to the median day of their activity during the course. It helps identify which traits are associated with earlier or later patterns of sustained engagement.

In [23]:
is_over, mean_val, var_val = nb_models.check_overdispersion(df, count_col="median_day_of_activity")
print(f"Mean: {mean_val:.2f}, Variance: {var_val:.2f}")
if is_over:
    print("➡  Potential overdispersion detected (variance > mean)")

Mean: 57.34, Variance: 985.61
➡  Potential overdispersion detected (variance > mean)


In [24]:
model_name = "median_day_main"
formula = nb_models.get_nb_formula_by_name(model_name)
model = nb_models.fit_negative_binomial(df, formula)

Optimization terminated successfully.
         Current function value: 4.881046
         Iterations: 13
         Function evaluations: 15
         Gradient evaluations: 15


In [25]:
formula

'median_day_of_activity ~ pre_test + HEXAD_P + HEXAD_S + HEXAD_F + HEXAD_A + HEXAD_D + HEXAD_R'

In [26]:
model.summary()

0,1,2,3
Dep. Variable:,median_day_of_activity,No. Observations:,752.0
Model:,NegativeBinomial,Df Residuals:,744.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 23 May 2025",Pseudo R-squ.:,0.002484
Time:,17:10:16,Log-Likelihood:,-3670.5
converged:,True,LL-Null:,-3679.7
Covariance Type:,nonrobust,LLR p-value:,0.01078

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,4.0393,0.024,170.104,0.000,3.993,4.086
pre_test,-0.0619,0.024,-2.556,0.011,-0.109,-0.014
HEXAD_P,0.0188,0.036,0.529,0.597,-0.051,0.088
HEXAD_S,0.0822,0.031,2.630,0.009,0.021,0.143
HEXAD_F,-0.0322,0.033,-0.965,0.335,-0.098,0.033
HEXAD_A,-0.0368,0.037,-1.001,0.317,-0.109,0.035
HEXAD_D,0.0127,0.027,0.469,0.639,-0.040,0.066
HEXAD_R,-0.0279,0.033,-0.847,0.397,-0.092,0.037
alpha,0.3996,0.021,19.140,0.000,0.359,0.441


## Days of activity

This model examines how students’ HEXAD traits and prior knowledge relate to the number of days they were active on the platform (code submission) during the course. It helps identify which traits are associated with more consistent engagement over time.

In [27]:
is_over, mean_val, var_val = nb_models.check_overdispersion(df, count_col="n_days_active")
print(f"Mean: {mean_val:.2f}, Variance: {var_val:.2f}")
if is_over:
    print("➡  Potential overdispersion detected (variance > mean)")

Mean: 11.75, Variance: 88.64
➡  Potential overdispersion detected (variance > mean)


In [28]:
model_name = "days_of_activity_main"
formula = nb_models.get_nb_formula_by_name(model_name)
model = nb_models.fit_negative_binomial(df, formula)

Optimization terminated successfully.
         Current function value: 3.364413
         Iterations: 11
         Function evaluations: 13
         Gradient evaluations: 13


In [29]:
formula

'n_days_active ~ pre_test + HEXAD_P + HEXAD_S + HEXAD_F + HEXAD_A + HEXAD_D + HEXAD_R'

In [30]:
model.summary()

0,1,2,3
Dep. Variable:,n_days_active,No. Observations:,752.0
Model:,NegativeBinomial,Df Residuals:,744.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 23 May 2025",Pseudo R-squ.:,0.02387
Time:,17:10:17,Log-Likelihood:,-2530.0
converged:,True,LL-Null:,-2591.9
Covariance Type:,nonrobust,LLR p-value:,1.269e-23

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.3778,0.028,85.051,0.000,2.323,2.433
pre_test,-0.2951,0.029,-10.326,0.000,-0.351,-0.239
HEXAD_P,-0.0028,0.042,-0.067,0.947,-0.086,0.080
HEXAD_S,0.0173,0.037,0.469,0.639,-0.055,0.090
HEXAD_F,-0.0326,0.040,-0.819,0.413,-0.111,0.045
HEXAD_A,0.0695,0.045,1.537,0.124,-0.019,0.158
HEXAD_D,-0.0702,0.031,-2.234,0.025,-0.132,-0.009
HEXAD_R,0.0625,0.040,1.578,0.115,-0.015,0.140
alpha,0.4778,0.029,16.368,0.000,0.421,0.535


## Weeks of activity

This model examines how students’ HEXAD traits and prior knowledge relate to the number of weeks they were active during the course. It helps identify which traits are associated with sustained engagement across multiple weeks.

In [31]:
is_over, mean_val, var_val = nb_models.check_overdispersion(df, count_col="n_weeks_active")
print(f"Mean: {mean_val:.2f}, Variance: {var_val:.2f}")
if is_over:
    print("➡  Potential overdispersion detected (variance > mean)")

Mean: 5.56, Variance: 11.58
➡  Potential overdispersion detected (variance > mean)


In [32]:
model_name = "weeks_of_activity_main"
formula = nb_models.get_nb_formula_by_name(model_name)
model = nb_models.fit_negative_binomial(df, formula)

Optimization terminated successfully.
         Current function value: 2.505746
         Iterations: 15
         Function evaluations: 18
         Gradient evaluations: 18


In [33]:
formula

'n_weeks_active ~ pre_test + HEXAD_P + HEXAD_S + HEXAD_F + HEXAD_A + HEXAD_D + HEXAD_R'

In [34]:
model.summary()

0,1,2,3
Dep. Variable:,n_weeks_active,No. Observations:,752.0
Model:,NegativeBinomial,Df Residuals:,744.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 23 May 2025",Pseudo R-squ.:,0.02542
Time:,17:10:17,Log-Likelihood:,-1884.3
converged:,True,LL-Null:,-1933.5
Covariance Type:,nonrobust,LLR p-value:,2.441e-18

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.6677,0.022,76.198,0.000,1.625,1.711
pre_test,-0.1884,0.022,-8.594,0.000,-0.231,-0.145
HEXAD_P,0.0054,0.032,0.169,0.866,-0.057,0.068
HEXAD_S,0.0097,0.028,0.349,0.727,-0.045,0.064
HEXAD_F,-0.0161,0.030,-0.533,0.594,-0.075,0.043
HEXAD_A,0.0472,0.034,1.371,0.171,-0.020,0.115
HEXAD_D,-0.0667,0.024,-2.751,0.006,-0.114,-0.019
HEXAD_R,0.0499,0.030,1.637,0.102,-0.010,0.110
alpha,0.1570,0.018,8.584,0.000,0.121,0.193
