**Building a Temporal Early Warning System for Student Performance**


**Problem Description**

We have been tasked to develop an early warning system with data provided from the open university

This is a temporal prediction problem as at week $t$ we need to make a prediction on the students performance.

Key Ideas:

- Each student produces **multiple time index snapshots** (week 0, 1, 2, ...)
- Each snapshot contains the cumulative features that add up to that week: i.e. a signal that moves up and down some weeks good some bad
- We train the model to look at thgis signal of some good weeks some bad.
- The model makes a prediction on the historical signals up to that point


---
Example:

**Imports**

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

from sklearn.model_selection import GroupShuffleSplit
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    classification_report, confusion_matrix
)

**Data Loader**

In [4]:
ROOT = Path.cwd().parent  # if notebook is in notebooks/
DATA_PATH = ROOT / "data" / "processed" / "ews_feature_store.csv"

df = pd.read_csv(DATA_PATH)
df.shape, df.head()


((627031, 80),
    student_id code_module code_presentation  week  cum_weekly_clicks_dataplus  \
 0        6516         AAA             2014J    -4                           0   
 1        6516         AAA             2014J    -3                           0   
 2        6516         AAA             2014J    -2                           0   
 3        6516         AAA             2014J    -1                           0   
 4        6516         AAA             2014J     0                           0   
 
    cum_weekly_clicks_dualpane  cum_weekly_clicks_externalquiz  \
 0                           0                               0   
 1                           0                               0   
 2                           0                               0   
 3                           0                               0   
 4                           0                               0   
 
    cum_weekly_clicks_folder  cum_weekly_clicks_forumng  \
 0                         0      

**Datashape**

Each row represents **one student at one week** (a snapshot of that students performance in time)

So a single student appears multiple times:

- week 0 row: state after week 0 
- week 1 row: state after week 1
- week 2 row: state after week 2 

... and so on

These rows contain the **cumulative features** e.g. total clicks so far, total VLE time so far etc.


In [5]:
# Basic checks
print(df["student_id"].nunique(), "unique students")
print(df["week"].min(), "to", df["week"].max(), "weeks")

df[["student_id", "week"]].head(10)

26074 unique students
-4 to 38 weeks


Unnamed: 0,student_id,week
0,6516,-4
1,6516,-3
2,6516,-2
3,6516,-1
4,6516,0
5,6516,1
6,6516,2
7,6516,3
8,6516,4
9,6516,5


**Making predictions**

So first we have a prediction week $t$

so if $t$ = 3 we are asking "At week 3, can we predict a final pass/fail?"

we then filter our dataset to include only snapshots/timestamps up to time $t$


***(at the moment this is a bit buggy as we need to get rid of data before week 0)***

In [6]:
PRED_WEEK = 3

df_w = df[df["week"] <= PRED_WEEK].copy()

df_w.shape, df_w[["week", "target_pass"]].head()


((140376, 80),
    week  target_pass
 0    -4            1
 1    -3            1
 2    -2            1
 3    -1            1
 4     0            1)

**Splitting the Groups**

Because each student appears multiple times we **must not** split randomly

If we split randomly:

- The same student could appear in both train and test sets
- The model would indirectly "recognise" the student
- Evaluation would be inflated (leakage)

So we split using `student_id` and `GroupShuffleSplit`

In [7]:
# Separate features/target/groups
X = df_w.drop(columns=["target_pass"])
y = df_w["target_pass"].astype(int)
groups = df_w["student_id"]

gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=groups))

X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

print("Train rows:", X_train.shape, "| Test rows:", X_test.shape)
print("Train students:", X_train["student_id"].nunique(), "| Test students:", X_test["student_id"].nunique())


Train rows: (112726, 79) | Test rows: (27650, 79)
Train students: 20553 | Test students: 5139


With this we then preprocess and fill in some missing info

In [8]:
# Identify categorical/numeric columns
cat_cols = X_train.select_dtypes(include=["object"]).columns.tolist()
num_cols = [c for c in X_train.columns if c not in cat_cols]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="median")),
        ]), num_cols),
        ("cat", Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore")),
        ]), cat_cols),
    ],
    remainder="drop",
)


**Then we can Train**

In [9]:
logreg = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=10000, class_weight="balanced"))
])

logreg.fit(X_train, y_train)

proba = logreg.predict_proba(X_test)[:, 1]
pred = (proba >= 0.5).astype(int)

print("ROC-AUC:", roc_auc_score(y_test, proba))
print("PR-AUC :", average_precision_score(y_test, proba))
print("\nConfusion matrix (row-level):\n", confusion_matrix(y_test, pred))
print("\nReport:\n", classification_report(y_test, pred, digits=3))


ROC-AUC: 0.9977970644408174
PR-AUC : 0.9985081370324104

Confusion matrix (row-level):
 [[11371   205]
 [  343 15731]]

Report:
               precision    recall  f1-score   support

           0      0.971     0.982     0.976     11576
           1      0.987     0.979     0.983     16074

    accuracy                          0.980     27650
   macro avg      0.979     0.980     0.980     27650
weighted avg      0.980     0.980     0.980     27650



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=10000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Why is the Confusion Matrix is so large**

The confusion Matrix looks inflated as we are evaluating the **rows NOT the students**

At `t=3`, each student contributes up to 4 rows

- week 0 snapshot 
- week 1 snapshot 
- week 2 snap shot
- week 3 snapshot

So the confusion matrix is counting **student week snapshots**

This tells us
> "Across all decision points up to week t, how well does the model classify?"

What we want to know is:
> "At week t, how well does the model flag students?"

***To do this we evaluate the results of the model only at the prediction point (one row per student)!!***


In [10]:
# Evaluate ONLY at the prediction horizon (one row per student)
df_eval = df_w[df_w["week"] == PRED_WEEK].copy()

X_eval = df_eval.drop(columns=["target_pass"])
y_eval = df_eval["target_pass"].astype(int)
groups_eval = df_eval["student_id"]

# Group split again at the student level (still important)
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X_eval, y_eval, groups=groups_eval))

X_train2, X_test2 = X_eval.iloc[train_idx], X_eval.iloc[test_idx]
y_train2, y_test2 = y_eval.iloc[train_idx], y_eval.iloc[test_idx]

logreg.fit(X_train2, y_train2)

proba2 = logreg.predict_proba(X_test2)[:, 1]
pred2 = (proba2 >= 0.5).astype(int)

print("Student-level evaluation at week t =", PRED_WEEK)
print("Test students:", X_test2["student_id"].nunique())
print("ROC-AUC:", roc_auc_score(y_test2, proba2))
print("PR-AUC :", average_precision_score(y_test2, proba2))
print("\nConfusion matrix (student-level):\n", confusion_matrix(y_test2, pred2))
print("\nReport:\n", classification_report(y_test2, pred2, digits=3))


Student-level evaluation at week t = 3
Test students: 4047
ROC-AUC: 0.9829461322990735
PR-AUC : 0.9852195697473014

Confusion matrix (student-level):
 [[1606   94]
 [ 164 2566]]

Report:
               precision    recall  f1-score   support

           0      0.907     0.945     0.926      1700
           1      0.965     0.940     0.952      2730

    accuracy                          0.942      4430
   macro avg      0.936     0.942     0.939      4430
weighted avg      0.943     0.942     0.942      4430



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=10000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Summary

- We frame early warning as a **temporal** prediction problem.
- We choose a prediction horizon `t` (e.g., week 3).
- We train using **snapshots** of student state up to `t`, with cumulative features.
- We prevent leakage by splitting using **student_id groups**.
- Confusion matrices can be:
  - **Row-level** (student-week snapshots): larger sample size
  - **Student-level** (fixed horizon `week == t`): one prediction per student, matches real-world EWS decisions
