# FAIDM Group Project: Student Performance Prediction & Clustering
## Open University Learning Analytics Dataset (OULAD)

**Module:** WM9QG-15 Fundamentals of AI and Data Mining  
**Methodology:** CRISP-DM  
**Deadline:** W/C 26th January 2026  

---

### Project Tasks:
1. **Predictive Model (Supervised ML):** Predict students' final outcomes based on demographics, VLE engagement, and assessment data
2. **Clustering Model (Unsupervised ML):** Segment students into meaningful groups based on engagement patterns

---

## 1. Business Understanding (CRISP-DM Phase 1)

### 1.1 Problem Statement
The Open University wants to:
- **Identify at-risk students early** for intervention
- **Understand student engagement patterns** to inform teaching strategies
- **Predict final module outcomes** to enable personalised support

### 1.2 Success Criteria
- Build a predictive model with acceptable accuracy (target: >70% or meaningful AUC-ROC)
- Identify actionable student segments that can inform intervention strategies
- Deliverables must follow CRISP-DM methodology and be presentation-ready

### 1.3 Business Questions
1. Which students are at risk of failing or withdrawing?
2. What engagement patterns characterise successful vs struggling students?
3. Can we intervene early enough to make a difference?

## 2. Data Understanding (CRISP-DM Phase 2)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries loaded successfully!")

In [None]:
# Define data path - UPDATE THIS TO YOUR LOCAL PATH
DATA_PATH = Path('.')  # Change to your data directory

# Load all datasets
print("Loading datasets...\n")

student_info = pd.read_csv(DATA_PATH / 'studentInfo.csv')
student_assessment = pd.read_csv(DATA_PATH / 'studentAssessment.csv')
assessments = pd.read_csv(DATA_PATH / 'assessments.csv')
courses = pd.read_csv(DATA_PATH / 'courses.csv')
student_registration = pd.read_csv(DATA_PATH / 'studentRegistration.csv')
vle = pd.read_csv(DATA_PATH / 'vle.csv')
student_vle = pd.read_csv(DATA_PATH / 'studentVle.csv')

print("All datasets loaded successfully!")
print("\n" + "="*60)
print("DATASET OVERVIEW")
print("="*60)

In [None]:
# Display shape and basic info for each dataset
datasets = {
    'studentInfo': student_info,
    'studentAssessment': student_assessment,
    'assessments': assessments,
    'courses': courses,
    'studentRegistration': student_registration,
    'vle': vle,
    'studentVle': student_vle
}

for name, df in datasets.items():
    print(f"\n{name}:")
    print(f"  Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
    print(f"  Columns: {list(df.columns)}")

In [None]:
# Detailed look at studentInfo
print("="*60)
print("STUDENT INFO - DETAILED ANALYSIS")
print("="*60)
print(f"\nTotal student-module registrations: {len(student_info):,}")
print(f"Unique students: {student_info['id_student'].nunique():,}")
print(f"Unique modules: {student_info['code_module'].nunique()}")
print(f"Unique presentations: {student_info['code_presentation'].nunique()}")

print("\n--- Data Types ---")
print(student_info.dtypes)

In [None]:
# Missing values analysis
print("="*60)
print("MISSING VALUE ANALYSIS")
print("="*60)

for name, df in datasets.items():
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(2)
    if missing.any():
        print(f"\n{name}:")
        for col in missing[missing > 0].index:
            print(f"  {col}: {missing[col]:,} ({missing_pct[col]}%)")
    else:
        print(f"\n{name}: No missing values")

In [None]:
# Target variable analysis
print("="*60)
print("TARGET VARIABLE: final_result")
print("="*60)

result_counts = student_info['final_result'].value_counts()
result_pct = student_info['final_result'].value_counts(normalize=True) * 100

print("\nDistribution:")
for result in result_counts.index:
    print(f"  {result}: {result_counts[result]:,} ({result_pct[result]:.1f}%)")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

colors = ['#2ecc71', '#3498db', '#e74c3c', '#95a5a6']
result_counts.plot(kind='bar', ax=axes[0], color=colors, edgecolor='black')
axes[0].set_title('Student Final Results Distribution', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Final Result')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

axes[1].pie(result_counts, labels=result_counts.index, autopct='%1.1f%%', colors=colors, startangle=90)
axes[1].set_title('Final Results Proportion', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nNote: Class imbalance present - consider stratified sampling")

In [None]:
# Categorical variables exploration
print("="*60)
print("CATEGORICAL VARIABLES")
print("="*60)

categorical_cols = ['gender', 'region', 'highest_education', 'imd_band', 'age_band', 'disability']

for col in categorical_cols:
    print(f"\n{col}:")
    print(student_info[col].value_counts())

In [None]:
# Visualize demographics
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, col in enumerate(categorical_cols):
    student_info[col].value_counts().plot(kind='bar', ax=axes[i], color='steelblue', edgecolor='black')
    axes[i].set_title(f'{col} Distribution', fontsize=11, fontweight='bold')
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].set_ylabel('Count')

plt.tight_layout()
plt.show()

In [None]:
# Final result by demographics
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for i, col in enumerate(categorical_cols):
    ct = pd.crosstab(student_info[col], student_info['final_result'], normalize='index') * 100
    ct[['Pass', 'Distinction', 'Fail', 'Withdrawn']].plot(kind='bar', stacked=True, ax=axes[i],
                                                          color=['#2ecc71', '#3498db', '#e74c3c', '#95a5a6'])
    axes[i].set_title(f'Final Result by {col}', fontsize=11, fontweight='bold')
    axes[i].set_ylabel('Percentage')
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].legend(loc='upper right', fontsize=8)

plt.tight_layout()
plt.show()

In [None]:
# VLE Engagement Analysis
print("="*60)
print("VLE ENGAGEMENT ANALYSIS")
print("="*60)

print(f"\nTotal VLE interaction records: {len(student_vle):,}")
print(f"Total clicks recorded: {student_vle['sum_click'].sum():,}")
print(f"Unique students with VLE activity: {student_vle['id_student'].nunique():,}")
print(f"Date range: {student_vle['date'].min()} to {student_vle['date'].max()}")
print("  (negative = before course start)")

print("\n--- Click Statistics ---")
print(student_vle['sum_click'].describe())

In [None]:
# VLE Activity Types
print("\n--- VLE Activity Types ---")
print(vle['activity_type'].value_counts())

# Clicks by activity type
vle_with_type = student_vle.merge(vle[['id_site', 'activity_type']], on='id_site', how='left')
clicks_by_type = vle_with_type.groupby('activity_type')['sum_click'].sum().sort_values(ascending=False)

print("\n--- Total Clicks by Activity Type ---")
print(clicks_by_type)

In [None]:
# Assessment analysis
print("="*60)
print("ASSESSMENT DATA ANALYSIS")
print("="*60)

print(f"\nTotal assessments in catalog: {len(assessments)}")
print(f"Total submissions: {len(student_assessment):,}")

print("\nAssessment types:")
print(assessments['assessment_type'].value_counts())

print("\n--- Score Statistics ---")
print(student_assessment['score'].describe())

missing_scores = student_assessment['score'].isnull().sum()
print(f"\nMissing scores: {missing_scores:,} ({missing_scores/len(student_assessment)*100:.2f}%)")

## 3. Data Preparation (CRISP-DM Phase 3)

In [None]:
# Build unified dataset
print("="*60)
print("BUILDING UNIFIED DATASET")
print("="*60)

df = student_info.copy()

# Create unique key
df['student_module_key'] = (df['id_student'].astype(str) + '_' + 
                            df['code_module'] + '_' + df['code_presentation'])

print(f"Base dataset: {df.shape}")

In [None]:
# Merge course info
df = df.merge(courses, on=['code_module', 'code_presentation'], how='left')
print(f"After courses: {df.shape}")

In [None]:
# Merge registration and create features
df = df.merge(student_registration, on=['code_module', 'code_presentation', 'id_student'], how='left')
print(f"After registration: {df.shape}")

df['registered_early'] = (df['date_registration'] < 0).astype(int)
df['days_before_start'] = df['date_registration'].apply(lambda x: abs(x) if x < 0 else 0)
df['withdrew'] = df['date_unregistration'].notna().astype(int)

In [None]:
# Assessment features
print("\n--- Creating Assessment Features ---")

assessments_with_meta = student_assessment.merge(assessments, on='id_assessment', how='left')
assessments_with_meta['student_module_key'] = (assessments_with_meta['id_student'].astype(str) + '_' + 
                                                assessments_with_meta['code_module'] + '_' + 
                                                assessments_with_meta['code_presentation'])

# Aggregate
assessment_features = assessments_with_meta.groupby('student_module_key').agg({
    'score': ['mean', 'std', 'min', 'max', 'count'],
    'date_submitted': ['mean', 'std'],
    'is_banked': 'sum'
}).reset_index()

assessment_features.columns = ['student_module_key', 'avg_score', 'score_std', 'min_score', 'max_score', 
                               'num_assessments_submitted', 'avg_submission_day', 'submission_day_std', 
                               'num_banked']

# Timeliness
assessments_with_meta['days_early'] = assessments_with_meta['date'] - assessments_with_meta['date_submitted']
timeliness = assessments_with_meta.groupby('student_module_key').agg({
    'days_early': ['mean', 'min']
}).reset_index()
timeliness.columns = ['student_module_key', 'avg_days_early', 'worst_days_early']

assessment_features = assessment_features.merge(timeliness, on='student_module_key', how='left')

print(f"Assessment features: {assessment_features.shape}")

In [None]:
# Merge assessment features
df = df.merge(assessment_features, on='student_module_key', how='left')
print(f"After assessments: {df.shape}")

In [None]:
# Assessment type-specific features
print("\n--- Assessment Type Features ---")

for atype in ['TMA', 'CMA', 'Exam']:
    type_data = assessments_with_meta[assessments_with_meta['assessment_type'] == atype]
    type_scores = type_data.groupby('student_module_key').agg({'score': ['mean', 'count']}).reset_index()
    type_scores.columns = ['student_module_key', f'{atype.lower()}_avg_score', f'{atype.lower()}_count']
    df = df.merge(type_scores, on='student_module_key', how='left')
    print(f"  Added {atype} features")

print(f"After assessment types: {df.shape}")

In [None]:
# VLE Engagement Features
print("\n" + "="*60)
print("CREATING VLE ENGAGEMENT FEATURES")
print("="*60)

student_vle['student_module_key'] = (student_vle['id_student'].astype(str) + '_' + 
                                      student_vle['code_module'] + '_' + 
                                      student_vle['code_presentation'])

# Basic engagement
vle_features = student_vle.groupby('student_module_key').agg({
    'sum_click': ['sum', 'mean', 'std', 'max'],
    'date': ['min', 'max', 'nunique'],
    'id_site': 'nunique'
}).reset_index()

vle_features.columns = ['student_module_key', 'total_clicks', 'avg_daily_clicks', 
                        'click_std', 'max_daily_clicks', 'first_access_day', 
                        'last_access_day', 'active_days', 'unique_resources']

vle_features['engagement_span'] = vle_features['last_access_day'] - vle_features['first_access_day']
vle_features['clicks_per_active_day'] = vle_features['total_clicks'] / vle_features['active_days'].replace(0, 1)

print(f"VLE features: {vle_features.shape}")

In [None]:
# Activity type clicks
print("\n--- Activity Type Click Features ---")

vle_with_type = student_vle.merge(vle[['id_site', 'activity_type']], on='id_site', how='left')
activity_clicks = vle_with_type.groupby(['student_module_key', 'activity_type'])['sum_click'].sum().unstack(fill_value=0)
activity_clicks = activity_clicks.add_prefix('clicks_').reset_index()

print(f"Activity types: {activity_clicks.shape[1] - 1}")

vle_features = vle_features.merge(activity_clicks, on='student_module_key', how='left')

In [None]:
# Early engagement (first 2 weeks)
print("\n--- Early Engagement Features ---")

early_vle = student_vle[student_vle['date'] <= 14]
early_engagement = early_vle.groupby('student_module_key').agg({
    'sum_click': 'sum',
    'date': 'nunique',
    'id_site': 'nunique'
}).reset_index()
early_engagement.columns = ['student_module_key', 'early_clicks', 'early_active_days', 'early_resources']

vle_features = vle_features.merge(early_engagement, on='student_module_key', how='left')

# Pre-course engagement
pre_course = student_vle[student_vle['date'] < 0]
pre_engagement = pre_course.groupby('student_module_key')['sum_click'].sum().reset_index()
pre_engagement.columns = ['student_module_key', 'pre_course_clicks']

vle_features = vle_features.merge(pre_engagement, on='student_module_key', how='left')

print(f"Final VLE features: {vle_features.shape}")

In [None]:
# Merge VLE features
df = df.merge(vle_features, on='student_module_key', how='left')
print(f"After VLE features: {df.shape}")

In [None]:
# Data cleaning
print("\n" + "="*60)
print("DATA CLEANING")
print("="*60)

df['imd_band'] = df['imd_band'].fillna('Unknown')

numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(0)

print(f"Missing values after cleaning: {df.isnull().sum().sum()}")

In [None]:
# Encode categorical variables
from sklearn.preprocessing import LabelEncoder

print("\n--- Encoding Categorical Variables ---")

df_encoded = df.copy()
label_encoders = {}

for col in ['gender', 'region', 'disability', 'code_module', 'code_presentation']:
    le = LabelEncoder()
    df_encoded[col + '_encoded'] = le.fit_transform(df_encoded[col])
    label_encoders[col] = le
    print(f"  Encoded {col}: {len(le.classes_)} categories")

# Ordinal encode education and age
education_order = ['No Formal quals', 'Lower Than A Level', 'A Level or Equivalent', 
                   'HE Qualification', 'Post Graduate Qualification']
df_encoded['education_level'] = df_encoded['highest_education'].apply(
    lambda x: education_order.index(x) if x in education_order else -1)

age_order = ['0-35', '35-55', '55<=']
df_encoded['age_level'] = df_encoded['age_band'].apply(
    lambda x: age_order.index(x) if x in age_order else -1)

In [None]:
# Create target variables
print("\n--- Target Variables ---")

df_encoded['target_binary'] = df_encoded['final_result'].apply(
    lambda x: 1 if x in ['Pass', 'Distinction'] else 0)

result_mapping = {'Pass': 2, 'Distinction': 3, 'Fail': 1, 'Withdrawn': 0}
df_encoded['target_multiclass'] = df_encoded['final_result'].map(result_mapping)

print("\nBinary (1=Pass/Distinction, 0=Fail/Withdrawn):")
print(df_encoded['target_binary'].value_counts())

In [None]:
# Final dataset
print("\n" + "="*60)
print("FINAL PREPARED DATASET")
print("="*60)
print(f"\nShape: {df_encoded.shape}")
print(f"\nColumns: {len(df_encoded.columns)}")

In [None]:
# Save
df_encoded.to_csv('prepared_student_data.csv', index=False)
print("Saved to 'prepared_student_data.csv'")

## 4. Modelling (CRISP-DM Phase 4)

### 4.1 Predictive Model

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix, accuracy_score, 
                             precision_score, recall_score, f1_score, roc_auc_score, roc_curve)

print("ML libraries loaded!")

In [None]:
# Select features
exclude_cols = ['id_student', 'student_module_key', 'final_result', 'target_binary', 
                'target_multiclass', 'code_module', 'code_presentation', 'gender', 
                'region', 'highest_education', 'imd_band', 'age_band', 'disability',
                'date_registration', 'date_unregistration']

feature_cols = [col for col in df_encoded.columns if col not in exclude_cols 
               and df_encoded[col].dtype in ['int64', 'float64', 'int32', 'float32']]

print(f"Features: {len(feature_cols)}")
print(feature_cols)

In [None]:
# Prepare data
X = df_encoded[feature_cols].copy()
y = df_encoded['target_binary'].copy()

X = X.replace([np.inf, -np.inf], np.nan).fillna(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Train: {X_train.shape[0]:,}")
print(f"Test: {X_test.shape[0]:,}")

In [None]:
# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Train models
print("="*60)
print("MODEL TRAINING")
print("="*60)

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

results = {}

for name, model in models.items():
    print(f"\n--- {name} ---")
    
    if 'Logistic' in name:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_proba = model.predict_proba(X_test)[:, 1]
    
    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'AUC-ROC': roc_auc_score(y_test, y_proba)
    }
    
    for metric, value in results[name].items():
        print(f"  {metric}: {value:.4f}")

In [None]:
# Results comparison
results_df = pd.DataFrame(results).T
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(results_df.round(4))

results_df.plot(kind='bar', figsize=(12, 5))
plt.title('Model Performance Comparison', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()

In [None]:
# Random Forest Analysis
print("="*60)
print("RANDOM FOREST - DETAILED")
print("="*60)

rf_model = models['Random Forest']
y_pred_rf = rf_model.predict(X_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Fail/Withdrawn', 'Pass/Distinction']))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Fail/Withdrawn', 'Pass/Distinction'],
            yticklabels=['Fail/Withdrawn', 'Pass/Distinction'])
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 20 Features:")
print(feature_importance.head(20).to_string(index=False))

plt.figure(figsize=(10, 10))
top_20 = feature_importance.head(20)
plt.barh(range(len(top_20)), top_20['importance'], color='steelblue')
plt.yticks(range(len(top_20)), top_20['feature'])
plt.xlabel('Importance')
plt.title('Feature Importance (Top 20)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# ROC Curves
plt.figure(figsize=(10, 8))

for name, model in models.items():
    if 'Logistic' in name:
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        y_proba = model.predict_proba(X_test)[:, 1]
    
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})', linewidth=2)

plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves', fontsize=14, fontweight='bold')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Hyperparameter tuning
print("="*60)
print("HYPERPARAMETER TUNING")
print("="*60)

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf_tuned = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(rf_tuned, param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print(f"\nBest params: {grid_search.best_params_}")
print(f"Best CV AUC: {grid_search.best_score_:.4f}")

best_rf = grid_search.best_estimator_
y_pred_tuned = best_rf.predict(X_test)
y_proba_tuned = best_rf.predict_proba(X_test)[:, 1]

print(f"\nTuned Test Accuracy: {accuracy_score(y_test, y_pred_tuned):.4f}")
print(f"Tuned Test AUC-ROC: {roc_auc_score(y_test, y_proba_tuned):.4f}")

### 4.2 Clustering Model

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.decomposition import PCA

In [None]:
# Clustering features (engagement-focused)
clustering_features = [
    'total_clicks', 'avg_daily_clicks', 'active_days', 'unique_resources',
    'early_clicks', 'early_active_days', 'early_resources', 'pre_course_clicks',
    'avg_score', 'num_assessments_submitted', 'avg_days_early',
    'clicks_per_active_day', 'engagement_span'
]

clustering_features = [f for f in clustering_features if f in df_encoded.columns]
print(f"Clustering features: {len(clustering_features)}")

In [None]:
# Prepare
X_cluster = df_encoded[clustering_features].copy()
X_cluster = X_cluster.replace([np.inf, -np.inf], np.nan).fillna(0)

scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

print(f"Clustering data: {X_cluster_scaled.shape}")

In [None]:
# Find optimal k
print("="*60)
print("FINDING OPTIMAL K")
print("="*60)

K_range = range(2, 11)
inertias = []
silhouette_scores = []
davies_bouldin_scores = []

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_cluster_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_cluster_scaled, kmeans.labels_))
    davies_bouldin_scores.append(davies_bouldin_score(X_cluster_scaled, kmeans.labels_))
    print(f"k={k}: Silhouette={silhouette_scores[-1]:.4f}, DB={davies_bouldin_scores[-1]:.4f}")

In [None]:
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(K_range, inertias, 'bo-', linewidth=2)
axes[0].set_xlabel('k')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method', fontsize=12, fontweight='bold')

axes[1].plot(K_range, silhouette_scores, 'go-', linewidth=2)
axes[1].set_xlabel('k')
axes[1].set_ylabel('Silhouette')
axes[1].set_title('Silhouette (higher=better)', fontsize=12, fontweight='bold')

axes[2].plot(K_range, davies_bouldin_scores, 'ro-', linewidth=2)
axes[2].set_xlabel('k')
axes[2].set_ylabel('Davies-Bouldin')
axes[2].set_title('Davies-Bouldin (lower=better)', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Final K-Means
OPTIMAL_K = 4  # Adjust based on above

print(f"\nFitting K-Means with k={OPTIMAL_K}")
kmeans_final = KMeans(n_clusters=OPTIMAL_K, random_state=42, n_init=10)
cluster_labels = kmeans_final.fit_predict(X_cluster_scaled)

df_encoded['cluster'] = cluster_labels

print(f"\nCluster distribution:")
print(df_encoded['cluster'].value_counts().sort_index())

In [None]:
# Cluster profiling
print("="*60)
print("CLUSTER PROFILES")
print("="*60)

cluster_profiles = df_encoded.groupby('cluster')[clustering_features].mean()
print(cluster_profiles.round(2).T)

In [None]:
# Cluster vs outcome
print("\n--- Cluster vs Final Result ---")
cluster_outcome = pd.crosstab(df_encoded['cluster'], df_encoded['final_result'], normalize='index') * 100
print(cluster_outcome.round(1))

success_rate = df_encoded.groupby('cluster')['target_binary'].mean() * 100
print("\nSuccess Rate per Cluster:")
for c, rate in success_rate.items():
    print(f"  Cluster {c}: {rate:.1f}%")

In [None]:
# Visualize cluster outcomes
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

cluster_outcome[['Pass', 'Distinction', 'Fail', 'Withdrawn']].plot(
    kind='bar', stacked=True, ax=axes[0],
    color=['#2ecc71', '#3498db', '#e74c3c', '#95a5a6'])
axes[0].set_title('Final Result by Cluster', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Cluster')
axes[0].set_ylabel('Percentage')
axes[0].tick_params(axis='x', rotation=0)

success_rate.plot(kind='bar', ax=axes[1], color='steelblue', edgecolor='black')
axes[1].set_title('Success Rate by Cluster', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Cluster')
axes[1].set_ylabel('Success Rate (%)')
axes[1].tick_params(axis='x', rotation=0)
axes[1].axhline(y=df_encoded['target_binary'].mean()*100, color='red', linestyle='--', label='Overall')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# PCA visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_cluster_scaled)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis', alpha=0.5, s=10)
plt.colorbar(scatter, label='Cluster')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.title('Student Clusters (PCA)', fontsize=14, fontweight='bold')

centers_pca = pca.transform(kmeans_final.cluster_centers_)
plt.scatter(centers_pca[:, 0], centers_pca[:, 1], c='red', marker='X', s=300, edgecolors='black', linewidths=2)

plt.tight_layout()
plt.show()

## 5. Evaluation (CRISP-DM Phase 5)

In [None]:
print("="*60)
print("EVALUATION SUMMARY")
print("="*60)

print("\n--- PREDICTIVE MODEL ---")
print(f"Best: Random Forest (Tuned)")
print(f"Test AUC-ROC: {roc_auc_score(y_test, y_proba_tuned):.4f}")
print(f"\nTop 5 Features:")
for _, row in feature_importance.head(5).iterrows():
    print(f"  {row['feature']}: {row['importance']:.4f}")

print("\n--- CLUSTERING MODEL ---")
print(f"Clusters: {OPTIMAL_K}")
print(f"Silhouette: {silhouette_score(X_cluster_scaled, cluster_labels):.4f}")
print(f"\nCluster Success Rates:")
for c, rate in success_rate.items():
    print(f"  Cluster {c}: {rate:.1f}%")

## 6. Deployment Considerations (CRISP-DM Phase 6)

**Recommendations:**
1. Deploy Random Forest model for at-risk student identification
2. Run predictions weekly during term
3. Flag students with P(success) < 0.5 for intervention
4. Use cluster assignments for personalised support pathways

**Limitations:**
- Model trained on historical data
- Need validation on new presentations
- Consider temporal models for early prediction

---

**Questions for Amir:**
1. Can we use statistical methods (Pearson correlation) beyond module content?
2. Does everyone need to present?