# [CRISP-DM Phase 5: Evaluation] Cluster Evaluation & Interpretation

## Objective
Critically analyse the identified student segments (clusters) using SHAP (SHapley Additive exPlanations) to understand feature importance and behaviour profiles.

## Methodology
1. **Load Data & Model**: Retrieve the pre-processed features and the optimized K-Means model.
2. **Re-Assign Clusters**: Apply the model to the dataset.
3. **SHAP Analysis**: Use KernelExplainer to interpret the model's decision boundaries globally and per cluster.
4. **Critical Analysis**: proper exposition of student behaviours and risk profiles based on feature distributions.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import shap
import warnings
from sklearn.preprocessing import StandardScaler

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries loaded successfully.")


In [None]:
# 1. Load Features and Model
try:
    # Load Features
    df_features = pd.read_pickle('../2_Outputs/clustering_features.pkl')
    clustering_features = ['Intensity', 'Regularity', 'Procrastination', 'Breadth']
    X = df_features[clustering_features]
    
    # Scale Features (Critical: K-Means was trained on scaled data)
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=clustering_features)
    
    print(f"Features loaded and scaled. Shape: {X_scaled.shape}")
    
    # Load Model
    model_path = '../2_Outputs/best_clustering_model.pkl'
    with open(model_path, 'rb') as f:
        kmeans_model = pickle.load(f)
        
    print(f"Model loaded: {type(kmeans_model).__name__} with {kmeans_model.n_clusters} clusters")
    
except FileNotFoundError as e:
    print(f"Error loading files: {e}")
    print("Please ensure you have run '03_Clustering_Models.ipynb' to generate outputs.")


In [None]:
# 2. Assign Clusters
# We predict clusters for the whole dataset to ensure alignment
clusters = kmeans_model.predict(X_scaled)
X_labeled = X.copy()
X_labeled['Cluster'] = clusters

# Check distribution
print("Cluster Distribution:")
print(pd.Series(clusters).value_counts().sort_index())


## SHAP Feature Importance Analysis

We use SHAP to understand which features drive the assignment of a student to a specific cluster. Since K-Means is a distance-based algorithm, we use `KernelExplainer` (appropriate for model-agnostic interpretation) or `LinearExplainer` if applicable, but Kernel is safer here. 

**Note**: Calculation can be computationally expensive, so we may use a sample.

In [None]:
# 3. Compute SHAP Values
# Use a background summary (k-means centers) to speed up KernelExplainer
# or simply sample the data if it's too large.

# Sample for speed (N=500 is usually sufficient for trend analysis in clustering)
X_sample = X_scaled.sample(n=500, random_state=42)

# We need a function that predicts the *cluster* (but SHAP usually explains 'class probabilities').
# For K-Means, 'model.predict' returns a discrete label (0, 1, 2). 
# Treating this as a classification problem for SHAP:
explainer = shap.KernelExplainer(kmeans_model.predict, X_sample)

# Calculate SHAP values
print("Calculating SHAP values (this may take a minute)...")
shap_values = explainer.shap_values(X_sample)

print("SHAP values calculated.")


In [None]:
# 4. Global Feature Importance (Summary Plot)
plt.figure()
shap.summary_plot(shap_values, X_sample, plot_type="bar")


In [None]:
# 5. Cluster Profiling (Centroid Analysis)
# Compare scaled centroids (model centers) vs original data means
print("Cluster Centroids (Scaled - Model Internal):")
print(pd.DataFrame(kmeans_model.cluster_centers_, columns=clustering_features))

print("\nCluster Means (Original Scale - Interpretable):")
print(X_labeled.groupby('Cluster')[clustering_features].mean().round(2))


In [None]:
# 6. Feature Distributions by Cluster
# Visualising the separation
plt.figure(figsize=(15, 10))
for i, col in enumerate(clustering_features):
    plt.subplot(2, 2, i+1)
    sns.violinplot(x='Cluster', y=col, data=X_labeled)
    plt.title(f'Distribution of {col} by Cluster')
plt.tight_layout()
plt.show()


## Critical Analysis of Student Segments

Based on the feature distributions and SHAP values derived from the K-Means (K=3) model, we can distinctly characterise the three student segments. These profiles provide actionable insights for intervention strategies.

### Cluster 1: "The Diligent High-Flyers" (High Engagement)
*   **Behaviour**: This group exhibits **exceptional engagement**. They have the highest `Intensity` (avg ~3875 clicks) and `Breadth` (avg ~148), indicating they cover almost all material deeply.
*   **Habits**: They are highly regular (low `Regularity` score of 1.22) and submit assessments significantly early (avg -40 days `Procrastination`).
*   **Risk Profile**: **Low Risk**. These students are self-regulated and likely high performers.

### Cluster 0: "The Mainstream Learners" (Moderate Engagement)
*   **Behaviour**: The largest or "average" group. They demonstrate moderate `Intensity` (avg ~785) and `Breadth` (avg ~49).
*   **Habits**: They are reasonably regular (`Regularity` ~3.11) and submit assignments close to the deadline (avg -3 days). They do what is required but do not exhibit the "super-user" behaviour of Cluster 1.
*   **Risk Profile**: **Moderate Risk**. While stable, they lack the buffer of high engagement. A drop in activity could signal emerging issues.

### Cluster 2: "The Disengaged / At-Risk" (Low Engagement)
*   **Behaviour**: This is the priority group for intervention. They show **minimal engagement** with very low `Intensity` (avg ~108) and `Breadth` (avg ~15).
*   **Habits**: The defining characteristic is **extreme irregularity** (`Regularity` score of ~18.0, far higher than other clusters). Their study patterns are sporadic and unpredictable.
*   **Risk Profile**: **High Risk**. The combination of low activity and erratic behaviour makes them prime candidates for early dropout or failure.

### Intervention Strategy Recommendations
1.  **Monitor Cluster 2**: Automated alerts for students exhibiting high `Regularity` scores (irregularity). Reach out to offer study planning support.
2.  **Nudge Cluster 0**: Encourage broader material coverage (`Breadth`) to move them towards the Cluster 1 profile.
3.  **Recognise Cluster 1**: No intervention needed, but their data can serve as a benchmark for "successful" behaviour patterns.
