# Final Models and Feature Importance Analysis

## Objective
Present final regression model performance, feature importance analysis, and comprehensive summary of findings.

## CRISP-DM Stage
Evaluation and Deployment

## Contents
- Feature importance visualisation
- Final model comparison summary
- Cluster interpretation summary
- Key insights and recommendations

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import warnings

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('All libraries imported successfully')
print('=' * 80)

## Section 1: Load Models and Results

Load the trained models and comparison results from previous notebooks.

In [None]:
print('\n' + '=' * 80)
print('LOADING MODELS AND RESULTS')
print('=' * 80)

best_rf_model = pickle.load(open('../2_Outputs/best_regression_model.pkl', 'rb'))
comparison_df = pd.read_csv('../model_comparison_results.csv')
df_encoded = pd.read_pickle('../2_Outputs/df_encoded_full.pkl')
features = pd.read_pickle('../2_Outputs/features_prepared.pkl')

print('\nModels and data loaded successfully')

## Section 2: Feature Importance Analysis

Display and visualise the most important features from the best regression model.

In [None]:
print('\n' + '=' * 80)
print('FEATURE IMPORTANCE ANALYSIS')
print('=' * 80)

feature_importance = pd.DataFrame({
    'feature': features.columns,
    'importance': best_rf_model.feature_importances_
}).sort_values('importance', ascending=True)

top_features = feature_importance.tail(20)

plt.figure(figsize=(12, 8))
plt.barh(range(len(top_features)), top_features['importance'].values, color='steelblue')
plt.yticks(range(len(top_features)), top_features['feature'].values)
plt.xlabel('Feature Importance', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Top 20 Most Important Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print('\nTop 15 Most Important Features:')
print('-' * 50)
for idx, (_, row) in enumerate(feature_importance.tail(15).iloc[::-1].iterrows(), 1):
    print(f'{idx:2d}. {row["feature"]:40s} -> {row["importance"]:.6f}')

## Section 3: Model Performance Summary

Review and visualise the regression model comparison results.

In [None]:
print('\n' + '=' * 80)
print('MODEL PERFORMANCE SUMMARY')
print('=' * 80)

print('\n' + comparison_df.to_string(index=False))

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

comparison_df.plot(x='Model', y='RMSE', kind='bar', ax=axes[0], legend=False, color='steelblue')
axes[0].set_title('RMSE Comparison', fontsize=12, fontweight='bold')
axes[0].set_ylabel('RMSE')
axes[0].tick_params(axis='x', rotation=45)

comparison_df.plot(x='Model', y='MAE', kind='bar', ax=axes[1], legend=False, color='coral')
axes[1].set_title('MAE Comparison', fontsize=12, fontweight='bold')
axes[1].set_ylabel('MAE')
axes[1].tick_params(axis='x', rotation=45)

comparison_df.plot(x='Model', y='R2', kind='bar', ax=axes[2], legend=False, color='seagreen')
axes[2].set_title('R2 Comparison', fontsize=12, fontweight='bold')
axes[2].set_ylabel('R2')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

best_model_name = comparison_df.iloc[0]['Model']
best_r2 = comparison_df.iloc[0]['R2']

print(f'\nBest Model: {best_model_name}')
print(f'Test R2 Score: {best_r2:.4f}')

## Section 4: Executive Summary

Comprehensive summary of analysis findings and recommendations.

In [None]:
print('\n' + '=' * 80)
print('EXECUTIVE SUMMARY')
print('=' * 80)

summary_text = """
OPEN UNIVERSITY LEARNING ANALYTICS - ANALYSIS SUMMARY
================================================================================

1. PROJECT SCOPE
   - Analysis of Open University student interaction data
   - CRISP-DM methodology applied across all phases
   - Objective: Identify at-risk students and segment learning personas

2. DATA PREPARATION
   - Seven datasets merged using composite keys (id_student, code_module, 
     code_presentation)
   - 50+ features engineered from raw data
   - Cold Start Problem addressed (filling NaNs with 0 for inactive students)
   - Memory optimisation applied (int32, float32 data types)

3. SUPERVISED LEARNING RESULTS (Score Prediction)
   - Four regression models compared:
     * Linear Regression (baseline)
     * Random Forest Regressor (initial)
     * XGBoost Regressor
     * Random Forest Regressor (hyperparameter tuned)
   
   - Best Model Performance:
""" + f"     Model: {best_model_name}\n" + f"     R2 Score: {best_r2:.4f}\n" + f"     RMSE: {comparison_df.iloc[0]['RMSE']:.4f}\n" + f"     MAE: {comparison_df.iloc[0]['MAE']:.4f}\n\n"

summary_text += """   - Key Feature Drivers:
     * VLE engagement metrics (total clicks, clicks per week)
     * Assessment submission behaviour
     * Student demographics and background

4. UNSUPERVISED LEARNING RESULTS (Student Segmentation)
   - K-Means clustering identified distinct learning personas
   - Optimal cluster number determined using Elbow Method and 
     Silhouette Score analysis
   - Students segmented by:
     * Engagement levels (VLE interaction)
     * Performance (assessment scores)
     * Timeliness (submission patterns)

5. KEY INSIGHTS
   - Student engagement strongly correlates with academic performance
   - Early submission patterns indicate higher success rates
   - Consistent VLE participation is critical for success
   - Clear student personas enable targeted interventions

6. RECOMMENDATIONS
   - Deploy regression model for early warning system
   - Use cluster assignments for personalised support strategies
   - Monitor engagement metrics weekly
   - Implement targeted interventions for at-risk clusters
   - Match high-performing students with struggling peers (mentoring)

7. TECHNICAL ACHIEVEMENTS
   - Robust pipeline handling 1M+ records
   - Cross-validation and hyperparameter tuning implemented
   - Multiple evaluation metrics used
   - Reproducible code with fixed random states
   - Professional documentation and code organisation

================================================================================
Analysis completed successfully.
All models, results, and visualisations available in preceding notebooks.
"""

print(summary_text)