# Student Dropout Prediction - Model Training

This notebook trains machine learning models to predict student dropout.

**Models to train:**
- XGBoost (Primary)
- RandomForest (Comparison)

**Evaluation Metrics:**
- Accuracy
- F1-Score
- Confusion Matrix
- ROC-AUC Curve
- Feature Importance


## 1. Setup and Installation

**Note:** This notebook automatically detects and uses GPU if available in Colab, otherwise uses CPU. No manual configuration needed!


In [None]:
# Install required packages
%pip install -q pandas numpy scikit-learn xgboost imbalanced-learn matplotlib seaborn joblib

# Check hardware availability (GPU or CPU)
import subprocess
try:
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print("‚úÖ GPU is available!")
        print("XGBoost will use GPU acceleration for faster training")
        USE_GPU = True
    else:
        USE_GPU = False
except:
    USE_GPU = False

if not USE_GPU:
    print("‚ÑπÔ∏è GPU not available, using CPU (training will work fine, just slower)")


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, f1_score, confusion_matrix, 
    classification_report, roc_auc_score, roc_curve
)
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

# Set tree method for XGBoost (automatically uses GPU if available)
tree_method = 'gpu_hist' if USE_GPU else 'hist'

print("‚úÖ All libraries imported successfully!")
print(f"üìä XGBoost will use: {'GPU' if USE_GPU else 'CPU'} (tree_method='{tree_method}')")


## 2. Mount Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("‚úÖ Google Drive mounted!")


## 3. Load Dataset

**Instructions:**
1. Upload your `dataset.csv` file to Google Drive
2. Update the path below to point to your dataset location
3. Example: `/content/drive/MyDrive/AI_Project/data/dataset.csv`


In [None]:
# Update this path to your dataset location
dataset_path = '/content/drive/MyDrive/AI_Project/data/dataset.csv'

# Alternative: Upload directly to Colab
# from google.colab import files
# uploaded = files.upload()
# dataset_path = list(uploaded.keys())[0]

# Load dataset
df = pd.read_csv(dataset_path)

print(f"‚úÖ Dataset loaded successfully!")
print(f"üìä Shape: {df.shape}")
print(f"\nüìã Columns: {list(df.columns)}")
print(f"\nüìà First few rows:")
df.head()


In [None]:
# Dataset information
print("Dataset Info:")
print("="*50)
print(f"Total records: {len(df)}")
print(f"\nTarget distribution:")
print(df['dropout'].value_counts())
print(f"\nDropout rate: {df['dropout'].mean() * 100:.2f}%")
print(f"\n\nDescriptive Statistics:")
df.describe()


## 4. Data Preprocessing


In [None]:
# Separate features and target
X = df.drop(columns=['dropout'])
y = df['dropout']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature names: {list(X.columns)}")


In [None]:
# Check for missing values
missing_values = X.isnull().sum()
if missing_values.sum() > 0:
    print("‚ö†Ô∏è Missing values found:")
    print(missing_values[missing_values > 0])
    # Fill with median
    for col in X.columns:
        if X[col].isnull().sum() > 0:
            X[col].fillna(X[col].median(), inplace=True)
    print("\n‚úÖ Missing values filled with median")
else:
    print("‚úÖ No missing values found")


In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y, shuffle=True
)

print(f"‚úÖ Train set: {X_train.shape[0]} samples")
print(f"‚úÖ Test set: {X_test.shape[0]} samples")
print(f"\nTrain set dropout rate: {y_train.mean() * 100:.2f}%")
print(f"Test set dropout rate: {y_test.mean() * 100:.2f}%")


In [None]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for better handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("‚úÖ Features scaled using StandardScaler")


In [None]:
# Apply SMOTE for class imbalance
print("Before SMOTE:")
print(y_train.value_counts())

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

print("\n‚úÖ After SMOTE:")
print(f"Shape: {X_train_balanced.shape}")
print(f"\nClass distribution:")
print(pd.Series(y_train_balanced).value_counts())


## 5. Model Training

### 5.1 XGBoost Classifier


In [None]:
# Train XGBoost model
print("üöÄ Training XGBoost Classifier...")
print("="*50)
print(f"Hardware: {'GPU üöÄ' if USE_GPU else 'CPU'}")
print(f"Tree method: {tree_method}")
print()

xgb_model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False,
    tree_method=tree_method  # Automatically uses GPU if available, CPU otherwise
)

xgb_model.fit(X_train_balanced, y_train_balanced)

print("‚úÖ XGBoost model trained successfully!")


In [None]:
# Evaluate XGBoost
y_pred_xgb = xgb_model.predict(X_test_scaled)
y_pred_proba_xgb = xgb_model.predict_proba(X_test_scaled)[:, 1]

accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
f1_xgb = f1_score(y_test, y_pred_xgb)
roc_auc_xgb = roc_auc_score(y_test, y_pred_proba_xgb)

print("XGBoost Performance:")
print("="*50)
print(f"Accuracy:  {accuracy_xgb:.4f} ({accuracy_xgb*100:.2f}%)")
print(f"F1-Score:  {f1_xgb:.4f}")
print(f"ROC-AUC:   {roc_auc_xgb:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_xgb))


### 5.2 Random Forest Classifier


In [None]:
# Train Random Forest model
print("üöÄ Training Random Forest Classifier...")
print("="*50)

rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train_balanced, y_train_balanced)

print("‚úÖ Random Forest model trained successfully!")


In [None]:
# Evaluate Random Forest
y_pred_rf = rf_model.predict(X_test_scaled)
y_pred_proba_rf = rf_model.predict_proba(X_test_scaled)[:, 1]

accuracy_rf = accuracy_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
roc_auc_rf = roc_auc_score(y_test, y_pred_proba_rf)

print("Random Forest Performance:")
print("="*50)
print(f"Accuracy:  {accuracy_rf:.4f} ({accuracy_rf*100:.2f}%)")
print(f"F1-Score:  {f1_rf:.4f}")
print(f"ROC-AUC:   {roc_auc_rf:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))


## 6. Model Comparison


In [None]:
# Compare models
comparison = pd.DataFrame({
    'Model': ['XGBoost', 'Random Forest'],
    'Accuracy': [accuracy_xgb, accuracy_rf],
    'F1-Score': [f1_xgb, f1_rf],
    'ROC-AUC': [roc_auc_xgb, roc_auc_rf]
})

print("Model Comparison:")
print("="*50)
print(comparison.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

metrics = ['Accuracy', 'F1-Score', 'ROC-AUC']
for idx, metric in enumerate(metrics):
    axes[idx].bar(comparison['Model'], comparison[metric], color=['#3498db', '#2ecc71'])
    axes[idx].set_title(f'{metric} Comparison', fontsize=14, fontweight='bold')
    axes[idx].set_ylabel(metric)
    axes[idx].set_ylim([0, 1])
    axes[idx].grid(True, alpha=0.3, axis='y')
    
    # Add value labels
    for i, v in enumerate(comparison[metric]):
        axes[idx].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig('/content/drive/MyDrive/AI_Project/models/model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Comparison plot saved to Google Drive")


## 7. Confusion Matrix


In [None]:
# Plot confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

models = [('XGBoost', xgb_model, y_pred_xgb), ('Random Forest', rf_model, y_pred_rf)]

for idx, (name, model, y_pred) in enumerate(models):
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx], 
                xticklabels=['No Dropout', 'Dropout'], 
                yticklabels=['No Dropout', 'Dropout'])
    axes[idx].set_title(f'{name} - Confusion Matrix', fontsize=14, fontweight='bold')
    axes[idx].set_ylabel('True Label')
    axes[idx].set_xlabel('Predicted Label')

plt.tight_layout()
plt.savefig('/content/drive/MyDrive/AI_Project/models/confusion_matrices.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Confusion matrices saved to Google Drive")


## 8. ROC-AUC Curve


In [None]:
# Plot ROC curves
plt.figure(figsize=(10, 8))

# XGBoost ROC
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_pred_proba_xgb)
plt.plot(fpr_xgb, tpr_xgb, label=f'XGBoost (AUC = {roc_auc_xgb:.4f})', 
         linewidth=2, color='#3498db')

# Random Forest ROC
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_proba_rf)
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {roc_auc_rf:.4f})', 
         linewidth=2, color='#2ecc71')

# Diagonal line (random classifier)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier', linewidth=1)

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC-AUC Curve Comparison', fontsize=16, fontweight='bold')
plt.legend(loc='lower right', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/content/drive/MyDrive/AI_Project/models/roc_curve.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ ROC curve saved to Google Drive")


## 9. Feature Importance


In [None]:
# Feature importance for both models
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# XGBoost feature importance
feature_importance_xgb = pd.DataFrame({
    'feature': X.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

axes[0].barh(feature_importance_xgb['feature'], feature_importance_xgb['importance'], 
            color='#3498db')
axes[0].set_title('XGBoost - Feature Importance', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Importance')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3, axis='x')

# Random Forest feature importance
feature_importance_rf = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

axes[1].barh(feature_importance_rf['feature'], feature_importance_rf['importance'], 
            color='#2ecc71')
axes[1].set_title('Random Forest - Feature Importance', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Importance')
axes[1].invert_yaxis()
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('/content/drive/MyDrive/AI_Project/models/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Feature importance plots saved to Google Drive")
print("\nTop 5 Most Important Features (XGBoost):")
print(feature_importance_xgb.head().to_string(index=False))
print("\nTop 5 Most Important Features (Random Forest):")
print(feature_importance_rf.head().to_string(index=False))


## 10. Save Best Model

**Note:** We'll save the XGBoost model as it typically performs better, but you can change this to Random Forest if preferred.


In [None]:
# Select best model (XGBoost based on performance)
best_model = xgb_model
model_name = 'dropout_model.pkl'
model_path = f'/content/drive/MyDrive/AI_Project/models/{model_name}'

# Create directory if it doesn't exist
import os
os.makedirs(os.path.dirname(model_path), exist_ok=True)

# Save model
joblib.dump(best_model, model_path)

# Also save scaler for preprocessing
scaler_path = f'/content/drive/MyDrive/AI_Project/models/scaler.pkl'
joblib.dump(scaler, scaler_path)

print(f"‚úÖ Model saved to: {model_path}")
print(f"‚úÖ Scaler saved to: {scaler_path}")
print("\nüì• To download to local machine:")
print("   1. Go to Google Drive")
print("   2. Navigate to AI_Project/models/")
print(f"   3. Download {model_name} and scaler.pkl")
print("   4. Place them in your local project/models/ folder")


## 11. Training Summary

### Model Performance Summary:

The models have been trained and evaluated. Check the outputs above for detailed metrics.

### Next Steps:
1. ‚úÖ Model trained and saved to Google Drive
2. üì• Download the model file (`dropout_model.pkl`) to your local machine
3. üìÅ Place it in `project/models/dropout_model.pkl`
4. üöÄ Use the model in your local application

### Files Saved to Google Drive:
- `/MyDrive/AI_Project/models/dropout_model.pkl` - Trained model
- `/MyDrive/AI_Project/models/scaler.pkl` - Feature scaler
- `/MyDrive/AI_Project/models/model_comparison.png` - Model comparison plot
- `/MyDrive/AI_Project/models/confusion_matrices.png` - Confusion matrices
- `/MyDrive/AI_Project/models/roc_curve.png` - ROC curve
- `/MyDrive/AI_Project/models/feature_importance.png` - Feature importance

**üéâ Training completed successfully!**


# Student Dropout Prediction - Model Training

This notebook trains machine learning models to predict student dropout based on attendance, academic performance, indiscipline, and online engagement metrics.

## Steps:
1. Mount Google Drive
2. Load and preprocess dataset
3. Train XGBoost and RandomForest models
4. Evaluate models (Accuracy, F1-score, ROC-AUC, Confusion Matrix)
5. Generate feature importance plots
6. Save model to Google Drive
7. Download model to local machine



In [None]:
# Install required packages
!pip install -q pandas numpy scikit-learn xgboost matplotlib seaborn imbalanced-learn joblib

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix, classification_report, roc_curve
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import joblib
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All packages installed and imported successfully!")

