### 1. Introduction
This notebook demonstrates the AI model development process for predicting student dropout risk in online learning platforms. The model is based on the AI Development Workflow covered in Part 1 of the assignment.


### 2. Load Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score


### 3. Load & Explore the Dataset

We begin by loading our sample CSV dataset and examining its structure to understand the features available for modeling student dropout risk.


In [None]:
import pandas as pd

df = pd.read_csv("data/sample_data.csv")
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'data/sample_data.csv'

### 4. Data Preprocessing

To prepare the data, we handle missing values, encode categorical features, and normalize numerical columns.


In [6]:
from sklearn.preprocessing import StandardScaler

# Fill missing values if any
df = df.fillna(method='ffill')

# One-hot encode categorical features
df = pd.get_dummies(df, columns=['language', 'location'])

# Standardize numeric features
scaler = StandardScaler()
df[['time_spent', 'quiz_score']] = scaler.fit_transform(df[['time_spent', 'quiz_score']])

df.head()


NameError: name 'df' is not defined

### 5. Split the Dataset

We divide the dataset into training and temporary test sets using stratified sampling to preserve class distribution.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("data/sample_data.csv")

X = df.drop("dropped_out", axis=1)
y = df["dropped_out"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)


NameError: name 'df' is not defined

### 6. Train the Model

We train a Random Forest Classifier to learn dropout patterns based on the engineered features.


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)


NameError: name 'X_train' is not defined

### 7. Evaluate the Model

We assess how well the trained model performs using precision, recall, and a classification report.


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score

y_pred = rf_model.predict(X_temp)

print("Confusion Matrix:\n", confusion_matrix(y_temp, y_pred))
print("\nClassification Report:\n", classification_report(y_temp, y_pred))
print("Precision:", precision_score(y_temp, y_pred))
print("Recall:", recall_score(y_temp, y_pred))


### 8. Save the Trained Model

We serialize and store the trained model so it can be used later without retraining.


In [None]:
import joblib
import os

os.makedirs("model", exist_ok=True)
joblib.dump(rf_model, "model/random_forest_dropout_predictor.pkl")


✅ Model saved successfully!


### 9. Feature Importance

Random Forest models allow us to rank input features by how influential they are in predicting student dropout. This helps educators and platform designers focus on the most impactful learning behaviors or characteristics.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Get feature importances and sort them
importances = rf_model.feature_importances_
feature_names = X.columns
sorted_idx = np.argsort(importances)

# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_names[sorted_idx], importances[sorted_idx], color="skyblue")
plt.xlabel("Feature Importance Score")
plt.title("Top Features Influencing Student Dropout")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


### 🔄 Reset Dataset (optional)

Re-importing the original dataset to undo any in-place modifications.


In [None]:
df = pd.read_csv('data/sample_data.csv')


### 10. Run Dropout Prediction for a New Student

We construct a new student profile and run it through the same preprocessing pipeline used in training. Then, we use the saved model to predict dropout risk.


In [None]:
import pandas as pd
import joblib

# Define new student features (adjust values as needed)
new_student = pd.DataFrame([{
    'gender': 'F',
    'age': 24,
    'language_English': 1,
    'language_French': 0,
    'location_Rural': 0,
    'location_Suburban': 1,
    'location_Urban': 0,
    'time_spent': 5.5,       # hours/week
    'quiz_score': 80,        # %
    'login_count': 9         # number of logins
}])

# Normalize numeric columns using same scaler (you can also pickle the scaler if needed)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
new_student[['time_spent', 'quiz_score']] = scaler.fit_transform(new_student[['time_spent', 'quiz_score']])

# Load trained model
model = joblib.load("model/random_forest_dropout_predictor.pkl")

# Predict
prediction = model.predict(new_student)
print("🔮 Dropout Risk Prediction:", "Dropped Out" if prediction[0] == 1 else "Continued")


In [1]:
import glob

for path in glob.glob("**/*.pkl", recursive=True):
    print("🔍 Found:", path)


🔍 Found: model\random_forest_dropout_predictor.pkl


In [2]:
import joblib
import os

os.makedirs("model", exist_ok=True)
joblib.dump(rf_model, "model/random_forest_dropout_predictor.pkl")


NameError: name 'rf_model' is not defined

# AI for Software Engineering – Student Dropout Prediction  
**Author:** Leonard & Team  
