In [1]:
import numpy as np
import pandas as pd
from warnings import filterwarnings
filterwarnings('ignore')

## Part 1: Data Exploration, Preprocessing, and Feature Engineering
---

### 1) Load the dataset using pandas
Load a CSV file into a pandas DataFrame.

In [2]:
df = pd.read_csv("student's dropout dataset.csv")

In [3]:
df.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,8,5,2,1,1,1,13,10,6,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,6,1,11,1,1,1,1,3,4,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,5,1,1,1,22,27,10,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,8,2,15,1,1,1,23,27,6,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,12,1,3,0,1,1,22,28,10,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


### 2) Understand the dataset structure
- Use .shape to display the number of rows and columns.
- Use .dtypes to check data types of each feature.
- Use .isnull().sum() to identify missing values in each column.

In [4]:
print('Dataset Shape:',df.shape)

Dataset Shape: (4424, 35)


In [5]:
print('Data Types: ')
df.dtypes

Data Types: 


Marital status                                      int64
Application mode                                    int64
Application order                                   int64
Course                                              int64
Daytime/evening attendance                          int64
Previous qualification                              int64
Nacionality                                         int64
Mother's qualification                              int64
Father's qualification                              int64
Mother's occupation                                 int64
Father's occupation                                 int64
Displaced                                           int64
Educational special needs                           int64
Debtor                                              int64
Tuition fees up to date                             int64
Gender                                              int64
Scholarship holder                                  int64
Age at enrollm

In [6]:
print("Misssing Values: ")
df.isna().sum()

Misssing Values: 


Marital status                                    0
Application mode                                  0
Application order                                 0
Course                                            0
Daytime/evening attendance                        0
Previous qualification                            0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            0
Mother's occupation                               0
Father's occupation                               0
Displaced                                         0
Educational special needs                         0
Debtor                                            0
Tuition fees up to date                           0
Gender                                            0
Scholarship holder                                0
Age at enrollment                                 0
International                                     0
Curricular u

### 3) Analyze categorical feature distributions using .value_counts()

- Gender
- Course
- Tuition fees up to date
- Debtor

In [7]:
print("Course Distribution: ")
df['Course'].value_counts()

Course Distribution: 


Course
12    766
9     380
10    355
6     337
15    331
14    268
17    268
11    252
5     226
2     215
3     215
4     210
16    192
7     170
8     141
13     86
1      12
Name: count, dtype: int64

In [8]:
print("Gender Distribution: ",df['Gender'].value_counts())


Gender Distribution:  Gender
0    2868
1    1556
Name: count, dtype: int64


In [9]:
print('Tuition Fees Up to Date Distribution:',df['Tuition fees up to date'].value_counts())

Tuition Fees Up to Date Distribution: Tuition fees up to date
1    3896
0     528
Name: count, dtype: int64


In [10]:
print("Debtor Distribution:",df['Debtor'].value_counts())

Debtor Distribution: Debtor
0    3921
1     503
Name: count, dtype: int64


### 4) Create new engineered features

- approval_rate = approved / enrolled:
- A feature showing the proportion of students approved relative to those enrolled.
- performance_score = approved / evaluations:
- Indicates academic performance based on course evaluations and approvals.

In [11]:
df['approval_rate'] = df['Curricular units 1st sem (approved)']/df['Curricular units 1st sem (enrolled)']
df['performance_score'] = df['Curricular units 1st sem (approved)']/df['Curricular units 1st sem (evaluations)']

In [12]:
print("Data with new columns: ")
df[['Course','Gender','approval_rate','performance_score']].head()

Data with new columns: 


Unnamed: 0,Course,Gender,approval_rate,performance_score
0,2,1,,
1,11,1,1.0,1.0
2,5,1,0.0,
3,15,0,1.0,0.75
4,3,0,0.833333,0.555556


---
# Part 2: Preprocessing, Model Training, Regularization, and Evaluation
---

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

### 1) Feature Selection

- Choose relevant features such as:
- Age at enrollment
- Gender
- Debtor
- Tuition fees up to date
- Curricular units 1st sem (approved)

In [14]:
features = ['Age at enrollment', 'Gender', 'Debtor', 'Tuition fees up to date', 'Curricular units 1st sem (approved)']
X = df[features]
X.head()

Unnamed: 0,Age at enrollment,Gender,Debtor,Tuition fees up to date,Curricular units 1st sem (approved)
0,20,1,0,1,0
1,19,1,0,0,6
2,19,1,0,0,0
3,20,0,0,1,6
4,45,0,0,1,5


### 2) Handle Missing Values
- For categorical columns: fill missing values with 'Unknown'
- For numerical columns: use .fillna() with the median

In [15]:
categorical_cols = ['Gender','Debtor','Tuition fees up to date']
numerical_cols = ['Age at enrollment','Curricular units 1st sem (approved)']

In [16]:
X = X.copy()

In [17]:
X[categorical_cols].fillna('Unknown',inplace=True)

In [18]:
X[numerical_cols].fillna(X[numerical_cols].mean(),inplace=True)

### 3) Encode Categorical Features

- Use one-hot encoding (pd.get_dummies()) to convert:
- Gender
- Debtor
- Tuition fees up to date  

In [19]:
categorical_cols

['Gender', 'Debtor', 'Tuition fees up to date']

In [20]:
X.head()

Unnamed: 0,Age at enrollment,Gender,Debtor,Tuition fees up to date,Curricular units 1st sem (approved)
0,20,1,0,1,0
1,19,1,0,0,6
2,19,1,0,0,0
3,20,0,0,1,6
4,45,0,0,1,5


In [21]:
X = pd.get_dummies(X,columns=categorical_cols,drop_first=True)

In [22]:
X.head()

Unnamed: 0,Age at enrollment,Curricular units 1st sem (approved),Gender_1,Debtor_1,Tuition fees up to date_1
0,20,0,True,False,True
1,19,6,True,False,False
2,19,0,True,False,False
3,20,6,False,False,True
4,45,5,False,False,True


### 4) Train/Test Split

Split the dataset into 70% training and 30% testing using train_test_split() from sklearn

In [23]:
y = df['Target']

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### 5) Train a Baseline Model
- Use DecisionTreeClassifier to predict student dropout risk based on selected features

In [25]:
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train,y_train)

### 6)Evaluate the Model

- Use .score() to measure accuracy
- Generate a confusion matrix using confusion_matrix() to assess:
- True Positives (TP)
- False Positives (FP)
- True Negatives (TN)
- False Negatives (FN)

In [26]:
y_pred = model.predict(X_test)

In [27]:
print("Decision Tree Model Accuracy:",model.score(X_test,y_test))

Decision Tree Model Accuracy: 0.6641566265060241


In [28]:
print("confusion matrix:\n",confusion_matrix(y_test,y_pred))

confusion matrix:
 [[297  46  98]
 [ 81  41 123]
 [ 54  44 544]]


### 7) Apply Regularization

- Implement L1 (Lasso) regularization:
- LogisticRegression(penalty='l1')
- Implement L2 (Ridge) regularization:
- LogisticRegression(penalty='l2')

In [29]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [30]:
X_train_scaled = pd.DataFrame(X_train_scaled,columns=X_train.columns,index=X_train.index)

In [31]:
X_test_scaled = pd.DataFrame(X_test_scaled,columns=X_test.columns,index=X_test.index)

In [32]:
reg_model_l2 = LogisticRegression(penalty='l2',random_state=42,max_iter=2000)
reg_model_l2.fit(X_train_scaled,y_train)

In [33]:
print("Regularized model accuracy (Ridge): ",reg_model_l2.score(X_test,y_test))

Regularized model accuracy (Ridge):  0.6634036144578314


In [34]:
reg_model_l1 = LogisticRegression(penalty='l1',solver='saga',random_state=42,max_iter=2000)
reg_model_l1.fit(X_train_scaled,y_train)

In [35]:
print("Regularized model accuracy (Lasso): ",reg_model_l1.score(X_test,y_test))

Regularized model accuracy (Lasso):  0.6634036144578314


---
# Part 3: Ensemble Methods, Model Optimization, and Interpretation
---

Use ensemble learning and model evaluation techniques to build a robust, high-performing model for predicting student dropout risk"

Part 3: Ensemble Methods, Model Optimization, and Interpretation
To boost predictive performance and improve model robustness, you will implement ensemble methods. These combine multiple models to reduce errors and generalize better. You’ll also compare the model metrics for final evaluation.

### 1) Ensemble Learning
- Use Random Forest (Bagging) via RandomForestClassifier
- Use Gradient Boosting (Boosting) via GradientBoostingClassifier

In [36]:
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.model_selection import cross_val_score,KFold
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix

In [37]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train,y_train)

In [38]:
y_pred = rf_model.predict(X_test)
print("\nRandom Forest Model Accuracy:", accuracy_score(y_test,y_pred))
print("Random Forest Model Precision:", precision_score(y_test,y_pred,average='weighted'))
print("Random Forest Model Recall:", recall_score(y_test,y_pred,average='weighted'))


Random Forest Model Accuracy: 0.6807228915662651
Random Forest Model Precision: 0.6495406357253748
Random Forest Model Recall: 0.6807228915662651


When you use metrics like precision, recall, or F1-score, they are originally binary metrics (for 0 and 1).

But if your dataset has more than 2 classes (e.g., classes 0, 1, 2, 3)
or the classes are not balanced,
you must tell sklearn how to combine per-class scores into one overall number.

That’s what average controls.

In [39]:
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train,y_train)

### 2) Model Optimization with Cross-Validation

- Apply K-fold cross-validation using cross_val_score() to validate your model across different data subsets

In [40]:
kf = KFold(n_splits=5,shuffle=True,random_state=42)
cv_scores = cross_val_score(gb_model,X_train,y_train,cv=kf,scoring='accuracy')

### 3) Evaluate and Compare Models

- Use the following metrics to assess performance:
- Accuracy: Overall prediction correctness
- Precision: Accuracy of positive predictions
- Recall: Ability to find all positive cases
- Confusion Matrix: Breakdown of correct and incorrect classifications

In [41]:
y_pred = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, y_pred)
gb_precision = precision_score(y_test, y_pred, average='weighted')
gb_recall = recall_score(y_test, y_pred, average='weighted')
gb_cm = confusion_matrix(y_test, y_pred)

In [42]:
print("\nGradient Boosting Model Accuracy:", gb_accuracy)
print("Gradient Boosting Model Precision:", gb_precision)
print("Gradient Boosting Model Recall:", gb_recall)
print("Gradient Boosting Confusion Matrix:\n", gb_cm)


Gradient Boosting Model Accuracy: 0.7070783132530121
Gradient Boosting Model Precision: 0.6761677663033085
Gradient Boosting Model Recall: 0.7070783132530121
Gradient Boosting Confusion Matrix:
 [[304  46  91]
 [ 64  44 137]
 [ 37  14 591]]


This final part helps determine the most reliable model for deployment and interprets how well it can identify at-risk students for proactive educational intervention.