# Student Dropout Prediction – Logistic Regression

This project explores factors contributing to student dropout using
Logistic Regression.

The notebook covers preprocessing, feature encoding,
model training, evaluation, and analysis of results.

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


In [2]:
df = pd.read_csv('student_dropout_dataset_v3.csv')

In [3]:
df.head()

Unnamed: 0,Student_ID,Age,Gender,Family_Income,Internet_Access,Study_Hours_per_Day,Attendance_Rate,Assignment_Delay_Days,Travel_Time_Minutes,Part_Time_Job,Scholarship,Stress_Index,GPA,Semester_GPA,CGPA,Semester,Department,Parental_Education,Dropout
0,1,22.1,Male,25000.0,Yes,3.36,86.1,2,20.4,Yes,No,5.5,0.96,0.9,0.9,Year 1,Arts,High School,0
1,2,20.7,Male,25000.0,Yes,4.3,68.0,2,44.0,No,No,6.8,1.28,1.2,1.19,Year 3,Engineering,Bachelor,1
2,3,22.4,Male,40183.0,Yes,4.4,70.9,0,48.9,Yes,No,5.5,1.68,1.32,1.32,Year 1,Arts,Master,0
3,4,24.4,Male,,Yes,,82.2,2,38.6,No,No,,1.78,1.77,1.77,Year 1,CS,High School,1
4,5,20.5,Female,25319.0,Yes,4.19,75.7,1,23.0,No,No,7.0,1.48,0.91,0.87,Year 4,Business,Bachelor,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Student_ID             10000 non-null  int64  
 1   Age                    10000 non-null  float64
 2   Gender                 10000 non-null  object 
 3   Family_Income          9500 non-null   float64
 4   Internet_Access        10000 non-null  object 
 5   Study_Hours_per_Day    9500 non-null   float64
 6   Attendance_Rate        10000 non-null  float64
 7   Assignment_Delay_Days  10000 non-null  int64  
 8   Travel_Time_Minutes    10000 non-null  float64
 9   Part_Time_Job          10000 non-null  object 
 10  Scholarship            10000 non-null  object 
 11  Stress_Index           9500 non-null   float64
 12  GPA                    10000 non-null  float64
 13  Semester_GPA           10000 non-null  float64
 14  CGPA                   10000 non-null  float64
 15  Sem

In [5]:
df.describe()

Unnamed: 0,Student_ID,Age,Family_Income,Study_Hours_per_Day,Attendance_Rate,Assignment_Delay_Days,Travel_Time_Minutes,Stress_Index,GPA,Semester_GPA,CGPA,Dropout
count,10000.0,10000.0,9500.0,9500.0,10000.0,10000.0,10000.0,9500.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,21.02606,38377.247474,4.014592,81.73683,1.7997,30.17926,5.507147,2.30844,2.300057,2.298761,0.2354
std,2886.89568,2.13981,20496.232179,1.29545,8.22093,1.344307,11.91887,1.765951,1.061717,1.074407,1.072555,0.42427
min,1.0,17.0,25000.0,0.5,38.2,0.0,5.0,1.0,0.0,0.0,0.0,0.0
25%,2500.75,19.5,25000.0,3.16,76.4,1.0,21.9,4.3,1.55,1.52,1.52,0.0
50%,5000.5,21.0,29740.5,4.0,81.8,2.0,30.2,5.5,2.35,2.35,2.35,0.0
75%,7500.25,22.5,44520.0,4.87,87.3,3.0,38.4,6.7,3.12,3.15,3.15,0.0
max,10000.0,29.6,316601.0,8.98,100.0,8.0,74.9,10.0,4.0,4.0,4.0,1.0


In [6]:
#fill the null column values with median as the data is right skewed
x = df.Family_Income.median()
df['Family_Income'] = df.Family_Income.fillna(x)

In [7]:
df.head(6)

Unnamed: 0,Student_ID,Age,Gender,Family_Income,Internet_Access,Study_Hours_per_Day,Attendance_Rate,Assignment_Delay_Days,Travel_Time_Minutes,Part_Time_Job,Scholarship,Stress_Index,GPA,Semester_GPA,CGPA,Semester,Department,Parental_Education,Dropout
0,1,22.1,Male,25000.0,Yes,3.36,86.1,2,20.4,Yes,No,5.5,0.96,0.9,0.9,Year 1,Arts,High School,0
1,2,20.7,Male,25000.0,Yes,4.3,68.0,2,44.0,No,No,6.8,1.28,1.2,1.19,Year 3,Engineering,Bachelor,1
2,3,22.4,Male,40183.0,Yes,4.4,70.9,0,48.9,Yes,No,5.5,1.68,1.32,1.32,Year 1,Arts,Master,0
3,4,24.4,Male,29740.5,Yes,,82.2,2,38.6,No,No,,1.78,1.77,1.77,Year 1,CS,High School,1
4,5,20.5,Female,25319.0,Yes,4.19,75.7,1,23.0,No,No,7.0,1.48,0.91,0.87,Year 4,Business,Bachelor,0
5,6,20.5,Male,25000.0,Yes,4.11,89.1,2,47.1,No,Yes,6.0,2.52,2.72,2.69,Year 3,Business,,0


In [8]:
#Fill the null column values with either mean or median as the data is symetrical. But in such a case, mean is better
x = df.Study_Hours_per_Day.mean()
df['Study_Hours_per_Day'] = df.Study_Hours_per_Day.fillna(x)

In [9]:
#Fill the null column values with either mean or median as the data is symetrical. But in such a case, mean is better
x = df.Stress_Index.mean()
df['Stress_Index'] = df.Stress_Index.fillna(x)

In [10]:
df.corr()['Dropout'].sort_values()

  df.corr()['Dropout'].sort_values()


GPA                     -0.460352
Semester_GPA            -0.445396
CGPA                    -0.444807
Attendance_Rate         -0.163539
Study_Hours_per_Day     -0.087177
Family_Income           -0.010324
Student_ID               0.007434
Age                      0.007585
Travel_Time_Minutes      0.028080
Assignment_Delay_Days    0.082327
Stress_Index             0.249356
Dropout                  1.000000
Name: Dropout, dtype: float64

Dropout rate across the categorical variables

In [11]:
pd.crosstab(df.Gender,df.Dropout,normalize='index')* 100

Dropout,0,1
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,76.152465,23.847535
Male,76.768892,23.231108


In [12]:
pd.crosstab(df.Internet_Access,df.Dropout,normalize='index')* 100

Dropout,0,1
Internet_Access,Unnamed: 1_level_1,Unnamed: 2_level_1
No,71.567831,28.432169
Yes,77.146767,22.853233


In [13]:
pd.crosstab(df.Part_Time_Job,df.Dropout,normalize='index')* 100

Dropout,0,1
Part_Time_Job,Unnamed: 1_level_1,Unnamed: 2_level_1
No,77.735157,22.264843
Yes,74.55045,25.44955


In [14]:
pd.crosstab(df.Scholarship,df.Dropout,normalize='index')* 100

Dropout,0,1
Scholarship,Unnamed: 1_level_1,Unnamed: 2_level_1
No,76.252119,23.747881
Yes,76.844204,23.155796


In [15]:
pd.crosstab(df.Semester,df.Dropout,normalize='index')* 100

Dropout,0,1
Semester,Unnamed: 1_level_1,Unnamed: 2_level_1
Year 1,77.841141,22.158859
Year 2,75.602894,24.397106
Year 3,76.120587,23.879413
Year 4,76.301262,23.698738


In [16]:
pd.crosstab(df.Department,df.Dropout,normalize='index')* 100

Dropout,0,1
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Arts,76.061204,23.938796
Business,76.673327,23.326673
CS,76.595745,23.404255
Engineering,75.993805,24.006195
Science,76.952935,23.047065


In [17]:
pd.crosstab(df.Parental_Education,df.Dropout,normalize='index')* 100

Dropout,0,1
Parental_Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Bachelor,76.778931,23.221069
High School,75.390625,24.609375
Master,77.883175,22.116825
,74.951076,25.048924
PhD,76.344086,23.655914


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Student_ID             10000 non-null  int64  
 1   Age                    10000 non-null  float64
 2   Gender                 10000 non-null  object 
 3   Family_Income          10000 non-null  float64
 4   Internet_Access        10000 non-null  object 
 5   Study_Hours_per_Day    10000 non-null  float64
 6   Attendance_Rate        10000 non-null  float64
 7   Assignment_Delay_Days  10000 non-null  int64  
 8   Travel_Time_Minutes    10000 non-null  float64
 9   Part_Time_Job          10000 non-null  object 
 10  Scholarship            10000 non-null  object 
 11  Stress_Index           10000 non-null  float64
 12  GPA                    10000 non-null  float64
 13  Semester_GPA           10000 non-null  float64
 14  CGPA                   10000 non-null  float64
 15  Sem

In [19]:
df['Parental_Education'].unique()

array(['High School', 'Bachelor', 'Master', 'None', 'PhD'], dtype=object)

In [20]:
df['Gender'] = df.Gender.map({'Male':0, 'Female':1})
df['Internet_Access'] = df['Internet_Access'].map({'Yes':1, 'No':0})
df['Part_Time_Job'] = df['Part_Time_Job'].map({'Yes':1,'No':0})
df['Scholarship'] = df['Scholarship'].map({'Yes':1,'No':0})
df['Semester'] = df['Semester'].map({'Year 1':1,'Year 2':2,'Year 3':3,'Year 4':4})
df['Parental_Education'] = df['Parental_Education'].map({'None':0,'High School':1,'Bachelor':2,'Master':3,'PhD':4})

df = pd.get_dummies(df,columns=['Department'],drop_first=True)

In [21]:
df['Parental_Education'].unique()

array([1, 2, 3, 0, 4])

In [22]:
df

Unnamed: 0,Student_ID,Age,Gender,Family_Income,Internet_Access,Study_Hours_per_Day,Attendance_Rate,Assignment_Delay_Days,Travel_Time_Minutes,Part_Time_Job,...,GPA,Semester_GPA,CGPA,Semester,Parental_Education,Dropout,Department_Business,Department_CS,Department_Engineering,Department_Science
0,1,22.1,0,25000.0,1,3.360000,86.1,2,20.4,1,...,0.96,0.90,0.90,1,1,0,0,0,0,0
1,2,20.7,0,25000.0,1,4.300000,68.0,2,44.0,0,...,1.28,1.20,1.19,3,2,1,0,0,1,0
2,3,22.4,0,40183.0,1,4.400000,70.9,0,48.9,1,...,1.68,1.32,1.32,1,3,0,0,0,0,0
3,4,24.4,0,29740.5,1,4.014592,82.2,2,38.6,0,...,1.78,1.77,1.77,1,1,1,0,1,0,0
4,5,20.5,1,25319.0,1,4.190000,75.7,1,23.0,0,...,1.48,0.91,0.87,4,2,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,23.9,1,42286.0,0,4.620000,92.0,0,10.0,1,...,1.60,0.99,0.97,2,2,0,0,0,0,0
9996,9997,17.0,1,61103.0,1,2.870000,75.2,3,32.4,0,...,3.09,3.09,3.09,1,3,1,1,0,0,0
9997,9998,19.4,0,25000.0,1,4.730000,74.9,4,25.4,0,...,3.45,3.37,3.43,4,2,0,1,0,0,0
9998,9999,22.1,1,40302.0,1,5.850000,74.2,1,5.0,0,...,3.35,3.34,3.34,1,1,0,0,1,0,0


We go on to logistic regression

In [23]:
model = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',
    random_state=42
)

In [24]:
X = df.drop(columns=['Student_ID','Dropout'])
y = df['Dropout']

In [25]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [26]:
model.fit(X_train,y_train)

In [27]:
y_test

6252    0
4684    0
1731    0
4742    1
4521    1
       ..
8014    0
1074    0
3063    0
6487    0
4705    1
Name: Dropout, Length: 3000, dtype: int64

In [28]:
model.predict(X_test)

array([0, 0, 1, ..., 0, 0, 1])

In [29]:
model.score(X_test,y_test)

0.7386666666666667

In [30]:
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
})

In [31]:
coef_df

Unnamed: 0,Feature,Coefficient
0,Age,0.04549873
1,Gender,0.004259197
2,Family_Income,3.096051e-07
3,Internet_Access,-0.008385958
4,Study_Hours_per_Day,-0.06371305
5,Attendance_Rate,-0.01678258
6,Assignment_Delay_Days,0.1104063
7,Travel_Time_Minutes,0.01279704
8,Part_Time_Job,0.02190745
9,Scholarship,0.001569874


In [32]:
coef_df.sort_values(by='Coefficient', ascending=False)

Unnamed: 0,Feature,Coefficient
10,Stress_Index,0.3242925
6,Assignment_Delay_Days,0.1104063
0,Age,0.04549873
14,Semester,0.03019642
8,Part_Time_Job,0.02190745
7,Travel_Time_Minutes,0.01279704
1,Gender,0.004259197
19,Department_Science,0.002249321
9,Scholarship,0.001569874
15,Parental_Education,0.001333737


Having remove some features with very weak coefficients

In [33]:
X = df[['Stress_Index',
                 'Assignment_Delay_Days',
                 'GPA',
                 'Semester_GPA',
                 'CGPA',
                 'Study_Hours_per_Day']]
# X
y = df['Dropout']

In [34]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [35]:
model.fit(X_train,y_train)

In [36]:
model.predict(X_test)

array([0, 0, 1, ..., 0, 0, 1])

In [37]:
y_test

6252    0
4684    0
1731    0
4742    1
4521    1
       ..
8014    0
1074    0
3063    0
6487    0
4705    1
Name: Dropout, Length: 3000, dtype: int64

In [38]:
model.score(X_test,y_test)

0.735

we want now to see the effect of removing one/using only one of the gpas available

In [39]:
X = df[['Stress_Index',
                 'Assignment_Delay_Days',
                 # 'GPA',
                 'Semester_GPA',
                 'CGPA',
                 'Study_Hours_per_Day']]
# X
y = df['Dropout']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [40]:
model.fit(X_train,y_train)

In [41]:
model.predict(X_test)

array([0, 0, 1, ..., 0, 0, 1])

In [42]:
model.score(X_test,y_test)

0.734

We tested the model accuracy without one of the Gpa variables each time and found that the score with wither one, two and even all of them was identical meaning that having all of them adds little information thus one can be used

In [43]:
X = df[['Stress_Index',
                 'Assignment_Delay_Days',
                 'GPA',
                 # 'Semester_GPA',
                 'CGPA',
                 'Study_Hours_per_Day']]
# X
y = df['Dropout']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [44]:
model.fit(X_train,y_train)

In [45]:
model.predict(X_test)

array([0, 0, 1, ..., 0, 0, 1])

In [46]:
model.score(X_test,y_test)

0.7366666666666667

our final trained model

In [47]:
X = df[['Stress_Index',
                 'Assignment_Delay_Days',
                 'GPA',
                 # 'Semester_GPA',
                 # 'CGPA',
                 'Study_Hours_per_Day']]
# X
y = df['Dropout']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

In [48]:
model.fit(X_train,y_train)

In [49]:
model.predict(X_test)

array([0, 0, 0, ..., 0, 1, 0])

In [50]:
y_pred = model.predict(X_test)

In [51]:
model.score(X_test,y_test)

0.733

In [52]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.74      0.81      2294
           1       0.46      0.72      0.56       706

    accuracy                           0.73      3000
   macro avg       0.68      0.73      0.68      3000
weighted avg       0.79      0.73      0.75      3000



In [53]:
model.score(X_train,y_train)

0.7344285714285714

In [54]:
y_test.value_counts()

0    2294
1     706
Name: Dropout, dtype: int64

### Key Findings from EDA

Academic performance variables (GPA, Semester_GPA, CGPA) showed the strongest relationship with student dropout.

Higher Stress_Index and increased Assignment_Delay_Days were associated with higher dropout probability.

Variables such as Department, Gender, and Scholarship showed weak influence on dropout after analysis.

The dataset was moderately imbalanced, with more students staying than dropping out.

#### Model Insights (Logistic Regression)

Logistic Regression was used as a baseline classification model.

Class imbalance was handled using:

class_weight = "balanced"

Model performance on test data:

Accuracy ≈ 0.74

Recall for dropout class ≈ 0.70

F1-score for dropout class ≈ 0.56

The model favors higher recall for dropout prediction, meaning it successfully identifies most at-risk students at the cost of some false positives.

#### Feature Importance (From Coefficients)

Positive coefficients (increase dropout probability):

Stress_Index

Assignment_Delay_Days

Negative coefficients (reduce dropout probability):

GPA-related features

Study_Hours_per_Day

Attendance_Rate

This suggests that academic performance is the strongest protective factor against dropout.

#### Feature Selection Experiments

Multiple feature combinations were tested.

Removing highly similar features (GPA / Semester_GPA / CGPA) produced only small performance changes.

This indicates strong overlap (multicollinearity) among academic performance variables.

#### Final Conclusion

Logistic Regression provides a simple and interpretable baseline for dropout prediction.

Academic performance and stress-related features are the main predictors.

The model generalizes well since train and test scores were nearly identical.

Further improvement could involve:

Feature scaling

Cross-validation

Trying more advanced models (e.g., tree-based methods).

#### Learning Outcomes

Understanding binary classification using logistic regression.

Handling categorical data with encoding.

Interpreting model coefficients.

Evaluating models using precision, recall, and F1-score.