#**Background**

Student academic performance is influenced not only by intellectual ability but also by mental health, daily habits, and lifestyle factors. Analyzing data on these aspects can provide valuable insights into how they affect learning outcomes.

The dataset used in this project contains survey responses from 1,000 students, including demographic information, lifestyle habits (screen time, sleep, physical activity), psychological indicators (stress level, exam anxiety), and self-reported changes in academic performance. By exploring and modeling this data, this project aims to understand the relationships between mental health, behavior, and academic outcomes.

Source: https://www.kaggle.com/datasets/utkarshsharma11r/student-mental-health-analysis

**Import Library & Data Extraction**

In [None]:
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format', '{:.2f}'.format)

In [None]:
data_student_mentalhealth = pd.read_csv('/content/StudentMentalHealth_dirty.csv')

display(data_student_mentalhealth)

Unnamed: 0,Name,Gender,Age,Education Level,Screen Time (hrs/day),Sleep Duration (hrs),Physical Activity (hrs/week),Stress Level,Anxious Before Exams,Academic Performance Change
0,Aarav,Male,15.0,Class 8,7.1,8.90,9.30,Medium,No,Same
1,Meera,Female,25.0,MSc,3.3,-5.00,0.20,Medium,No,Same
2,Ishaan,Male,20.0,BTech,9.5,150.00,6.20,Medium,No,Same
3,Aditya,Male,20.0,BA,10.8,5.60,5.50,High,Yes,Same
4,Anika,Female,17.0,Class 11,2.8,5.40,3.10,Medium,Yes,Same
...,...,...,...,...,...,...,...,...,...,...
1245,Vivaan,Male,21.0,BTech,6.1,6.20,4.00,Low,Yes,Declined
1246,Arjun,Male,15.0,Class 9,5.5,8.00,6.90,Medium,Yes,Same
1247,Aarav,Male,22.0,,5.5,4.40,-5.00,Medium,No,Declined
1248,Ananya,Female,16.0,Class 10,10.4,8.00,6.60,High,No,Declined


In [None]:
print(data_student_mentalhealth.columns)

Index(['Name', 'Gender', 'Age', 'Education Level', 'Screen Time (hrs/day)',
       'Sleep Duration (hrs)', 'Physical Activity (hrs/week)', 'Stress Level',
       'Anxious Before Exams', 'Academic Performance Change'],
      dtype='object')


In [None]:
print(data_student_mentalhealth.index)

RangeIndex(start=0, stop=1250, step=1)


**Understanding Data**

**Data Dictionary**

| **Column Name**                  | **Description**                                                                    |
| -------------------------------- | ---------------------------------------------------------------------------------- |
| **Name**                         | Student's first name (non-essential for analysis; can be anonymized)                                       |
| **Gender**                       | Gender of the respondent (Male/Female)                   |
| **Age**                          | Student’s age in years.                                                            |
| **Education Level**              | Current Academic level (e.g., Class 8, BTech, MSc)               |
| **Screen Time (hrs/day)**        | Average screen time per day during online learning          |
| **Sleep Duration (hrs)**         | Average daily sleep duration per night.                              |
| **Physical Activity (hrs/week)** | Weekly exercise time                           |
| **Stress Level**                 | Self-reported stress level (Low, Medium, High)                |
| **Anxious Before Exams**         | Indicates whether the student experiences anxiety before exams (Yes/No).  |
| **Academic Performance Change**  | Self-assessed change in academic performance (Improved, Declined, Same) |


In [None]:
data_student_mentalhealth.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Name                          1240 non-null   object 
 1   Gender                        1239 non-null   object 
 2   Age                           1240 non-null   object 
 3   Education Level               1240 non-null   object 
 4   Screen Time (hrs/day)         1236 non-null   object 
 5   Sleep Duration (hrs)          1232 non-null   float64
 6   Physical Activity (hrs/week)  1243 non-null   float64
 7   Stress Level                  1238 non-null   object 
 8   Anxious Before Exams          1233 non-null   object 
 9   Academic Performance Change   1235 non-null   object 
dtypes: float64(2), object(8)
memory usage: 97.8+ KB


**Insight**: Dataset consist of 1,250 rows and 10 columns. Some numerial variables are still stored as object dtype, for example, the 'Age' column where its values may contain the string 'twenty' instead of '20', and there are missing values in all columns.

In [None]:
# Categorical

column_categorical = data_student_mentalhealth.select_dtypes(include='object').columns

for col in column_categorical:
    print(f'> Frequency of \033[93m{col}\033[0m')
    print(f'  Unique values: {data_student_mentalhealth[col].nunique(dropna=False)}\n')
    display(
        data_student_mentalhealth[col]
        .value_counts(dropna=False)
        .reset_index()
    )
    print('\n')

> Frequency of [93mName[0m
  Unique values: 31



Unnamed: 0,Name,count
0,Shaurya,77
1,Kavya,68
2,Meera,66
3,Aadhya,66
4,Diya,64
5,Arjun,64
6,Anika,61
7,Krishna,61
8,Myra,60
9,Reyansh,60




> Frequency of [93mGender[0m
  Unique values: 4



Unnamed: 0,Gender,count
0,Female,592
1,Male,578
2,Other,69
3,,11




> Frequency of [93mAge[0m
  Unique values: 17



Unnamed: 0,Age,count
0,17.0,121
1,21.0,120
2,23.0,115
3,15.0,108
4,20.0,104
5,16.0,104
6,25.0,96
7,19.0,96
8,26.0,90
9,22.0,88




> Frequency of [93mEducation Level[0m
  Unique values: 12



Unnamed: 0,Education Level,count
0,MSc,172
1,MTech,172
2,MA,164
3,Class 10,116
4,Class 11,115
5,BSc,105
6,BTech,104
7,Class 9,100
8,BA,78
9,Class 8,58




> Frequency of [93mScreen Time (hrs/day)[0m
  Unique values: 106



Unnamed: 0,Screen Time (hrs/day),count
0,6.9,22
1,4.8,21
2,4.5,21
3,6.3,20
4,9.9,19
...,...,...
101,12.0,6
102,6.5,6
103,9.2,4
104,8.5,4




> Frequency of [93mStress Level[0m
  Unique values: 4



Unnamed: 0,Stress Level,count
0,Medium,625
1,Low,398
2,High,215
3,,12




> Frequency of [93mAnxious Before Exams[0m
  Unique values: 3



Unnamed: 0,Anxious Before Exams,count
0,Yes,634
1,No,599
2,,17




> Frequency of [93mAcademic Performance Change[0m
  Unique values: 4



Unnamed: 0,Academic Performance Change,count
0,Same,486
1,Improved,377
2,Declined,372
3,,15






**Data Cleaning & Transformation**

In [None]:
# Age Column
data_student_mentalhealth['Age'] = data_student_mentalhealth['Age'].replace('twenty', 20)
data_student_mentalhealth['Age'] = pd.to_numeric(data_student_mentalhealth['Age'])

In [None]:
# Education Level Column
data_student_mentalhealth['Education Level'] = data_student_mentalhealth['Education Level'].fillna('Unknown')

In [None]:
# Screen Time Column
data_student_mentalhealth['Screen Time (hrs/day)'] = (data_student_mentalhealth['Screen Time (hrs/day)'].replace('unknown', None))
data_student_mentalhealth['Screen Time (hrs/day)'] = pd.to_numeric(data_student_mentalhealth['Screen Time (hrs/day)'])

In [None]:
# Gender Column
data_student_mentalhealth['Gender'] = data_student_mentalhealth['Gender'].fillna('Other')

In [None]:
# Stress Level Column
data_student_mentalhealth['Stress Level'] = data_student_mentalhealth['Stress Level'].fillna('Unknown')

In [None]:
# Anxious Before Exams Column
data_student_mentalhealth['Anxious Before Exams'] = data_student_mentalhealth['Anxious Before Exams'].fillna('Unknown')

In [None]:
# Numerical Column Analysis
column_numerical = data_student_mentalhealth.select_dtypes(include='number').columns

display(data_student_mentalhealth[column_numerical].describe())

Unnamed: 0,Age,Screen Time (hrs/day),Sleep Duration (hrs),Physical Activity (hrs/week)
count,1240.0,1235.0,1232.0,1243.0
mean,21.53,8.17,8.2,7.03
std,15.43,13.47,15.18,16.19
min,1.0,-5.0,-5.0,-5.0
25%,17.0,4.4,5.1,2.7
50%,20.0,6.9,6.5,5.1
75%,23.0,9.5,7.8,7.7
max,200.0,150.0,150.0,150.0


**Insight**: Numerical column analysis shows unusually wide ranges, including negative and extreme values that are not realistic in real-life contexts.

In [None]:
# Numerical Columns

# Extreme value in Age assumed as typo in additional 0 (150:15, 200:20)
data_student_mentalhealth['Age'] = data_student_mentalhealth['Age'].replace({150: 15, 200: 20})

# Invalid value removed
data_student_mentalhealth.loc[(data_student_mentalhealth['Age'] < 15) | (data_student_mentalhealth['Age'] > 30), 'Age'] = None
data_student_mentalhealth.loc[(data_student_mentalhealth['Screen Time (hrs/day)'] < 0) | (data_student_mentalhealth['Screen Time (hrs/day)'] > 16), 'Screen Time (hrs/day)'] = None

data_student_mentalhealth['Sleep Duration (hrs)'] = data_student_mentalhealth['Sleep Duration (hrs)'].replace({150: 15, -5: 5})

data_student_mentalhealth['Physical Activity (hrs/week)'] = data_student_mentalhealth['Physical Activity (hrs/week)'].replace({150: 15, -5: 5})

display(data_student_mentalhealth[column_numerical].describe())

Unnamed: 0,Age,Screen Time (hrs/day),Sleep Duration (hrs),Physical Activity (hrs/week)
count,1227.0,1214.0,1232.0,1243.0
mean,20.31,6.86,7.2,5.89
std,3.43,2.88,8.05,8.92
min,15.0,2.0,4.0,0.0
25%,17.0,4.4,5.1,2.7
50%,20.0,6.9,6.5,5.1
75%,23.0,9.4,7.8,7.7
max,26.0,12.0,99.0,99.0


In [None]:
#Missing Values

data_student_mentalhealth.isnull().sum()

Unnamed: 0,0
Name,10
Gender,0
Age,23
Education Level,0
Screen Time (hrs/day),36
Sleep Duration (hrs),18
Physical Activity (hrs/week),7
Stress Level,0
Anxious Before Exams,0
Academic Performance Change,15


In [None]:
data_student_mentalhealth.shape

(1250, 10)

In [None]:
# Drop Missing Values (No Info in Academic Performance Change)

data_student_mentalhealth = data_student_mentalhealth.dropna(subset=['Academic Performance Change'])

In [None]:
# Inputation in Numerical Columns with Median (Screen Time (hrs/day); Sleep Duration (hrs); Physical Activity (hrs/week)

for col in column_numerical:
    median_value = data_student_mentalhealth[col].median()
    data_student_mentalhealth[col] = data_student_mentalhealth[col].fillna(median_value)

In [None]:
data_student_mentalhealth.isnull().sum()

Unnamed: 0,0
Name,10
Gender,0
Age,0
Education Level,0
Screen Time (hrs/day),0
Sleep Duration (hrs),0
Physical Activity (hrs/week),0
Stress Level,0
Anxious Before Exams,0
Academic Performance Change,0


In [None]:
# Duplicated Data

print(f'Dataset size: {data_student_mentalhealth.shape}')
print(f'Duplicated rows: {data_student_mentalhealth.duplicated(keep=False).sum()}')

# Drop
data_student = data_student_mentalhealth.drop_duplicates()

print(f'Dataset size: {data_student.shape}')
print(f'Duplicated rows: {data_student.duplicated(keep=False).sum()}')

Dataset size: (1235, 10)
Duplicated rows: 346
Dataset size: (1053, 10)
Duplicated rows: 0


In [None]:
data_student.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1053 entries, 0 to 1249
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Name                          1043 non-null   object 
 1   Gender                        1053 non-null   object 
 2   Age                           1053 non-null   float64
 3   Education Level               1053 non-null   object 
 4   Screen Time (hrs/day)         1053 non-null   float64
 5   Sleep Duration (hrs)          1053 non-null   float64
 6   Physical Activity (hrs/week)  1053 non-null   float64
 7   Stress Level                  1053 non-null   object 
 8   Anxious Before Exams          1053 non-null   object 
 9   Academic Performance Change   1053 non-null   object 
dtypes: float64(4), object(6)
memory usage: 90.5+ KB


**Insight**: After data cleaning, the dataset contains 1,053 valid records with consistent data types.
All numerical variables (Age, Screen Time, Sleep Duration, and Physical Activity) are now in numeric format, and no error or critical missing values remain. The data is ready for further analysis.

In [None]:
# Save Cleaned Data

data_student.to_excel('StudentMentalHealth_clean.xlsx', index = False)

**EDA**

In [None]:
# Outlier

import plotly.express as px

def box_plot(series, column_name, color):
    fig = px.box(
        series,
        orientation='h',
        color_discrete_sequence=[color]
    )

    fig.update_layout(
        title=f'<b>Box Plot of {column_name}</b>',
        yaxis = dict(
            title = '',
            showgrid = False,
            showline = False,
            showticklabels = False,
            zeroline = False,
        ),
        xaxis = dict(
            title = column_name,
            showgrid = False,
            showline = True,
            showticklabels = True,
            zeroline = False,
        )
    )

    fig.show()

color_map = {
  'Age' : '#4E79A7',
  'Screen Time (hrs/day)' : '#F28E2B',
  'Sleep Duration (hrs)' : '#E15759',
  'Physical Activity (hrs/week)' : '#76B7B2'
}

for column, color in color_map.items():
    box_plot(data_student_mentalhealth[column], column, color)

In [None]:
#Winsorizing

def winsorize(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    if(lower_bound < 0):
      lower_bound = 0

    series = series.astype(pd.Float64Dtype())
    winsorize = series.clip(lower=lower_bound, upper=upper_bound)

    return (winsorize)

color_map = {
  'Age' : '#4E79A7',
  'Screen Time (hrs/day)' : '#F28E2B',
  'Sleep Duration (hrs)' : '#E15759',
  'Physical Activity (hrs/week)' : '#76B7B2'
}

# Terapkan ke semua kolom
for column, color in color_map.items():
    data_student_mentalhealth[column] = winsorize(data_student_mentalhealth[column])
    box_plot(data_student_mentalhealth[column], column, color)

**Insight**: Winsorization limits the impact of extreme values while preserving the overall distribution patterns to give a more stable and realistic numerical features for modeling.


In [None]:
# Distribution
import plotly.express as px

for col in column_numerical:
    fig = px.histogram(
        data_student_mentalhealth,
        x = col,
        nbins = 25,
        color_discrete_sequence = [color_map[col]],
        marginal = "box",
        hover_data = data_student_mentalhealth.columns
    )

    fig.update_yaxes(
        showgrid = False,
        showticklabels=False,
        title =''
    )

    fig.update_layout(
        title={
            'text' : f'Distribution of <b><span style="color:#B07AA1"></span> {col}</b>',
            'y':0.92,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        plot_bgcolor = 'rgba(0,0,0,0)',
        bargap = 0.01,
        title_font = dict(size = 25)
    )

    fig.show()

In [None]:
# Correlation

data_corr = data_student_mentalhealth[column_numerical].corr()

display(data_corr)

Unnamed: 0,Age,Screen Time (hrs/day),Sleep Duration (hrs),Physical Activity (hrs/week)
Age,1.0,-0.01,-0.02,-0.04
Screen Time (hrs/day),-0.01,1.0,0.02,0.0
Sleep Duration (hrs),-0.02,0.02,1.0,-0.04
Physical Activity (hrs/week),-0.04,0.0,-0.04,1.0


In [None]:
#Correlation Matrix

import plotly.express as px

fig = px.imshow(
    data_corr,
    text_auto=True,
    color_continuous_scale='Blues',
    title='Correlation Matrix of Numerical Features</b><br>'
)

fig.update_coloraxes(showscale=False)

fig.update_layout(
    title = dict(
        x=0.5,
        y=0.9,
        xanchor='center',
        yanchor='top'
    ),
    width = 1000,
    height = 800
)

fig.show()

In [None]:
# Statistics for Each Category of Academic Performance Change

data_student.groupby(['Academic Performance Change']).agg(
    total_student = ('Name', 'count'),
    min_age = ('Age', 'min'),
    median_age = ('Age', 'median'),
    max_age = ('Age', 'max'),
    min_screentime = ('Screen Time (hrs/day)', 'min'),
    median_screentime = ('Screen Time (hrs/day)', 'median'),
    max_screentime = ('Screen Time (hrs/day)', 'max'),
    min_sleep = ('Sleep Duration (hrs)', 'min'),
    median_sleep = ('Sleep Duration (hrs)', 'median'),
    max_sleep = ('Sleep Duration (hrs)', 'max')
)

Unnamed: 0_level_0,total_student,min_age,median_age,max_age,min_screentime,median_screentime,max_screentime,min_sleep,median_sleep,max_sleep
Academic Performance Change,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Declined,309,15.0,20.0,26.0,2.0,6.9,12.0,4.0,6.4,99.0
Improved,315,15.0,21.0,26.0,2.0,6.9,12.0,4.0,6.5,15.0
Same,419,15.0,20.0,26.0,2.0,6.9,12.0,4.0,6.6,99.0


In [None]:
# Data Preprocessing

data_preprocessing = data_student_mentalhealth.copy()
data_preprocessing = data_preprocessing.drop(columns=['Name'])

data_preprocessing.head()

Unnamed: 0,Gender,Age,Education Level,Screen Time (hrs/day),Sleep Duration (hrs),Physical Activity (hrs/week),Stress Level,Anxious Before Exams,Academic Performance Change
0,Male,15.0,Class 8,7.1,8.9,9.3,Medium,No,Same
1,Female,25.0,MSc,3.3,5.0,0.2,Medium,No,Same
2,Male,20.0,BTech,9.5,11.7,6.2,Medium,No,Same
3,Male,20.0,BA,10.8,5.6,5.5,High,Yes,Same
4,Female,17.0,Class 11,2.8,5.4,3.1,Medium,Yes,Same


In [None]:
# Encoding

from sklearn.preprocessing import OrdinalEncoder

column_categorical = data_preprocessing.select_dtypes(include='object').columns
column_numerical = data_preprocessing.select_dtypes(include='number').columns

encoder = OrdinalEncoder()
data_preprocessing[column_categorical] = encoder.fit_transform(
    data_preprocessing[column_categorical]
)

display(data_preprocessing)

for col, cats in zip(column_categorical, encoder.categories_):
    print(f'Mapping for \033[93m{col}\033[0m:\n')
    for i, cat in enumerate(cats):
        print(f'  {cat} → {i}')
    print()


Unnamed: 0,Gender,Age,Education Level,Screen Time (hrs/day),Sleep Duration (hrs),Physical Activity (hrs/week),Stress Level,Anxious Before Exams,Academic Performance Change
0,1.00,15.00,6.00,7.10,8.90,9.30,2.00,0.00,2.00
1,0.00,25.00,9.00,3.30,5.00,0.20,2.00,0.00,2.00
2,1.00,20.00,2.00,9.50,11.70,6.20,2.00,0.00,2.00
3,1.00,20.00,0.00,10.80,5.60,5.50,0.00,2.00,2.00
4,0.00,17.00,4.00,2.80,5.40,3.10,2.00,2.00,2.00
...,...,...,...,...,...,...,...,...,...
1245,1.00,21.00,2.00,6.10,6.20,4.00,1.00,2.00,0.00
1246,1.00,15.00,7.00,5.50,8.00,6.90,2.00,2.00,2.00
1247,1.00,22.00,11.00,5.50,4.40,5.00,2.00,0.00,0.00
1248,0.00,16.00,3.00,10.40,8.00,6.60,0.00,0.00,0.00


Mapping for [93mGender[0m:

  Female → 0
  Male → 1
  Other → 2

Mapping for [93mEducation Level[0m:

  BA → 0
  BSc → 1
  BTech → 2
  Class 10 → 3
  Class 11 → 4
  Class 12 → 5
  Class 8 → 6
  Class 9 → 7
  MA → 8
  MSc → 9
  MTech → 10
  Unknown → 11

Mapping for [93mStress Level[0m:

  High → 0
  Low → 1
  Medium → 2
  Unknown → 3

Mapping for [93mAnxious Before Exams[0m:

  No → 0
  Unknown → 1
  Yes → 2

Mapping for [93mAcademic Performance Change[0m:

  Declined → 0
  Improved → 1
  Same → 2



In [None]:
# Train test split

from sklearn.model_selection import train_test_split

X = data_preprocessing.drop(columns=['Academic Performance Change'])
y = data_preprocessing['Academic Performance Change']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

display(y.value_counts())

Unnamed: 0_level_0,count
Academic Performance Change,Unnamed: 1_level_1
2.0,486
1.0,377
0.0,372


Declined → 0

Improved → 1

Same → 2

In [None]:
# Imbalance Target

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

display(y_train_res.value_counts())

Unnamed: 0_level_0,count
Academic Performance Change,Unnamed: 1_level_1
2.0,389
1.0,389
0.0,389


In [None]:
# Modelling

from imblearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline(steps=[
    ('scaler', RobustScaler()),
    ('smote', SMOTE()),
    ('model',DecisionTreeClassifier())
])

param_grid = {
    'smote__k_neighbors': [3, 5, 7],
    'model__criterion': ['gini', 'entropy', 'log_loss'],
    'model__max_depth': [None, 5, 10, 20, 30],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 5],
    'model__class_weight': [None, 'balanced']
}

grid = GridSearchCV(
    estimator = pipeline,
    param_grid = param_grid,
    cv = 5,
    scoring = 'f1_macro',
    n_jobs = -1
)

grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best CV Score:", grid.best_score_)

Best Params: {'model__class_weight': 'balanced', 'model__criterion': 'gini', 'model__max_depth': None, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'smote__k_neighbors': 5}
Best CV Score: 0.5214545162889831


In [None]:
# Define Final Model Using Best Params

from sklearn.tree import DecisionTreeClassifier

model_dtc = DecisionTreeClassifier(
    criterion='entropy',
    max_depth=30,
    min_samples_leaf=1,
    min_samples_split=5,
    class_weight=None,
    random_state=42
)

In [None]:
# Final Pipeline

from imblearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE

final_pipe = Pipeline([
    ('scaler', RobustScaler()),
    ('smote', SMOTE(random_state=42)),
    ('model', model_dtc)
])

In [None]:
# Cross Validation

from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)

cv_scores = cross_val_score(
    final_pipe,
    X,
    y,
    cv=cv,
    scoring='f1_macro'   # keep consistent with GridSearch
)

print(f'CV Scores : {cv_scores}')
print(f'Mean CV   : {cv_scores.mean()}')
print(f'Std CV    : {cv_scores.std()}')

CV Scores : [0.5563534  0.47212885 0.5348679  0.48995158 0.48016206]
Mean CV   : 0.5066927596228543
Std CV    : 0.032981288109190904


In [None]:
final_pipe.fit(X_train, y_train)

In [None]:
# Model Evaluation on Test Set

y_pred = final_pipe.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.48      0.52      0.50        75
         1.0       0.46      0.49      0.47        75
         2.0       0.55      0.48      0.52        97

    accuracy                           0.50       247
   macro avg       0.50      0.50      0.50       247
weighted avg       0.50      0.50      0.50       247



**Insight**: The model achieves 50% accuracy, performing best at predicting students whose performance stayed the same, but overall struggles to distinguish between similar academic outcomes.


In [None]:
# Confusion Matrix Visualization

import plotly.express as px
from sklearn.metrics import confusion_matrix

label_map = {
    0: 'Declined',
    1: 'Improved',
    2: 'Same'
}

labels_num = list(label_map.keys())
labels_text = list(label_map.values())

cm = confusion_matrix(
    y_test,
    y_pred,
    labels=labels_num
)

fig = px.imshow(
    cm,
    text_auto=True,
    color_continuous_scale='Blues',
    title="<b>Confusion Matrix – Decision Tree</b>",
    x=labels_text,
    y=labels_text
)

fig.update_layout(
    xaxis_title="<b>Predicted</b>",
    yaxis_title="<b>Actual</b>",
    width=700,
    height=700,
    coloraxis_showscale=False,
    title=dict(x=0.5)
)

fig.show()


**Insight**: The model predicts academic performance reasonably well, with most correct predictions along the diagonal.

Misclassifications occur mostly between similar categories (Declined ↔ Improved, Improved ↔ Same), showing overlapping behaviors.

The Decision Tree captures general trends but struggles with nuanced differences between classes.

#**Conclusion**

This project studies how students’ habits and mental health affect changes in their academic performance using a Decision Tree model.

After cleaning the data, we saw that students’ routines vary in screen time, sleep, and physical activity, and stress or exam anxiety also influence performance. SMOTE, feature scaling, and hyperparameter tuning were used to handle class imbalance and improve the model.

The results show the model works well overall, though errors occur mostly between similar performance levels, suggesting overlapping behaviors.

Data Source : <i>https://www.kaggle.com/datasets/utkarshsharma11r/student-mental-health-analysis</i>


---

<br>
<a href="https://www.linkedin.com/in/zhana-sabira/"><img src="https://img.shields.io/badge/-© 2025 Zhana Sabira-F54927?style=for-the-badge&logoColor=white"/></a>

<a href="https://github.com/zhanasabi"><img src="https://raw.githubusercontent.com/bachtiyarma/Material/main/Image/Materi-Python/FINAL LOGO GSB 2025 - ADV (Buat IG, TikTok, Podcast) (2).png" align="left" width="15%" /></a>