Data Analysis & Preprocessing:
- Exploratory Data Analysis (EDA)
- Check class distribution (balanced/imbalanced)
- Handle missing values
- Encode categorical variables
- Feature scaling if needed
- Feature selection/importance analysis


Modeling Strategy:
- Start with simple models as baseline (Logistic Regression)
- Move to more complex models:Random Forest, XGBoost/LightGBM, Support Vector Machines
- Use cross-validation for robust evaluation
- If data is imbalanced, consider: SMOTE/ADASYN for oversampling, Class weights, Ensemble methods

Evaluation Metrics to focus on:
- Accuracy (if balanced classes)
- Precision, Recall, F1-score
- ROC-AUC
- Confusion Matrix

Interpretability:
- Feature importance
- SHAP values
- Partial dependence plots

reference : https://medium.com/data-and-beyond/mastering-exploratory-data-analysis-eda-everything-you-need-to-know-7e3b48d63a95

In [None]:
# import libraries

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

: 

#### Exploratory Data Analysis

In [37]:
df = pd.read_csv("data/train.csv")
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140700 entries, 0 to 140699
Data columns (total 20 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   id                                     140700 non-null  int64  
 1   Name                                   140700 non-null  object 
 2   Gender                                 140700 non-null  object 
 3   Age                                    140700 non-null  float64
 4   City                                   140700 non-null  object 
 5   Working Professional or Student        140700 non-null  object 
 6   Profession                             104070 non-null  object 
 7   Academic Pressure                      27897 non-null   float64
 8   Work Pressure                          112782 non-null  float64
 9   CGPA                                   27898 non-null   float64
 10  Study Satisfaction                     27897 non-null   

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,Aaradhya,Female,49.0,Ludhiana,Working Professional,Chef,,5.0,,,2.0,More than 8 hours,Healthy,BHM,No,1.0,2.0,No,0
1,1,Vivan,Male,26.0,Varanasi,Working Professional,Teacher,,4.0,,,3.0,Less than 5 hours,Unhealthy,LLB,Yes,7.0,3.0,No,1
2,2,Yuvraj,Male,33.0,Visakhapatnam,Student,,5.0,,8.97,2.0,,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
3,3,Yuvraj,Male,22.0,Mumbai,Working Professional,Teacher,,5.0,,,1.0,Less than 5 hours,Moderate,BBA,Yes,10.0,1.0,Yes,1
4,4,Rhea,Female,30.0,Kanpur,Working Professional,Business Analyst,,1.0,,,1.0,5-6 hours,Unhealthy,BBA,Yes,9.0,4.0,Yes,0


In [38]:
df.describe()

Unnamed: 0,id,Age,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Work/Study Hours,Financial Stress,Depression
count,140700.0,140700.0,27897.0,112782.0,27898.0,27897.0,112790.0,140700.0,140696.0,140700.0
mean,70349.5,40.388621,3.142273,2.998998,7.658636,2.94494,2.974404,6.252679,2.988983,0.181713
std,40616.735775,12.384099,1.380457,1.405771,1.464466,1.360197,1.416078,3.853615,1.413633,0.385609
min,0.0,18.0,1.0,1.0,5.03,1.0,1.0,0.0,1.0,0.0
25%,35174.75,29.0,2.0,2.0,6.29,2.0,2.0,3.0,2.0,0.0
50%,70349.5,42.0,3.0,3.0,7.77,3.0,3.0,6.0,3.0,0.0
75%,105524.25,51.0,4.0,4.0,8.92,4.0,4.0,10.0,4.0,0.0
max,140699.0,60.0,5.0,5.0,10.0,5.0,5.0,12.0,5.0,1.0


In [40]:
# column summary

def column_summary(df):
    data = []

    for column in df.columns:
        data_type = df[column].dtype
        null_count = df[column].isnull().sum()
        non_null_count = df[column].notnull().sum()
        distinct_values = df[column].nunique()

        if distinct_values <= 10:
            distinct_value_count = df[column].value_counts().to_dict()
        else:
            top_10_distinct_values = df[column].value_counts().head(10).to_dict()
            distinct_value_count = {k:v for k, v in sorted(top_10_distinct_values.items(), key=lambda item: item[1], reverse=True)}

        data.append({
            "name": column,
            "column_dtype" : data_type,
            "#_null": null_count,
            "#_non_null": non_null_count,
            "unique_values": distinct_values,
            "unique_value_counts": distinct_value_count,
        })

    data_df = pd.DataFrame(data)
    return data_df

data_summary = column_summary(df)
display(data_summary)

Unnamed: 0,name,column_dtype,#_null,#_non_null,unique_values,unique_value_counts
0,id,int64,0,140700,140700,"{140699: 1, 0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}"
1,Name,object,0,140700,422,"{'Rohan': 3178, 'Aarav': 2336, 'Rupak': 2176, 'Aaradhya': 2045, 'Anvi': 2035, 'Raghavendra': 1877, 'Vani': 1657, 'Tushar': 1596, 'Ritvik': 1589, 'Shiv': 1568}"
2,Gender,object,0,140700,2,"{'Male': 77464, 'Female': 63236}"
3,Age,float64,0,140700,43,"{56.0: 5246, 49.0: 5099, 38.0: 4564, 53.0: 4526, 57.0: 4395, 47.0: 4199, 46.0: 4080, 54.0: 3928, 51.0: 3927, 18.0: 3921}"
4,City,object,0,140700,98,"{'Kalyan': 6591, 'Patna': 5924, 'Vasai-Virar': 5765, 'Kolkata': 5689, 'Ahmedabad': 5613, 'Meerut': 5528, 'Ludhiana': 5226, 'Pune': 5210, 'Rajkot': 5207, 'Visakhapatnam': 5176}"
5,Working Professional or Student,object,0,140700,2,"{'Working Professional': 112799, 'Student': 27901}"
6,Profession,object,36630,104070,64,"{'Teacher': 24906, 'Content Writer': 7814, 'Architect': 4370, 'Consultant': 4229, 'HR Manager': 4022, 'Pharmacist': 3893, 'Doctor': 3255, 'Business Analyst': 3161, 'Entrepreneur': 2968, 'Chemist': 2967}"
7,Academic Pressure,float64,112803,27897,5,"{3.0: 7463, 5.0: 6296, 4.0: 5158, 1.0: 4801, 2.0: 4179}"
8,Work Pressure,float64,27918,112782,5,"{2.0: 24373, 4.0: 22512, 5.0: 22436, 3.0: 21899, 1.0: 21562}"
9,CGPA,float64,112802,27898,331,"{8.04: 822, 9.96: 425, 5.74: 410, 8.95: 371, 9.21: 343, 7.25: 339, 7.09: 320, 7.88: 318, 9.44: 317, 8.91: 276}"
