## Import Packages

In [1]:
import os
import sys
import pandas as pd

project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)

from src.data_preprocessing import convert_data_types
from src.feature_engineering import scale_numeric_features, encode_categorical_features, generate_interaction_features
from utils.data import load_data

## Load Dataset

In [2]:
data = convert_data_types(load_data('../data/processed/train.csv'))
data.head()

Unnamed: 0,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,Female,49.0,Ludhiana,Working Professional,Chef,Not Applicable,5.0,-1.0,Not Applicable,2.0,More than 8 hours,Healthy,BHM,No,1.0,2.0,No,0
1,Male,26.0,Varanasi,Working Professional,Teacher,Not Applicable,4.0,-1.0,Not Applicable,3.0,Less than 5 hours,Unhealthy,LLB,Yes,7.0,3.0,No,1
2,Male,33.0,Visakhapatnam,Student,Student,5.0,Not Applicable,8.97,2.0,Not Applicable,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
3,Male,22.0,Mumbai,Working Professional,Teacher,Not Applicable,5.0,-1.0,Not Applicable,1.0,Less than 5 hours,Moderate,BBA,Yes,10.0,1.0,Yes,1
4,Female,30.0,Kanpur,Working Professional,Business Analyst,Not Applicable,1.0,-1.0,Not Applicable,1.0,5-6 hours,Unhealthy,BBA,Yes,9.0,4.0,Yes,0


## Gender
<!-- City
Profession
Family History of Mental Illness
Dietary Habits
Sleep Duration
Study Satisfaction
Work/Study Hours

CGPA
Age
Financial Stress
Academic Pressure
Work Pressure

Suicidal Thoughts
Working Professional or Student -->

In [3]:
unwanted_features = ['Gender', 'Degree']
data = data.drop(columns=unwanted_features)

# Normalize/Scale Numerical Features
The numeric features, `CGPA` and `Age`, will be scaled using the `MinMaxScaler`. For `CGPA`, only student data will undergo scaling, while the placeholder value for working professionals (`-1`) will remain unchanged.

In [4]:
data = scale_numeric_features(data)

# Encode Categorical Variables
The categorical features will be encoded as follows:
- One-Hot Encoding will be applied to categories such as Gender, Working Professional or Student, Have you ever had suicidal thoughts?, Sleep Duration, Dietary Habits, and Family History of Mental Illness
- Target Encoding (using mean values) will be used for City, Profession, and Degree
- Label Encoding will be applied to categories like Academic Pressure, Work Pressure, Study Satisfaction, Job Satisfaction, Work/Study Hours, and Financial Stress

In [5]:
data = encode_categorical_features(data)

# Create New Features
## Interaction Features
The following feature interactions will be generated and they aim to uncover significant patterns that may contribute to depression:
1. **CGPA × Study Satisfaction:** A high CGPA paired with low study satisfaction could signal academic pressure, a potential trigger for depression and stress in students.
2. **Work Pressure × Financial Stress:** Captures the dual burden of workplace demands and financial difficulties, highlighting individuals at risk of compounded stress.
3. **Job Satisfaction × Sleep Duration:** Explores the link between job dissatisfaction and poor sleep quality, which together can significantly impact mental well-being.
4. **Academic Pressure × Suicidal Thoughts:** Focuses on students under extreme stress, identifying those at higher risk of severe mental health challenges, including suicidal ideation.
5. **Dietary Habits × Financial Stress:** Examines how financial stress might lead to unhealthy dietary choices, potentially worsening physical and mental health.

In [6]:
data = generate_interaction_features(data)