## Import Libraries

In [3]:
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sys.path.append('..')
from utils.plot import *

## Load the dataset

In [6]:
# Load testing data
data = pd.read_csv('https://raw.githubusercontent.com/ktxdev/mind-matters/refs/heads/master/data/raw/test.csv')

## Initial Exploration
### Shape and Structure

In [7]:
print(data.shape)
data.head()

(93800, 19)


Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness
0,140700,Shivam,Male,53.0,Visakhapatnam,Working Professional,Judge,,2.0,,,5.0,Less than 5 hours,Moderate,LLB,No,9.0,3.0,Yes
1,140701,Sanya,Female,58.0,Kolkata,Working Professional,Educational Consultant,,2.0,,,4.0,Less than 5 hours,Moderate,B.Ed,No,6.0,4.0,No
2,140702,Yash,Male,53.0,Jaipur,Working Professional,Teacher,,4.0,,,1.0,7-8 hours,Moderate,B.Arch,Yes,12.0,4.0,No
3,140703,Nalini,Female,23.0,Rajkot,Student,,5.0,,6.84,1.0,,More than 8 hours,Moderate,BSc,Yes,10.0,4.0,No
4,140704,Shaurya,Male,47.0,Kalyan,Working Professional,Teacher,,5.0,,,5.0,7-8 hours,Moderate,BCA,Yes,3.0,4.0,No


### Dropping Unwanted Features
- **id:** This is just a unique identifier for each person in the dataset, which does not contribute to predicting mental health outcomes.
- **Name:** This feature is not relevant to predicting depression and could introduce noise i.e it does not provide meaningful information for the model

In [8]:
data.drop(columns=['id', 'Name'], inplace=True)

### Rename columns

In [10]:
data.rename(columns={'Have you ever had suicidal thoughts ?': 'Had Suicidal Thoughts?'}, inplace=True)

## Data Types

In [11]:
data.dtypes

Gender                               object
Age                                 float64
City                                 object
Working Professional or Student      object
Profession                           object
Academic Pressure                   float64
Work Pressure                       float64
CGPA                                float64
Study Satisfaction                  float64
Job Satisfaction                    float64
Sleep Duration                       object
Dietary Habits                       object
Degree                               object
Had Suicidal Thoughts?               object
Work/Study Hours                    float64
Financial Stress                    float64
Family History of Mental Illness     object
dtype: object

## Data Types Conversion

In [12]:
data['Study Satisfaction'] = data['Study Satisfaction'].astype('category')
data['Job Satisfaction'] = data['Job Satisfaction'].astype('category')
data['Academic Pressure'] = data['Academic Pressure'].astype('category')
data['Work Pressure'] = data['Work Pressure'].astype('category')
data['Financial Stress'] = data['Financial Stress'].astype('category')
data['Work/Study Hours'] = data['Work/Study Hours'].astype('category')

## Checking for Missing Values

In [13]:
missing_values_cols = data.isnull()
print('Missing value counts:\n')
print(missing_values_cols.sum()[missing_values_cols.sum() > 0])
print('\nMissing value percentages:\n')
print(round((missing_values_cols.mean()[missing_values_cols.mean() > 0]) * 100, 2))

Missing value counts:

Profession            24632
Academic Pressure     75033
Work Pressure         18778
CGPA                  75034
Study Satisfaction    75033
Job Satisfaction      18774
Dietary Habits            5
Degree                    2
dtype: int64

Missing value percentages:

Profession            26.26
Academic Pressure     79.99
Work Pressure         20.02
CGPA                  79.99
Study Satisfaction    79.99
Job Satisfaction      20.01
Dietary Habits         0.01
Degree                 0.00
dtype: float64


### Basic Statistics
#### Continuous Variables

In [14]:
data.describe()

Unnamed: 0,Age,CGPA
count,93800.0,18766.0
mean,40.321685,7.674016
std,12.39348,1.465056
min,18.0,5.03
25%,29.0,6.33
50%,42.0,7.8
75%,51.0,8.94
max,60.0,10.0


**Insights:**
- The age distribution closely aligns with that observed in the training dataset, indicating consistency across samples
- Similar to the training dataset, CGPA data is available for only a subset of the population, likely reflecting the portion representing students
- The CGPA median is comparable to that in the training dataset, suggesting similar academic performance levels across the samples