# Exploratory Data Analysis

This notebook is created in order to go through big data.

### Dataset description

This dataset is imported from [Kaggle](https://www.kaggle.com/competitions/playground-series-s4e11) and was collected as part of a comprehensive survey aimed at understanding the factors contributing to depression risk among adults.

The target variable, `Depression`, represents whether the individual is at risk of depression, marked as 'Yes' or 'No', based on their responses to lifestyle and demographic factors. The dataset has been curated to provide insights into how everyday factors might correlate with mental health risks, making it a useful resource for machine learning models aimed at mental health prediction.

- **id**: Unique identifier for each individual.  
- **Name**: Name of the individual.  
- **Gender**: Gender of the individual (e.g., Male, Female, Other).  
- **Age**: Age of the individual in years.  
- **City**: City of residence.  
- **Working Professional or Student**: Status indicating whether the individual is a working professional or a student.  
- **Profession**: Specific profession of the individual (if applicable).  
- **Academic Pressure**: Level of academic pressure experienced (numeric scale).  
- **Work Pressure**: Level of work-related pressure experienced (numeric scale).  
- **CGPA**: Cumulative Grade Point Average of the individual (if applicable).  
- **Study Satisfaction**: Satisfaction level with studies (numeric scale).  
- **Job Satisfaction**: Satisfaction level with job (numeric scale).  
- **Sleep Duration**: Average daily sleep duration (e.g., hours per night).  
- **Dietary Habits**: Description of dietary patterns (e.g., healthy, unhealthy, balanced).  
- **Degree**: Highest degree or qualification attained.  
- **Have you ever had suicidal thoughts ?**: Self-reported history of suicidal thoughts (Yes/No).  
- **Work/Study Hours**: Average number of hours spent working or studying per day.  
- **Financial Stress**: Level of financial stress experienced (numeric scale).  
- **Family History of Mental Illness**: Indicates if there is a family history of mental illness (Yes/No).  
- **Depression**: Indicator of depression presence (e.g., 1 for Yes, 0 for No).  

In [1]:
# all required packages for the notebook
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import re

In [2]:
# importing data
df_train = pd.read_csv('../data/train.csv')
df_train.head()

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,Aaradhya,Female,49.0,Ludhiana,Working Professional,Chef,,5.0,,,2.0,More than 8 hours,Healthy,BHM,No,1.0,2.0,No,0
1,1,Vivan,Male,26.0,Varanasi,Working Professional,Teacher,,4.0,,,3.0,Less than 5 hours,Unhealthy,LLB,Yes,7.0,3.0,No,1
2,2,Yuvraj,Male,33.0,Visakhapatnam,Student,,5.0,,8.97,2.0,,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
3,3,Yuvraj,Male,22.0,Mumbai,Working Professional,Teacher,,5.0,,,1.0,Less than 5 hours,Moderate,BBA,Yes,10.0,1.0,Yes,1
4,4,Rhea,Female,30.0,Kanpur,Working Professional,Business Analyst,,1.0,,,1.0,5-6 hours,Unhealthy,BBA,Yes,9.0,4.0,Yes,0


In [3]:
# looking at info
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140700 entries, 0 to 140699
Data columns (total 20 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   id                                     140700 non-null  int64  
 1   Name                                   140700 non-null  object 
 2   Gender                                 140700 non-null  object 
 3   Age                                    140700 non-null  float64
 4   City                                   140700 non-null  object 
 5   Working Professional or Student        140700 non-null  object 
 6   Profession                             104070 non-null  object 
 7   Academic Pressure                      27897 non-null   float64
 8   Work Pressure                          112782 non-null  float64
 9   CGPA                                   27898 non-null   float64
 10  Study Satisfaction                     27897 non-null   

In [4]:
numeric_cols = df_train.select_dtypes(include='number').columns
print(numeric_cols)
cat_cols = df_train.select_dtypes(include='object').columns
print(cat_cols)

Index(['id', 'Age', 'Academic Pressure', 'Work Pressure', 'CGPA',
       'Study Satisfaction', 'Job Satisfaction', 'Work/Study Hours',
       'Financial Stress', 'Depression'],
      dtype='object')
Index(['Name', 'Gender', 'City', 'Working Professional or Student',
       'Profession', 'Sleep Duration', 'Dietary Habits', 'Degree',
       'Have you ever had suicidal thoughts ?',
       'Family History of Mental Illness'],
      dtype='object')


In [5]:
for col in cat_cols:
    if col not in ('Name', 'City'):
        print(f"Unique values in {col} column: {df_train[col].unique()}")
        print()

Unique values in Gender column: ['Female' 'Male']

Unique values in Working Professional or Student column: ['Working Professional' 'Student']

Unique values in Profession column: ['Chef' 'Teacher' nan 'Business Analyst' 'Finanancial Analyst' 'Chemist'
 'Electrician' 'Software Engineer' 'Data Scientist' 'Plumber'
 'Marketing Manager' 'Accountant' 'Entrepreneur' 'HR Manager'
 'UX/UI Designer' 'Content Writer' 'Educational Consultant'
 'Civil Engineer' 'Manager' 'Pharmacist' 'Financial Analyst' 'Architect'
 'Mechanical Engineer' 'Customer Support' 'Consultant' 'Judge'
 'Researcher' 'Pilot' 'Graphic Designer' 'Travel Consultant'
 'Digital Marketer' 'Lawyer' 'Research Analyst' 'Sales Executive' 'Doctor'
 'Unemployed' 'Investment Banker' 'Family Consultant' 'B.Com' 'BE'
 'Student' 'Yogesh' 'Dev' 'MBA' 'LLM' 'BCA' 'Academic' 'Profession'
 'FamilyVirar' 'City Manager' 'BBA' 'Medical Doctor'
 'Working Professional' 'MBBS' 'Patna' 'Unveil' 'B.Ed' 'Nagpur' 'Moderate'
 'M.Ed' 'Analyst' 'Pranav' '

There are many inconsistencies in the categorical data. The following steps will be needed:

- Handle 'NaN' values in the Profession column by replacing them with 'Other' if the person is not a student, or 'Studying' if the person is a student
- Correct invalid values in the `Profession` column, such as 'Profession', names, etc.
- Normalize the feature values in the `Sleep Duration` column to a common standard.
- Apply the same normalization for the `Dietary Habits` and `Degree` columns.
- Encode the categorical data at the end of the EDA process.

## Correcting invalid values in categorical data

In [6]:
# Filling NaN values
df_train['Profession'] = df_train['Profession'].fillna(
    df_train.apply(lambda row: 'Student' if row['Working Professional or Student'] == 'Student' else 'Other', axis=1)
)

# Mapping for replacements
replace_map = {
    "Medical Doctor": "Doctor",
    "City Manager": "Manager",
    "Family Consultant": "Consultant"
}

df_train["Profession"] = df_train["Profession"].replace(replace_map)

# Define valid professions
valid_professions = {
    'Chef', 'Teacher', 'Business Analyst', 'Financial Analyst', 'Chemist',
    'Electrician', 'Software Engineer', 'Data Scientist', 'Plumber',
    'Marketing Manager', 'Accountant', 'Entrepreneur', 'HR Manager',
    'UX/UI Designer', 'Content Writer', 'Educational Consultant',
    'Civil Engineer', 'Manager', 'Pharmacist', 'Architect',
    'Mechanical Engineer', 'Customer Support', 'Consultant', 'Judge',
    'Researcher', 'Pilot', 'Graphic Designer', 'Travel Consultant',
    'Digital Marketer', 'Lawyer', 'Research Analyst', 'Sales Executive',
    'Doctor', 'Unemployed', 'Investment Banker', 'Student', 'Academic',
    'Analyst', 'PhD'
}

# Replace anything not in valid list with "Unknown"
df_train["Profession"] = df_train["Profession"].apply(lambda x: x if x in valid_professions else "Unknown")

In [7]:
import re
import numpy as np
import pandas as pd

def categorize_sleep(v):
    if pd.isna(v):
        return "Unknown"
    s = str(v).lower().strip().replace("_", " ")

    # obvious noise / labels / locations
    noise_words = {
        "sleep duration", "work study hours", "moderate", "unhealthy",
        "pune", "indore"
    }
    if s in {"no"} or any(w in s for w in noise_words):
        return "Unknown"

    # "more than/less than" patterns
    m_more = re.search(r"\bmore\s*than\s*(\d+(?:\.\d+)?)\b", s)
    if m_more:
        x = float(m_more.group(1))
        if x < 6:   return "less than 6 hours"
        if x <= 8:  return "6 - 8 hours"
        return "more than 8 hours"

    m_less = re.search(r"\bless\s*than\s*(\d+(?:\.\d+)?)\b", s)
    if m_less or "than 5" in s:
        x = float(m_less.group(1)) if m_less else 5.0
        if x <= 6:  return "less than 6 hours"
        if x <= 8:  return "6 - 8 hours"
        return "more than 8 hours"

    # range like "a-b" (with or without 'hours')
    m_range = re.search(r"(\d+(?:\.\d+)?)\s*-\s*(\d+(?:\.\d+)?)", s)
    if m_range:
        a, b = float(m_range.group(1)), float(m_range.group(2))

        # if reversed but plausible hours, swap (e.g., "9-6 hours" -> 6-9)
        if a > b and a <= 24 and b <= 24:
            a, b = b, a

        lo, hi = min(a, b), max(a, b)
        if hi > 24 or lo == 0:
            return "Unknown"

        # bucket
        if hi < 6:               return "less than 6 hours"
        if lo > 8:               return "more than 8 hours"
        if 6 <= lo and hi <= 8:  return "6 - 8 hours"

        mid = (lo + hi) / 2
        if mid < 6:   return "less than 6 hours"
        if mid > 8:   return "more than 8 hours"
        return "6 - 8 hours"

    # single number like "8 hours" or just "8"
    m_num = re.search(r"(\d+(?:\.\d+)?)", s)
    if m_num:
        h = float(m_num.group(1))
        if h == 0 or h > 24:
            return "Unknown"
        if h < 6:      return "less than 6 hours"
        if h <= 8:     return "6 - 8 hours"
        return "more than 8 hours"

    return "Unknown"

In [8]:
df_train["Sleep Category"] = df_train["Sleep Duration"].apply(categorize_sleep)

allowed = {"less than 6 hours", "6 - 8 hours", "more than 8 hours"}

# Decide what to do with Unknowns: use the mode of valid buckets (or pick one)
mode_bucket = df_train.loc[df_train["Sleep Category"].isin(allowed), "Sleep Category"].mode()[0]

df_train["Sleep Category"] = df_train["Sleep Category"].where(
    df_train["Sleep Category"].isin(allowed),
    mode_bucket  # you can hardcode "6 - 8 hours" instead, if you prefer
)

In [9]:
df_train["Sleep Category"].value_counts(dropna=False)
# should show only the 3 categories now

Sleep Category
less than 6 hours    70982
6 - 8 hours          69712
more than 8 hours        6
Name: count, dtype: int64

In [10]:
def categorize_diet(value):
    if pd.isna(value):
        return "Moderate"
    
    s = str(value).lower().strip()
    
    # if no keyword "healthy" or "moderate" → treat as noise
    if not re.search(r"(healthy|moderate)", s):
        return "Moderate"  # or "unhealthy" if you want stricter
    
    # unhealthy patterns
    if re.search(r"\b(?:unhealthy|less(?:\s+\w+)*\s+healthy|no(?:\s+\w+)*\s+healthy)\b", s):
        return "Unhealthy"
    
    # healthy pattern
    if re.search(r"\bhealthy\b", s):
        return "Healthy"
    
    # moderate pattern
    if re.search(r"\bmoderate\b", s):
        return "Moderate"
    
    # default fallback
    return "Moderate"

In [11]:
df_train['Dietary Habits'] = df_train['Dietary Habits'].apply(categorize_diet)

In [12]:
df_train['Dietary Habits'].value_counts(dropna=False)
# should show only the 3 categories now

Dietary Habits
Moderate     49727
Unhealthy    46230
Healthy      44743
Name: count, dtype: int64

## Handling empty values

In [13]:
# Finding out amount of empty values
for column in df_train.columns:
    missing_count = df_train[column].isnull().sum()
    if missing_count > 0:
        print(f"The '{column}' column has {missing_count} empty values.")

The 'Academic Pressure' column has 112803 empty values.
The 'Work Pressure' column has 27918 empty values.
The 'CGPA' column has 112802 empty values.
The 'Study Satisfaction' column has 112803 empty values.
The 'Job Satisfaction' column has 27910 empty values.
The 'Degree' column has 2 empty values.
The 'Financial Stress' column has 4 empty values.


There are some empty values in dataset in following columns:
- Profession
- Academic Pressure
- Work Pressure
- CGPA
- Study Satisfaction
- Job Satisfaction
- Dietary Habits
- Degree
- Financial Stress

We apply different strategies for handling empty values based on the type of column with ones.

In [14]:
numeric_columns_with_null = [col for col in numeric_cols if df_train[col].isnull().sum() > 0]
numeric_columns_with_null

['Academic Pressure',
 'Work Pressure',
 'CGPA',
 'Study Satisfaction',
 'Job Satisfaction',
 'Financial Stress']