### *** ANALYSING MENTAL HEALTH WELL BEING IN A REMOTE WORK ERA ***

# <center> **Introduction**  

This project focuses on applying health analytics and machine learning techniques to predict employee mental health risk using work arrangement, lifestyle, and psychosocial data. By leveraging data-driven models, the study captures complex relationships between factors such as work location, working hours, sleep quality, stress levels, social isolation, physical activity, and mental health outcomes. These models enable early identification of individuals at higher risk of anxiety, depression, or burnout and support evidence-based recommendations for workplace mental health interventions, prevention strategies, and policy development.

## Part 1: Data Preprocessing & Cleaning

#### Import libraries

In [13]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

#### Data Pre-processing 

In [15]:
df = pd.read_csv(r'C:\Users\bless\OneDrive\Desktop\St Clair\St Clair docs\Courses\semester 4\Health analytics\Group project\Impact_of_Remote_Work_on_Mental_Health.csv')
df.head()

Unnamed: 0,Employee_ID,Age,Gender,Job_Role,Industry,Years_of_Experience,Work_Location,Hours_Worked_Per_Week,Number_of_Virtual_Meetings,Work_Life_Balance_Rating,Stress_Level,Mental_Health_Condition,Access_to_Mental_Health_Resources,Productivity_Change,Social_Isolation_Rating,Satisfaction_with_Remote_Work,Company_Support_for_Remote_Work,Physical_Activity,Sleep_Quality,Region
0,EMP0001,32,Non-binary,HR,Healthcare,13,Hybrid,47,7,2,Medium,Depression,No,Decrease,1,Unsatisfied,1,Weekly,Good,Europe
1,EMP0002,40,Female,Data Scientist,IT,3,Remote,52,4,1,Medium,Anxiety,No,Increase,3,Satisfied,2,Weekly,Good,Asia
2,EMP0003,59,Non-binary,Software Engineer,Education,22,Hybrid,46,11,5,Medium,Anxiety,No,No Change,4,Unsatisfied,5,,Poor,North America
3,EMP0004,27,Male,Software Engineer,Finance,20,Onsite,32,8,4,High,Depression,Yes,Increase,3,Unsatisfied,3,,Poor,Europe
4,EMP0005,49,Male,Sales,Consulting,32,Onsite,35,12,2,High,,Yes,Decrease,3,Unsatisfied,3,Weekly,Average,North America


In [5]:
df.shape

(5000, 20)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 20 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Employee_ID                        5000 non-null   object
 1   Age                                5000 non-null   int64 
 2   Gender                             5000 non-null   object
 3   Job_Role                           5000 non-null   object
 4   Industry                           5000 non-null   object
 5   Years_of_Experience                5000 non-null   int64 
 6   Work_Location                      5000 non-null   object
 7   Hours_Worked_Per_Week              5000 non-null   int64 
 8   Number_of_Virtual_Meetings         5000 non-null   int64 
 9   Work_Life_Balance_Rating           5000 non-null   int64 
 10  Stress_Level                       5000 non-null   object
 11  Mental_Health_Condition            3804 non-null   object
 12  Access

In [7]:
missing = df.isna().sum().sort_values(ascending=False)
missing_pct = (df.isna().sum() / len(df) * 100).sort_values(ascending=False)

summary = pd.DataFrame({"Missing_Count": missing, "Missing_%": missing_pct})
print(summary[summary["Missing_Count"] > 0])


                         Missing_Count  Missing_%
Physical_Activity                 1629      32.58
Mental_Health_Condition           1196      23.92


In [10]:
df.columns = df.columns.str.strip().str.replace(" ", "_")
df.columns



Index(['Employee_ID', 'Age', 'Gender', 'Job_Role', 'Industry',
       'Years_of_Experience', 'Work_Location', 'Hours_Worked_Per_Week',
       'Number_of_Virtual_Meetings', 'Work_Life_Balance_Rating',
       'Stress_Level', 'Mental_Health_Condition',
       'Access_to_Mental_Health_Resources', 'Productivity_Change',
       'Social_Isolation_Rating', 'Satisfaction_with_Remote_Work',
       'Company_Support_for_Remote_Work', 'Physical_Activity', 'Sleep_Quality',
       'Region'],
      dtype='object')

In [9]:
before = len(df)
df = df.drop_duplicates()
after = len(df)

print("Duplicates removed:", before - after)


Duplicates removed: 0


In [11]:
text_columns = [
    "Gender", "Job_Role", "Industry", "Work_Location",
    "Stress_Level", "Mental_Health_Condition",
    "Access_to_Mental_Health_Resources", "Productivity_Change",
    "Satisfaction_with_Remote_Work", "Physical_Activity",
    "Sleep_Quality", "Region"
]

for col in text_columns:
    df[col] = df[col].astype("string").str.strip()


#### Data imputation ####

#### Missing Data Handling: Mental_Health_Condition (Categorical Imputation)

The Mental_Health_Condition variable represents a sensitive health outcome. 
Because imputing missing values using statistical measures (mean, median, or mode) could introduce incorrect assumptions about an individualâ€™s mental health status, a categorical imputation approach was used. 

All missing values were labeled as "Not Reported" to indicate that the respondent did not provide information about their mental health condition. 
This approach preserves the sample size while avoiding misclassification or artificial diagnoses.


In [16]:
df["Mental_Health_Condition"] = df["Mental_Health_Condition"].fillna("Not Reported")


#### Missing Data Handling: Physical_Activity (Categorical Imputation)

Physical_Activity is a behavioral categorical variable. 
Instead of removing records or estimating values, missing entries were replaced with the category "Unknown". 

This categorical imputation method allows the dataset to retain all observations while transparently indicating that the information was not provided.


In [17]:
df["Physical_Activity"] = df["Physical_Activity"].fillna("Unknown")