# Remote Work Health Analysis

### 1. Project Introduction

This Exploratory Data Analysis project seeks to uncover the relationship between an employee's work arrangement (Remote, Hybrid, Onsite) and their physical and mental health in a post-pandemic era.

In order to simulate real-world data quality challenges, the dataset has been purposefully augmented to include more observations and to also include missing values, formatting issues as well as inconsistent values. This was done using the `src/create_messy_data.py` script. The original dataset from Kaggle can 
be found [here](https://www.kaggle.com/datasets/pratyushpuri/remote-work-health-impact-survey-2025/data).

#### Analysis Focus:

What are the trends for each work arrangement with respect to an employee's well being
(Purpose: So that companies can reorganise their remote work policies for employee health and wellbeing
so as to maximise their efficiency)

- Summary Stats: Percentage of each type of Physical Health / Mental Health issues for each work arrangement
- Summary Stats: Boxplot of their Scoring / Countplot of Burnout level
- Seasonality: Trend over Time for Prevalence of Health Conditions / Average Scores
- Dimensions: How it differs across age groups, gender, industry, region
- Distribution: Which industries have the greatest potential for wellbeing improvement through work     arrangement changes

#### Notebook Structure:



In [47]:
"""
This dataset represents survey data, where each record represents an employee's personal particulars, job
information, physical and mental health issues and their self-assigned scores to varying degrees of burn out,
work-life balance and social isolation. 

The most important columns are the job information columns (work arrangement, hours per week) and the scores
columns (burnout level, work life balance score, social isolation score). There are also a few other columns
that would be useful in segmenting the data such as age, gender, region and industry. 

The data spans from Jan 2025 to Jun 2025


# Dimensions:
- Age
- Month
- Gender
- Region
- Industry
- Work Arrangement
- Hours Per Week
- Salary Range

# Measures:

Mental Health Status
Burnout Level
Work Life Balance score
Physical Health
Social Isolation Score
"""

"\nThis dataset represents survey data, where each record represents an employee's personal particulars, job\ninformation, physical and mental health issues and their self-assigned scores to varying degrees of burn out,\nwork-life balance and social isolation. \n\nThe most important columns are the job information columns (work arrangement, hours per week) and the scores\ncolumns (burnout level, work life balance score, social isolation score). There are also a few other columns\nthat would be useful in segmenting the data such as age, gender, region and industry. \n\nThe data spans from Jan 2025 to Jun 2025\n\n\n# Dimensions:\n- Age\n- Month\n- Gender\n- Region\n- Industry\n- Work Arrangement\n- Hours Per Week\n- Salary Range\n\n# Measures:\n\nMental Health Status\nBurnout Level\nWork Life Balance score\nPhysical Health\nSocial Isolation Score\n"

### 2. Importing Packages and Setup

In [48]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sys

sys.path.append('..')
from src import standardize_lower_case, clear_trailing_spaces, remove_characters, has_mental_health_concerns, impute_by_grouped_median

sns.set_context("poster")

### 3. Data Loading and Overview

In [49]:
# Loading the raw augmented dataset
df_raw = pd.read_csv("../data/remote_work_health_dataset_augmented_raw.csv")
# df_raw = pd.read_csv("../data/post_pandemic_remote_work_health_impact_2025_raw.csv")

df_raw.head()

Unnamed: 0,Survey_Date,Age,Gender,Region,Industry,Job_Role,Work_Arrangement,Hours_Per_Week,Mental_Health_Status,Burnout_Level,Work_Life_Balance_Score,Physical_Health_Issues,Social_Isolation_Score,Salary_Range
0,2025-06-01,27.0,Female,Asia,Professional Services,Data Analyst,Onsite,64.0,Stress Disorder,High\n,3.0,Shoulder Pain; Neck Pain,Level 2.0,$40K-60K
1,2025-06-01,37.0,Female,Asia,Professional Services,Data Analyst,Onsite,37.0,,High,,Back Pain,2.0,$80K-100K
2,2025-06-01,90.0,Female,Africa,Education\n,Business Analyst,Onsite,36.0,,High,3.0,Shoulder Pain; Eye Strain,2.0,$80K-100K
3,2025-06-01,40.0,Female,Europe,Education,Data Analyst,Onsite,63.0,ADHD,Medium,1.0,Shoulder Pain; Eye Strain,2.0,$60K-80K
4,2025-06-01,30.0,Male\n,South America,Manufacturing,DevOps Engineer,Hybrid,65.0,,Medium,5.0,,4.0,$60K-80K


In [50]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5150 entries, 0 to 5149
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Survey_Date              5150 non-null   object
 1   Age                      4889 non-null   object
 2   Gender                   5150 non-null   object
 3   Region                   5150 non-null   object
 4   Industry                 5150 non-null   object
 5   Job_Role                 5150 non-null   object
 6   Work_Arrangement         5150 non-null   object
 7   Hours_Per_Week           4845 non-null   object
 8   Mental_Health_Status     3691 non-null   object
 9   Burnout_Level            4894 non-null   object
 10  Work_Life_Balance_Score  4550 non-null   object
 11  Physical_Health_Issues   4394 non-null   object
 12  Social_Isolation_Score   4742 non-null   object
 13  Salary_Range             5150 non-null   object
dtypes: object(14)
memory usage: 563.4+ KB


In [51]:
df_raw.describe() # Not much significance as columns are of the "object" type

Unnamed: 0,Survey_Date,Age,Gender,Region,Industry,Job_Role,Work_Arrangement,Hours_Per_Week,Mental_Health_Status,Burnout_Level,Work_Life_Balance_Score,Physical_Health_Issues,Social_Isolation_Score,Salary_Range
count,5150,4889.0,5150,5150,5150,5150,5150,4845.0,3691,4894,4550.0,4394,4742.0,5150
unique,180,139.0,72,47,136,157,78,121.0,87,53,66.0,158,59.0,67
top,2025-06-16,30.0,Female,North America,Technology,HR Manager,Onsite,35.0,Good,Medium,3.0,Back Pain,3.0,$60K-80K
freq,154,136.0,2071,972,800,280,1437,187.0,467,1262,1079.0,607,1019.0,933


In [52]:
df_raw.shape

(5150, 14)

### 4. Data Quality Assessment

In [53]:
# Assessing Missing Values

prop_missing_vals = pd.DataFrame({
    "Missing_Count" : df_raw.isnull().sum(),
    "Missing_Proportion" : (df_raw.isnull().sum() / len(df_raw)) * 100
}).sort_values("Missing_Proportion", ascending= False)

print(prop_missing_vals)

                         Missing_Count  Missing_Proportion
Mental_Health_Status              1459           28.330097
Physical_Health_Issues             756           14.679612
Work_Life_Balance_Score            600           11.650485
Social_Isolation_Score             408            7.922330
Hours_Per_Week                     305            5.922330
Age                                261            5.067961
Burnout_Level                      256            4.970874
Survey_Date                          0            0.000000
Gender                               0            0.000000
Region                               0            0.000000
Industry                             0            0.000000
Job_Role                             0            0.000000
Work_Arrangement                     0            0.000000
Salary_Range                         0            0.000000


### Missing Data Analysis

**Mental_Health_Status** : 28.3% missing values (1459 records)
- This is likely caused by the respondees privacy when it comes to disclosing their personal mental health issues. However, it is also worth considering that some of those surveyed may not have a diagnosis for mental health issues and do not have any ongoing health concerns to report.

**Physical_Health_Status** : 14.7% missing values (756 records)
- Similarly, this might be due to the respondees privacy in their unwillingness to share their physical health history. Alternatively, they may not be experiencing any physical health issues and have no data to report. 

**Work_Life_Balance_Score** : 11.7 % missing values (600 records)
- Likely due to sensitive work-personal boundary topic. This would lead to a bias to respondees who have shared their work life balance scores.

The other columns have less than 10% of missing values and are still viable for Exploratory Data Analysis

In [54]:
# Assessing Invalid Data & Formatting Inconsistencies

print(df_raw["Age"].unique())
print("\n")
print(df_raw["Work_Arrangement"].unique())
print("\n")
print(df_raw["Social_Isolation_Score"].unique())
print("\n")
print(df_raw["Salary_Range"].unique())


['27.0' '37.0' '90.0' '40.0' '30.0' '52.0' '50.0' '63.0' '42.0' '64.0' nan
 '36.0' '57.0' '26.0' '55.0' '59.0' '23.0' '22.0' '25.0' '85.0' '47.0'
 '39.0' '62.0' '29.0' '31.0' '33.0' '50.0.00' '65.0' '58.0' '51.0' '41.0'
 '44.0' '53.0' '48.0' '49.0' '56.0' '28.0' '60.0' '43.0' '38.0' '46.0'
 '45.0' '24.0' '61.0' '32.0' '35.0' 'Age 40.0' '34.0' '43.0.0' '16.0'
 '54.0' '62.0pts' '35.0+' ' 50.0 ' '17.0' 'Age 37.0' '46.0.00' '25.0.0'
 '62.0.0' '59.0.00' ' 42.0 ' '55.0pts' '54.0.0' '(45.0)' '38.0pts'
 '50.0.0' '150.0' '57.0.00' '54.0pts' '53.0pts' '41.0pts' 'Age 41.0'
 '47.0pts' '1995.0' ' 46.0 ' 'Age 29.0' '23.0.00' '0.0' 'Age 39.0'
 '(27.0)' '37.0.00' ' 38.0 ' '85.0.00' ' 56.0 ' '24.0.0' '48.0.0'
 '43.0pts' 'Age 35.0' ' 58.0 ' '48.0+' 'Age 23.0' '52.0pts' ' 40.0 '
 '15.0' '60.0.00' '27.0pts' '34.0+' ' 41.0 ' '(55.0)' '34.0.0' '37.0pts'
 ' 35.0 ' ' 39.0 ' '45.0+' '47.0.0' ' 53.0 ' '27.0.0' '29.0.00' 'Age 16.0'
 '(40.0)' '35.0.0' '49.0pts' ' 34.0 ' ' 63.0 ' '(31.0)' 'Age 43.0'
 ' 60.0 ' 'Age

### Data Formatting and Consistency Analysis

Investigation of the key data fields raises multiple issues with the data collected, which needs to be addressed during the data cleaning stage. 

**Trailing Spaces**: Columns have trailing whitespaces in their values as indicated by  `' '` as well as `\n` and `\t`.

**Capitalisation Inconsistency**: In particular, categorical columns have similar entries with inconsistent capitalisation: lower case, upper case, and both.

**Text Format Inconsistency**: Columns have unnecessary characters in their records such as "pts", "+" and "Age" in the `Age` column, which do not add any value to the analysis. Numeric columns have redundant decimal points in their values as well.

**Invalid Values**: Columns have invalid values that do not fall into the categories assigned, such as "Mixed" and "Flexible" in the `Work_Arrangement` column, which can only accept categories such as "Onsite", "Remote" or "Hybrid". Another class of invalid values are values that fall out of scale such as records that are greater than 5 in the `Social_Isolation_Score` column, or impossible values like "0.0" and "1995" in the `Age` column.

**Inconsistent Columns**: The `Salary_Range` column contains heavily overlapping and inconsistently defined salary brackets ("$40K-$60K" and "$50K-$70K"). Due to the ambiguity and challenge of standardising these ranges without introducing assumptions, this column was excluded from further analysis.


**Improper Column Type**: All columns belong to the `object` class and need to be typecasted to their respective classes for proper analysis.


In [55]:
# Checking for Duplicate Values
#print(df_raw[df_raw.duplicated()]
#print("\n")

# Counting Number and Proportion of Duplicate Values
print(df_raw.duplicated().sum())
print(df_raw.duplicated().sum() / len(df_raw) * 100)

19
0.36893203883495146


### Duplicate Record Analysis

Analysis of duplicate values found 19 entries in the entire dataset, which was likely due to survey adminstration errors. However, this would not have a significant impact on the analysis as these duplicate entries only account for 0.37% of the entire dataset.

### 5. Data Cleaning

### Addressing Duplicate Values

In [56]:
# Creating a copy of the df
df_clean = df_raw.copy()

df_clean.drop_duplicates(inplace= True)

As aforementioned, these duplicate entries are likely due to survey errors and are real duplicated values. Since this survey was performed on a global scale, it is highly unlikely that there are multiple individuals with the exact same personal particulars, with similar job information and scoring the metrics and presenting the same health issues. These duplicates serve no statistical significance in the analysis and can be dropped as they only constitute 0.37% of the dataset.

### Addressing Spacing and Capitalisation Inconsistency

In [57]:
df_clean = standardize_lower_case(df_clean)
df_clean = clear_trailing_spaces(df_clean)

### Addressing Text Format Inconsistencies and Invalid Values

#### Date Columns

In [58]:
# Formatting Date_Surveyed Column
#df_clean["Survey_Date"].unique()

df_clean = remove_characters(df_clean, "Survey_Date", " 00:00:00")
df_clean["Survey_Date"] = pd.to_datetime(df_clean["Survey_Date"])

Timestamp values " 00:00:00" were removed from each entry in the `Surveyed_Date` column as the only relevant component was the date component, not the time of day.

#### Numeric Columns

In [59]:
# Cleaning Age Column
#df_clean["Age"].unique()

df_clean = remove_characters(df_clean, "Age", "pts", "age ", "+", "(", ")")
df_clean["Age"] = df_clean["Age"].str.replace(r"\.0.00$", "", regex=True)
df_clean["Age"] = df_clean["Age"].str.replace(r"\.0.0$", "", regex=True)

df_clean["Age"] = pd.to_numeric(df_clean["Age"], errors="coerce")
df_clean.loc[(df_clean["Age"] < 18) | (df_clean["Age"] > 100), "Age"] = np.nan

In [60]:
# Cleaning Hours_Per_Week Column
#df_clean["Hours_Per_Week"].unique()

df_clean = remove_characters(df_clean, "Hours_Per_Week", "hrs", "+", "(", ")")
df_clean["Hours_Per_Week"] = df_clean["Hours_Per_Week"].str.replace(r"\.0.00$", "", regex=True)
df_clean["Hours_Per_Week"] = df_clean["Hours_Per_Week"].str.replace(r"\.0.0$", "", regex=True)

df_clean["Hours_Per_Week"] = pd.to_numeric(df_clean["Hours_Per_Week"], errors = "coerce")
df_clean.loc[(df_clean["Hours_Per_Week"] < 10) | (df_clean["Hours_Per_Week"] > 168), "Hours_Per_Week"]  = np.nan

In [61]:
# Cleaning Work_Life_Balance_Score Column
#df_clean["Work_Life_Balance_Score"].unique()

df_clean = remove_characters(df_clean, "Work_Life_Balance_Score", "pts", "-", "(", ")", "+", "level ")
df_clean["Work_Life_Balance_Score"] = df_clean["Work_Life_Balance_Score"].str.replace(r"\.0.0$", "", regex=True)
df_clean["Work_Life_Balance_Score"] = df_clean["Work_Life_Balance_Score"].str.replace(r"\.0.00$", "", regex=True)

df_clean["Work_Life_Balance_Score"] = pd.to_numeric(df_clean["Work_Life_Balance_Score"], errors = "coerce")
df_clean.loc[(df_clean["Work_Life_Balance_Score"] < 1) | (df_clean["Work_Life_Balance_Score"] > 5), 
             "Work_Life_Balance_Score"]  = np.nan

In [62]:
# Cleaning Social_Isolation_Score Column
#df_clean["Social_Isolation_Score"].unique()

df_clean = remove_characters(df_clean, "Social_Isolation_Score", "pts", "-", "(", ")", "+", "level ")
df_clean["Social_Isolation_Score"] = df_clean["Social_Isolation_Score"].str.replace(r"\.0.0$", "", regex=True)
df_clean["Social_Isolation_Score"] = df_clean["Social_Isolation_Score"].str.replace(r"\.0.00$", "", regex=True)

df_clean["Social_Isolation_Score"] = pd.to_numeric(df_clean["Social_Isolation_Score"], errors = "coerce")
df_clean.loc[(df_clean["Social_Isolation_Score"] < 1) | (df_clean["Social_Isolation_Score"] > 5), 
             "Social_Isolation_Score"]  = np.nan

The numeric columns contained inconsistent formatting and extraneous characters that need to be cleaned to allow for proper typecasting and analysis. These included removing irrelevant descriptors such as: "pts", "age", "hrs", and "level" as well as special characters like mathematical operators and parentheses. Redundant decimal formats like ".0.0" and ".0.00" were removed to ensure they can be accurately converted into numeric types. Thereafter, invalid values in each column were converted to NaN based on the following criteria:

**Age**: Ages below 18 and above 100 are unrealistic ages which likely represent data entry errors.

**Hours_Per_Week**: Less than 10 hours per week is not representative of full time employment, and values exceeding 168 hours are not possible with only 168 hours a week.

**Work_Life_Balance_Score**: Score values can only take values between 1 and 5 (inclusive).

**Social_Isolation_Score**: Score values can only take values between 1 and 5 (inclusive).

#### Categorical Columns

In [63]:
# Cleaning Gender Column
#df_clean["Gender"].unique()

gender_mapping = {
    
    "female": "female",
    "woman": "female",
    "f": "female",

    "male": "male",
    "man": "male",
    "m": "male",

    "other": "other",
    "prefer not to say": "other",

    "non-binary": "non-binary"
}

df_clean["Gender"] = df_clean["Gender"].replace(gender_mapping)
df_clean["Gender"] = df_clean["Gender"].astype("category")

In [64]:
# Cleaning Work_Arrangement Column
# df_clean["Work_Arrangement"].unique()

arrangement_mapping ={

    "onsite": "onsite",
    "office": "onsite",
    "on-site": "onsite",
    "office work": "onsite",
    "in-person": "onsite",

    "hybrid": "hybrid",
    "mixed": "hybrid",
    "hybrid work": "hybrid",

    "remote": "remote",
    "remote work": "remote",
    "wfh": "remote",
    "work from home": "remote",

    "flexible": "flexible"

}

df_clean["Work_Arrangement"] = df_clean["Work_Arrangement"].replace(arrangement_mapping)
df_clean["Work_Arrangement"] = df_clean["Work_Arrangement"].astype("category")

In [65]:
# Cleaning Burnout_Level
# df_clean["Work_Arrangement"].unique()

burnout_mapping = {

    "low": "low",
    "lo": "low",
    "lw": "low",
    "non": "low",
    "none": "low",
    "nil": "low",

    "medium": "medium",
    "moderate": "medium",
    "moderat": "medium",
    "moderete": "medium",

    "high": "high",
    "hgh": "high",
    "hi": "high",
    "very high": "high",
    "veryhigh": "high"
}

df_clean["Burnout_Level"] = df_clean["Burnout_Level"].replace(burnout_mapping)
df_clean["Burnout_Level"] = df_clean["Burnout_Level"].astype("category")

To ensure consistency across categorical variables, inconsistent entries were cleaned by mapping similar values into standardized categories. Entries with spelling inconsistencies and shorthand notations were mapped to their correct forms. This process helps simplify analysis and prevent fragmentation of categories during aggregation and visualisation.

**Gender**: Entries such as "woman" and "f" were mapped to "female", while "man" and "m" were mapped to "male". The "prefer not to say" category was grouped under "other" to reduce sparsity. The "non-binary" category was preserved as a distinct group to maintain inclusivity.

**Work_Arrangement**: Terms related to office-based and in-person work were standardized to "onsite", while remote-related terms were mapped to "remote". Hybrid setups were unified under "hybrid". The "flexible" category was retained as a separate value, as it may provide unique insights despite not fitting neatly into the standard arrangements.

**Burnout_Level**: Responses which indicated the absence of burnout such as "nil" and "none" were grouped under the "low" category under the assumption that they indicate minimal to no signs of burnout. References to elevated levels of burnout such as "veryhigh" were grouped under "high" to simplify the distribution.  

In [66]:
# Recategorising Mental_Health_Status and Physical_Health_Issues

df_clean["Mental_Health_Issues_Binary"] = df_clean["Mental_Health_Status"].apply(has_mental_health_concerns)
df_clean["Physical_Health_Issues_Binary"] = df_clean["Physical_Health_Issues"].apply(
    lambda x: 0 if pd.isna(x) or x == "none" else 1
)

df_clean["Mental_Health_Issues_Binary"] = df_clean["Mental_Health_Issues_Binary"].astype("category")
df_clean["Physical_Health_Issues_Binary"] = df_clean["Physical_Health_Issues_Binary"].astype("category")

No furthur cleaning was performed in the `Mental_Health_Status` and `Physical_Health_Issues` columns due to the wide variability in free-text responses. Instead, two new columns with binary values were created: `Mental_Health_Issues_Binary` and `Physical_Health_Issues_Binary`. 

**Mental_Health_Issues_Binary**: Entries mentioning poor mental health or presenting mental health diagnoses were assigned the value 1, which indicate the presence of mental health concerns. On the other hand, responses that stated fairly good to good mental health were assigned the value 0 to show the absence of mental health issues. 0 was also assigned to null responses on the assumption that they have no medical diagnosis or no reported mental health issues.

**Physical_Health_Issues_Binary**: Entries that mention at least one physical health issue were assigned the value of 1 whereas entries labelled "none" or left blank were assigned the value of 0 to indicate no reported physical health issues.

In [67]:
#df_clean["Region"].unique()
#df_clean["Industry"].unique()
#df_clean["Job_Role"].unique()

df_clean["Region"] = df_clean["Region"].astype("category")
df_clean["Industry"] = df_clean["Industry"].astype("string")
df_clean["Job_Role"] = df_clean["Industry"].astype("string")

No additional cleaning was required for these columns (`Region`, `Industry`, `Job_Role`) after implementing consistent formatting and capitalisation, which segregated the values in each of these columns into distinct categories with no duplicates or inconsistencies.

### Addressing Missing Values

In [68]:
missing_values = pd.DataFrame({
    "Missing Values": df_clean.isnull().sum(),
    "Missing Proportion": df_clean.isnull().sum() / len(df_clean) * 100
}).sort_values("Missing Proportion", ascending= False)

print(missing_values)

                               Missing Values  Missing Proportion
Work_Life_Balance_Score                  1826           35.587605
Mental_Health_Status                     1453           28.318067
Social_Isolation_Score                   1173           22.861041
Physical_Health_Issues                    753           14.675502
Hours_Per_Week                            337            6.567920
Age                                       299            5.827324
Burnout_Level                             254            4.950302
Survey_Date                                 0            0.000000
Gender                                      0            0.000000
Region                                      0            0.000000
Industry                                    0            0.000000
Job_Role                                    0            0.000000
Work_Arrangement                            0            0.000000
Salary_Range                                0            0.000000
Mental_Hea

In [None]:
# Dropping Missing Values
df_clean = df_clean.dropna(subset= ["Burnout_Level"])

# Imputing Missing Values
impute_by_grouped_median(df_clean, "Age", "Job_Role", "Industry")
impute_by_grouped_median(df_clean, "Hours_Per_Week", "Job_Role", "Work_Arrangement")
impute_by_grouped_median(df_clean, "Social_Isolation_Score", "Work_Arrangement", "Mental_Health_Issues_Binary")
impute_by_grouped_median(df_clean, "Work_Life_Balance_Score", "Work_Arrangement", "Industry")

No valid data found in groupby combination. Applied overall median as fallback for Work_Life_Balance_Score (12 records, 0.25 %)


  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)


The missing values in these columns were addressed using different strategies based on the nature of each variable and the proportion of missing data:

**Burnout_Level**: Missing values were dropped as they accounted for less than 5% of the values in the dataset (4.95%). Given the low proportion of missing data, deletion of these values were appropriate to maintain data integrity without significant reduction in sample size 

**Age**: Missing values were imputed using median values grouped by `Job_Role` and `Industry`. This approach leverages the fact that different roles within specific industries tend to attract workers of similar age demographics.

**Hours_Per_Week**: Missing values were imputed using median values grouped by `Job_Role` and `Work_Arrangement`, as working hours differ significantly between different job positions and whether they work remotely or in-office.

**Social_Isolation_Score**: Missing values were imputed using median values grouped by `Work_Arrangement` and `Mental_Health_Issues_Binary`, given that work arrangements influences human interactions and the state of one's mental health which would influence feelings of social isolation to varying degrees.

**Work_Life_Balance_Score**: Missing values were imputed using median values grouped by `Work_Arrangement` and `Industry`, as work-life balance expectations and practises would vary considerably across different industries and work arrangements. 

**Mental_Health_Issues** & **Physical_Health_Issues**: Missing values will not be dropped nor imputed due to the wide variability of free-text response. Instead, their responses or lack of response would be assigned a Binary value in the `Mental_Health_Issues_Binary` and `Physical_Health_Issues_Binary` respectively.

For grouped imputation methods, when specific group combinations do not have any valid data and a median value is unable to be obtained, missing values for that group combination would be filled with the column's overall median value as a fallback strategy. For instance, in the `Work_Life_Balance_Score` column, there were a total of 12 records with no valid values for certain `Work_Arrangement` and `Industry` combinations. These values were filled with the median value of the `Work_Life_Balance_Score`, which would unlikely cause any bias in the analysis as they only account for 0.25% of the entire dataset.