#Remote Work & Health Impact Analysis
##Notebook 02: Cleaning / Transformed Data

**Author:** Mengie Jean-Baptiste
**Date:** September 2025
**Purpose:** This notebook cleans and prepares the raw data data analysis by:
 - Handling missing values
 - Organizing variable formats
 - Creating regional subsets
 - Saving a cleaned dataset for future analysis

## 1. Imports and Set up

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)
np.random.seed(42)

## 2. Load Raw Data

In [5]:
df_raw=pd.read_csv("/Users/mengiejean-baptiste/remote_work_health_analysis/data/raw/post_pandemic_remote_work_health_impact_2025.csv")
df_raw.head()

Unnamed: 0,Survey_Date,Age,Gender,Region,Industry,Job_Role,Work_Arrangement,Hours_Per_Week,Mental_Health_Status,Burnout_Level,Work_Life_Balance_Score,Physical_Health_Issues,Social_Isolation_Score,Salary_Range
0,2025-06-01,27,Female,Asia,Professional Services,Data Analyst,Onsite,64,Stress Disorder,High,3,Shoulder Pain; Neck Pain,2,$40K-60K
1,2025-06-01,37,Female,Asia,Professional Services,Data Analyst,Onsite,37,Stress Disorder,High,4,Back Pain,2,$80K-100K
2,2025-06-01,32,Female,Africa,Education,Business Analyst,Onsite,36,ADHD,High,3,Shoulder Pain; Eye Strain,2,$80K-100K
3,2025-06-01,40,Female,Europe,Education,Data Analyst,Onsite,63,ADHD,Medium,1,Shoulder Pain; Eye Strain,2,$60K-80K
4,2025-06-01,30,Male,South America,Manufacturing,DevOps Engineer,Hybrid,65,,Medium,5,,4,$60K-80K


## 3. Initial Data Inspection

This step identifies:
- Attribute types (Qualitative & Quantitative)
- Missing values
- Potential data quality issues

In [7]:
df_raw.info()
df_raw.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3157 entries, 0 to 3156
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Survey_Date              3157 non-null   object
 1   Age                      3157 non-null   int64 
 2   Gender                   3157 non-null   object
 3   Region                   3157 non-null   object
 4   Industry                 3157 non-null   object
 5   Job_Role                 3157 non-null   object
 6   Work_Arrangement         3157 non-null   object
 7   Hours_Per_Week           3157 non-null   int64 
 8   Mental_Health_Status     2358 non-null   object
 9   Burnout_Level            3157 non-null   object
 10  Work_Life_Balance_Score  3157 non-null   int64 
 11  Physical_Health_Issues   2877 non-null   object
 12  Social_Isolation_Score   3157 non-null   int64 
 13  Salary_Range             3157 non-null   object
dtypes: int64(4), object(10)
memory usage: 34

Survey_Date                  0
Age                          0
Gender                       0
Region                       0
Industry                     0
Job_Role                     0
Work_Arrangement             0
Hours_Per_Week               0
Mental_Health_Status       799
Burnout_Level                0
Work_Life_Balance_Score      0
Physical_Health_Issues     280
Social_Isolation_Score       0
Salary_Range                 0
dtype: int64

* Keep in mind that the missing values from "Mental Health Status" and "Physical Health Issues" area around 34.2% of missing date. This is a substatial amount so removing these rows of date might introduce bias if they are not random.

## 4. Create copy of dataset and Standarize Column names

Column names are reverted to:
- lowercase
- snake_case

In [10]:
df1=df_raw.copy()
df1.columns=(
    df1.columns
    .str.lower()
    .str.strip()
    .str.replace(" ", "_")
)
df1.columns

Index(['survey_date', 'age', 'gender', 'region', 'industry', 'job_role',
       'work_arrangement', 'hours_per_week', 'mental_health_status',
       'burnout_level', 'work_life_balance_score', 'physical_health_issues',
       'social_isolation_score', 'salary_range'],
      dtype='object')

## 5. Handling Duplicates

In [None]:
duplicates= df1.duplicated().sum()
duplicates

np.int64(0)

* No duplicates

## 6. Handling Missing Values

Missing values are handled using ...

In [13]:
missing=df1.isna().sum()
missing

missing_percent=(
    df1[["mental_health_status", "physical_health_issues"]]
    .isna()
    .mean() *100
)
missing_percent

mental_health_status      25.308838
physical_health_issues     8.869180
dtype: float64

In [None]:
# Replace N/A values with "not_reported"
df1["physical_health_issues"]=df1["physical_health_issues"].fillna("not_reported")
df1["mental_health_status"]=df1["mental_health_status"].fillna("not_reported")
df1
pd.crosstab(
    df1["region"],
    df1["mental_health_status"],
    margins=True
)

pd.crosstab(
    df1["region"],
    df1["physical_health_issues"],
    margins=True
)



physical_health_issues,Back Pain,Back Pain; Eye Strain,Back Pain; Eye Strain; Neck Pain,Back Pain; Eye Strain; Neck Pain; Wrist Pain,Back Pain; Eye Strain; Wrist Pain,Back Pain; Neck Pain,Back Pain; Neck Pain; Wrist Pain,Back Pain; Shoulder Pain,Back Pain; Shoulder Pain; Eye Strain,Back Pain; Shoulder Pain; Eye Strain; Neck Pain,Back Pain; Shoulder Pain; Eye Strain; Neck Pain; Wrist Pain,Back Pain; Shoulder Pain; Eye Strain; Wrist Pain,Back Pain; Shoulder Pain; Neck Pain,Back Pain; Shoulder Pain; Neck Pain; Wrist Pain,Back Pain; Shoulder Pain; Wrist Pain,Back Pain; Wrist Pain,Eye Strain,Eye Strain; Neck Pain,Eye Strain; Neck Pain; Wrist Pain,Eye Strain; Wrist Pain,Neck Pain,Neck Pain; Wrist Pain,Shoulder Pain,Shoulder Pain; Eye Strain,Shoulder Pain; Eye Strain; Neck Pain,Shoulder Pain; Eye Strain; Neck Pain; Wrist Pain,Shoulder Pain; Eye Strain; Wrist Pain,Shoulder Pain; Neck Pain,Shoulder Pain; Neck Pain; Wrist Pain,Shoulder Pain; Wrist Pain,Wrist Pain,not_reported,All
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
Africa,42,36,21,2,10,8,1,43,43,12,3,8,11,4,5,11,45,20,3,3,21,6,37,38,18,2,5,12,2,5,7,48,532
Asia,46,42,10,3,1,12,2,30,38,15,5,8,6,4,10,11,38,20,2,11,11,3,47,47,10,2,2,16,1,8,11,45,517
Europe,38,50,15,2,5,12,3,38,44,14,4,9,7,4,6,6,43,18,2,10,16,3,41,38,16,0,4,8,3,5,8,41,513
North America,30,50,12,1,7,20,2,36,39,14,1,4,11,4,7,5,35,13,1,6,14,4,43,39,13,1,12,14,1,8,9,41,497
Oceania,44,43,19,0,7,14,4,35,48,18,3,9,10,1,8,5,42,8,2,3,17,6,28,53,11,3,5,14,2,4,6,51,523
South America,53,37,16,9,6,22,2,36,43,15,2,13,23,2,5,8,53,11,2,10,13,1,39,47,8,3,10,14,2,8,8,54,575
All,253,258,93,17,36,88,14,218,255,88,18,51,68,19,41,46,256,90,12,43,92,23,235,262,76,11,38,78,11,38,49,280,3157


## 7. Create a North America Subset

In [None]:
df1_na=df1[df1["region"]=="North America"].copy()

mental_health_status
not_reported       134
depression          73
adhd                61
stress disorder     61
ptsd                60
burnout             56
anxiety             52
Name: count, dtype: int64

## 8. Handle not reported values

In [None]:
df1_na["physical_health_issues"]=(
    df1_na["physical_health_issues"].astype(str)
    .str.strip()
    .str.lower()
)
df1_na["mental_health_status"]=(
    df1_na["mental_health_status"]
    .astype(str)
    .str.strip()
    .str.lower()
)
df1_na["physical_health_issues"].value_counts(dropna=False)
df1_na["mental_health_status"].value_counts(dropna=False)

df1_na["mental_nonreported"]=(
    df1_na["mental_health_status"]=="not_reported"
).astype(str)

df1_na["physical_nonreported"]=(
    df1_na["physical_health_issues"]=="not_reported"
).astype(str)

print(df1_na["mental_nonreported"].value_counts())
print(df1_na["physical_nonreported"].value_counts())
df1_na=df1_na.reset_index(drop=True)
df1_na.index=df1_na.index +1


mental_nonreported
False    363
True     134
Name: count, dtype: int64
physical_nonreported
False    456
True      41
Name: count, dtype: int64


Unnamed: 0,survey_date,age,gender,region,industry,job_role,work_arrangement,hours_per_week,mental_health_status,burnout_level,work_life_balance_score,physical_health_issues,social_isolation_score,salary_range,mental_nonreported,physical_nonreported
1,2025-06-01,64,Male,North America,Technology,Business Analyst,Remote,35,adhd,Medium,3,eye strain; wrist pain,4,$40K-60K,False,False
2,2025-06-01,50,Male,North America,Education,Digital Marketing Specialist,Onsite,51,stress disorder,Low,4,back pain; shoulder pain,5,$100K-120K,False,False
3,2025-06-01,23,Male,North America,Professional Services,Product Manager,Onsite,63,stress disorder,Medium,3,shoulder pain; neck pain,2,$60K-80K,False,False
4,2025-06-01,22,Female,North America,Manufacturing,DevOps Engineer,Hybrid,47,anxiety,High,4,eye strain,1,$60K-80K,False,False
5,2025-06-01,39,Male,North America,Manufacturing,Quality Assurance,Onsite,59,anxiety,Low,5,eye strain; neck pain,2,$60K-80K,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,2025-06-26,41,Female,North America,Technology,Product Manager,Remote,59,stress disorder,High,2,shoulder pain,1,$100K-120K,False,False
494,2025-06-26,42,Male,North America,Technology,Account Manager,Onsite,45,adhd,Low,3,not_reported,3,$80K-100K,False,True
495,2025-06-26,45,Female,North America,Professional Services,HR Manager,Onsite,59,ptsd,Medium,1,shoulder pain,3,$40K-60K,False,False
496,2025-06-26,38,Male,North America,Education,Operations Manager,Onsite,52,depression,Medium,3,shoulder pain; eye strain; neck pain,5,$80K-100K,False,False


* Index was reset after subsetting

## 9 Create Subset for each region 

In [154]:
df1['region'].value_counts()

df1_eu=df1[df1["region"]=="Europe"].copy()
df1_oc=df1[df1["region"]=="Oceania"].copy()
df1_af=df1[df1["region"]=="Africa"].copy()
df1_sa=df1[df1["region"]=="South America"].copy()
df1_as=df1[df1["region"]=="Asia"].copy()
assert df1_as["region"].nunique() == 1
assert df1_as["region"].iloc[0] == "Asia"




## Europe dataseet cleaning

In [None]:
df1_eu["physical_health_issues"]=(
    df1_eu["physical_health_issues"].astype(str)
    .str.strip()
    .str.lower()
)

df1_eu["physical_health_issues"]

df1_eu["mental_health_status"]=(
    df1_eu["mental_health_status"]
    .astype(str)
    .str.strip()
    .str.lower()
)

df1_eu["mental_health_status"]

df1_eu["physical_health_issues"].value_counts(dropna=False)
df1_eu["mental_health_status"].value_counts(dropna=False)

df1_eu["mental_nonreported"]=(
    df1_eu["mental_health_status"]=="not_reported"
).astype(str)

df1_eu["physical_nonreported"]=(
    df1_eu["physical_health_issues"]=="not_reported"
).astype(str)

print(df1_eu["mental_nonreported"].value_counts())
print(df1_eu["physical_nonreported"].value_counts())

df1_eu=df1_eu.reset_index(drop=True)
df1_eu.index=df1_eu.index +1

df1_eu



mental_nonreported
False    370
True     143
Name: count, dtype: int64
physical_nonreported
False    472
True      41
Name: count, dtype: int64


Unnamed: 0,survey_date,age,gender,region,industry,job_role,work_arrangement,hours_per_week,mental_health_status,burnout_level,work_life_balance_score,physical_health_issues,social_isolation_score,salary_range,mental_nonreported,physical_nonreported
1,2025-06-01,40,Female,Europe,Education,Data Analyst,Onsite,63,adhd,Medium,1,shoulder pain; eye strain,2,$60K-80K,False,False
2,2025-06-01,63,Non-binary,Europe,Professional Services,Technical Writer,Onsite,55,anxiety,High,3,not_reported,2,$100K-120K,False,True
3,2025-06-01,37,Male,Europe,Finance,UX Designer,Remote,59,anxiety,High,5,back pain; shoulder pain; wrist pain,5,$60K-80K,False,False
4,2025-06-01,50,Male,Europe,Professional Services,Social Media Manager,Remote,64,burnout,Medium,1,not_reported,4,$60K-80K,False,True
5,2025-06-01,27,Male,Europe,Education,IT Support,Onsite,44,depression,Medium,3,back pain; shoulder pain; eye strain,3,$80K-100K,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509,2025-06-26,45,Female,Europe,Manufacturing,HR Manager,Hybrid,40,not_reported,High,2,not_reported,1,$60K-80K,True,True
510,2025-06-26,48,Female,Europe,Technology,Quality Assurance,Remote,43,ptsd,Medium,4,eye strain,3,$60K-80K,False,False
511,2025-06-26,31,Female,Europe,Finance,Business Analyst,Hybrid,36,anxiety,High,3,wrist pain,2,$100K-120K,False,False
512,2025-06-26,35,Male,Europe,Technology,Project Manager,Hybrid,53,burnout,Medium,2,back pain; shoulder pain; eye strain,2,$40K-60K,False,False


## Asia Subset data cleaning

In [158]:
df1_as["physical_health_issues"]=(
    df1_as["physical_health_issues"].astype(str)
    .str.strip()
    .str.lower()
)

df1_as["physical_health_issues"]

df1_as["mental_health_status"]=(
    df1_as["mental_health_status"]
    .astype(str)
    .str.strip()
    .str.lower()
)

df1_as["mental_health_status"]

df1_as["physical_health_issues"].value_counts(dropna=False)
df1_as["mental_health_status"].value_counts(dropna=False)

df1_as["mental_nonreported"]=(
    df1_as["mental_health_status"]=="not_reported"
).astype(str)

df1_as["physical_nonreported"]=(
    df1_as["physical_health_issues"]=="not_reported"
).astype(str)

print(df1_as["mental_nonreported"].value_counts())
print(df1_as["physical_nonreported"].value_counts())

df1_as=df1_as.reset_index(drop=True)
df1_as.index= df1_as.index + 1

df1_as







mental_nonreported
False    403
True     114
Name: count, dtype: int64
physical_nonreported
False    472
True      45
Name: count, dtype: int64


Unnamed: 0,survey_date,age,gender,region,industry,job_role,work_arrangement,hours_per_week,mental_health_status,burnout_level,work_life_balance_score,physical_health_issues,social_isolation_score,salary_range,mental_nonreported,physical_nonreported
1,2025-06-01,27,Female,Asia,Professional Services,Data Analyst,Onsite,64,stress disorder,High,3,shoulder pain; neck pain,2,$40K-60K,False,False
2,2025-06-01,37,Female,Asia,Professional Services,Data Analyst,Onsite,37,stress disorder,High,4,back pain,2,$80K-100K,False,False
3,2025-06-01,50,Female,Asia,Manufacturing,IT Support,Onsite,62,not_reported,Medium,4,back pain; shoulder pain; wrist pain,2,$80K-100K,True,False
4,2025-06-01,37,Female,Asia,Finance,HR Manager,Onsite,55,burnout,Medium,5,back pain,1,$60K-80K,False,False
5,2025-06-01,42,Female,Asia,Professional Services,Project Manager,Onsite,38,not_reported,High,3,shoulder pain,2,$100K-120K,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
513,2025-06-26,43,Female,Asia,Technology,Consultant,Remote,40,burnout,High,2,back pain; eye strain; neck pain,3,$60K-80K,False,False
514,2025-06-26,47,Female,Asia,Healthcare,Data Analyst,Onsite,38,ptsd,Low,5,not_reported,3,$120K+,False,True
515,2025-06-26,59,Female,Asia,Professional Services,Data Analyst,Onsite,64,stress disorder,Low,4,eye strain; neck pain,2,$60K-80K,False,False
516,2025-06-26,42,Male,Asia,Healthcare,Data Analyst,Hybrid,53,anxiety,High,2,shoulder pain,4,$40K-60K,False,False


## Oceania subset data cleaning

In [162]:
df1_oc["physical_health_issues"]=(
    df1_oc["physical_health_issues"].astype(str)
    .str.strip()
    .str.lower()
)

df1_oc["physical_health_issues"]

df1_oc["mental_health_status"]=(
    df1_oc["mental_health_status"]
    .astype(str)
    .str.strip()
    .str.lower()
)

df1_oc["mental_health_status"]

df1_oc["physical_health_issues"].value_counts(dropna=False)
df1_oc["mental_health_status"].value_counts(dropna=False)

df1_oc["mental_nonreported"]=(
    df1_oc["mental_health_status"]=="not_reported"
).astype(str)

df1_oc["physical_nonreported"]=(
    df1_oc["physical_health_issues"]=="not_reported"
).astype(str)

print(df1_oc["mental_nonreported"].value_counts())
print(df1_oc["physical_nonreported"].value_counts())

df1_oc=df1_oc.reset_index(drop=True)
df1_oc.index=df1_oc.index + 1

df1_oc

mental_nonreported
False    377
True     146
Name: count, dtype: int64
physical_nonreported
False    472
True      51
Name: count, dtype: int64


Unnamed: 0,survey_date,age,gender,region,industry,job_role,work_arrangement,hours_per_week,mental_health_status,burnout_level,work_life_balance_score,physical_health_issues,social_isolation_score,salary_range,mental_nonreported,physical_nonreported
1,2025-06-01,52,Male,Oceania,Customer Service,Business Analyst,Onsite,61,burnout,Medium,4,back pain; shoulder pain,3,$60K-80K,False,False
2,2025-06-01,25,Female,Oceania,Technology,Data Scientist,Hybrid,57,burnout,High,2,back pain; eye strain,1,$80K-100K,False,False
3,2025-06-01,30,Male,Oceania,Professional Services,Data Analyst,Hybrid,36,anxiety,High,3,neck pain,2,$60K-80K,False,False
4,2025-06-01,59,Male,Oceania,Finance,Customer Service Manager,Remote,45,anxiety,High,2,shoulder pain; eye strain; neck pain; wrist pain,3,$80K-100K,False,False
5,2025-06-01,57,Female,Oceania,Customer Service,Research Scientist,Remote,41,depression,Medium,3,back pain; shoulder pain; eye strain,3,$120K+,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
519,2025-06-26,24,Female,Oceania,Manufacturing,Quality Assurance,Remote,36,adhd,Medium,1,back pain; shoulder pain,5,$40K-60K,False,False
520,2025-06-26,60,Female,Oceania,Customer Service,HR Manager,Onsite,61,ptsd,Low,3,back pain,2,$40K-60K,False,False
521,2025-06-26,24,Female,Oceania,Healthcare,Customer Service Manager,Onsite,47,not_reported,Medium,2,not_reported,1,$40K-60K,True,True
522,2025-06-26,42,Male,Oceania,Technology,Data Analyst,Hybrid,55,adhd,High,2,back pain,1,$80K-100K,False,False


## Africa subset cleaning

In [166]:
df1_af["physical_health_issues"]=(
    df1_af["physical_health_issues"].astype(str)
    .str.strip()
    .str.lower()
)

df1_af["physical_health_issues"]

df1_af["mental_health_status"]=(
    df1_af["mental_health_status"]
    .astype(str)
    .str.strip()
    .str.lower()
)

df1_af["mental_health_status"]

df1_af["physical_health_issues"].value_counts(dropna=False)
df1_af["mental_health_status"].value_counts(dropna=False)

df1_af["mental_nonreported"]=(
    df1_af["mental_health_status"]=="not_reported"
).astype(str)

df1_af["physical_nonreported"]=(
    df1_af["physical_health_issues"]=="not_reported"
).astype(str)

print(df1_af["mental_nonreported"].value_counts())
print(df1_af["physical_nonreported"].value_counts())

df1_af=df1_af.reset_index(drop=True)
df1_af.index=df1_af.index + 1
df1_af

mental_nonreported
False    415
True     117
Name: count, dtype: int64
physical_nonreported
False    484
True      48
Name: count, dtype: int64


Unnamed: 0,survey_date,age,gender,region,industry,job_role,work_arrangement,hours_per_week,mental_health_status,burnout_level,work_life_balance_score,physical_health_issues,social_isolation_score,salary_range,mental_nonreported,physical_nonreported
1,2025-06-01,32,Female,Africa,Education,Business Analyst,Onsite,36,adhd,High,3,shoulder pain; eye strain,2,$80K-100K,False,False
2,2025-06-01,36,Female,Africa,Customer Service,HR Manager,Onsite,63,not_reported,Medium,3,shoulder pain,2,$60K-80K,True,False
3,2025-06-01,27,Female,Africa,Healthcare,Account Manager,Remote,43,not_reported,Medium,1,not_reported,4,$80K-100K,True,True
4,2025-06-01,36,Female,Africa,Manufacturing,Research Scientist,Remote,41,ptsd,High,5,back pain; shoulder pain; eye strain,3,$100K-120K,False,False
5,2025-06-01,57,Male,Africa,Retail,Sales Representative,Onsite,57,burnout,High,3,shoulder pain; eye strain,3,$60K-80K,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528,2025-06-26,56,Male,Africa,Healthcare,IT Support,Onsite,65,adhd,Low,1,back pain; shoulder pain; eye strain; wrist pain,5,$60K-80K,False,False
529,2025-06-26,47,Male,Africa,Finance,Research Scientist,Hybrid,35,ptsd,Medium,4,back pain,1,$60K-80K,False,False
530,2025-06-26,34,Male,Africa,Education,Account Manager,Onsite,36,depression,Low,4,back pain; shoulder pain; neck pain,2,$80K-100K,False,False
531,2025-06-26,22,Male,Africa,Finance,Marketing Specialist,Onsite,60,burnout,Low,2,back pain,1,$80K-100K,False,False


## South America Subset cleaning

In [168]:
df1_sa["physical_health_issues"]=(
    df1_sa["physical_health_issues"].astype(str)
    .str.strip()
    .str.lower()
)

df1_sa["physical_health_issues"]

df1_sa["mental_health_status"]=(
    df1_sa["mental_health_status"]
    .astype(str)
    .str.strip()
    .str.lower()
)

df1_sa["mental_health_status"]

df1_sa["physical_health_issues"].value_counts(dropna=False)
df1_sa["mental_health_status"].value_counts(dropna=False)

df1_sa["mental_nonreported"]=(
    df1_sa["mental_health_status"]=="not_reported"
).astype(str)

df1_sa["physical_nonreported"]=(
    df1_sa["physical_health_issues"]=="not_reported"
).astype(str)

print(df1_sa["mental_nonreported"].value_counts())
print(df1_sa["physical_nonreported"].value_counts())

df1_sa=df1_sa.reset_index(drop=True)
df1_sa.index=df1_sa.index + 1
df1_sa

mental_nonreported
False    430
True     145
Name: count, dtype: int64
physical_nonreported
False    521
True      54
Name: count, dtype: int64


Unnamed: 0,survey_date,age,gender,region,industry,job_role,work_arrangement,hours_per_week,mental_health_status,burnout_level,work_life_balance_score,physical_health_issues,social_isolation_score,salary_range,mental_nonreported,physical_nonreported
1,2025-06-01,30,Male,South America,Manufacturing,DevOps Engineer,Hybrid,65,not_reported,Medium,5,not_reported,4,$60K-80K,True,True
2,2025-06-01,30,Female,South America,Technology,Software Engineer,Remote,47,anxiety,Medium,2,neck pain,4,$60K-80K,False,False
3,2025-06-01,42,Female,South America,Retail,Data Scientist,Onsite,54,anxiety,High,5,back pain; shoulder pain; eye strain,2,$120K+,False,False
4,2025-06-01,37,Male,South America,Manufacturing,Technical Writer,Remote,58,adhd,High,1,shoulder pain,5,$60K-80K,False,False
5,2025-06-01,26,Female,South America,Customer Service,Data Analyst,Hybrid,59,stress disorder,Medium,2,back pain; wrist pain,5,$40K-60K,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
571,2025-06-26,26,Male,South America,Professional Services,DevOps Engineer,Hybrid,60,ptsd,Low,5,back pain; shoulder pain; neck pain,3,$40K-60K,False,False
572,2025-06-26,63,Male,South America,Customer Service,Executive Assistant,Remote,46,adhd,Medium,3,back pain,5,$80K-100K,False,False
573,2025-06-26,45,Female,South America,Professional Services,Technical Writer,Onsite,56,burnout,Medium,4,back pain; shoulder pain,2,$80K-100K,False,False
574,2025-06-26,62,Female,South America,Professional Services,Data Analyst,Hybrid,38,ptsd,Medium,4,shoulder pain; neck pain,3,$80K-100K,False,False


## Save subsets

In [180]:
df1_na.to_csv("/Users/mengiejean-baptiste/remote_work_health_analysis/data/processed/processed_na.csv", index=False)
df1_eu.to_csv("/Users/mengiejean-baptiste/remote_work_health_analysis/data/processed/processed_eu.csv", index=False)
df1_as.to_csv("/Users/mengiejean-baptiste/remote_work_health_analysis/data/processed/processed_as.csv", index=False)
df1_af.to_csv("/Users/mengiejean-baptiste/remote_work_health_analysis/data/processed/processed_af.csv", index=False)
df1_oc.to_csv("/Users/mengiejean-baptiste/remote_work_health_analysis/data/processed/processed_oc.csv", index=False)
df1_sa.to_csv("/Users/mengiejean-baptiste/remote_work_health_analysis/data/processed/processed_sa.csv", index=False)
