# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset: https://www.kaggle.com/datasets/waqi786/remote-work-and-mental-health

Import the necessary libraries and create your dataframe(s).

In [13]:
# import libraries
import pandas as pd
import numpy as np

# load dataset
df = pd.read_csv(r'C:\Users\jorda\OneDrive\Desktop\Data Analysis\Final Project\Impact_of_Remote_Work_on_Mental_Health.csv')

# display first rows
df.head()

Unnamed: 0,Employee_ID,Age,Gender,Job_Role,Industry,Years_of_Experience,Work_Location,Hours_Worked_Per_Week,Number_of_Virtual_Meetings,Work_Life_Balance_Rating,Stress_Level,Mental_Health_Condition,Access_to_Mental_Health_Resources,Productivity_Change,Social_Isolation_Rating,Satisfaction_with_Remote_Work,Company_Support_for_Remote_Work,Physical_Activity,Sleep_Quality,Region
0,EMP0001,32,Non-binary,HR,Healthcare,13,Hybrid,47,7,2,Medium,Depression,No,Decrease,1,Unsatisfied,1,Weekly,Good,Europe
1,EMP0002,40,Female,Data Scientist,IT,3,Remote,52,4,1,Medium,Anxiety,No,Increase,3,Satisfied,2,Weekly,Good,Asia
2,EMP0003,59,Non-binary,Software Engineer,Education,22,Hybrid,46,11,5,Medium,Anxiety,No,No Change,4,Unsatisfied,5,,Poor,North America
3,EMP0004,27,Male,Software Engineer,Finance,20,Onsite,32,8,4,High,Depression,Yes,Increase,3,Unsatisfied,3,,Poor,Europe
4,EMP0005,49,Male,Sales,Consulting,32,Onsite,35,12,2,High,,Yes,Decrease,3,Unsatisfied,3,Weekly,Average,North America


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [11]:
# # check for missing values
# df.isna().sum()

# fill missing data with "N/A" to show the question was Not Answered
df = df.fillna("N/A")

# check
print("Missing values after filling:", df.isnull().sum())


Missing values after filling: Employee_ID                          0
Age                                  0
Gender                               0
Job_Role                             0
Industry                             0
Years_of_Experience                  0
Work_Location                        0
Hours_Worked_Per_Week                0
Number_of_Virtual_Meetings           0
Work_Life_Balance_Rating             0
Stress_Level                         0
Mental_Health_Condition              0
Access_to_Mental_Health_Resources    0
Productivity_Change                  0
Social_Isolation_Rating              0
Satisfaction_with_Remote_Work        0
Company_Support_for_Remote_Work      0
Physical_Activity                    0
Sleep_Quality                        0
Region                               0
dtype: int64


## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [14]:
df.describe()

Unnamed: 0,Age,Years_of_Experience,Hours_Worked_Per_Week,Number_of_Virtual_Meetings,Work_Life_Balance_Rating,Social_Isolation_Rating,Company_Support_for_Remote_Work
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,40.995,17.8102,39.6146,7.559,2.9842,2.9938,3.0078
std,11.296021,10.020412,11.860194,4.636121,1.410513,1.394615,1.399046
min,22.0,1.0,20.0,0.0,1.0,1.0,1.0
25%,31.0,9.0,29.0,4.0,2.0,2.0,2.0
50%,41.0,18.0,40.0,8.0,3.0,3.0,3.0
75%,51.0,26.0,50.0,12.0,4.0,4.0,4.0
max,60.0,35.0,60.0,15.0,5.0,5.0,5.0


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [18]:
# # print column names
# print (df.columns)

# # remove region column
# df = df.drop(columns=['Region'])

# I've decided to drop the "Region" column as it does not pertain to my business issue. Does work location, hours worked, and job role effect an individuals mental health and if an individual's access to mental health care effects their mental health condition.

# check
print(df.columns)


Index(['Employee_ID', 'Age', 'Gender', 'Job_Role', 'Industry',
       'Years_of_Experience', 'Work_Location', 'Hours_Worked_Per_Week',
       'Number_of_Virtual_Meetings', 'Work_Life_Balance_Rating',
       'Stress_Level', 'Mental_Health_Condition',
       'Access_to_Mental_Health_Resources', 'Productivity_Change',
       'Social_Isolation_Rating', 'Satisfaction_with_Remote_Work',
       'Company_Support_for_Remote_Work', 'Physical_Activity',
       'Sleep_Quality'],
      dtype='object')


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [23]:
# display unique values to find inconsistencies 
categorical_columns = df.select_dtypes(include=['object']).columns
for column in categorical_columns:
    print(f"\nUnique values in '{column}' before cleaning:\n", df[column].unique())



Unique values in 'Employee_ID' before cleaning:
 ['EMP0001' 'EMP0002' 'EMP0003' ... 'EMP4998' 'EMP4999' 'EMP5000']

Unique values in 'Gender' before cleaning:
 ['Non-binary' 'Female' 'Male' 'Prefer not to say']

Unique values in 'Job_Role' before cleaning:
 ['HR' 'Data Scientist' 'Software Engineer' 'Sales' 'Marketing' 'Designer'
 'Project Manager']

Unique values in 'Industry' before cleaning:
 ['Healthcare' 'IT' 'Education' 'Finance' 'Consulting' 'Manufacturing'
 'Retail']

Unique values in 'Work_Location' before cleaning:
 ['Hybrid' 'Remote' 'Onsite']

Unique values in 'Stress_Level' before cleaning:
 ['Medium' 'High' 'Low']

Unique values in 'Mental_Health_Condition' before cleaning:
 ['Depression' 'Anxiety' nan 'Burnout']

Unique values in 'Access_to_Mental_Health_Resources' before cleaning:
 ['No' 'Yes']

Unique values in 'Productivity_Change' before cleaning:
 ['Decrease' 'Increase' 'No Change']

Unique values in 'Satisfaction_with_Remote_Work' before cleaning:
 ['Unsatisfied' 

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?
I chose a dataset that was clean from the start. I really didn't find inconsistent or irregular data in this dataset.
2. Did the process of cleaning your data give you new insights into your dataset?
I do wish in the "Mental_Health_Condition column and the "Physical_Actvity" column were either answered "Wish to not discolse" or "No condition" to better separate those who didn't answer the question resulting in "N/A" as the filler.
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?
Keep in mind what I mentioned in #2. Some of the "N/A" could have a condition we don't know about, and some didn't answer simply because they do not have a condition.