# Intro
As computer science students with a deep interest in data science, we, Hila Giladi (add id) and Kfir Shuster (add id), have chosen to focus on sleep duration.
Recognizing that quality sleep is crucial for everyone's daily functioning and overall health, we are committed to leveraging data science to help people worldwide achieve better sleep patterns and, consequently, improve their quality of life.

# The problem
In our modern world, where the boundaries between day and night blur and the demanding requirements of digital life continue to grow, we are witnessing an alarming sleep crisis that is intensifying. The data collected in our research presents a particularly troubling picture: an average of just 4.5 hours of sleep - a concerning and significant gap from the accepted medical recommendation of 7-9 hours of sleep per day. This chronic lack of sleep reflects the complex reality of contemporary society, where technology, work pressures, and the desire to accomplish everything push sleep to the bottom of our priority list. The factors contributing to this crisis are diverse and intertwined: from the blue screens illuminating our lives until late night hours, through demanding jobs requiring 24/7 availability, to the constant social pressure to stay connected and active. This phenomenon gains additional significance in the global era, where time zones blur and the expectation for immediate response has become the norm. The data at our disposal serves as a wake-up call - literally and figuratively - compelling us to confront the implications of the "always awake" culture we've developed, and to seek creative and sustainable solutions that will allow us to rebalance our natural sleep-wake cycle.


# The importance of the solution
Sleep is a fundamental aspect of human health and well-being, with particular significance for mental health.


# How we're going to do it:
This project aims to develop a predictive model for sleep duration using a comprehensive dataset that captures various physiological, behavioral, and environmental factors. The dataset includes continuous monitoring of multiple variables including heart rate, blood pressure, skin temperature, activity levels, and various lifestyle and mental health indicators.
By leveraging machine learning techniques to predict sleep duration, we aim to:
1. Identify key factors that influence sleep patterns
2. Understand the relationship between daily activities, stress levels, and sleep duration
3. Develop a model that can potentially help in early intervention for sleep-related issues
The dataset's unique combination of physiological measurements (heart rate, blood pressure, skin temperature), behavioral metrics (activity levels, social interaction), and mental health indicators (stress level, mental health status, resilience factors) provides a rich foundation for exploring the complex interplay between various life factors and sleep patterns.
Through this analysis, we hope to contribute to the broader understanding of sleep health and potentially develop tools that could help individuals and healthcare providers in monitoring and improving sleep patterns, particularly in the context of mental health management.

# Data Collection and Selection Process
Before beginning our research, we needed to find the right dataset for analyzing sleep patterns. We searched through many different datasets from health organizations and research groups, looking at how well each one could help us understand sleep patterns. Our goal was to find data that would tell us the most complete story about sleep and what affects it. After reviewing many options, we chose this dataset because it gives us an excellent mix of different types of information. It tracks physical measurements like heart rate and blood pressure, along with important details about people's daily lives - their work hours, stress levels, and how they're feeling. What made this dataset stand out was that it collects information every hour and includes both physical health measurements and mental well-being indicators. This means we can look at how sleep connects to many different parts of people's lives. With over 52,000 entries, the dataset is large enough to help us find reliable patterns and connections. We believe this rich combination of different types of information will help us better understand what affects people's sleep and how we might be able to improve it.

# Data Analysis
Let's transform our collected data into a structured pandas data frame to examine the information we've gathered. This will give us a clear view of our dataset's contents and help us understand what we're working with.

In [1]:
import pandas as pd

In [2]:
# read data_base
df = pd.read_csv('mental_health_monitoring_dataset.csv')
display(df)

Unnamed: 0,Timestamp,Heart_Rate,Blood_Pressure_Systolic,Blood_Pressure_Diastolic,Skin_Temperature,Galvanic_Skin_Response,Respiration_Rate,Sleep_Duration,Activity_Levels,Mood,...,Fuel_Consumption,Average_Speed,Work_Hours,Job_Stressors,Location_Latitude,Location_Longitude,Stress_Level,Mental_Health_History,Resilience_Factors,Mental_Health_Status
0,2019-01-01 00:00:00,98,177,99,36.854833,0.603660,20,4,4945,Happy,...,17.568925,78,12,5,51.002436,44.895265,Low,Yes,5,Normal
1,2019-01-01 01:00:00,111,133,86,37.239401,0.934820,24,5,6221,Happy,...,10.295241,25,8,9,42.839887,75.276677,Low,No,2,Moderate Stress
2,2019-01-01 02:00:00,88,154,113,36.209588,0.035483,28,1,6426,Anxious,...,8.515171,37,15,9,68.836583,20.795174,Low,No,4,Depression
3,2019-01-01 03:00:00,74,162,60,36.087569,0.653249,25,4,14435,Happy,...,13.774633,39,15,5,51.752928,89.835756,Low,No,1,Normal
4,2019-01-01 04:00:00,102,160,81,36.593949,0.756138,15,6,4334,Happy,...,18.747632,87,11,1,98.613250,86.654677,Low,No,2,Moderate Stress
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52580,2024-12-30 20:00:00,86,120,92,36.371463,0.817159,18,2,16712,Sad,...,8.554723,44,12,2,45.538176,91.829336,Low,Yes,7,Moderate Stress
52581,2024-12-30 21:00:00,81,168,118,34.847808,0.087909,21,0,8975,Anxious,...,16.481642,80,9,5,33.411870,52.294238,Medium,No,5,Moderate Stress
52582,2024-12-30 22:00:00,61,149,119,36.782976,0.539182,20,6,8770,Neutral,...,5.111862,23,14,1,67.402028,21.529958,Low,No,1,Normal
52583,2024-12-30 23:00:00,66,133,72,37.084081,0.103317,16,4,14148,Happy,...,7.017609,83,8,1,16.589861,50.978252,Low,No,7,Normal


# Cleaning the data
Let’s remove all the rows with nan values:

In [3]:
original_len = len(df)
df = df.dropna()
print(f"number of removed rows: {original_len - len(df)}")

number of removed rows: 0


As we can see there are no rows with nan values.
Let's check for duplicates:

In [4]:
df[df.duplicated()].shape[0]

0

As we can see there are no duplicates rows in the dataset.

Let’s see how many different values there are in our prediction column, Sleep_Duration:

In [7]:
print([int(num) for num in sorted(df['Sleep_Duration'].unique())])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


The current output is an array with all the different values in the column Sleep_Duration.

Delete rows with outliers values in the prediction column, Sleep Duration < 1

In [8]:
original_len = len(df)
df = df[df['Sleep_Duration'] >= 1]
print(f"number of removed rows: {original_len - len(df)}")
display(df)

number of removed rows: 5240


Unnamed: 0,Timestamp,Heart_Rate,Blood_Pressure_Systolic,Blood_Pressure_Diastolic,Skin_Temperature,Galvanic_Skin_Response,Respiration_Rate,Sleep_Duration,Activity_Levels,Mood,...,Fuel_Consumption,Average_Speed,Work_Hours,Job_Stressors,Location_Latitude,Location_Longitude,Stress_Level,Mental_Health_History,Resilience_Factors,Mental_Health_Status
0,2019-01-01 00:00:00,98,177,99,36.854833,0.603660,20,4,4945,Happy,...,17.568925,78,12,5,51.002436,44.895265,Low,Yes,5,Normal
1,2019-01-01 01:00:00,111,133,86,37.239401,0.934820,24,5,6221,Happy,...,10.295241,25,8,9,42.839887,75.276677,Low,No,2,Moderate Stress
2,2019-01-01 02:00:00,88,154,113,36.209588,0.035483,28,1,6426,Anxious,...,8.515171,37,15,9,68.836583,20.795174,Low,No,4,Depression
3,2019-01-01 03:00:00,74,162,60,36.087569,0.653249,25,4,14435,Happy,...,13.774633,39,15,5,51.752928,89.835756,Low,No,1,Normal
4,2019-01-01 04:00:00,102,160,81,36.593949,0.756138,15,6,4334,Happy,...,18.747632,87,11,1,98.613250,86.654677,Low,No,2,Moderate Stress
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52578,2024-12-30 18:00:00,114,137,84,37.106297,0.377905,14,6,6767,Sad,...,16.159221,35,9,7,76.503918,24.779305,Medium,No,8,Moderate Stress
52580,2024-12-30 20:00:00,86,120,92,36.371463,0.817159,18,2,16712,Sad,...,8.554723,44,12,2,45.538176,91.829336,Low,Yes,7,Moderate Stress
52582,2024-12-30 22:00:00,61,149,119,36.782976,0.539182,20,6,8770,Neutral,...,5.111862,23,14,1,67.402028,21.529958,Low,No,1,Normal
52583,2024-12-30 23:00:00,66,133,72,37.084081,0.103317,16,4,14148,Happy,...,7.017609,83,8,1,16.589861,50.978252,Low,No,7,Normal


Delete irrelevant columns:
'Location_Longitude', 'Location_Latitude', 'Fuel_Consumption', 'Timestamp', 'Galvanic_Skin_Response'
We decided to delete those columns because they have no effect on sleep duration.

In [9]:
df = df.drop(['Location_Longitude', 'Location_Latitude', 'Fuel_Consumption', 'Timestamp', 'Galvanic_Skin_Response'], axis=1)
display(df)

Unnamed: 0,Heart_Rate,Blood_Pressure_Systolic,Blood_Pressure_Diastolic,Skin_Temperature,Respiration_Rate,Sleep_Duration,Activity_Levels,Mood,Cognitive_Load,Social_Interaction,Driving_Conditions,Route_Duration,Average_Speed,Work_Hours,Job_Stressors,Stress_Level,Mental_Health_History,Resilience_Factors,Mental_Health_Status
0,98,177,99,36.854833,20,4,4945,Happy,5,7,Moderate,6,78,12,5,Low,Yes,5,Normal
1,111,133,86,37.239401,24,5,6221,Happy,4,1,Moderate,5,25,8,9,Low,No,2,Moderate Stress
2,88,154,113,36.209588,28,1,6426,Anxious,3,8,Good,20,37,15,9,Low,No,4,Depression
3,74,162,60,36.087569,25,4,14435,Happy,4,4,Poor,12,39,15,5,Low,No,1,Normal
4,102,160,81,36.593949,15,6,4334,Happy,3,6,Moderate,22,87,11,1,Low,No,2,Moderate Stress
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52578,114,137,84,37.106297,14,6,6767,Sad,3,8,Moderate,9,35,9,7,Medium,No,8,Moderate Stress
52580,86,120,92,36.371463,18,2,16712,Sad,2,6,Good,13,44,12,2,Low,Yes,7,Moderate Stress
52582,61,149,119,36.782976,20,6,8770,Neutral,3,2,Poor,15,23,14,1,Low,No,1,Normal
52583,66,133,72,37.084081,16,4,14148,Happy,5,1,Moderate,16,83,8,1,Low,No,7,Normal


Delete rows with outlier values, skin temperatures below 35 are unusual.
Let's remove those rows.

In [10]:
original_len = len(df)
df = df[df['Skin_Temperature'] >= 35]
print(f"number of removed rows: {original_len - len(df)}")
display(df)

number of removed rows: 53


Unnamed: 0,Heart_Rate,Blood_Pressure_Systolic,Blood_Pressure_Diastolic,Skin_Temperature,Respiration_Rate,Sleep_Duration,Activity_Levels,Mood,Cognitive_Load,Social_Interaction,Driving_Conditions,Route_Duration,Average_Speed,Work_Hours,Job_Stressors,Stress_Level,Mental_Health_History,Resilience_Factors,Mental_Health_Status
0,98,177,99,36.854833,20,4,4945,Happy,5,7,Moderate,6,78,12,5,Low,Yes,5,Normal
1,111,133,86,37.239401,24,5,6221,Happy,4,1,Moderate,5,25,8,9,Low,No,2,Moderate Stress
2,88,154,113,36.209588,28,1,6426,Anxious,3,8,Good,20,37,15,9,Low,No,4,Depression
3,74,162,60,36.087569,25,4,14435,Happy,4,4,Poor,12,39,15,5,Low,No,1,Normal
4,102,160,81,36.593949,15,6,4334,Happy,3,6,Moderate,22,87,11,1,Low,No,2,Moderate Stress
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52578,114,137,84,37.106297,14,6,6767,Sad,3,8,Moderate,9,35,9,7,Medium,No,8,Moderate Stress
52580,86,120,92,36.371463,18,2,16712,Sad,2,6,Good,13,44,12,2,Low,Yes,7,Moderate Stress
52582,61,149,119,36.782976,20,6,8770,Neutral,3,2,Poor,15,23,14,1,Low,No,1,Normal
52583,66,133,72,37.084081,16,4,14148,Happy,5,1,Moderate,16,83,8,1,Low,No,7,Normal


Remove records where the values in the columns Mood and Stress_Level are logically incompatible
   1. Mood is 'High' and Stress_Level is either 'Anxious', 'Sad', or 'Irritable'
   2. Mood is 'Low' and Stress_Level is 'Happy'

In [11]:
original_len = len(df)
df = df[
    ~((df['Mood'] == 'High') & df['Stress_Level'].isin(['Anxious', 'Sad', 'Irritable'])) &
    ~((df['Mood'] == 'Low') & (df['Stress_Level'] == 'Happy'))
]
print(f"number of removed rows: {original_len - len(df)}")

number of removed rows: 0


Change text values to numeric values in the columns Driving_Conditions, Stress_Level, Mental_Health_History, Mental_Health_Status, Mood

In [12]:
from sklearn.preprocessing import LabelEncoder
columns_to_encode = ['Driving_Conditions', 'Stress_Level', 'Mental_Health_History', 'Mental_Health_Status', 'Mood']
encoders = {}

for col in columns_to_encode:
    encoders[col] = LabelEncoder()
    df[col] = encoders[col].fit_transform(df[col])
    # Create and print mapping for this column
    mapping = dict(zip(encoders[col].classes_, encoders[col].transform(encoders[col].classes_)))
    print(f"\n{col} mappings:")
    for original, encoded in mapping.items():
        print(f"{original} -> {encoded}")


Driving_Conditions mappings:
Good -> 0
Moderate -> 1
Poor -> 2

Stress_Level mappings:
High -> 0
Low -> 1
Medium -> 2

Mental_Health_History mappings:
No -> 0
Yes -> 1

Mental_Health_Status mappings:
Anxiety -> 0
Depression -> 1
Mild Stress -> 2
Moderate Stress -> 3
Normal -> 4
Severe Stress -> 5

Mood mappings:
Anxious -> 0
Happy -> 1
Irritable -> 2
Neutral -> 3
Sad -> 4


Categorize numeric values in the column Activity_Levels

In [13]:
import numpy as np

df['Activity_Levels'] = np.where(df['Activity_Levels'] <= 3000, 1, np.where(df['Activity_Levels'] <= 10000, 2, 3))
display(df)

Unnamed: 0,Heart_Rate,Blood_Pressure_Systolic,Blood_Pressure_Diastolic,Skin_Temperature,Respiration_Rate,Sleep_Duration,Activity_Levels,Mood,Cognitive_Load,Social_Interaction,Driving_Conditions,Route_Duration,Average_Speed,Work_Hours,Job_Stressors,Stress_Level,Mental_Health_History,Resilience_Factors,Mental_Health_Status
0,98,177,99,36.854833,20,4,2,1,5,7,1,6,78,12,5,1,1,5,4
1,111,133,86,37.239401,24,5,2,1,4,1,1,5,25,8,9,1,0,2,3
2,88,154,113,36.209588,28,1,2,0,3,8,0,20,37,15,9,1,0,4,1
3,74,162,60,36.087569,25,4,3,1,4,4,2,12,39,15,5,1,0,1,4
4,102,160,81,36.593949,15,6,2,1,3,6,1,22,87,11,1,1,0,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52578,114,137,84,37.106297,14,6,2,4,3,8,1,9,35,9,7,2,0,8,3
52580,86,120,92,36.371463,18,2,3,4,2,6,0,13,44,12,2,1,1,7,3
52582,61,149,119,36.782976,20,6,2,3,3,2,2,15,23,14,1,1,0,1,4
52583,66,133,72,37.084081,16,4,3,1,5,1,1,16,83,8,1,1,0,7,4


Normalize the numeric values in the columns Heart_Rate, Blood_Pressure_systolic, Blood_Pressure_diastolic

In [None]:
from sklearn.preprocessing import MinMaxScaler

columns_to_normalize = ['Heart_Rate', 'Blood_Pressure_Systolic', 'Blood_Pressure_Diastolic']
scaler = MinMaxScaler()
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])

display(df)

In [16]:
# EDA

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='שעות_שינה', bins=20, color='skyblue')

NameError: name 'EDA' is not defined