## Calculating DEEP, REM and LIGHT sleep.

The main purpose of collecting the data was to get infromation regarding the duartion of the various sleep cycles per night. However the apple watch does not provide this in a nice way, rather it needs to be manually combined and calculated to get the durations per day.

**1) calculating the sleep metrics per day**

I first calculate the duration of each sleep cycle row by subtracting the start date from the end date, placing it into a duration column. Then I add a date column to identify which night the sleep cycle belongs to. There was a trick here as the sleep cycle might start before midnight, in which case it would be registered into the wrong night if we simply used the startDate column. Hence we deduce the correct night by looking at the createdDate column instead.

After calculating the duration of each sleep cycle and identifying the night they belong to, I group them all together in a new table by summing the durations of the same sleep cycle types like Deep sleep, Rem, core and Awake, hence obtaining the total minutes of each sleep type per night which is what we are looking for.

In [2]:
import pandas as pd

# Load the sleep data
sleep_data = pd.read_csv("converted_data/converted_and_filtered_sleep_data.csv")

# Convert startDate and endDate to datetime
sleep_data['startDate'] = pd.to_datetime(sleep_data['startDate'])
sleep_data['endDate'] = pd.to_datetime(sleep_data['endDate'])

# Calculate duration in minutes
sleep_data['duration'] = (sleep_data['endDate'] - sleep_data['startDate']).dt.total_seconds() / 60

# Extract the date from startDate and make new column with it
sleep_data['date'] = pd.to_datetime(sleep_data['creationDate']).dt.date

# Group by date and sleep stage, summing up the durations
grouped_data = sleep_data.groupby(['date', 'value'])['duration'].sum().reset_index()

# Display the grouped data
grouped_data

Unnamed: 0,date,value,duration
0,2025-03-03,HKCategoryValueSleepAnalysisAsleepCore,134.0
1,2025-03-03,HKCategoryValueSleepAnalysisAsleepDeep,63.5
2,2025-03-03,HKCategoryValueSleepAnalysisAsleepREM,72.0
3,2025-03-03,HKCategoryValueSleepAnalysisAwake,6.0
4,2025-03-04,HKCategoryValueSleepAnalysisAsleepCore,188.5
...,...,...,...
137,2025-04-21,HKCategoryValueSleepAnalysisAsleepREM,89.5
138,2025-04-22,HKCategoryValueSleepAnalysisAsleepCore,203.5
139,2025-04-22,HKCategoryValueSleepAnalysisAsleepDeep,62.0
140,2025-04-22,HKCategoryValueSleepAnalysisAsleepREM,91.5


**2) re-ogranizing the data**

Now I am going to re-shape the data into a nicer form in which the sleep metrics are the columns.
I also need to fill in the empty values such as the night in which I has 0 "awake" sleep with a value like 0 because currently it just resorts to not having a row for that type of sleep at all since none of it was recorded.

In [3]:
# Pivot the data
pivoted_data = grouped_data.pivot(index='date', columns='value', values='duration').reset_index()

# Rename columns for clarity
pivoted_data.columns.name = None
pivoted_data = pivoted_data.rename(columns={
    'HKCategoryValueSleepAnalysisAwake': 'Awake',
    'HKCategoryValueSleepAnalysisAsleepREM': 'REM',
    'HKCategoryValueSleepAnalysisAsleepCore': 'Core',
    'HKCategoryValueSleepAnalysisAsleepDeep': 'Deep'
})

# Fill NaN values with 0 (if any sleep stage is missing for a night)
pivoted_data = pivoted_data.fillna(0)

# Display the pivoted data
pivoted_data

Unnamed: 0,date,Core,Deep,REM,HKCategoryValueSleepAnalysisAsleepUnspecified,Awake
0,2025-03-03,134.0,63.5,72.0,0.0,6.0
1,2025-03-04,188.5,68.5,82.0,0.0,1.5
2,2025-03-05,197.0,71.5,59.0,0.0,1.5
3,2025-03-06,180.5,67.0,94.5,112.0,38.0
4,2025-03-07,185.0,81.0,85.0,0.0,2.0
5,2025-03-08,248.0,73.0,95.5,0.0,2.0
6,2025-03-09,175.5,54.5,58.0,0.0,9.0
7,2025-03-10,237.5,54.0,78.5,0.0,1.0
8,2025-03-11,193.0,58.0,91.0,0.0,1.5
9,2025-03-12,242.0,60.0,58.0,0.0,0.0


**3) Cleaning up the incorrect values**

Some days in which I woke up and took off the apple watch and did not use my phone, the apple watch thought that I was alseep so it incorrectly added additional sleep time. Fortunately, this was added into the "Unspecified" sleep analysis category so we can easily remove these.

However the second issue is that sometimes the watch thought that I was trying to sleep at that time and added minutes into the "awake" sleep metric. For example the 21st row, on 2025-04-7, the watch recorded 151 minutes awake incorrectly. There are 3 errors like this. I can manually fix them by overwriting the awake time with an average. This is easy to do as I know exactly which nights this error occured, as it is the night in which the unspecified sleep category was recorded.

Additionally there is one night which is recorded by the watch however I did not wear it that night hence the values are zero, so I will also remove that row.

So I wil first fix the incorrect awake durations and then remove the "unspecified" category, then the row

In [4]:
# Exclude rows 3, 10, and 21 (adjusting for zero-based indexing)
excluded_rows = [3, 10, 21]
awake_average = pivoted_data.drop(index=excluded_rows)['Awake'].mean()

print(f"Average Awake time (excluding rows 3, 10, 21): {awake_average}")

Average Awake time (excluding rows 3, 10, 21): 2.6515151515151514


In [5]:
# Override the Awake column for rows 3, 10, and 21
for row in excluded_rows:
    pivoted_data.loc[row, 'Awake'] = awake_average

# Display to verify the changes
print(pivoted_data.loc[excluded_rows])

          date   Core  Deep   REM  \
3   2025-03-06  180.5  67.0  94.5   
10  2025-03-13  191.5  53.5  98.0   
21  2025-04-06  144.0  63.0  91.5   

    HKCategoryValueSleepAnalysisAsleepUnspecified     Awake  
3                                           112.0  2.651515  
10                                           72.0  2.651515  
21                                          130.5  2.651515  


In [6]:
# finally remove the unspecified sleep category column as well as the empty row:
pivoted_data = pivoted_data.drop(columns=['HKCategoryValueSleepAnalysisAsleepUnspecified'], errors='ignore')
pivoted_data = pivoted_data.drop(index=17)


# save into processed_data folder, this is what we are going to be working with moving forward:
pivoted_data.to_csv("processed_data/sleep_cycles.csv", index=False)

#final form:
pivoted_data

Unnamed: 0,date,Core,Deep,REM,Awake
0,2025-03-03,134.0,63.5,72.0,6.0
1,2025-03-04,188.5,68.5,82.0,1.5
2,2025-03-05,197.0,71.5,59.0,1.5
3,2025-03-06,180.5,67.0,94.5,2.651515
4,2025-03-07,185.0,81.0,85.0,2.0
5,2025-03-08,248.0,73.0,95.5,2.0
6,2025-03-09,175.5,54.5,58.0,9.0
7,2025-03-10,237.5,54.0,78.5,1.0
8,2025-03-11,193.0,58.0,91.0,1.5
9,2025-03-12,242.0,60.0,58.0,0.0


## Manipulating the habits spreadsheet data
This dataset is already well refined as it was collected manually, so the cleaning is minimal.

**1) Missing values**

Some data points are marked with a -1 instead of a 1 or 0 because I had forgotten to track that particular habit on that particular day (missing at random). So for each column I will deduce the average value for that particular habit and then replace the missing value with it:

In [8]:
# Load the habits data
habits_data = pd.read_csv("converted_data/converted_habits_data.csv")

# Replace -1 with the mode (most frequent value) for each column, since we need binary data
for column in habits_data.columns[1:]:
    mode_value = habits_data[habits_data[column] != -1][column].mode()[0]
    habits_data[column] = habits_data[column].replace(-1, mode_value)


**2) Adjusting Dates trick**

There is a trick in here which is that the dates for which the habits recorded do not contribute to the sleep of that date, rather they contribute to the sleep registered for the next day. For example we are tracking how seeing sunlight in the morning on 3rd of march will affect the sleep of that day which is registered as the sleep on 4th of march, not 3rd.

So we need to adjust all the dates by 1 to have the dates of the habit correspond to the dates of sleep.

In [None]:
# Adjust the dates in the habits data to correspond to the sleep data of the next day
habits_data['Date'] = pd.to_datetime(habits_data['Date']) + pd.Timedelta(days=1)

# Save the adjusted data
habits_data.to_csv("processed_data/processed_habits_data.csv", index=False)

# Displaying the final dataset:
habits_data

Unnamed: 0,Date,Sugar consumption,Morning sunlight,Afternoon nap,Food before bed,Night time caffeine consumption,Social media usage before bed,Exercise
0,2025-03-13,1,1,1,0,0,1,0
1,2025-03-14,0,1,1,0,1,0,0
2,2025-03-15,1,0,0,1,0,0,0
3,2025-03-16,1,1,1,0,0,0,1
4,2025-03-17,1,1,1,0,0,0,0
5,2025-03-18,1,1,1,0,0,0,0
6,2025-03-19,1,1,1,0,0,0,0
7,2025-03-20,1,0,0,1,0,0,0
8,2025-04-01,1,0,1,1,0,0,0
9,2025-04-02,1,0,1,1,0,1,0
