
# CMI Sleep State Prediction
<br/>

## Problem Statement
The goal of this competition is to detect sleep onset and wake.

Your work could make it possible for researchers to conduct more reliable, larger-scale sleep studies across a range of populations and contexts. The results of such studies could provide even more information about sleep.

The successful outcome of this competition can also have significant implications for children and youth, especially those with mood and behavior difficulties. Sleep is crucial in regulating mood, emotions, and behavior in individuals of all ages, particularly children. By accurately detecting periods of sleep and wakefulness from wrist-worn accelerometer data, researchers can gain a deeper understanding of sleep patterns and better understand disturbances in children.

<br/>

## ML Objective
You will develop a model trained on wrist-worn accelerometer data in order to determine a person's sleep state.

<br/>

## ML Task Type
Binary Classification

## EDA

In [1]:
import pandas as pd
import dask.dataframe as dd

In [2]:
train_df = dd.read_csv("data/train_events.csv") #.compute()
train_df.head()

Unnamed: 0,series_id,night,event,step,timestamp
0,038441c925bb,1,onset,4992.0,2018-08-14T22:26:00-0400
1,038441c925bb,1,wakeup,10932.0,2018-08-15T06:41:00-0400
2,038441c925bb,2,onset,20244.0,2018-08-15T19:37:00-0400
3,038441c925bb,2,wakeup,27492.0,2018-08-16T05:41:00-0400
4,038441c925bb,3,onset,39996.0,2018-08-16T23:03:00-0400


In [3]:
train_df_parquet = dd.read_parquet("data/train_series.parquet") #.compute()
train_df_parquet.head()

Unnamed: 0,series_id,step,timestamp,anglez,enmo
0,038441c925bb,0,2018-08-14T15:30:00-0400,2.6367,0.0217
1,038441c925bb,1,2018-08-14T15:30:05-0400,2.6368,0.0215
2,038441c925bb,2,2018-08-14T15:30:10-0400,2.637,0.0216
3,038441c925bb,3,2018-08-14T15:30:15-0400,2.6368,0.0213
4,038441c925bb,4,2018-08-14T15:30:20-0400,2.6368,0.0215


In [4]:
train_df_parquet.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 5 entries, series_id to enmo
dtypes: float32(2), uint32(1), string(2)

In [6]:
# join on id and time stamp
train_df_parquet_join_events = dd.merge(train_df_parquet, train_df, on=["series_id", "timestamp", "step"])
train_df_parquet_join_events.head()

Unnamed: 0,series_id,step,timestamp,anglez,enmo,night,event
0,038441c925bb,4992,2018-08-14T22:26:00-0400,-78.690598,0.0099,1,onset
1,038441c925bb,10932,2018-08-15T06:41:00-0400,-61.578201,0.0263,1,wakeup
2,038441c925bb,20244,2018-08-15T19:37:00-0400,-6.3874,0.0182,2,onset
3,038441c925bb,27492,2018-08-16T05:41:00-0400,-45.355099,0.0165,2,wakeup
4,038441c925bb,39996,2018-08-16T23:03:00-0400,-1.7867,0.0,3,onset


- Dataset now shows onset vs wakeup information.
- Let's split `timestamp` to hours and minutes as dates may not be that useful
- We'll drop `night` as it's an enumeration of potential onset / wakeup event pairs. At most one pair of events can occur for each night. Which is that important to know.
- `series_id` is just a unique identifier for an individual, we'll drop that column in training, but store the test one for submission. Similar step will be performed for `step`.

In [10]:
filtered_df = train_df_parquet.loc[train_df_parquet["series_id"] == "038441c925bb"]

In [12]:
filtered_df.compute()

Unnamed: 0,series_id,step,timestamp,anglez,enmo
0,038441c925bb,0,2018-08-14T15:30:00-0400,2.636700,0.0217
1,038441c925bb,1,2018-08-14T15:30:05-0400,2.636800,0.0215
2,038441c925bb,2,2018-08-14T15:30:10-0400,2.637000,0.0216
3,038441c925bb,3,2018-08-14T15:30:15-0400,2.636800,0.0213
4,038441c925bb,4,2018-08-14T15:30:20-0400,2.636800,0.0215
...,...,...,...,...,...
389875,038441c925bb,389875,2018-09-06T04:59:35-0400,-27.373899,0.0110
389876,038441c925bb,389876,2018-09-06T04:59:40-0400,-27.493799,0.0110
389877,038441c925bb,389877,2018-09-06T04:59:45-0400,-27.533701,0.0111
389878,038441c925bb,389878,2018-09-06T04:59:50-0400,-28.003599,0.0111


In [34]:
train_df_parquet["series_id"].unique().compute()

0      038441c925bb
1      03d92c9f6f8a
2      0402a003dae9
3      04f547b8017d
4      05e1944c3818
           ...     
272    fa149c3c4bde
273    fb223ed2278c
274    fbf33b1a2c10
275    fcca183903b7
276    fe90110788d2
Name: series_id, Length: 277, dtype: string

## Training

- We'll start with simple no validation dataset.

In [None]:
X, y = #, #

In [7]:
# a look at the test data
test_df_parquet = dd.read_parquet("data/test_series.parquet") #.compute()
test_df_parquet.head()

Unnamed: 0,series_id,step,timestamp,anglez,enmo
0,038441c925bb,0,2018-08-14T15:30:00-0400,2.6367,0.0217
1,038441c925bb,1,2018-08-14T15:30:05-0400,2.6368,0.0215
2,038441c925bb,2,2018-08-14T15:30:10-0400,2.637,0.0216
3,038441c925bb,3,2018-08-14T15:30:15-0400,2.6368,0.0213
4,038441c925bb,4,2018-08-14T15:30:20-0400,2.6368,0.0215


- Which step presents an onset or wakeup and what is the confidence? Goal.

## Evaluation and Submission