## **Child Mind Institute - Detect Sleep States**

## ***Project participants:*** Alexandra Serechenko and Maria Zueva

### Link to the competition: https://www.kaggle.com/competitions/child-mind-institute-detect-sleep-states/data?select=test_series.parquet

### Sleeping plays a crucial role in maintaining health and well-being. Sleep is a complex physiological process that contributes to cognition, emotional regulation, immune system, and metabolic balance work. Good-quality sleep is essential for maintaining memory, learning, and problem-solving. It supports physical recovery, helps body to repair tissues and muscles.


### **The main goal**: to detect sleep onset and wake.

### The results when obtained, can be used in the future in order to improve researchers' ability to analyze accelerometer data for speed monitoring and enable them to conduct large-scale studies of sleep. Also the competition itself has a mission to improve awareness and guidance surrounding the importance of sleep.

### **Description of sleep data:**

- approximately 500 multi-day recordings of wrist-worn accelerometer data annotated with two event types: onset, the beginning of sleep, and wakeup, the end of sleep.  In this work, we will use 3 files that are publicly available to the participants of the competition:
1. train_series.parquet - Series to be used as training data. Each series - continuous recording of accelerometer data for a single subject spanning many days.
2. test_series.parquet - Series to be used as the test data, containing the same fields as above.

3. train_events.csv - Sleep logs for series in the training set recording onset and wake events.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip uninstall polars
!pip install polars

Found existing installation: polars 0.17.3
Uninstalling polars-0.17.3:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/polars-0.17.3.dist-info/*
    /usr/local/lib/python3.10/dist-packages/polars/*
Proceed (Y/n)? y
  Successfully uninstalled polars-0.17.3
Collecting polars
  Downloading polars-0.19.19-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m28.5/28.5 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: polars
Successfully installed polars-0.19.19


In [None]:
import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go


import matplotlib.pyplot as plt
import polars as pl

from event_detection_ap import score

import datetime
from tqdm import tqdm

In [None]:

# Variables for the score function

column_names = {
    'series_id_column_name': 'series_id',
    'time_column_name': 'step',
    'event_column_name': 'event',
    'score_column_name': 'score',
}

tolerances = {
    'onset': [12, 36, 60, 90, 120, 150, 180, 240, 300, 360],
    'wakeup': [12, 36, 60, 90, 120, 150, 180, 240, 300, 360]
}

In [None]:
review_data = pd.read_csv('/content/drive/My Drive/train_events.csv')

## Data transformations

In [None]:
# import data, transform the columns

dt_transforms = [
    pl.col('timestamp').str.to_datetime(),
    (pl.col('timestamp').str.to_datetime().dt.year()-2000).cast(pl.UInt8).alias('year'),
    pl.col('timestamp').str.to_datetime().dt.month().cast(pl.UInt8).alias('month'),
    pl.col('timestamp').str.to_datetime().dt.day().cast(pl.UInt8).alias('day'),
    pl.col('timestamp').str.to_datetime().dt.hour().cast(pl.UInt8).alias('hour')
]

data_transforms = [
    # Convert anglez to 16 bit integer
    pl.col('anglez').cast(pl.Int16),
    # Convert enmo to 16 bit uint
    (pl.col('enmo')*1000).cast(pl.UInt16),
]

In [None]:
train_series = pl.scan_parquet('/content/drive/My Drive/train_series.parquet').with_columns(
    dt_transforms + data_transforms
    )

train_events = pl.read_csv('/content/drive/My Drive/train_events.csv').with_columns(
    dt_transforms
    ).drop_nulls()

test_series = pl.scan_parquet('/content/drive/My Drive/test_series.parquet').with_columns(
    dt_transforms + data_transforms
    )


In [None]:
# Remove null events and nights with mismatched counts from series_events

mismatches = train_events.drop_nulls().group_by(['series_id', 'night']).agg([
    ((pl.col('event') == 'onset').sum() == (pl.col('event') == 'wakeup').sum()).alias('balanced')
    ]).sort(by=['series_id', 'night']).filter(~pl.col('balanced'))

for mm in mismatches.to_numpy():
    train_events = train_events.filter(~((pl.col('series_id') == mm[0]) & (pl.col('night') == mm[1])))

# Series ids --> list
series_ids = train_events['series_id'].unique(maintain_order=True).to_list()

# Keep only these series ids in train_series
train_series = train_series.filter(pl.col('series_id').is_in(series_ids))

**Useful parameters for the model**

It is possible to show that the parameter anglez ("a metric derived from individual accelerometer components that is commonly used in sleep detection, and refers to the angle of the arm relative to the vertical axis of the body", https://www.kaggle.com/competitions/child-mind-institute-detect-sleep-states/data) varies constantly during waking hours, while during sleep hours it almost doesn't change. So it is reasonable to take the total variation of the values of this parameter as a distinguishing factor between awakeness and sleep (the sum of the absolute difference between points). For the waking person it is unlimiting while for the sleeping person it is a limited value.

**Notes for the code cells below:**

- *Initialization of Features and Feature Columns:*

features: Initialize with the 'hour' column from the original DataFrame. The base for further feature engineering.
feature_cols: Initialized with the column names, starting with 'hour'.

- *Nested Loops for Feature Engineering:*

For each combination of duration and variable, generate rolling features (mean, max, and standard deviation) based on the absolute values of the variable.

Generate first-order variations  of the variables, calculate rolling mean, max, and standard deviation for the variations, scale the results by a factor of 10.

- *Update features and feature_cols:*

After each iteration of the inner loops, append the generated features and corresponding column names to the features and feature_cols lists.

- *Update train_series and test_series:*

For train_series and test_series add the calculated features to the DataFrame using the with_columns method.

Select DataFrame to include only the specified identifier columns (id_cols) and the newly created feature columns (feature_cols).

Rolling window feature engineering on the 'enmo' and 'anglez' variables for different durations (5 minutes, 30 minutes, 2 hours, and 8 hours).

- Calculate rolling mean, max, and standard deviation for the absolute values and the first-order variations of the variables.

Add the resulting features to the original time series (train_series and test_series).

In [None]:
features, feature_cols = [pl.col('hour')], ['hour']

In [None]:
for mins in [5, 30, 60*2, 60*8] :

    for var in ['enmo', 'anglez'] :

        features += [
            pl.col(var).rolling_mean(12 * mins, center=True, min_periods=1).abs().cast(pl.UInt16).alias(f'{var}_{mins}m_mean'),
            pl.col(var).rolling_max(12 * mins, center=True, min_periods=1).abs().cast(pl.UInt16).alias(f'{var}_{mins}m_max'),
            pl.col(var).rolling_std(12 * mins, center=True, min_periods=1).abs().cast(pl.UInt16).alias(f'{var}_{mins}m_std')
        ]

        feature_cols += [
            f'{var}_{mins}m_mean', f'{var}_{mins}m_max', f'{var}_{mins}m_std'
        ]

        # Getting first variations
        features += [
            (pl.col(var).diff().abs().rolling_mean(12 * mins, center=True, min_periods=1)*10).abs().cast(pl.UInt32).alias(f'{var}_1v_{mins}m_mean'),
            (pl.col(var).diff().abs().rolling_max(12 * mins, center=True, min_periods=1)*10).abs().cast(pl.UInt32).alias(f'{var}_1v_{mins}m_max'),
            (pl.col(var).diff().abs().rolling_std(12 * mins, center=True, min_periods=1)*10).abs().cast(pl.UInt32).alias(f'{var}_1v_{mins}m_std')
        ]

        feature_cols += [
            f'{var}_1v_{mins}m_mean', f'{var}_1v_{mins}m_max', f'{var}_1v_{mins}m_std'
        ]

id_cols = ['series_id', 'step', 'timestamp']

train_series = train_series.with_columns(
    features
).select(id_cols + feature_cols)

test_series = test_series.with_columns(
    features
).select(id_cols + feature_cols)

- Iterate over series IDs, normalize features

- Construct the feature matrix and labels for the training dataset

- The labels are determined based on the occurrence of steps within specified intervals corresponding to events in the 'train_events' DataFrame

In [None]:
def make_train_dataset(train_data, train_events, drop_nulls=False) :

    series_ids = train_data['series_id'].unique(maintain_order=True).to_list()
    X, y = pl.DataFrame(), pl.DataFrame()
    for idx in tqdm(series_ids) :

        # Normalizing sample features
        sample = train_data.filter(pl.col('series_id')==idx).with_columns(
            [(pl.col(col) / pl.col(col).std()).cast(pl.Float32) for col in feature_cols if col != 'hour']
        )

        events = train_events.filter(pl.col('series_id')==idx)

        if drop_nulls :
            # Removing datapoints on dates where no data was recorded
            sample = sample.filter(
                pl.col('timestamp').dt.date().is_in(events['timestamp'].dt.date())
            )

        X = X.vstack(sample[id_cols + feature_cols])

        onsets = events.filter((pl.col('event') == 'onset') & (pl.col('step') != None))['step'].to_list()
        wakeups = events.filter((pl.col('event') == 'wakeup') & (pl.col('step') != None))['step'].to_list()

        # NOTE: This will break if there are event series without any recorded onsets or wakeups
        y = y.vstack(sample.with_columns(
            sum([(onset <= pl.col('step')) & (pl.col('step') <= wakeup) for onset, wakeup in zip(onsets, wakeups)]).cast(pl.Boolean).alias('asleep')
            ).select('asleep')
            )

    y = y.to_numpy().ravel()

    return X, y

- Process a time series using a classifier to predict sleep events (onset and wakeup)

- Calculate scores for each predicted sleep period based on the mean probability over the period
- Store the results in a formatted DataFrame
- Reset the row index to create a 'row_id' column. The final DataFrame is returned as the output.

In [None]:
def get_events(series, classifier) :
    '''
    Takes a time series and a classifier and returns a formatted submission dataframe.
    '''

    series_ids = series['series_id'].unique(maintain_order=True).to_list()
    events = pl.DataFrame(schema={'series_id':str, 'step':int, 'event':str, 'score':float})

    for idx in tqdm(series_ids) :

        # Collect sample and normalize features
        scale_cols = [col for col in feature_cols if (col != 'hour') & (series[col].std() !=0)]
        X = series.filter(pl.col('series_id') == idx).select(id_cols + feature_cols).with_columns(
            [(pl.col(col) / series[col].std()).cast(pl.Float32) for col in scale_cols]
        )

        # Apply classifier to get predictions and scores
        preds, probs = classifier.predict(X[feature_cols]), classifier.predict_proba(X[feature_cols])[:, 1]


        X = X.with_columns(
            pl.lit(preds).cast(pl.Int8).alias('prediction'),
            pl.lit(probs).alias('probability')
                        )

        # Get predicted onset and wakeup time steps
        pred_onsets = X.filter(X['prediction'].diff() > 0)['step'].to_list()
        pred_wakeups = X.filter(X['prediction'].diff() < 0)['step'].to_list()

        if len(pred_onsets) > 0 :

            # Ensure all predicted sleep periods begin and end
            if min(pred_wakeups) < min(pred_onsets) :
                pred_wakeups = pred_wakeups[1:]

            if max(pred_onsets) > max(pred_wakeups) :
                pred_onsets = pred_onsets[:-1]

            # Keep sleep periods longer than 30 minutes
            sleep_periods = [(onset, wakeup) for onset, wakeup in zip(pred_onsets, pred_wakeups) if wakeup - onset >= 12 * 30]

            for onset, wakeup in sleep_periods :
                # Score using mean probability over period
                score = X.filter((pl.col('step') >= onset) & (pl.col('step') <= wakeup))['probability'].mean()

                # Add sleep event to dataframe
                events = events.vstack(pl.DataFrame().with_columns(
                    pl.Series([idx, idx]).alias('series_id'),
                    pl.Series([onset, wakeup]).alias('step'),
                    pl.Series(['onset', 'wakeup']).alias('event'),
                    pl.Series([score, score]).alias('score')
                ))

    # Add row id column
    events = events.to_pandas().reset_index().rename(columns={'index':'row_id'})

    return events

### To conclude:

Further work involves dividing the data into train and test.
The next step is to select a model for prediction - determining wakefulness-sleep.
It is planned to begin this phase with the use of the random forest model.

### Training

In [None]:
# Collect datapoints every 5 minutes
train_data = train_series.filter(pl.col('series_id').is_in(series_ids)).take_every(3 * 20).collect()

  train_data = train_series.filter(pl.col('series_id').is_in(series_ids)).take_every(3 * 20).collect()
