# Data Pipeline
In this notebook, we label the data at every timestep with one of 3 labels:
- Awake
- Asleep
- Unknown (Device not worn)

First, we must create a model to identify the periods of inactivity (accelerometer not being worn) so that we can label accordingly.

## Load in data

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# format display
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', '{:.3f}'.format)

In [5]:
# read in the data
root = '../data/'
train = pd.read_parquet(root + 'train_series.parquet')
test = pd.read_parquet(root + 'test_series.parquet')
events = pd.read_csv(root + 'train_events.csv')
submission = pd.read_csv(root + 'sample_submission.csv')

In [11]:
# get a single user
example_user = '038441c925bb'
train_example = train.loc[train['series_id'] == example_user, :].copy()

# convert to datetime
train_example['timestamp'] = pd.to_datetime(train_example['timestamp'], utc=False)

## Create Features

# Label Data
- Awake
- Asleep
- Unknown (Device not worn)

In [85]:
# merge example with events table
train_final = train_example.merge(events, how='left', on=['series_id', 'step'])

# check
train_final.shape

(389880, 6)

In [87]:
# # get valid dates (days with events recorded)
# valid_dates = train_final[train_final['event'].notna()].date.unique()

# # drop rows that aren't in valid_dates
# train_final = train_final[train_final['date'].isin(valid_dates)]

# # check
# train_final.shape

(328320, 6)

In [61]:
# look at the rows where events occurred
train_final[train_final['event'].notna()].head()

Unnamed: 0,series_id,step,anglez,enmo,date,event
4992,038441c925bb,4992,-78.691,0.01,2018-08-15,onset
10932,038441c925bb,10932,-61.578,0.026,2018-08-15,wakeup
20244,038441c925bb,20244,-6.387,0.018,2018-08-15,onset
27492,038441c925bb,27492,-45.355,0.016,2018-08-16,wakeup
39996,038441c925bb,39996,-1.787,0.0,2018-08-17,onset


In [62]:
train_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 328320 entries, 3240 to 383399
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   series_id  328320 non-null  object 
 1   step       328320 non-null  uint32 
 2   anglez     328320 non-null  float32
 3   enmo       328320 non-null  float32
 4   date       328320 non-null  object 
 5   event      38 non-null      object 
dtypes: float32(2), object(3), uint32(1)
memory usage: 13.8+ MB


The majority of 'event' rows are null, as no event occurred at those. To address this, we forward fill the event column:

In [63]:
# forward fill the event column where sleep time is classified as 'onset' and awake time is classified as 'wakeup'
train_final['event'].ffill(inplace=True)

# fill the null rows at beginning with 'wakeup'
train_final.fillna('wakeup', inplace=True)

# check that all nulls are filled
train_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 328320 entries, 3240 to 383399
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   series_id  328320 non-null  object 
 1   step       328320 non-null  uint32 
 2   anglez     328320 non-null  float32
 3   enmo       328320 non-null  float32
 4   date       328320 non-null  object 
 5   event      328320 non-null  object 
dtypes: float32(2), object(3), uint32(1)
memory usage: 13.8+ MB


Now, we'll check a few of the step indices where events occurred to ensure the forward fill worked correctly:

In [66]:
# check 1
train_final.set_index('step').loc[4990:4994]

Unnamed: 0_level_0,series_id,anglez,enmo,date,event
step,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4990,038441c925bb,-78.71,0.01,2018-08-15,wakeup
4991,038441c925bb,-78.73,0.01,2018-08-15,wakeup
4992,038441c925bb,-78.691,0.01,2018-08-15,onset
4993,038441c925bb,-78.665,0.01,2018-08-15,onset
4994,038441c925bb,-78.466,0.01,2018-08-15,onset


In [67]:
# check 2
train_final.set_index('step').loc[10930:10934]

Unnamed: 0_level_0,series_id,anglez,enmo,date,event
step,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10930,038441c925bb,-58.177,0.036,2018-08-15,onset
10931,038441c925bb,-61.438,0.027,2018-08-15,onset
10932,038441c925bb,-61.578,0.026,2018-08-15,wakeup
10933,038441c925bb,-61.744,0.026,2018-08-15,wakeup
10934,038441c925bb,-61.786,0.027,2018-08-15,wakeup


Now, we need to map these values to a sleep col to use as our final label:

In [68]:
# create sleep col
train_final['asleep'] = train_final['event'].apply(lambda x: 1 if x == 'onset' else 0)

The value count for awake times should be roughly double that of sleep time (16 hours awake, 8 hours asleep):

In [69]:
train_final['asleep'].value_counts()

asleep
0    208236
1    120084
Name: count, dtype: int64

In [70]:
# drop event col
train_final.drop('event', axis=1, inplace=True)

# final training dataframe (for the first user)
train_final.head()

Unnamed: 0,series_id,step,anglez,enmo,date,asleep
3240,038441c925bb,3240,67.175,0.015,2018-08-15,0
3241,038441c925bb,3241,68.881,0.021,2018-08-15,0
3242,038441c925bb,3242,73.114,0.034,2018-08-15,0
3243,038441c925bb,3243,73.692,0.03,2018-08-15,0
3244,038441c925bb,3244,72.685,0.02,2018-08-15,0


In [71]:
# export
train_final.to_csv('data/single_user_1.csv', index=False)