# Aggregate Watch Data Classification Project

The goal of this project is to investigate and utilize the data collected from a personal smartwatch to provide daily workout recommendations. Using the data collected from the Withings brand watch, we want to predict whether or not a person will have a successful workout on a given day. Providing this insight to users in the morning could provide valuable information about how the user could structure their day or provide the necessary motivation to make a workout routine become a workout habit. The idea of a "successful workout" will be investigated as well as which data provides insights in workout performance during the next day. 

As an initial analysis, the data that has been aggregated by day will be used to determine whether or not it is a good predictor of a workout the following day, additionally the sleep data will be organized and cleaned to provide additional insights

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Importing the Data

In this case, we are only looking at the data that has already been aggregated by day and not every file included in the watch_data folder

In [None]:
distance = pd.read_csv("../watch_data/aggregates_distance.csv", header=0)

passive_calories = pd.read_csv("../watch_data/aggregates_calories_passive.csv", header=0)
active_calories = pd.read_csv("../watch_data/aggregates_calories_earned.csv", header=0)

steps = pd.read_csv("../watch_data/aggregates_steps.csv", header=0)

sleep_data = pd.read_csv("../watch_data/sleep.csv", header=0)

workouts = pd.read_csv("../watch_data/activities.csv", header=0)

# Aggregate if I want to do same operation on all DataFrames
datasets = [distance, passive_calories, active_calories, steps, sleep_data, workouts]

## Data Preparation/Cleaning

Now that we have imported all of the relevant data, we need to clean the data and prepare it for model fitting

In [None]:
def convert_to_datetime(df:pd.DataFrame):
    if 'date' in df.columns:
        df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
    elif 'from' in df.columns:
        df['from'] = pd.to_datetime(df['from'], infer_datetime_format=True, utc=True)
        df['to'] = pd.to_datetime(df['to'], infer_datetime_format=True, utc=True)
    else:
        print('No columns defined as date/from/to in {}'.format(df))


# Start with the simple files, check for nans and turn the date columns into datetime types
for df in datasets:
    convert_to_datetime(df)

In [None]:
distance.rename(columns={'value':'distance'}, inplace=True)
passive_calories.rename(columns={'value':'passive calories'}, inplace=True)
active_calories.rename(columns={'value':'active calories'}, inplace=True)
steps.rename(columns={'value':'steps'}, inplace=True)

### Now, we can focus on the "workouts" dataset. This dataset will require more work to get into a useable format. We will first drop the columns that contain only NaN values that provide no additional information that can be inferred or 

In [None]:
workouts.drop(['from (manual)', 'to (manual)','GPS', 'Modified'], axis=1, inplace=True)

In [None]:
workouts.head()

### It's clear that a lot of information is located in the "Data" column of the workouts DataFrame and needs to be unpacked. Many of the elements in the Data column are the empty set, we also need to investigate this.

In [None]:
# Get the total number of times the Data column is the empty array
(workouts['Data'] == '{}').sum()

In [None]:
# Maybe the "Other" workout types result in this output, however there are only 114 "Other" workouts registered
workouts['Activity type'].value_counts()

In [None]:
# Showing which type of workouts results in the null bracket for the workout
workouts[workouts['Data'] == '{}']['Activity type'].value_counts()

### The empty bracket categories are somewhat random, let's unpack the Data column to see what it contains for the different workouts

In [None]:
import ast
ast.literal_eval(workouts['Data'].iloc[0])

In [None]:
# Here we see that the 'effduration' category is simply the number of seconds the workout is
(workouts['to'].iloc[0] - workouts['from'].iloc[0]).total_seconds()

### As a first investigation before diving into the 'Data' column too heavily as it changes for different workout types, we can simply compute the workout duration and save it as a new column

In [None]:
# Create the column that generates a TimeDelta datetime object
workouts['Duration'] = (workouts['to'] - workouts['from']) / np.timedelta64(1, 's')
workouts['Duration']

### Here it becomes more clear that the workouts with the empty brackets are simply duplicates of the previous workouts without the data, so we can remove each of these rows from the dataset

In [None]:
workouts = workouts[workouts['Data'] != '{}']

### Now let's make a few changes to simplify the workouts data.
1. Replace the "from" and "to" columns with one date that indicates the date of the workout in addition to the "Duration" column
2. Remove the "Data" column (further investigation at a later point)

Care is needed because there may be multiple workouts in one day. In this case, if multiple workouts occur on the same day, they will be summed into one day and one duration value. In this case, we will also drop the activity type and timezone and simply find the total duration for each day that a workout was completed

In [None]:
basicworkouts = workouts.copy()
basicworkouts['date'] = pd.to_datetime(workouts['from'].dt.date)
basicworkouts.drop(['from', 'to', 'Data', 'Timezone', 'Activity type'], axis=1, inplace=True)
basicworkouts.set_index('date', inplace=True)

In [None]:
# Sum workout duration on the days when there are more than one workout and then consider a workout effective if it lasts longer than the 30 minutes recommended by CDC
basicworkouts = basicworkouts.groupby(['date']).sum()
basicworkouts['Effective Workout'] = basicworkouts['Duration'] > 1800
basicworkouts['Effective Workout'].value_counts()

## Next, we need to clean that sleep_data file

In [None]:
sleep_data.head()

In [None]:
sleep_data.describe()

### It's clear from the sleep data that REM sleep is not recorded, as well as snoring, snoring episodes, and night events. These columns can be dropped. Additionally, "from" and "to" columns are not necessary as we simply want the date that the sleep occurred that corresponds with a workout later that day

In [None]:
sleep_data.drop(['rem (s)', 'Snoring (s)', 'Snoring episodes', 'Night events'], axis=1, inplace=True)

In [None]:
sleep_data['date'] = pd.to_datetime(sleep_data["from"].dt.date)

In [None]:
# If we look at the values for the dates, we see that we have one day in which there are 2 recorded sleeps! Let's investigate this date
sleep_data['date'].value_counts()

In [None]:
sleep_data[sleep_data['date'] == '20220127']

### After investigating, it's clear that the duplicate was due to a long nap I took on vacation :) $\quad$  I will remove this sleep record from the data set in this case. In other applications, it may become necessary to create some outlier detection to determine when a sleep event occurs outside of the usual time.

In [None]:
sleep_data.drop(499, inplace=True)
sleep_data.drop(['from', 'to'], axis=1, inplace=True)
sleep_data.set_index('date', inplace=True)

In [None]:
# Cleaned sleep data
sleep_data

## Now that the more complicated datasets have been cleaned, let's quickly clean the distance, passive_calories, active_calories, and steps sets

In [None]:
distance.set_index('date', inplace=True)
passive_calories.set_index('date', inplace=True)
active_calories.set_index('date', inplace=True)
steps.set_index('date', inplace=True)

In [None]:
distance.info()

In [None]:
basicworkouts.info()

## Now, let's join the datasets together on the date

In [None]:
X_data = pd.merge(distance, passive_calories, how='inner', on='date')
X_data = pd.merge(X_data, active_calories, how='inner', on='date')
X_data = pd.merge(X_data, steps, how='inner', on='date')
X_data = pd.merge(X_data, sleep_data, how='inner', on='date')
X_data = pd.merge(X_data, basicworkouts, how='inner', on='date')
X_data


# 2. Exploratory Data Analysis

### Now that the data has been cleaned to a usable format, we can briefly explore the data before applying different ML techniques. It should be noted that this data will only include the days in which data exists for each of the previous datasets, i.e. days in which I had a workout, recorded my sleep, and other data exists (recorded automatically every day)

In [None]:
fig, axes = plt.subplots(figsize=(12, 6))
axes.scatter(X_data.index, X_data['steps'])
axes.set_xlabel('Date')
axes.set_ylabel('Steps')

axes.set_title("My Steps on Workout Days Over Time")

In [None]:
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
steps_by_day = X_data.groupby(X_data.index.day_name()).mean().reindex(cats) 

In [None]:
fig, axes = plt.subplots(figsize=(12, 6))
axes.bar(steps_by_day.index, steps_by_day['steps'])
axes.set_xlabel('Day of the Week')
axes.set_ylabel('Steps')

axes.set_title("Average Steps on Workout Days by Day of the Week")

In [None]:
X_data.info()

In [None]:
X_data.hist(bins=50, figsize=(20,15))

# 3. Feature Exploration/Engineering
## Many of the techniques have been chosen from the text "Hands on Machine Learning"

In [None]:
corr_mat = X_data.corr()
corr_mat['Effective Workout'].sort_values(ascending=False)

## After looking at the correlations, it's clear that some of the data would not make sense to predict whether or not someone will have an effective workout. After consideration, it makes sense to predict whether an effective workout will occur or not based off of information from the sleep data OR data from the previous day. Because of this, the following steps will be made to modify the data:

1. The Duration, distance, active calories, steps, and passive calories categories will be removed from the current day as they occur concurrently with the current day's workout and may be confounding variables
2. The distance, active/passive calories, and steps from the previous day will be added in as possible influence over an effective workout or not
3. The total time asleep is added as a feature

In [None]:
# Remove confounding variables
x_test = X_data.drop(['Duration', 'distance', 'active calories', 'steps', 'passive calories'], axis=1)

In [None]:
# Add total time asleep as a column
x_test['total sleep'] = X_data['light (s)'] + X_data['deep (s)']

In [None]:
from datetime import datetime, timedelta
prev_day_index = x_test.index - timedelta(days=1)
x_test['prevday steps'] = steps['steps'].loc[prev_day_index].values
x_test['prevday active cals'] = active_calories['active calories'].loc[prev_day_index].values
x_test['prevday passive cals'] = passive_calories['passive calories'].loc[prev_day_index].values
x_test['prevday distance'] = distance['distance'].loc[prev_day_index].values


In [None]:
corr_mat = x_test.corr()
corr_mat['Effective Workout'].sort_values(ascending=False)

### Now, we can plot the correlation matrix with all of our data we will use to predict an effective workout.

From this data, the main correlation to "Effective Workout" come from the sleep data with the highest correlation related to light sleep.

In [None]:
corr_mat.style.background_gradient(cmap='coolwarm')

In [None]:
x_test.info()

## Prepping Data for Fitting Classifiers

In [None]:
# Getting a test set
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(x_test, test_size=0.2, random_state=42)

In [None]:
y_train = train_set['Effective Workout'].values.astype(int)
x_train = train_set.drop(['Effective Workout'], axis=1)

y_test = test_set['Effective Workout'].values.astype(int)
x_test = test_set.drop(['Effective Workout'], axis=1)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('std_scaler', StandardScaler()),
    ])
x_train_prepared = num_pipeline.fit_transform(x_train)
x_test_prepared  = num_pipeline.fit_transform(x_test)

## Stochastic Gradient Descent classifier

In [None]:
# Training the classifier in the standard way
from sklearn.linear_model import SGDClassifier 

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(x_train_prepared, y_train)

In [None]:
# Getting the cross-validation score
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, x_train_prepared, y_train, cv=3, scoring="accuracy")

In [None]:
# Getting the confusion matrix
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(sgd_clf, x_train_prepared, y_train, cv=3)


y_scores = cross_val_predict(sgd_clf, x_train_prepared, y_train, cv=3, method="decision_function")

In [None]:
confusion_matrix(y_train, y_scores)

### Looking at the results we see:

1. Our classifier performs equally poorly at Type 1 and 2 errors and that our resulting precision and recall is very similar. This may mean that our SGD classifier can do no better than what is shown here without tradeoffs between precision and recall
2. The average cross-validation accuracy of around 0.63 is somewhat better than random chance, but doesn't tell the full story because the precision and recall are decent compared with the accuracy
3. Precision: 0.763
4. Recall   : 0.756
5. F1 score : 0.7596
6. The classifier is not good at predicting true negative, or cases when a poor workout is expected. This could mean that we need to train a different model, or that the input features are simply not good predictors of a good workout or not

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

precision_score(y_train, y_train_pred)

In [None]:
recall_score(y_train, y_train_pred)

In [None]:
f1_score(y_train, y_train_pred)

In [None]:
# What would the accuracy be if we simply guessed a good workout every time?
sum(np.ones(len(y_train)) == y_train)/len(y_train)

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.show()



### We can look at the ROC curve to see the performance of our SGD classifier

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train, y_scores)



def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])                                    
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) 
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    
    plt.grid(True)                                            

plt.figure(figsize=(8, 6))                                    
plot_roc_curve(fpr, tpr)
plt.show()


In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train, y_scores)

## From this initial investigation, the current classifier performs poorly. This could be due to a number of factors:
1. Small datasets (more data may differentiate bad workouts from good workouts more)
2. Class imbalance (75% are effective workouts and 25% are not)
3. Incorrect metrics (perhaps a 30 minute workout is not the sure-fire metric that was expected)

## First, other binary classifiers will be tested and then further analysis will be done