# Internet of Wands (WIP)

This notebook is part of [*Practical Data Science for IoT*](https://github.com/pablodecm/datalab_ml_iot) tutorial by Pablo de Castro

## Overview of Use Case

The aim of this example is to demonstrate an end-to-end
example of a machine learning for a (consumer) IoT application
and remark the main challenges associated.

**The use case chosen is a imaginary application where smartphones
devices act as magic wands and we want to make a spell recognition
system, which will be referred as Internet of Wands (IoW).**

We will be focussing on how could we can collect and process 
for training and evaluating a model for such an application. We will also
discuss how we could deploy to production.


### Important Remark

Most of the discussion and technology choices to follow are
not unique of this application and could be readily applied
to use cases  such as:
- Human Activity Recognition with Wearables (e.g. running, lying down, driving or sitting)
- Elderly fall/accident/emergency alert system
- Possible Consumer Applications, for example:
    - Gym Repetition Counter: identify exercise and count the reps based on wearables
    - Parkinson Disease Early Detection system
- Also broadly related with distributed training applications such a self-driving cars (e.g. Tesla Object recognition model [[1]](#References))

## Important Aspects

We will be discussing many aspects of the data cycle in supervised ML workflows for IoT, such as:

- *Training Data Collection*:
    - What device/hardware/configuration are we gonna use for a given application?
    - Which sensors and additional data are relevant?
    - Who/how is gonna be labelling/labelled the data? Do they need training to standarise the process?
    - How much data we need for the application?
    - Can we oversee and control the data collection process?
    - How can we make data labelling it as easy as possible?
    - Can we replicate the training conditions in production?
    - **How expensive is it gonna be?**

- *Training Data Transport and Processing*:
    - Where and how are we gonna store the data?
    - How are we gonna transfer data from the devices to our data processing center?
    - How much preprocessing we are gonna do on the device (i.e. edge computing)?
    - How can we ensure security and privacy (e.g. transport encryption)?
    - Have we test the data collection framework properly before data collection starts?
    - **What volume of data is expected to flow to the servers per unit of time? Will the infrastructure scale and be robust enough?**

- *Data analysis and model building*:
    - What do we want to do?
    - Which tools/platforms/servers are we gonna use to explore the data?
    - What type of data are we studying (e.g. time-series sensor, audio, images, text, etc)?
    - What is the dimensionality and structure of the data?
    - What are the possible factors that affect to the variance of
      the data (e.g. data collection issues or changes in the
      environment)?
    - How easy will it be?
    - Which techniques are more appropiate for a given type and volume of data?
    - Can we complement with existing datasets or starting from a pre-trained model?
    - How are we gonna to evaluate the performance to have unbiased measures?
    - **What is the right trade-off between model complexity and
    performance for the given application?**



- *Production Environment*:
    - How are we gonna be using the resulting model in production?
    - Can we deploy it first as a beta or internally to verify that it works as expected?
    - Where are we gonna to carry out the model evaluation (e.g. our own remote servers, cloud or device)?
    - Can we setup a loop monitoring and redeployment the model in production?
    - **How much is expected to be gained by training with more data or improving the model?**

## Data Collection Infrastructure

Here is an scheme of how the IOW data collection infrastructure,
that uses common IoT technologies (e.g. a MQTT broker and node-red):

<div align="center">
  <img src="https://raw.githubusercontent.com/pablodecm/datalab_ml_iot/master/04_internet_of_wands/images/iow_infrastructure.png" height="50%" style="max-width: 80%">
</div>

## Data Collection Campaign

Go with your smartphone to https://iow.pablodecm.com/ and generate magic spell demostrations.



## Downloading Latest Dataset

You download the latest dataset from the GitHub repository.

In [None]:
!wget "https://github.com/pablodecm/datalab_ml_iot/raw/refs/heads/master/04_internet_of_wands/iow_data.zip?download="; unzip -qq -o iow_data.zip

### Loading the Data

We have do decide how we want to represent the data and also
work on a custom reader for our set of json-based files.

In [None]:
cat iow_data/alohomora/Cedric_2579103d.01bba.json

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")

example_file = "iow_data/wingardium-leviosa/Peppapig_9b2bd7a9.0696f8.json"

example_df = pd.read_json(example_file)
print(example_df.dtypes)
example_df

In [None]:
# we can apply a series transformation and then invert to get it in a much better format
accel_df = example_df.accel_data.apply(pd.Series).T
accel_df["timestamp"] = pd.to_timedelta(accel_df.timestamp, unit="ms")
accel_df

In [None]:
# timestamps of giro and accel sensors are high precision with a common origin
gyro_df = example_df.gyro_data.apply(pd.Series).T
gyro_df["timestamp"] = pd.to_timedelta(gyro_df.timestamp, unit="ms")
gyro_df

In [None]:
# we can use the minimum of timestamp as origin to have the same reference
min_timestamp = min([accel_df.timestamp.min(),gyro_df.timestamp.min()])

accel_df["timestamp"] -= min_timestamp
gyro_df["timestamp"] -= min_timestamp
# set first element to 0 to have the same start
accel_df.loc[0,"timestamp"] = pd.Timedelta(0)
gyro_df.loc[0,"timestamp"] = pd.Timedelta(0)
accel_df = accel_df.set_index("timestamp")
gyro_df = gyro_df.set_index("timestamp")

accel_df.head()

In [None]:
# still the sampling is slightly different
gyro_df.head()

In [None]:
# we can resample to fix that
resampled_accel_df = accel_df.resample("20ms").mean().interpolate('time')
resampled_gyro_df = gyro_df.resample("20ms").mean().interpolate('time')

resampled_accel_df

In [None]:
# we can then finally merge
merged_df = pd.merge(resampled_accel_df, resampled_gyro_df, 
                     left_index=True, right_index=True,
                     suffixes=["_accel","_gyro"], how="inner")

merged_df

In [None]:
# we can not plot the resampled sensor data
fig, axs = plt.subplots(2, figsize=(12,12))

merged_df.filter(regex="._accel").plot(ax=axs[0])
merged_df.filter(regex="._gyro").plot(ax=axs[1])

In [None]:
# we can define a function to do all the pre-processing for a given json file

from pathlib import Path

def load_df_from_iow_json(json_path: Path) -> pd.DataFrame:
    
    # load json file to dataframe
    raw_df = pd.read_json(json_path)
    
    # convert to series to avoid having object elements
    accel_df = raw_df.accel_data.apply(pd.Series).T
    accel_df["timestamp"] = pd.to_timedelta(accel_df.timestamp, unit="ms")
    gyro_df = raw_df.gyro_data.apply(pd.Series).T
    gyro_df["timestamp"] = pd.to_timedelta(gyro_df.timestamp, unit="ms")
    
    # we can use the minimum of timestamp as origin to have the same reference
    min_timestamp = min([accel_df.timestamp.min(),gyro_df.timestamp.min()])

    # set the same timestamp origin
    accel_df["timestamp"] -= min_timestamp
    gyro_df["timestamp"] -= min_timestamp
    # set first element to 0 to have the same start
    accel_df.loc[0,"timestamp"] = pd.Timedelta(0)
    gyro_df.loc[0,"timestamp"] = pd.Timedelta(0)
    accel_df = accel_df.set_index("timestamp")
    gyro_df = gyro_df.set_index("timestamp")
    
    # resample and interpolate to make homogeneous
    
    resampled_accel_df = accel_df.resample("20ms").mean().interpolate('time')
    resampled_gyro_df = gyro_df.resample("20ms").mean().interpolate('time')
    
    # merge in a single df
    merged_df = pd.merge(resampled_accel_df, resampled_gyro_df, 
                         left_index=True, right_index=True,
                         suffixes=["_accel","_gyro"], how="inner")
    
    del raw_df["accel_data"]
    del raw_df["gyro_data"]
    
    return merged_df, raw_df.loc["timestamp"].copy()


In [None]:
# check if it works from example
load_df_from_iow_json(example_file)[0]

In [None]:
# and apply fo all dat
md_fields = ["spell_select","device_select","wizard_name"]
data_path = Path("./iow_data")

merged_df_dict = {}
for path in data_path.glob("*/*"):
    rel_path = path.relative_to(data_path)

    try:
        merged_df, metadata = load_df_from_iow_json(path)
    except:
        print(f"Error loading file {rel_path}")
        continue

    # ignore data without a without an selected spell
    if path.parent.name == "choose":
        continue
        
    spell_id = (rel_path.parent/rel_path.stem).as_posix()
    key = tuple(metadata.loc[md_fields].to_list() + [spell_id])
    merged_df_dict[key] = merged_df


In [None]:
# this is dataframe with a multi-index (spell_select, device_select, wizard_name, spell_id, timestamp)
all_df = pd.concat(merged_df_dict, names = (md_fields+ ["spell_id"]))
all_df

## Explorative Data Analysis

In [None]:
fig, ax = plt.subplots(1, figsize=(12,6))

var_name = "y_accel"
wizard_name = "pablodecm"
spell_name = "alohomora"

subset_df = all_df.loc[(spell_name,slice(None),wizard_name),var_name]


for group_id, group_series in subset_df.groupby("spell_id"):
    
    group_df = group_series.reset_index()
    ax.plot(group_df["timestamp"], group_df[var_name] )

ax.set_title(f"{var_name}-{spell_name}-{wizard_name}");

In [None]:
# get number of unique spells in dataset
unique_spells = all_df.index.get_level_values(3).unique()
print(f"A total of {len(unique_spells)} spells in dataset")

In [None]:
# number of examples 
all_df.groupby(["spell_select","spell_id"]).count().groupby("spell_select").count()

In [None]:
fig, axs = plt.subplots(3,2, figsize=(12,12))

summary_vars = all_df.groupby(["spell_select","spell_id"]).count()
bins = np.linspace(0, 250,41)
for plot_i, col_name in enumerate(summary_vars.columns):
    ax = axs.flatten()[plot_i]
    group_by_spell = summary_vars.loc[:,col_name].groupby("spell_select")
    subplots = group_by_spell.plot.hist(ax=ax,alpha=0.8,title=col_name, bins=bins, density=True, histtype='step')
    ax.legend()

In [None]:
fig, axs = plt.subplots(3,2, figsize=(12,12))

summary_vars = all_df.groupby(["spell_select","spell_id"]).mean()

for plot_i, col_name in enumerate(summary_vars.columns):
    ax = axs.flatten()[plot_i]
    group_by_spell = summary_vars.loc[:,col_name].groupby("spell_select")
    subplots = group_by_spell.plot.density(ax=ax,alpha=0.5,title=col_name)
    ax.legend()

In [None]:
# feel free to carry out further exploratory data analysis

## Create Training and Validation Dataset

In [None]:
# cleaning a little bit the data
# get only spells that last more than 400 ms
long_spells = all_df.groupby("spell_id").count() > 20
long_spells = long_spells[long_spells].dropna().index

In [None]:
# create train and validation datasets
train_subset, valid_subset = train_test_split(long_spells, shuffle=True, random_state=7, test_size=0.3)
train_df = all_df.loc[(slice(None),slice(None),slice(None),list(train_subset)),:]
valid_df = all_df.loc[(slice(None),slice(None),slice(None),list(valid_subset)),:]

In [None]:
train_df

In [None]:
valid_df

## Create an Spell Classifier

For each sensor sequence we need to create a model that determines the spell cast.
There are several ways to frame this multi-class classification problems:
- Come up with a clever rule to classify
- Manually construct tabular features for each sequence and train a traditional classifier
- Use a variable sequence classification model (RNN, transformer)
- Other alternative approaches (CNN on image representation of the data)

In [None]:
# a single row (spell) is characterized by 6 sensor recording and variable number of timesteps
valid_df.loc[(slice(None),slice(None),slice(None),valid_df.index[0][3])].reset_index(drop=True).plot()

In [None]:
# the raw features are the following, however we have the problem that their number is variable
features = [ f"{i}_accel" for i in ["x", "y","z"]] + [ f"{i}_gyro" for i in ["x", "y","z"]]
features

In [None]:
# we can use initially the mean of each wavelenght as a feature
train_df_mean = train_df.loc[(slice(None),slice(None),slice(None),slice(None)),:].groupby("spell_id").mean().loc[:, features]
train_df_mean.columns = [f"{f}_mean" for f in features]
valid_df_mean= valid_df.loc[(slice(None),slice(None),slice(None),slice(None)),:].groupby("spell_id").mean().loc[:, features]
valid_df_mean.columns = [f"{f}_mean" for f in features]

In [None]:
# we can optionally complement with the standard deviation as features
train_df_std = train_df.loc[(slice(None),slice(None),slice(None),slice(None)),:].groupby("spell_id").std().loc[:, features]
train_df_std.columns = [f"{f}_std" for f in features]
valid_df_std = valid_df.loc[(slice(None),slice(None),slice(None),slice(None)),:].groupby("spell_id").std().loc[:, features]
valid_df_std.columns = [f"{f}_std" for f in features]


In [None]:
train_df_length = train_df.groupby("spell_id").apply(lambda df: df.shape[0])
train_df_length.columns = ["length"]
valid_df_length = valid_df.groupby("spell_id").apply(lambda df: df.shape[0])
valid_df_length.columns = ["length"]


In [None]:
# example of features
train_df_std.head()

In [None]:
# we can create a simplified training dataset by contatenating both
train_df_extra = pd.concat([train_df_mean,train_df_std, train_df_length], axis=1)
valid_df_extra = pd.concat([valid_df_mean,valid_df_std, valid_df_length], axis=1)

In [None]:
# the training data will look like this
train_df_extra.head()

In [None]:
# we can also obtain the category label from the spell_id index
label_assign = { "alohomora" : 0, "lumos" : 1, "wingardium-leviosa" : 2, "reparo" : 3}
train_y =  train_df_extra.reset_index().spell_id.str.split("/").apply(lambda x: label_assign[x[0]])
valid_y =  valid_df_extra.reset_index().spell_id.str.split("/").apply(lambda x: label_assign[x[0]])

In [None]:
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier()


In [None]:
# we can train the classifier
gb_clf.fit(train_df_extra, train_y)

In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics import classification_report

In [None]:
# we can get probability predictions
gb_clf.predict_proba(train_df_extra)

In [None]:
# and compute some metrics on the training datataset (not to be trusted)
y_train_clf_proba = gb_clf.predict_proba(train_df_extra)[:, 1]
y_train_clf_pred = gb_clf.predict(train_df_extra)

print("Confusion Matrix:")
print(confusion_matrix(train_y,y_train_clf_pred))
print("Gradient Boosting Classifier Accuracy: "+"{:.1%}".format(accuracy_score(train_y,y_train_clf_pred)));
print("Classification Report:")
print(classification_report(train_y,y_train_clf_pred))

In [None]:
y_valid_clf_proba = gb_clf.predict_proba(valid_df_extra)[:, 1]
y_valid_clf_pred = gb_clf.predict(valid_df_extra)

print("Confusion Matrix:")
print(confusion_matrix(valid_y,y_valid_clf_pred))
print("Gradient Boosting Classifier Accuracy: "+"{:.1%}".format(accuracy_score(valid_y,y_valid_clf_pred)));
print("Classification Report:")
print(classification_report(valid_y,y_valid_clf_pred))

In [None]:
# can try to do a grid search to try to find a better hyper-parameter combination
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier()

# to avoid having same UnitNumber in both sets
cv = KFold(3)

param_grid = { "n_estimators" : [100, 130, 150, 180, 200],
               "learning_rate" :  [0.05, .1, 0.07]
             }



optimized_gb_clf = GridSearchCV(estimator=gb_clf,
                            cv = cv,
                            param_grid=param_grid,
                            verbose = 1,
                            n_jobs = -1)

# we train the best model with the full dataset
optimized_gb_clf.fit(train_df_extra, train_y)

In [None]:
# compute again the metrics on the validation set
y_valid_clf_proba = optimized_gb_clf.predict_proba(valid_df_extra)[:, 1]
y_valid_clf_pred = optimized_gb_clf.predict(valid_df_extra)

print("Confusion Matrix:")
print(confusion_matrix(valid_y,y_valid_clf_pred))
print("Gradient Boosting Classifier Accuracy: "+"{:.1%}".format(accuracy_score(valid_y,y_valid_clf_pred)));
print("Classification Report:")
print(classification_report(valid_y,y_valid_clf_pred))

In [None]:
import joblib

model = optimized_gb_clf.best_estimator_
joblib.dump(model,"optimized_gb_clf.joblib")

In [None]:
valid_y != y_valid_clf_pred

In [None]:
# it is possible that some of the mistakes are due to
# incorrect data, we could potentially explore some of the 
wrongly_classified_ids = (valid_df_extra.loc[(valid_y != y_valid_clf_pred).values]).index
wrongly_classified_ids

In [None]:
wrongly_classified_ids[0]

In [None]:
# we could check a few of the incorrectly classified to see 
spell_id = wrongly_classified_ids[10]
valid_df.loc[(slice(None),slice(None),slice(None), spell_id)].reset_index(drop=True).plot()

In [None]:
# TODO: train a random forest model or other models with sklearn
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()

In [None]:
# feel free to carry out additonal work and/or train additional models

## References

About high-resolution time in web applications:

https://www.w3.org/TR/hr-time-2/#dfn-time-origin


For an overview of a state-of-the-art distributed training
infrastructure including redeployment and the importance
of edge in real-time applications you can check the Tesla Autonomy day presentation:

- [1] [*Tesla Autonomy Day*](https://www.youtube.com/watch?v=Ucp0TTmvqOE)  Youtube video (+2hrs) or newer Tesla AI day

There are several publications using combinations of RNNs and CNNs
for dealing with IoT sensor data, for task such as
Human Activity Recognition:
- [2] Yao, Shuochao, et al. [*Deepsense: A unified deep learning framework for time-series mobile sensing data processing*](https://arxiv.org/abs/1611.01942) Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

Many more recent using transformers and/or semi-supervised approaches.