# Internet of Wands (WIP)

This notebook is part of [*Practical Data Science for IoT*](https://github.com/pablodecm/datalab_ml_iot) tutorial by Pablo de Castro

## Overview of Use Case

The aim of this example is to demonstrate an end-to-end
example of a machine learning for a (consumer) IoT application
and remark the main challenges associated.

**The use case chosen is a imaginary application where smartphones
devices act as magic wands and we want to make a spell recognition
system, which will be referred as Internet of Wands (IoW).**

We will be focussing on how could we can collect and process 
for training and evaluating a model for such an application. We will also
discuss how we could deploy to production.


### Important Remark

Most of the discussion and technology choices to follow are
not unique of this application and could be readily applied
to use cases  such as:
- Human Activity Recognition with Wearables (e.g. running, lying down, driving or sitting)
- Elderly fall/accident/emergency alert system
- Possible Consumer Applications, for example:
    - Gym Repetition Counter: identify exercise and count the reps based on wearables
    - Parkinson Disease Early Detection system
- Also broadly related with distributed training applications such a self-driving cars (e.g. Tesla Object recognition model [[1]](#References))


## Important Aspects

We will be discussing many aspects of the data cycle in supervised ML workflows for IoT, such as:

- *Training Data Collection*:
    - What device/hardware/configuration are we gonna use for a given application?
    - Which sensors and additional data are relevant?
    - Who/how is gonna be labelling/labelled the data? Do they need training to standarise the process?
    - How much data we need for the application?
    - Can we oversee and control the data collection process?
    - How can we make data labelling it as easy as possible?
    - Can we replicate the training conditions in production?
    - **How expensive is it gonna be?**

- *Training Data Transport and Processing*:
    - Where and how are we gonna store the data?
    - How are we gonna transfer data from the devices to our data processing center?
    - How much preprocessing we are gonna do on the device (i.e. edge computing)?
    - How can we ensure security and privacy (e.g. transport encryption)?
    - Have we test the data collection framework properly before data collection starts?
    - **What volume of data is expected to flow to the servers per unit of time? Will the infrastructure scale and be robust enough?**

- *Data analysis and model building*:
    - What do we want to do?
    - Which tools/platforms/servers are we gonna use to explore the data?
    - What type of data are we studying (e.g. time-series sensor, audio, images, text, etc)?
    - What is the dimensionality and structure of the data?
    - What are the possible factors that affect to the variance of
      the data (e.g. data collection issues or changes in the
      environment)?
    - How easy will it be?
    - Which techniques are more appropiate for a given type and volume of data?
    - Can we complement with existing datasets or starting from a pre-trained model?
    - How are we gonna to evaluate the performance to have unbiased measures?
    - **What is the right trade-off between model complexity and
    performance for the given application?**



- *Production Environment*:
    - How are we gonna be using the resulting model in production?
    - Can we deploy it first as a beta or internally to verify that it works as expected?
    - Where are we gonna to carry out the model evaluation (e.g. our own remote servers, cloud or device)?
    - Can we setup a loop monitoring and redeployment the model in production?
    - **How much is expected to be gained by training with more data or improving the model?**

## Data Collection Infrastructure

Here is an scheme of how the chosen data collection infrastructure,
that use common IoT technologies (e.g. a MQTT broker and node-red):

<div align="center">
  <img src="https://raw.githubusercontent.com/pablodecm/datalab_ml_iot/master/04_internet_of_wands/images/iow_infrastructure.png" height="50%" style="max-width: 80%">
</div>

## Downloading Latest Dataset

You can check a list with latest raw datasets compressed in zip
format at:
- https://iow.pablodecm.com/iow_data_zips/


In [None]:
!LATEST="iow_data_09-07-05-09-2019.zip"; wget "https://iow.pablodecm.com/iow_data_zips/"$LATEST; unzip -o $LATEST

### Loading the Data

We have do decide how we want to represent the data and also
work on a custom reader for our set of json-based files.

In [None]:
import pandas as pd
import numpy as np

example_file = "iow_data/wingardium-leviosa/Peppapig_9b2bd7a9.0696f8.json"

example_df = pd.read_json(example_file)
print(example_df.dtypes)
example_df

In [None]:
# IMPORTANT for later to deal with times
# have to divide by 1000.0 to change from ms timestamp (JS)
# to second timestamp
(example_df["start_timestamp"]/1000.0).apply(pd.Timestamp.fromtimestamp)

In [None]:
import json

def read_json_file(file_name):
    with open(file_name) as f:
        json_dict = json.load(f)
    return json_dict

read_json_file(example_file)

About high-resolution time in web applications:

https://www.w3.org/TR/hr-time-2/#dfn-time-origin

## References

For an overview of a state-of-the-art distributed training
infrastructure including redeployment and the importance
of edge in real-time applications you can check the Tesla Autonomy day presentation:

- [1] [*Tesla Autonomy Day*](https://www.youtube.com/watch?v=Ucp0TTmvqOE)  Youtube video (+2hrs)

There are several publications using combinations of RNNs and CNNs
for dealing with IoT sensor data, for task such as
Human Activity Recognition:
- [2] Yao, Shuochao, et al. [*Deepsense: A unified deep learning framework for time-series mobile sensing data processing*](https://arxiv.org/abs/1611.01942) Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.