# Demystifying Machine Learning Demo Session

Peter Flach and Niall Twomey

Tuesday, 5th of December 2017

## Typical Machine Learning Pipeline

Need to prepare data into a matrix of observations X and a vector of labels y. 

![Machine Learning Pipeline](ml-pipeline.jpg)

## CASAS Dataset

This notebook considers the CASAS dataset. This is a dataset collected in a smart environment. As participants interact with the house, sensors record their interactions. There are a number of different sensor types including motion, door contact, light, temperature, water flow, etc.

This notebook goes through a number of common issues in data science and machine learning pipelines when working with real data. Namely, several issues relating to dates, sensor values, etc. This are dealt with consistently using the functionality provided by the pandas library.

The objective is to fix all errors (if we can), and then to convert the timeseries data to a form that would be recognisable by a machine learning algorithm. I have attempted to comment my code where possible to explain my thought processes. At several points in this script I could have taken shortcuts, but I also attempted to forgo brevity for clarity.

Resources: 
- CASAS homepage: http://casas.wsu.edu
- Pandas library: https://pandas.pydata.org/
- SKLearn library: http://scikit-learn.org/

![CASAS Testbed](sensorlayout.jpg)

In [None]:
# Set up the libraries that we need to use 

from os.path import join 

import matplotlib.pyplot as pl 
import seaborn as sns 

from pprint import pprint

import pandas as pd
import numpy as np

from datetime import datetime, timedelta

from subprocess import call

% matplotlib inline 

sns.set_style('darkgrid') 
sns.set_context('poster')

In [None]:
# Download the data 

url = 'http://casas.wsu.edu/datasets/twor.2009.zip'

zipfile = url.split('/')[-1]
dirname = '.'.join(zipfile.split('.')[:2])
filename = join(dirname, 'data')

print('     url: {}'.format(url))
print(' zipfile: {}'.format(zipfile))
print(' dirname: {}'.format(dirname))
print('filename: {}'.format(filename))

call(('wget', url))
call(('unzip', zipfile))
call(('rm', 'twor.2009.zip'))

In [None]:
# Read the data file

column_headings = ('date', 'time', 'sensor', 'value', 'annotation', 'state')

df = pd.read_csv(
    filename, 
    delim_whitespace=True, 
    names=column_headings
)

df.head()

In [None]:
df.head()

In [None]:
for col in column_headings:
    df.loc[:, col] = df[col].str.strip()

### Small diversion: pandas dataframes

In [None]:
df.date.head()

In [None]:
df.sensor.unique()

In [None]:
df.annotation.unique()

In [None]:
df.state.unique()

### Not everything is what it seems

In [None]:
df.date.dtype

In [None]:
df.time.dtype

The date and time columns are generic python **objects**. We will want them to be date time objects so that we can work with them naturally. Before so doing we will want to verify that all of the data are proper dates. 

In [None]:
df.date.unique()

The final date is clearly incorrect. We can assume that '22009' is intended to be '2009'

In [None]:
df.loc[df.date.str.startswith('22009'), 'date'] = '2009-02-03'

In [None]:
df.date.unique()

Create the date time objects and set them as the index of the dataframe. 

In [None]:
df['datetime'] = pd.to_datetime(df[['date', 'time']].apply(lambda row: ' '.join(row), axis=1))

df = df[['datetime', 'sensor', 'value', 'annotation', 'state']]
df.set_index('datetime', inplace=True)

df.head()

In [None]:
df.index.second

### Querying the sensors

![CASAS Testbed](sensorlayout.jpg)

In [None]:
df.sensor.unique()

- M-sensors are binary motion sensors (ON/OFF)
- L-sensors are ambiant light sensors (ON/OFF)
- D-sensors are binary door sensors (OPEN/CLOSED)
- I-sensors are binary item presence sensors (PRESENT/ABSENT)
- A-sensors are ADC (measuring temperature on hob/oven)

M-, L-, I- and D-sensors are binary, whereas A-sensors have continuous values. So let's split them up into analogue and digital dataframes. 

### Split the analogue and digital components from eachother

In [None]:
cdf = df[~df.sensor.str.startswith("A")][['sensor', 'value']]
adf = df[df.sensor.str.startswith("A")][['sensor', 'value']]

#### Categorical data

We would like to create a matrix columns corresponding to the categorical sensor name (eg M13) which is `1` when the sensor value is `ON`, `-1` when the sensor value is `OFF`, and otherwise remains `0`. First we need to validate the values of the categorical dataframe. 

In [None]:
cdf.head()

In [None]:
cdf.value.unique()

Some strange values: 

- ONF
- OF
- O
- OFFF

It is often unclear how we should deal with errors such as these, so let's just convert the sensor value of all of these to `ON` in this demo. 

In [None]:
for value in ('ONF', 'OF', 'O', 'OFFF'): 
    cdf.loc[cdf.value == value, 'value'] = 'ON'
cdf.value.unique()

In [None]:
cdf_cols = pd.get_dummies(cdf.sensor)
cdf_cols.head()

In [None]:
cdf_cols['M35'].plot(figsize=(10, 5))

In [None]:
kitchen_columns = ['M{}'.format(ii) for ii in (15, 16, 17, 18, 19, 51)]

start = datetime(2009, 2, 2, 10)
end   = datetime(2009, 2, 2, 11)
cdf_cols[(cdf_cols.index > start) & (cdf_cols.index < end)][kitchen_columns].plot(subplots=True, figsize=(10, 10));

In [None]:
start = datetime(2009, 2, 2, 15)
end   = datetime(2009, 2, 2, 17)
cdf_cols[(cdf_cols.index > start) & (cdf_cols.index < end)][kitchen_columns].plot(subplots=True, figsize=(10, 10));

#### Analogue data

the `value` column of the `adf` dataframe is still a set of strings, so let's convert these to floating point numbers

In [None]:
adf.head()

In [None]:
adf.value.astype(float)

In [None]:
% debug

In [None]:
f_inds = adf.value.str.endswith('F') 
adf.loc[f_inds, 'value'] = adf.loc[f_inds, 'value'].str[:-1]

f_inds = adf.value.str.startswith('F') 
adf.loc[f_inds, 'value'] = adf.loc[f_inds, 'value'].str[1:]

In [None]:
adf.loc[:, 'value'] = adf.value.astype(float)

In [None]:
adf.value.groupby(adf.sensor).plot(kind='kde', legend=True, figsize=(10, 5))

In [None]:
adf.head()

In [None]:
adf_keys = adf.sensor.unique()
adf_keys

In [None]:
adf_cols = pd.get_dummies(adf.sensor)
for key in adf_keys:
    adf_cols[key] *= adf.value

adf_cols = adf_cols[adf_keys]
adf_cols.head()

## Regrouping 

At this stage we have our data prepared as we need. We have arranged the categorical data into a matrix of 0 and 1, and the analogue data has also been similarly translated. What remains is to produce our label matrix. Since we have already introduced most of the methods in the previous sections, this should be quite straightforward. 

In [None]:
annotation_inds = pd.notnull(df.annotation)

anns = df.loc[annotation_inds][['annotation', 'state']]

# Removing duplicated indices
anns = anns.groupby(level=0).first()
anns.head()

Interestingly there are also bugs in the labels! 

In [None]:
for annotation, group in anns.groupby('annotation'): 
    counts = group.state.value_counts()
    
    if counts.begin == counts.end: 
        print('             {}: equal counts ({} begins, {} ends)'.format(
            annotation, 
            counts.begin, 
            counts.end
        ))
        
    else:
        print(' *** WARNING {}: inconsistent annotation counts with {} begins and {} ends'.format(
            annotation, 
            counts.begin, 
            counts.end
        ))


In [None]:
def filter_annotations(anns):
    left = iter(anns.index[:-1])
    right = iter(anns.index[1:])

    filtered_annotations = []
    for ii, (ll, rr) in enumerate(zip(left, right)): 
        l = anns.loc[ll]
        r = anns.loc[rr]

        if l.state == 'begin' and r.state == 'end': 
            filtered_annotations.append(dict(label=l.annotation, start=ll, end=rr))
                
    return filtered_annotations
        

annotations = []
for annotation, group in anns.groupby('annotation'): 
    gi = filter_annotations(group)
    if len(gi) > 10:
        print('{:>30} - {}'.format(annotation, len(group)))
        annotations.extend(gi)

In [None]:
annotations[:10]

In [None]:
X_a = []
X_d = []
y   = []

for ann in annotations: 
    try: 
        ai = adf_cols[ann['start']: ann['end']]
        ci = cdf_cols[ann['start']: ann['end']]
        yi = ann['label']

        X_a.append(ai)
        X_d.append(ci)
        y.append(yi)
        
    except KeyError: 
        pass

print(len(y), len(X_d), len(X_a))

In [None]:
ii = 10
print(y[ii])
print(X_d[ii].sum().to_dict())
print(X_a[ii].sum().to_dict())

In [None]:
X = []

for ii in range(len(y)):
    xi = dict()
    
    # Number of sensor activations
    xi['nd'] = len(X_d)
    xi['na'] = len(X_a)
    
    # Duration of sensor windows
    if len(X_d[ii]): 
        xi['dd'] = (X_d[ii].index[-1] - X_d[ii].index[0]).total_seconds()
    if len(X_a[ii]):
        xi['da'] = (X_a[ii].index[-1] - X_a[ii].index[0]).total_seconds()
    
    for xx in (X_a[ii], X_d[ii]): 
        # Value counts of sensors 
        for kk, vv in xx.sum().to_dict().items(): 
            if np.isfinite(vv) and vv > 0:
                xi[kk] = vv
                                
        # Average time of day
        for kk, vv in xx.index.hour.value_counts().to_dict().items(): 
            kk = 'H_{}'.format(kk)
            if kk not in xi: 
                xi[kk] = 0
            xi[kk] += vv
    X.append(xi)

In [None]:
for ii in range(10): 
    print(y[ii], X[ii], end='\n\n')

# Doing machine learning on this (FINALLY!)

In [None]:
# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Preprocessing
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

# Cross validation
from sklearn.model_selection import StratifiedKFold

In [None]:
results = []

model_classes = [
    LogisticRegression, 
    RandomForestClassifier, 
    SVC, 
    GaussianNB, 
    KNeighborsClassifier, 
    DecisionTreeClassifier, 
]

print('Learning models...', end='')
for model_class in model_classes:
    folds = StratifiedKFold(5, shuffle=True, random_state=12345)
    for fold_i, (train_inds, test_inds) in enumerate(folds.split(X, y)): 
        print('.', end='')
        X_train, y_train = [X[i] for i in train_inds], [y[i] for i in train_inds]
        X_test, y_test = [X[i] for i in test_inds], [y[i] for i in test_inds]

        model = Pipeline((
            ('dict_to_vec', DictVectorizer(sparse=False)), 
            ('scaling', StandardScaler()), 
            ('classifier', model_class()),
        ))

        model.fit(X_train, y_train)
        
        results.append(dict(
            model=model_class.__name__, 
            fold=fold_i, 
            train_acc=model.score(X_train, y_train),
            test_acc=model.score(X_test, y_test)
        ))
print('...done!\n')
                
res = pd.DataFrame(results)

In [None]:
res

In [None]:
res.groupby('model')[['train_acc', 'test_acc']].mean()

# Reflecting on Pipeline

![Machine Learning Pipeline](ml-pipeline.jpg)

# Take away messages

- Be cynical about data! 
- Ensuring that data is in an appropriate form is very important. 
- Discovering the variety of errors in the data is not easy, depends on the application, and can be present even in fully automated systems. 
- Using machine learning models on data doesn't have to take too much time, although developing bespoke model classes for specific applications will. 
- Part of modelling the problem is working with noisy data. 