# AM 207 Final Project
## Now I See You
### --- Sensor Based Single User Activity Recognition
<img src="plot/background.png", width=300>
#### Team members:
Xiaowen Chang: xiaowenchang@g.harvard.edu

Baijie Lu: blu@g.harvard.edu

Fangzheng Qian: fqian@g.harvard.edu

Yuhan Tang: tang01@g.harvard.edu

#### Assigned Teaching fellow: 

Xide Xia: xidexia@g.harvard.edu
###### Submitted as the final project of AM 207: Advanced Scientific Computing in Harvard University

# Table of Contents

* [0. Introduction](#0.-Introduction)
* [1. Data Engineering](#1.-Data-Engineering)
    * [1.1 Preprocessing](#1.1-Preprocessing)
    * [1.2 Activity Array](#1.2-Activity-Array)
    * [1.3 Feature Representation](#1.3-Feature-Representation)
        * [1.3.1 Raw](#1.3.1-Raw)
        * [1.3.2 ChangePoint](#1.3.2-ChangePoint)
        * [1.3.3 LastSensor](#1.3.3-LastSensor)
    * [1.4 Save Files](#1.4-Save-Files)
* [2. House and Feature Setting](#2.-House-and-Feature-Setting)
* [3. Naive Bayes Model](#3.-Naive-Bayes-Model)
    * [3.1 Parameter Estimation](#3.1-Parameter-Estimation)
    * [3.2 Prior](#3.2-Prior)
    * [3.3 Maximize Posterior](#3.3-Maximize-Posterior)
    * [3.4 Model Visualization](#3.4-Model-Visualization)
* [4. Hidden Markov Model (HMM)](#4.-Hidden-Markov-Model)
    * [4.1 First Order HMM](#4.1-First-Order-HMM)
    * [4.2 Second Order HMM](#4.2-Second-Order-HMM)
* [5. Model Comparison](#5.-Model-Comparison)

# 0. Introduction

Activity Recognition, which identifies the activity (eg. cooking, sleeping, reading) that a user performs from a series of observations, is an active research area because it has many real-life applications such as healthcare and intelligent environments. In our project, to simplify the problem, we used sensor based single user data instead of some complex activities data involved multiple users. Recognize activities from sensor data poses the following challenges. First, there is an ambiguity of interpretation. The interpretation of similar observed sensor data may be different. For example, 'cooking' and 'cleaning fridge' both involve opening the fridge. Second, same activity may be performed in different ways. Third, from the observed sensor data, it is hard to see when one activity ends and another one starts. In this project, we implemented Naive Bayes Model, First Order HMM(Hidden Markov Model) and Second Order HMM to tackle these issues and tested their performance on several real world datasets. Besides models, different feature representations were tried to further improve the performance. 

**This project is split into two parts --- DataEngineering.ipynb and Models.ipynb. You are looking at the first part DataEngineering.ipynb. You only need to run this part once and all the data needed for building models would be generated and saved.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
from datetime import datetime,timedelta
%matplotlib inline

# 1. Data Engineering

## 1.1 Preprocessing 
### --- Organizing and Cleaning data

In this project, supervised machine learning models were used for recognition of users' activity, therefore, labeled training data from sensors had to be collected. We used three online datasets with single user activity recorded in different houses. Sensors were installed in different places inside the house to gather the data needed for recognition. The floor plan of different houses is as following. The red boxes represent sensor nodes. It gives 1 when the sensor is firing and a 0 otherwise. 
<img src="plot/data1.png"; width=300>
<img src="plot/data2.png"; width =300>

In [2]:
#convert matlab time to total seconds
def to_sec(matlab_datenum):
    dt = datetime.fromordinal(int(matlab_datenum)) + timedelta(days=matlab_datenum%1) - timedelta(days = 366)
    seconds = (dt-datetime(1970,1,1)).total_seconds()
    return int(seconds)

In [3]:
#run houses = 'A', 'B' and 'C'
house = 'C'
act_mat = scipy.io.loadmat("../data/datasets/house{}/actStructHouse{}.mat".format(house, house))
sensor_mat = scipy.io.loadmat("../data/datasets/house{}/sensorStructHouse{}.mat".format(house, house))
label_mat = scipy.io.loadmat("../data/datasets/house{}/senseandactLabelsHouse{}.mat".format(house,house))

In [4]:
act_df = pd.DataFrame(act_mat['activityStructure'][0][0][0])

In [5]:
act_df.columns = ["start_sec", "end_sec", "label"]
act_df['start_sec'] = [to_sec(i) for i in act_df.start_sec]
act_df['end_sec'] = [to_sec(i) for i in act_df.end_sec]
act_df['diff_sec'] = act_df.end_sec-act_df.start_sec
act_df['start_time'] = [str(datetime.fromtimestamp(i)) for i in act_df.start_sec]
act_df['end_time'] = [str(datetime.fromtimestamp(i)) for i in act_df.end_sec]
act_df = act_df[["start_time", "end_time", "start_sec", "end_sec", "diff_sec", "label"]]
act_df.head()

Unnamed: 0,start_time,end_time,start_sec,end_sec,diff_sec,label
0,2008-11-19 17:49:00,2008-11-19 17:49:59,1227134940,1227134999,59,1
1,2008-11-19 17:50:40,2008-11-19 17:51:45,1227135040,1227135105,65,4
2,2008-11-19 17:59:25,2008-11-19 18:00:00,1227135565,1227135600,35,17
3,2008-11-19 18:00:50,2008-11-19 20:24:59,1227135650,1227144299,8649,28
4,2008-11-19 19:14:50,2008-11-19 19:15:19,1227140090,1227140119,29,17


In [366]:
len(act_df)

344

In [367]:
sensor_df = pd.DataFrame(sensor_mat['sensorStructure'][0][0][0])

In [368]:
sensor_df.columns = ['start_sec', 'end_sec', 'label', 'on']
sensor_df['start_sec'] = [to_sec(i) for i in sensor_df.start_sec]
sensor_df['end_sec'] = [to_sec(i) for i in sensor_df.end_sec]
sensor_df['diff_sec'] = sensor_df.end_sec-sensor_df.start_sec
sensor_df['start_time'] = [str(datetime.fromtimestamp(i)) for i in sensor_df.start_sec]
sensor_df['end_time'] = [str(datetime.fromtimestamp(i)) for i in sensor_df.end_sec]
sensor_df = sensor_df[["start_time", "end_time", "start_sec", "end_sec", "diff_sec", "on", "label"]]
sensor_df.head()

Unnamed: 0,start_time,end_time,start_sec,end_sec,diff_sec,on,label
0,2008-11-19 17:47:46,2008-11-19 17:49:17,1227134866,1227134957,91,1,28
1,2008-11-19 17:49:20,2008-11-19 17:49:22,1227134960,1227134962,2,1,28
2,2008-11-19 17:49:24,2008-11-19 17:50:14,1227134964,1227135014,50,1,28
3,2008-11-19 17:50:18,2008-11-20 06:14:11,1227135018,1227179651,44633,1,28
4,2008-11-19 17:51:02,2008-11-19 17:51:04,1227135062,1227135064,2,1,25


In [369]:
len(sensor_df)

22700

In [370]:
temp = label_mat['sensor_labels']
sensor_labels = {}
for item in temp:
    sensor_labels[(item[0][0][0])]= item[1][0]
sensor_labels 

{5: u'mat bed rechts, drukmat',
 6: u'mat bank, huiskamer',
 7: u'vriezer, reed',
 8: u'toilet flush boven, flush ',
 10: u'toilet flush beneden. flush ',
 11: u'badkamer klapdeur links',
 13: u'zelfde la als 18, kwik',
 15: u'La sleutels',
 16: u'badkamer klapdeur links',
 18: u'bestek la, kwik sensor ',
 20: u'kastje pannen, reed ',
 21: u'magnetron, reed ',
 22: u'kastje restjesbakjes, reed ',
 23: u'kastje borden/kruiden,reed ',
 25: u'deur toilet beneden',
 27: u'kastje cups/bowl/tuna, reed ',
 28: u'voordeur, reed ',
 29: u'deur slaapkamer',
 30: u'koelkast, reed ',
 35: u'badkuip, pir ',
 36: u'dresser, pir ',
 38: u'wasbak boven, flush ',
 39: u'mat bed links, drukmat '}

In [371]:
temp = label_mat['activity_labels']
act_labels = {}
for i,item in enumerate(temp):
    if len(item[0]) != 0:
        act_labels[i+1] = item[0][0]
act_labels

{1: u'leave house',
 3: u'Eating',
 4: u'use toilet downstairs',
 5: u'take shower',
 6: u'brush teeth',
 7: u'use toilet upstairs',
 8: u'take bath',
 9: u'shave',
 10: u'go to bed',
 11: u'get dressed',
 12: u'take medication',
 13: u'prepare Breakfast',
 14: u'prepare Lunch',
 15: u'prepare Dinner',
 16: u'get snack',
 17: u'get drink',
 18: u'put items in dishwasher',
 19: u'unload dishwasher',
 20: u'store groceries',
 21: u'Grooming (Collection of 6,9,12,22)',
 22: u'put clothes in washingmachine',
 23: u'unload washingmachine',
 25: u'receive guest',
 26: u'watch tv',
 27: u'read paper',
 28: u'relax',
 30: u'Unknown'}

In [372]:
if house == 'B':
    sensor_df = sensor_df[sensor_df.label != 23]
    act_df = act_df[act_df.label != 16]

In [373]:
temp = []
for l in sensor_df.label:
    temp.append(sensor_labels[int(l)])
sensor_df['meaning'] = temp

In [374]:
sensor_df.head()

Unnamed: 0,start_time,end_time,start_sec,end_sec,diff_sec,on,label,meaning
0,2008-11-19 17:47:46,2008-11-19 17:49:17,1227134866,1227134957,91,1,28,"voordeur, reed"
1,2008-11-19 17:49:20,2008-11-19 17:49:22,1227134960,1227134962,2,1,28,"voordeur, reed"
2,2008-11-19 17:49:24,2008-11-19 17:50:14,1227134964,1227135014,50,1,28,"voordeur, reed"
3,2008-11-19 17:50:18,2008-11-20 06:14:11,1227135018,1227179651,44633,1,28,"voordeur, reed"
4,2008-11-19 17:51:02,2008-11-19 17:51:04,1227135062,1227135064,2,1,25,deur toilet beneden


In [375]:
temp = []
for l in act_df.label:
    temp.append(act_labels[int(l)])
act_df['meaning'] = temp

In [376]:
act_df.head()

Unnamed: 0,start_time,end_time,start_sec,end_sec,diff_sec,label,meaning
0,2008-11-19 17:49:00,2008-11-19 17:49:59,1227134940,1227134999,59,1,leave house
1,2008-11-19 17:50:40,2008-11-19 17:51:45,1227135040,1227135105,65,4,use toilet downstairs
2,2008-11-19 17:59:25,2008-11-19 18:00:00,1227135565,1227135600,35,17,get drink
3,2008-11-19 18:00:50,2008-11-19 20:24:59,1227135650,1227144299,8649,28,relax
4,2008-11-19 19:14:50,2008-11-19 19:15:19,1227140090,1227140119,29,17,get drink


In [377]:
sensor_df.to_csv("../data/house{}_sensor.csv".format(house), index=False)
act_df.to_csv("../data/house{}_act.csv".format(house), index=False)

## 1.2. Data Discretization
### - Use different house data by setting house to A or B or C

After some experiments, data was discretized into T time slices of length $\Delta t =60$ seconds which is long enough to be discriminative and gives a relative small discretization error. When two or more activities occur within a single time slice, we used the activity that occupies most of the time slice. 

In [8]:
timeslice = 60

In [9]:
#####
#select a house, 'A', 'B', or 'C'
#####
house = 'A'
act_df = pd.read_csv("../data/house{}_act.csv".format(house))
sensor_df = pd.read_csv("../data/house{}_sensor.csv".format(house))

In [10]:
start = min(min(act_df.start_sec), min(sensor_df.start_sec))
end = max(max(act_df.end_sec), max(sensor_df.end_sec))
if (end-start)%timeslice != 0:
    end = (1+(end-start)/timeslice)*timeslice + start
duration = end-start

In [11]:
num_sensor = len(list(set(sensor_df.label)))
num_act = len(list(set(act_df.label)))
num_t = duration/timeslice
print "# sensors: ", num_sensor
print "# states/acts: ", num_act
print "# timeframes: ", num_t

# sensors:  14
# states/acts:  16
# timeframes:  40006


## 1.2 Activity Array
### ---Hidden States

The hidden state(activity) is denoted with $y_t \in \{1,2, ..., D\}$ for D possible hidden states. 

In [12]:
#check counts
temp = list(set(zip(act_df.label, act_df.meaning)))
for y in temp:
    print "label:{}, meaning:{}, count:{}".format(y[0], y[1], sum(act_df.label==y[0]))

label:16.0, meaning:get snack, count:12
label:22.0, meaning:put clothes in washingmachine, count:3
label:6.0, meaning:brush teeth, count:16
label:17.0, meaning:get drink, count:20
label:23.0, meaning:unload washingmachine, count:4
label:4.0, meaning:use toilet, count:114
label:5.0, meaning:take shower, count:23
label:15.0, meaning:prepare Dinner, count:9
label:10.0, meaning:go to bed, count:24
label:19.0, meaning:unload dishwasher, count:4
label:20.0, meaning:store groceries, count:1
label:1.0, meaning:leave house, count:33
label:25.0, meaning:receive guest, count:3
label:18.0, meaning:put items in dishwasher, count:5
label:13.0, meaning:prepare Breakfast, count:20
label:3.0, meaning:Eating, count:1


In [13]:
Y = np.zeros(num_t)

In [14]:
for j in range(num_t):
    c = j*timeslice + start
    c_ = c + timeslice
    mask = ((act_df.start_sec <= c_) & (act_df.end_sec >= c_)) | ((act_df.start_sec <= c) & (act_df.end_sec >= c))
    temp_df = act_df[mask]
    max_cover = 0
    max_label = 0 #default, unknown act
    for i in range(len(temp_df)):
        s = np.array(temp_df.start_sec)[i]
        t = np.array(temp_df.end_sec)[i]
        l = np.array(temp_df.label)[i]
        
        if s <= c and t >= c_:
            max_cover = timeslice
            max_label = l
            break
        elif s > c and t < c_ and (t-s) > max_cover:
            max_cover = t-s
            max_label = l
        elif s<=c and t>=c and (t-c) > max_cover:
            max_cover = t-c
            max_label = l
        elif s<=c_ and t>=c_ and (c_-s) > max_cover:
            max_cover = c_-s
            max_label = l
    Y[j] = max_label
            

## 1.3 Feature Representation

In a house dataset, with N sensors installed, we define a binary observation state vector at time t as $x_{t} = (x_{t1}, x_{t2}, ... , x_{tN})^T$. We tried 3 different ways to represent the observation states.
<img src="plot/feature.png">

As shown in the graph above, 

(a) is our raw data feature representation which gives 1 when the sensor is firing and a 0 otherwise.  

(b) is ChangePoint feature representation, which indicates the moment when a sensor changes its value. 

(c) is LastSensor feature representation. It indicates which sensor fired last. The sensor that changed state last continues to give 1 and only changes to 0 when another sensor changes its value.

### 1.3.1 Raw

The raw sensor representation uses the sensor data directly as it was received from the sensors. It gives a 1 when the sensor is firing and a 0 otherwise.

**X_raw** is a num_t-by-num_sensor matrix, where rows are times and columns are features(or sensors). The maps of sensors to index can be found by the dictionary. 

In [17]:
#map from sensor to idx and idx to sensor
i2s = dict(zip(range(num_sensor), list(set(sensor_df.label))))
s2i = dict(zip(list(set(sensor_df.label)), range(num_sensor)))
s2i

{1.0: 0,
 5.0: 1,
 6.0: 2,
 7.0: 3,
 8.0: 4,
 9.0: 5,
 12.0: 6,
 13.0: 7,
 14.0: 8,
 17.0: 9,
 18.0: 10,
 20.0: 11,
 23.0: 12,
 24.0: 13}

In [18]:
#each row = (x1, x2, .. xn), n=num_sensor
X_raw = np.zeros([num_t, num_sensor])
for i in range(len(sensor_df)):
    elapsed = sensor_df.start_sec[i] - start
    row = elapsed/timeslice
    label = sensor_df.label[i]
    diff = sensor_df.diff_sec[i]
    while diff > 0:
        X_raw[row, s2i[label]] = 1
        row = row + 1
        diff = diff - timeslice

### 1.3.2 ChangePoint

The change point representation indicates when a sensor event takes place. That is, it indicates when a sensor changes value. More formally, it gives a 1 when a sensor changes state (i.e. goes from zero to one or vice versa) and a 0 otherwise.

**X_change** is a num_t-by-num_sensor matrix, where rows are times and columns are features(or sensors). The maps of sensors to index can be found by the dictionary. 

In [19]:
X_change = np.zeros([num_t, num_sensor])
X_change[0] = X_raw[0]
for i in range(1, num_t):
    curr = X_raw[i]
    prev = X_raw[i-1]
    logic = curr==prev
    X_change[i] = [1-int(x) for x in logic]

In [20]:
print "ones in X_raw: ", sum(sum(X_raw))
print "ones in X_change: ", sum(sum(X_change))

ones in X_raw:  52053.0
ones in X_change:  1531.0


### 1.3.3 LastSensor
The last-fired sensor representation indicates which sensor fired last. The sensor that changed state last continues to give 1 and changes to 0 when another sensor changes state.
**X_last** is a num_t-by-num_sensor matrix, where rows are times and columns are features(or sensors). The maps of sensors to index can be found by the dictionary. 

In [21]:
X_last = np.zeros([num_t, num_sensor])
X_last[0] = X_change[0]
#sensor index
s_ind = list(X_change[0]).index(1.)
for i in range(1, num_t):
    if 1 in X_change[i]: s_ind = list(X_change[i]).index(1.)
    X_last[i][s_ind] = 1.   

## 1.4 Save Files 
### (only need to be done once)

In [22]:
np.save("../data/X_raw_house{}.npy".format(house), X_raw)
np.save("../data/X_change_house{}.npy".format(house), X_change)
np.save("../data/X_last_house{}.npy".format(house), X_last)
np.save("../data/Y_house{}.npy".format(house), Y)