<h1> MotiononSense Dataset : Smartphone Sensor Data </h1>
<h3> Problem definition - predict user's activity base on phone sensors data </h3>

<h3> Part 3:
<ul>
    <li> Extracting real world data </li>
    <li> Evaluation on real world data </li>
    <li> Neuronal Models - training and evaluation on real world data </li>
    <li> Final results, conculsions and application </li>
    </ul>
</h3>

<h3> Extracting real world data </h3>

<ul>
    <li> We used the <a href="https://developer.apple.com/documentation/coremotion/cmdevicemotion">Core Motion Framework for iOS devices</a> to extract sensors data from our phones </li>
    <li> More details on the app we built can be found in our final document </li>
    <li> We recorded sensors data while performing different activties and extracted labeled data samples </li>
    <li> On the next section we will load our data and use it as a test set to evaluate the performace of our Random Forest model </li>
</ul>

In [1]:
import numpy as np
import pandas as pd
import os

class NewDataLoader():
    
    def __init__(self, folder_path):
        self.data_path = folder_path
    
    def load_all_expirements(self):
        df = None
        exp_index = 1
        for filename in os.listdir(self.data_path):
            file_path = os.path.join(self.data_path, filename)
            current_df = self.load_single_test_expirement(file_path, exp_index)
            exp_index += 1
            if df is None:
                df = current_df
            else:
                df = df.append(current_df)
        return df

    def load_single_test_expirement(self, path_to_file, exp_index, partc_id=1):
        cols_to_drop = ["timestamp", "timeIntervalSince1970", 'magneticField.x', 
                        'magneticField.y', 'magneticField.z', 'magneticField.accuracy']
        file_name = path_to_file.split(os.sep)[-1]
        name, file_type = file_name.split('.')
        action = name[:3]
        exp_df = pd.read_csv(path_to_file)
        exp_df = exp_df.drop(cols_to_drop, axis=1)
        exp_df["partc"] = partc_id
        exp_df["action"] = action
        exp_df["action_file_index"] = exp_index
        return exp_df

In [2]:
PROJECT_MAIN_DIR = os.path.join(os.getcwd(), "../")
path = os.path.join(PROJECT_MAIN_DIR, 'real-data')
data_loader = NewDataLoader(path)
real_test_df = data_loader.load_all_expirements()

We will load also our original data set and use it as a training data 

In [3]:
train_df = pd.read_csv(os.path.join(PROJECT_MAIN_DIR,'full_data.gz'), compression='gzip') # we will load our data saved as a compressed csv file
train_df = train_df.drop(['Unnamed: 0'], axis=1).set_index('time')

<h3> Evaluation real world data </h3>

Now, we will encode both samples with our Sliding Window encoding, train our Random Forest model over the entire old data and evaluate it's performance on the real world data

In [4]:
class SlidingWindow:
    
    def __init__(self, orig_df, window_size, num_experiments, num_participants, exclude, fnlist):
        exps = [i for i in range(1,num_experiments + 1) if i != exclude]
        parts = [i for i in range(1,num_participants + 1)]
        smp_df = self.create_sliding_df(orig_df, window_size, fnlist, exps, parts)
        self.window_size = window_size
        self.df = smp_df

    def create_sld_df_single_exp(self, orig_df, window_size, analytic_functions_list):
        dfs_to_concate = []
        base_df = orig_df.drop('action', axis=1)
        for func in analytic_functions_list:
            method_to_call = getattr(base_df.rolling(window=window_size), func)
            analytic_df = method_to_call()
            analytic_df = analytic_df[window_size:]
            analytic_df.columns = [col + "_sld_" + func for col in analytic_df.columns]
            dfs_to_concate.append(analytic_df)

        action_df = orig_df[['action']][window_size:] # [[]] syntax to return DataFrame and not Series
        dfs_to_concate.append(action_df)
        return pd.concat(dfs_to_concate,axis=1)

    def create_sliding_df(self, orig_df, window_size, analytic_functions_list, expirements, participants):
        dfs_to_concate = []
        cols_to_drop = ['partc', 'action_file_index']
        for e in expirements:
            for p in participants:
                exp_df = orig_df[(orig_df['partc'] == p) & (orig_df['action_file_index'] == e)]
                exp_df = exp_df.drop(cols_to_drop, axis=1)
                exp_roll_df = self.create_sld_df_single_exp(exp_df, window_size, analytic_functions_list)

                dfs_to_concate.append(exp_roll_df)
        return pd.concat(dfs_to_concate, axis=0, ignore_index=True)

In [5]:
# defining variables for the sliding window data frame creation
num_experiments = 16
num_participants = 24
exclude = 10
analytic_functions_list = ['mean', 'sum', 'median', 'min', 'max', 'std']
WINDOW_SIZE = 10

# create the sliding window data frame
train_win_df = SlidingWindow(train_df, WINDOW_SIZE, num_experiments, num_participants, exclude, analytic_functions_list)
train_win_df = train_win_df.df

In [9]:
num_experiments = 18
num_participants = 1
exclude = 0
analytic_functions_list = ['mean', 'sum', 'median', 'min', 'max', 'std']
WINDOW_SIZE = 10

real_test_df["partc"] = 1
test_win_df = SlidingWindow(real_test_df, WINDOW_SIZE, num_experiments, num_participants, exclude, analytic_functions_list)
test_win_df = test_win_df.df

In [30]:
from sklearn.metrics import classification_report, confusion_matrix

class DataProcessingEval():
    
    def __init__(self, origin_df, labels_dict):
        self.labels_dict = labels_dict
        self.classes_names = self.create_classes(labels_dict)
        self.df = origin_df
    
    def create_samples(self, division_ratio=[0.7, 0.1, 0.2]):
        # Define X, y
        df = self.df.sample(frac=1).reset_index(drop=True)
        X, y = df.drop(["action"], axis=1), df["action"]
        y = y.replace(self.labels_dict)

        # Divide to training, validation and test set
        train_ratio, dev_ratio = division_ratio[0], division_ratio[1]
        num_training = int(df.shape[0] * train_ratio)
        num_validation = int(df.shape[0] * dev_ratio)
        
        X_train, y_train = X[:num_training], y[:num_training]
        X_vald, y_vald = X[num_training:num_training + num_validation], y[num_training:num_training + num_validation]
        X_test, y_test = X[num_training + num_validation:], y[num_training + num_validation:]

        return X_train, y_train, X_vald, y_vald, X_test, y_test

    def create_classes(self, labels_dict):
        classes_indexs = labels_dict.items()
        classes_indexs = sorted(classes_indexs, key=lambda x: x[1])
        classes_names = [label for label, index in classes_indexs]
        return classes_names

    def evaluate_results(self, y_true, y_pred):
            print("---- Printing classification report ----")
            print(classification_report(y_true, y_pred, target_names=self.classes_names))

In [51]:
labels = {'wlk': 0, 'sit': 1, "std": 2, "ups": 3, "jog": 4, "dws": 5}

win_train_processor = DataProcessingEval(train_win_df, labels_dict=labels)
X_train, y_train, _, _, _, _  = win_train_processor.create_samples([1.0, 0, 0])

win_test_processor = DataProcessingEval(test_win_df, labels_dict=labels)
X_test_real, y_test_real, _, _, _, _ = win_test_processor.create_samples([1.0, 0, 0])

Training over the entire original data and evaluating on new test data

In [52]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, n_jobs=-1, verbose=1)
rf.fit(X_train, y_train)
rf_test_predictions = rf.predict(X_test_real)
win_test_processor.evaluate_results(y_test_real, rf_test_predictions)

[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:   47.8s remaining:   31.9s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.3min finished


---- Printing classification report ----
             precision    recall  f1-score   support

        wlk       0.79      0.47      0.59     52652
        sit       0.98      0.70      0.82     35225
        std       0.76      0.97      0.85     36561
        ups       0.43      0.75      0.55     21800
        jog       0.00      0.00      0.00         0
        dws       0.38      0.46      0.42     19547

avg / total       0.73      0.66      0.67    165785



[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished
  'recall', 'true', average, warn_for)


Conclusions so far:
<ul>
    <li> We excluded the "joging" activity beacuse we didn't preform this activity in the data we created from our app </li>
    <li> As predicted, the results on real world data are much worse compared to results over our original test set </li>
    <li> We are still predicting "sit" and "stand" activities quite well but our current model is having hard time identifying "upstairs" and "down stairs" </li>
    <li> Next, we will try to use a stronger, Neural models, hoping it will help us increasing our performance over the real test data
</ul>

<h3> Neuronal Models - Training and Evaluation </h3>

Encoding for Neuronal Models
<ul>
    <li> The first model we will try is a simple feed forward network with one hidden layer </li>
    <li> Feed forward nets, like classic ML models, cannot use sequence as input so we will have to use one of our previous encodings </li>
    <li> We will choose our sliding window encoding first, since it out-performed our raw history encoding </li>
    <li> We hope that our model can creat a better representation of the data in it's hidden layer and thus increase the generalization ability of the model </li>
</ul>

In [56]:
train_win_df.shape

(1409265, 73)

In [76]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

dropout_rate = 0.5
precision = as_keras_metric(tf.metrics.precision)
recall = as_keras_metric(tf.metrics.recall)

ff_model = Sequential()
ff_model.add(Dense(32, input_shape=(X_train.shape[1],), activation='relu'))  # hidden layer size is 32
ff_model.add(Dropout(dropout_rate))  # adding dropout layer
ff_model.add(Dense(6, activation='softmax'))  # applying softmax and cross entorpy loss
ff_model.compile(loss='categorical_crossentropy',optimizer='adam')

In [77]:
from keras.utils import to_categorical

num_activities = 6
y_train_one_hot = np.array([to_categorical(t, num_activities) for t in y_train])

# tranform y to one hot encoding vector of length 6 (we have 6 activities)
ff_model.fit(X_train, y_train_one_hot, batch_size=32, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5

KeyboardInterrupt: 