<h1> MotiononSense Dataset : Smartphone Sensor Data </h1>
<h3> Problem definition - predicts user's activity base on phone sensors data </h3>

<h3> Part 2:
<ul>
    <li> Problem definition and possible applications </li>
    <li> Feature extraction/engineering </li>
    <li> Classic ML models - training and statistical evaluation </li>
    <li> Problems and the need for "real data" </li>
    </ul>
</h3>

<h3> Problem definition and applications </h3>

<ul>
    <li> Our probelm is predicting user's activity from phone sensors data </li>
    <li> This definition might be too wide, so we limit ourself to predicting 1 of 5 possible activities </li>
    <li> Thus, we can define our problem as multiclass classification, where we can label each data point as <br>
        sitting, standing, walking, going downstaris or going upstairs </li>
    <li> There are many application for this kind of classification in various fields such as <br>
        healthcare, intelligence etc. </li>
    <li> We will further discuss some of these applications in later part of our project </li>
</ul>

<h3> Feature extraction/engineering </h3>

Our data is a time series - a sequence of measurments over time
<ul>
    <li> Thus, extracting value for a single data point depends on it's context </li> 
    <li> But, classic ML algorithms/classifiers predicts output for a single input data point - independent to ajdecent input data point </li>
    <li> So, in order to use our data to train classic ML model we will have to encode our features to represent context data </li>
    <li> We will present two different features encoding methods - Sliding-Window and Raw-History </li>
</ul>

<h4> Sliding Window Features </h4>

<ul>
<li> In this method, we will encode each data sample as a concatenation of anayltical functions calculated over a predefined size of previous samples </li>
<li> For example, here we will use a context size of 10 (calculate over 10 pervious data points) </li>
<li> Notice that we cannot mix between different expirements who represents different activity label </li>
</ul>

In [5]:
import numpy as np
import pandas as pd
import os

PROJECT_MAIN_DIR = os.path.join(os.getcwd(), "../")

In [2]:
class SlidingWindow:
    
    def __init__(self, orig_df, window_size, num_experiments, num_participants, exclude, fnlist):
        exps = [i for i in range(1,num_experiments + 1) if i != exclude]
        parts = [i for i in range(1,num_participants + 1)]
        smp_df = self.create_sliding_df(orig_df, window_size, fnlist, exps, parts)
        self.window_size = window_size
        self.df = smp_df

    def create_sld_df_single_exp(self, orig_df, window_size, analytic_functions_list):
        dfs_to_concate = []
        base_df = orig_df.drop('action', axis=1)
        for func in analytic_functions_list:
            method_to_call = getattr(base_df.rolling(window=window_size), func)
            analytic_df = method_to_call()
            analytic_df = analytic_df[window_size:]
            analytic_df.columns = [col + "_sld_" + func for col in analytic_df.columns]
            dfs_to_concate.append(analytic_df)

        action_df = orig_df[['action']][window_size:] # [[]] syntax to return DataFrame and not Series
        dfs_to_concate.append(action_df)
        return pd.concat(dfs_to_concate,axis=1)

    def create_sliding_df(self, orig_df, window_size, analytic_functions_list, expirements, participants):
        dfs_to_concate = []
        cols_to_drop = ['partc', 'action_file_index']
        for e in expirements:
            for p in participants:
                exp_df = orig_df[(orig_df['partc'] == p) & (orig_df['action_file_index'] == e)]
                exp_df = exp_df.drop(cols_to_drop, axis=1)
                exp_roll_df = self.create_sld_df_single_exp(exp_df, window_size, analytic_functions_list)

                dfs_to_concate.append(exp_roll_df)
        return pd.concat(dfs_to_concate, axis=0, ignore_index=True)

In [6]:
df = pd.read_csv(os.path.join(PROJECT_MAIN_DIR,'full_data.gz'), compression='gzip') # we will load our data saved as a compressed csv file
df = df.drop(['Unnamed: 0'], axis=1).set_index('time')

In [7]:
# defining variables for the sliding window data frame creation
num_experiments = 16
num_participants = 24
exclude = 10
analytic_functions_list = ['mean', 'sum', 'median', 'min', 'max', 'std']
WINDOW_SIZE = 10

# create the sliding window data frame
win_df = SlidingWindow(df, WINDOW_SIZE, num_experiments, num_participants, exclude, analytic_functions_list)

Viewing our data and performing sanity check

In [9]:
win_df.df.head(5)

Unnamed: 0,attitude.roll_sld_mean,attitude.pitch_sld_mean,attitude.yaw_sld_mean,gravity.x_sld_mean,gravity.y_sld_mean,gravity.z_sld_mean,rotationRate.x_sld_mean,rotationRate.y_sld_mean,rotationRate.z_sld_mean,userAcceleration.x_sld_mean,...,gravity.x_sld_std,gravity.y_sld_std,gravity.z_sld_std,rotationRate.x_sld_std,rotationRate.y_sld_std,rotationRate.z_sld_std,userAcceleration.x_sld_std,userAcceleration.y_sld_std,userAcceleration.z_sld_std,action
0,1.476032,-0.699698,0.659227,0.761074,0.643965,-0.072516,0.327435,-0.23759,0.125294,0.089179,...,0.003243,0.006475,0.029224,0.346436,0.590791,0.249107,0.083854,0.128267,0.114783,dws
1,1.464487,-0.697192,0.650675,0.761804,0.642056,-0.081426,0.344311,-0.346253,0.059212,0.058162,...,0.001706,0.004752,0.029167,0.377046,0.554298,0.172086,0.087612,0.140744,0.099833,dws
2,1.448353,-0.695176,0.63986,0.761559,0.64051,-0.093848,0.481461,-0.525592,0.033799,0.054865,...,0.002168,0.004552,0.032387,0.428049,0.712118,0.14147,0.090179,0.146393,0.097535,dws
3,1.4265,-0.692378,0.625654,0.760575,0.638354,-0.110722,0.602284,-0.699763,0.062317,0.055195,...,0.004032,0.005687,0.043814,0.439572,1.00645,0.168158,0.089927,0.148507,0.10431,dws
4,1.399383,-0.688014,0.609652,0.758815,0.634966,-0.131806,0.70538,-0.951931,0.111215,0.041147,...,0.007015,0.008968,0.062607,0.433246,1.32976,0.224856,0.074548,0.091135,0.131904,dws


<b> Sanity check: </b> <br>
<ul>
    <li> There are 15 expirements and 24 participants for each expirement </li>
    <li> For sliding window of 10 samples we are loosing 10 data samples of each expirement </li>
    <li> This sums up to 15 \* 24 \* 10 = 3600 </li>
    <li> Indeed in the new data set there are exactly 3600 rows fewer than the origial data set <\li>
    <li> Furthermore, the new data set has exactly 12 * {num_analytical_function} + label column = 12 \* 6 + 1 = 73 columns <br>
        (12 is the number of features in the original data set) </li>
</ul>

In [11]:
print(win_df.df.shape)
print(df.shape)

(1409265, 73)
(1412865, 15)


<h4> Raw History Features </h4>

<ul>
    <li>In this method, we will simply encode each data sample as a concatenation of the raw features of it's previous x data points </li>
        <li>For example, here we will use a context size of 10. i.e it is aligned with our previous sliding window method, <br>
but instead of calculating aggregation of analytical function over the context features, here we simply encode them as a long vector </li>
    <li> Again, we cannot mix between different expirements who represents different activity label </li>
</ul>

In [15]:
class RawHistory:
    
    def __init__(self, origin_df, history_length, num_experiments, num_participants, exclude):
        exps = [i for i in range(1,num_experiments + 1) if i != exclude]
        parts = [i for i in range(1,num_participants + 1)]
        smp_df = self.create_history_encoded_df(df, history_length, expirements=exps, participants=parts)
        self.history_length = history_length
        self.df = smp_df

    def create_history_encoded_single_exp(self, orig_df, history_length):
        hist_df = orig_df.copy(deep=True) # later operations are "in place" so we need to avoid changing original dataframe
        columns_to_shift = hist_df.columns[:-1] # omit the action column, we don't want to duplicate it
        for i in range(1,history_length + 1):
            shift_df = orig_df.shift(i)
            for col_name in columns_to_shift:
                new_col_name = "prev_{0}_".format(i) + col_name
                hist_df[new_col_name] = shift_df[col_name] # add shifted column, aka history, as a column to orignal dataframe

        hist_df = hist_df[history_length:] # we don't return the first "history_length" sample - they have missing history data
        return hist_df

    def create_history_encoded_df(self, orig_df, history_length, expirements, participants):
        dfs_to_concate = []
        cols_to_drop = ['partc', 'action_file_index']
        for e in expirements:
            for p in participants:
                exp_df = orig_df[(orig_df['partc'] == p) & (orig_df['action_file_index'] == e)]
                exp_df = exp_df.drop(cols_to_drop, axis=1)
                exp_histoy_df = self.create_history_encoded_single_exp(exp_df, history_length)
                dfs_to_concate.append(exp_histoy_df)
        return pd.concat(dfs_to_concate, axis=0, ignore_index=True) 

In [16]:
# defining variables for the raw history data frame creation
num_experiments = 16
num_participants = 24
exclude = 10
HISTORY_LEN = 10

# create the sliding window data frame
hist_df = RawHistory(df, HISTORY_LEN, num_experiments, num_participants, exclude)

<b> Sanity check: </b> <br>
<ul>
    <li> There are 15 expirements and 24 participants for each expirement </li>
    <li> For history encoded data with history length of 10 samples we are loosing 10 data samples of each expirement
this sums up to 15 \* 24 \* 10 = 3600 </li>
    <li> This sums up to 15 \* 24 \* 10 = 3600 </li>
    <li> Indeed in the new data set there are exactly 3600 rows fewer than the origial data set <\li>
    <li> Furthermore, the new data set has exactly 12 * {history_length + 1} + label columns = 12 * (10+1) + 1 = 133 columns <br>
        (addition of one for the original data) </li>
</ul>

In [17]:
print(hist_df.df.shape)
print(df.shape)

(1409265, 133)
(1412865, 15)
