#MobiAct_ADL_load_dataset.ipynb. 
Loads the 6 Activities of Daily living (ADL) from the first release of the MobiAct dataset and converts to numpy arrays while adhering to the general format of the [Keras MNIST load_data function](https://keras.io/api/datasets/mnist/#load_data-function).

STD Standing (1x5min)
WAL Walking (1x5min)
JOG Jogging (3x30sec)
JUM Jumping (3x30sec)
STU Stairs Up (6x10sec)
STN Stairs Down (6x10sec)

Only the timestamp 'nanoseconds' and accelerometer data (accel_x/y/z) is imported. Sitting, Fall, gyro, and orientation data is not used.

Data is segmented into 500 samples (~5s) using as much of the sample as possible (incomplete segments are discarded)

Returns: Tuple of Numpy arrays:   
(x_train, y_train),(x_validation, y_validation)\[optional\],(x_test, y_test) 

* x_train\/validation\/test: containing float64 with shapes (num_samples, 500, {3,4,1})
* y_train\/validation\/test: containing int8 with shapes (num_samples 0-9)

Default train/validation/test split is by subject with a best effort to keep gender mix and distributed by height among the three categories.
Split is 60%/20%/20%

Example usage:  
x_train, y_train, x_test, y_test = mobiact_adl_load_dataset()

Additional References:   
For more information on the dataset, please refer to the publication:    Vavoulas, G., Chatzaki, C., Malliotakis, T., Pediaditis, M. and Tsiknakis, M.,(2016) The MobiAct Dataset: Recognition of Activities of Daily Living using Smartphones.,In Proceedings of the International Conference on Information and Communication Technologies for Ageing Well and e-Health  – Volume 1: ICT4AWE, (ICT4AGEINGWELL 2016) ISBN 978-989-758-180-9, pages 143-151, Rome, Italy. DOI: 10.5220/0005792401430151

[tensorflow git repo with more datasets](https://github.com/tensorflow/datasets) which also links to another colab [Keras MNIST Example](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/keras_example.ipynb) with more options.

Developed and tested using colab.research.google.com  
To save as .py version use File > Download .py

Author:  Lee B. Hinkle, IMICS Lab, Texas State University, 2021

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

TODOs:
* Figure out how to download source directly from https://drive.google.com/file/d/0B5VcW5yHhWhibWxGRTZDd0dGY2s/edit. this example seems newer than version used for UniMiB https://keras.io/examples/timeseries/timeseries_weather_forecasting/
* Implement one-hot encoding using libraries and also non-one-hot encode case
* Need to implement incl_xyz_accel and incl_rms_accel


In [3]:
import os
import shutil #https://docs.python.org/3/library/shutil.html
from shutil import unpack_archive # to unzip
#from shutil import make_archive # to create zip for storage
import requests #for downloading zip file
import glob # to generate lists of files in directory - unix style pathnames
#from scipy import io #for loadmat, matlab conversion
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt # for plotting - pandas uses matplotlib
from tabulate import tabulate # for verbose tables
from tensorflow.keras.utils import to_categorical # for one-hot encoding

In [4]:
def get_mobiact_fname_mdata(path_in):
    """returns dataframe with filename and metadata from mobiact directory
    args: path_in is location of files e.g. JOG, JUM, etc directories
    returns: pandas dataframe with one row for each file.
    columns are full filename, activity, subject, group {unassigned, train, validate, test}"""
    df = pd.DataFrame() # new empty dataframe
    GRP = ['train','validation','test']
    ACT = ['JOG','JUM','STD','STN','STU','WAL']
    for i in ACT:    
        sub_path = path_in + i + '/'
        fname_in = i + '_acc_*.txt'
        print("Generating filenames ", fname_in," from ", sub_path," directory")
        # Build list of matching files in each directory, make into dataframe
        # extract and add metadata
        in_files = glob.glob(os.path.join(sub_path, fname_in)) # generates a list of all matching files
        temp = pd.DataFrame(in_files,dtype="string") # easier in pandas, make column of filenames
        temp.columns = ['fname'] # name the column
        temp['ACT'] = i # add column with activity
        temp['SUB'] = temp['fname'].str.extract(r'(\d+)') # uses pd method and regex get SUB from fname
        # both columns are type object, would be better to set but didn't work first time
        df = pd.concat([df,temp], ignore_index=True ) # concat needs two df, add regardless of index   
    # add column for validation, test groups by subject, others are marked train
    df["SUB"] = pd.to_numeric(df["SUB"]) #convert from object to INT32
    return df

In [5]:
def assign_group(df,split_subj):
    """returns dataframe with filename and metadata from mobiact directory
    args: dataframe with mobiact fname, ACT, SUB columns
    returns: dataframe with GRP = {train, test, validate} column added
    GRP assignment is based on list in split_subj
    """
    df['GRP'] = 'unassigned' # add column
    df.loc[df['SUB'].isin(split_subj['train_subj']),'GRP'] = 'train'
    df.loc[df['SUB'].isin(split_subj['validation_subj']),'GRP'] = 'validation'
    df.loc[df['SUB'].isin(split_subj['test_subj']),'GRP'] = 'test'
    return df

In [6]:
def read_mobiact_file(full_filename):
    """returns dataframe from Mobiact txt file accel_xyz data, skips metadata, labels columns"""
    df = pd.read_csv(full_filename,skiprows=16, header=None) #skip 16 lines of metadata
    # json better? https://docs.python.org/3/library/json.html#module-json initial tries weren't successful
    df.columns = ["nanoseconds", "accel_x", "accel_y", "accel_z"]
    return df

In [7]:
def add_total_accel(df_in, delete_xyz=False):
    """computes rms of accel_x/y/z data, removed 1g component, adds accel_total column"""
    # compute rms accel and remove 1g due to gravity, removes rotational dependency
    dfx = df_in.pow(2)[['accel_x','accel_y','accel_z']] #square each accel
    df_sum = dfx.sum(axis=1) #add sum of squares, new 1 col df
    df_in.loc[:,'accel_total'] = df_sum.pow(0.5)-9.8  # sqrt + remove 1g gravity component
    del dfx, df_sum
    if delete_xyz:
        return df_in.drop(columns = ['accel_x','accel_y','accel_z'])
    else: 
        return df_in

In [8]:
def to_abs_time(df):
    """subtracts the value of 1st nanosecond sample from all samples"""
    start_time = df.loc[0,'nanoseconds']
    df.loc[:,'nanoseconds'] = df.loc[:,'nanoseconds'] - start_time
    return df

In [9]:
def get_df_from_file(fname, start_discard=100, end_discard=100):
    """processes mobiact file fname and returns dataframe of accel values
    inputs: fname = full filemane
            start_discard = number of initial samples to delete (default = 200)
            end_discard = number of final samples to delete (default = 200)
    output: pandas dataframe of shape (samples, accel_total)"""
    df = read_mobiact_file(fname)
    df = add_total_accel(df,delete_xyz=True)
    df = df.drop(columns=['nanoseconds']) # note this loses some time info - datetime better?
    df.drop(df.head(start_discard).index,inplace=True) # drop first rows
    df.drop(df.tail(end_discard).index,inplace=True) # drop last rows
    return df

In [10]:
def split_df_npX(df, num_samples = 500):
    """converts dataframe (samples, total_accel) into 3D numpy array of shape
    (num_segments, num_samples, total_accel). e.g. a 1800 row dataframe will
    result in a (3,500,1) numpy array.   The samples that don't populate an
    entire segment are discarded and there is no overlap (sliding window)
    input:  dataframe with one files worth of data, index=samples, column=total_accel
            num_samples = number samples (rows) in each segment, default = 500
    ouput:  numpy array in trainX format shape = (segments, samples/segment, 1)"""
    temp = df.to_numpy() # easier to reshape and final form needed anyway
    if (temp.shape[0]//num_samples)==0:
        print ("WARNING:  File contains less than ",num_samples,"samples and is discarded")
    temp2 = temp[0:num_samples*(temp.shape[0]//num_samples)] # truncate to multiple of num_samples
    tempX = temp2.reshape(-1,num_samples,1) # won't work with out truncation
    return tempX

In [11]:
def mobiact_adl_load_dataset(
    verbose = True,
    #Pass location of the original MobiAct zip file here.
    #Easiest way to find this in colab is:
    #  -mount your google drive with MobiAct zip file (link is above)
    #  -navigate to the file using File menu to left
    #  -right click on file and select 'copy path', paste in next line
    orig_zipfile = '/content/drive/My Drive/Datasets/MobiAct_Dataset_v1.0.zip',
    incl_xyz_accel = False, #include component accel_x/y/z in ____X data
    incl_rms_accel = True, #add rms value of accel_x/y/z in ____X data
    incl_val_group = False, #True => returns x/y_test, x/y_validation, x/y_train
                           #False => combine test & validation groups
    split_subj = {'train_subj':[2,4,5,9,10,16,18,20,23,24,26,27,28,32,34,35,
                                36,38,42,45,46,47,48,49,50,51,52,53,54,57],
                   'validation_subj':[3,6,8,11,12,22,37,40,43,56],
                   'test_subj':[7,19,21,25,29,33,39,41,44,55]},
    one_hot_encode = True
    ):
    #unzip original dataset from google drive map into colab session
    if (not os.path.isdir('/content/MobiAct_Dataset')):
        print("Unzipping MobiAct Dataset")
        shutil.unpack_archive(orig_zipfile,'/content','zip')
    else:
        print("Using existing archive in colab")
    df_flist = get_mobiact_fname_mdata('/content/MobiAct_Dataset/')
    df_flist = assign_group(df_flist,split_subj)
    #Note:  STU, STN files don't contain 900 samples so 200 discard start/finish + 500 time step doesn't work
    # Create zero'd np arrays otherwise accumulates when run more than once
    # should be a better way to do this
    trainX = np.zeros(shape=(1,500,1)) #otherwise accumulates when run more than once
    trainy = np.zeros(shape=(1,6))
    validationX = np.zeros(shape=(1,500,1))
    validationy = np.zeros(shape=(1,6))
    testX = np.zeros(shape=(1,500,1))
    testy = np.zeros(shape=(1,6))
    for i in df_flist.index:
        if ((df_flist['GRP'][i])=='train'):
            #print ("processing train file", df_flist['fname'][i])
            df = get_df_from_file(df_flist['fname'][i])
            tempX = split_df_npX(df)
            trainX = np.vstack([trainX, tempX])
            tempy = np.zeros(shape=(tempX.shape[0],6)) # 6 is number of ACT
            #one hot encoding of ACT
            #Class indices are  {'JOG': 0, 'JUM': 1, 'STD': 2, 'STN': 3, 'STU': 4, 'WAL': 5}
            #to match the ones generated by Keras image import
            if ((df_flist['ACT'][i])=='JOG'): tempy[:,0]=1
            if ((df_flist['ACT'][i])=='JUM'): tempy[:,1]=1
            if ((df_flist['ACT'][i])=='STD'): tempy[:,2]=1
            if ((df_flist['ACT'][i])=='STN'): tempy[:,3]=1
            if ((df_flist['ACT'][i])=='STU'): tempy[:,4]=1
            if ((df_flist['ACT'][i])=='WAL'): tempy[:,5]=1
            trainy = np.vstack([trainy, tempy])
        if ((df_flist['GRP'][i])=='validation'):
            #print ("processing test file", df_flist['fname'][i])
            df = get_df_from_file(df_flist['fname'][i])
            tempX = split_df_npX(df)
            validationX = np.vstack([validationX, tempX])
            tempy = np.zeros(shape=(tempX.shape[0],6)) # 6 is number of ACT
            #one hot encoding of ACT
            if ((df_flist['ACT'][i])=='JOG'): tempy[:,0]=1
            if ((df_flist['ACT'][i])=='JUM'): tempy[:,1]=1
            if ((df_flist['ACT'][i])=='STD'): tempy[:,2]=1
            if ((df_flist['ACT'][i])=='STN'): tempy[:,3]=1
            if ((df_flist['ACT'][i])=='STU'): tempy[:,4]=1
            if ((df_flist['ACT'][i])=='WAL'): tempy[:,5]=1
            validationy = np.vstack([validationy, tempy])
        if ((df_flist['GRP'][i])=='test'):
            #print ("processing test file", df_flist['fname'][i])
            df = get_df_from_file(df_flist['fname'][i])
            tempX = split_df_npX(df)
            testX = np.vstack([testX, tempX])
            tempy = np.zeros(shape=(tempX.shape[0],6)) # 6 is number of ACT
            #one hot encoding of ACT
            if ((df_flist['ACT'][i])=='JOG'): tempy[:,0]=1
            if ((df_flist['ACT'][i])=='JUM'): tempy[:,1]=1
            if ((df_flist['ACT'][i])=='STD'): tempy[:,2]=1
            if ((df_flist['ACT'][i])=='STN'): tempy[:,3]=1
            if ((df_flist['ACT'][i])=='STU'): tempy[:,4]=1
            if ((df_flist['ACT'][i])=='WAL'): tempy[:,5]=1
            testy = np.vstack([testy, tempy])
    #delete first row placeholders
    trainX = np.delete(trainX, (0), axis=0) 
    trainy = np.delete(trainy, (0), axis=0)
    validationX = np.delete(validationX, (0), axis=0) 
    validationy = np.delete(validationy, (0), axis=0)
    testX = np.delete(testX, (0), axis=0)
    testy = np.delete(testy, (0), axis=0)
    if (incl_val_group):
        return trainX, trainy, validationX, validationy, testX, testy
    else:
        return np.concatenate((trainX, validationX), axis=0),\
            np.concatenate((trainy, validationy), axis=0),\
            testX, testy


In [13]:
if __name__ == "__main__":
    print("Downloading and processing MobiAct dataset, ADL Portion")
    x_train, y_train, x_test, y_test = mobiact_adl_load_dataset()
    print("\nMobiAct ADL returned arrays:")
    print("x_train shape ",x_train.shape," y_train shape ", y_train.shape)
    print("x_test shape  ",x_test.shape," y_test shape  ",y_test.shape)

Downloading and processing MobiAct dataset, ADL Portion
Unzipping MobiAct Dataset
Generating filenames  JOG_acc_*.txt  from  /content/MobiAct_Dataset/JOG/  directory
Generating filenames  JUM_acc_*.txt  from  /content/MobiAct_Dataset/JUM/  directory
Generating filenames  STD_acc_*.txt  from  /content/MobiAct_Dataset/STD/  directory
Generating filenames  STN_acc_*.txt  from  /content/MobiAct_Dataset/STN/  directory
Generating filenames  STU_acc_*.txt  from  /content/MobiAct_Dataset/STU/  directory
Generating filenames  WAL_acc_*.txt  from  /content/MobiAct_Dataset/WAL/  directory

MobiAct ADL returned arrays:
x_train shape  (5587, 500, 1)  y_train shape  (5587, 6)
x_test shape   (1395, 500, 1)  y_test shape   (1395, 6)


In [None]:
#change to True for example/testing including
if (False):
    x_train, y_train, x_validation, y_validation, x_test, y_test = mobiact_adl_load_dataset(incl_val_group=True)
    print("x/y_train shape ",x_train.shape,y_train.shape)
    print("x/y_validation shape ",x_validation.shape,y_validation.shape)
    print("x/y_test shape  ",x_test.shape,y_test.shape)

Using existing archive in colab
Generating filenames  JOG_acc_*.txt  from  /content/MobiAct_Dataset/JOG/  directory
Generating filenames  JUM_acc_*.txt  from  /content/MobiAct_Dataset/JUM/  directory
Generating filenames  STD_acc_*.txt  from  /content/MobiAct_Dataset/STD/  directory
Generating filenames  STN_acc_*.txt  from  /content/MobiAct_Dataset/STN/  directory
Generating filenames  STU_acc_*.txt  from  /content/MobiAct_Dataset/STU/  directory
Generating filenames  WAL_acc_*.txt  from  /content/MobiAct_Dataset/WAL/  directory
x/y_train shape  (4190, 500, 1) (4190, 6)
x/y_validation shape  (1397, 500, 1) (1397, 6)
x/y_test shape   (1395, 500, 1) (1395, 6)
