#Gesture-Phase-Segmentation_load_dataset.ipynb
Loads the dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/gesture+phase+segmentation) and returns train and test X/y numpy arrays.  This loader derived from TWRistAR version Feb'22.   Please follow the original dataset owners citation requests if you use this data in your work.

The basic flow is:
* Download and unzip the dataset if not already present
* Convert each recording *session* into Intermediate Representation 1 (IR1) format - a datetime indexed pandas dataframe with columns for each channel plus the label and subject number.
* Transform the IR1 into IR2 - a set of three numpy arrays containing sliding window samples
   * X = (samples, time steps per sample, channels)  
   * y =  (samples, label) # activity classification  
   * s =  (samples, subject) # subject number
* Clean and further transforms the IR2 arrays as needed - note the transforms that can be applied here are train vs test dependent.   For example, the IR2 arrays in the training set may be rebalanced, but those in the test set should not.
* Concatenate the processed IR2 arrays into the final returned train/validate/test arrays.

Set interactive to true to run the Jupyter Notebook version.  Note most of the calls are setup to test the functions, not process the entire dataset, to do that set interactive to false and run all so that main executes.   This notebook can be saved and run as a python file as well.


<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

[Lee B. Hinkle](https://userweb.cs.txstate.edu/~lbh31/), Texas State University, [IMICS Lab](https://imics.wp.txstate.edu/)  
TODO:
* Some files (a1_raw) have time discontinuity.  Need to address when forming sliding windows.
* The one-hot encoding should be moved to a common function.


# Import Libraries and Common Load Dataset Code (from IMICS public repo)

In [1]:
import os
import shutil #https://docs.python.org/3/library/shutil.html
from shutil import unpack_archive # to unzip
import time
import pandas as pd
import numpy as np
from numpy import savetxt
from tabulate import tabulate # for verbose tables, showing data
from tensorflow.keras.utils import to_categorical # for one-hot encoding
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from time import gmtime, strftime, localtime #for displaying Linux UTC timestamps in hh:mm:ss
from datetime import datetime, date
import urllib.request # to get files from web w/o !wget
import matplotlib.pyplot as plt

In [2]:
def get_py_file(fname, url):
    """checks for local file, if none downloads from URL.    
    :return: nothing"""
    #fname = 'load_data_utils.py'
    #ffname = os.path.join(my_dir,fname)
    if (os.path.exists(fname)):
        print ("Local",fname, "found, skipping download")
    else:
        print("Downloading",fname, "from IMICS git repo")
        urllib.request.urlretrieve(url, filename=fname)

get_py_file(fname = 'load_data_utils.py', url = 'https://raw.githubusercontent.com/imics-lab/load_data_time_series/main/load_data_utils.py')
get_py_file(fname = 'load_data_transforms.py', url = 'https://raw.githubusercontent.com/imics-lab/load_data_time_series/main/load_data_transforms.py')

Downloading load_data_utils.py from IMICS git repo
Downloading load_data_transforms.py from IMICS git repo


In [3]:
import load_data_transforms as xform
import load_data_utils as utils

# Global Parameters

In [4]:
# environment and execution parameters
my_dir = '.' # replace with absolute path if desired
dataset_dir = os.path.join(my_dir,'gesture_phase_dataset') # Where dataset will be unzipped

interactive = True # for exploring data and functions interactively
verbose = True

In [5]:
# Gesture-Phase-Segmentation Dataset unique params
xform.time_steps = 30
xform.stride = 5
# the label map should contain all possible labels, it is used to convert from
# IR1 dataframe string labels of type categorical (saves a ton of memory) to 
# integers for the IR2 and beyond numpy ndarrays.
label_map_gps = {"label":     {"Rest": 0, "Preparation": 1, "Stroke": 2,
                                "Hold": 3, "Retraction": 4}}

In [6]:
interactive = False # don't run if interactive, automatically runs for .py version
verbose = False # to limit the called functions output

In [7]:
def get_gesture_phase_dataset():
    """checks for local zipfile, if none downloads from UCI repository
    after download will unzip the dataset into local directory.
    Assumes a global my_dir has been defined (default is my_dir = ".")
    :return: nothing"""
    zip_fname = 'gesture_phase_dataset.zip'
    zip_ffname = os.path.join(my_dir,zip_fname)
    gps_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00302/gesture_phase_dataset.zip"
    if (os.path.exists(zip_ffname)):
        if verbose:
            print ("Local",zip_ffname, "found, skipping download")
    else:
        print("Downloading Gesture-Phase-Segmentation dataset from UCI ML Repository")
        urllib.request.urlretrieve(gps_url, zip_fname)
    if (os.path.isdir(dataset_dir)):
        if verbose:
            print("Found existing directory:", dataset_dir, "skipping unzip")
    else:
        os.makedirs(dataset_dir)
        print("Unzipping Gesture Phase Segmentation file in", dataset_dir, "directory")
        if (os.path.exists(zip_ffname)):
            shutil.unpack_archive(zip_ffname,dataset_dir,'zip')
        else:
            print("Error: ", zip_ffname, " not found, exiting")
    return
if interactive:
    get_gesture_phase_dataset()

In [8]:
def get_gps_ir1_dict(include_story = False):
    """reads the Gesture-Phase-Segmentation raw .csv files in the global
    'dataset_dir'. This version uses dict to preserve original filenames.
    Returns: a dict containing IR1 dataframes."""
    fn_list = ['a1_raw.csv', 'a2_raw.csv', 'a3_raw.csv',
           'b1_raw.csv', 'b3_raw.csv', 'c1_raw.csv','c3_raw.csv']
    get_gesture_phase_dataset()
    ir1_df_dict = dict() # an empty dictionary
    for item in fn_list:
        subject = item[0] # the first letter of the filename a,b,c
        story = item[1] # the number after subject letter, 1,2,3
        ffname = os.path.join(dataset_dir,item)
        # print(subject, story, ffname)
        df = pd.read_csv(ffname)
        # change to 32-bit, credit/ref https://stackoverflow.com/questions/69188132/how-to-convert-all-float64-columns-to-float32-in-pandas
        # Select columns with 'float64' dtype  
        float64_cols = list(df.select_dtypes(include='float64'))
        # The same code again calling the columns
        df[float64_cols] = df[float64_cols].astype('float32')
        # b1_raw has an instance of alternate spelling.  Change to match others.
        df['phase'] = df['phase'].replace({'Preparação':'Preparation'})
        # Seems better to explicitly type the other columns vs object.
        df['phase']=df['phase'].astype('category')

        df['sub'] = subject
        df['sub'] = [ ord(x) - 96 for x in df['sub']] # ord is unicode char
        df['sub']=df['sub'].astype('int8')
        if include_story:
            df['story'] = story
            df['story']=df['story'].astype('int8')
        df.rename(columns={"phase": "label"}, inplace = True, errors="raise") # phase was GPS dataset specific

        # kinect sample rate is 15 or 30 fps, timestamp appears to
        # be a running counter not an actual UTC time.
        df['datetime'] = pd.to_datetime(df['timestamp'], unit='ms') 
        df.set_index('datetime',inplace=True)
        df = df.drop('timestamp', axis=1)
        ir1_df_dict[item.split('.')[0]]=df # key is root name of csv
    return ir1_df_dict
if interactive:
    ir1_dict = get_gps_ir1_dict()
    print(ir1_dict.keys())

In [9]:
def split_ir1_dict_by_sub(ir1_dict, split_subj_dict):
    """This splits the ir1 dictionary of all files into three separate ones
    based on the subject and split_subj_dict.  Primarily used to change the
    processing between the train/valid sets (mixed windows discarded, classes
    balanced) and the test test (mode labeling, little other processing)"""
    # TODO: check for mixed subs and double allocations
    # empty dictionaries
    ir1_df_dict_train = dict() 
    ir1_df_dict_valid = dict() 
    ir1_df_dict_test = dict() 
    for key, item in ir1_dict.items():
        extracted_sub = item['sub'].iloc[0] # first row sub number
        #print(key,item['sub'].iloc[0]) # first row sub number
        if extracted_sub in split_subj_dict['train_subj']:
            ir1_df_dict_train[key] = item
        if extracted_sub in split_subj_dict['valid_subj']:
            ir1_df_dict_valid[key] = item
        if extracted_sub in split_subj_dict['test_subj']:
            ir1_df_dict_test[key] = item
    return ir1_df_dict_train,ir1_df_dict_valid,ir1_df_dict_test

if interactive:
    split_subj = dict(train_subj = [1], valid_subj = [2], test_subj = [3])
    ir1_df_dict_train,ir1_df_dict_valid,ir1_df_dict_test = split_ir1_dict_by_sub(ir1_dict, split_subj_dict = split_subj)
    print("Train:", ir1_df_dict_train.keys())
    print("Valid:",ir1_df_dict_valid.keys())
    print("Test :",ir1_df_dict_test.keys())

# Exploratory code for dealing with the data gaps in the dataset

In [10]:
if interactive: 
    my_df = ir1_dict['a1_raw']
    print(type(my_df))
    print(my_df.dtypes)
    display(my_df.head())
    my_df['lhx'].plot(figsize=(20, 10))

In [11]:
if interactive:
    # labels will plot if they have been converted to ints 
    my_df = xform.assign_ints_ir1_labels(ir1_dict['a1_raw'], label_mapping_dict = label_map_gps)
    my_df.plot(subplots=True, figsize=(20, 10)) # yay Pandas

In [12]:
# find timegaps and split so that the resulting IR1 dataframes
# represent a continuous recording session
# use resampe/mean/interpolate to fill small gaps and make
# larger ones NaN.   Then get a dataframe of just the NaN
# rows and use that index to split the dataframe
# NOTE:  This is dataset specific in that it only handles one big gap
# Interesting that the plots merge and don't show separately
if interactive:
    my_df = ir1_dict['a1_raw']
    max_gaps_to_fill = 30
    # good resample explanation https://datastud.dev/posts/time-series-resample
    rs_df = my_df.resample('33ms').mean().interpolate(limit = max_gaps_to_fill)
    display(rs_df['lhx'].plot(figsize=(20, 10))) # NaN will not be plotted by default
    nan_rows = rs_df[rs_df['lhx'].isnull()] # notnull also available
    if (len(nan_rows.index)==0):
        print("No gaps exceeding",max_gaps_to_fill,"found")
    else:
        print(len(nan_rows.index), "rows with gaps large than",max_gaps_to_fill,"found")
        gap_start, gap_end = nan_rows.index[0], nan_rows.index[-1]
        print("Start of gap", gap_start)
        print(" End of gap ", gap_end)
        df1 = rs_df.loc[rs_df.index[0]:gap_start]
        df1 = df1.iloc[:-max_gaps_to_fill] # drop the filled gap
        df2 = rs_df.loc[gap_end:rs_df.index[-1]]
        df2 = df2.iloc[1:] # drop the one NaN row due to slice
        display(df1['lhx'].plot(figsize=(20, 10)))
        display(df2['lhx'].plot(figsize=(20, 10)))

# The dataset specific code to generate the dictionary of IR1 dataframes is complete.  Now use Shared Transforms to generate the final output arrays.

In [13]:
def gesture_phase_segmentation_load_dataset(
    incl_val_group = False, # split train into train and validate
    split_subj = dict
                (train_subj = [1,2],
                valid_subj = [],
                test_subj = [3]),
    one_hot_encode = True, # make y into multi-column one-hot, one for each activity
    return_info_dict = False, # return dict of meta info along with ndarrays
    suppress_warn = False # special case for stratified warning
    ):
    """Downloads the Phase-Gesture-Segmentation dataset from UCI repository.
    Each csv file is converted into an IR1 dataframe and placed into a dictionary.
    Since this dataset is so small all subjects are placed into the train set
    and the validation set is a stratification of the train set.   The test set
    is empty by default (usually the test set is pulled out before model tuning)
    """
    log_info = "Generated by Gesture-Phase-Segmentation_load_dataset.ipynb\n"
    today = date.today()
    log_info += today.strftime("%B %d, %Y") + "\n"
    log_info += "sub dict = " + str(split_subj) + "\n"

    ir1_dict = get_gps_ir1_dict()

    # split the IR1 dict by subject so each can be processed separately.
    ir1_dict_train,ir1_dict_valid,ir1_dict_test = split_ir1_dict_by_sub(ir1_dict, split_subj_dict = split_subj)
    if True:  # change back to verbose when done debugging!
        print("Train:", ir1_dict_train.keys())
        print("Valid:",ir1_dict_valid.keys())
        print("Test :",ir1_dict_test.keys())
    x_train, y_train, sub_train, ss_times_train, xys_info = xform.get_ir3_from_dict(ir1_dict_train, label_map = label_map_gps, label_method = 'drop')
    #x_valid, y_valid, sub_valid, ss_times_valid, xys_info = xform.get_ir3_from_dict(ir1_dict_valid, label_map = label_map_gps)
    x_test, y_test, sub_test, ss_times_test, xys_info = xform.get_ir3_from_dict(ir1_dict_test, label_map = label_map_gps, label_method = 'mode')

    # headers = ("Initial Array","shape", "object type", "data type")
    # mydata = [("X", X.shape, type(X), X.dtype),
    #           ("y", y.shape, type(y), y.dtype),
    #           ("sub", sub.shape, type(sub), sub.dtype)]
    # if (verbose):
    #     print(tabulate(mydata, headers=headers))
    # log_info += tabulate(mydata, headers=headers) + "\n"

    if (one_hot_encode):
        # using newer code, ints only, from Fusion of Learned Reps work
        enc = OneHotEncoder(categories='auto', sparse_output=False)
        y_train = enc.fit_transform(y_train)
        #y_valid = enc.fit_transform(y_valid)
        y_test = enc.fit_transform(y_test)
        # integer encode
        # y_vector_train = np.ravel(y_train) #encoder won't take column vector
        # y_vector_valid = np.ravel(y_valid) #encoder won't take column vector
        # y_vector_test = np.ravel(y_test) #encoder won't take column vector
        # le = LabelEncoder()
        # integer_encoded = le.fit_transform(y_vector_train) #convert from string to int
        # name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
        # if (verbose):
        #     print("One-hot-encoding: category names -> int -> one-hot \n")
        #     print(name_mapping) # seems risky as interim step before one-hot
        # log_info += "One Hot:" + str(name_mapping) +"\n\n"
        # onehot_encoder = OneHotEncoder(sparse_output=False)
        # integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
        # onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
        # y=onehot_encoded.astype('uint8')
        #return X,y
    # split by subject number pass in dictionary
    # sub_num = np.ravel(sub[ : , 0] ) # convert shape to (1047,)
    # this code is different from typical due to limited subjects,
    # all not test subjects data is placed into train which is then 
    # split using stratification - validation group is not sub independent
    # train_index = np.nonzero(np.isin(sub_num, split_subj['train_subj'] + 
    #                                     split_subj['valid_subj']))
    # x_train = X[train_index]
    # y_train = y[train_index]
    if (incl_val_group):
        if not suppress_warn:
            print("Warning: Due to limited subjects the validation group is a stratified")
            print("90/10 split of the training group.  It is not subject independent.")
        # split training into training + validate using stratify - note that the
        # validation set is not subject independent (hard to achieve with limited
        # subjects).   The test set however is subject independent and as a result
        # will have much lower accuracy.  Another option is to tag a few of the
        # activities for inclusion in validation.  See
        # https://github.com/imics-lab/Semi-Supervised-HAR-e4-Wristband
        # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
        x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size=0.10, random_state=42, stratify=y_train)

    # test_index = np.nonzero(np.isin(sub_num, split_subj['test_subj']))
    # x_test = X[test_index]
    # y_test = y[test_index]

    headers = ("Returned Array","shape", "object type", "data type")
    mydata = [("x_train:", x_train.shape, type(x_train), x_train.dtype),
                    ("y_train:", y_train.shape ,type(y_train), y_train.dtype)]
    if (incl_val_group):
        mydata += [("x_valid:", x_valid.shape, type(x_valid), x_valid.dtype),
                        ("y_valid:", y_valid.shape ,type(y_valid), y_valid.dtype)]
    mydata += [("x_test:", x_test.shape, type(x_test), x_test.dtype),
                    ("y_test:", y_test.shape ,type(y_test), y_test.dtype)]
    if (verbose):
        print(tabulate(mydata, headers=headers))
    log_info += tabulate(mydata, headers=headers)
    if (incl_val_group):
        if (return_info_dict):
            return x_train, y_train, x_valid, y_valid, x_test, y_test, log_info
        else:
            return x_train, y_train, x_valid, y_valid, x_test, y_test
    else:
        if (return_info_dict):
            return x_train, y_train, x_test, y_test, log_info
        else:
            return x_train, y_train, x_test, y_test


# Main is setup to be a demo and bit of unit test.

In [14]:
if __name__ == "__main__":
    verbose = False
    print("Get Gesture Phase Segmentation using defaults - simple and easy!")
    x_train, y_train, x_test, y_test \
                             = gesture_phase_segmentation_load_dataset()
    headers = ("Array","shape", "data type")
    mydata = [("x_train:", x_train.shape, x_train.dtype),
            ("y_train:", y_train.shape, y_train.dtype),
            ("x_test:", x_test.shape, x_test.dtype),
            ("y_test:", y_test.shape, y_test.dtype)]
    print("\n",tabulate(mydata, headers=headers))
    print ('\n','-'*72)

    print("Get Gesture Phase Segmentation with validation and info file\n")
    x_train, y_train, x_valid, y_valid, x_test, y_test, log_info \
                             = gesture_phase_segmentation_load_dataset(
                                 incl_val_group = True,
                                 return_info_dict = True)

    headers = ("Array","shape", "data type")
    mydata = [("x_train:", x_train.shape, x_train.dtype),
            ("y_train:", y_train.shape, y_train.dtype),
            ("x_valid:", x_valid.shape, x_valid.dtype),
            ("y_valid:", y_valid.shape, y_valid.dtype),
            ("x_test:", x_test.shape, x_test.dtype),
            ("y_test:", y_test.shape, y_test.dtype)]
    print("\n",tabulate(mydata, headers=headers))
    print("\n----------- Contents of returned log_info ---------------")
    print(log_info)
    print("\n------------- End of returned log_info -----------------")

Get Gesture Phase Segmentation using defaults - simple and easy!
Downloading Gesture-Phase-Segmentation dataset from UCI ML Repository
Unzipping Gesture Phase Segmentation file in ./gesture_phase_dataset directory
Train: dict_keys(['a1_raw', 'a2_raw', 'a3_raw', 'b1_raw', 'b3_raw'])
Valid: dict_keys([])
Test : dict_keys(['c1_raw', 'c3_raw'])

 Array     shape          data type
--------  -------------  -----------
x_train:  (469, 30, 18)  float32
y_train:  (469, 5)       float64
x_test:   (501, 30, 18)  float32
y_test:   (501, 5)       float64

 ------------------------------------------------------------------------
Get Gesture Phase Segmentation with validation and info file

Train: dict_keys(['a1_raw', 'a2_raw', 'a3_raw', 'b1_raw', 'b3_raw'])
Valid: dict_keys([])
Test : dict_keys(['c1_raw', 'c3_raw'])
90/10 split of the training group.  It is not subject independent.

 Array     shape          data type
--------  -------------  -----------
x_train:  (422, 30, 18)  float32
y_train:  (