# Data

We collected data from 22 participants, consisting of 8 female and 14 male aged between 24 and 40. Each participant repeated all the gestures with the glove (31 gestures) either five or ten times. We collected Data during the experiment from the Glove and the Myo Armband. In the thesis we concentrate on the Glove. As described in the Hardware section of the thesis the Glove has 64 Channels. There are 13 flexation sensors (two for each finger, an additional one for the thumb, and two for the wrist flexation). Each tip of the finger has one pressure sensor. Also there is an IMU on each finger, one at the middle of the palm and one at the wrist, making it to 7 IMUs and 5 pressure sensors. Each IMU consists of a 3 Axis Accelerometer and a 3 Axis Gyroscope. Additionally there is a 3 Axis Magnetometer on the palm.


The glove loopes trough the sensors every $12ms$ and reads the current value and sends this array to a computer, resulting in a sampling rate of $83,\overline{3}Hz$. This array is then written into a .csv file. In the experiment the data was saved autmatically the following way: For each participant a folder with the participant code was created. In that folder a file in the Format *CODE_TYPE_TIMESTAMP.csv* was created, with *CODE* beeing the participant code, *TYPE* being one of gesture, myo or labels, timestamp being in the format *Year_Month_Day_Hour_Minute_Second*. An example would be *PS42_glove_2015_08_13_15_06_08.csv*.


The type glove means it has all the channels for the glove, myo has all the channels for the myo, labels has start and end index for glove and myo for each time a label was recorded in a row. If you combine all the raw data into one big python pickle we have around **3.4 GB** of raw data.


If an experiment was interrupted inbetween because there was a short break needed or a problem with a sensor, it could be picked up at a later time. In this case several of the above files exists in the users folder, with a different starting timestamp. A first step is to create one dataset siutable for exploration out of these user recording sessions and individual files. For this reconstruction lets recap the most important structural properties of the data.


The data is timeseries data with a set of channels from the glove. This channels can have occasional errors, and contain noise, but in general there should not be many missing values or alike. The timeseries is gathered with a fixed frequency from the glove. While there are many channels only a small set of different sensor types (Accelerometer, Gyroscope, Magnetometer, Flex Sensors and Pressure Sensors) exist, who all share f.e. their characteristic value range. Most importantly there is a set of different types of labels who form a hirachy. Every gesture is labeled. By the protocol of the experiment the participant is only allowed to perform a label when a programm commands him to do so, and he has 3 seconds time to do the gesture then. This 3 second times are automatically annotated as automatic label. Within that 3 seconds an experimenter observed the gesture and pressed 1 for the dynamic part of the gesture and 2 for the static part of the gesture. These timings where combined for the manual label type. That means as a hiracy that automatic gesture label define the maximal start and end time, manual define theoretically the time of the real gesture, and the dynamic and static types again must be shorter and within the times of the manual time. We will harness this properties of the label types to do integrity checks of the data.

In [1]:
# basic setup for the notebook: autoreload and loading modules
%load_ext autoreload
%autoreload 2

In [2]:
import os
import pickle
import tqdm
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import gestureanalysis.read_raw as rr
import gestureanalysis.raw_preprocessing_and_cleaning as rp
import gestureanalysis.utils as utils
import gestureanalysis.specific_utils as sutils
from gestureanalysis.constants import Constants

In [3]:
base_path = "/home/jsimon/Documents/thesis/gesture-analysis/data/"
base_path_raw = base_path+"raw/all"
base_path_pickl = base_path+"transformed/raw-pickeled/all/raw-all.pkl"
time_path_pickl = base_path+"transformed/time_added/all/time-all.pkl"
time_groups_path_corrected_pickl = base_path+"transformed/time_added/all/time-and-groups-corrected-all.pkl"

In [4]:
# check working directory and adopt if needed
os.getcwd()

'/home/jsimon/Documents/thesis/gesture-analysis/scripts'

## Data Loading

We first have to load the data. The function rr.read_raw(base_path_raw) iterates trough a directory with user data reading the folders name as username and reads all the .csv into keys for that user. In this first collection each file is read into a individual pandas dataframe, and saved into the users dictionary under the key 'glove', 'myo' or 'label' with it's filename.

In [5]:
def read_users():
    return rr.read_raw(base_path_raw)
users = utils.try_pickl_or_recreate(base_path_pickl, read_users)

we now have loaded 21 users with the following keys for each:

In [6]:
# users:
print(list(users.keys()))
# keys for one example:
print(list(users['AB73'].keys()))

['AB73', 'AE30', 'AF82', 'AL29', 'AW18', 'CB23', 'CB24', 'CF58', 'DG12', 'DH42', 'DL24', 'JL61', 'JQ28', 'JS52', 'MF20', 'MS55', 'PC29', 'PM32', 'PS42', 'RR45', 'RW32', 'SF1', 'YW13']
['filecount', 'files', 'glove', 'label', 'myo']


## Preprocessing

The intial preprocessing consists in adding the labels, changing the indes to the time domain, and with that combining the individual data files (there is only more than one if interruptions occured) into one file with a time domain index

- First the labeling info is added back into the data for the glove and the myo data. This should be done first as the labels are collected with a start and end index number relative to the recording number in that file. There exist 4 Types of labels:
 - Automatic: the user is asked by the prgram of the experiment to perform the gesture after a countoun within three seconds. The start and the end index of this window is tagged with the label_automatic type
 - Manual: The manual label is the combination of the dynamic and the static label, and recorded as the label_manual type
 - Dynamic: When the user performs a gesture a experimenter manually pressed 1 when the dynamic part of a gesture was taken place. The dynamic part is defined as the part where arm and/or fingers move to form a gesture. The start and the end index of this manual labeling is recorded in the label_dynamic type.
 - Static: When the dynamic part ends, often a static part is added to a gesture. The static part is where the arm rests in a pose forming the symobolic shape the gesture represents (like thumbs up). If that part is present the experimenter manually labels that part by pressing 2. The start and end frame is recordet into the label_static type.

- After that the timension of time must be correctly recovered: Each recording has a file with a timestamp. This timestamp is used as a base. After initial reading each pandas file is indexed from *0* to _len(records)_. This can be used to recover the time domain by using the base timestamp and adding the index number multiplied by the time delta between each recording. We used a fixed frequency of 83.3Hz or a time delta of 0.012.
- Up to this point, if a recording was breaked into two sessions, each file was processed individually, because the index is relative to a file, and needed for recovery of the frames a label belongs to. Since now the labels are correctly combined and the time domain is added, we can concatenate the individual recordings to one big recording. This is saved into the "glove_merged" key into each users data dictionary.

In [7]:
def add_labels(gdata, ldata):
    # add labels:

    gdata['label_automatic'] = np.NaN
    gdata['label_manual'] = np.NaN
    gdata['label_dynamic'] = np.NaN
    gdata['label_static'] = np.NaN

    # prepare annotated Automatic Labels:
    automatic = ldata[ldata['manual_L_vs_automatic_G'] == 'G']
    manual = ldata[ldata['manual_L_vs_automatic_G'] == 'L']
    dynamic = manual[manual['aut0_dyn1_static2'] == 1]
    static = manual[manual['aut0_dyn1_static2'] == 2]

    for _, row in automatic.iterrows():
        rp.add_label(gdata, row, 'label_automatic')
    for _, row in manual.iterrows():
        rp.add_label(gdata, row, 'label_manual')
    for _, row in dynamic.iterrows():
        rp.add_label(gdata, row, 'label_dynamic')
    for _, row in static.iterrows():
        rp.add_label(gdata, row, 'label_static')


def transform_index_to_time(fname, gdata, data):
    fdate = fname[-23:-4]
    startdate = datetime.strptime(fdate, "%Y_%m_%d_%H_%M_%S")

    offsets = gdata.index.values * Constants.dt_t
    times = startdate + offsets
    tmp = pd.to_datetime(times)
    gdata.index = tmp

    if 'glove_merged' in data:
        old_data = data['glove_merged']
        data['glove_merged'] = old_data.append(gdata)
    else:
        data['glove_merged'] = gdata

In [8]:
# preprocess users iterates the users array, and applies the add_labels function and 
# transfrom_index_to_time to each data frame
rp.preprocess_raw(users, add_labels, transform_index_to_time)

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

ignore user  AE30
ignore user  AE30



This adds an additonal field 'glove_merged' to each of the 21 users. This field contains the concatenated data of the individual data files, changed the index into the time-domain, and added the labels.

In [9]:
# users:
print(list(users.keys()))
# keys for one example:
print(list(users['AB73'].keys()))

['AB73', 'AE30', 'AF82', 'AL29', 'AW18', 'CB23', 'CB24', 'CF58', 'DG12', 'DH42', 'DL24', 'JL61', 'JQ28', 'JS52', 'MF20', 'MS55', 'PC29', 'PM32', 'PS42', 'RR45', 'RW32', 'SF1', 'YW13']
['filecount', 'files', 'glove', 'label', 'myo', 'glove_merged']


- _filecount_ is the number of files needed for that user. In an experiment without interruption that number is 3, else it is accordingly more
- _files_ is the name of the files
- _glove_ is a list with a dictionary inside. It has a key file to indicate the specific csv file with glove sensor data for that data, and a key data with a pandas where the data is stored inside
- _label_ is like glove, but the data / files is the labels data
- _myo_ is like glove, but the data /files is the myo data
- *glove_merged* has all the glove data in one pandas dictionary after transforming the index, concatenation and adding the labels

In [10]:
# in case the script above worked, execute that line:
with open( time_path_pickl, "wb" ) as users_pickle_file:
    pickle.dump(users, users_pickle_file)

In [11]:
# in case you need to reload, and know it exists:
with open( time_path_pickl, "rb" ) as users_pickle_file:
    users = pickle.load(users_pickle_file)

### Recovering Label Groups in Data

We now added to labels from the labels file to each row of the geasture data. After that we recovered the time domain, so the dataset now has concrete dates for each row as index. The problem with that is that we now lost the start and end index for each label, which is useful to have to inspect the raw data and make analysis before the model is trained.

An easy was ist just to go trough the original data, and recover the groups by scanning trough the index and define every consecutive streak of label data as one group. Concretely if a occurence of a label is found, as long as there are new rows with the same label within three times the time delate of 0.012s the next row is counted towards the same group. as soon as this requirenment breaks the group is closed and with the next labeled row the new group started.

In [14]:
duration_allowed_error = Constants.dt_t*3
dae = duration_allowed_error
z = timedelta(milliseconds=0)

def label_groups_for_one_label_type(label_type, glove_merged):
    g_lbls = glove_merged[glove_merged[label_type].notnull()]
    index = g_lbls.index.tolist()
    groups = utils.find_consecutive_groups(index, Constants.dt_t*2, use_tqdm=False)
    return groups

def recover_label_groups_from_data(user_data):
    glove_merged = user_data['glove_merged']
    
    groups = label_groups_for_one_label_type("label_automatic", glove_merged)
    user_data['start_end_automatic_groups'] = groups
    
    groups = label_groups_for_one_label_type("label_manual", glove_merged)
    user_data['start_end_manual_groups'] = groups
    
    groups = label_groups_for_one_label_type("label_dynamic", glove_merged)
    user_data['start_end_dynamic_groups'] = groups
    
    groups = label_groups_for_one_label_type("label_static", glove_merged)
    user_data['start_end_static_groups'] = groups

for key, data in tqdm.tqdm_notebook(users.items()):
    if not 'glove_merged' in data:
        continue
    recover_label_groups_from_data(data)

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




After recovering we have to check for a set of possible errors. First we force to only allow that labels keep their general structure: A gesture can only happen within an automatic label. In the autmatic there is a manual label consisting of a dynamic and optional static part. To label manual the experimenter has to constantly press the corresponding button, sometimes small errors like loosing contact for a few 100ms can happen. Therefor all the first start of dynamic labels and the last end of it is merged into one dynamic group. The same ist done with the manual label and static label if present. I then lookup the gesture type, and create a LabelGroup object out of this four (autmatic, manual, dynamic, static) label types. Any incomplete group is removed from the dataset as incorrect. With this I remove about 1.9616% of the labels present in the dataset.

In [15]:
def merge_subautomatic_labels(user_data, show_details):
    groups_automatic = user_data['start_end_automatic_groups']
    groups_automatic.sort()
    groups_manual = user_data['start_end_manual_groups']
    groups_dynamic = user_data['start_end_dynamic_groups']
    groups_static = user_data['start_end_static_groups']
    dyn_idx, sts_idx, man_idx = 0, 0, 0
    lbl_groups = []
    for a in groups_automatic:
        dyn_idx, dgs = sutils.combine_ranges_contained(a, dyn_idx, groups_dynamic)
        sts_idx, sgs = sutils.combine_ranges_contained(a, sts_idx, groups_static)
        man_idx, mgs = sutils.combine_ranges_contained(a, man_idx, groups_manual)
        if (len(dgs) == 1) and (len(mgs) == 1):
            d = sutils.t_to_t(dgs[0])
            m = sutils.t_to_t(mgs[0])
            s = sutils.t_to_t(sgs[0]) if len(sgs) == 1 else None
            label = (user_data['glove_merged'].loc[d.start:d.end, 'label_automatic'])[1]
            lg = sutils.LabelGroup(label, sutils.t_to_t(a), m, s, d)
            lbl_groups.append(lg)
    if (len(lbl_groups) != len(groups_automatic)) and show_details:
        print("changed size of total groups: ", len(groups_automatic), len(lbl_groups), ' that is ', (1-len(lbl_groups)/len(groups_automatic))*100, "%")
    
    user_data['lbl_groups'] = lbl_groups
    return lbl_groups

In [18]:
total_len = 0
total_new_len = 0
SHOW_DETAILS=True
for key, data in tqdm.tqdm_notebook(users.items()):
    if not 'glove_merged' in data:
        continue
    if SHOW_DETAILS: print('check ', key)
    total_len += len(data['start_end_automatic_groups'])
    merge_subautomatic_labels(data, SHOW_DETAILS)
    total_new_len += len(data['lbl_groups'])
print('in total we removed ', (1-total_new_len/total_len)*100, '% of automatic labels')

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

check  AB73
changed size of total groups:  310 307  that is  0.9677419354838679 %
check  AF82
changed size of total groups:  157 155  that is  1.273885350318471 %
check  AL29
changed size of total groups:  310 308  that is  0.6451612903225823 %
check  AW18
changed size of total groups:  155 153  that is  1.2903225806451646 %
check  CB23
changed size of total groups:  155 152  that is  1.9354838709677469 %
check  CB24
changed size of total groups:  157 154  that is  1.9108280254777066 %
check  CF58
changed size of total groups:  171 167  that is  2.3391812865497075 %
check  DG12
changed size of total groups:  155 154  that is  0.6451612903225823 %
check  DH42
changed size of total groups:  155 153  that is  1.2903225806451646 %
check  DL24
changed size of total groups:  310 301  that is  2.9032258064516148 %
check  JL61
changed size of total groups:  165 162  that is  1.8181818181818188 %
check  JQ28
changed size of total groups:  310 298  that is  3.8709677419354827 %
check  JS52
chang

### Recovering Label Groups from the Labels File

The alternative way to recover label groups is to go over the original label and glove files. In the label file is a start end end index, hence it alteady has the form of a group. The goal is to turn that into the time domain. With this approach I multiply the timedelta by the index + the timestamp of the file directly to get the groups. This should give me the same timings as the other way above...

We will see that the timings are almost the same for the automatic labels, the start is always the same, but the end time if usually about 1-2 time deltas off. I can not yet explain that behaviour, for now we just take the longer sequence.

In [19]:
def grups_over_label_files(username, data, get_labels):
    labelsbar = tqdm.tqdm_notebook(data['label'], leave=False)
    groups = []
    ignored = 0
    for lbl in labelsbar:
        ldata, file = lbl['data'], lbl['file']
        datestr = sutils.datestr_from_filename(file)
        startdate = datetime.strptime(datestr, "%Y_%m_%d_%H_%M_%S")
        
        labels = get_labels(ldata)
        
        for _, row in labels.iterrows():
            group = sutils.recover_group(startdate, row)
            (s,e) = group
            if s == e:
                ignored += 1
                continue
            groups.append(group)
    return groups, ignored, file

def recover_label_groups_from_labels_file(username, user_data):
    glove_merged = user_data['glove_merged']
    
    groups, _, _ = grups_over_label_files(username, user_data, sutils.get_automatic_labels)
    user_data['start_end_automatic_groups_fl'] = groups
    
    groups, _, _ = grups_over_label_files(username, user_data, sutils.get_manual_labels)
    user_data['start_end_manual_groups_fl'] = groups
    
    groups, _, _ = grups_over_label_files(username, user_data, sutils.get_dynamic_labels)
    user_data['start_end_dynamic_groups_fl'] = groups
    
    groups, _, _ = grups_over_label_files(username, user_data, sutils.get_static_labels)
    user_data['start_end_static_groups_fl'] = groups

for key, data in tqdm.tqdm_notebook(users.items()):
    if not 'glove_merged' in data:
        continue
    recover_label_groups_from_labels_file(key, data)

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=11), HTML(value='')))

HBox(children=(IntProgress(value=0, max=11), HTML(value='')))

HBox(children=(IntProgress(value=0, max=11), HTML(value='')))

HBox(children=(IntProgress(value=0, max=11), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))




Now we build the LabelGroups analog to the labels recovered from data. It seems this method is a bit more complete for this constrain, I only need to remove 1.1902% of the labels that way.

In [20]:
def merge_subautomatic_labels(user_data, show_details):
    groups_automatic = user_data['start_end_automatic_groups_fl']
    groups_automatic.sort()
    groups_manual = user_data['start_end_manual_groups_fl']
    groups_dynamic = user_data['start_end_dynamic_groups_fl']
    groups_static = user_data['start_end_static_groups_fl']
    dyn_idx, sts_idx, man_idx = 0, 0, 0
    lbl_groups = []
    for a in groups_automatic:
        dyn_idx, dgs = sutils.combine_ranges_contained(a, dyn_idx, groups_dynamic)
        sts_idx, sgs = sutils.combine_ranges_contained(a, sts_idx, groups_static)
        man_idx, mgs = sutils.combine_ranges_contained(a, man_idx, groups_manual)
        if (len(dgs) == 1) and (len(mgs) == 1):
            d = sutils.t_to_t(dgs[0])
            m = sutils.t_to_t(mgs[0])
            s = sutils.t_to_t(sgs[0]) if len(sgs) == 1 else None
            label = (user_data['glove_merged'].loc[d.start:d.end, 'label_automatic'])[1]
            lg = sutils.LabelGroup(label, sutils.t_to_t(a), m, s, d)
            lbl_groups.append(lg)
    if (len(lbl_groups) != len(groups_automatic)) and show_details:
        print("changed size of total groups: ", len(groups_automatic), len(lbl_groups), ' that is ', (1-len(lbl_groups)/len(groups_automatic))*100, "%")
    
    user_data['lbl_groups_fl'] = lbl_groups
    return lbl_groups

In [21]:
total_len = 0
total_new_len = 0
SHOW_DETAILS=True
for key, data in tqdm.tqdm_notebook(users.items()):
    if not 'glove_merged' in data:
        continue
    if SHOW_DETAILS: print('check ', key)
    total_len += len(data['start_end_automatic_groups_fl'])
    merge_subautomatic_labels(data, SHOW_DETAILS)
    total_new_len += len(data['lbl_groups_fl'])
print('in total we removed ', (1-total_new_len/total_len)*100, '% of automatic labels')

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

check  AB73
changed size of total groups:  310 307  that is  0.9677419354838679 %
check  AF82
changed size of total groups:  157 156  that is  0.6369426751592355 %
check  AL29
changed size of total groups:  310 309  that is  0.3225806451612856 %
check  AW18
changed size of total groups:  155 154  that is  0.6451612903225823 %
check  CB23
changed size of total groups:  155 152  that is  1.9354838709677469 %
check  CB24
changed size of total groups:  157 154  that is  1.9108280254777066 %
check  CF58
changed size of total groups:  171 167  that is  2.3391812865497075 %
check  DG12
changed size of total groups:  155 154  that is  0.6451612903225823 %
check  DH42
changed size of total groups:  155 154  that is  0.6451612903225823 %
check  DL24
changed size of total groups:  310 304  that is  1.9354838709677469 %
check  JL61
changed size of total groups:  165 162  that is  1.8181818181818188 %
check  JQ28
changed size of total groups:  310 307  that is  0.9677419354838679 %
check  JS52
chan

### Check for errors / bad label quality

When I first created the program several problems with labels existed. By creating both ways I could correct many errors. But the amound of how much I remove of the labels after checking the first set of basic constrains does not say which label group is better or if any is better. First I need to more rigitly check the constrains of the Groups. This are that the automatic label has to be about 3s (I choos 2.9 to 3.1s as the interval). The start time of the manual group has to be the start time of the dynamic group. The end time of manual has to be either the end time of static if present, or the end of dynamic if static is not present. The gap between the end time of the dynamic label and the start time of the static label if present is not allowed to be too large.

From inspection all constrins are already more or less kept within aceptible borders. Only too long or too short autmatic labels are still also present. I again remove all these labels.

In [22]:
def check_label_group_constrains(user_data, groups_key):
    #for aut, man, dyn, stat in zip(groups_automatic, groups_manual, groups_dynamic, groups_static):
    errors_found = 0
    lgl_groups = []
    for g in user_data[groups_key]:
        reasons = []
        if not g.verify(reasons, (Constants.dt_t * 21)):
            errors_found += 1
            print(reasons)
            if not (('automatic label too short' in reasons) or ('automatic label too long' in reasons)):
                lgl_groups.append(g) 
        else:
            lgl_groups.append(g)
    user_data[groups_key] = lgl_groups
    return errors_found

There exist 45 constrain errors in total, but only 33 which are bad enought so they need to be removed from the labels recovered from the data.

In [23]:
errors_found = 0
for key, data in tqdm.tqdm_notebook(users.items()):
    if not 'glove_merged' in data:
        continue
    print('check ', key)
    errors_found += check_label_group_constrains(data, 'lbl_groups')
print('total errors found: ', errors_found)

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

check  AB73
['dynamic and static gap too large', 'dynamic much longer than static', 'static and manual w. different end']
check  AF82
['automatic label too short']
['automatic label too short']
check  AL29
check  AW18
['dynamic and static gap too large']
check  CB23
['dynamic and static gap too large']
check  CB24
check  CF58
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
check  DG12
check  DH42
check  DL24
['static and manual w. different end']
check  JL61
['automatic label too short']
['dynamic and static gap too large', 'dynamic much longer than static', 'static and manual w. different end']
['dynamic and static gap too large', 'dynamic much longer than static', 'static and manual w. different end']
['dynamic and static gap too large', 'dynamic much longer than st

For the labels from the label file there exist 46 constrain violations, and 32 need to be removed.

In [24]:
errors_found = 0
for key, data in tqdm.tqdm_notebook(users.items()):
    if not 'glove_merged' in data:
        continue
    print('check ', key)
    errors_found += check_label_group_constrains(data, 'lbl_groups_fl')
print('total errors found: ', errors_found)

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

check  AB73
['dynamic and static gap too large', 'dynamic much longer than static', 'static and manual w. different end']
['manual and dynamic w. different start']
check  AF82
['automatic label too short']
['automatic label too short']
check  AL29
check  AW18
['manual and dynamic w. different start', 'dynamic and static gap too large']
check  CB23
['dynamic and static gap too large']
check  CB24
check  CF58
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
['automatic label too short']
check  DG12
check  DH42
check  DL24
['manual and dynamic w. different start', 'static and manual w. different end']
check  JL61
['automatic label too short']
['dynamic and static gap too large', 'dynamic much longer than static', 'static and manual w. different end']
['dynamic and static gap too large', 'dynamic much 

After these transformations we can compare each label group if they are approximately the same. The gesture label and each start and end time of the group is checked. This shows that the labels are mostly the same. The difference is below 1% for most users, with one user having about 4% of labels different. I again just exclude all labels who are significantly different.

In [25]:
def sort(lst):
    lst.sort(key=lambda lg: lg.automatic.start)

def compare_lbl_groups(user_data):
    lg1 = user_data['lbl_groups']
    lg2 = user_data['lbl_groups_fl']
    new_lg1 = []
    new_lg2 = []
    sort(lg1)
    sort(lg2)
    print(len(lg1))
    print(len(lg2))
    differences = 0
    # this algorithm still has a problem with lists of different size if the first entries differ
    # TODO: fix that!
    g1_idx = 0
    g2_idx = 0
    finished = False
    while not finished:
        g1 = lg1[g1_idx]
        g2 = lg2[g2_idx]
        if g1.approx(g2, timedelta(milliseconds=500)):
            # yay!
            new_lg1.append(g1)
            new_lg2.append(g2)
            g1_idx += 1
            g2_idx += 1
            finished = (g1_idx >= len(lg1)) or (g2_idx >= len(lg2))
            continue
        else:
            if g1.automatic.end < g2.automatic.start: # advance g1 only, it is earlier
                g1_idx += 1
                print('found additional timepoint in labels from data:')
                #print(g1)
                differences += 1
            elif g2.automatic.end < g1.automatic.start: 
                g2_idx += 1
                print('found additional timepoint in labels from labels file:')
                #print(g2)
                differences += 1
            else:
                g1_idx += 1
                g2_idx += 1
                print('difference found:')
                #print(g1.diff(g2, timedelta(milliseconds=500)))
                differences += 1
            finished = (g1_idx >= len(lg1)) or (g2_idx >= len(lg2))
    rest = utils.rest(lg1, lg2)
    if rest is not None:
        print(f'additionally found {len(rest)} labels')
        #print(rest)
    if differences > 0:
        print("corrected groups differ between methods for ", (differences/min(len(lg1), len(lg2))) * 100, "%")
    user_data['lbl_groups'] = new_lg1
    user_data['lbl_groups_fl'] = new_lg2

In [26]:
for key, data in tqdm.tqdm_notebook(users.items()):
    if not 'glove_merged' in data:
        continue
    compare_lbl_groups(data)

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

307
307
153
154
found additional timepoint in labels from labels file:
additionally found 1 labels
corrected groups differ between methods for  0.6535947712418301 %
308
309
found additional timepoint in labels from labels file:
additionally found 1 labels
corrected groups differ between methods for  0.3246753246753247 %
153
154
difference found:
found additional timepoint in labels from labels file:
additionally found 1 labels
corrected groups differ between methods for  1.3071895424836601 %
152
152
154
154
158
158
154
154
153
154
found additional timepoint in labels from labels file:
additionally found 1 labels
corrected groups differ between methods for  0.6535947712418301 %
301
304
found additional timepoint in labels from labels file:
found additional timepoint in labels from labels file:
found additional timepoint in labels from labels file:
additionally found 3 labels
corrected groups differ between methods for  0.9966777408637874 %
157
158
found additional timepoint in labels fr

### Correct the labels

After these transformations we have label groups who are at a high quality. The missing step is to remove all labels per row and recreate them with the new high quality labels.

Someone might ask if these way of conservatively excluding labels now means that the zero class is contaminated by signals who are actually gestures. While this is true, the zero class is extreamly more data than the gesture class anyway. Only a few percent of the labeled data is excluded. Some of that might be in fact really be a wrong label. The rest should not influence the zero class to much. Another option is to keep track of the excluded labels and also exclude that data. This is not done for now as it does not seem to have too much of an influence.

In [27]:
def reset_labels_of_user(username, user_data, label_group_key):
    glove_merged = user_data['glove_merged']
    glove_merged['label_automatic'] = np.NaN
    glove_merged['label_manual'] = np.NaN
    glove_merged['label_dynamic'] = np.NaN
    glove_merged['label_static'] = np.NaN
    
    for lg in user_data[label_group_key]:
        glove_merged.loc[lg.automatic.start:lg.automatic.end, 'label_automatic'] = lg.label_name
        glove_merged.loc[lg.manual.start:lg.manual.end, 'label_manual'] = lg.label_name
        glove_merged.loc[lg.dynamic.start:lg.dynamic.end, 'label_dynamic'] = lg.label_name
        if lg.static is not None:
            glove_merged.loc[lg.static.start:lg.static.end, 'label_static'] = lg.label_name

In [28]:
for key, data in tqdm.tqdm_notebook(users.items()):
    if not 'glove_merged' in data:
        continue
    reset_labels_of_user(key, data, 'lbl_groups_fl') #_fl

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




In [31]:
# in case the script above worked, execute that line:
with open( time_groups_path_corrected_pickl, "wb" ) as users_pickle_file:
    pickle.dump(users, users_pickle_file)

In [5]:
# in case you need to reload, and know it exists:
with open( time_groups_path_corrected_pickl, "rb" ) as users_pickle_file:
    users = pickle.load(users_pickle_file)

In [6]:
print(time_groups_path_corrected_pickl)

/home/jsimon/Documents/thesis/gesture-analysis/data/transformed/time_added/all/time-and-groups-corrected-all.pkl


In [12]:
automatic = users["AB73"]["glove_merged"]["label_automatic"]

In [14]:
automatic[automatic.notnull()]

2015-08-20 19:13:17.300    (2) Two
2015-08-20 19:13:17.312    (2) Two
2015-08-20 19:13:17.324    (2) Two
2015-08-20 19:13:17.336    (2) Two
2015-08-20 19:13:17.348    (2) Two
2015-08-20 19:13:17.360    (2) Two
2015-08-20 19:13:17.372    (2) Two
2015-08-20 19:13:17.384    (2) Two
2015-08-20 19:13:17.396    (2) Two
2015-08-20 19:13:17.408    (2) Two
2015-08-20 19:13:17.420    (2) Two
2015-08-20 19:13:17.432    (2) Two
2015-08-20 19:13:17.444    (2) Two
2015-08-20 19:13:17.456    (2) Two
2015-08-20 19:13:17.468    (2) Two
2015-08-20 19:13:17.480    (2) Two
2015-08-20 19:13:17.492    (2) Two
2015-08-20 19:13:17.504    (2) Two
2015-08-20 19:13:17.516    (2) Two
2015-08-20 19:13:17.528    (2) Two
2015-08-20 19:13:17.540    (2) Two
2015-08-20 19:13:17.552    (2) Two
2015-08-20 19:13:17.564    (2) Two
2015-08-20 19:13:17.576    (2) Two
2015-08-20 19:13:17.588    (2) Two
2015-08-20 19:13:17.600    (2) Two
2015-08-20 19:13:17.612    (2) Two
2015-08-20 19:13:17.624    (2) Two
2015-08-20 19:13:17.