# Train and Test Split

In machine learning a common praxis is to split the data into random subsets for training, testing and validation. However, this praxis is often critisised as it is only for certain subsets of data (for those data where each instance really is i.i.d.) the best choice. If your data depends on each other, this splits are not valid. By using sliding windows I remove a part of the temporal dependencies. However, since the windows overlapp each other just using a random split gives a high probability the data was already seen. Therefor we need an other strategy for the train / validation / test set split. Especially cross validation is not easily done here.

The strategy I choose therefor is the following: I save one random selected user for the test set. For the validation set I pick complete sections of windows over a label and some longer sections for the zero class. Since each user performs each of the 31 gestures at least 5 times (not completely true, because of data cleaning sometimes a label was removed) I select one of these repitions randomly from each user and remove all sequences into the validation set. I discard the overlapping zero class windows, as the zero class is dominant anyway, and that does not hurt the performance.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import gc
import pickle
import pathlib
import tqdm
import numpy as np
import numpy.random
import pandas as pd
import gestureanalysis.utils as utils

In [3]:
base_path = "/home/jsimon/Documents/thesis/gesture-analysis/data/"
time_groups_path_corrected_pickl = base_path+"transformed/time_added/all/time-and-groups-corrected-all.pkl"
stats_added_base_path = base_path+"transformed/stats_added/all/"
stats_added_path_pickl = stats_added_base_path+"raw_stats-added-all.pkl"
gyro_calibration_path = base_path+'../scripts/gestureanalysis/gyro_offset.txt'

In [4]:
# check working directory and adopt if needed
import os
os.getcwd()

'/home/jsimon/Documents/thesis/gesture-analysis/scripts'

In [5]:
# in case you need to reload, and know it exists:
with open( time_groups_path_corrected_pickl, "rb" ) as users_pickle_file:
    users = pickle.load(users_pickle_file)

In [6]:
# get data of CF58 as a test set
path = f"{stats_added_base_path}AB73-window-data.pkl"
with open( path, "rb" ) as windows_file:
    testdata = pickle.load(windows_file)
# honestly I do not use the code here

I just misuse user CF58 a bit here to explain the labels. We have a matrix of all gestures. For each window the number of how often that gesture is performed at that window is written in the label matrix. The gesture is way shorter than the window, but long enough to be captured by several windows. How many windows capture that gesture is different. However it is always a consecutive group of window indexes who see the gesture. We cound the number of groups for a gesture and select a random number. This gesture is taken away from the set into the validation set. Then the windows are deleted plus the windows till the next and previous gesture. If it was the first or the last gesture 20 windows are used as a default.

In [7]:
testdata['winlabels']['(1) One'][testdata['winlabels']['(1) One'] > 0][5:15]

2066    61
2067    60
2068    30
2077    25
2078    55
2079    85
2080    85
2081    85
2082    85
2083    80
Name: (1) One, dtype: int64

In [8]:
def find_groups(occurences):
    indexes = list(occurences.index)
    groups = []
    start = indexes[0]
    for i in range(1,len(indexes)):
        last = indexes[i-1]
        current = indexes[i]
        if current - last == 1:
            continue
        else:
            groups.append((start, last))
            start = current
    groups.append((start, current))
    return groups

In [9]:
def gaps(groups):
    for g1,g2 in zip(groups[1:], groups[:-1]):
        print(f'gap: {g2[0] - g1[1]}')

In [10]:
def append(list_of_dfs):
    df = list_of_dfs[0]
    for i in range(1, len(list_of_dfs)):
        df = df.append(list_of_dfs[i], ignore_index=True)
        df.reset_index()
    return df

In [11]:
def nocat(list_of_dfs):
    for df in list_of_dfs:
        df.columns=df.columns.astype('str')

In [12]:
def remove_group(groups, selected):
    if selected == 0:
        start = groups[0][0] - 20
    else:
        start = groups[selected-1][1] # end of last group
    
    if selected == len(groups)-1:
        end = groups[selected][1] + 20
    else:
        end = groups[selected+1][0] # start of next group
    return start, end

In [13]:
def find_zero_class(df):
    zero_idx = df[df.columns[0]] == 0
    for c in df.columns[1:]:
        zidx = df[c] == 0
        zero_idx = zero_idx & zidx
    return zero_idx

In [14]:
def filter_groups_larger_than(groups, criterion):
    deltas = map(lambda x: x[1] - x[0], groups)
    filtered = filter(lambda x: x[0] > criterion, zip(deltas, groups))
    filtered_groups = map(lambda x: x[1], filtered)
    return list(filtered_groups)

In [15]:
# train validation split:

# what is missing: a random zero class part
lbl_dfs = []
data_dfs = []

lbl_train_dfs = []
data_train_dfs = []
for user, udata in tqdm.tqdm_notebook(users.items()):
    if 'glove_merged' not in udata:
        print(f"skipping user {user}, no data")
        continue
    if user == "CF58":
        print("user CF58 is kept away as a test")
    path = f"{stats_added_base_path}{user}-window-data.pkl"
    with open( path, "rb" ) as windows_file:
        testdata = pickle.load(windows_file)
    data = testdata['windata']
    label = testdata['winlabels']
    print(f'user {user} has {len(label.columns)}')
    cnt = 0
    smpls = 0
    groups_to_remove = []
    for gesture in label.columns:
        occurences = label[gesture][label[gesture] > 0]
        if len(occurences) == 0:
            print(groups)
            print('---------------------')
            continue 
        groups = find_groups(occurences)
        if len(groups) == 1:
            print(groups)
            print('---------------------')
            continue 
        selected = np.random.randint(0, len(groups))
        group = groups[selected]
        lbl_windows = label[group[0]:group[1]].copy()
        smpls += len(lbl_windows)
        #print(f'add {len(lbl_windows)} of {gesture} for {user}')
        data_windows = data[group[0]:group[1]].copy()
        lbl_dfs.append(lbl_windows)
        data_dfs.append(data_windows)
        groups_to_remove.append(remove_group(groups, selected))
        cnt += 1
    print(f'added {cnt} gestures for {user}')
    print(f'data len: {len(data)}; label len: {len(label)}')
    for g in groups_to_remove:
        s,e = g
        idx = data[s:e].index
        data = data.drop(idx, axis=0)
        label = label.drop(idx, axis=0)
    print(f'after removel: data len: {len(data)}; label len: {len(label)}')
    
    # transfer a sample for the zero class, delete the whole section
    smpls = int(smpls / 31)
    zero_remove = smpls + 20
    zero_groups = find_groups(label[find_zero_class(label)])
    candidates = filter_groups_larger_than(zero_groups, zero_remove)
    selected = np.random.randint(0, len(candidates))
    group = candidates[selected]
    middle = ((group[1] - group[0])//2) + group[0]
    sampleg = (middle-smpls//2, middle+smpls//2)
    lbl_windows = label[sampleg[0]:sampleg[1]].copy()
    data_windows = data[sampleg[0]:sampleg[1]].copy()
    lbl_dfs.append(lbl_windows)
    data_dfs.append(data_windows)
    idx = data[group[0]:group[1]].index
    data = data.drop(idx, axis=0)
    label = label.drop(idx, axis=0)
    
    print(f'after zero removel: data len: {len(data)}; label len: {len(label)}')
    
    lbl_train_dfs.append(label)
    data_train_dfs.append(data)
    gc.collect()
print(len(lbl_dfs))
nocat(lbl_dfs)
valid_label_df = append(lbl_dfs)
valid_data_df = append(data_dfs)
lbl_dfs = None
data_dfs = None
gc.collect()

nocat(lbl_train_dfs)
train_label_df = append(lbl_train_dfs)
lbl_train_dfs = None
gc.collect()
train_data_df = append(data_train_dfs)
data_train_dfs = None
gc.collect()

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

user AB73 has 31
added 31 gestures for AB73
data len: 6093; label len: 6093
after removel: data len: 5306; label len: 5306
after zero removel: data len: 5276; label len: 5276
skipping user AE30, no data
user AF82 has 31
added 31 gestures for AF82
data len: 3521; label len: 3521
after removel: data len: 2662; label len: 2662
after zero removel: data len: 2557; label len: 2557
user AL29 has 31
added 31 gestures for AL29
data len: 6135; label len: 6135
after removel: data len: 5348; label len: 5348
after zero removel: data len: 5348; label len: 5348
user AW18 has 31
added 31 gestures for AW18
data len: 3431; label len: 3431
after removel: data len: 2528; label len: 2528
after zero removel: data len: 2528; label len: 2528
user CB23 has 31
added 31 gestures for CB23
data len: 3504; label len: 3504
after removel: data len: 2633; label len: 2633
after zero removel: data len: 2546; label len: 2546
user CB24 has 31
added 31 gestures for CB24
data len: 3529; label len: 3529
after removel: data l

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


252

In [17]:
print(f'validation data: {len(valid_data_df)}, train data: {len(train_data_df)}')
n = len(train_data_df) + len(valid_data_df)
print(f'validation data: {len(valid_data_df)/n:g}, train data: {len(train_data_df)/n:g}')

validation data: 6186, train data: 77610
validation data: 0.0738221, train data: 0.926178


In [20]:
with open( stats_added_base_path+'validation.pkl', "wb" ) as users_pickle_file:
    ds = { 'valid' : {'data' : valid_data_df, 'labels': valid_label_df} }
    pickle.dump(ds, users_pickle_file)
    
with open( stats_added_base_path+'train.pkl', "wb" ) as users_pickle_file:
    ds = { 'train' : {'data' : train_data_df, 'labels': train_label_df} }
    pickle.dump(ds, users_pickle_file, protocol=4)

In [21]:
len(valid_data_df.columns)

7201