# 1. Data preprocessing
This script includes the steps of preprocessing.
*N.B. Approximately 25 minutes of execution time.*

## 1.1. Preliminary

Import modules.

In [0]:
from google.colab import drive
import os
import itertools as it
from pandas import Series
import warnings
import pandas as pd
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.preprocessing import MinMaxScaler
from keras.utils import to_categorical

Mount drive and change current working directory.

In [0]:
drive.mount('/content/drive', force_remount = True)
os.chdir("/content/drive/My Drive/thesis/")

## 1.2. Human-Animal Affective Robot Touch (HAART)

The HAART dataset is available in a form where it has already been split into a training and test set. Import the CSV data.

In [0]:
HAART_import_train = pd.read_csv("data/HAART DataSet/training.csv")
HAART_import_test = pd.read_csv("data/HAART DataSet/testWITHLABELS.csv")

#### 1.2.1. Missing 'Sequence' column in training set

However, the 'Sequence' column is missing in the training set. Each individual gesture capture uses 432 rows (54Hz x 8 seconds), so these must be added.

In [0]:
seq = it.cycle(range(1,433)) #Create an infinite iterator with ascending values from 1 to 432.
HAART_import_train["Sequence"] = [next(seq) for count in range(HAART_import_train.shape[0])] #Fill the column with these values.

#### 1.2.2. Consistent column headers

Where the headers of the test set do not contain parentheses, this is the case for the training set. In order to get a correct reconnection, this must be adjusted.

In [0]:
HAART_import_train.columns = HAART_import_train.columns.str.replace('"', '') #Remove parentheses in train set headers.

The test (ParticipantID) and training (ParticipantNo) sets have a different header name for the number of the participant. This must also be adjusted.
    

In [0]:
HAART_import_train = HAART_import_train.rename({"ParticipantNo": "ParticipantID"}, axis = "columns")

The column headers of the train set all contain an unnecessary space that does not occur in the test set. These will be removed to ensure that they can later be merged based on this information.

In [0]:
HAART_import_train = HAART_import_train.rename(columns = lambda x: x.strip())

#### 1.2.3. Reconnect the old train and test set

The following code gives information about both datasets. In total the reconnected set contains 358128 rows.

In [0]:
trn_rows, trn_col = HAART_import_train.shape
tst_rows, tst_col = HAART_import_test.shape
print("The training set has",trn_rows,"rows and",trn_col,"columns.") #Give information about training set.
print("The test set has",tst_rows,"rows and",tst_col,"columns.") #Give information about test set.

The training set has 249696 rows and 69 columns.
The test set has 108432 rows and 69 columns.


Join the train and test set based on the column headers. Use 'how = "outer"' to use union of keys from both frames.

In [0]:
HAART_merge = pd.merge(HAART_import_test, HAART_import_train, how = "outer")

#### 1.2.4. Remove irrelevant gestures / Instance selection

Only the classes that are available in both datasets will be used. This means that the 'notouch' gesture will be deleted in the HAART dataset. The remaining gestures are 'stroke', 'scratch', 'constant' (= press), 'tickle', 'rub', and 'pat'.

In [0]:
HAART_notouch = HAART_merge[~HAART_merge.Gesture.str.contains("notouch")]

#### 1.2.5. Data harmonization

In order to create one cohesive data set, several changes has to be made so the HAART is compatible with the CoST.

A problem we encounter is that the HAART gestures are written out in the column, while in the CoST dataset these are represented in numbers, each representing a gesture. For a good harmonization, the written gestures are converted to numbers.

Gestures that will be used with number:
- 4 Pat
- 7 Press ('constant' in HAART)
- 8 Rub
- 9 Scratch
- 12 Stroke
- 14 Tickle

In [0]:
replace_numbers = {"Gesture": {"stroke": 12, 
                               "scratch": 9, 
                               "constant": 7, 
                               "tickle": 14, 
                               "rub": 8, 
                               "pat": 4}
                  } #Create a dictionary which contains each category with the number.

HAART_replace = HAART_notouch.copy() #Make a copy to prevent a "SettingWithCopyWarning".

HAART_replace.replace(replace_numbers, inplace = True) #Use this dictionary to replace text with numbers.

Not relevant columns may be removed and other columns may be adjusted so that the names match the CoST set.

In [0]:
HAART_replace.drop(["Substrate", "Cover"], axis = 1, inplace = True) #Remove irrelevant columns.

col_num = list(range(1,65)) #Create a list with the numbers for the headers of all 64 channels.
new_cols = [] #Empty list which will be filled with new header names.
for _ in col_num:
    name = "ch"+str(_)
    new_cols.append(name) #Create new header names (similar to CoST) and append to new_cols.

HAART_replace.rename(columns = dict(zip(HAART_replace.columns[3:], new_cols)), inplace = True) #Replace names.

HAART_complete = HAART_replace.rename(columns = {"ParticipantID": "subject", 
                                                 "Gesture": "gesture", 
                                                 "Sequence": "frame"}) #Harmonize other columns with CoST.

The "HAART_complete" DataFrame is structurally identical to the final CoST set. However, the content still needs to be normalized.

#### 1.2.6. Normalization

In [0]:
HAART = HAART_complete.drop(["subject", "gesture", "frame"], axis = 1) #Continue working with the data that will be normalized.
HAART.head()

Unnamed: 0,ch1,ch2,ch3,ch4,ch5,ch6,ch7,ch8,ch9,ch10,ch11,ch12,ch13,ch14,ch15,ch16,ch17,ch18,ch19,ch20,ch21,ch22,ch23,ch24,ch25,ch26,ch27,ch28,ch29,ch30,ch31,ch32,ch33,ch34,ch35,ch36,ch37,ch38,ch39,ch40,ch41,ch42,ch43,ch44,ch45,ch46,ch47,ch48,ch49,ch50,ch51,ch52,ch53,ch54,ch55,ch56,ch57,ch58,ch59,ch60,ch61,ch62,ch63,ch64
0,5,1,1,2,9,15,9,3,3,1,1,0,8,10,8,1,1,0,0,1,2,0,1,1,0,0,0,1,2,0,1,0,0,0,0,0,0,2,1,0,0,0,1,2,2,1,0,0,2,0,1,2,2,1,1,0,1,2,0,3,3,3,1,0
1,4,9,4,6,10,12,6,2,2,7,1,7,7,6,3,1,1,3,1,3,1,1,1,0,2,2,2,2,1,0,0,0,1,2,2,1,2,0,0,0,2,5,2,1,1,1,1,0,9,20,12,14,11,8,4,3,12,19,19,37,17,13,5,4
2,13,17,15,19,31,16,7,4,11,14,14,12,11,10,4,2,2,2,1,2,1,2,2,1,2,3,3,1,1,4,3,0,1,1,0,1,3,3,1,0,2,1,2,1,1,4,1,0,46,63,26,22,19,11,7,3,18,18,22,30,15,9,7,6
3,23,22,20,22,38,11,7,6,19,19,20,18,15,9,6,3,3,1,1,0,1,1,0,0,3,2,2,2,1,1,0,0,2,1,1,1,1,4,0,0,4,2,2,3,1,0,0,1,55,71,27,23,18,6,6,4,9,8,8,11,6,2,2,2
4,21,21,19,20,30,6,4,4,11,10,11,11,11,1,3,3,2,1,1,1,0,0,0,0,2,2,1,1,1,2,0,1,2,3,2,3,0,0,0,1,1,2,3,3,2,0,1,1,34,41,29,31,20,7,6,4,7,7,8,6,7,4,2,2


In [0]:
HAART_np = HAART.values #Transform HAART DataFrame to a 2-dimensional array.
HAART_norm = [] #Create empty list in which normalized arrays will be added.

#This rule has only been added to remove a warning that indicates that the input of the MinMaxScaler has been changed.
warnings.filterwarnings(action = 'ignore') 

for row in HAART_np:
    series = Series(np.ndarray.tolist(row)) #Define series.
    #Prepare data for normalization.
    values = series.values 
    values = values.reshape((len(values), 1))
    #Train the normalization. Values can be between 0 and 1.
    scaler = MinMaxScaler(feature_range=(0, 1)) 
    scaler = scaler.fit(values)
    #Append transformed data to list.
    HAART_norm.append(np.asarray(scaler.transform(values)))
    
HAART_norm = np.array(HAART_norm) #Transform list into a 2-dimensional array.

The two-dimensional array must be reshaped to a three-dimensional array so that it meets the expected LSTM imput. Besides the 3D requirement, the input must be comprised of samples, time steps, and features in that order.

In [0]:
x_HAART = np.reshape(HAART_norm, (710,432,64)) #Reshape to a 3D-array which is the intended input is for the LSTM model.

print("The old shape was", HAART_norm.shape, ", the new shape is", x_HAART.shape)

The old shape was (306720, 64, 1) , the new shape is (710, 432, 64)


The target variable (y) must also be stored in the correct array shape (710,).

In [0]:
y = np.asarray(HAART_complete["gesture"]) #Get information out of dataset and transform to array.
y = np.reshape(y, (710,432)) #Transform to 2D array. 

The target of each frame is now stored in a 2D array. The target value now occurs 432 times in every gesture capture. With a loop this will be solved. The target will be saved in a new array named "y_HAART". 

In [0]:
y_HAART = []
for target in y:
    y_HAART.append(int(np.mean(target)))

y_HAART = np.array(y_HAART) #Transform list to array.

In order to use the categorical crossentropy loss function, the y_HAART has to be in categorical format and must be ranged from 0 to 5.

In [0]:
old_label = [4, 7, 8, 9, 12, 14]
new_label = [0, 1, 2, 3, 4, 5]

for N, O in zip(new_label, old_label):
    y_HAART[y_HAART == O] = N

y_HAART = to_categorical(y_HAART)

From this point the HAART set is sufficiently prepared for an aggregation with the CoST set and in the right format for the LSTM.

## 1.3. Corpus of Social Touch (CoST)

The CoST dataset is made available by 4TU.Centre for Research Data. Import the CSV data via URL.

In [0]:
url = "https://data.4tu.nl/repository/uuid:5ef62345-3b3e-479c-8e1d-c922748c9b29/DATA"
CoST_import = pd.read_csv(url)

#### 1.3.1. Consistent column headers

The most of the column headers contain an unnecessary space. These will be removed to ensure that they can later be merged.

In [0]:
CoST_import = CoST_import.rename(columns = lambda x: x.strip())

#### 1.3.2. Remove irrelevant gestures / Instance selection

First, the gestures that will not be used can be removed.

In [0]:
irr_col = [1,2,3,5,6,10,11,13] #Irrelevant gestures.
for number in irr_col: #Loop over gestures.
    CoST_import = CoST_import[CoST_import.gesture != number] #Remove gesture from dataset.

The numbers of gestures that occur in both sets are: 4, 7, 8, 9, 12, 14.

In [0]:
print("The following gestures are left:", CoST_import.gesture.unique())

The following gestures are left: [ 4  7  8  9 12 14]


#### 1.3.3. Data harmonization

Not relevant column can be removed.

In [0]:
CoST_import.drop(["variant"], axis = 1, inplace = True)

Within the HAART set, subjects are indicated by means of a letter-number combination. This has saved the column as an 'object' data type. Only a number is used for the CoST dataset, so the column is saved as data type 'int64'. To ensure that the aggregation runs smoothly, the 'subject' column of the CoST set will (temporarily) be converted. The other way around is not possible. In addition, the index must be reset for at a later stage.

In [0]:
CoST_import["subject"] = CoST_import["subject"].apply(str)
CoST_complete = CoST_import
CoST_complete = CoST_complete.reset_index(drop = True) #Reset index.

The "CoST_complete" DataFrame is structurally identical to the final HAART DataFrame. However, the content still needs to be normalized.

#### 1.3.4. Normalization

In [0]:
CoST = CoST_complete.drop(["subject", "gesture", "frame"], axis = 1) #Continue working with the data that will be normalized.
CoST.head()

Unnamed: 0,ch1,ch2,ch3,ch4,ch5,ch6,ch7,ch8,ch9,ch10,ch11,ch12,ch13,ch14,ch15,ch16,ch17,ch18,ch19,ch20,ch21,ch22,ch23,ch24,ch25,ch26,ch27,ch28,ch29,ch30,ch31,ch32,ch33,ch34,ch35,ch36,ch37,ch38,ch39,ch40,ch41,ch42,ch43,ch44,ch45,ch46,ch47,ch48,ch49,ch50,ch51,ch52,ch53,ch54,ch55,ch56,ch57,ch58,ch59,ch60,ch61,ch62,ch63,ch64
0,33,67,70,58,61,47,39,25,30,70,71,67,65,49,42,25,35,95,93,84,85,57,53,27,38,131,124,102,101,67,55,28,37,109,102,94,91,65,57,28,29,61,56,58,55,42,38,22,27,43,44,42,41,34,30,21,29,51,54,52,47,32,33,22
1,33,70,72,64,61,46,41,24,32,71,70,67,67,47,40,25,35,94,95,86,84,60,52,25,35,129,122,103,103,68,61,28,40,101,104,92,92,62,56,27,31,59,62,60,55,42,38,23,28,46,42,44,42,34,31,20,29,50,53,51,50,39,34,23
2,32,70,69,63,62,47,38,25,33,72,73,66,69,49,43,24,34,95,94,86,86,58,52,28,37,129,124,102,102,69,61,27,37,109,102,93,92,64,56,28,30,59,58,59,58,42,41,22,25,43,46,45,43,34,32,21,26,52,54,53,47,42,37,24
3,31,70,70,66,63,49,40,22,32,71,71,67,67,49,43,24,38,96,93,85,84,60,52,25,39,127,123,102,103,68,60,27,38,108,102,92,93,63,58,26,29,59,59,57,56,41,37,24,26,46,45,42,42,36,31,19,31,52,53,51,50,35,39,22
4,34,70,69,62,63,49,41,26,31,72,71,68,63,49,42,26,34,96,93,82,85,59,53,26,38,126,123,102,102,67,59,27,37,107,103,93,92,65,57,28,30,61,60,57,56,43,35,23,26,43,47,44,44,34,28,20,28,51,51,51,48,37,36,21


In [0]:
CoST_np = CoST.values #Transform CoST DataFrame to a 2-dimensional array.
CoST_norm = [] #Create empty list in which normalized arrays will be added.

#This rule has only been added to remove a warning that indicates that the input of the MinMaxScaler has been changed.
warnings.filterwarnings(action = 'ignore') 

for row in CoST_np:
    series = Series(np.ndarray.tolist(row)) #Define series.
    #Prepare data for normalization.
    values = series.values 
    values = values.reshape((len(values), 1))
    #Train the normalization. Values can be between 0 and 1.
    scaler = MinMaxScaler(feature_range = (0, 1)) 
    scaler = scaler.fit(values)
    #Append transformed data to list.
    CoST_norm.append(np.asarray(scaler.transform(values)))

First, the normalized data needs to be reshaped to a 2 dimensional array. After that, it can be transformed to DataFrame. The CoST_norm sensor data will replace the CoST_complete sensor data via a for-loop.

In [0]:
CoST_norm = np.array(CoST_norm)
CoST_norm = np.reshape(CoST_norm, (788224, 64))
CoST_norm = pd.DataFrame(CoST_norm)

count = 0 #Variable that goes up to 64 so the columns can be replaced by the normalized ones.

for column in CoST_complete.iloc[:,3:]:
  CoST_complete[column] = CoST_norm[count]
  count += 1

In [0]:
print("The longest gestures contains",max(CoST_complete["frame"]), "frames.")

The longest gestures contains 1489 frames.


Because the data is too sparse when using 1489 frames in the second dimension and because the HAART set has 432 frames, the COST will also contain this number of frames. The desired 'x_COST' 3D-array output shape is therefore (*number of gestures:* 3342, *frames:* 432, *sensor grid:* 64). First all sensors of all frames that belong together are merged. We get a list containing 3342 new lists with arrays of 64 values. 

In [0]:
frame = 1 #Counter variable used to count the number of frames.
gestures  = [] #Values of the gesture_values will be stored together here.
gesture_values = [] #Values of one gesture from multiple frames will be stored here.
for index, columns in CoST_complete.iterrows(): #Loop over dataframe.
  if CoST_complete.iloc[index]["frame"] < frame: #If new frame begins, then...
    gestures.append(gesture_values)
    gesture_values = []
    gesture_values.append(CoST_complete.iloc[index,3:].values)
    frame = 1
  else: #If row still belongs to same frame, then...
    gesture_values.append(CoST_complete.iloc[index,3:].values)
    frame += 1

####1.3.5. Reshaping and frames removing

Every gesture now has multiple sets of 64 values seperate from each other in the list. With a for loop, all the values are combined to one array.

In [0]:
gestures_2D = [] #New list with gestures where all sensor values are combined in one array.

for gesture in gestures:
  gesture_2D = np.concatenate(gesture, axis = 0) #Connect values in rows.
  gestures_2D.append(gesture_2D) #Append to new list.

In [0]:
gestures_2D = np.array(gestures_2D) #Transform list to two-dimensional array.

Create function to fill empty positions with zeros.

In [0]:
def zeros_np(dataset):
    "Append zeros until max length of arrays last dimension is reached."
    lenght = np.array([len(i) for i in dataset]) #Get lengths of each row of data.
    mask = np.arange(lenght.max()) < lenght[:,None] #Mask of valid places in each row.
    output = np.zeros(mask.shape, dtype = dataset.dtype) #Put elements from data into masked positions in array.
    output[mask] = np.concatenate(dataset)
    return output

In [0]:
gestures_2D = zeros_np(gestures_2D) #Apply function.
gestures_2D.shape #Check dimensions of array.

(3342, 95296)

The gestures_2D array is now in a 2D shape with 1489*64 = 95.296 sensor values per gestures. Most of these value are 0 because only a few gestures have 1000+ frames. Because the LSTM model cannot learn optimal on sparse data, the array will be shortened to 432 frames. By doing this, we also create consitency between the CoST and HAART dataset which already has 432 frames per gesture.

In [0]:
gestures_2D = np.delete(gestures_2D, np.s_[27648:], 1) #Shorten the frame length of the gestures to 432 (equal to HAART)(432*64 = 27648).

In [0]:
x_COST = np.reshape(gestures_2D, (3342, 432, 64)) #Reshape to a 3D-array which is the intended input is for the LSTM model.

Too illustrate the minimum effect the deletion of these frames, the following code is added.

In [0]:
print(CoST_complete.groupby('frame').count()[432:]["ch1"].values[0], "gestures are affected by this decision.")
print(round((3342-346)/3342, 3), "% of the data is not affected by this decision.")

346 gestures are affected by this decision.
0.896 % of the data is not affected by this decision.


The target variable (y) must also be stored in the correct array shape (3342,).

In [0]:
frame = 1 #Frame counter.
y_COST = [] #New list with 3342 targets.
for index, columns in CoST_complete.iterrows(): #Loop over complete dataset.
  if CoST_complete.iloc[index]["frame"] < frame: #If new gesture starts...
    y_COST.append(CoST_complete.iloc[index]["gesture"]) #...Then append target to list.
    frame = 1
  else: #Else add 1 to counter.
    frame += 1

In [0]:
y_COST = np.array(y_COST) #Transfrom to NumPy array.

The LSTM algorithm cannot work with the categorical data directly. First the data must be transformed to numbers ranging from 0 to 5. After that, for use of categorical_crosentropy loss, the to_categorical() function is used to convert the vector of integers to a binary class matrix.

In [0]:
old_label = [4, 7, 8, 9, 12, 14]
new_label = [0, 1, 2, 3, 4, 5]

for N, O in zip(new_label, old_label): #Replace old labels by new ones.
    y_COST[y_COST == O] = N

y_COST = to_categorical(y_COST) #Converts a class vector (integers) to binary class matrix.

## 1.4. CoST-HAART aggregation

Aggregate/join both the input and target variables of the CoST and HAART set.

In [0]:
x_COSTHAART = np.concatenate((x_COST, x_HAART))
y_COSTHAART = np.concatenate((y_COST, y_HAART))

##1.5. Output

The results of these preprocessing steps will serve as input for the model. For this reason, the files will be exported to the 'input' folder.

In [0]:
files = [("x_HAART", x_HAART),
         ("y_HAART", y_HAART),
         ("x_COST", x_COST),
         ("y_COST", y_COST),
         ("x_COSTHAART", x_COSTHAART),
         ("y_COSTHAART", y_COSTHAART)]

for name, array in files: #Loop over the NumPy array files.
  filename = 'input/' + name #Create path to which the data will be saved.
  np.save(filename, array) #Save them in the ' input' folder.