# Machine Learning Engineer Nanodegree
# Capstone Project

Hala Jadallah

May, 14 2019

# I. Definition
### Project Overview

There is a big interest in delegating certain type of jobs to robots that need no human assistance. Unlike humans, robots are more designed to do repetitive jobs, with small variability and with fatigue. They can also do nightshift jobs when humans are less effective and naturally need restorative sleep. 

In this project we consider an application where we help an autonomous mobile robot armed with data obtained from Inertial Measurement Units (IMU sensors) to identify the type of surface they are moving on from a list of 9 surface types: carpet, concrete, fine concrete, hard tiles, hard tiles large space, soft-pvc, soft tiles, tiled, and wood. 
The sensors give readings of orientation in quaternion form, angular velocity along the x, y, z directions and linear acceleration along the x, y, and z directions.

The data comes from one of [Kaggle’s competitions](https://www.kaggle.com/c/career-con-2019) and is provided by researchers at Tampere University, that have an active program in robotics. The researchers want the robots to identify the surface on their own just by sensor input so that they can adjust their navigation that suits the surface so that, for example, they avoid falling while moving. 

The researchers had the mobile robots move on nine different surfaces and recorded the data by series numbers and group numbers. Each series has 10 features: the four quaternion variables (orientation_x, orientation_y, orientation_z, orientation_w), velocity (angular_velocity_x, angular_velocity_y, angular_velocity_z) and acceleration (linear_acceleration_x, linear_acceleration_y, linear_acceleration_z).  

Each series variable has 128-unit steps. The group ids or numbers identify recordings that were done together.  
In this project I will train the data with series recordings that are identified with a surface, so that we can predict the surface for series data that is not identified with a surface. 

### Problem Statement

As I described in the project overview section, the researchers want the autonomous mobile robot be able to use sensor data to identify the surface type (out of the nine types) they are moving on. 

Since the sensor data are series of orientation, velocity and acceleration each with 128-unit steps, we obviously have multi-variable time series classification problem. There are several methods that are used for such a task. Some are indicated in the UCR time series archive website. Recently there are efforts to use deep learning for classifying time series.

I am interested in using deep learning for this classification task. Deep learning is based on using multi layered neural networks that are either fully connected layers FC, convolution neural networks (CNN) or long short-term memory recurrent neural networks (LSTM), or even a mix of these. 

CNNs are known to classify images, a two-dimensional object. On the other other hand for series, a one-dimensional object, they can learn the shape of the series for the 10 variables to learn classifying them.   

LSTMs on the other hand, are designed for sequential data so that it exploits the correlation along the sequence. 

Any deep learning architecture must end with at least one fully connected (FC) layer and a “softmax” activation function  that give the probability of a series belonging to each of the class types, where we assign (classify) the series to the class with the highest probability.     

### Metrics
For classification one can use accuracy (as dictated by Kaggle’s competition). For this project as we will see below, the classes are not balanced, therefore multiclass accuracy may not be reliable. I would visually inspect the confusion matrix and compute F1-score which maybe considered as an interpolation between recall and precision. I use ‘macro’ averaging in computing F1-score, that is, the F1-score for each class is computed and then the nine F1-scores are averaged.  

# II. Analysis

### Data Exploration
The data set is comprised of `y-train` data that has 3810 rows and three columns: the `series_id`, the `group_id` and `surface` type corresponding to the series. Each group_id correspond to only one surface. However, each surface corresponds to multiple groups. 

Each series_id in the y_train dataset, correspond to a 128-unit step timeseries in the X_train dataset. Thus, is has 487680 (=128*3810) rows and 13 columns: `row_id`, `series_id`, `measrement_num`, `orientation_X`, `orientation_Y`, `orientation_Z`, `orientation_W`, `angular_velosity_X`, `angular_velocity_Y`, `angular_velocity_Z`, `linear_acceleration_X`, `linear_acceleration_Y`, `linear_acceleration_Z`. 

There are no missing values in this dataset. Most variables are float or integer. The `surface` variable and `row_id` are  character variables 

The main data variables should be considered as time series each of length 128. In fact, a multivariate time series, where the variables fall in the three main categories: orientation, angular velocity and linear acceleration. Each time series correspond to one class (surface). 

In [1]:
import pandas as pd
import numpy as np
# read in train and test sensor data
data_train = pd.read_csv('X_train.csv')
data_test = pd.read_csv('X_test.csv')

In [2]:
y_train = pd.read_csv('y_train.csv')
Xy_train = data_train.join(y_train[['surface','group_id']], on='series_id')

**Table 1 a.** A sample of the data features shows the series_id of the first series, measurement number from 0 to 127, and then the main features.  

In [12]:
data_train.head()

Unnamed: 0,row_id,series_id,measurement_number,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
0,0_0,0,0,-0.75853,-0.63435,-0.10488,-0.10597,0.10765,0.017561,0.000767,-0.74857,2.103,-9.7532
1,0_1,0,1,-0.75853,-0.63434,-0.1049,-0.106,0.067851,0.029939,0.003385,0.33995,1.5064,-9.4128
2,0_2,0,2,-0.75853,-0.63435,-0.10492,-0.10597,0.007275,0.028934,-0.005978,-0.26429,1.5922,-8.7267
3,0_3,0,3,-0.75852,-0.63436,-0.10495,-0.10597,-0.013053,0.019448,-0.008974,0.42684,1.0993,-10.096
4,0_4,0,4,-0.75852,-0.63435,-0.10495,-0.10596,0.005135,0.007652,0.005245,-0.50969,1.4689,-10.441


**Table 1 b.** A sample of the target variables

In [13]:
y_train.head()

Unnamed: 0,series_id,group_id,surface
0,0,13,fine_concrete
1,1,31,concrete
2,2,20,concrete
3,3,31,concrete
4,4,22,soft_tiles


**Table 2.** the main statistics of the features over all series. We observe the orientation variables lie between -1 and 1. The angular velocity variables have a slightly wider range. While the linear acceleration variables have more extreme variables. 

In [7]:
data_train.iloc[:,3:].describe()

Unnamed: 0,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
count,487680.0,487680.0,487680.0,487680.0,487680.0,487680.0,487680.0,487680.0,487680.0,487680.0
mean,-0.01805,0.075062,0.012458,-0.003804,0.000178,0.008338,-0.019184,0.129281,2.886468,-9.364886
std,0.685696,0.708226,0.105972,0.104299,0.117764,0.088677,0.229153,1.8706,2.140067,2.845341
min,-0.9891,-0.98965,-0.16283,-0.15662,-2.371,-0.92786,-1.2688,-36.067,-121.49,-75.386
25%,-0.70512,-0.68898,-0.089466,-0.10606,-0.040752,-0.033191,-0.090743,-0.530833,1.9579,-10.193
50%,-0.10596,0.237855,0.031949,-0.018704,8.4e-05,0.005412,-0.005335,0.12498,2.8796,-9.3653
75%,0.651803,0.80955,0.12287,0.097215,0.040527,0.048068,0.064604,0.792263,3.7988,-8.5227
max,0.9891,0.98898,0.15571,0.15477,2.2822,1.0791,1.3873,36.797,73.008,65.839


__Table 3.__ The mean of variables by surfaces show slight variability between surfaces

In [8]:
train = Xy_train.drop(['row_id','measurement_number','group_id', 'series_id'], axis=1)
data_by_surface = train.groupby('surface')
data_by_surface.mean()

Unnamed: 0_level_0,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
surface,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
carpet,-0.385794,0.232877,0.032506,-0.060315,0.001029,0.024627,-0.072238,0.091456,2.872431,-9.368058
concrete,-0.207607,0.078932,0.010351,-0.032107,0.000925,0.021816,-0.062718,0.111593,2.907733,-9.356408
fine_concrete,-0.14329,0.188151,0.028501,-0.021683,-9.9e-05,0.003167,-0.002443,0.136297,2.922102,-9.35607
hard_tiles,0.650556,-0.735114,-0.106206,0.103824,-0.000176,-0.006708,0.026448,0.145759,2.990207,-9.332681
hard_tiles_large_space,0.393165,-0.113644,-0.014866,0.060699,0.000434,0.016397,-0.044681,0.11025,2.885197,-9.365567
soft_pvc,0.30957,-0.011457,0.005433,0.043331,0.000692,0.014869,-0.040387,0.128839,2.811946,-9.386315
soft_tiles,0.276689,0.017786,0.004524,0.042025,0.00064,0.017502,-0.048913,0.084119,2.914124,-9.359474
tiled,-0.275754,0.075003,0.010869,-0.04209,0.000102,0.006403,-0.012668,0.13556,2.913803,-9.357045
wood,-0.138246,0.209518,0.030993,-0.022491,-0.001783,-0.025233,0.089153,0.185963,2.892481,-9.364265


__Table 4.__ The standard deviation of series corresponding to the same surface. The values are reasonable. We only find slight variability between srufaces in the acceleration features. 

In [9]:
data_by_surface.std()

Unnamed: 0_level_0,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
surface,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
carpet,0.55809,0.680656,0.103843,0.082027,0.103999,0.087537,0.223227,2.03344,2.562971,2.288149
concrete,0.724844,0.634656,0.095181,0.110878,0.154918,0.097308,0.22209,2.581342,3.096531,3.726617
fine_concrete,0.628582,0.725514,0.111411,0.094375,0.085643,0.080723,0.221343,1.379338,1.60125,2.047193
hard_tiles,0.090707,0.075925,0.011803,0.014121,0.032907,0.072593,0.226314,0.521849,0.945227,0.599659
hard_tiles_large_space,0.522936,0.732409,0.11301,0.077233,0.195604,0.097617,0.188297,2.832893,2.578976,4.86655
soft_pvc,0.61227,0.712656,0.102576,0.094142,0.058173,0.057497,0.157295,0.944897,1.35642,1.235721
soft_tiles,0.669783,0.672457,0.100333,0.102425,0.030014,0.041384,0.124861,0.537524,0.872669,0.485342
tiled,0.589763,0.740189,0.110952,0.091911,0.141375,0.082908,0.165605,2.114298,2.251182,3.489906
wood,0.630122,0.719346,0.109419,0.095142,0.087131,0.1169,0.352725,1.377191,1.601598,2.134775


__Table 5.__ The maximum value of series corresponding to the same surface. 

In [10]:
data_by_surface.max()

Unnamed: 0_level_0,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
surface,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
carpet,0.38137,0.98207,0.15566,0.052329,1.2116,0.93619,1.1257,23.293,32.623,22.336
concrete,0.98889,0.98898,0.14838,0.15154,2.2822,0.83621,1.3873,36.797,73.008,65.839
fine_concrete,0.98642,0.98866,0.15147,0.15477,0.71355,0.46764,0.94021,12.394,16.452,7.1959
hard_tiles,0.74929,-0.64475,-0.092761,0.11902,0.16933,0.14701,0.44347,1.896,7.1469,-7.1307
hard_tiles_large_space,0.98872,0.9884,0.15065,0.15187,1.195,1.0791,0.97221,24.945,20.005,20.45
soft_pvc,0.9891,0.98865,0.15264,0.14957,0.86514,0.42471,0.48411,10.716,18.501,15.983
soft_tiles,0.98874,0.98876,0.15009,0.14985,0.41735,0.28317,0.66076,8.8783,11.877,-3.0539
tiled,0.95521,0.98747,0.15176,0.15395,0.95938,0.54884,0.83011,18.806,17.999,20.431
wood,0.98865,0.98893,0.15571,0.1515,0.76485,0.3595,0.97238,15.053,14.511,12.735


__Table 6.__ The minimum value of series corresponding to the same surface. 

In [11]:
data_by_surface.min()

Unnamed: 0_level_0,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
surface,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
carpet,-0.98859,-0.98905,-0.15054,-0.15345,-1.0547,-0.92786,-1.2445,-29.895,-54.591,-49.403
concrete,-0.98862,-0.96579,-0.14873,-0.15186,-2.371,-0.90875,-1.2593,-36.067,-121.49,-75.386
fine_concrete,-0.89089,-0.98866,-0.15396,-0.13772,-0.61259,-0.36092,-0.82227,-10.93,-9.6574,-24.686
hard_tiles,0.46074,-0.87507,-0.12841,0.074212,-0.12939,-0.15695,-0.33992,-2.2821,-1.2135,-11.452
hard_tiles_large_space,-0.71222,-0.98965,-0.15189,-0.12019,-1.2455,-0.82208,-0.98636,-21.221,-23.714,-41.542
soft_pvc,-0.97853,-0.91465,-0.16283,-0.14777,-0.64247,-0.46152,-1.2688,-11.505,-26.83,-27.907
soft_tiles,-0.96457,-0.96176,-0.14454,-0.14744,-0.38121,-0.37008,-0.41884,-9.8046,-6.9817,-18.902
tiled,-0.9891,-0.98908,-0.15101,-0.15662,-1.0244,-0.49977,-0.77189,-16.029,-20.999,-37.273
wood,-0.98872,-0.98883,-0.15023,-0.1534,-0.9148,-0.52811,-0.91822,-12.506,-10.404,-32.114


### Visualizing
#### Figure 1: Class distirbution. 
Observe suraces are not uniformly distributed in this data. This means we need to keep an eye on the learning algorithm that it does not undersample from low frequency classes. 
<img src="class_distribution.PNG" width="1000">

#### Figure 2: Density Curves by class
Density curves here show the most variability in orientation curves. These curves have density curves that seem to be distinct for each surface, either by the mode location or number (since most appear to be multi-modal). The angular velocity shows some distinct behavior among few classes, but mostly unimodal centered around zero. The linear acceleration is the most concentrated about zero with much less spread, although some series take extreme values.
However, as the figure below shows the series have a lot of noise that the density curves do not reveal. 

<img src="density_by_class_c.png" alt="density_by_class_after_croping_top_and_bottom_space" width="1000">
#### Figure 3: Sample plots of a random series
Notice the high noise on top of oscillatory curve around a constant mean or a trend of either increasing, decreasing
for a sample series corresponding to “soft tiles” surface. The first row has the orientation variables with increasing or decreasing trend. Followed by the velocity variables in the second row and acceleration variables in the third row all plotted against the measurement unit from 0 to 127.
<img src="sample_plots_1.png" width="1500">
<img src="sample_plots.png" width="1500">
#### Figure 4: Feature plots by class
The plots demonstrate random series sampled from each class. To reduce the noise, the curves have been smoothed using a 7-point moving average.
Although the curves oscillate about a common mean, particularly along the x and y directions, we observe the frequency and (to a lesser degree) the amplitude vary from class to another. We also observe that the mean of acceleration along the z direction is much lower than the other two directions.

<img src="orin_by_clas.png" width="1500">
<img src="vel_by_clas.png" width="1500" >
<img src="acc_by_clas.png" width="1500" >

### Algorithms and Techniques
I choose to solve the multivariate time series classification using deep learning. Specifically, I use convolutional neural networks, since these incorporate applying filters that consider correlations in neighboring data along the signal. 
Convolutional neural networks usually have convolutional layers each is described by the number of filters, kernel size of each filter, the stride of each filter as it computed the parameters of each filter and the padding type whether there should be no padding, or zero padding. There is also activation layers, batch normalization layers and pooling layers. 
I test several architectures and use cross validation to choose the best. 

For model validation I split the data into k folds for cross validation. I compared two strategies for splitting. The first respect class distribution and so folds are sampled by stratification. 

The second strategy, use the ‘group_id’ as a guide, so that validation fold does not come from groups in the other folds. This is necessary because it is suspected that series within the same group are correlated, that they were done consecutively successively and therefore one should take care to get a meaningful validation score by separating series based on groups. 

### Benchmark
One common method that I found used in a number of kernels regarding this competition was based on finding aggregate statistics for each time series so that the data is now tabular. Then it becomes a typical classification problem. I used random forests, a decision tree ensemble algorithm, where it chooses the best tree among randomly decision trees for classification. 

Results: multiclass accuracy

|Model| 	 accuracy Validation|	Kaggle Public| Kaggle	Private|
|:-----|:--------|:------|:-----|
|splitting by stratification|	0.7963|	0.5747|	0.4571|
|splitting by groups|	0.3973|	0.5765|	0.4225|


# III. Methodology

### Data Preprocessing

I standadize the data by z-normalization that is subtract the values from the mean of a feature of all series and devide by the standard deviation. This reduces any influence of the scale of the training data. Moreover, any test data would be more comparable when all data that is used to fit the model, scale-wise. 


### Implementation
I attempted three models of different architectures using convolutional neural networks.

**Model 1:** The input to the model involves three branches, one for the orientation variables, one for velocity variables and the last for the acceleration variables each is fed into a block that consists of: two consecutive convolutional layers with 16 and 32 filters respectively, stride one, kernel size of 5 and 3 respectively and with no padding. I used rectified linear unit for weights activation. Then I follow this with average pooling with pooling size of 2. I end this block with a `Batch-Normalization` layer. The output of this block is the input of the second block with exactly the same architecture. Next I merge the outputs of the second block of the three branches, and feed it into two blocks with the same type as above but the second replaces average pooling with a layer of Global Average Pooling, which is then followed by one fully connected layer with `softmax` activation function. 

**Model 2:** This model is a modification of the fully convolutional model mentioned in a [review paper](https://arxiv.org/abs/1809.04356). The model takes the whole data as an input to blocks each consist of one convolution layer followed by `BatchNormalization` layer, followed by activation layer using the rectified linear unit function. There are four blocks of this type with 32, 32, 64, 64 filters respectively, and a kernel size of 8, 5, 3, 3 respectively, and no padding.
The last block outputs feeds into a Global Average Pooling layer followed by a fully connected layers with `softmax` activation and L2 kernel regularization to reduce possible over fitting that was observed in Model 1. 

**Model 3:** This model is similar to Model 1. It has the same architecture before the merge of the three branches. After the merge there are no blocks just one convolution layer, `BatchNormalization` layer, Global Average Pooling and finally a fully connected layer with `softmax` activation, L2 regulirized as in Model 2.  

I defined the above models using Keras with Tensorflow backend. 
I compiled the model with ADAM optimzer, category-cross entropy loss function, and accuracy as the metric.  

Then I iterated over the 5 stratified folds selecting each one as a validation fold in each run with 55 or 60 epochs. 
I monitored the loss function and the accuracy for both the training folds and validation fold. Which can tell me how well the learning is proceeding and whether there is an over fitting to the training data. I also look at the confusion matrix after each run. 

When the learning and validation is done I compute the cross validation F1 score, and the multi-class accuracy. 

I ran the model on my personal laptop, with 1.70 GHz processor and 5.9 GB RAM.  


### Refinement
The model is refined manually which is not perfect. However, since deep learning takes a long time I had to take some common sense in my choices. The very first refinement was to realize that ADAM optimization give better resluts when the learning rate is 0.0005 rather than the default 0.001, I also checked a learning rate of 0.00001 but found that 0.0005 is better. 

I also checked whether no padding versus zero padding. The adjusted the number of filter and kernels and found the above choosen values gave better results. For example for Model 2, the paper suggests using filter of size 128 and 256, which I tried and got bad performance (worse than the bechmark), so I reduced the filters to 32 and 64 to get a better performance as documented below.   

I also checked the use of batch sizes per epoch to find that 15 is better for Model 1 and Model 3, but Model 2 can handle 32.


# IV. Results

### Model Evaluation and Validation
The method I used for model evaluation and validation is K fold cross validation. I tried two stratigies. One using splitting the data to 5 folds while maintaing the strata of the classe in each fold. Since the smallest class had 0.005 of the data, any fold shoud have about 700-800 series to get 3-4 series from the minority class on average in each fold. On the other hand having five folds give a better estimate which means that each fold will have 762 series. 

The second strategy has to do with the possibility that group id represent runs that where done together or sequentially so they are not independent. Since we do not have knowledge of the test data, splitting so that groups are not in the same group. This issue is recorded in this discussion between [competitots regarding groups being split between train and test](https://www.kaggle.com/c/career-con-2019/discussion/87239#latest-512162)

However, when I split by groups, I saw over fitting in the training data as I observed a wide gap between the train and test performance. I abondond this method of splitting to folds prefering stratification. 

Here are the results of the three models described above after refinement:

|Model| CV F1-score| CV accuracy| Kaggle public score| Kaggle private score|
|:----|:-----------|:-----------|:-------------------|:--------------------|
|Model 1| 0.7517| 0.7856| 0.6263| 0.5613|
|Model 2| 0.7823| 0.8056| 0.5753 | 0.5888 |
|Model 3| 0.8643| 0.8738| 0.6850| 0.6693|

Clearly, Model 3 does the best and out performs the bechmark model results. 

### Justification

In this problem, the convolution neural network seem to do a reasonable job predicting the surface the robot is moving on. However, I am not sure it can be generalized to the same robot moving on surfaces in different places not just a university premises. The issue I am talking about is the nature of train and test data and how independent from each other thay are.  
I am not fully satisfied with the results here. However deep learning takes time for training on my laptop. 

However, the results do better than the bechmark, but I cannot discuss this on a statistical basis, since I need to generate more testing data than what I have.  




# V. Conclusion

### Free Form Visualization

#### Figure 5.  Performance plots of different folds for the winning model.
We find the training and validation scores are close initially but then diverge slightly when compared Model 1, see below. 
<img src="TCNN_mdf_perfm.png" width="1000">

#### Figure 6. Peforamance plots of Model 1. 
In the first 20 epochs are fine but after which we observe the over fitting. 
<img src="Tcnn_valid_bs32.png" width="1000" >


#### Figure 7. Performance plots of Model 2
Here except for the first run holding Fold 1 for validation, the training and the validation score are matching well and inidicates that probably if I let the training go for lonther epochs, a 100 say, I would get better final score. 
<img src="FCN_performace.png" width="1000" >

### Reflection

I describe in this section end-to-end solution. We started with exploring the data which is comprised of time series of 10 different variables. Each of 10-dimensional series represent a surface. We explored surface (class) distirbution to find it non-uniform. The 10 features series show variability that is not easily distinguishable visually. Next we standardized each feature series and reshape the series into three dimensional arrays, so that its applicable for convolution neural networks. 
The model training and validation were carried simultanuously. I split the data into 5 folds each stratified by class distribution and fit the model keeping one fold for validation. Model learning occurs over 50-60 epochs for each 5-fold turn and I keep an eye on the loss function and accuracy score for each epoch keeping in mind the possibility of over fitting. I see over fitting when the loss of the training folds is very small while the validation loss stays large. 

At the end I compute the accuracy as well as the F1 score for the overall model. 

There are several issues that I struggled with in this project one has to do with the role of groups (`group_is`) and whether I should take that into account in cross validation as I mensioned above. 

The second has to do with my intuition regarding orinetation variables.
Intuitively, I think that velocity and acceleration should be sufficient to detect the surface characteristics. I find it hard to grasp that orientation is important to surface recognition. However, as I write now I somewhat see orientation measured by quaternion is not the same as position. Orientation may reflect angular displacement with respect to a particular direction. For example, if the surface has some tiny microscopic pumps, then the robot may make some rotations with respect to some direction, not necessarily the direction of movement. In short I confused orientation with location or position of the robot. I spent a long time trying/adjusting models without orientation variable in the input data. 

I also considered fourier transform of the series, phase, autocorrelatioin funcion and the periodogram as feature engineering hopping that it will get better results. But none where better than the raw data and let the convolution layer filters compute the weights. 


### Improvement

To imporve on my results, I would try the following:

1. Let Model 2 go for longer epochs

2. Adjust the architecure of Model 1, by removing the pooling layers. 

3. Try recurrent neural network for time series classification

4. Try other methods mentioned in the review paper. 



# References:

H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, P. Muller: "Deep learning for time series classification: a review",  [arXiv:1809.04356](https://arxiv.org/abs/1809.04356)

Kaggle discussion https://www.kaggle.com/c/career-con-2019/discussion/87239#latest-512162

Kaggle kernel https://www.kaggle.com/artgor/where-do-the-robots-drive

Kaggle kernel https://www.kaggle.com/gpreda/robots-need-help