# CarrerCon 2019 - Help Navigate Robots

**Description**: A kernel for understanding the problem and getting some insights about the data.

## Table of contents

1. Libraries and data loading
2. Understanding the problem
    - 2.1. IMU sensor and physics
    - 2.2. Sneak Peek
3. Univariate Distribution
4. Bivariate Distribution
    - 4.1. Average velocity vs target
    - 4.2. Average acceleration vs target
5. Next steps


## 1. Libraries and data loading

Useful libraries

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#for displaying images
from IPython.display import Image

Configuration visualization

In [None]:
# Table printing large
plt.rcParams['figure.figsize'] = (15, 7)
pd.set_option("display.max_columns", 400)
pd.options.display.max_colwidth = 250
pd.set_option("display.max_rows", 100)
# High defition plots
%config InlineBackend.figure_format = 'retina'
sns.set()

train and test set data

In [None]:
df_train = pd.read_csv('../input/career-con-2019/X_train.csv')
y_train = pd.read_csv('../input/career-con-2019/y_train.csv')

print('X_train.csv shape is {}'.format(df_train.shape))
print('y_train.csv shape is {}'.format(y_train.shape))

## 2. Understanding the problem

We have been given sensor data from robots driving on different surfaces. I imagine these robots similar to this:

In [None]:
Image("../input/careercon2019/robot.JPG",width=400)

### 2.1. IMU sensors and physics

Robots have a sensor called <a href="http://www.starlino.com/imu_guide.html" target="_blank">IMU</a> (Inertial Measurement Units) which has two parts: an **accelerometer** and a **gyroscope**:

-**Accelerometers** detects inertial forces and returns acceleration of the robot.<br>
-**Gyroscopes** measures robot angles rotation and the rate these angles changes.

An IMU sensor is like this:

In [None]:
Image("../input/careercon2019/IMU.png",width=400)

Let磗 remember some elementary **physics**. 

In physics most magnitudes are not given by a number but by a vector. A vector is an oriented segment in space which has coordinates, something like this:

In [None]:
Image("../input/careercon2019/vector.jpg",width=400)

Vectors will be useful to understand our data.

### 2.2. Sneak Peek

Let磗 see how the training looks like:

In [None]:
df_train.head()

For every row we have 13 features:
- **row_id**: An identifier for every measurement. 
- **series_id**: An identifier for every recording.
- **measurement_number**: The position of the measurement in the recording.

But only last 10 are really useful:
- **Orientation**: A  <a href="https://en.wikipedia.org/wiki/Conversion_between_quaternions_and_Euler_angles" target="_blank">quaternion</a> indicating the angle orientation of the robot. A quaternion, as a vector, is another way of representing a magnitude.
- **Angular velocity**: A 3-element vector indicating the coordinates of angular velocity vector (Vx,Vy,Vz)
- **Linear acceleration**: A 3-element vector indicating the coordinates of linear acceleration vector (ax,ay,az)

Every 128 rows correspond to a unique recording of the robot.

In [None]:
df_train.shape[0]/y_train.shape[0]

How many recordings do we have?

In [None]:
y_train.shape[0]

We have 3810 recordings

In [None]:
y_train.head(5)

In [None]:
print('Number of classes: {}'.format(y_train.surface.nunique()))
print('Number of group_id: {}'.format(y_train.group_id.nunique()))

-Every serie has been recorded in a particular recording session indicated by group_id feature. There are 73 different recording sessions.<br>

-We have nine different surfaces.

## 3. Univariate Distribution

Let磗 start with the target variable, **surface**:

In [None]:
sns.catplot(x='surface',data=y_train,kind='count')
plt.xticks(rotation=90)
plt.show()
print(y_train.surface.value_counts(normalize=True))

Let磗 check **group_id** feature. 

In [None]:
sns.countplot(x='group_id',data=y_train)
plt.xticks(rotation=90)
plt.show()

Have different surfaces been measured in the same recording session?

In [None]:
y_train.groupby('group_id').surface.nunique().max()

No, in every recording session only was tried a surface type.

## 4.Bivariate analysis

We want to check is  average robot angular velocity or linear acceleration depends on the surface or the recording session.<br><br>
First, we have to calculate the norm of velocity and acceleration vector. The norm is the length of the vector.

In [None]:
#Function to calculate the norm of a three element vector
def vector_norm(x,y,z,df):
    return np.sqrt(df[x]**2 + df[y]**2 + df[z]**2)

In [None]:
df_train['angular_velocity_norm'] =vector_norm('angular_velocity_X',
                                                'angular_velocity_Y',
                                                'angular_velocity_Z',df_train)

df_train['linear_acceleration_norm'] =vector_norm('linear_acceleration_X',
                                                'linear_acceleration_Y',
                                                'linear_acceleration_Z',df_train)

We need to create a new dataframe with average velocity and acceleration per recording.

In [None]:
new_df = df_train.groupby('series_id')['angular_velocity_norm','linear_acceleration_norm'].mean()
new_df = pd.DataFrame(new_df).reset_index()
new_df.columns = ['serie_id','avg_velocity','avg_acceleration']
new_df['surface'] = y_train.surface
new_df['group_id'] = y_train.group_id

In [None]:
new_df.head(3)

### 4.1 Average Velocity

Let磗 analyze if there are differences between average velocity depending on the surface and on the recording session(group_id

In [None]:
sns.boxplot(x='surface',y='avg_velocity',data=new_df)
plt.title('avg_velocity vs surface')

It looks there is a slight difference. It is logic as a robot is thought to reach higher speed on wood than on carpet.

In [None]:
surfaces = new_df.surface.unique()

for surface in surfaces:
    sns.swarmplot(x=new_df[new_df.surface == surface]['group_id'],
                  y=new_df[new_df.surface == surface]['avg_velocity'])
    plt.title('Surface = {}'.format(surface))
    plt.show()

It look some surfaces are a little dependent on the recording session like wood and hard_tiles_large_space. 

### 4.2. Average Acceleration

Let磗 go with average acceleration.

In [None]:
sns.boxplot(x='surface',y='avg_acceleration',data=new_df)
plt.title('Avg_acceleration vs Surface')

We can appreciate the same for average acceleration.

In [None]:
for surface in surfaces:
    sns.swarmplot(x=new_df[new_df.surface == surface]['group_id'],
                  y=new_df[new_df.surface == surface]['avg_acceleration'])
    plt.title('Surface = {}'.format(surface))
    plt.show()

Three surfaces with differences: wood, carpet and hard_tiles_large_space.

## 5. Next steps

- Extract features for each recording.
- Check if train and test come from same distribution.
- Define a validation strategy (group_id may work here).
- Train a GBDT model.
- Try an LSTM model with the raw records.

Hope it was useful. Thanks for reading ;)