# The Floor Is Lava! ... Or something else?
## Multi-class Cassification of Floor Types from Robot Sensor Data

### Please find the corresponding kaggle Kernel [here](https://www.kaggle.com/mediasittich/the-floor-is-lava-multi-class-cassification)

![xkcd Comic 735 - Floor](https://imgs.xkcd.com/comics/floor.png)

## The Competition

Robots are smart… by design. To fully understand and properly navigate a task, however, they need input about their environment.

In this competition, you’ll help robots recognize the floor surface they’re standing on using data collected from Inertial Measurement Units (IMU sensors).

We’ve collected IMU sensor data while driving a small mobile robot over different floor surfaces on the university premises. The task is to predict which one of the nine floor types (carpet, tiles, concrete) the robot is on using sensor data such as acceleration and velocity. Succeed and you'll help improve the navigation of robots without assistance across many different surfaces, so they won’t fall down on the job.

### Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('seaborn')

### Load & check out the data

In [None]:
# Load sensor data
X_train = pd.read_csv('../input/X_train.csv')

In [None]:
X_train.head(135)

In [None]:
X_train.info()

**X_[train/test].csv** - the input data, covering 10 sensor channels and 128 measurements per time series plus three ID columns:

* row_id: The ID for this row.
* series_id: ID number for the measurement series. Foreign key to y_train/sample_submission.
* measurement_number: Measurement number within the series.

The **10 sensor channels** are:

The **orientation** channels encode the current angles how the robot is oriented as a quaternion
* orientation_X
* orientation_Y
* orientation_Z
* orientation_W

**Angular velocity** describes the angle and speed of motion
* angular_velocity_X
* angular_velocity_Y
* angular_velocity_Z

**Linear acceleration** components describe how the speed is changing at different times
* linear_acceleration_X
* linear_acceleration_Y
* linear_acceleration_Z

**_Note_**  
In the context of relational databases, a **foreign key** is a field (or collection of fields) in one table that uniquely identifies a row of another table or the same table.  
Source [Wikipedia - Foreign key](https://en.wikipedia.org/wiki/Foreign_key)

In [None]:
# Load label data
y_train = pd.read_csv('../input/y_train.csv')

In [None]:
y_train.head()

In [None]:
y_train.info()

**y_train.csv** - the surfaces for training set.

* series_id: ID number for the measurement series.
* group_id: ID number for all of the measurements taken in a recording session. Provided for the training set only, to enable more cross validation strategies.
* surface: the target for this competition.

In [None]:
# Number and value of different surfaces
print(y_train['surface'].nunique())
print(y_train['surface'].unique())

In [None]:
# Check for missing values in training sets
X_train.isnull().any()

In [None]:
y_train.isnull().any()

### Understanding the Data

#### Simplified Problem Description

A scientist walks a robot every day for a cetain period of time on different floor types.  
During each "walk" different kinds of sensor data is collected (The data comes from 10 sensors).  
Each "walk" is $n$ steps (time intervals $\Delta t$) long where after each interval each sensor records a datum.  
The scientist walks the robot for $d$ days and after each day annotates (labels) the collected sensor data set with the corresponding floor type.

So our scientist should have $n * d$ data points in total.  

Each "walk" can be described as a Time Series with the **measurement_number** as time indicator.  
We know from the description of the dataset that

* $n = 128$ from **X_train**
* $d = 3810$ from **y_train**

So the number of total data points should be

$$
n * d = 128 * 3810 = 487,680
$$

which is the number of data point in **X_train**.

**Note on ID variables**

* The **row_id** is a unique ID for each row. It is a composition of the **series_id** and the **measurement_number**.
* The **measurement_number** is the number of a measurement within a series, hence indicates the time progression.  
* The **series_id** is the ID of the day on which the data was collected.
* If we group all the data that our scientist gathered in e.g. a month and give each month an ID, then that ID is represented by the **group_id** variable.

In [None]:
# Check sample submission 
sample_submission = pd.read_csv('../input/sample_submission.csv')

In [None]:
sample_submission.head()

In [None]:
sample_submission.tail()

### EDA

### Questions to answer

* How can we relate the label to the sensor data?
* What are the features in the data set?

In [None]:
# Join frames to have a label for each data point
full = pd.merge(X_train, y_train, on='series_id')

In [None]:
full.head()

In [None]:
full.tail()

In [None]:
full.columns

### Looking at Things... specifically example time series

In [None]:
# Select first time series with series_id 0
example1 = full[full['series_id'] == 0]
example2 = full[full['series_id'] == 3809]

In [None]:
example = pd.concat([example1, example2])

In [None]:
# Plot example data
f, axes = plt.subplots(3, 2, figsize=(15, 10), sharex=True)

# Linear Acceleration
sns.lineplot(data=example, x='measurement_number', y='linear_acceleration_X', hue='surface', ax=axes[0, 0])
sns.lineplot(data=example, x='measurement_number', y='linear_acceleration_Y', hue='surface', ax=axes[1, 0])
sns.lineplot(data=example, x='measurement_number', y='linear_acceleration_Z', hue='surface', ax=axes[2, 0])
# Angular Velocity
sns.lineplot(data=example, x='measurement_number', y='angular_velocity_X', hue='surface', ax=axes[0, 1])
sns.lineplot(data=example, x='measurement_number', y='angular_velocity_Y', hue='surface', ax=axes[1, 1])
sns.lineplot(data=example, x='measurement_number', y='angular_velocity_Z', hue='surface', ax=axes[2, 1])

# Show plot
plt.show()

In [None]:
# Plot example data
f, axes = plt.subplots(2, 2, figsize=(15, 10), sharex=True)

# Orientation
sns.lineplot(data=example, x='measurement_number', y='orientation_X', hue='surface', ax=axes[0, 0])
sns.lineplot(data=example, x='measurement_number', y='orientation_Y', hue='surface', ax=axes[1, 0])
sns.lineplot(data=example, x='measurement_number', y='orientation_Z', hue='surface', ax=axes[0, 1])
sns.lineplot(data=example, x='measurement_number', y='orientation_W', hue='surface', ax=axes[1, 1])


# Show plot
plt.show()

So, basically blue and green... and a lot of stuff going on. Although not so much in terms of rotation.  

What about the whole data set?

In [None]:
# Group data by 'surface' -> result: multiple time series for each surface type

# Descriptive stats on each set of time series

### Solution Approach

* What kind of problem?
* Which solutions could be used?