<h1><center><font size="6">Robots need help!</font></center></h1>

<img src="https://upload.wikimedia.org/wikipedia/commons/d/df/RobotsMODO.jpg" width="400"></img>

<br>

# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Prepare the data analysis</a>  
- <a href='#3'>Data exploration</a>   
 - <a href='#31'>Check the data</a>   
 - <a href='#32'>Distribution of target feature `surface`</a>   
 - <a href='#33'>Density plots of features</a>   
- <a href='#4'>Feature engineering</a>
- <a href='#5'>Model</a>
- <a href='#6'>Submission</a>  
- <a href='#7'>References</a>

# <a id='1'>Introduction</a>  

## Competition
In this competition, we willl help robots recognize the floor surface they抮e standing on. The floor could be of various types, like carpet, tiles, concrete.

## Data
The data provided by the organizers  is collected IMU sensor data while driving a small mobile robot over different floor surfaces on the university premises.  

## Kernel
In this Kernel we perform EDA on the data, explore with feature engineering and build a predictive model.

# <a id='2'>Prepare for data analysis</a>  


## Load packages


In [None]:
import gc
import os
import logging
import datetime
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from lightgbm import LGBMClassifier
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import StratifiedKFold
warnings.filterwarnings('ignore')

## Load data   

Let's check what data files are available.

In [None]:
IS_LOCAL = False
if(IS_LOCAL):
    PATH="../input/careercon/"
else:
    PATH="../input/"
os.listdir(PATH)

Let's load the data.

In [None]:
%%time
X_train = pd.read_csv(os.path.join(PATH, 'X_train.csv'))
X_test = pd.read_csv(os.path.join(PATH, 'X_test.csv'))
y_train = pd.read_csv(os.path.join(PATH, 'y_train.csv'))

In [None]:
print("Train X: {}\nTrain y: {}\nTest X: {}".format(X_train.shape, y_train.shape, X_test.shape))

We can observe that train data and labels have different number of rows.

# <a id='3'>Data exploration</a>  

## <a id='31'>Check the data</a>  

Let's check the train and test set.

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
X_test.head()

X_train and X_test datasets have the following entries:  

* series and measurements identifiers: **row_id**, **series_id**, **measurement_number**: these identify uniquely a series and measurement; there are 3809 series, each with max 127 measurements;  
* measurement orientations: **orientation_X**, **orientation_Y**, **orientation_Z**, **orientation_W**;   
* angular velocities: **angular_velocity_X**, **angular_velocity_Y**, **angular_velocity_Z**;
* linear accelerations: **linear_acceleration_X**, **linear_acceleration_Y**, **linear_acceleration_Z**.

y_train has the following columns:  

* **series_id** - this corresponds to the series in train data;  
* **group_id**;  
* **surface** - this is the surface type that need to be predicted.



In [None]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

In [None]:
missing_data(X_train)

In [None]:
missing_data(X_test)

There are no missing values in train and test data.

In [None]:
missing_data(y_train)

Also, train labels has no missing data.

In [None]:
X_train.describe()

In [None]:
X_test.describe()

In [None]:
y_train.describe()

There is the same number of series in X_train and y_train, numbered from 0 to 3809 (total 3810). Each series have 128 measurements.   
Each series in train dataset is part of a group (numbered from 0 to 72).  
The number of rows in X_train and X_test differs with 6 x 128, 128 being the number of measurements for each group.  

## <a id='32'>Distribution of target feature - surface</a>  


In [None]:
f, ax = plt.subplots(1,1, figsize=(16,4))
total = float(len(y_train))
g = sns.countplot(y_train['surface'], order = y_train['surface'].value_counts().index)
g.set_title("Number and percentage of labels for each class")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(100*height/total),
            ha="center") 
plt.show()    

## <a id='32'>Density plots of features</a>  

Let's show now the density plot of variables in train and test dataset. 

We represent with different colors the distribution for values with different values of **surface**.

In [None]:
def plot_feature_distribution(df1, df2, label1, label2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(2,5,figsize=(16,8))

    for feature in features:
        i += 1
        plt.subplot(2,5,i)
        sns.kdeplot(df1[feature], bw=0.5,label=label1)
        sns.kdeplot(df2[feature], bw=0.5,label=label2)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=8)
        plt.tick_params(axis='y', which='major', labelsize=8)
    plt.show();

In [None]:
features = X_train.columns.values[3:]
plot_feature_distribution(X_train, X_test, 'train', 'test', features)

In [None]:
def plot_feature_class_distribution(classes,tt, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(5,2,figsize=(16,24))

    for feature in features:
        i += 1
        plt.subplot(5,2,i)
        for clas in classes:
            ttc = tt[tt['surface']==clas]
            sns.kdeplot(ttc[feature], bw=0.5,label=clas)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=8)
        plt.tick_params(axis='y', which='major', labelsize=8)
    plt.show();

In [None]:
classes = (y_train['surface'].value_counts()).index
tt = X_train.merge(y_train, on='series_id', how='inner')
plot_feature_class_distribution(classes, tt, features)

# <a id='4'>Feature engineering</a>  


This section is heavily borrowing from: https://www.kaggle.com/vanshjatana/help-humanity-by-helping-robots Kernel. 
The quaternion_to_euler transformation procedure is also credited in the original Kernel, and I kept this reference as well.
I also corrected few issues and added some more engineered features. Thanks for @timmmmmms for pointing them out.

In [None]:
# https://stackoverflow.com/questions/53033620/how-to-convert-euler-angles-to-quaternions-and-get-the-same-euler-angles-back-fr?rq=1
def quaternion_to_euler(x, y, z, w):
    import math
    t0 = +2.0 * (w * x + y * z)
    t1 = +1.0 - 2.0 * (x * x + y * y)
    X = math.atan2(t0, t1)

    t2 = +2.0 * (w * y - z * x)
    t2 = +1.0 if t2 > +1.0 else t2
    t2 = -1.0 if t2 < -1.0 else t2
    Y = math.asin(t2)

    t3 = +2.0 * (w * z + x * y)
    t4 = +1.0 - 2.0 * (y * y + z * z)
    Z = math.atan2(t3, t4)

    return X, Y, Z

In [None]:
def perform_feature_engineering(df):
    df_out = pd.DataFrame()
    df['total_angular_velocity'] = np.sqrt(np.square(df['angular_velocity_X']) + np.square(df['angular_velocity_Y']) + np.square(df['angular_velocity_Z']))
    df['total_linear_acceleration'] = np.sqrt(np.square(df['linear_acceleration_X']) + np.square(df['linear_acceleration_Y']) + np.square(df['linear_acceleration_Z']))
    df['total_xyz'] = np.sqrt(np.square(df['orientation_X']) + np.square(df['orientation_Y']) +
                              np.square(df['orientation_Z']))
    df['acc_vs_vel'] = df['total_linear_acceleration'] / df['total_angular_velocity']
    
    x, y, z, w = df['orientation_X'].tolist(), df['orientation_Y'].tolist(), df['orientation_Z'].tolist(), df['orientation_W'].tolist()
    nx, ny, nz = [], [], []
    for i in range(len(x)):
        xx, yy, zz = quaternion_to_euler(x[i], y[i], z[i], w[i])
        nx.append(xx)
        ny.append(yy)
        nz.append(zz)
    
    df['euler_x'] = nx
    df['euler_y'] = ny
    df['euler_z'] = nz
    
    df['total_angle'] = np.sqrt(np.square(df['euler_x']) + np.square(df['euler_y']) + np.square(df['euler_z']))
    df['angle_vs_acc'] = df['total_angle'] / df['total_linear_acceleration']
    df['angle_vs_vel'] = df['total_angle'] / df['total_angular_velocity']
    
    def mean_change_of_abs_change(x):
        return np.mean(np.diff(np.abs(np.diff(x))))

    def mean_abs_change(x):
        return np.mean(np.abs(np.diff(x)))
    
    for col in df.columns:
        if col in ['row_id', 'series_id', 'measurement_number']:
            continue
        df_out[col + '_mean'] = df.groupby(['series_id'])[col].mean()
        df_out[col + '_min'] = df.groupby(['series_id'])[col].min()
        df_out[col + '_max'] = df.groupby(['series_id'])[col].max()
        df_out[col + '_std'] = df.groupby(['series_id'])[col].std()
        df_out[col + '_mad'] = df.groupby(['series_id'])[col].mad()
        df_out[col + '_med'] = df.groupby(['series_id'])[col].median()
        df_out[col + '_skew'] = df.groupby(['series_id'])[col].skew()
        df_out[col + '_range'] = df_out[col + '_max'] - df_out[col + '_min']
        df_out[col + '_max_to_min'] = df_out[col + '_max'] / df_out[col + '_min']
        df_out[col + '_mean_abs_change'] = df.groupby('series_id')[col].apply(mean_abs_change)
        df_out[col + '_mean_change_of_abs_change'] = df.groupby('series_id')[col].apply(mean_change_of_abs_change)
        df_out[col + '_abs_max'] = df.groupby('series_id')[col].apply(lambda x: np.max(np.abs(x)))
        df_out[col + '_abs_min'] = df.groupby('series_id')[col].apply(lambda x: np.min(np.abs(x)))
        df_out[col + '_abs_mean'] = df.groupby('series_id')[col].apply(lambda x: np.mean(np.abs(x)))
        df_out[col + '_abs_std'] = df.groupby('series_id')[col].apply(lambda x: np.std(np.abs(x)))
        df_out[col + '_abs_avg'] = (df_out[col + '_abs_min'] + df_out[col + '_abs_max'])/2
        df_out[col + '_abs_range'] = df_out[col + '_abs_max'] - df_out[col + '_abs_min']

    return df_out

In [None]:
%%time
X_train = perform_feature_engineering(X_train)

In [None]:
%time
X_test = perform_feature_engineering(X_test)

In [None]:
X_train.head()

In [None]:
X_test.head()

# <a id='5'>Model</a>  

We use LabelEncoder for the target feature.

In [None]:
le = LabelEncoder()
y_train['surface'] = le.fit_transform(y_train['surface'])

We replace with 0 NAs and $\infty$.

In [None]:
X_train.fillna(0, inplace = True)
X_train.replace(-np.inf, 0, inplace = True)
X_train.replace(np.inf, 0, inplace = True)
X_test.fillna(0, inplace = True)
X_test.replace(-np.inf, 0, inplace = True)
X_test.replace(np.inf, 0, inplace = True)

## Prepare for cross-validation.

In [None]:
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=59)

## Random Forest classifier

We use first a Random Forest Classifier model.

In [None]:
sub_preds_rf = np.zeros((X_test.shape[0], 9))
oof_preds_rf = np.zeros((X_train.shape[0]))
score = 0
for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train['surface'])):
    clf =  RandomForestClassifier(n_estimators = 2000, n_jobs = -1)
    clf.fit(X_train.iloc[trn_idx], y_train['surface'][trn_idx])
    oof_preds_rf[val_idx] = clf.predict(X_train.iloc[val_idx])
    sub_preds_rf += clf.predict_proba(X_test) / folds.n_splits
    score += clf.score(X_train.iloc[val_idx], y_train['surface'][val_idx])
    print('Fold: {} score: {}'.format(fold_,clf.score(X_train.iloc[val_idx], y_train['surface'][val_idx])))
print('Avg Accuracy', score / folds.n_splits)


## LightGBM Classifier

We also use a LightGBM Classifier model.

In [None]:
USE_LGB = False
if(USE_LGB):
    sub_preds_lgb = np.zeros((X_test.shape[0], 9))
    oof_preds_lgb = np.zeros((X_train.shape[0]))
    score = 0
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train['surface'])):
        train_x, train_y = X_train.iloc[trn_idx], y_train['surface'][trn_idx]
        valid_x, valid_y = X_train.iloc[val_idx], y_train['surface'][val_idx]
        clf =  LGBMClassifier(
                      nthread=-1,
                      n_estimators=2000,
                      learning_rate=0.01,
                      boosting_type='gbdt',
                      is_unbalance=True,
                      objective='multiclass',
                      numclass=9,
                      silent=-1,
                      verbose=-1,
                      feval=None)
        clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], 
                     verbose= 1000, early_stopping_rounds= 200)

        oof_preds_lgb[val_idx] = clf.predict(valid_x)
        sub_preds_lgb += clf.predict_proba(X_test) / folds.n_splits
        score += clf.score(valid_x, valid_y)
        print('Fold: {} score: {}'.format(fold_,clf.score(valid_x, valid_y)))
    print('Avg Accuracy', score / folds.n_splits)

# <a id='6'>Submission</a>  

We submit the solution for both the RF and LGB.

In [None]:
submission = pd.read_csv(os.path.join(PATH,'sample_submission.csv'))
submission['surface'] = le.inverse_transform(sub_preds_rf.argmax(axis=1))
submission.to_csv('submission_rf.csv', index=False)
submission.head(10)

In [None]:
USE_LGB = False
if(USE_LGB):
    submission['surface'] = le.inverse_transform(sub_preds_lgb.argmax(axis=1))
    submission.to_csv('submission_lgb.csv', index=False)
    submission.head(10)

# <a id='7'>References</a>    

[1] https://www.kaggle.com/vanshjatana/help-humanity-by-helping-robots-4e306b  
[2] https://www.kaggle.com/artgor/where-do-the-robots-drive  
