<h2>Introduction</h2>

In this competition, participants must help robots recognize the floor surface they’re standing on using data collected from sensors.

In [None]:
import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold
import lightgbm as lgb
sns.set()

print("Files in the input folder:")
print(os.listdir("../input"))
train = pd.read_csv('../input/X_train.csv')
test = pd.read_csv('../input/X_test.csv')
y = pd.read_csv('../input/y_train.csv')
sub = pd.read_csv('../input/sample_submission.csv')
print("\nX_train shape: {}, X_test shape: {}".format(train.shape, test.shape))
print("y_train shape: {}, submission shape: {}".format(y.shape, sub.shape))

In [None]:
train.head()

<h3>Data structure</h3>

Each series has 128 measurements, that's why there are almost half million rows at x_train, but only 3810 outputs (y_train). For each measurement we have ten features, which are basically the orientation, angular velocity and acceleration in three dimensions. The orientation channel has a fourth dimension since it's using [quaternions](https://en.wikipedia.org/wiki/Conversion_between_quaternions_and_Euler_angles).

This is a classification problem with nine possible classes (floor surfaces):

In [None]:
plt.figure(figsize=(10,6))
plt.title("Training labels")
ax = sns.countplot(y='surface', data=y)
y.head(3)

Each group_id is a unique recording session and has only one surface type:

In [None]:
y.groupby('group_id').surface.nunique().max()

<h2>Feature Engineering</h2>

In [None]:
def feature_extraction(raw_frame):
    frame = pd.DataFrame()
    raw_frame['angular_velocity'] = raw_frame['angular_velocity_X'] + raw_frame['angular_velocity_Y'] + raw_frame['angular_velocity_Z']
    raw_frame['linear_acceleration'] = raw_frame['linear_acceleration_X'] + raw_frame['linear_acceleration_Y'] + raw_frame['linear_acceleration_Y']
    raw_frame['velocity_to_acceleration'] = raw_frame['angular_velocity'] / raw_frame['linear_acceleration']
    
    for col in raw_frame.columns[3:]:
        frame[col + '_mean'] = raw_frame.groupby(['series_id'])[col].mean()
        frame[col + '_std'] = raw_frame.groupby(['series_id'])[col].std()
        frame[col + '_max'] = raw_frame.groupby(['series_id'])[col].max()
        frame[col + '_min'] = raw_frame.groupby(['series_id'])[col].min()
        frame[col + '_max_to_min'] = frame[col + '_max'] / frame[col + '_min']
        
        frame[col + '_mean_abs_change'] = raw_frame.groupby('series_id')[col].apply(lambda x: np.mean(np.abs(np.diff(x))))
        frame[col + '_abs_max'] = raw_frame.groupby('series_id')[col].apply(lambda x: np.max(np.abs(x)))
    return frame

In [None]:
train_df = feature_extraction(train)
test_df = feature_extraction(test)
train_df.head()

<h2>Gradient Boosting</h2>

The standard metric for multiclass classification is *multi_logloss* in lightgbm, so I added a custom evaluation metric for multiclass accuracy. Another possible metric is *multi_error*, but there is no description in the documentation.

In [None]:
le = LabelEncoder()
target = le.fit_transform(y['surface'])

In [None]:
params = {
    'num_leaves': 54,
    'min_data_in_leaf': 10,
    'objective': 'multiclass',
    'max_depth': 7,
    'learning_rate': 0.01,
    "boosting": "gbdt",
    "bagging_freq": 5,
    "bagging_fraction": 0.8126672064208567,
    "bagging_seed": 11,
    "verbosity": -1,
    'reg_alpha': 0.1302650970728192,
    'reg_lambda': 0.3603427518866501,
    "num_class": 9,
    'nthread': -1
}

def multiclass_accuracy(preds, train_data):
    labels = train_data.get_label()
    pred_class = np.argmax(preds.reshape(9, -1).T, axis=1)
    return 'multi_accuracy', np.mean(labels == pred_class), True

t0 = time.time()
train_set = lgb.Dataset(train_df, label=target)
eval_hist = lgb.cv(params, train_set, nfold=10, num_boost_round=9999,
                   early_stopping_rounds=100, seed=19, feval=multiclass_accuracy)
num_rounds = len(eval_hist['multi_logloss-mean'])
# retrain the model and make predictions for test set
clf = lgb.train(params, train_set, num_boost_round=num_rounds)
predictions = clf.predict(test_df, num_iteration=None)
print("Timer: {:.1f}s".format(time.time() - t0))

The following plots show the mean logloss and accuracy at each iteration (blue line). The red lines are the standard deviation between folds.

In [None]:
v1, v2 = eval_hist['multi_logloss-mean'][-1], eval_hist['multi_accuracy-mean'][-1]
print("Validation logloss: {:.4f}, accuracy: {:.4f}".format(v1, v2))
plt.figure(figsize=(10, 4))
plt.title("CV multiclass logloss")
num_rounds = len(eval_hist['multi_logloss-mean'])
ax = sns.lineplot(x=range(num_rounds), y=eval_hist['multi_logloss-mean'])
ax2 = ax.twinx()
p = sns.lineplot(x=range(num_rounds), y=eval_hist['multi_logloss-stdv'], ax=ax2, color='r')

plt.figure(figsize=(10, 4))
plt.title("CV multiclass accuracy")
num_rounds = len(eval_hist['multi_accuracy-mean'])
ax = sns.lineplot(x=range(num_rounds), y=eval_hist['multi_accuracy-mean'])
ax2 = ax.twinx()
p = sns.lineplot(x=range(num_rounds), y=eval_hist['multi_accuracy-stdv'], ax=ax2, color='r')

<h3>Feature importance</h3>

In [None]:
importance = pd.DataFrame({'gain': clf.feature_importance(importance_type='gain'),
                           'feature': clf.feature_name()})
importance.sort_values(by='gain', ascending=False, inplace=True)
plt.figure(figsize=(10, 20))
ax = sns.barplot(x='gain', y='feature', data=importance)

<h3>Submission</h3>

In [None]:
sub['surface'] = le.inverse_transform(predictions.argmax(axis=1))
sub.to_csv('lgb_submission2.csv', index=False)
sub.head(3)

<h3>Work in progress...</h3>