## Introduction:
Robots are smart… by design. To fully understand and properly navigate a task, however, they need input about their environment.
In this competition, you’ll help robots recognize the floor surface they’re standing on using data collected from Inertial Measurement Units (IMU sensors).

## About Data: 
CareerCon has collected IMU sensor data while driving a small mobile robot over different floor surfaces on the university premises. 

## Objective:
The task is to predict which one of the nine floor types (carpet, tiles, concrete) the robot is on using sensor data such as acceleration and velocity. Succeed and you'll help improve the navigation of robots without assistance across many different surfaces, so they won’t fall down on the job.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

import plotly.offline as py 
from plotly.offline import init_notebook_mode, iplot
py.init_notebook_mode(connected=True) # this code, allow us to work with offline plotly version
import plotly.graph_objs as go # it's like "plt" of matplot

# Any results you write to the current directory are saved as output.

In [None]:
X_train = pd.read_csv('../input/X_train.csv')
X_train.head(3)

In [None]:
y_train = pd.read_csv('../input/y_train.csv')
y_train.head(3)

In [None]:
X_test = pd.read_csv('../input/X_test.csv')
X_test.head(3)

# Descriptive Statistics

In [None]:
print('Size of Train Data')
print('Number of samples are: {0}\nNumber of features are: {1}'.format(X_train.shape[0], X_train.shape[1]))

print('\nSize of Test Data')
print('Number of samples are: {0}\nNumber of features are: {1}'.format(X_test.shape[0], X_test.shape[1]))

print('\nSize of Target Data')
print('Number of samples are: {0}\nNumber of features are: {1}'.format(y_train.shape[0], y_train.shape[1]))

## Train Data Description

In [None]:
X_train.describe()

## Target surface type and their sample count

In [None]:
target_data = y_train['surface'].value_counts().reset_index().rename(columns = {'index' : 'target'})
target_data

In [None]:
#sns.countplot(y='surface',data = y_train)
trace0 = go.Bar(
    x = y_train['surface'].value_counts().index,
    y = y_train['surface'].value_counts().values
    )

trace1 = go.Pie(
    labels = y_train['surface'].value_counts().index,
    values = y_train['surface'].value_counts().values,
    domain = {'x':[0.55,1]})

data = [trace0, trace1]
layout = go.Layout(
    title = 'Frequency Distribution for surface/target data',
    xaxis = dict(domain = [0,.50]))

fig = go.Figure(data = data, layout = layout)
py.iplot(fig)


## Preprocessing data

### Is there any missing data?

In [None]:
X_train.isnull().sum()

#### Observation: No missing data

### Is there any duplicate data?

In [None]:
X_train['is_duplicate'] = X_train.duplicated()
X_train['is_duplicate'].value_counts()

#### Observation: There is no duplicate data

In [None]:
X_train = X_train.drop(['is_duplicate'], axis = 1)

### Sorting based on series_id and measurement_number

In [None]:
X_train_sort = X_train.sort_values(by = ['series_id', 'measurement_number'], ascending = True)
X_train_sort.head()

### Min_Max value of each feature

In [None]:
def min_max_values(col):
    top = X_train[col].idxmax()
    top_obs = pd.DataFrame(X_train.loc[top])
    
    bottom = X_train[col].idxmin()
    bot_obs = pd.DataFrame(X_train.loc[bottom])
    
    min_max_obs = pd.concat([top_obs, bot_obs], axis = 1)
    
    return min_max_obs

In [None]:
min_max_values('series_id')

### Correlation Matrix

In [None]:
corr = X_train.corr()
corr

In [None]:
fig, ax = plt.subplots(1,1, figsize = (15,6))

hm = sns.heatmap(corr,
                ax = ax,
                cmap = 'coolwarm',
                annot = True,
                fmt = '.2f',
                linewidths = 0.05)
fig.subplots_adjust(top=0.93)
fig.suptitle('Orientation, Angular_velocity and Linear_accelaration Correlation Heatmap', 
              fontsize=14, 
              fontweight='bold')

**Observation:**
*     orientation_X and orientation_W are strongly correlated
*     orientation_Y and orientation_Z are strongly correlated
*     linear_accelaration_Y and linear_accelaration_Z also has positive correlation
*     angular_velocity_Y and angular_velocity_Z has negative correlation

### Box plot of angular_velocity, orientation and linear_accelaration data

In [None]:
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(311)
ax.set_title('Distribution of Orientation_X,Y,Z,W',
             fontsize=14, 
             fontweight='bold')
X_train.iloc[:,3:7].boxplot()
ax = fig.add_subplot(312)
ax.set_title('Distribution of Angular_Velocity_X,Y,Z',fontsize=14, 
             fontweight='bold')
X_train.iloc[:,7:10].boxplot()
ax = fig.add_subplot(313)
ax.set_title('Distribution of linear_accelaration_X,Y,Z',fontsize=14, 
             fontweight='bold')
X_train.iloc[:,10:13].boxplot()

**Observation**: There are many outliers in angular_velocity and linear accelaration data

### Histogram plot for all features

In [None]:
plt.figure(figsize=(26, 16))
for i, col in enumerate(X_train.columns[3:]):
    ax = plt.subplot(3, 4, i + 1)
    sns.distplot(X_train[col], bins=100, label='train')
    sns.distplot(X_test[col], bins=100, label='test')
    ax.legend()   

### Observation:
*    Angular velocity are normally distributed infect they are symmetrical data distribution
*    linear_accelaration are normally distributed/symmetrical distribution but average value is slightly negative for linear_accelaration_Z
*    X,Y,Z,W orientation data are not symmetrical or bell shaped distributed. 
*         X,Y orientation data are distributed un-even between 1 to -1.
*         Z,W orientation data are distributed un-even between 1.5 to -1.5

### Feature distribution for each target value (surface)

In [None]:
df = X_train.merge(y_train, on = 'series_id', how = 'inner')
targets = (y_train['surface'].value_counts()).index

In [None]:
plt.figure(figsize=(26, 16))
for i,col in enumerate(df.columns[3:13]):
    ax = plt.subplot(3,4,i+1)
    ax = plt.title(col)
    for surface in targets:
        surface_feature = df[df['surface'] == surface]
        sns.kdeplot(surface_feature[col], label = surface)

**Observation:**

*     even though 'hard tile' data count is less, orientation_X,Y,Z,W for hard tile surface is at pick.
*     for orientation_X these data range is approx 0.5 to 1.0, 
*     for orientation_Y these data range is approx -1.0 to -0.5
*     for orientation_Z these data range is approx -0.12 to -0.8
*     for orientation_W these data range is approx 0.07 to 0.12 
*     for angular velocity and linear accelaration data, there is a symmetry around mean in terms of data distribution.
    

## Model

Our goal is to identify 'which surface' it might be based on previous input features. More precisely it is a 'Classification' Problem.
Logistic Regression is a first choice here.

### Types of Logistic Regression:
**Binary Logistic Regression:** The target variable has only two possible outcomes such as Spam or Not Spam, Cancer or No Cancer.
**Multinomial Logistic Regression:** The target variable has three or more nominal categories such as in this problem type of surface.
**Ordinal Logistic Regression:** the target variable has three or more ordinal categories such as restaurant or product rating from 1 to 5.

### Selecting Feature

We will select appropriate features for the model. for that we will drop less important columns.
For this model our data has to be numeric. for that we will transform target data to numerical using LabelEncoding.

### Spliting data

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

Let's split dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size. Additionally, you can use random_state to select records randomly.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
holdout = X_test # from now on we will refer to this
               # dataframe as the holdout data
    
Y = df[['surface']]
features = [c for c in df.columns if c not in ['surface','group_id']]
X = df[features]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.025, random_state = 0)
X_train1, y_train1 = X_train, y_train
X_test1, y_test1 = X_test, y_test

In [None]:
X_train.shape,X_test.shape, y_train.shape, y_test.shape

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial')
model = logreg.fit(X_train,y_train)

In [None]:
from sklearn import metrics
y_pred=logreg.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

**Conclusion**: 
Accuracy score is very law. Logistic regression here not giving good result, 
so lets another model for our classification problem. Lets use Random forest classifier.

**Random Forest Classifier**
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

In [None]:
model1 = RandomForestClassifier(n_estimators=50, random_state=0).fit(X_train1, y_train1)
y_pred = model1.predict(X_test1)
y_pred

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

**Conclusion:**
    Accuracy score is 99%. which is really good.
    
  **Feature Importance**

In [None]:
feature_importances = pd.DataFrame(lr.feature_importances_, index = X_train.columns, columns = ['importance'])
feature_importances = feature_importances.sort_values('importance' , ascending = False)
feature_importances

In [None]:
colors = ['grey'] * 6 + ['green'] * 5
trace1 = go.Bar(x = feature_importances.importance[:11][::-1],
               y = [x.title()+"  " for x in feature_importances.index[:11][::-1]],
               name = 'feature importnace (relative)',
               marker = dict(color = colors, opacity=0.4), orientation = 'h')

data = [trace1]

layout = go.Layout(
    margin=dict(l=400), width = 1000,
    xaxis=dict(range=(0.0,0.15)),
    title='Relative Feature Importance (Which Features are more important to make predictions ?)',
    barmode='group',
    bargap=0.25
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

## Submission

Thanks for stopping by. Please upvote if you like my kernel. 
Stay Tuned for further Analaysis and Predictive models