**The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line.**

**A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range.**

![Otto product](https://storage.googleapis.com/kaggle-competitions/kaggle/4280/media/Grafik.jpg)

### **For this competition, we have provided a dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. The winning models will be open sourced.**

In [0]:
# import sklearn libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# import sklearn library
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, bagging, AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report, f1_score, confusion_matrix, precision_recall_curve, precision_score, recall_score
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier
from sklearn.feature_selection import VarianceThreshold

# imblearn library
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import RandomOverSampler, SMOTE



In [0]:
# upload dataset from local machine
     # from google.colab import files
     # uploaded = files.upload()

# upload dataset from github
url = 'https://raw.githubusercontent.com/chandannaidu/datasets/master/Otto%20dataset/train.csv'

In [0]:
data_set = pd.read_csv(url)
data_set.head()

Unnamed: 0,id,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,feat_10,feat_11,feat_12,feat_13,feat_14,feat_15,feat_16,feat_17,feat_18,feat_19,feat_20,feat_21,feat_22,feat_23,feat_24,feat_25,feat_26,feat_27,feat_28,feat_29,feat_30,feat_31,feat_32,feat_33,feat_34,feat_35,feat_36,feat_37,feat_38,feat_39,...,feat_55,feat_56,feat_57,feat_58,feat_59,feat_60,feat_61,feat_62,feat_63,feat_64,feat_65,feat_66,feat_67,feat_68,feat_69,feat_70,feat_71,feat_72,feat_73,feat_74,feat_75,feat_76,feat_77,feat_78,feat_79,feat_80,feat_81,feat_82,feat_83,feat_84,feat_85,feat_86,feat_87,feat_88,feat_89,feat_90,feat_91,feat_92,feat_93,target
0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,4,1,1,0,0,2,0,0,0,0,0,1,0,0,0,0,...,0,0,2,0,0,11,0,1,1,0,1,0,7,0,0,0,1,0,0,0,0,0,0,0,2,1,0,0,0,0,1,0,0,0,0,0,0,0,0,Class_1
1,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Class_1
2,3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0,0,0,6,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Class_1
3,4,1,0,0,1,6,1,5,0,0,1,1,0,1,0,0,1,1,0,0,0,0,0,0,7,2,2,0,0,0,58,0,10,0,0,0,0,0,3,0,...,1,0,0,0,0,0,0,0,0,0,2,1,5,0,0,4,0,0,2,1,0,1,0,0,1,1,2,2,0,22,0,1,2,0,0,0,0,0,0,Class_1
4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,3,0,0,0,0,0,0,0,4,0,1,0,0,0,1,0,0,0,0,1,0,0,0,Class_1


In [0]:
data_set['target'].value_counts()

Class_2    16122
Class_6    14135
Class_8     8464
Class_3     8004
Class_9     4955
Class_7     2839
Class_5     2739
Class_4     2691
Class_1     1929
Name: target, dtype: int64

In [0]:
data_set.isna().count()

id         61878
feat_1     61878
feat_2     61878
feat_3     61878
feat_4     61878
           ...  
feat_90    61878
feat_91    61878
feat_92    61878
feat_93    61878
target     61878
Length: 95, dtype: int64

In [0]:
data_set.isna().sum()

id         0
feat_1     0
feat_2     0
feat_3     0
feat_4     0
          ..
feat_90    0
feat_91    0
feat_92    0
feat_93    0
target     0
Length: 95, dtype: int64

In [0]:
X = data_set.iloc[:,1:-1].values
X

array([[ 1,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       ...,
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 1,  0,  0, ...,  3, 10,  0],
       [ 0,  0,  0, ...,  0,  2,  0]])

In [0]:
y = data_set.iloc[:,-1].values
y

array(['Class_1', 'Class_1', 'Class_1', ..., 'Class_9', 'Class_9',
       'Class_9'], dtype=object)

In [0]:
X.shape

(61878, 93)

In [0]:
shuffel_index = np.random.permutation(61878)
X = X[shuffel_index]
y = y[shuffel_index]

## Feature selection

In [0]:
#remove constant features
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(X)

VarianceThreshold(threshold=0)

In [0]:
constant_filter.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [0]:
constant_filter.get_support().sum()

93

In [0]:
#remove quasi constant
quasi_constant_filter = VarianceThreshold(threshold=0.01)
quasi_constant_filter .fit(X)

VarianceThreshold(threshold=0.01)

In [0]:
quasi_constant_filter.get_support().sum()
X_quasi = quasi_constant_filter.transform(X)

In [0]:
#remove duplicate features
X_T = X_quasi.T

In [0]:
type(X_T)

numpy.ndarray

In [0]:
X_train_T = pd.DataFrame(X_T)

In [0]:
X_train_T.shape

(93, 61878)

In [0]:
X_train_T.duplicated().sum()

0

In [0]:
X = X_train_T.T

In [0]:
smote = SMOTE(random_state = 42)
X_smote , y_smote = smote.fit_resample(X,y)



In [0]:
# ROS = RandomOverSampler(random_state = 42)
# X_ROS, X_ROS = ROS.fit_resample(X,y)

## **Visualization**

In [0]:
sns.pairplot(X)

In [0]:
rows = 10
cols = 10

fig, ax = plt.subplots(nrows= rows, ncols= cols, figsize = (16,4))

col = data_set.columns
index = 0

for i in range(rows):
    for j in range(cols):
        sns.distplot(data_set[col[index]], ax = ax[i][j])
        index = index + 1

plt.tight_layout()

In [0]:
shuffel_index = np.random.permutation(61878)
X_smote = X_smote[shuffel_index]
y_smote = y_smote[shuffel_index]

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.25, random_state=42)

In [0]:
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.fit_transform(X_test)
X_test

array([[ 0.98239355, -0.20195452,  3.34112927, ..., -0.13032521,
        -0.37726307, -0.09283604],
       [-0.24614436, -0.20195452, -0.29525742, ..., -0.13032521,
        -0.37726307, -0.09283604],
       [-0.24614436, -0.20195452, -0.29525742, ..., -0.13032521,
        -0.37726307, -0.09283604],
       ...,
       [-0.24614436, -0.20195452,  2.67996806, ..., -0.13032521,
         0.61648113, -0.09283604],
       [-0.24614436, -0.20195452, -0.29525742, ..., -0.13032521,
        -0.37726307, -0.09283604],
       [-0.24614436, -0.20195452,  0.03532319, ..., -0.13032521,
        -0.37726307, -0.09283604]])

In [0]:
# SGDClassifier
sgd_clf = SGDClassifier(random_state=42, max_iter=800)
score = cross_val_score(sgd_clf, X_train, y_train, scoring='accuracy', cv=4).mean()
print('cross_val_score of SGDClassifier is ',score)

cross_val_score of SGDClassifier is  0.7404326840199965


In [0]:
# KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=5)
score = cross_val_score(knn_clf, X_train, y_train, scoring='accuracy', cv=10).mean()
print('cross_val_score of KNeighborsClassifier is ',score)

cross_val_score of KNeighborsClassifier is  0.766548803208286


---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-24-f4471cd327c3> in <module>()
      1 knn_clf = KNeighborsClassifier(n_neighbors=3)
----> 2 score = cross_val_score(knn_clf, X_train, y_train, scoring='accuracy', cv=4).mean()
      3 print('cross_val_score of KNeighborsClassifier is ',score)

22 frames
/usr/local/lib/python3.6/dist-packages/sklearn/neighbors/base.py in _tree_query_parallel_helper(tree, data, n_neighbors, return_distance)
    289     under PyPy.
    290     """
--> 291     return tree.query(data, n_neighbors, return_distance)
    292 
    293 

KeyboardInterrupt: 

In [0]:
# RandomForestClassifier
random_clf = RandomForestClassifier(random_state=42)
score = cross_val_score(random_clf, X_train, y_train, scoring='accuracy', cv=4).mean()
print('cross_val_score of SGDClassifier is ',score)

cross_val_score of SGDClassifier is  0.8004438889846578


In [0]:
# DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(max_depth=70,random_state=42,max_leaf_nodes= 200)
score = cross_val_score(tree_clf, X_train, y_train, scoring='accuracy', cv=4).mean()
print('cross_val_score of SGDClassifier is ',score)

cross_val_score of SGDClassifier is  0.7099422513359765


In [0]:
# LogisticRegression
log_clf = LogisticRegression(random_state=42)
score = cross_val_score(log_clf, X_train, y_train, scoring='accuracy', cv=4).mean()
print('cross_val_score of SGDClassifier is ',score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

cross_val_score of SGDClassifier is  0.7619806929839683


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [0]:
pipe = Pipeline([('classifier',SGDClassifier())])

grid_param = [
              {
                  'classifier':[SGDClassifier()],
                  'classifier__loss': ['hinge'],
                  'classifier__penalty': ['l1'],
                  'classifier__alpha': [0.0001,0.001],
                  'classifier__max_iter': [1000]
               
              },
              {"classifier": [RandomForestClassifier()],
                 "classifier__n_estimators": [10, 100, 1000],
                 "classifier__max_depth":[5,8,15,25,30,None],
                 "classifier__min_samples_leaf":[1,2,5,10,15,100],
                 "classifier__max_leaf_nodes": [2, 5,10]
               }
]

In [0]:
grid_search = GridSearchCV(pipe, grid_param, cv=5)
best_model = grid_search.fit(X_train,y_train)

In [0]:
print(best_model)

In [0]:
print(best_model.best_estimator_)
print('the accuracy of the model is:', best_model.score(X_test,y_test))