The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line.

A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range.

**Each row corresponds to a single product**. There are a total of 93 numerical features, which represent counts of different events. All features have been obfuscated and will not be defined any further.

There are nine categories for all products. Each target category represents one of our most important product categories (like fashion, electronics, etc.). The products for the training and testing sets are selected randomly.

In [168]:
#imports and check our WD
import csv as csv
import numpy as np
import pandas as pd
import os
os.listdir('.')

['.ipynb_checkpoints',
 'otto_classification.ipynb',
 'train.csv.zip',
 'test.csv.zip',
 'train.csv',
 'test.csv']

In [169]:
train = pd.read_csv('train.csv', header=0)
train.head(5)

Unnamed: 0,id,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_85,feat_86,feat_87,feat_88,feat_89,feat_90,feat_91,feat_92,feat_93,target
0,1,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,Class_1
1,2,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,Class_1
2,3,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,Class_1
3,4,1,0,0,1,6,1,5,0,0,...,0,1,2,0,0,0,0,0,0,Class_1
4,5,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,Class_1


In [170]:
test = pd.read_csv('test.csv', header=0)
test.head(5)

Unnamed: 0,id,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_84,feat_85,feat_86,feat_87,feat_88,feat_89,feat_90,feat_91,feat_92,feat_93
0,1,0,0,0,0,0,0,0,0,0,...,0,0,11,1,20,0,0,0,0,0
1,2,2,2,14,16,0,0,0,0,0,...,0,0,0,0,0,4,0,0,2,0
2,3,0,1,12,1,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,0,...,0,3,1,0,0,0,0,0,0,0
4,5,1,0,0,1,0,0,1,2,0,...,0,0,0,0,0,0,0,9,0,0


In [171]:
train.dtypes

id         int64
feat_1     int64
feat_2     int64
feat_3     int64
feat_4     int64
feat_5     int64
feat_6     int64
feat_7     int64
feat_8     int64
feat_9     int64
feat_10    int64
feat_11    int64
feat_12    int64
feat_13    int64
feat_14    int64
...
feat_80     int64
feat_81     int64
feat_82     int64
feat_83     int64
feat_84     int64
feat_85     int64
feat_86     int64
feat_87     int64
feat_88     int64
feat_89     int64
feat_90     int64
feat_91     int64
feat_92     int64
feat_93     int64
target     object
Length: 95, dtype: object

In [172]:
train.shape

(61878, 95)

In [173]:
train.describe()

Unnamed: 0,id,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_84,feat_85,feat_86,feat_87,feat_88,feat_89,feat_90,feat_91,feat_92,feat_93
count,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,...,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0
mean,30939.5,0.38668,0.263066,0.901467,0.779081,0.071043,0.025696,0.193704,0.662433,1.011296,...,0.070752,0.532306,1.128576,0.393549,0.874915,0.457772,0.812421,0.264941,0.380119,0.126135
std,17862.784315,1.52533,1.252073,2.934818,2.788005,0.438902,0.215333,1.030102,2.25577,3.474822,...,1.15146,1.900438,2.681554,1.575455,2.115466,1.527385,4.597804,2.045646,0.982385,1.20172
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,15470.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,30939.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,46408.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,61878.0,61.0,51.0,64.0,70.0,19.0,10.0,38.0,76.0,43.0,...,76.0,55.0,65.0,67.0,30.0,61.0,130.0,52.0,19.0,87.0


##Data Munging

In [174]:
train.dtypes

id         int64
feat_1     int64
feat_2     int64
feat_3     int64
feat_4     int64
feat_5     int64
feat_6     int64
feat_7     int64
feat_8     int64
feat_9     int64
feat_10    int64
feat_11    int64
feat_12    int64
feat_13    int64
feat_14    int64
...
feat_80     int64
feat_81     int64
feat_82     int64
feat_83     int64
feat_84     int64
feat_85     int64
feat_86     int64
feat_87     int64
feat_88     int64
feat_89     int64
feat_90     int64
feat_91     int64
feat_92     int64
feat_93     int64
target     object
Length: 95, dtype: object

In [175]:
#make a new columns called gender in which male and female are 0 and 1s
train['target2'] = train['target'].map( {'Class_1': 1, 'Class_2': 2, 'Class_3' : 3, 'Class_4' : 4, 'Class_5' : 5, 'Class_6' : 6, 'Class_7' : 7, 'Class_8' : 8, 'Class_9' :9 } ).astype(int)

##Separate features from target

In [176]:
y_train = train.target2

In [177]:
X_train = train.drop('target', axis=1, inplace=True)

In [178]:
X_test = test

In [179]:
X_train = train
X_train.drop('target2', axis=1, inplace=True)
Z_train = X_train

- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)
- **Classification** is supervised learning in which the response is categorical
- **Regression** is supervised learning in which the response is ordered and continuous


So in this case we want to use **classification**



## Requirements for working with data in scikit-learn"
​
1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

In [180]:
# check the types of the features and response
print type(Z_train)
print type(y_train)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


We now have to send our pandas.DataFrame back to a numpy.array.

In [181]:
train_data = Z_train.values
train_target = y_train.values

In [182]:
# check the types of the features and response
print type(train_data)
print type(train_target)

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>


In [183]:
# print the shapes of X and y
print train_data.shape
print train_target.shape
print test.shape

(61878, 94)
(61878,)
(144368, 94)


Step 1: Import the class you plan to use

In [184]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"
- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In other words, start an empty instance of your model

In [185]:
knn = KNeighborsClassifier(n_neighbors=1)

**Step 3:** Fit the model with data (aka "model training")

In [186]:
knn.fit(train_data, train_target)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           n_neighbors=1, p=2, weights='uniform')

**Step 4:** Predict the response for a new observation
   
- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

In [187]:
X_new = test.values
preds = knn.predict(X_new)

In [188]:
preds

array([1, 1, 1, ..., 9, 9, 9])

In [192]:
dfpreds = pd.DataFrame(preds)
dfpreds.to_csv("predictions_knn.csv")

#Random Forest w/ 100 Estimators


In [194]:
# Import the random forest package
from sklearn.ensemble import RandomForestClassifier 

# Create the random forest object which will include all the parameters
# for the fit
forest = RandomForestClassifier(n_estimators = 100)

# Fit the training data to the Survived labels and create the decision trees
forest = forest.fit(train_data,train_target)


In [196]:
# Take the same decision trees and run it on the test data
X_new = test.values
pred_rf = forest.predict(X_new)

In [197]:
pred_rf

array([1, 1, 6, ..., 4, 4, 7])