# Why my products do not sell?

# Data Import

Let's install the necessary packages for this tutorial:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
pd.set_option('display.max_columns', 500)

In [None]:
%%time
X_train = pd.read_csv('X_train.csv', index_col=0, error_bad_lines=False)
X_test = pd.read_csv('X_test.csv', index_col=0, error_bad_lines=False)
y_train = pd.read_csv('y_train.csv', index_col=0)

In [None]:
print('Dimension X_train:', X_train.shape)
print('Dimension y_train:', y_train.shape)
print('Dimension X_test:', X_test.shape)

In [None]:
X_train.head(3)

In [None]:
y_train.head(3)

# Descriptive analysis

## Structure of the datasets

The train dataset contains the characteristics and time of sale of **8880** items sold on the Emmaus website. It is this dataset that we will use to create a model. Each object is described by an observation of X variables. These variables are described in the ```description.pdf``` file in the USB key.

The test dataset contains the characteristics of **2960 objects**, which must be predicted for the time of sale. Unlike the train, the sell time is of course not filled in and an ```id``` column has been added to identify the predictions during the submission stage.

In [None]:
X_train.describe(include='all').T

In [None]:
X_train.hist(bins=50, figsize=(20, 15))
plt.show()

In [None]:
y_train.duration.value_counts()

The dataset is very balanced, each of the 3 classes has a frequency close to 1/3.

# Model Creation

Now is the time to create a model. In this tutorial we will build a Random Forest.

To do this we use the variables ```["weight","price","nb_images","image_length","image_width","category"]```.

To avoid overfitting and estimate the true performance of our model we will use the criterion of cross-validation **k-fold** method (cross-validation).

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

### Imputation of missing values by the value "missing"

In [None]:
X_train.category.fillna('missing', inplace=True)
X_test.category.fillna('missing', inplace=True)

### Encoding categorical features

Machine learning algorithms expect to have **numbers** as input, not strings. That's why we turn **categorical features** into numbers, using ```LabelEncoder ()```

In [None]:
X_train.category.unique()

In [None]:
le = LabelEncoder()
X_train['category'] = le.fit_transform(X_train.category)
X_test['category'] = le.transform(X_test.category)

In [None]:
from sklearn.model_selection import GridSearchCV

features = ['weight', 'price', 'images_count',
            'image_width', 'image_height', 'category']

ppl = Pipeline([('imputer', Imputer(strategy='median')),
                ('clf', RandomForestClassifier(n_estimators=1000, max_leaf_nodes=580, n_jobs=-1))])

ppl.fit(X_train.loc[:, features], np.ravel(y_train))

pred_train = ppl.predict_proba(X_train.loc[:, features])
pred_cv = cross_val_predict(ppl,
                            X_train.loc[:, features],
                            np.ravel(y_train),
                            method='predict_proba',
                            cv=5,
                            n_jobs=-1)

# Calcul de l'erreur: logloss

In [None]:
from sklearn.metrics import log_loss 

In [None]:
print('LogLoss on train sample:', log_loss(y_pred=pred_train, y_true=y_train))
print('LogLoss on train sample (CV):', log_loss(y_pred=pred_cv, y_true=y_train))

# Calcul des predictions

In [None]:
pred_test = ppl.predict_proba(X_test.loc[:, features])

In [None]:
df_submission = pd.DataFrame(pred_test, index=X_test.index)

In [None]:
df_submission.head(3)