# Python Tutorial - DSO 2019 Training

This tutorial is intended to guide people wishing to use Python to participate in the challenge.

It has 5 steps:

1. Importing data
2. Descriptive analysis
3. Data Preparation
4. Creating a template
5. Calculation of predictions and submissions

# Data Import

Let's install the necessary packages for this tutorial:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
pd.set_option('display.max_columns', 500)

In [2]:
%%time
X_train = pd.read_csv("data/X_train.csv", index_col=0, error_bad_lines=False)
X_test = pd.read_csv("data/X_test.csv", index_col=0, error_bad_lines=False)
y_train = pd.read_csv("data/y_train.csv", index_col=0)

b'Skipping line 2168: expected 31 fields, saw 33\nSkipping line 4822: expected 31 fields, saw 37\nSkipping line 4859: expected 31 fields, saw 37\nSkipping line 7342: expected 31 fields, saw 37\n'


Wall time: 425 ms


In [3]:
print("Dimension X_train:", X_train.shape)
print("Dimension X_test:", X_test.shape)

Dimension X_train: (8880, 30)
Dimension X_test: (2960, 30)


In [4]:
X_train.head(3)

Unnamed: 0_level_0,images_count,image_width,image_height,image_url,product_description,product_size,material,age,warranty,year,color,product_width,wifi,condition,product_length,shoe_size,vintage,brand,author,editor,product_height,weight,price,category,sub_category_1,sub_category_2,sub_category_3,sub_category_4,product_name,store_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
0,3,3458.0,2552.0,https://d1kvfoyrif6wzg.cloudfront.net/assets/i...,Superbe petit top bustier avec explosion de co...,44.0,100 % polyester,,,,Multicolore,,,bon état,,,False,,,,,200.0,4.5,mode,"tops, t-shirts, débardeurs femme",,,,Top bustier multicolore,Emmaüs 88 Neufchateau
1,2,2486.0,2254.0,https://d1kvfoyrif6wzg.cloudfront.net/assets/i...,"Radio ITT Océnic Flirt, année 70\nPour déco",,Plastique,,,,Jaune,,,en l'état,,,True,ITT Océanic,,,,1000.0,15.0,mobilier - deco,bibelots et objets déco,,,,Radio ITT Océanic,Communauté Emmaüs Thouars (magasin Parthenay)
2,3,1536.0,1536.0,https://d1kvfoyrif6wzg.cloudfront.net/assets/i...,Veste boléro à manches courtes NÛMPH. Gris chi...,40.0,"Polyester, coton, laine",,,,Gris,,,neuf,,,False,Nûmph,,,,360.0,16.0,label selection,mode,mode femme,,,,Label Emmaüs Chambéry


# Descriptive analysis

## Structure of the datasets

The train dataset contains the characteristics and time of sale of **8880** items sold on the Emmaus website. It is this dataset that we will use to create a model. Each object is described by an observation of X variables. These variables are described in the ```description.pdf``` file in the USB key.

The test dataset contains the characteristics of **2960 objects**, which must be predicted for the time of sale. Unlike the train, the sell time is of course not filled in and an ```id``` column has been added to identify the predictions during the submission stage.

In [5]:
X_train.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
images_count,8880,,,,3.63345,2.04857,0.0,2.0,3.0,5.0,29.0
image_width,8823,,,,1807.82,1025.25,58.0,1000.0,1536.0,2448.0,5472.0
image_height,8823,,,,1801.77,1101.21,64.0,970.5,1536.0,2448.0,5472.0
image_url,8823,8775.0,https://d1kvfoyrif6wzg.cloudfront.net/assets/i...,4.0,,,,,,,
product_description,8880,8836.0,"Relié, 48 pages, couverture usagée",6.0,,,,,,,
product_size,2414,33.0,38,402.0,,,,,,,
material,3947,1722.0,Coton,144.0,,,,,,,
age,120,18.0,4a,14.0,,,,,,,
warranty,101,2.0,6 mois,100.0,,,,,,,
year,1497,,,,14810.1,496237.0,0.0,1979.0,1998.0,2007.0,19201900.0


In [6]:
y_train.duration.value_counts()

0    3027
2    2953
1    2900
Name: duration, dtype: int64

The dataset is very balanced, each of the 3 classes has a frequency close to 1/3.

# Model Creation

Now is the time to create a model. In this tutorial we will build a Random Forest.

To do this we use the variables ```["weight","price","nb_images","image_length","image_width","category"]```.

To avoid overfitting and estimate the true performance of our model we will use the criterion of cross-validation **k-fold** method (cross-validation).

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import LabelEncoder

### Imputation of missing values by the value "missing"

In [8]:
X_train.category.fillna('missing', inplace=True)
X_test.category.fillna('missing', inplace=True)

### Encoding categorical features

Machine learning algorithms expect to have **numbers** as input, not strings. That's why we turn **categorical features** into numbers, using ```LabelEncoder ()```

In [9]:
X_train.category.unique()

array(['mode', 'mobilier - deco', 'label selection', 'multimédia',
       'loisirs', 'enfance', 'librairie', 'culture - loisirs',
       'les coups de coeur des vendeurs', 'mobilier - deco - maison',
       'créations', 'missing'], dtype=object)

In [10]:
le = LabelEncoder()
X_train['category'] = le.fit_transform(X_train.category)
X_test['category'] = le.transform(X_test.category)

In [11]:
features = ["weight", "price", "images_count",
            "image_width", "image_height", "category"]

ppl = Pipeline([("imputer", Imputer(strategy='median')),
                ("clf", RandomForestClassifier(n_estimators=10))])

ppl.fit(X_train.loc[:, features], np.ravel(y_train))

pred_train = ppl.predict_proba(X_train.loc[:, features])
pred_cv = cross_val_predict(ppl, X_train.loc[:, features], np.ravel(y_train),
                            method='predict_proba', cv=5, n_jobs=-1)



# Calcul de l'erreur: logloss

In [12]:
from sklearn.metrics import log_loss 

In [13]:
print("LogLoss on train sample:",log_loss(y_pred=pred_train, y_true=y_train))
print("LogLoss on train sample (CV):",log_loss(y_pred=pred_cv, y_true=y_train))

LogLoss on train sample: 0.29625210734455876
LogLoss on train sample (CV): 3.1504084002015773


# Calcul des predictions

In [14]:
pred_test = ppl.predict_proba(X_test.loc[:, features])

In [15]:
df_submission = pd.DataFrame(pred_test, index=X_test.index)

# Submission

## Possibility #1: via the QScore API

1. Go to the platform [QScore](https://qscore.datascience-olympics.com) then in "Submissions"> "Submit from your Python Notebook"
2. Get your TOKEN
3. Replace it in the function below and execute it

In [16]:
import io, math, requests

# Only works in Python3, see comment below for Python2
def submit_prediction(df, sep=',', **kwargs):
    # TOKEN to recover on the platform: "Submissions"> "Submit from your Python Notebook"
    TOKEN='ad30dbf801679bec096e2e1cb2c22196ac4f8ee4aabd83b0e820aa033ec347ac1db1f474bf754dfddf627c7d7ee78a62389bf48e19584f28bfea2a92aeb104c2'  
    URL='https://qscore.datascience-olympics.com/api/submissions'
    #buffer = io.BytesIO() # Python 2
    buffer = io.StringIO() # Python 3
    df.to_csv(buffer, sep=sep, **kwargs)
    buffer.seek(0)
    r = requests.post(URL, headers={'Authorization': 'Bearer {}'.format(TOKEN)},files={'datafile': buffer})
    if r.status_code == 429:
        raise Exception('Submissions are too close. Next submission is only allowed in {} seconds.'.format(int(math.ceil(int(r.headers['x-rate-limit-remaining']) / 1000.0))))
    if r.status_code != 200:
        raise Exception(r.text)

In [17]:
submit_prediction(df_submission, sep=',', index=True)

## Possibility #2: Submit a CSV file

1. Go to the platform [QScore](https://qscore.datascience-olympics.com) then in "Submissions"> "Submit with a file"
2. Deposit the CSV file

In [None]:
df_submission.to_csv("my_prediction.csv", index_label="id", header=['0', '1', '2'])