In this competition, you will forecast the demand of a product for a given week, at a particular store. The dataset you are given consists of 9 weeks of sales transactions in Mexico. Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week.

The train and test dataset are split based on time, as well as the public and private leaderboard dataset split.

Things to note:

- There may be products in the test set that don't exist in the train set. This is the expected behavior of inventory data, since there are new products being sold all the time. Your model should be able to accommodate this.
- There are duplicate Cliente_ID's in cliente_tabla, which means one Cliente_ID may have multiple NombreCliente that are very similar. This is due to the NombreCliente being noisy and not standardized in the raw data, so it is up to you to decide how to clean up and use this information. 
- The adjusted demand (Demanda_uni_equil) is always >= 0 since demand should be either 0 or a positive value. The reason that Venta_uni_hoy - Dev_uni_proxima sometimes has negative values is that the returns records sometimes carry over a few weeks.
<br><br>
- ```Semana``` — Week number (From Thursday to Wednesday)
- ```Agencia_ID``` — Sales Depot ID
- ```Canal_ID``` — Sales Channel ID
- ```Ruta_SAK``` — Route ID (Several routes = Sales Depot)
- ```Cliente_ID``` — Client ID
- ```NombreCliente``` — Client name
- ```Producto_ID``` — Product ID
- ```NombreProducto``` — Product Name
- ```Venta_uni_hoy``` — Sales unit this week (integer)
- ```Venta_hoy``` — Sales this week (unit: pesos)
- ```Dev_uni_proxima``` — Returns unit next week (integer)
- ```Dev_proxima``` — Returns next week (unit: pesos)
- ```Demanda_uni_equil``` — Adjusted Demand (integer) (This is the target you will predict)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import math

from collections import Counter

from sklearn.preprocessing import StandardScaler

In [2]:
def rmsle(y, y0):
    assert len(y) == len(y0)
    return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(y0), 2)))

In [3]:
#looping error calc
def rmsle_loop(y, y_pred):
    assert len(y) == len(y_pred)
    terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(terms_to_sum) * (1.0/len(y))) ** 0.5

In [3]:
# Pegando somente as 3 primeiras semanas

#df = pd.read_csv('train.csv')
#df[df.Semana.isin([3, 4, 5])].to_csv('train_345.csv')

In [4]:
# Qubrando novamente em datasets separados

#df = pd.read_csv('train_345.csv')
#df[df.Semana.isin([3])].to_csv('train_3.csv')
#df[df.Semana.isin([4])].to_csv('train_4.csv')
#df[df.Semana.isin([5])].to_csv('train_5.csv')

#df[df.Semana.isin([8,9])].to_csv('train_8_9.csv')

In [4]:
dtype = {#'Semana':'uint8',
         #'Canal_ID':'uint8',
         #'Agencia_ID':'uint16',
         #'Ruta_SAK':'uint16',
         'Dev_uni_proxima':'uint16',
         'Cliente_ID':'uint32',
         'Producto_ID':'uint32',
         'Venta_uni_hoy':'uint8',
         'Demanda_uni_equil':'uint32',
         'Venta_hoy':'float32',
         'Dev_proxima':'float32'}

usecols=[#'Semana', 
         #'Canal_ID',
         #'Agencia_ID', 
         #'Ruta_SAK', 
         'Cliente_ID', 
         'Producto_ID', 
         'Venta_uni_hoy', 
         'Venta_hoy',
         'Dev_uni_proxima', 
         'Dev_proxima', 
         'Demanda_uni_equil']

df = pd.read_csv('train_345.csv', dtype=dtype, usecols=usecols)

In [None]:
df.describe()

In [16]:
X = StandardScaler().fit_transform(df.iloc[:, :-1])
y = df.iloc[:,-1]

del df

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [18]:
del X, y

In [19]:
import xgboost as xgb

xgb_clf = xgb.XGBRegressor(random_state=42, n_jobs=-1)
xgb_clf.fit(X_train, y_train)

[18:32:56] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=-1, nthread=None, objective='reg:linear', random_state=42,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [20]:
result = xgb_clf.predict(X_test)

In [21]:
rmsle(y_test, result)

  This is separate from the ipykernel package so we can avoid doing imports until


0.11037768329008704

Fazendo validação com semanas 8 e 9

In [None]:
df_val = pd.read_csv('train_8_9.csv', dtype=dtype, usecols=usecols)

In [None]:
df_val = StandardScaler().fit_transform(df_val)
X_ = df_val[:,:-1]
y_ = df_val[:,-1]

del df_val

In [None]:
result_val = xgb_clf.predict(X_)

In [None]:
#result_val[result_val < 0] = 0

In [None]:
rmsle(y_, result_val)

In [None]:
#pd.DataFrame([y_, result_val]).T