# Introduction

I will explore the problem in following stages:

1.  **Hypothesis Generation** – understanding the problem better by brainstorming possible factors that can impact the outcome
2.  **Data Exploration** – looking at categorical and continuous feature summaries and making inferences about the data.
3.  **Data Cleaning** – imputing missing values in the data and checking for outliers
4.  **Feature Engineering** – modifying existing variables and creating new ones for analysis
5.  **Model Building** – making predictive models on the data


## 1. Hypothesis Generation

This involves understanding the problem and making some hypothesis about what could potentially have a good impact on the outcome. This is done BEFORE looking at the data, and we end up creating a laundry list of the different analysis which we can potentially perform if data is available.

### The Problem Statement

Understanding the problem statement is the first and foremost step:

> In this competition, you will forecast the demand of a product for a given week, at a particular store. The dataset you are given consists of 9 weeks of sales transactions in Mexico. Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week.

So the idea is to find out the demand of a product (sales - returns) per client, and store which impacts the sales of a product. Let’s think about some of the analysis that can be done and come up with certain hypothesis.

### The Hypotheses

I came up with the following hypothesis while thinking about the problem. Since we’re talking about stores and products, lets make different sets for each.

**Store/Client Level Hypotheses:**

1.  **Town type:** Stores located in urban or Tier 1 towns should have higher sales because of the higher income levels of people there.
2.  **Population Density:** Stores located in densely populated areas should have higher sales because of more demand.
3.  **Store Capacity:** Stores which are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place
4.  **Competitors:** Stores having similar establishments nearby should have less sales because of more competition.
5.  **Marketing:** Stores which have a good marketing division should have higher sales as it will be able to attract customers through the right offers and advertising.
6.  **Location:** Stores located within popular marketplaces should have higher sales because of better access to customers.
7.  **Customer Behavior:** Stores keeping the right set of products to meet the local needs of customers will have higher sales.
8.  **Ambiance:** Stores which are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales.
9.  **Season:** Store should sell more after customer's pay day: after 15th or 30th of the month

**Product Level Hypotheses:**

1.  **Brand:** Branded products should have higher sales because of higher trust in the customer.
2.  **Packaging:** Products with good packaging can attract customers and sell more.
3.  **Utility:** Daily use products should have a higher tendency to sell as compared to the specific use products.
4.  **Display Area:** Products which are given bigger shelves in the store are likely to catch attention first and sell more.
5.  **Visibility in Store:** The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back.
6.  **Advertising:** Better advertising of products in the store will should higher sales in most cases.
7.  **Promotional Offers:** Products accompanied with attractive offers and discounts will sell more.


Lets move on to the data exploration where we will have a look at the data in detail.

## 2\. Data Exploration

I’ll be performing some basic data exploration here and come up with some inferences about the data.

The first step is to look at the data and try to identify the information which we hypothesized vs the available data. A comparison between the data dictionary on the competition page and out hypotheses is shown below:

![Image of Variables vs Hypothesis](files/../input-data/Variables_vs_Hyphotesis.png)

We can summarize the findings as:

** 9 Features Hypothesized but not found in actual data. **

** 5 Features Hypothesized as well as present in the data **

** 3 Features present in the data but not hypothesized. **


We find features which we hypothesized, but data doesn’t carry and vice versa. We should look for open source data to fill the gaps if possible. Let’s start by loading the required libraries and data. 

In [1]:
import pandas as pd
import numpy as np
import time
import csv

_start_time = time.time()

# define a easy timing function to use going forward
def tic():
    global _start_time 
    _start_time = time.time()

def tac():
    t_sec = round(time.time() - _start_time)
    (t_min, t_sec) = divmod(t_sec,60)
    (t_hour,t_min) = divmod(t_min,60) 
    print('Time passed: {}hour:{}min:{}sec'.format(t_hour,t_min,t_sec))
    
# utility function- display large dataframes in an html iframe
def df_display(df, lines=500):
    txt = ("<iframe " +
           "srcdoc='" + df.head(lines).to_html() + "' " +
           "width=1000 height=500>" +
           "</iframe>")

    return IPython.display.HTML(txt)


In [2]:
#Read files:
tic()
train = pd.read_csv('input-data/train_sampled5pct.csv',
                           dtype  = {'Semana': 'int8',
                                     'Producto_ID':'int32',
                                     'Cliente_ID':'int32',
                                     'Agencia_ID':'uint16',
                                     'Canal_ID':'int8',
                                     'Ruta_SAK':'int32',
                                     'Venta_hoy':'float32',
                                     'Venta_uni_hoy': 'int8',
                                     'Dev_uni_proxima':'int8',
                                     'Dev_proxima':'float32',
                                     'Demanda_uni_equil':'int32'})
test = pd.read_csv('input-data/test.csv',
                           dtype  = {'Semana': 'int8',
                                     'Producto_ID':'int32',
                                     'Cliente_ID':'int32',
                                     'Agencia_ID':'uint16',
                                     'Canal_ID':'int8',
                                     'Ruta_SAK':'int32'})
tac()

Time passed: 0hour:0min:10sec


In [3]:
train.head()

Unnamed: 0,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
0,3,1110,7,3301,24695,1187,1,148.5,0,0,1
1,3,1110,7,3301,50379,1146,3,64.169998,0,0,3
2,3,1110,7,3301,73589,4085,2,12.3,0,0,2
3,3,1110,7,3301,73589,31506,1,6.25,0,0,1
4,3,1110,7,3301,73844,1242,1,7.64,0,0,1


In [4]:
test.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID
0,0,11,4037,1,2209,4639078,35305
1,1,11,2237,1,1226,4705135,1238
2,2,10,2045,1,2831,4549769,32940
3,3,11,1227,1,4448,4717855,43066
4,4,11,1219,1,1130,966351,1277


In [5]:
# remove unnecessary fields in training data
train.drop(['Venta_uni_hoy', 'Venta_hoy','Dev_uni_proxima', 'Dev_proxima'], axis=1, inplace=True)

In [6]:
#Since test dataframe is not the same as train dataframe, we make them equal by removing and adding columns
train.insert(0, 'id', np.nan)
test.insert(7, 'Demanda_uni_equil', np.nan)

In [7]:
train.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil
0,,3,1110,7,3301,24695,1187,1
1,,3,1110,7,3301,50379,1146,3
2,,3,1110,7,3301,73589,4085,2
3,,3,1110,7,3301,73589,31506,1
4,,3,1110,7,3301,73844,1242,1


In [8]:
test.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil
0,0,11,4037,1,2209,4639078,35305,
1,1,11,2237,1,1226,4705135,1238,
2,2,10,2045,1,2831,4549769,32940,
3,3,11,1227,1,4448,4717855,43066,
4,4,11,1219,1,1130,966351,1277,


It is a good idea to combine both train and test data sets into one, perform feature engineering and then divide them later again. This saves the trouble of performing the same steps twice on test and train. Lets combine them into a dataframe ‘data’ with a ‘source’ column specifying where each observation belongs.

In [9]:
tic()
train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True)
tac()
print (train.shape, test.shape, data.shape)

Time passed: 0hour:0min:2sec
(3709700, 9) (6999251, 9) (10708951, 9)


Thus we can see that data has same #columns but rows equivalent to both test and train. Lets start by checking which columns contain missing values. (takes aprox 30 mins to run!)

In [10]:
data.apply(lambda x: sum(x.isnull()))

id                   3709700
Semana                     0
Agencia_ID                 0
Canal_ID                   0
Ruta_SAK                   0
Cliente_ID                 0
Producto_ID                0
Demanda_uni_equil    6999251
source                     0
dtype: int64

There doesn't seem to be any missing values (other than the NaN we set on the test and train sets).

Lets look at some basic statistics for numerical variables.

In [11]:
data.describe()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil
count,6999251.0,10708951.0,10708951.0,10708951.0,10708951.0,10708950.0,10708951.0,3709700.0
mean,3499625.0,8.920064,2516.178078,1.394992,2129.807716,1813825.0,21705.737578,7.203616
std,2020509.86883,2.498625,4033.947395,1.495498,1495.910168,2951558.0,18696.499509,21.734525
min,0.0,3.0,1110.0,1.0,1.0,26.0,41.0,0.0
25%,1749812.5,8.0,1311.0,1.0,1160.0,356081.0,1242.0,2.0
50%,3499625.0,10.0,1613.0,1.0,1292.0,1198099.0,31298.0,3.0
75%,5249437.5,11.0,2034.0,1.0,2804.0,2383228.0,40886.0,6.0
max,6999250.0,11.0,25759.0,11.0,9970.0,2015152000.0,49997.0,4975.0


Some observations:

   Looking at Demanda_uni_equil (our target), or the amount of product sold per week, we find interesting things:
   
   **1)** The average is 7.22, so in average there is 7 units per week per store sold.
   
   **2)** Looking at the max of 5000, it looks very far fro the mean (3 orders of magnitude), so we must check for an outlier here or a store that is crazy different from the rest.
   
   **3)** Same behaviour we find on Dev_uni_proxima, Venta_hoy and Venta_uni_hoy
   
Looking at the nice data analysis made in R by Faviens, here: https://www.kaggle.com/fabienvs/grupo-bimbo-inventory-demand/notebook-8a62eda039a3b0b944cf/notebook we corroborate the outlier(s):
There is a massive client: Puebla Remision
   
![Image of size of Customers]( https://www.kaggle.io/svf/267812/783a24d1dd546819a44914f996b249e8/__results___files/figure-html/unnamed-chunk-16-1.png)
   

Moving to nominal (categorical) variable, lets have a look at the number of unique values in each of them.

In [11]:
data.apply(lambda x: len(x.unique()))

id                   6999252
Semana                     9
Agencia_ID               552
Canal_ID                   9
Ruta_SAK                3620
Cliente_ID            890267
Producto_ID             1833
Venta_uni_hoy            257
Venta_hoy              73516
Dev_uni_proxima          256
Dev_proxima            14241
Demanda_uni_equil       2092
source                     2
dtype: int64

So, in train and test sets, we have 552 Agencies(depots), 890k clients (we might have some repeated clients due to typos when enterind data), 1833 products (we might have some repeated products here based on typos) and 3620 routes

In [10]:
# Let's see how many records we have per week
for i in range(3,12):
    print("Semana"+repr(i)+" =\t" + repr(data[data["Semana"]==i].Semana.count()))

Semana3 =	557998
Semana4 =	550734
Semana5 =	530826
Semana6 =	510759
Semana7 =	519346
Semana8 =	520212
Semana9 =	519825
Semana10 =	3538385
Semana11 =	3460866


As stated in the Kaggle competition - Week 10 and 11 is sampled down aprox 70%. According to Kaggle, this was done so the scoring of candidates didn't take extremely long.

## 3\. Data Cleaning

This step involves imputing missing values and treating outliers. As we saw before, there are no missing values. Regarding outliers, there seem to be an obvious one, but we are going to see later on if its necessary to treat it differently.

My initial reaction would be to see if anything with the word REMISION is on the test set. if not, then delete it. See this discussion: https://www.kaggle.com/c/grupo-bimbo-inventory-demand/forums/t/22037/puebla-remission/126053

In [13]:
#Let's find out who are the clients with the word REMISION on it
client_name = pd.read_csv('files/../input-data/cliente_tabla.csv')
cliente_id_name_train = pd.merge(train,client_name, on='Cliente_ID')
cliente_id_name_test = pd.merge(test,client_name, on='Cliente_ID')

In [14]:
cliente_id_name_train.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil,source,NombreCliente
0,,3,1110,7,3301,15766,1212,3,25.139999,0,0,3,train,PUESTO DE PERIODICOS LAZARO
1,,3,1110,7,3301,15766,1216,4,33.52,0,0,4,train,PUESTO DE PERIODICOS LAZARO
2,,3,1110,7,3301,15766,1238,4,39.32,0,0,4,train,PUESTO DE PERIODICOS LAZARO
3,,3,1110,7,3301,15766,1240,4,33.52,0,0,4,train,PUESTO DE PERIODICOS LAZARO
4,,3,1110,7,3301,15766,1242,3,22.92,0,0,3,train,PUESTO DE PERIODICOS LAZARO


In [15]:
cliente_id_name_train[cliente_id_name_train.NombreCliente.str.contains('REMISION')].count()

id                        0
Semana               139487
Agencia_ID           139487
Canal_ID             139487
Ruta_SAK             139487
Cliente_ID           139487
Producto_ID          139487
Venta_uni_hoy        139487
Venta_hoy            139487
Dev_uni_proxima      139487
Dev_proxima          139487
Demanda_uni_equil    139487
source               139487
NombreCliente        139487
dtype: int64

As we can see above, the word "REMISION" shows up 140k times on the train set. Let's see the test set:

In [16]:
cliente_id_name_test[cliente_id_name_test.NombreCliente.str.contains('REMISION')].count()

id                   12842
Semana               12842
Agencia_ID           12842
Canal_ID             12842
Ruta_SAK             12842
Cliente_ID           12842
Producto_ID          12842
Venta_uni_hoy            0
Venta_hoy                0
Dev_uni_proxima          0
Dev_proxima              0
Demanda_uni_equil        0
source               12842
NombreCliente        12842
dtype: int64

12k rows shows up the word REMISION on the test set. This implies that it has to be predicted as well. We cannot eliminate it.

## 4\. Feature Engineering

We explored some nuances in the data in the data exploration section. Lets move on to resolving them and making our data ready for analysis. We will also create some new variables using the existing ones in this section.

In [11]:
#First thing we need to do is to transform our target ( Demanda_uni_equil) to log(1 + demand) - this makes sense since we're 
#trying to minimize rmsle vs the mean which minimizes rmse. At the end of the modeling (for submission) we need to reverse it 
#by applying expm1(x)

data['log_target'] = np.log1p(data["Demanda_uni_equil"])

In [12]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target
0,,3,1110,7,3301,24695,1187,1,train,0.693147
1,,3,1110,7,3301,50379,1146,3,train,1.386294
2,,3,1110,7,3301,73589,4085,2,train,1.098612
3,,3,1110,7,3301,73589,31506,1,train,0.693147
4,,3,1110,7,3301,73844,1242,1,train,0.693147


In [13]:
#Let's also create all the grouping dataframes we are going to need 
tic()

global_mean = data['log_target'].mean()
prod_mean = data.groupby('Producto_ID').agg({'log_target': 'mean' })
client_mean = data.groupby('Cliente_ID').agg({'log_target': 'mean' })
prod_client_mean = data.groupby(['Producto_ID', 'Cliente_ID']).agg({'log_target': 'mean' })
semana_client_prod_mean = data.groupby(['Semana','Cliente_ID','Producto_ID']).agg({'log_target': 'mean'})
ruta_cliente_prod_mean = data.groupby(['Ruta_SAK','Cliente_ID','Producto_ID']).agg({'log_target': 'mean'})

tac()

Time passed: 0hour:0min:35sec


### Feature 1: Lags - Demand per client-product pair for prior weeks
Based on this blog: http://blog.nycdatascience.com/student-works/predicting-demand-historical-sales-data-grupo-bimbo-kaggle-competition/

As this script said: https://www.kaggle.com/bpavlyshenko/grupo-bimbo-inventory-demand/bimbo-xgboost-r-script-lb-0-457/code
It is important to know what were the previous weeks sales. If the previous week, too many products were supplied and they were not sold, the next week this product amount, supplied to the same store, will be decreased. So it is very important to included lag values of target variable as a feature to predict the next sales.

In [14]:
df = semana_client_prod_mean.reset_index() # we convert the index to columns for later use

In [15]:
df.head()

Unnamed: 0,Semana,Cliente_ID,Producto_ID,log_target
0,3,60,32275,4.26268
1,3,60,34787,5.796058
2,3,60,34867,6.142037
3,3,65,31200,4.290459
4,3,65,34786,5.846439


In [16]:
# Let's see how many records we have per week on the semana_cliente_Producto groups vs the raw dataset
for i in range(3,12):
    print("Semana"+repr(i)+" =\t" + repr(data[data["Semana"]==i].Semana.count())+ " (raw)\t" +
            repr(df[df["Semana"]==i].Semana.count()) + " (group)\t" +  
            repr(data[data["Semana"]==i].Semana.count() - df[df["Semana"]==i].Semana.count()) + " (diff)"
         )

Semana3 =	557998 (raw)	557211 (group)	787 (diff)
Semana4 =	550734 (raw)	550018 (group)	716 (diff)
Semana5 =	530826 (raw)	530002 (group)	824 (diff)
Semana6 =	510759 (raw)	509982 (group)	777 (diff)
Semana7 =	519346 (raw)	518516 (group)	830 (diff)
Semana8 =	520212 (raw)	519374 (group)	838 (diff)
Semana9 =	519825 (raw)	519079 (group)	746 (diff)
Semana10 =	3538385 (raw)	3531921 (group)	6464 (diff)
Semana11 =	3460866 (raw)	3454394 (group)	6472 (diff)


As we can see above, there are repeated combinations of client-product on each week.

In [17]:
#Before we start adding lags and removing rows, let's see the size of our dataset
size_data = data.memory_usage().sum()
print(size_data)

514029648


In [18]:
#here we add the number of lags we want
tic()
lag=4

for i in range(1,lag+1):
    df['Semana'] += 1
    df.rename(columns={df.columns[3]: 'Log_Target_mean_lag%d' %(i)}, inplace=True)
    data = pd.merge(data,df, how = 'left', on = ['Semana','Cliente_ID','Producto_ID']) #here we add the lag to the dataset
    data['Log_Target_mean_lag%d' %(i)].fillna(0, inplace=True) # we replace the client-product log mean NaN/Not found on the week before with ZERO
    data = data[data.Semana != i+2] # here we delete the week rows we dont have lags for
   
tac()

Time passed: 0hour:1min:0sec


In [54]:
#Let's see how many NaN or Nulls we have
data.apply(lambda x: sum(x.isnull()))

id                      1559383
Semana                        0
Agencia_ID                    0
Canal_ID                      0
Ruta_SAK                      0
Cliente_ID                    0
Producto_ID                   0
Demanda_uni_equil       6999251
source                        0
log_target              6999251
Log_Target_mean_lag1          0
Log_Target_mean_lag2          0
Log_Target_mean_lag3          0
Log_Target_mean_lag4          0
dtype: int64

The above looks correct! the only NaN shown are the variables that are not avaibable on the test set, everyting else looks good


In [19]:
# Let's see how many records we have per week and make sure we didn't delete any data from our important weeks
for i in range(3,12):
    print("Semana"+repr(i)+" =\t" + repr(data[data["Semana"]==i].Semana.count()))

Semana3 =	0
Semana4 =	0
Semana5 =	0
Semana6 =	0
Semana7 =	519346
Semana8 =	520212
Semana9 =	519825
Semana10 =	3538385
Semana11 =	3460866


The above looks correct! we deleted the week rows we don't have lags for, and we kept the original amount of rows for the remaining weeks.

In [20]:
#Now let's see how much was the data set size reduced/increased
new_size_data = data.memory_usage().sum()
print("Dataset size changed in " + repr((new_size_data - size_data )*100/new_size_data) + "%")

Dataset size changed in 24.925278963909427%


Good! the data set was reduced in 50% (this means faster modeling)

In [22]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,Log_Target_mean_lag1,Log_Target_mean_lag2,Log_Target_mean_lag3,Log_Target_mean_lag4
510759,,7,1110,7,3301,15766,3894,4,train,1.609438,0,0.0,0,0
510760,,7,1110,7,3301,15766,35452,4,train,1.609438,0,0.0,0,0
510761,,7,1110,7,3301,73838,5310,6,train,1.94591,0,0.0,0,0
510762,,7,1110,7,3301,73838,5345,3,train,1.386294,0,0.0,0,0
510763,,7,1110,7,3301,108104,1182,162,train,5.09375,0,5.365976,0,0


### Feature 2: Rate of change of lags
It is also important to know what is the rate of change in demand from week to week in respect to the product-client mean

In [23]:
df = prod_client_mean.reset_index() # we convert the index to columns for later use
df.rename(columns={'log_target': 'PC_log_target_mean'}, inplace=True)

In [24]:
data = pd.merge(data,df, how = 'left', on = ['Producto_ID','Cliente_ID'])

In [26]:
data.tail()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,Log_Target_mean_lag1,Log_Target_mean_lag2,Log_Target_mean_lag3,Log_Target_mean_lag4,PC_log_target_mean
8558629,6999246,11,2057,1,1153,4379638,1232,,test,,0.0,0,0,0,
8558630,6999247,10,1334,1,2008,970421,43069,,test,,1.098612,0,0,0,1.098612
8558631,6999248,11,1622,1,2869,192749,30532,,test,,0.0,0,0,0,1.609438
8558632,6999249,11,1636,1,4401,286071,35107,,test,,0.0,0,0,0,
8558633,6999250,11,1625,1,1259,978760,1232,,test,,0.0,0,0,0,


In [27]:
tic()
for i in range(2,lag+1):
        data['rate_change%d' %(i)] = (data['Log_Target_mean_lag%d' %(i-1)] - data['Log_Target_mean_lag%d' %(i)])/data['PC_log_target_mean']
        data['rate_change%d' %(i)].fillna(0, inplace=True) # we replace NaN/Not found with ZERO
    
tac()

Time passed: 0hour:0min:0sec


In [28]:
#We don't need Qty_Ruta_Sak anymore
data.drop(['PC_log_target_mean'],axis=1,inplace=True)

In [30]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,Log_Target_mean_lag1,Log_Target_mean_lag2,Log_Target_mean_lag3,Log_Target_mean_lag4,rate_change2,rate_change3,rate_change4
0,,7,1110,7,3301,15766,3894,4,train,1.609438,0,0.0,0,0,0.0,0.0,0
1,,7,1110,7,3301,15766,35452,4,train,1.609438,0,0.0,0,0,0.0,0.0,0
2,,7,1110,7,3301,73838,5310,6,train,1.94591,0,0.0,0,0,0.0,0.0,0
3,,7,1110,7,3301,73838,5345,3,train,1.386294,0,0.0,0,0,0.0,0.0,0
4,,7,1110,7,3301,108104,1182,162,train,5.09375,0,5.365976,0,0,-1.026026,1.026026,0


###  Feature 3:  Calculates de sum of prior weeks Log mean Demands

In [31]:
#We want to sum the lags up until week 9, this means that we need to sum lag2 and up.
data['Lags_sum'] = 0
for i in range(1,lag+1):
    data['Lags_sum'] += data['Log_Target_mean_lag%d' %(i)]

In [33]:
data.tail()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,Log_Target_mean_lag1,Log_Target_mean_lag2,Log_Target_mean_lag3,Log_Target_mean_lag4,rate_change2,rate_change3,rate_change4,Lags_sum
8558629,6999246,11,2057,1,1153,4379638,1232,,test,,0.0,0,0,0,0,0,0,0.0
8558630,6999247,10,1334,1,2008,970421,43069,,test,,1.098612,0,0,0,1,0,0,1.098612
8558631,6999248,11,1622,1,2869,192749,30532,,test,,0.0,0,0,0,0,0,0,0.0
8558632,6999249,11,1636,1,4401,286071,35107,,test,,0.0,0,0,0,0,0,0,0.0
8558633,6999250,11,1625,1,1259,978760,1232,,test,,0.0,0,0,0,0,0,0,0.0


###  Feature 4:  Mean Demand per client-product pair - Product/Client demand, Product demand, Global demand.

In [34]:
tic()
prod_mean_dict = prod_mean.to_dict()
prod_client_mean_dict = prod_client_mean.to_dict()
tac()

Time passed: 0hour:0min:8sec


In [35]:
def gen_pairs_mean_feature(key):
    key = tuple(key)
    product = key[0]
    client = key[1]
    
    val = prod_client_mean_dict['log_target'][(product,client)]
    if np.isnan(val):
        val = prod_mean_dict['log_target'][(product)]
        if np.isnan(val):
            val = global_mean
            
    return val

In [36]:
print (global_mean)

1.6026541306341908


In [37]:
#Let's see how many products are new (appear on week 10 and 11 but not on past weeks)
prod_mean.apply(lambda x: sum(x.isnull()))

log_target    130
dtype: int64

In [38]:
#Let's see how many combinations of products-clients are new (appear on week 10 and 11 but not on past weeks) = 
prod_client_mean.apply(lambda x: sum(x.isnull()))

log_target    5324541
dtype: int64

In [39]:
tic()
data['pairs_mean'] = data[['Producto_ID', 'Cliente_ID']].apply(lambda x:gen_pairs_mean_feature(x), axis=1)
tac()

Time passed: 0hour:2min:36sec


In [41]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,Log_Target_mean_lag1,Log_Target_mean_lag2,Log_Target_mean_lag3,Log_Target_mean_lag4,rate_change2,rate_change3,rate_change4,Lags_sum,pairs_mean
0,,7,1110,7,3301,15766,3894,4,train,1.609438,0,0.0,0,0,0.0,0.0,0,0.0,1.609438
1,,7,1110,7,3301,15766,35452,4,train,1.609438,0,0.0,0,0,0.0,0.0,0,0.0,1.609438
2,,7,1110,7,3301,73838,5310,6,train,1.94591,0,0.0,0,0,0.0,0.0,0,0.0,1.94591
3,,7,1110,7,3301,73838,5345,3,train,1.386294,0,0.0,0,0,0.0,0.0,0,0.0,1.386294
4,,7,1110,7,3301,108104,1182,162,train,5.09375,0,5.365976,0,0,-1.026026,1.026026,0,5.365976,5.229863


### Feature 5: Create a broad category of Brand of item (brand hypothesis)
Let's preprocess products a little bit. I borrowed some of the preprocessing from here: https://www.kaggle.com/vykhand/grupo-bimbo-inventory-demand/exploring-products

In [42]:
products  =  pd.read_csv("input-data/producto_tabla.csv")
products  =  pd.read_csv("input-data/producto_tabla.csv")
#products['short_name'] = products.NombreProducto.str.extract('^(\D*)', expand=False)#python 2.7
products['short_name'] = products.NombreProducto.str.extract('^(\D*)')#python 3.0
#products['brand'] = products.NombreProducto.str.extract('^.+\s(\D+) \d+$', expand=False)
products['brand'] = products.NombreProducto.str.extract('^.+\s(\D+) \d+$')
#w = products.NombreProducto.str.extract('(\d+)(Kg|g)', expand=True)
w = products.NombreProducto.str.extract('(\d+)(Kg|g)')
products['weight'] = w[0].astype('float')*w[1].map({'Kg':1000, 'g':1})
#products['pieces'] =  products.NombreProducto.str.extract('(\d+)p ', expand=False).astype('float')
products['pieces'] =  products.NombreProducto.str.extract('(\d+)p ').astype('float')
products.head()

Unnamed: 0,Producto_ID,NombreProducto,short_name,brand,weight,pieces
0,0,NO IDENTIFICADO 0,NO IDENTIFICADO,IDENTIFICADO,,
1,9,Capuccino Moka 750g NES 9,Capuccino Moka,NES,750.0,
2,41,Bimbollos Ext sAjonjoli 6p 480g BIM 41,Bimbollos Ext sAjonjoli,BIM,480.0,6.0
3,53,Burritos Sincro 170g CU LON 53,Burritos Sincro,LON,170.0,
4,72,Div Tira Mini Doradita 4p 45g TR 72,Div Tira Mini Doradita,TR,45.0,4.0


In [43]:
products.tail()

Unnamed: 0,Producto_ID,NombreProducto,short_name,brand,weight,pieces
2587,49992,Tostado Integral 180g MTA WON 49992,Tostado Integral,WON,180,
2588,49993,Tostado Integral 180g TAB WON 49993,Tostado Integral,WON,180,
2589,49994,Tostado Int 0pct Grasa Azuc 200g WON 49994,Tostado Int,WON,200,
2590,49996,Tostado Int 0pct Grasa Azuc 200g MTA WON 49996,Tostado Int,WON,200,
2591,49997,Tostado Int 0pct Grasa Azuc 200g TAB WON 49997,Tostado Int,WON,200,


In [44]:
products.brand.value_counts()

BIM             679
MLA             657
TR              257
LAR             182
GBI             130
WON             117
DH               95
LON              83
SAN              66
MR               64
ORO              44
CC               33
SL               32
BAR              31
RIC              20
SUA              20
MP               10
SUN               9
JMX               8
SKD               7
MCM               5
COR               5
NAI               4
THO               4
NES               3
TRI               3
BRL               2
MSK               2
CHK               2
KOD               2
PUL               2
EMB               1
AM                1
VER               1
IDENTIFICADO      1
BRE               1
AV                1
GV                1
DIF               1
NEC               1
CAR               1
LC                1
VR                1
MTB               1
Name: brand, dtype: int64

In [45]:
products.brand.nunique()

44

In [46]:
products_id_brand  = products[['Producto_ID', 'brand']].copy()

In [47]:
data = pd.merge(data, products_id_brand, on='Producto_ID')

In [49]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,Log_Target_mean_lag1,Log_Target_mean_lag2,Log_Target_mean_lag3,Log_Target_mean_lag4,rate_change2,rate_change3,rate_change4,Lags_sum,pairs_mean,brand
0,,7,1110,7,3301,15766,3894,4,train,1.609438,0,0,0,0,0,0,0,0,1.609438,MLA
1,,7,1110,7,3306,1449933,3894,13,train,2.639057,0,0,0,0,0,0,0,0,2.639057,MLA
2,,7,1110,7,3309,1240974,3894,6,train,1.94591,0,0,0,0,0,0,0,0,1.94591,MLA
3,,7,1110,7,3313,2084703,3894,15,train,2.772589,0,0,0,0,0,0,0,0,2.772589,MLA
4,,7,1110,7,3314,1015221,3894,4,train,1.609438,0,0,0,0,0,0,0,0,1.609438,MLA


### Feature 6: Create clusters of Products (utility hypothesis) - ramdonly pick 30 clusters

In [50]:
#Read files:
product_clusters = pd.read_csv('input-data/producto_clusters.csv')

In [51]:
product_clusters.tail()

Unnamed: 0,Producto_ID,NombreProducto,product_shortname,cluster
2586,49992,Tostado Integral 180g MTA WON 49992,Tostado Integral 180g,4
2587,49993,Tostado Integral 180g TAB WON 49993,Tostado Integral 180g,4
2588,49994,Tostado Int 0pct Grasa Azuc 200g WON 49994,Tostado Int 0pct Grasa Azuc 200g,4
2589,49996,Tostado Int 0pct Grasa Azuc 200g MTA WON 49996,Tostado Int 0pct Grasa Azuc 200g,4
2590,49997,Tostado Int 0pct Grasa Azuc 200g TAB WON 49997,Tostado Int 0pct Grasa Azuc 200g,4


In [52]:
print (product_clusters["cluster"].value_counts())

1     204
14    137
10    136
11    124
4     118
13    109
23    103
24    101
19     99
17     93
16     89
8      88
25     85
30     81
22     78
20     75
15     73
9      73
27     71
5      70
2      68
6      65
3      63
26     62
7      61
28     60
12     60
18     59
29     53
21     33
Name: cluster, dtype: int64


In [53]:
products_id_clusters = product_clusters[['Producto_ID', 'prodtype_cluster']].copy()

In [54]:
products_id_clusters.tail()

Unnamed: 0,Producto_ID,cluster
2586,49992,4
2587,49993,4
2588,49994,4
2589,49996,4
2590,49997,4


In [55]:
data = pd.merge(data, products_id_clusters, on='Producto_ID')

In [56]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,...,Log_Target_mean_lag2,Log_Target_mean_lag3,Log_Target_mean_lag4,rate_change2,rate_change3,rate_change4,Lags_sum,pairs_mean,brand,cluster
0,,7,1110,7,3301,15766,3894,4,train,1.609438,...,0,0,0,0,0,0,0,1.609438,MLA,18
1,,7,1110,7,3306,1449933,3894,13,train,2.639057,...,0,0,0,0,0,0,0,2.639057,MLA,18
2,,7,1110,7,3309,1240974,3894,6,train,1.94591,...,0,0,0,0,0,0,0,1.94591,MLA,18
3,,7,1110,7,3313,2084703,3894,15,train,2.772589,...,0,0,0,0,0,0,0,2.772589,MLA,18
4,,7,1110,7,3314,1015221,3894,4,train,1.609438,...,0,0,0,0,0,0,0,1.609438,MLA,18


### Feature 7: Create a category of Size of store based on Number of Agencies and Routes and Sales Channels that serve the store

In [57]:
#Determine pivot table
Rutas_per_store = data.pivot_table(values=["Ruta_SAK"], index=["Cliente_ID"], aggfunc=pd.Series.nunique)

In [58]:
Rutas_per_store.describe()

Unnamed: 0,Ruta_SAK
count,774993.0
mean,2.236169
std,1.321886
min,1.0
25%,1.0
50%,2.0
75%,3.0
max,47.0


In [55]:
Agencies_per_store = data.pivot_table(values=["Agencia_ID"], index=["Cliente_ID"], aggfunc=pd.Series.nunique)

In [56]:
Agencies_per_store.describe()

Unnamed: 0,Agencia_ID
count,858290.0
mean,1.049571
std,0.244343
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,63.0


In [57]:
Canals_per_store = data.pivot_table(values=["Canal_ID"], index=["Cliente_ID"], aggfunc=pd.Series.nunique)

In [58]:
Canals_per_store.describe()

Unnamed: 0,Canal_ID
count,858290.0
mean,1.006472
std,0.080898
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,4.0


It doesn't look that we can Bin on Canal_ID or Agencia_ID since they are not at least semi evenly ditributed, but it does look like Ruta_SAK is a good option based on the percentiles distribution"

In [59]:
Rutas_per_store.rename(columns={'Ruta_SAK': 'Qty_Ruta_SAK'}, inplace=True)

In [60]:
#Mergin Routa_Sak's per client to data df
data = pd.merge(data,Rutas_per_store,right_index=True, left_on='Cliente_ID')

In [62]:
data.tail()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,...,Log_Target_mean_lag3,Log_Target_mean_lag4,rate_change2,rate_change3,rate_change4,Lags_sum,pairs_mean,brand,cluster,Qty_Ruta_SAK
8558531,2895532,10,1348,1,2805,2453136,36310,,test,,...,0,0,0,0,0,0,1.602654,TR,30,1
8558548,3211677,10,2059,11,3922,1919007,30370,,test,,...,0,0,0,0,0,0,0.965496,TR,27,1
8558600,5292949,11,1448,11,3935,743124,35830,,test,,...,0,0,0,0,0,0,1.602654,BIM,9,1
8558622,6332303,11,1535,8,5702,2371213,36801,,test,,...,0,0,0,0,0,0,1.602654,BIM,20,1
8558631,6867383,10,2278,8,3402,1611123,4139,,test,,...,0,0,0,0,0,0,1.602654,BIM,1,1


In [63]:
#Binning:
def binning(col, cut_points, labels=None):
  #Define min and max values:
  minval = col.min()
  maxval = col.max()

  #create list by adding min and max to cut_points
  break_points = [minval] + cut_points + [maxval]

  #if no labels provided, use default labels 0 ... (n-1)
  if not labels:
    labels = range(len(cut_points)+1)

  #Binning using cut function of pandas
  colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
  return colBin

#Binning Qty_Ruta_SAK:
cut_points = [2,4,10]
labels = ["low","medium","high","very high"]
data["Qty_Ruta_SAK_Bin"] = binning(data["Qty_Ruta_SAK"], cut_points, labels)
print (pd.value_counts(data["Qty_Ruta_SAK_Bin"], sort=False))

low          3280223
medium       3277155
high         1985580
very high      15676
dtype: int64


In [64]:
#We don't need Qty_Ruta_Sak anymore
data.drop(['Qty_Ruta_SAK'],axis=1,inplace=True)

In [66]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,...,Log_Target_mean_lag3,Log_Target_mean_lag4,rate_change2,rate_change3,rate_change4,Lags_sum,pairs_mean,brand,cluster,Qty_Ruta_SAK_Bin
0,,7,1110,7,3301,15766,3894,4.0,train,1.609438,...,0.0,0,0,0,0,0.0,1.609438,MLA,18,low
5199,970784.0,10,1110,7,3301,15766,3894,,test,,...,1.609438,0,0,-1,1,1.609438,1.609438,MLA,18,low
16569,,7,1110,7,3301,15766,35452,4.0,train,1.609438,...,0.0,0,0,0,0,0.0,1.609438,MLA,10,low
23721,1027965.0,10,1110,7,3301,15766,35452,,test,,...,1.609438,0,0,-1,1,1.609438,1.609438,MLA,10,low
263587,4521987.0,11,1110,7,3301,15766,1240,,test,,...,0.0,0,0,0,0,0.0,1.608777,BIM,14,low


### Feature 8: Create a category of location based on zip code (embedded on town table)

In [67]:
import re 
import os
import time
towns = pd.read_csv("input-data/town_state.csv")
L = lambda x: list(map(int, re.findall('\d+', x)))[0]
towns['ZipCode'] = towns.Town.apply(L) 
towns['ZipCode'] = np.uint16(towns['ZipCode'])

In [68]:
zipcodes_df = towns[['Agencia_ID', 'ZipCode']].copy()

In [69]:
zipcodes_df.head()

Unnamed: 0,Agencia_ID,ZipCode
0,1110,2008
1,1111,2002
2,1112,2004
3,1113,2008
4,1114,2029


In [70]:
data = pd.merge(data, zipcodes_df, on='Agencia_ID')

In [72]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,...,Log_Target_mean_lag4,rate_change2,rate_change3,rate_change4,Lags_sum,pairs_mean,brand,cluster,Qty_Ruta_SAK_Bin,ZipCode
0,,7,1110,7,3301,15766,3894,4.0,train,1.609438,...,0,0,0,0,0.0,1.609438,MLA,18,low,2008
1,970784.0,10,1110,7,3301,15766,3894,,test,,...,0,0,-1,1,1.609438,1.609438,MLA,18,low,2008
2,,7,1110,7,3301,15766,35452,4.0,train,1.609438,...,0,0,0,0,0.0,1.609438,MLA,10,low,2008
3,1027965.0,10,1110,7,3301,15766,35452,,test,,...,0,0,-1,1,1.609438,1.609438,MLA,10,low,2008
4,4521987.0,11,1110,7,3301,15766,1240,,test,,...,0,0,0,0,0.0,1.608777,BIM,14,low,2008


In [70]:
data.apply(lambda x: len(x.unique()))

id                      6999252
Semana                        5
Agencia_ID                  552
Canal_ID                      9
Ruta_SAK                   3240
Cliente_ID               858290
Producto_ID                1727
Venta_uni_hoy               257
Venta_hoy                 52354
Dev_uni_proxima             253
Dev_proxima               10286
Demanda_uni_equil          1678
source                        2
log_target                 1678
Log_Target_mean_lag1       4878
Log_Target_mean_lag2       5407
Log_Target_mean_lag3       5369
Log_Target_mean_lag4       5342
Lags_sum                 407990
pairs_mean               686903
brand                        31
cluster                      30
Qty_Ruta_SAK_Bin              4
ZipCode                     254
dtype: int64

### Feature 9: Week of the month counter

The idea is to have an indicator of what week of the month the current data belongs. This is to see if there is a monthly pattern that the algorithm can pick up.

In [73]:
data['week_ct'] = data['Semana'].apply(lambda x: x%4)

In [74]:
data.tail()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,...,rate_change2,rate_change3,rate_change4,Lags_sum,pairs_mean,brand,cluster,Qty_Ruta_SAK_Bin,ZipCode,week_ct
8558629,2690813.0,11,1482,2,1563,850159,33267,,test,,...,0,0,0,0,2.210268,BIM,27,low,2480,3
8558630,2456689.0,10,1482,2,1563,850159,43364,,test,,...,0,0,0,0,3.044522,ORO,6,low,2480,2
8558631,,9,1482,2,2533,850159,35244,48.0,train,3.89182,...,0,0,0,0,3.89182,MLA,19,low,2480,1
8558632,,8,1482,2,1563,850159,34851,215.0,train,5.375278,...,0,0,0,0,5.488699,BIM,20,low,2480,0
8558633,4982678.0,10,1482,2,1563,850159,34793,,test,,...,0,0,0,0,4.546748,BIM,9,low,2480,2


### Feature 10: Clusters

#### 10.1 Client type clusters
Thanks to AbderRahman Sobh - https://www.kaggle.com/abbysobh/grupo-bimbo-inventory-demand/classifying-client-type-using-client-names/comments
for doing the great code

In [75]:
client_types = pd.read_csv('./input-data/client_types.csv',header=0)

In [76]:
client_types.head()

Unnamed: 0,Cliente_ID,NombreCliente
0,0,Individual
1,1,Oxxo Store
2,2,Individual
3,3,Small Franchise
4,4,Small Franchise


In [77]:
data = data.merge(client_types.drop_duplicates(subset=['Cliente_ID']), how="left")

In [79]:
data.tail()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,...,rate_change3,rate_change4,Lags_sum,pairs_mean,brand,cluster,Qty_Ruta_SAK_Bin,ZipCode,week_ct,NombreCliente
8558629,2690813.0,11,1482,2,1563,850159,33267,,test,,...,0,0,0,2.210268,BIM,27,low,2480,3,Walmart
8558630,2456689.0,10,1482,2,1563,850159,43364,,test,,...,0,0,0,3.044522,ORO,6,low,2480,2,Walmart
8558631,,9,1482,2,2533,850159,35244,48.0,train,3.89182,...,0,0,0,3.89182,MLA,19,low,2480,1,Walmart
8558632,,8,1482,2,1563,850159,34851,215.0,train,5.375278,...,0,0,0,5.488699,BIM,20,low,2480,0,Walmart
8558633,4982678.0,10,1482,2,1563,850159,34793,,test,,...,0,0,0,4.546748,BIM,9,low,2480,2,Walmart


#### 10.2 Client type clusters
Since this is going to take a lot of RAM, we are going to do this at the end after we save the dataset.

## Numerical and One-Hot Coding of Categorical variables
Since scikit-learn accepts only numerical variables, so i have to convert all categories of nominal variables into numeric types.

Lets start with coding all low cardinality object/nominal categorical variables (brand, Qty_Ruta_SAK_Bin, NombreCliente)  as numeric using ‘LabelEncoder’ from sklearn’s preprocessing module.

In [80]:
print (data.dtypes)

id                      float64
Semana                     int8
Agencia_ID               uint16
Canal_ID                   int8
Ruta_SAK                  int32
Cliente_ID                int32
Producto_ID               int32
Demanda_uni_equil       float64
source                   object
log_target              float64
Log_Target_mean_lag1    float64
Log_Target_mean_lag2    float64
Log_Target_mean_lag3    float64
Log_Target_mean_lag4    float64
rate_change2            float64
rate_change3            float64
rate_change4            float64
Lags_sum                float64
pairs_mean              float64
brand                    object
cluster                   int64
Qty_Ruta_SAK_Bin         object
ZipCode                  uint16
week_ct                   int64
NombreCliente            object
dtype: object


In [81]:
#Import library:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

var_mod = ['brand', 'Qty_Ruta_SAK_Bin', 'NombreCliente']
for i in var_mod:
    data[i] = le.fit_transform(data[i])

In [82]:
data.head()

Unnamed: 0,id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Demanda_uni_equil,source,log_target,...,rate_change3,rate_change4,Lags_sum,pairs_mean,brand,cluster,Qty_Ruta_SAK_Bin,ZipCode,week_ct,NombreCliente
0,,7,1110,7,3301,15766,3894,4.0,train,1.609438,...,0,0,0.0,1.609438,13,18,1,2008,3,11
1,970784.0,10,1110,7,3301,15766,3894,,test,,...,-1,1,1.609438,1.609438,13,18,1,2008,2,11
2,,7,1110,7,3301,15766,35452,4.0,train,1.609438,...,0,0,0.0,1.609438,13,10,1,2008,3,11
3,1027965.0,10,1110,7,3301,15766,35452,,test,,...,-1,1,1.609438,1.609438,13,10,1,2008,2,11
4,4521987.0,11,1110,7,3301,15766,1240,,test,,...,0,0,0.0,1.608777,1,14,1,2008,3,11


One-Hot-Coding refers to creating dummy variables, one for each category of a categorical variable. For example, the 'cluster' variable has 30 categories. One hot coding will remove this variable and generate 30 new variables. Each will have binary numbers – 0 (if the category is not present) and 1(if category is present).
Categorical variables are intentionally (for censorship) or implicitly encoded as numerical variables in order to be used as features in any given model.

e.g. [house, car, tooth, car] becomes [0,1,2,1].

This imparts an ordinal property to the variable, i.e. house < car < tooth.

As this is ordinal characteristic is usually not desired, one hot encoding is necessary for the proper representation of the distinct elements of the variable.

-- This can be done using ‘get_dummies’ function of Pandas.


In [74]:
#One Hot Coding: you need python 3 and 128GB ram for this
#tic()
#data = pd.get_dummies(data, columns=['Canal_ID','brand','cluster','Qty_Ruta_SAK_Bin'])
#tac()

I decided NOT to one-hot-encode, since the data would grew to unmanageable size

In [83]:
data.dtypes

id                      float64
Semana                     int8
Agencia_ID               uint16
Canal_ID                   int8
Ruta_SAK                  int32
Cliente_ID                int32
Producto_ID               int32
Demanda_uni_equil       float64
source                   object
log_target              float64
Log_Target_mean_lag1    float64
Log_Target_mean_lag2    float64
Log_Target_mean_lag3    float64
Log_Target_mean_lag4    float64
rate_change2            float64
rate_change3            float64
rate_change4            float64
Lags_sum                float64
pairs_mean              float64
brand                     int64
cluster                   int64
Qty_Ruta_SAK_Bin          int64
ZipCode                  uint16
week_ct                   int64
NombreCliente             int64
dtype: object

## 5\. Exporting Data

In [84]:
#Divide into test and train:
import csv
train = data.loc[data['source']=="train"]
test = data.loc[data['source']=="test"]

#Drop unnecessary columns: note - we are dropping Demanda_uni_equil since we replaced it by log_target
train.drop(['source','id','Demanda_uni_equil'],axis=1,inplace=True)
test.drop(['source','Demanda_uni_equil'],axis=1,inplace=True)

#Export files as modified versions:
#tic()
train.to_csv("./input-data/train_modified_noclusters.csv", index=False, quoting=csv.QUOTE_NONE)
test.to_csv("./input-data/test_modified_noclusters.csv", index=False, quoting=csv.QUOTE_NONE)


#For easy H2O Flow modeling
val = train[train["Semana"] > 8]
train = train[train["Semana"] <= 8]

train.to_csv("./input-data/train_modified_noclusters_noW9.csv", index=False, quoting=csv.QUOTE_NONE)
val.to_csv("./input-data/val_modified_noclusters_W9.csv", index=False, quoting=csv.QUOTE_NONE)
tac()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Time passed: 0hour:16min:38sec
