# Favorita: Favorita Grocery Sales Forecasting

## Data exploration  
For each data file, I want to see:  
(1) some samples with display(df.head(5)) or display(df.tail(5))  
(2) A summary of this DataFrame with df.describe()  
(3) Check if there is missing data with df.isnull().values.any()  
(4) The number of unique values for each variable with display(df['column_name'].unique())

In [1]:
# Import libraries necessary for this project
import os.path
import pickle
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from IPython.display import display
import matplotlib.pyplot as plt

# Pretty display for notebooks
%matplotlib inline
# Loaddata() function is used to load two large csv files: train.csv and test.csv
def loaddata(filename, nrows=None):
    types = {'id': 'int32', 'date':'string_', 'item_nbr': 'int32', 'store_nbr': 'int16', 'unit_sales': 'float32', 'onpromotion': 'float64',}
    data = pd.read_csv(filename, dtype=types, infer_datetime_format=True)
    return data

### 1. Train.csv  
There are 125497040 items in this training data with 6 variables each.  
6 variables include:  
1. id: This is meaningless for model training, and will be dropped
2. date: From 2013-01-01 to 2017-08-15
3. store_nbr: conintunous integer from 1 to 54
4. item_nbr: item id, un-continuous integers
5. unit_sale: continuous float number with min=-0.000153 max=89440
6. onpromotion: bool 0 1, and missing entries  

Missing entries only exist in the column 'onpromotion'.

In [2]:
# Load the training dataset
train_data = loaddata('input/train.csv')
print("Training dataset has {} data points with {} variables each.".format(*train_data.shape))

Training dataset has 125497040 data points with 6 variables each.


In [3]:
display(train_data.describe())

Unnamed: 0,id,store_nbr,item_nbr,unit_sales,onpromotion
count,125497000.0,125497000.0,125497000.0,125497000.0,103870200.0
mean,62748520.0,27.46458,972769.2,5.319669,0.07549226
std,36227880.0,16.33051,520533.6,23.06714,0.264184
min,0.0,1.0,96995.0,-15372.0,0.0
25%,31374260.0,12.0,522383.0,2.0,0.0
50%,62748520.0,28.0,959500.0,4.0,0.0
75%,94122780.0,43.0,1354380.0,9.0,0.0
max,125497000.0,54.0,2127114.0,89440.0,1.0


In [4]:
for col in train_data.columns:
    print(col, train_data[col].unique(), sep='\n')

id
[        0         1         2 ..., 125497037 125497038 125497039]
date
['2013-01-01' '2013-01-02' '2013-01-03' ..., '2017-08-13' '2017-08-14'
 '2017-08-15']
store_nbr
[25  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 23 24 26 27 28
 30 31 32 33 34 35 37 38 39 40 41 43 44 45 46 47 48 49 50 51 54 36 53 20 29
 21 42 22 52]
item_nbr
[ 103665  105574  105575 ..., 2126944 2123839 2011451]
unit_sales
[   7.            1.            2.         ...,  247.42999268  225.19599915
  114.91699982]
onpromotion
[ nan   1.   0.]


In [5]:
train_data.drop('onpromotion', axis=1).isnull().values.any()

False

### 2. holidays_events.csv  
There are 350 items in holidays_events with 6 variables each.  
Variables are listed below:
1. date: 312 unique dates from 2012-03-02 to 2017-12-26
2. type: ['Holiday', 'Transfer', 'Additional', 'Bridge', 'Work Day', 'Event']
3. locale: ['Local', 'Regional', 'National']
4. locale_name: ['Manta', 'Cotopaxi', 'Cuenca', 'Libertad', 'Riobamba', 'Puyo','Guaranda', 'Imbabura', 'Latacunga', 'Machala', 'Santo Domingo','El Carmen', 'Cayambe', 'Esmeraldas', 'Ecuador', 'Ambato', 'Ibarra','Quevedo', 'Santo Domingo de los Tsachilas', 'Santa Elena', 'Quito','Loja', 'Salinas', 'Guayaquil']
5. description:103 entries (don't understand)
6. transferred: [False,  True]  

There is no missing entries in this dataset.

In [6]:
holidays_events = pd.read_csv("input/holidays_events.csv")
print("Holidays_events has {} data points with {} variables each.".format(*holidays_events.shape))

display(holidays_events.describe())

for col in holidays_events.columns:
    print(col, holidays_events[col].unique(), sep='\n')

holidays_events.isnull().values.any()

Holidays_events has 350 data points with 6 variables each.


Unnamed: 0,date,type,locale,locale_name,description,transferred
count,350,350,350,350,350,350
unique,312,6,3,24,103,2
top,2014-06-25,Holiday,National,Ecuador,Carnaval,False
freq,4,221,174,174,10,338


date
['2012-03-02' '2012-04-01' '2012-04-12' '2012-04-14' '2012-04-21'
 '2012-05-12' '2012-06-23' '2012-06-25' '2012-07-03' '2012-07-23'
 '2012-08-05' '2012-08-10' '2012-08-15' '2012-08-24' '2012-09-28'
 '2012-10-07' '2012-10-09' '2012-10-12' '2012-11-02' '2012-11-03'
 '2012-11-06' '2012-11-07' '2012-11-10' '2012-11-11' '2012-11-12'
 '2012-12-05' '2012-12-06' '2012-12-08' '2012-12-21' '2012-12-22'
 '2012-12-23' '2012-12-24' '2012-12-25' '2012-12-26' '2012-12-31'
 '2013-01-01' '2013-01-05' '2013-01-12' '2013-02-11' '2013-02-12'
 '2013-03-02' '2013-04-01' '2013-04-12' '2013-04-14' '2013-04-21'
 '2013-04-29' '2013-05-01' '2013-05-11' '2013-05-12' '2013-05-24'
 '2013-06-23' '2013-06-25' '2013-07-03' '2013-07-23' '2013-07-24'
 '2013-07-25' '2013-08-05' '2013-08-10' '2013-08-15' '2013-08-24'
 '2013-09-28' '2013-10-07' '2013-10-09' '2013-10-11' '2013-11-02'
 '2013-11-03' '2013-11-06' '2013-11-07' '2013-11-10' '2013-11-11'
 '2013-11-12' '2013-12-05' '2013-12-06' '2013-12-08' '2013-12-21'
 '201

False

### 3. stores.csv  
There are 54 items in holidays_events with 5 variables each.  
Variables are listed below:
1. store_nbr: [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
2. city: ['Quito', 'Santo Domingo', 'Cayambe', 'Latacunga', 'Riobamba', 'Ibarra', 'Guaranda', 'Puyo', 'Ambato', 'Guayaquil', 'Salinas', 'Daule', 'Babahoyo', 'Quevedo', 'Playas', 'Libertad', 'Cuenca', 'Loja', 'Machala', 'Esmeraldas', 'Manta', 'El Carmen']
3. state: ['Pichincha', 'Santo Domingo de los Tsachilas', 'Cotopaxi', 'Chimborazo', 'Imbabura', 'Bolivar', 'Pastaza', 'Tungurahua', 'Guayas', 'Santa Elena', 'Los Rios', 'Azuay', 'Loja', 'El Oro', 'Esmeraldas', 'Manabi']
4. type: ['D', 'B', 'C', 'E', 'A']
5. cluster: [13,  8,  9,  4,  6, 15,  7,  3, 12, 16,  1, 10,  2,  5, 11, 14, 17]

There is no missing entries in this dataset.

In [7]:
stores = pd.read_csv("input/stores.csv")
print("Stores has {} data points with {} variables each.".format(*stores.shape))

display(stores.describe())

for col in stores.columns:
    print(col, stores[col].unique(), sep='\n')

stores.isnull().values.any()

Stores has 54 data points with 5 variables each.


Unnamed: 0,store_nbr,cluster
count,54.0,54.0
mean,27.5,8.481481
std,15.732133,4.693395
min,1.0,1.0
25%,14.25,4.0
50%,27.5,8.5
75%,40.75,13.0
max,54.0,17.0


store_nbr
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51 52 53 54]
city
['Quito' 'Santo Domingo' 'Cayambe' 'Latacunga' 'Riobamba' 'Ibarra'
 'Guaranda' 'Puyo' 'Ambato' 'Guayaquil' 'Salinas' 'Daule' 'Babahoyo'
 'Quevedo' 'Playas' 'Libertad' 'Cuenca' 'Loja' 'Machala' 'Esmeraldas'
 'Manta' 'El Carmen']
state
['Pichincha' 'Santo Domingo de los Tsachilas' 'Cotopaxi' 'Chimborazo'
 'Imbabura' 'Bolivar' 'Pastaza' 'Tungurahua' 'Guayas' 'Santa Elena'
 'Los Rios' 'Azuay' 'Loja' 'El Oro' 'Esmeraldas' 'Manabi']
type
['D' 'B' 'C' 'E' 'A']
cluster
[13  8  9  4  6 15  7  3 12 16  1 10  2  5 11 14 17]


False

### 4. oil.csv  
There are 1218 items in holidays_events with 2 variables each.  
Variables are listed below:
1. date: from 2013-01-01 to 2017-08-31
2. dcoilwtico: continuous value from 26.19~110.62

There is 1 missing entries in this dataset.

In [8]:
oil = pd.read_csv("input/oil.csv")
print("Oil has {} data points with {} variables each.".format(*oil.shape))

display(oil.describe())

for col in oil.columns:
    print(col, oil[col].unique(), sep='\n')

oil.isnull().values.any()

Oil has 1218 data points with 2 variables each.


Unnamed: 0,dcoilwtico
count,1175.0
mean,67.714366
std,25.630476
min,26.19
25%,46.405
50%,53.19
75%,95.66
max,110.62


date
['2013-01-01' '2013-01-02' '2013-01-03' ..., '2017-08-29' '2017-08-30'
 '2017-08-31']
dcoilwtico
[    nan   93.14   92.97   93.12   93.2    93.21   93.08   93.81   93.6
   94.27   93.26   94.28   95.49   95.61   96.09   95.06   95.35   95.15
   95.95   97.62   97.98   97.65   97.46   96.21   96.68   96.44   95.84
   95.71   97.01   97.48   97.03   97.3    96.69   94.92   92.79   92.74
   92.63   92.84   92.03   90.71   90.13   90.88   90.47   91.53   92.01
   92.07   92.44   92.47   93.03   93.49   93.71   92.46   93.41   94.55
   95.99   96.53   97.24   97.1    97.23   95.02   92.76   93.36   94.18
   94.59   93.44   91.23   88.75   88.73   86.65   87.83   88.04   88.81
   89.21   91.07   93.27   94.09   93.22   90.74   93.7    95.25   95.8
   95.28   96.24   95.81   94.76   93.96   93.95   94.85   95.72   96.29
   95.55   93.98   94.12   93.84   94.65   93.13   93.57   91.93   93.66
   94.71   96.11   95.82   95.5    95.98   96.66   97.83   97.86   98.46
   98.24   94.89   95.07

True

### 5. transactions.csv

There are 83488 items in holidays_events with 3 variables each.  
Variables are listed below:
1. date: from 2013-01-01 to 2017-08-15
2. store_nbr: 54 store numbers
2. transactions: integers between 5 and 8358

There is 1 missing entries in this dataset.

In [9]:
transactions = pd.read_csv("input/transactions.csv")
print("transactions has {} data points with {} variables each.".format(*transactions.shape))

display(transactions.describe())

for col in transactions.columns:
    print(col, transactions[col].unique(), sep='\n')

transactions.isnull().values.any()

transactions has 83488 data points with 3 variables each.


Unnamed: 0,store_nbr,transactions
count,83488.0,83488.0
mean,26.939237,1694.602158
std,15.608204,963.286644
min,1.0,5.0
25%,13.0,1046.0
50%,27.0,1393.0
75%,40.0,2079.0
max,54.0,8359.0


date
['2013-01-01' '2013-01-02' '2013-01-03' ..., '2017-08-13' '2017-08-14'
 '2017-08-15']
store_nbr
[25  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 23 24 26 27 28
 30 31 32 33 34 35 37 38 39 40 41 43 44 45 46 47 48 49 50 51 54 36 53 20 29
 21 42 22 52]
transactions
[ 770 2111 2358 ..., 4553 4400 4392]


False

### 6. items.csv  
There are 4100 items in holidays_events with 4 variables each.  
Variables are listed below:
1. item_nbr: 4100 discrete values
2. family: ['GROCERY I' 'CLEANING' 'BREAD/BAKERY' 'DELI' 'POULTRY' 'EGGS'
 'PERSONAL CARE' 'LINGERIE' 'BEVERAGES' 'AUTOMOTIVE' 'DAIRY' 'GROCERY II'
 'MEATS' 'FROZEN FOODS' 'HOME APPLIANCES' 'SEAFOOD' 'PREPARED FOODS'
 'LIQUOR,WINE,BEER' 'BEAUTY' 'HARDWARE' 'LAWN AND GARDEN' 'PRODUCE'
 'HOME AND KITCHEN II' 'HOME AND KITCHEN I' 'MAGAZINES' 'HOME CARE'
 'PET SUPPLIES' 'BABY CARE' 'SCHOOL AND OFFICE SUPPLIES'
 'PLAYERS AND ELECTRONICS' 'CELEBRATION' 'LADIESWEAR' 'BOOKS']
3. class: 337 discrete values
4. perishable: 0 and 1
There is no missing entries in this dataset.

In [10]:
items = pd.read_csv("input/items.csv")
print("items has {} data points with {} variables each.".format(*items.shape))

display(items.describe())

for col in items.columns:
    print(col, items[col].unique(), sep='\n')

items.isnull().values.any()

items has 4100 data points with 4 variables each.


Unnamed: 0,item_nbr,class,perishable
count,4100.0,4100.0,4100.0
mean,1251436.0,2169.65,0.240488
std,587687.2,1484.9109,0.427432
min,96995.0,1002.0,0.0
25%,818110.8,1068.0,0.0
50%,1306198.0,2004.0,0.0
75%,1904918.0,2990.5,0.0
max,2134244.0,7780.0,1.0


item_nbr
[  96995   99197  103501 ..., 2132957 2134058 2134244]
family
['GROCERY I' 'CLEANING' 'BREAD/BAKERY' 'DELI' 'POULTRY' 'EGGS'
 'PERSONAL CARE' 'LINGERIE' 'BEVERAGES' 'AUTOMOTIVE' 'DAIRY' 'GROCERY II'
 'MEATS' 'FROZEN FOODS' 'HOME APPLIANCES' 'SEAFOOD' 'PREPARED FOODS'
 'LIQUOR,WINE,BEER' 'BEAUTY' 'HARDWARE' 'LAWN AND GARDEN' 'PRODUCE'
 'HOME AND KITCHEN II' 'HOME AND KITCHEN I' 'MAGAZINES' 'HOME CARE'
 'PET SUPPLIES' 'BABY CARE' 'SCHOOL AND OFFICE SUPPLIES'
 'PLAYERS AND ELECTRONICS' 'CELEBRATION' 'LADIESWEAR' 'BOOKS']
class
[1093 1067 3008 1028 2712 1045 1034 1044 1092 1032 1030 1075 2636 2644 3044
 1004 2416 2502 1062 3024 1072 1016 4126 3034 1014 1040 1084 7034 1056 3090
 3026 1042 1122 6810 2124 3020 2114 1026 2112 1096 2704 2708 1013 3038 1048
 2116 3032 1124 1066 2718 1236 1080 3004 1058 6824 1136 3016 1006 2302 1010
 2632 2226 2412 1078 1074 1036 3046 3022 3018 1035 2104 1086 1039 6155 2806
 1120 1002 2218 2220 1060 2986 2720 3014 6806 4114 1087 3015 2702 3006 2752
 2652

False

### test.csv

In [11]:
test_data = pd.read_csv("input/test.csv")
print("Favorita grocery sales forecasting testing data has {} samples with {} features each.".format(*test_data.shape))
display(test_data.head(5))

Favorita grocery sales forecasting testing data has 3370464 samples with 5 features each.


Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
0,125497040,2017-08-16,1,96995,False
1,125497041,2017-08-16,1,99197,False
2,125497042,2017-08-16,1,103501,False
3,125497043,2017-08-16,1,103520,False
4,125497044,2017-08-16,1,103665,False


### sample_submission.csv

In [12]:
sample_submission = pd.read_csv("input/sample_submission.csv")
display(sample_submission.head(n=5))
sample_submission.dtypes

Unnamed: 0,id,unit_sales
0,125497040,0
1,125497041,0
2,125497042,0
3,125497043,0
4,125497044,0


id            int64
unit_sales    int64
dtype: object