# Favorita: Favorita Grocery Sales Forecasting

## Data exploration  
For each data file, I want to see:  
(1) some samples with display(df.head(5)) or display(df.tail(5))  
(2) A summary of this DataFrame with df.describe()  
(3) Data types with print(df.dtypes)
(4) Check if there is missing data with df.isnull().values.any()  
(5) The number of unique values for each variable with display(df['column_name'].unique())

In [1]:
# Import libraries necessary for this project
import os.path
import pickle
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from IPython.display import display
import matplotlib.pyplot as plt

# Pretty display for notebooks
%matplotlib inline
# Loaddata() function is used to load two large csv files: train.csv and test.csv
def loaddata(filename, nrows=None):
    types = {'id': 'int32', 'date':'string_', 'item_nbr': 'int32', 'store_nbr': 'int16', 'unit_sales': 'float32', 'onpromotion': 'float64',}
    data = pd.read_csv(filename, dtype=types, infer_datetime_format=True)
    return data

### 1. Train.csv  
There are 125497040 items in this training data with 6 variables each.  
6 variables include:  
1. id: This is meaningless for model training, and will be dropped
2. date: From 2013-01-01 to 2017-08-15
3. store_nbr: conintunous integer from 1 to 54
4. item_nbr: item id, un-continuous integers
5. unit_sale: continuous float number with min=-0.000153 max=89440
6. onpromotion: bool 0 1, and missing entries  

Missing entries only exist in the column 'onpromotion'.

In [2]:
# Load the training dataset
train_data = loaddata('input/train.csv')
print("Training dataset has {} data points with {} variables each.".format(*train_data.shape))

Training dataset has 125497040 data points with 6 variables each.


In [3]:
display(train_data.describe())

Unnamed: 0,id,store_nbr,item_nbr,unit_sales,onpromotion
count,125497000.0,125497000.0,125497000.0,125497000.0,103870200.0
mean,62748520.0,27.46458,972769.2,5.319669,0.07549226
std,36227880.0,16.33051,520533.6,23.06714,0.264184
min,0.0,1.0,96995.0,-15372.0,0.0
25%,31374260.0,12.0,522383.0,2.0,0.0
50%,62748520.0,28.0,959500.0,4.0,0.0
75%,94122780.0,43.0,1354380.0,9.0,0.0
max,125497000.0,54.0,2127114.0,89440.0,1.0


In [4]:
print(train_data.dtypes)

id               int32
date            object
store_nbr        int16
item_nbr         int32
unit_sales     float32
onpromotion    float64
dtype: object


In [None]:
for col in train_data.columns:
    print(col, train_data[col].unique(), sep='\n')

id
[        0         1         2 ..., 125497037 125497038 125497039]
date
['2013-01-01' '2013-01-02' '2013-01-03' ..., '2017-08-13' '2017-08-14'
 '2017-08-15']
store_nbr
[25  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 23 24 26 27 28
 30 31 32 33 34 35 37 38 39 40 41 43 44 45 46 47 48 49 50 51 54 36 53 20 29
 21 42 22 52]
item_nbr
[ 103665  105574  105575 ..., 2126944 2123839 2011451]
unit_sales
[   7.            1.            2.         ...,  247.42999268  225.19599915
  114.91699982]
onpromotion
[ nan   1.   0.]


In [None]:
train_data.drop('onpromotion', axis=1).isnull().values.any()

### 2. holidays_events.csv  
There are 350 items in holidays_events with 6 variables each.  
Variables are listed below:
1. date: 312 unique dates from 2012-03-02 to 2017-12-26
2. type: ['Holiday', 'Transfer', 'Additional', 'Bridge', 'Work Day', 'Event']
3. locale: ['Local', 'Regional', 'National']
4. locale_name: ['Manta', 'Cotopaxi', 'Cuenca', 'Libertad', 'Riobamba', 'Puyo','Guaranda', 'Imbabura', 'Latacunga', 'Machala', 'Santo Domingo','El Carmen', 'Cayambe', 'Esmeraldas', 'Ecuador', 'Ambato', 'Ibarra','Quevedo', 'Santo Domingo de los Tsachilas', 'Santa Elena', 'Quito','Loja', 'Salinas', 'Guayaquil']
5. description:103 entries (don't understand)
6. transferred: [False,  True]  

There is no missing entries in this dataset.

In [None]:
holidays_events = pd.read_csv("input/holidays_events.csv")
print("Holidays_events has {} data points with {} variables each.".format(*holidays_events.shape))

display(holidays_events.describe())
print(holidays_events.dtypes)

for col in holidays_events.columns:
    print(col, holidays_events[col].unique(), sep='\n')

holidays_events.isnull().values.any()

### 3. stores.csv  
There are 54 items in holidays_events with 5 variables each.  
Variables are listed below:
1. store_nbr: [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
2. city: ['Quito', 'Santo Domingo', 'Cayambe', 'Latacunga', 'Riobamba', 'Ibarra', 'Guaranda', 'Puyo', 'Ambato', 'Guayaquil', 'Salinas', 'Daule', 'Babahoyo', 'Quevedo', 'Playas', 'Libertad', 'Cuenca', 'Loja', 'Machala', 'Esmeraldas', 'Manta', 'El Carmen']
3. state: ['Pichincha', 'Santo Domingo de los Tsachilas', 'Cotopaxi', 'Chimborazo', 'Imbabura', 'Bolivar', 'Pastaza', 'Tungurahua', 'Guayas', 'Santa Elena', 'Los Rios', 'Azuay', 'Loja', 'El Oro', 'Esmeraldas', 'Manabi']
4. type: ['D', 'B', 'C', 'E', 'A']
5. cluster: [13,  8,  9,  4,  6, 15,  7,  3, 12, 16,  1, 10,  2,  5, 11, 14, 17]

There is no missing entries in this dataset.

In [None]:
stores = pd.read_csv("input/stores.csv")
print("Stores has {} data points with {} variables each.".format(*stores.shape))

display(stores.describe())
print(stores.dtypes)

for col in stores.columns:
    print(col, stores[col].unique(), sep='\n')

stores.isnull().values.any()

### 4. oil.csv  
There are 1218 items in holidays_events with 2 variables each.  
Variables are listed below:
1. date: from 2013-01-01 to 2017-08-31
2. dcoilwtico: continuous value from 26.19~110.62

There is 1 missing entries in this dataset.

In [None]:
oil = pd.read_csv("input/oil.csv")
print("Oil has {} data points with {} variables each.".format(*oil.shape))

display(oil.describe())
print(oil.dtypes)

for col in oil.columns:
    print(col, oil[col].unique(), sep='\n')

oil.isnull().values.any()

### 5. transactions.csv

There are 83488 items in holidays_events with 3 variables each.  
Variables are listed below:
1. date: from 2013-01-01 to 2017-08-15
2. store_nbr: 54 store numbers
2. transactions: integers between 5 and 8358

There is 1 missing entries in this dataset.

In [None]:
transactions = pd.read_csv("input/transactions.csv")
print("transactions has {} data points with {} variables each.".format(*transactions.shape))

display(transactions.describe())
print(transactions.dtypes)

for col in transactions.columns:
    print(col, transactions[col].unique(), sep='\n')

transactions.isnull().values.any()

### 6. items.csv  
There are 4100 items in holidays_events with 4 variables each.  
Variables are listed below:
1. item_nbr: 4100 discrete values
2. family: ['GROCERY I' 'CLEANING' 'BREAD/BAKERY' 'DELI' 'POULTRY' 'EGGS'
 'PERSONAL CARE' 'LINGERIE' 'BEVERAGES' 'AUTOMOTIVE' 'DAIRY' 'GROCERY II'
 'MEATS' 'FROZEN FOODS' 'HOME APPLIANCES' 'SEAFOOD' 'PREPARED FOODS'
 'LIQUOR,WINE,BEER' 'BEAUTY' 'HARDWARE' 'LAWN AND GARDEN' 'PRODUCE'
 'HOME AND KITCHEN II' 'HOME AND KITCHEN I' 'MAGAZINES' 'HOME CARE'
 'PET SUPPLIES' 'BABY CARE' 'SCHOOL AND OFFICE SUPPLIES'
 'PLAYERS AND ELECTRONICS' 'CELEBRATION' 'LADIESWEAR' 'BOOKS']
3. class: 337 discrete values
4. perishable: 0 and 1
There is no missing entries in this dataset.

In [None]:
items = pd.read_csv("input/items.csv")
print("items has {} data points with {} variables each.".format(*items.shape))

display(items.describe())
print(items.dtypes)

for col in items.columns:
    print(col, items[col].unique(), sep='\n')

items.isnull().values.any()

### test.csv

In [None]:
test_data = pd.read_csv("input/test.csv")
print("Favorita grocery sales forecasting testing data has {} samples with {} features each.".format(*test_data.shape))
display(test_data.head(5))

### sample_submission.csv

In [None]:
sample_submission = pd.read_csv("input/sample_submission.csv")
display(sample_submission.head(n=5))
sample_submission.dtypes