# Favorita Grocery Sales Forecasting

https://www.kaggle.com/c/favorita-grocery-sales-forecasting

Brick-and-mortar grocery stores are always in a delicate dance with purchasing and sales forecasting. Predict a little over, and grocers are stuck with overstocked, perishable goods. Guess a little under, and popular items quickly sell out, leaving money on the table and customers fuming.

The problem becomes more complex as retailers add new locations with unique needs, new products, ever transitioning seasonal tastes, and unpredictable product marketing. Corporación Favorita, a large Ecuadorian-based grocery retailer, knows this all too well. They operate hundreds of supermarkets, with over 200,000 different products on their shelves.

Corporación Favorita has challenged the Kaggle community to build a model that more accurately forecasts product sales. They currently rely on subjective forecasting methods with very little data to back them up and very little automation to execute plans. They’re excited to see how machine learning could better ensure they please customers by having just enough of the right products at the right time.

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
# for Jupyter
from IPython.display import display

# for Fastai and PyTorch
from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)

# path to data
PATH='data/'

  from numpy.core.umath_tests import inner1d


In [3]:
# Small:
# Checking some things

print(pd.__version__)
print(torch.cuda.is_available())
print(torch.backends.cudnn.enabled)

0.23.4
False
True


In [4]:
!ls -lh {PATH}

total 21732208
-rw-r--r--  1 ilirium  staff    22K Oct 19  2017 holidays_events.csv
-rw-r--r--  1 ilirium  staff    26K Sep  3 13:31 holidays_events.feather
-rw-r--r--  1 ilirium  staff    99K Oct 19  2017 items.csv
-rw-r--r--  1 ilirium  staff   149K Sep  3 13:31 items.feather
-rw-r--r--  1 ilirium  staff    20K Oct 19  2017 oil.csv
-rw-r--r--  1 ilirium  staff    27K Sep  3 13:31 oil.feather
-rw-r--r--  1 ilirium  staff    39M Oct 19  2017 sample_submission.csv
-rw-r--r--  1 ilirium  staff   1.4K Oct 19  2017 stores.csv
-rw-r--r--  1 ilirium  staff   2.9K Sep  3 13:31 stores.feather
-rw-r--r--  1 ilirium  staff   120M Oct 19  2017 test.csv
-rw-r--r--  1 ilirium  staff   123M Sep  3 13:31 test.feather
-rw-r--r--  1 ilirium  staff   4.7G Oct 19  2017 train.csv
-rw-r--r--  1 ilirium  staff   5.4G Sep  3 13:31 train.feather
-rw-r--r--  1 ilirium  staff   1.5M Oct 19  2017 transactions.csv
-rw-r--r--  1 ilirium  staff   2.4M Sep  3 13:31 transactions.feather


# Read into Vars from Feather

In [5]:
# 0 - train
# 1 - holidat_events
# 2 - items
# 3 - oil
# 4 - stores
# 5 - transactions
# 6 - test


table_names = ['train', 'holidays_events', 'items', 'oil', 'stores', 'transactions', 'test']

tables = [pd.read_feather(f'{PATH}{fname}.feather') for fname in table_names]

train, holidays_events, items, oil, stores, transactions, test = tables

print((len(train), len(test)))

  return feather.read_dataframe(path, nthreads=nthreads)


(125497040, 3370464)


In [6]:
display(len(tables))
display(type(tables))
display(type(tables[0]))

7

list

pandas.core.frame.DataFrame

# Data cleaning / Feature Engineering

## Convert Date to Date type

In [7]:
# 0 - train
train['date'] = pd.to_datetime(train.date)

# 1 - holidat_events
holidays_events['date'] = pd.to_datetime(holidays_events.date)

# 2 - items
# It doesn't have the date field.

# 3 - oil
oil['date'] = pd.to_datetime(oil.date)

# 4 - stores
# It doesn't have the date field.

# 5 - transactions
transactions['date'] = pd.to_datetime(transactions.date)

# 6 - test
test['date'] = pd.to_datetime(test.date)

## Join Train and Stores

In [8]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None:
        right_on = left_on

    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

In [9]:
# Small Example to show how does the join_df function work

a = {'City Name': ['Tomsk', 'Omsk', 'Moscow'],
     'Value1': [1, 2, 3]}

b = {'Cities': ['Moscow', 'Tomsk', 'Omsk'],
     'Value2': ['q', 'w', 'e']}

a_df = pd.DataFrame(a)
b_df = pd.DataFrame(b)
c_df = join_df(a_df, b_df, "City Name", "Cities")

for t in [a_df, b_df, c_df]:
    display(t)

Unnamed: 0,City Name,Value1
0,Tomsk,1
1,Omsk,2
2,Moscow,3


Unnamed: 0,Cities,Value2
0,Moscow,q
1,Tomsk,w
2,Omsk,e


Unnamed: 0,City Name,Value1,Cities,Value2
0,Tomsk,1,Tomsk,w
1,Omsk,2,Omsk,e
2,Moscow,3,Moscow,q


In [10]:
display(train.columns)
display(stores.columns)

Index(['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion'], dtype='object')

Index(['store_nbr', 'city', 'state', 'type', 'cluster'], dtype='object')

In [11]:
df = join_df(train, stores, 'store_nbr', 'store_nbr')

In [12]:
df.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,25,103665,7.0,,Salinas,Santa Elena,D,1
1,1,2013-01-01,25,105574,1.0,,Salinas,Santa Elena,D,1
2,2,2013-01-01,25,105575,2.0,,Salinas,Santa Elena,D,1
3,3,2013-01-01,25,108079,1.0,,Salinas,Santa Elena,D,1
4,4,2013-01-01,25,108701,1.0,,Salinas,Santa Elena,D,1


## Join DF and Holiday_Events

In [13]:
holidays_events.columns

Index(['date', 'type', 'locale', 'locale_name', 'description', 'transferred'], dtype='object')

In [14]:
df = join_df(df, holidays_events, ['date', 'state'], ['date', 'locale_name'])

In [15]:
df.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,type,cluster,type_y,locale,locale_name,description,transferred
0,0,2013-01-01,25,103665,7.0,,Salinas,Santa Elena,D,1,,,,,
1,1,2013-01-01,25,105574,1.0,,Salinas,Santa Elena,D,1,,,,,
2,2,2013-01-01,25,105575,2.0,,Salinas,Santa Elena,D,1,,,,,
3,3,2013-01-01,25,108079,1.0,,Salinas,Santa Elena,D,1,,,,,
4,4,2013-01-01,25,108701,1.0,,Salinas,Santa Elena,D,1,,,,,


In [16]:
DataFrameSummary(df).summary()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,type,cluster,type_y,locale,locale_name,description,transferred
count,1.25497e+08,,1.25497e+08,1.25497e+08,1.25497e+08,,,,,1.25497e+08,,,,,
mean,6.27485e+07,,27.4646,972769,8.55487,,,,,8.72711,,,,,
std,3.62279e+07,,16.3305,520534,23.6052,,,,,4.62675,,,,,
min,0,,1,96995,-15372,,,,,1,,,,,
25%,3.13743e+07,,12,522383,2,,,,,4,,,,,
50%,6.27485e+07,,28,959500,4,,,,,9,,,,,
75%,9.41228e+07,,43,1.35438e+06,9,,,,,13,,,,,
max,1.25497e+08,,54,2.12711e+06,89440,,,,,17,,,,,
counts,125497040,125497040,125497040,125497040,125497040,103839389,125497040,125497040,125497040,125497040,49893,49893,49893,49893,49893
uniques,125497040,1684,54,4036,258474,2,22,16,5,17,1,2,6,6,1
