# Favorita Grocery Sales Forecasting

https://www.kaggle.com/c/favorita-grocery-sales-forecasting

Brick-and-mortar grocery stores are always in a delicate dance with purchasing and sales forecasting. Predict a little over, and grocers are stuck with overstocked, perishable goods. Guess a little under, and popular items quickly sell out, leaving money on the table and customers fuming.

The problem becomes more complex as retailers add new locations with unique needs, new products, ever transitioning seasonal tastes, and unpredictable product marketing. Corporación Favorita, a large Ecuadorian-based grocery retailer, knows this all too well. They operate hundreds of supermarkets, with over 200,000 different products on their shelves.

Corporación Favorita has challenged the Kaggle community to build a model that more accurately forecasts product sales. They currently rely on subjective forecasting methods with very little data to back them up and very little automation to execute plans. They’re excited to see how machine learning could better ensure they please customers by having just enough of the right products at the right time.

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
# for Jupyter
from IPython.display import display

# for Fastai and PyTorch
from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)

# path to data
PATH='data/'

  from numpy.core.umath_tests import inner1d


In [3]:
# Small:
# Checking some things

print(pd.__version__)
print(torch.cuda.is_available())
print(torch.backends.cudnn.enabled)

0.22.0
True
True


In [4]:
!ls {PATH}

holidays_events.csv	 oil.feather		test.feather
holidays_events.feather  sample_submission.csv	train.csv
items.csv		 stores.csv		train.feather
items.feather		 stores.feather		transactions.csv
oil.csv			 test.csv		transactions.feather


# Functions

In [26]:
def mem_usage(df):
    sizes = list(df.memory_usage(deep=True) / 1024 ** 2)[1:]
    types = [t.name for t in df.dtypes]
    cols = list(df.columns)
    
    mem = pd.DataFrame({'columns': cols, 'size': sizes, 'type': types})
    
    return mem

# Read into Vars from Feather

In [5]:
# 0 - train
# 1 - holidat_events
# 2 - items
# 3 - oil
# 4 - stores
# 5 - transactions
# 6 - test


table_names = ['train', 'holidays_events', 'items', 'oil', 'stores', 'transactions', 'test']

tables = [pd.read_feather(f'{PATH}{fname}.feather') for fname in table_names]



In [6]:
train, holidays_events, items, oil, stores, transactions, test = tables

print((len(train), len(test)))

(125497040, 3370464)


In [7]:
display(len(tables))
display(type(tables))
display(type(tables[0]))

7

list

pandas.core.frame.DataFrame

## Memory Consumption

In [8]:
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125497040 entries, 0 to 125497039
Data columns (total 6 columns):
id             uint32
date           category
store_nbr      category
item_nbr       category
unit_sales     category
onpromotion    category
dtypes: category(5), uint32(1)
memory usage: 1.7 GB


In [20]:
holidays_events.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
date           350 non-null category
type           350 non-null category
locale         350 non-null category
locale_name    350 non-null category
description    350 non-null category
transferred    350 non-null category
dtypes: category(6)
memory usage: 49.0 KB


In [21]:
items.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4100 entries, 0 to 4099
Data columns (total 4 columns):
item_nbr      4100 non-null uint32
family        4100 non-null category
class         4100 non-null category
perishable    4100 non-null category
dtypes: category(3), uint32(1)
memory usage: 65.8 KB


In [22]:
oil.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
date          1218 non-null category
dcoilwtico    1175 non-null float32
dtypes: category(1), float32(1)
memory usage: 126.9 KB


In [23]:
stores.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
store_nbr    54 non-null category
city         54 non-null category
state        54 non-null category
type         54 non-null category
cluster      54 non-null category
dtypes: category(5)
memory usage: 11.8 KB


In [24]:
transactions.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
date            83488 non-null category
store_nbr       83488 non-null category
transactions    83488 non-null category
dtypes: category(3)
memory usage: 1.0 MB


# Join multiple CSV to one DF

## Convert Date to Date type

**It does not need. Much effective to use feather files with categorical columns for date.**

In [41]:
# # 0 - train
# train['date'] = pd.to_datetime(train.date)

# # 1 - holidat_events
# holidays_events['date'] = pd.to_datetime(holidays_events.date)

# # 2 - items
# # It doesn't have the date field.

# # 3 - oil
# oil['date'] = pd.to_datetime(oil.date)

# # 4 - stores
# # It doesn't have the date field.

# # 5 - transactions
# transactions['date'] = pd.to_datetime(transactions.date)

# # 6 - test
# test['date'] = pd.to_datetime(test.date)

## Join Train and Stores

In [10]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None:
        right_on = left_on

    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

In [11]:
# Small Example to show how does the join_df function work

a = {'City Name': ['Tomsk', 'Omsk', 'Moscow'],
     'Value1': [1, 2, 3]}

b = {'Cities': ['Moscow', 'Tomsk', 'Omsk'],
     'Value2': ['q', 'w', 'e']}

a_df = pd.DataFrame(a)
b_df = pd.DataFrame(b)
c_df = join_df(a_df, b_df, "City Name", "Cities")

for t in [a_df, b_df, c_df]:
    display(t)

Unnamed: 0,City Name,Value1
0,Tomsk,1
1,Omsk,2
2,Moscow,3


Unnamed: 0,Cities,Value2
0,Moscow,q
1,Tomsk,w
2,Omsk,e


Unnamed: 0,City Name,Value1,Cities,Value2
0,Tomsk,1,Tomsk,w
1,Omsk,2,Omsk,e
2,Moscow,3,Moscow,q


In [12]:
display(train.columns)
display(stores.columns)

Index(['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion'], dtype='object')

Index(['store_nbr', 'city', 'state', 'type', 'cluster'], dtype='object')

In [13]:
df = join_df(train, stores, 'store_nbr', 'store_nbr')

In [14]:
df.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,25,103665,7.0,,Puyo,Pastaza,C,7
1,1,2013-01-01,25,105574,1.0,,Puyo,Pastaza,C,7
2,2,2013-01-01,25,105575,2.0,,Puyo,Pastaza,C,7
3,3,2013-01-01,25,108079,1.0,,Puyo,Pastaza,C,7
4,4,2013-01-01,25,108701,1.0,,Puyo,Pastaza,C,7


## Join DF and Holiday_Events

In [15]:
holidays_events.columns

Index(['date', 'type', 'locale', 'locale_name', 'description', 'transferred'], dtype='object')

In [16]:
df = join_df(df, holidays_events, ['date', 'state'], ['date', 'locale_name'])

In [17]:
df.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,type,cluster,type_y,locale,locale_name,description,transferred
0,0,2013-01-01,25,103665,7.0,,Puyo,Pastaza,C,7,,,,,
1,1,2013-01-01,25,105574,1.0,,Puyo,Pastaza,C,7,,,,,
2,2,2013-01-01,25,105575,2.0,,Puyo,Pastaza,C,7,,,,,
3,3,2013-01-01,25,108079,1.0,,Puyo,Pastaza,C,7,,,,,
4,4,2013-01-01,25,108701,1.0,,Puyo,Pastaza,C,7,,,,,


In [18]:
DataFrameSummary(df).summary()

TypeError: concat() got an unexpected keyword argument 'sort'

In [19]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 125497040 entries, 0 to 125497039
Data columns (total 15 columns):
id             uint32
date           object
store_nbr      category
item_nbr       category
unit_sales     category
onpromotion    category
city           category
state          object
type           category
cluster        category
type_y         category
locale         category
locale_name    category
description    category
transferred    category
dtypes: category(12), object(2), uint32(1)
memory usage: 18.8 GB


In [27]:
mem_usage(df)

Unnamed: 0,columns,size,type
0,id,478.733215,uint32
1,date,8018.781357,object
2,store_nbr,119.688844,category
3,item_nbr,239.767878,category
4,unit_sales,504.360354,category
5,onpromotion,119.683497,category
6,city,119.68526,category
7,state,7899.702167,object
8,type,119.683771,category
9,cluster,119.684931,category


In [29]:
df['date'] = df.date.astype('category')
df['state'] = df.state.astype('category')

In [30]:
mem_usage(df)

Unnamed: 0,columns,size,type
0,id,478.733215,uint32
1,date,239.552334,category
2,store_nbr,119.688844,category
3,item_nbr,239.767878,category
4,unit_sales,504.360354,category
5,onpromotion,119.683497,category
6,city,119.68526,category
7,state,119.684922,category
8,type,119.683771,category
9,cluster,119.684931,category


In [31]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 125497040 entries, 0 to 125497039
Data columns (total 15 columns):
id             uint32
date           category
store_nbr      category
item_nbr       category
unit_sales     category
onpromotion    category
city           category
state          category
type           category
cluster        category
type_y         category
locale         category
locale_name    category
description    category
transferred    category
dtypes: category(14), uint32(1)
memory usage: 3.6 GB


## Join DF and Items

In [33]:
display(items.columns)
display(df.columns)

Index(['item_nbr', 'family', 'class', 'perishable'], dtype='object')

Index(['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion',
       'city', 'state', 'type', 'cluster', 'type_y', 'locale', 'locale_name',
       'description', 'transferred'],
      dtype='object')

In [35]:
df = join_df(df, items, 'item_nbr', 'item_nbr')

In [36]:
df.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,type,cluster,type_y,locale,locale_name,description,transferred,family,class,perishable
0,0,2013-01-01,25,103665,7.0,,Puyo,Pastaza,C,7,,,,,,,,
1,1,2013-01-01,25,105574,1.0,,Puyo,Pastaza,C,7,,,,,,,,
2,2,2013-01-01,25,105575,2.0,,Puyo,Pastaza,C,7,,,,,,,,
3,3,2013-01-01,25,108079,1.0,,Puyo,Pastaza,C,7,,,,,,,,
4,4,2013-01-01,25,108701,1.0,,Puyo,Pastaza,C,7,,,,,,,,


In [37]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 125497040 entries, 0 to 125497039
Data columns (total 18 columns):
id             uint32
date           category
store_nbr      category
item_nbr       object
unit_sales     category
onpromotion    category
city           category
state          category
type           category
cluster        category
type_y         category
locale         category
locale_name    category
description    category
transferred    category
family         category
class          category
perishable     category
dtypes: category(16), object(1), uint32(1)
memory usage: 11.3 GB


In [39]:
df['item_nbr'] = df.date.astype('category')

In [40]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 125497040 entries, 0 to 125497039
Data columns (total 18 columns):
id             uint32
date           category
store_nbr      category
item_nbr       category
unit_sales     category
onpromotion    category
city           category
state          category
type           category
cluster        category
type_y         category
locale         category
locale_name    category
description    category
transferred    category
family         category
class          category
perishable     category
dtypes: category(17), uint32(1)
memory usage: 4.1 GB


## Join DF and Oil

## Join DF and Transactions

## Save

# Data cleaning / Feature Engineering