# Favorita Grocery Sales Forecasting

https://www.kaggle.com/c/favorita-grocery-sales-forecasting

Brick-and-mortar grocery stores are always in a delicate dance with purchasing and sales forecasting. Predict a little over, and grocers are stuck with overstocked, perishable goods. Guess a little under, and popular items quickly sell out, leaving money on the table and customers fuming.

The problem becomes more complex as retailers add new locations with unique needs, new products, ever transitioning seasonal tastes, and unpredictable product marketing. Corporación Favorita, a large Ecuadorian-based grocery retailer, knows this all too well. They operate hundreds of supermarkets, with over 200,000 different products on their shelves.

Corporación Favorita has challenged the Kaggle community to build a model that more accurately forecasts product sales. They currently rely on subjective forecasting methods with very little data to back them up and very little automation to execute plans. They’re excited to see how machine learning could better ensure they please customers by having just enough of the right products at the right time.

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
# my imports

from IPython.display import display # smart print for notebooks
from tqdm import tqdm # progress bar

In [3]:
from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)

PATH='data/'

  from numpy.core.umath_tests import inner1d


In [4]:
# Small:
# Checking some things

print(pd.__version__)
print(torch.cuda.is_available())
print(torch.backends.cudnn.enabled)

0.22.0
True
True


# Functions

In [5]:
def count_by_value(df, col, value):
    return len(df.loc[df[col] == value])

In [6]:
def count_all_categories(df, col, num=5):
    cats = list(df[col].cat.categories)
    if len(cats) <= num:
        for cat in cats:
            print(f'"{cat}" has {count_by_value(df, col, cat)} records')
    else:
        print(f'The number of categories is {len(cats)}.')
        print(f'First {num} categories are: {cats[:num]}.')
        print(f'The category "Unknown" has {count_by_value(df, col, "Unknown")} records.')
        
    print(f'"NA" has {len(df[df[col].isna()])} records.')

# Examples

## Append new row to DataFrame

In [42]:
geo = pd.DataFrame({'locale_name': ['Tomsk'], 'city': [0], 'state': [0]})
display(geo.shape)
display(geo)

(1, 3)

Unnamed: 0,city,locale_name,state
0,0,Tomsk,0


In [43]:
locale_name = 'Moscow'
cities = '0'
states = '0'
geo = geo.append(pd.DataFrame([[locale_name, cities, states]], columns=['locale_name', 'city', 'state']),
                 ignore_index=True)
display(geo)
del geo

Unnamed: 0,city,locale_name,state
0,0,Tomsk,0
1,0,Moscow,0


# Look ar Files

In [5]:
!ls -lh {PATH}

total 15G
-rw-r--r-- 1 paperspace paperspace 4.0G Sep 26 18:30 df_all_csv.feather
-rw-r--r-- 1 paperspace paperspace 3.9G Sep 26 18:40 df_missing_values.feather
-rw-rw-r-- 1 paperspace paperspace  22K Oct 19  2017 holidays_events.csv
-rw-r--r-- 1 paperspace paperspace  11K Sep 26 14:23 holidays_events.feather
-rw-rw-r-- 1 paperspace paperspace 100K Oct 19  2017 items.csv
-rw-r--r-- 1 paperspace paperspace  71K Sep 26 14:23 items.feather
-rw-rw-r-- 1 paperspace paperspace  21K Oct 19  2017 oil.csv
-rw-r--r-- 1 paperspace paperspace  25K Sep 25 13:44 oil.feather
-rw-rw-r-- 1 paperspace paperspace  39M Oct 19  2017 sample_submission.csv
-rw-rw-r-- 1 paperspace paperspace 1.4K Oct 19  2017 stores.csv
-rw-r--r-- 1 paperspace paperspace 2.0K Sep 26 14:23 stores.feather
-rw-rw-r-- 1 paperspace paperspace 121M Oct 19  2017 test.csv
-rw-r--r-- 1 paperspace paperspace  29M Sep 26 14:23 test.feather
-rw-rw-r-- 1 paperspace paperspace 4.7G Oct 19  2017 train.csv
-rw-r--r-- 1 papersp

In [6]:
!ls -lh ~/.kaggle/competitions/favorita-grocery-sales-forecasting/

total 458M
-rw-rw-r-- 1 paperspace paperspace 1.9K Sep 13 06:28 holidays_events.csv.7z
-rw-rw-r-- 1 paperspace paperspace  14K Sep 13 06:28 items.csv.7z
-rw-rw-r-- 1 paperspace paperspace 3.7K Sep 13 06:28 oil.csv.7z
-rw-rw-r-- 1 paperspace paperspace 651K Sep 13 06:29 sample_submission.csv.7z
-rw-rw-r-- 1 paperspace paperspace  648 Sep 13 06:28 stores.csv.7z
-rw-rw-r-- 1 paperspace paperspace 4.7M Sep 13 06:28 test.csv.7z
-rw-rw-r-- 1 paperspace paperspace 453M Sep 13 06:29 train.csv.7z
-rw-rw-r-- 1 paperspace paperspace 215K Sep 13 06:28 transactions.csv.7z


In [7]:
table_names = ['holidays_events', 'items', 'oil', 'stores', 'transactions',
               'test', 'train']

## `train.csv`

In [8]:
train = pd.read_feather(f'{PATH}{table_names[6]}.feather')

In [9]:
# ??pd.read_csv

In [10]:
train['date'] = pd.to_datetime(train.date)

In [11]:
train.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
0,0,2013-01-01,25,103665,7.0,
1,1,2013-01-01,25,105574,1.0,
2,2,2013-01-01,25,105575,2.0,
3,3,2013-01-01,25,108079,1.0,
4,4,2013-01-01,25,108701,1.0,


In [12]:
DataFrameSummary(train).summary()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
count,1.25497e+08,,1.25497e+08,1.25497e+08,1.25497e+08,
mean,6.27485e+07,,27.4646,972769,8.55487,
std,3.62279e+07,,16.3305,520534,23.6052,
min,0,,1,96995,-15372,
25%,3.13743e+07,,12,522383,2,
50%,6.27485e+07,,28,959500,4,
75%,9.41228e+07,,43,1.35438e+06,9,
max,1.25497e+08,,54,2.12711e+06,89440,
counts,125497040,125497040,125497040,125497040,125497040,103839389
uniques,125497040,1684,54,4036,258474,2


In [14]:
??DataFrameSummary

In [13]:
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125497040 entries, 0 to 125497039
Data columns (total 6 columns):
id             int64
date           datetime64[ns]
store_nbr      int64
item_nbr       int64
unit_sales     float64
onpromotion    object
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 8.4 GB


### Repeated Dates

**I expect that records about selling an item of each type (`item_nbr`) do not repeat during day in one store. It's needed to check this statement.**

In [None]:
non_unique_dates = pd.DataFrame()
store_nbr_column = train['store_nbr'].unique()

for store in tqdm(store_nbr_column):
    store_subtable = train.query('(store_nbr == @store)')
    date_subtable = store_subtable['date'].unique()
    for date in date_subtable:
        item_nbr = store_subtable.query('date == @date')['item_nbr']
        if len(item_nbr) != len(item_nbr.unique()):
            display(date)
            non_unique_dates = non_unique_dates.append(date) # what does it to add?

In [None]:
display(non_unique_dates)

### Repeated Dates / Pool

**It's the faster realization with a pool of processes.**

In [33]:
from multiprocessing import Pool
from itertools import product

num_cores = 8 # from $lscpu or $cat /proc/cpuinfo

# non_unique_dates = pd.DataFrame({'date': [], 'store_nbr': []})
store_nbr_column = train['store_nbr'].unique()

def f(date, store):
    item_nbr = store_subtable.query('date == @date')['item_nbr']
    if len(item_nbr) != len(item_nbr.unique()):
        return pd.DataFrame({'date': [date], 'store_nbr': [store]})
    
    return None

def calc_repeated_dates():
    non_unique_dates = pd.DataFrame()
    
    for store in tqdm(store_nbr_column):
        d = datetime.datetime.now()
        store_subtable = train.query('(store_nbr == @store)')
        date_subtable = store_subtable['date'].unique()

        with Pool(processes=num_cores) as pool:
            out = pool.starmap(f, product(date_subtable, store))
            if out != [None] * len(out):
                non_unique_dates = non_unique_dates.append(pd.concat(out))
            
    return non_unique_dates
    
non_unique_dates = calc_repeated_dates()

100%|██████████| 54/54 [32:29<00:00, 18.67s/it]


In [35]:
non_unique_dates.shape

(0, 0)

In [36]:
non_unique_dates

**I expect that records about selling an item of each type (`item_nbr`) do not repeat during day in one store.**

**The `non_unique_dates` var is empty. Hence, the statement is true.**

## `test.csv`

In [16]:
test = pd.read_csv(f'{PATH}{table_names[5]}.csv', low_memory=True)

In [17]:
test.head()

Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
0,125497040,2017-08-16,1,96995,False
1,125497041,2017-08-16,1,99197,False
2,125497042,2017-08-16,1,103501,False
3,125497043,2017-08-16,1,103520,False
4,125497044,2017-08-16,1,103665,False


In [18]:
DataFrameSummary(test).summary()

Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
count,3.37046e+06,,3.37046e+06,3.37046e+06,
mean,1.27182e+08,,27.5,1.2448e+06,
std,972969,,15.5858,589836,
min,1.25497e+08,,1,96995,
25%,1.2634e+08,,14,805321,
50%,1.27182e+08,,27.5,1.29466e+06,
75%,1.28025e+08,,41,1.73002e+06,
max,1.28868e+08,,54,2.13424e+06,
counts,3370464,3370464,3370464,3370464,3370464
uniques,3370464,16,54,3901,2


## `holidays_events.csv`

In [9]:
holiday_events = pd.read_feather(f'{PATH}{table_names[0]}.feather')

In [10]:
holiday_events.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
date           350 non-null category
type           350 non-null category
locale         350 non-null category
locale_name    350 non-null category
description    350 non-null category
transferred    350 non-null category
dtypes: category(6)
memory usage: 49.0 KB


In [11]:
holiday_events.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [12]:
DataFrameSummary(holiday_events).summary()

TypeError: concat() got an unexpected keyword argument 'sort'

### `type` column

In [13]:
holiday_events['type'].dtype

CategoricalDtype(categories=['Additional', 'Bridge', 'Event', 'Holiday', 'Transfer',
                  'Work Day'],
                 ordered=False)

In [14]:
count_all_categories(holiday_events, 'type', 6)

"Additional" has 51 records
"Bridge" has 5 records
"Event" has 56 records
"Holiday" has 221 records
"Transfer" has 12 records
"Work Day" has 5 records
"NA" has 0 records.


### `locale_name` column

In [15]:
holiday_events['locale_name'].dtype

CategoricalDtype(categories=['Ambato', 'Cayambe', 'Cotopaxi', 'Cuenca', 'Ecuador',
                  'El Carmen', 'Esmeraldas', 'Guaranda', 'Guayaquil', 'Ibarra',
                  'Imbabura', 'Latacunga', 'Libertad', 'Loja', 'Machala',
                  'Manta', 'Puyo', 'Quevedo', 'Quito', 'Riobamba', 'Salinas',
                  'Santa Elena', 'Santo Domingo',
                  'Santo Domingo de los Tsachilas'],
                 ordered=False)

In [16]:
count_all_categories(holiday_events, 'locale_name', 24)

"Ambato" has 12 records
"Cayambe" has 6 records
"Cotopaxi" has 6 records
"Cuenca" has 7 records
"Ecuador" has 174 records
"El Carmen" has 6 records
"Esmeraldas" has 6 records
"Guaranda" has 12 records
"Guayaquil" has 11 records
"Ibarra" has 7 records
"Imbabura" has 6 records
"Latacunga" has 12 records
"Libertad" has 6 records
"Loja" has 6 records
"Machala" has 6 records
"Manta" has 6 records
"Puyo" has 6 records
"Quevedo" has 6 records
"Quito" has 13 records
"Riobamba" has 12 records
"Salinas" has 6 records
"Santa Elena" has 6 records
"Santo Domingo" has 6 records
"Santo Domingo de los Tsachilas" has 6 records
"NA" has 0 records.


### Ecuador Location

In [17]:
holiday_events.loc[holiday_events['locale_name'] == 'Ecuador']

Unnamed: 0,date,type,locale,locale_name,description,transferred
14,2012-08-10,Holiday,National,Ecuador,Primer Grito de Independencia,False
19,2012-10-09,Holiday,National,Ecuador,Independencia de Guayaquil,True
20,2012-10-12,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
21,2012-11-02,Holiday,National,Ecuador,Dia de Difuntos,False
22,2012-11-03,Holiday,National,Ecuador,Independencia de Cuenca,False
31,2012-12-21,Additional,National,Ecuador,Navidad-4,False
33,2012-12-22,Additional,National,Ecuador,Navidad-3,False
34,2012-12-23,Additional,National,Ecuador,Navidad-2,False
35,2012-12-24,Bridge,National,Ecuador,Puente Navidad,False
36,2012-12-24,Additional,National,Ecuador,Navidad-1,False


**Can the dates of holidays from the place of Ecuador coincide with the dates of the holidays from other places?**

In [24]:
repeated_dates = pd.DataFrame()
for date in holiday_events.loc[holiday_events['locale_name'] == 'Ecuador']['date']:
#     holiday_events.loc[holiday_events['date'] == date]
#     display(holiday_events.query('(date == @date) & (locale_name != "Ecuador")'))
    repeated_dates = repeated_dates.append(holiday_events.query('(date == @date) & (locale_name != "Ecuador")'))
    
display(repeated_dates)
del repeated_dates
del date

Unnamed: 0,date,type,locale,locale_name,description,transferred
32,2012-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
54,2013-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
86,2013-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
110,2014-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
111,2014-06-25,Holiday,Local,Machala,Fundacion de Machala,False
112,2014-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
151,2014-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
205,2015-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
224,2016-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
249,2016-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False


**So, it is needed to add national wide events as an additional columns. This operation is going to implement in an another notebook namely `favorita-02.ipynb`.**

## `items.csv`

In [22]:
items = pd.read_csv(f'{PATH}{table_names[1]}.csv', low_memory=False)

In [23]:
items.head()

Unnamed: 0,item_nbr,family,class,perishable
0,96995,GROCERY I,1093,0
1,99197,GROCERY I,1067,0
2,103501,CLEANING,3008,0
3,103520,GROCERY I,1028,0
4,103665,BREAD/BAKERY,2712,1


In [24]:
DataFrameSummary(items).summary()

Unnamed: 0,item_nbr,family,class,perishable
count,4100,,4100,4100
mean,1.25144e+06,,2169.65,0.240488
std,587687,,1484.91,0.427432
min,96995,,1002,0
25%,818111,,1068,0
50%,1.3062e+06,,2004,0
75%,1.90492e+06,,2990.5,0
max,2.13424e+06,,7780,1
counts,4100,4100,4100,4100
uniques,4100,33,337,2


## `oil.csv`

In [25]:
oil = pd.read_csv(f'{PATH}{table_names[2]}.csv', low_memory=False)

In [26]:
oil.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [27]:
DataFrameSummary(oil).summary()

Unnamed: 0,date,dcoilwtico
count,,1175
mean,,67.7144
std,,25.6305
min,,26.19
25%,,46.405
50%,,53.19
75%,,95.66
max,,110.62
counts,1218,1175
uniques,1218,998


## `stores.csv`

In [7]:
stores = pd.read_feather(f'{PATH}{table_names[3]}.feather')

In [20]:
stores.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [30]:
DataFrameSummary(stores).summary()

Unnamed: 0,store_nbr,city,state,type,cluster
count,54,,,,54
mean,27.5,,,,8.48148
std,15.7321,,,,4.69339
min,1,,,,1
25%,14.25,,,,4
50%,27.5,,,,8.5
75%,40.75,,,,13
max,54,,,,17
counts,54,54,54,54,54
uniques,54,22,16,5,17


## `transactions.csv`

In [31]:
transactions = pd.read_csv(f'{PATH}{table_names[4]}.csv', low_memory=False)

In [32]:
transactions.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [33]:
DataFrameSummary(transactions).summary()

Unnamed: 0,date,store_nbr,transactions
count,,83488,83488
mean,,26.9392,1694.6
std,,15.6082,963.287
min,,1,5
25%,,13,1046
50%,,27,1393
75%,,40,2079
max,,54,8359
counts,83488,83488,83488
uniques,1682,54,4993


## Cities, States, Locale Names

`city` and `state` from **`stores.csv`**

`locale_name` from **`holiday_events.csv`**

In [12]:
count_all_categories(stores, 'city', 22)

"Ambato" has 2 records
"Babahoyo" has 1 records
"Cayambe" has 1 records
"Cuenca" has 3 records
"Daule" has 1 records
"El Carmen" has 1 records
"Esmeraldas" has 1 records
"Guaranda" has 1 records
"Guayaquil" has 8 records
"Ibarra" has 1 records
"Latacunga" has 2 records
"Libertad" has 1 records
"Loja" has 1 records
"Machala" has 2 records
"Manta" has 2 records
"Playas" has 1 records
"Puyo" has 1 records
"Quevedo" has 1 records
"Quito" has 18 records
"Riobamba" has 1 records
"Salinas" has 1 records
"Santo Domingo" has 3 records
"NA" has 0 records.


In [13]:
count_all_categories(stores, 'state', 16)

"Azuay" has 3 records
"Bolivar" has 1 records
"Chimborazo" has 1 records
"Cotopaxi" has 2 records
"El Oro" has 2 records
"Esmeraldas" has 1 records
"Guayas" has 11 records
"Imbabura" has 1 records
"Loja" has 1 records
"Los Rios" has 2 records
"Manabi" has 3 records
"Pastaza" has 1 records
"Pichincha" has 19 records
"Santa Elena" has 1 records
"Santo Domingo de los Tsachilas" has 3 records
"Tungurahua" has 2 records
"NA" has 0 records.


In [14]:
count_all_categories(holiday_events, 'locale_name', 24)

"Ambato" has 12 records
"Cayambe" has 6 records
"Cotopaxi" has 6 records
"Cuenca" has 7 records
"Ecuador" has 174 records
"El Carmen" has 6 records
"Esmeraldas" has 6 records
"Guaranda" has 12 records
"Guayaquil" has 11 records
"Ibarra" has 7 records
"Imbabura" has 6 records
"Latacunga" has 12 records
"Libertad" has 6 records
"Loja" has 6 records
"Machala" has 6 records
"Manta" has 6 records
"Puyo" has 6 records
"Quevedo" has 6 records
"Quito" has 13 records
"Riobamba" has 12 records
"Salinas" has 6 records
"Santa Elena" has 6 records
"Santo Domingo" has 6 records
"Santo Domingo de los Tsachilas" has 6 records
"NA" has 0 records.


In [59]:
geo = pd.DataFrame({'locale_name': [], 'holiday_events/days': [], 'stores/city': [], 'stores/state': []})
display(geo.shape)
display(geo)

(0, 4)

Unnamed: 0,holiday_events/days,locale_name,stores/city,stores/state


In [60]:
# for locale_name in holiday_events['locale_name']:
for locale_name in list(holiday_events['locale_name'].cat.categories):
    days = count_by_value(holiday_events, 'locale_name', locale_name)
    cities = count_by_value(stores, 'city', locale_name)
    states = count_by_value(stores, 'state', locale_name)
    geo = geo.append(pd.DataFrame([[locale_name, days, cities, states]],
                                  columns=['locale_name', 'holiday_events/days', 'stores/city', 'stores/state']),
                    ignore_index=True)
#     geo.append(pd.DataFrame({'locale_name': locale_name, 'city': cities, 'state': states}))
    
display(geo[['locale_name', 'holiday_events/days', 'stores/city', 'stores/state']])

Unnamed: 0,locale_name,holiday_events/days,stores/city,stores/state
0,Ambato,12.0,2.0,0.0
1,Cayambe,6.0,1.0,0.0
2,Cotopaxi,6.0,0.0,2.0
3,Cuenca,7.0,3.0,0.0
4,Ecuador,174.0,0.0,0.0
5,El Carmen,6.0,1.0,0.0
6,Esmeraldas,6.0,1.0,1.0
7,Guaranda,12.0,1.0,0.0
8,Guayaquil,11.0,8.0,0.0
9,Ibarra,7.0,1.0,0.0


Zero:
* Ecuador

Double:
* Esmeraldas
* Loja

## Conclusion

### Statistics

In [34]:
# test.csv

store_num = 54
days_num = 1684
item_num = 4036
id_num = 125497040

full_items = store_num * days_num * item_num

print(full_items)
print(id_num - full_items)

367017696
-241520656


### Training Data Range

**1) get training data range**

In [195]:
train.columns

Index(['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion'], dtype='object')

In [196]:
# get dates from the training dataset

unique_train_dates = pd.DataFrame(pd.unique(train.date))

In [197]:
display(type(unique_train_dates))
display(DataFrameSummary(unique_train_dates).summary())
display(unique_train_dates.columns)

pandas.core.frame.DataFrame

Unnamed: 0,0
count,1684
unique,1684
top,2016-11-14 00:00:00
freq,1
first,2013-01-01 00:00:00
last,2017-08-15 00:00:00
counts,1684
uniques,1684
missing,0
missing_perc,0%


RangeIndex(start=0, stop=1, step=1)

In [198]:
print(len(unique_train_dates))
print(unique_train__dates[0][0])
# print(unique_train__dates[0][-1]) # it does not work
print(unique_train__dates[0][len(unique_train__dates.index)-1])

1684
2013-01-01 00:00:00
2017-08-15 00:00:00


**2) find missing dates**

In [199]:
def get_missing_dates(columns_with_date_from_dataset):
    
    date_from_range = pd.DataFrame(pd.date_range(
        columns_with_date_from_dataset[0],
        columns_with_date_from_dataset[len(columns_with_date_from_dataset.index)-1]))
    
    correction = 0
    date_missing = []
    for idx in columns_with_date_from_dataset.index:
        date1 = columns_with_date_from_dataset[idx - correction]
        date2 = date_from_range[0][idx]
        if date1 != date2:
            date_missing.append(date2)
            correction += 1
            
    return date_missing
        
train_date_missing = get_missing_dates(unique_train__dates[0])
display(train_date_missing)

[Timestamp('2013-12-25 00:00:00'),
 Timestamp('2014-12-25 00:00:00'),
 Timestamp('2015-12-25 00:00:00'),
 Timestamp('2016-12-25 00:00:00')]

In [213]:
# check that that dates are not in the train dataset

# display(train[train.date == pd.datetime(2013, 12, 25)])

for date in train_date_missing:
    display(train[train.date == date])

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion


### Testing Date Range

**1) get training data range**

In [201]:
test.columns

Index(['id', 'date', 'store_nbr', 'item_nbr', 'onpromotion'], dtype='object')

In [202]:
test['date'] = pd.to_datetime(test.date)

In [206]:
unique_test_dates = pd.DataFrame(pd.unique(test.date))

In [208]:
display(unique_test_dates.shape)
display(unique_test_dates[0][0])
# display(unique_test_dates.a[-1]) # it does not work, why?
display(unique_test_dates[0][len(unique_test_dates.index)-1])

(16, 1)

Timestamp('2017-08-16 00:00:00')

Timestamp('2017-08-31 00:00:00')

In [209]:
unique_test_dates.head()

Unnamed: 0,0
0,2017-08-16
1,2017-08-17
2,2017-08-18
3,2017-08-19
4,2017-08-20


In [210]:
DataFrameSummary(unique_test_dates).summary()

Unnamed: 0,0
count,16
unique,16
top,2017-08-16 00:00:00
freq,1
first,2017-08-16 00:00:00
last,2017-08-31 00:00:00
counts,16
uniques,16
missing,0
missing_perc,0%


**2) find missing dates**

In [211]:
test_date_missing = get_missing_dates(unique_test_dates[0])
display(test_date_missing)

[]

In [212]:
# check that that dates are not in the train dataset

# display(train[train.date == pd.datetime(2013, 12, 25)])

for date in test_date_missing:
    display(test[test.date == date])

In [None]:
# end this section