# Capstone 3 - Data Wrangling

### Table of contents
* [1.0 Intorduction](#1.0)
    * [1.1 Purpose](#1.1)
    * [1.2 Approach](#1.2)
* [2.0 Explore the data](#2.0)
    * [2.1 Training data](#2.1)
    * [2.2 Exogenous Variables](#2.2)
        * [2.2.1 Holidays & Events](#2.2.1)
        * [2.2.2 Oil](#2.2.2)
        * [2.2.3 Store Information](#2.2.3)
        * [2.2.4 Transaction data](#2.2.4)
    * [2.3 Identify the resolution of the time series](#2.3)
    * [2.4 Forecasting horizon](#2.4)
* [3.0 Combining pertinent features into the data](#3.0)
    * [3.1 Merging training data with geographical store data](#3.1)
    * [3.2 Merging current data with holiday and event data](#3.2)
    * [3.3 Merging current data with transaction data](#3.3)
    * [3.4 Merging current data with oil data](#3.4)
* [4.0 Summary](#4.0)

### 1.0 Introduction: <a id='1.0'></a>

In order to stay in business, commercial grocery stores must offer prices that are commensurate with competitors, offer deals to entice customers, and accurately predict which products, and the quantity of those products, to keep in stock. These considerations are confounded by the effect of both seasonal and regional trends.

Especially for grocers, the consequences of poor inventory management are dire. Perishable items like fruits and vegetables can rot before selling if they are overstocked. Conversely, many locations do not have the real estate or capability to store overstocked, low-demand items that are not selling. According to Retail Wire, overstocking costs the average retailer 3.2% in lost revenue, while understocking items can cost 4.1%. A review of the data has shown that overstocks are costing retailers \\$123.4 billion every year, and understocks remove another \\$129.5 billion from net inflows. [1]

#### 1.1 Purpose<a id='1.1'></a>

Using Kaggle data available from Favorita grocery stores located in Ecuador [2], we will assess and predict sales of available items using time series analysis. There are multiple datasets that will require some data to be merged. We have information on transactions, stores, regions, holidays, and even oil pricing.

[1] https://www.retailwire.com/discussion/retailers-suffer-the-high-cost-of-overstocks-and-out-of-stocks/

[2] Alexis Cook, DanB, inversion, Ryan Holbrook. (2021). Store Sales - Time Series Forecasting. Kaggle. https://kaggle.com/competitions/store-sales-time-series-forecasting


#### 1.2 Approach<a id='1.2'></a>

We will use machine learning time series analysis to forecast sales of different types of items across dozens of stores. This will allow Favorita to become more efficient with its distribution of resources, and more likely to attract customers to purchase certain products at certain times. This analysis can also inform the company of the best times to offer discounts, whether to stock up on certain items, and knowledge of general market trends.


### 2.0 Explore the data<a id='2.0'></a>

In [154]:
# import needed modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

#### 2.1 Training data<a id='2.1'></a>

In [2]:
train_csv = pd.read_csv('./train.csv')

In [3]:
train_csv.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


In [4]:
missing = pd.concat([train_csv.isnull().sum(), 100 * train_csv.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
id,0,0.0
date,0,0.0
store_nbr,0,0.0
family,0,0.0
sales,0,0.0
onpromotion,0,0.0


In [5]:
train_csv.dtypes

id               int64
date            object
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object

In [6]:
train_csv["date"] = pd.to_datetime(train_csv["date"])
train_csv.dtypes

id                      int64
date           datetime64[ns]
store_nbr               int64
family                 object
sales                 float64
onpromotion             int64
dtype: object

In [7]:
train_df = train_csv.set_index('date')

In [8]:
train_df

Unnamed: 0_level_0,id,store_nbr,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,0,1,AUTOMOTIVE,0.000,0
2013-01-01,1,1,BABY CARE,0.000,0
2013-01-01,2,1,BEAUTY,0.000,0
2013-01-01,3,1,BEVERAGES,0.000,0
2013-01-01,4,1,BOOKS,0.000,0
...,...,...,...,...,...
2017-08-15,3000883,9,POULTRY,438.133,0
2017-08-15,3000884,9,PREPARED FOODS,154.553,1
2017-08-15,3000885,9,PRODUCE,2419.729,148
2017-08-15,3000886,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [9]:
print("{} product families\n\n{}".format(len(train_df.family.unique()),train_df.family.value_counts()))

33 product families

AUTOMOTIVE                    90936
HOME APPLIANCES               90936
SCHOOL AND OFFICE SUPPLIES    90936
PRODUCE                       90936
PREPARED FOODS                90936
POULTRY                       90936
PLAYERS AND ELECTRONICS       90936
PET SUPPLIES                  90936
PERSONAL CARE                 90936
MEATS                         90936
MAGAZINES                     90936
LIQUOR,WINE,BEER              90936
LINGERIE                      90936
LAWN AND GARDEN               90936
LADIESWEAR                    90936
HOME CARE                     90936
HOME AND KITCHEN II           90936
BABY CARE                     90936
HOME AND KITCHEN I            90936
HARDWARE                      90936
GROCERY II                    90936
GROCERY I                     90936
FROZEN FOODS                  90936
EGGS                          90936
DELI                          90936
DAIRY                         90936
CLEANING                      90936
CELEBRA

In [10]:
print("{} promotion types\n\n{}".format(len(train_df.onpromotion.unique()),train_df.onpromotion.value_counts()))

362 promotion types

0      2389559
1       174551
2        79386
3        45862
4        31659
        ...   
313          1
452          1
642          1
305          1
425          1
Name: onpromotion, Length: 362, dtype: int64


In [11]:
train_df.sales.describe()

count    3.000888e+06
mean     3.577757e+02
std      1.101998e+03
min      0.000000e+00
25%      0.000000e+00
50%      1.100000e+01
75%      1.958473e+02
max      1.247170e+05
Name: sales, dtype: float64

Now we can determine the dimensions of the data and how many total time series exist in our dataset.

In [12]:
train_df

Unnamed: 0_level_0,id,store_nbr,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,0,1,AUTOMOTIVE,0.000,0
2013-01-01,1,1,BABY CARE,0.000,0
2013-01-01,2,1,BEAUTY,0.000,0
2013-01-01,3,1,BEVERAGES,0.000,0
2013-01-01,4,1,BOOKS,0.000,0
...,...,...,...,...,...
2017-08-15,3000883,9,POULTRY,438.133,0
2017-08-15,3000884,9,PREPARED FOODS,154.553,1
2017-08-15,3000885,9,PRODUCE,2419.729,148
2017-08-15,3000886,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


Make list of unique "family" names

In [13]:
family_values = train_df.family.unique()

Make a list of all unique store numbers

In [14]:
store_id = train_df.store_nbr.unique()

In [15]:
print("There are {} product categories over {} stores, meaning that there are {} total time series.".format(len(family_values),len(store_id),(len(family_values)*len(store_id))))

There are 33 product categories over 54 stores, meaning that there are 1782 total time series.


Each of the 54 stores has 33 time series associated with it (33 product categories)

In [16]:
train_df.dtypes

id               int64
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object

We are interested in 'sales' as a target, and its type is non-catergorical and numeric (float64).

#### 2.2 Exogenous Variables<a id='2.2'></a>

There are features in addition to time and the target, called "Exogenous Variables".

The datasets of holidays, oil, store information, and transactions could contain information pertinent to our analysis.

According to documentation:

Daily oil price includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)

Holidays and events (e.g. Christmas, or an earthquake) would also effect how people choose to shop for groceries, as well as when stores are not open for business.

##### 2.2.1 Holidays & Events<a id='2.2.1'></a>

In [17]:
holidays = pd.read_csv('./holidays_events.csv')

In [18]:
holidays.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [19]:
holidays.rename(columns={'type':'holiday_type'},inplace=True)

In [20]:
holidays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          350 non-null    object
 1   holiday_type  350 non-null    object
 2   locale        350 non-null    object
 3   locale_name   350 non-null    object
 4   description   350 non-null    object
 5   transferred   350 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 14.1+ KB


The features of holiday type, locale, locale name, and transfer status are categorical. The description is not needed.

In [21]:
holidays["date"] = pd.to_datetime(holidays["date"])

In [22]:
# count missing values
missing = pd.concat([holidays.isnull().sum(), 100 * holidays.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
date,0,0.0
holiday_type,0,0.0
locale,0,0.0
locale_name,0,0.0
description,0,0.0
transferred,0,0.0


In [23]:
print("{} holiday types\n\n{}".format(len(holidays.holiday_type.unique()),holidays.holiday_type.value_counts()))

6 holiday types

Holiday       221
Event          56
Additional     51
Transfer       12
Bridge          5
Work Day        5
Name: holiday_type, dtype: int64


In [24]:
print("{} locales\n\n{}".format(len(holidays.locale.unique()),holidays.locale.value_counts()))

3 locales

National    174
Local       152
Regional     24
Name: locale, dtype: int64


In [25]:
print("{} locale names\n\n{}".format(len(holidays.locale_name.unique()),holidays.locale_name.value_counts()))

24 locale names

Ecuador                           174
Quito                              13
Riobamba                           12
Guaranda                           12
Latacunga                          12
Ambato                             12
Guayaquil                          11
Cuenca                              7
Ibarra                              7
Salinas                             6
Loja                                6
Santa Elena                         6
Santo Domingo de los Tsachilas      6
Quevedo                             6
Manta                               6
Esmeraldas                          6
Cotopaxi                            6
El Carmen                           6
Santo Domingo                       6
Machala                             6
Imbabura                            6
Puyo                                6
Libertad                            6
Cayambe                             6
Name: locale_name, dtype: int64


There appears to be a mix of City, State, and Country locale names here.

In [26]:
list(holidays[holidays.locale == "Local"].locale_name.unique())

['Manta',
 'Cuenca',
 'Libertad',
 'Riobamba',
 'Puyo',
 'Guaranda',
 'Latacunga',
 'Machala',
 'Santo Domingo',
 'El Carmen',
 'Cayambe',
 'Esmeraldas',
 'Ambato',
 'Ibarra',
 'Quevedo',
 'Quito',
 'Loja',
 'Salinas',
 'Guayaquil']

"Local" locale refers to a city holiday.

In [27]:
list(holidays[holidays.locale == "Regional"].locale_name.unique())

['Cotopaxi', 'Imbabura', 'Santo Domingo de los Tsachilas', 'Santa Elena']

"Regional" locale refers to a state (province) holiday

In [28]:
list(holidays[holidays.locale == "National"].locale_name.unique())

['Ecuador']

Logically, all "National" holidays are of the country locale of Ecuador.

Therefore, we can make new dataframes for each type of locale, for use later when merging

In [29]:
local_holiday = holidays[holidays.locale == "Local"]
regional_holiday = holidays[holidays.locale == "Regional"]
national_holiday = holidays[holidays.locale == "National"]

In [30]:
print("{} transfer outcomes\n\n{}".format(len(holidays.transferred.unique()),holidays.transferred.value_counts()))

2 transfer outcomes

False    338
True      12
Name: transferred, dtype: int64


##### 2.2.2 Oil<a id='2.2.2'></a>

In [31]:
oil = pd.read_csv('./oil.csv')

In [32]:
oil.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [33]:
oil.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


The feature of oil price is numeric (float64), and will have its own time series available since the price across time is a continuous variable.

In [34]:
oil.describe()

Unnamed: 0,dcoilwtico
count,1175.0
mean,67.714366
std,25.630476
min,26.19
25%,46.405
50%,53.19
75%,95.66
max,110.62


In [35]:
missing = pd.concat([oil.isnull().sum(), 100 * oil.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
dcoilwtico,43,3.530378
date,0,0.0


##### 2.2.3 Store Information<a id='2.2.3'></a>

In [36]:
stores = pd.read_csv('./stores.csv')

In [37]:
stores.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [38]:
stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


The store features of cluster, city, state, and type are all categorical.

In [39]:
# count missing values
missing = pd.concat([stores.isnull().sum(), 100 * stores.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
store_nbr,0,0.0
city,0,0.0
state,0,0.0
type,0,0.0
cluster,0,0.0


In [40]:
print("{} clusters\n\n{}".format(len(stores.cluster.unique()),stores.cluster.value_counts()))


17 clusters

3     7
6     6
10    6
15    5
13    4
14    4
11    3
4     3
8     3
1     3
9     2
7     2
2     2
12    1
5     1
16    1
17    1
Name: cluster, dtype: int64


In [41]:
print("{} cities\n\n{}".format(len(stores.city.unique()),stores.city.value_counts()))


22 cities

Quito            18
Guayaquil         8
Cuenca            3
Santo Domingo     3
Manta             2
Latacunga         2
Machala           2
Ambato            2
Quevedo           1
Esmeraldas        1
Loja              1
Libertad          1
Playas            1
Daule             1
Babahoyo          1
Salinas           1
Puyo              1
Guaranda          1
Ibarra            1
Riobamba          1
Cayambe           1
El Carmen         1
Name: city, dtype: int64


In [42]:
print("{} types\n\n{}".format(len(stores.type.unique()),stores.type.value_counts()))


5 types

D    18
C    15
A     9
B     8
E     4
Name: type, dtype: int64


In [43]:
print("{} states\n\n{}".format(len(stores.state.unique()),stores.state.value_counts()))

16 states

Pichincha                         19
Guayas                            11
Santo Domingo de los Tsachilas     3
Azuay                              3
Manabi                             3
Cotopaxi                           2
Tungurahua                         2
Los Rios                           2
El Oro                             2
Chimborazo                         1
Imbabura                           1
Bolivar                            1
Pastaza                            1
Santa Elena                        1
Loja                               1
Esmeraldas                         1
Name: state, dtype: int64


##### 2.2.4 Transaction data<a id='2.2.4'></a>

In [44]:
transactions = pd.read_csv('./transactions.csv')

In [45]:
transactions

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922
...,...,...,...
83483,2017-08-15,50,2804
83484,2017-08-15,51,1573
83485,2017-08-15,52,2255
83486,2017-08-15,53,932


In [46]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          83488 non-null  object
 1   store_nbr     83488 non-null  int64 
 2   transactions  83488 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.9+ MB


The feature of transactions is numeric (int64) and continuous, so it could be its own time series.

In [47]:
transactions["date"] = pd.to_datetime(transactions["date"])

In [48]:
transactions.transactions.describe()

count    83488.000000
mean      1694.602158
std        963.286644
min          5.000000
25%       1046.000000
50%       1393.000000
75%       2079.000000
max       8359.000000
Name: transactions, dtype: float64

In [49]:
missing = pd.concat([transactions.isnull().sum(), 100 * transactions.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
date,0,0.0
store_nbr,0,0.0
transactions,0,0.0


In [50]:
print("{} stores\n\n{}".format(len(transactions.store_nbr.unique()),transactions.store_nbr.value_counts()))

54 stores

39    1678
38    1678
26    1678
31    1678
33    1678
34    1678
37    1678
27    1677
28    1677
32    1677
23    1677
40    1677
41    1677
44    1677
45    1677
46    1677
47    1677
48    1677
50    1677
51    1677
49    1677
2     1677
16    1677
5     1677
54    1676
3     1676
4     1676
6     1676
8     1676
9     1676
19    1676
35    1676
13    1676
1     1676
15    1676
11    1676
10    1675
7     1675
17    1674
43    1672
30    1655
14    1638
12    1616
25    1615
24    1577
18    1566
36    1551
53    1167
20     909
29     874
21     748
42     720
22     671
52     118
Name: store_nbr, dtype: int64


#### 2.3 Identify the resolution of the time series<a id='2.3'></a>

In [51]:
train_df.index.unique()

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10',
               ...
               '2017-08-06', '2017-08-07', '2017-08-08', '2017-08-09',
               '2017-08-10', '2017-08-11', '2017-08-12', '2017-08-13',
               '2017-08-14', '2017-08-15'],
              dtype='datetime64[ns]', name='date', length=1684, freq=None)

In [52]:
latest_date = train_df.index.unique().max()
latest_date

Timestamp('2017-08-15 00:00:00')

In [53]:
earliest_date = train_df.index.unique().min()
earliest_date

Timestamp('2013-01-01 00:00:00')

In [54]:
latest_date-earliest_date

Timedelta('1687 days 00:00:00')

The data available to us covers 1684 days, over 1687 unique timepoints, and therefore has ~one timepoint per day.

<i>We have determined that the resolution is daily.</i>

When accoubting for leap years, the number of days per year is ~365.25.



#### 2.4 Forecasting horizon<a id='2.4'></a>

In [55]:
test_csv = pd.read_csv('./test.csv')

In [56]:
test_csv

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0
...,...,...,...,...,...
28507,3029395,2017-08-31,9,POULTRY,1
28508,3029396,2017-08-31,9,PREPARED FOODS,0
28509,3029397,2017-08-31,9,PRODUCE,1
28510,3029398,2017-08-31,9,SCHOOL AND OFFICE SUPPLIES,9


In [57]:
len(test_csv.date.unique())

16

The forecasting horizon for this project is 16 days, so we will predict sales for the 16 days following the end of the data in the training dataframe.

### 3.0 Combining pertinent features into the data<a id='3.0'></a>

The exogenous variables can now be integrated into the data.

#### 3.1 Merging training data with geographical store data<a id='3.1'></a>

First, we combine the train_csv sales data with geographical store information

In [58]:
store_info_1 = train_csv.merge(stores,on=['store_nbr'],how='left')
store_info_1

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13
...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6


#### 3.2 Merging current data with holiday and event data<a id='3.2'></a>

Next, we combine the prior dataframe with holiday information; local first.

In [59]:
lh = local_holiday[local_holiday.transferred==False]

In [60]:
store_info_2 = store_info_1.merge(lh,how='left',right_on=['locale_name','date'],
                                  left_on=['city','date']).drop(['locale_name','description'], axis=1)

store_info_2

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type,locale,transferred
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3001147,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,
3001148,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,
3001149,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,
3001150,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,


Unfortunately, there are more rows in the new dataframe than the prior one. We need to investigate and correct this issue.

Since the prior and new dataframes both start and end on the same 'id', there must be dates where the holiday dataframe assigns multiple values.

We make a list of 'id's that appear more than once in the new dataframe.

In [61]:
test = store_info_2.id.value_counts() > 1
test


2312781     True
2312104     True
2312096     True
2312097     True
2312098     True
           ...  
1000298    False
1000299    False
1000300    False
1000301    False
3000887    False
Name: id, Length: 3000888, dtype: bool

Contracting the dataframe into only unique values and their number of occurrances has reduced the length of the dataframe to the original size (3,000,888 rows).

In [62]:
test2 = pd.DataFrame(store_info_2[(store_info_2.date > '2012-01-01') & (store_info_2.id.value_counts() > 1)])
test2.date.value_counts()

2016-07-24    264
Name: date, dtype: int64

It appears that on a single date, we have 264 instances of multiple holidays assigned to a single date.

In [63]:
test2[(test2.date == '2016-07-24') & (test2.id.value_counts() > 1)]

  test2[(test2.date == '2016-07-24') & (test2.id.value_counts() > 1)]


Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type,locale,transferred
2311782,2311782,2016-07-24,24,AUTOMOTIVE,2.000,0,Guayaquil,Guayas,D,1,Additional,Local,False
2311783,2311782,2016-07-24,24,AUTOMOTIVE,2.000,0,Guayaquil,Guayas,D,1,Transfer,Local,False
2311784,2311783,2016-07-24,24,BABY CARE,1.000,0,Guayaquil,Guayas,D,1,Additional,Local,False
2311785,2311783,2016-07-24,24,BABY CARE,1.000,0,Guayaquil,Guayas,D,1,Transfer,Local,False
2311786,2311784,2016-07-24,24,BEAUTY,6.000,1,Guayaquil,Guayas,D,1,Additional,Local,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2312024,2311936,2016-07-24,28,"LIQUOR,WINE,BEER",640.000,0,Guayaquil,Guayas,E,10,Additional,Local,False
2312025,2311936,2016-07-24,28,"LIQUOR,WINE,BEER",640.000,0,Guayaquil,Guayas,E,10,Transfer,Local,False
2312026,2311937,2016-07-24,28,MAGAZINES,2.000,0,Guayaquil,Guayas,E,10,Additional,Local,False
2312027,2311937,2016-07-24,28,MAGAZINES,2.000,0,Guayaquil,Guayas,E,10,Transfer,Local,False


According to documentation,

"A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday...

Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday)."

The issue is that Simón Bolívar Day falls on 07/24 every year, so it cannot be a transfer to that date.

Therefore, we will eliminate the rows on 2016-07-24 that are categorized as "Transfer"

In [64]:
local_holiday2 = local_holiday[local_holiday['holiday_type'].str.contains("Transfer")==False]

In [65]:
lh2 = local_holiday2[local_holiday2.transferred==False]

In [66]:
store_info_2b = store_info_1.merge(lh2,how='left',right_on=['locale_name','date'],
                                  left_on=['city','date']).drop(['locale_name','description'], axis=1)

store_info_2b

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type,locale,transferred
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,


Now we have the correct dataframe size, so we can continue by merging the regional holidays.

In [67]:
store_info_3 = store_info_2b.merge(regional_holiday,how='left',right_on=['locale_name','date'],
                                  left_on=['state','date']).drop(['locale_name','description'], axis=1)

store_info_3

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,,,,
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,,,,
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,,,,
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,,,,
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,,,,
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,,,,
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,,,,
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,,,,


In [68]:
national_holiday

Unnamed: 0,date,holiday_type,locale,locale_name,description,transferred
14,2012-08-10,Holiday,National,Ecuador,Primer Grito de Independencia,False
19,2012-10-09,Holiday,National,Ecuador,Independencia de Guayaquil,True
20,2012-10-12,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
21,2012-11-02,Holiday,National,Ecuador,Dia de Difuntos,False
22,2012-11-03,Holiday,National,Ecuador,Independencia de Cuenca,False
...,...,...,...,...,...,...
345,2017-12-22,Additional,National,Ecuador,Navidad-3,False
346,2017-12-23,Additional,National,Ecuador,Navidad-2,False
347,2017-12-24,Additional,National,Ecuador,Navidad-1,False
348,2017-12-25,Holiday,National,Ecuador,Navidad,False


In [69]:
nh_count = national_holiday.date.value_counts()

In [70]:
nh_count[nh_count>1].index

DatetimeIndex(['2012-12-31', '2014-12-26', '2016-05-08', '2016-05-07',
               '2012-12-24', '2016-05-01'],
              dtype='datetime64[ns]', freq=None)

In [71]:
national_holiday[national_holiday.date.isin(nh_count[nh_count>1].index)]

Unnamed: 0,date,holiday_type,locale,locale_name,description,transferred
35,2012-12-24,Bridge,National,Ecuador,Puente Navidad,False
36,2012-12-24,Additional,National,Ecuador,Navidad-1,False
39,2012-12-31,Bridge,National,Ecuador,Puente Primer dia del ano,False
40,2012-12-31,Additional,National,Ecuador,Primer dia del ano-1,False
156,2014-12-26,Bridge,National,Ecuador,Puente Navidad,False
157,2014-12-26,Additional,National,Ecuador,Navidad+1,False
235,2016-05-01,Holiday,National,Ecuador,Dia del Trabajo,False
236,2016-05-01,Event,National,Ecuador,Terremoto Manabi+15,False
242,2016-05-07,Additional,National,Ecuador,Dia de la Madre-1,False
243,2016-05-07,Event,National,Ecuador,Terremoto Manabi+21,False


In [72]:
nh_not_trans = national_holiday[national_holiday.transferred!=True]

In [73]:
nh_dates = nh_not_trans.date.unique()

In [74]:
store_info_3["national_holiday"] = store_info_3.date.isin(nh_dates)

In [75]:
store_info_3

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y,national_holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,,,,,True
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,,,,,True
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,,,,,True
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,,,,,True
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,,,,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,,,,,False
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,,,,,False
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,,,,,False
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,,,,,False


In [76]:
store_df = store_info_3.rename(columns={'holiday_type_x':'local_holiday','holiday_type_y':'regional_holiday'}).drop(['locale_x','transferred_x','locale_y','transferred_y'], axis=1)

In [77]:
store_df = store_df.fillna({'local_holiday':'None', 'regional_holiday':'None'})
store_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,local_holiday,regional_holiday,national_holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,True
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,True
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,True
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,True
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,False
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,False
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,False
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,False


In [78]:
store_df.local_holiday.value_counts()

None          2989107
Holiday          8085
Additional       3696
Name: local_holiday, dtype: int64

In [79]:
store_df.regional_holiday.value_counts()

None       2999865
Holiday       1023
Name: regional_holiday, dtype: int64

In [80]:
store_df.local_holiday = store_df.local_holiday.replace({'None':0,'Holiday':1,'Additional':2})
store_df.regional_holiday = store_df.regional_holiday.replace({'None':0,'Holiday':1})
store_df.national_holiday = store_df.national_holiday.replace({False:0,True:1})
store_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,local_holiday,regional_holiday,national_holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,0,0,1
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,0,0,1
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,0,0,1
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,0,0,1
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,0,0,0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,0,0,0
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,0,0,0
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,0,0,0


#### 3.3 Merging current data with transaction data<a id='3.3'></a>

In [81]:
transactions

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922
...,...,...,...
83483,2017-08-15,50,2804
83484,2017-08-15,51,1573
83485,2017-08-15,52,2255
83486,2017-08-15,53,932


In [82]:
store_df2 = store_df.merge(transactions,how='left',on=['store_nbr','date'])

store_df2

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,local_holiday,regional_holiday,national_holiday,transactions
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,0,0,1,
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,0,0,1,
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,0,0,1,
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,0,0,1,
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,0,0,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,0,0,0,2155.0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,0,0,0,2155.0
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,0,0,0,2155.0
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,0,0,0,2155.0


In [83]:
# count missing values
missing = pd.concat([store_df2.isnull().sum(), 100 * store_df2.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
transactions,245784,8.190376
id,0,0.0
date,0,0.0
store_nbr,0,0.0
family,0,0.0
sales,0,0.0
onpromotion,0,0.0
city,0,0.0
state,0,0.0
type,0,0.0


In [84]:
# checksum of missing data
round(245784/(8.190376/100))

3000888

We will need to account for the missing transaction data.

With 33 product families available, it is possible that a particular store can have transactions on a particular day, with some product families have no sales whatsoever.

Conversely, it is not possible for a particular store to have transactions on a particular day, if there are no sales in any product category.

We can use the strategy of condensing all sales for all product families in each store on each date, aggregating by date and store number, and summing separately all of the transactions and all of the sales. Then, we can pinpoint the problem areas by discovering where there are occurrances of sales without transactions, or transactions without sales.

In [85]:
agg_store_df2 = store_df2.groupby(['date','store_nbr','cluster']).sum().reset_index()
agg_store_df2

  agg_store_df2 = store_df2.groupby(['date','store_nbr','cluster']).sum().reset_index()


Unnamed: 0,date,store_nbr,cluster,id,sales,onpromotion,local_holiday,regional_holiday,national_holiday,transactions
0,2013-01-01,1,13,528,0.000000,0,0,0,33,0.0
1,2013-01-01,2,13,12507,0.000000,0,0,0,33,0.0
2,2013-01-01,3,8,24486,0.000000,0,0,0,33,0.0
3,2013-01-01,4,9,36465,0.000000,0,0,0,33,0.0
4,2013-01-01,5,4,48444,0.000000,0,0,0,33,0.0
...,...,...,...,...,...,...,...,...,...,...
90931,2017-08-15,50,14,99020031,16879.121004,150,0,0,0,92532.0
90932,2017-08-15,51,17,99021120,20154.559000,127,0,0,0,51909.0
90933,2017-08-15,52,11,99022209,18600.046000,142,0,0,0,74415.0
90934,2017-08-15,53,13,99023298,8208.189000,114,0,0,0,30756.0


After aggregating each store and date combination, summing both sales and transactions, we see that there are 90936 total datapoints to consider.

Now we can filter results for store/date combinations that have sales without transactions.

In [86]:
missing_transactions = agg_store_df2[(agg_store_df2['transactions']==0)&(agg_store_df2['sales']!=0)]

missing_transactions

Unnamed: 0,date,store_nbr,cluster,id,sales,onpromotion,local_holiday,regional_holiday,national_holiday,transactions
9135,2013-06-19,10,15,9939831,3802.291998,0,0,0,0,0.0
9160,2013-06-19,35,3,9969234,1699.048000,0,0,0,0,0.0
9168,2013-06-19,43,10,9979035,4642.495001,0,0,0,0,0.0
9179,2013-06-19,54,3,9992103,2977.852000,0,0,0,0,0.0
19741,2014-01-02,32,3,21491943,3146.146702,0,0,0,0,0.0
...,...,...,...,...,...,...,...,...,...,...
59180,2016-01-04,51,17,64443192,28280.580970,33,0,0,0,0.0
59182,2016-01-04,53,13,64445370,8702.973100,45,0,0,0,0.0
59183,2016-01-04,54,3,64446459,8711.512998,21,0,0,0,0.0
73554,2016-09-27,7,8,80149839,19783.335000,164,0,0,0,0.0


There appear to be 118 store/date combinations that have mistakenly excluded transaction data.

Now we can filter results for store/date combinations that have transactions without sales.

In [87]:
missing_sales = agg_store_df2[(agg_store_df2['transactions']!=0)&(agg_store_df2['sales']==0)]

missing_sales

Unnamed: 0,date,store_nbr,cluster,id,sales,onpromotion,local_holiday,regional_holiday,national_holiday,transactions


Thankfully, there are no store/date combinations that have transactions without sales.

Having a theoretically impossible situation where sales exist without transactions could cause our analysis to suffer from inaccurate forcasting. Therefore, we will need to fill in these particular zero-values. My preferred strategy for this is to use the monthly average transaction value.

First we will make a list of tuples consisting of the store and date of each store/date combination in 'missing_transactions' so that we can fill in the main dataframe with the replacement information.

In [88]:
fill_trans = list(zip(missing_transactions['date'],missing_transactions['store_nbr']))

In [89]:
fill_trans[0:5]

[(Timestamp('2013-06-19 00:00:00'), 10),
 (Timestamp('2013-06-19 00:00:00'), 35),
 (Timestamp('2013-06-19 00:00:00'), 43),
 (Timestamp('2013-06-19 00:00:00'), 54),
 (Timestamp('2014-01-02 00:00:00'), 32)]

Next, we change all of the dates to month/year on a copy of the present main dataframe, in order to prepare the data for aggregation.

In [90]:
sdf2 = store_df2.copy()
sdf2['date'] = sdf2['date'].dt.strftime('%m/%Y')

In [91]:
sdf2

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,local_holiday,regional_holiday,national_holiday,transactions
0,0,01/2013,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,0,0,1,
1,1,01/2013,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,0,0,1,
2,2,01/2013,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,0,0,1,
3,3,01/2013,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,0,0,1,
4,4,01/2013,1,BOOKS,0.000,0,Quito,Pichincha,D,13,0,0,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,08/2017,9,POULTRY,438.133,0,Quito,Pichincha,B,6,0,0,0,2155.0
3000884,3000884,08/2017,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,0,0,0,2155.0
3000885,3000885,08/2017,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,0,0,0,2155.0
3000886,3000886,08/2017,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,0,0,0,2155.0


Then, we can aggregate the data by store and the new month/year date, and programming the transaction output to be the monthly mean. We can ignore 'family' since the transaction data is not precise down to that level.

In [92]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

agg_pf = pd.DataFrame()

agg_df = []
for date,store in fill_trans:
    
    agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
    
    
    agg_pf = pd.concat([agg_df,agg_pf])

        
agg_pf

  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.str

  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.str

  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.str

  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.str

  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.str

  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.str

  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())
  agg_df = pd.DataFrame(sdf2[(sdf2['date']==date.strftime('%m/%Y'))&(sdf2['store_nbr']==store)].groupby(['date','store_nbr','cluster']).mean().reset_index())


Unnamed: 0,date,store_nbr,cluster,id,sales,onpromotion,local_holiday,regional_holiday,national_holiday,transactions
0,09/2016,23,9,2407102.0,233.862334,4.350505,0.0,0.0,0.000000,994.241379
0,09/2016,7,8,2408290.0,627.627939,5.387879,0.0,0.0,0.000000,1806.931034
0,01/2016,54,3,1974307.0,219.041280,1.260020,0.0,0.0,0.032258,835.642857
0,01/2016,53,13,1974274.0,242.757217,2.375367,0.0,0.0,0.032258,866.107143
0,01/2016,51,17,1974208.0,702.108862,2.761486,0.0,0.0,0.032258,1695.250000
...,...,...,...,...,...,...,...,...,...,...
0,01/2014,32,3,676219.0,92.070826,0.000000,0.0,0.0,0.032258,594.620690
0,06/2013,54,3,296554.0,131.083805,0.000000,0.0,0.0,0.000000,890.482759
0,06/2013,43,10,296158.0,172.018247,0.000000,0.0,0.0,0.000000,1210.448276
0,06/2013,35,3,295861.0,72.500466,0.000000,0.0,0.0,0.000000,543.620690


Excellent, now we have the means of each store/month's transaction totals. The next step is to isolate the non-aggregated rows which match our criteria.

In [121]:
missing_tx_list = pd.DataFrame()
missing_tx_list2 = []    
for a,b in fill_trans:
    missing_tx_list2 = pd.DataFrame(store_df2[(store_df2['date']==a)&(store_df2['store_nbr']==b)])
    missing_tx_list = pd.concat([missing_tx_list,missing_tx_list2])
    
missing_tx_list

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,local_holiday,regional_holiday,national_holiday,transactions
301191,301191,2013-06-19,10,AUTOMOTIVE,3.000,0,Quito,Pichincha,C,15,0,0,0,
301192,301192,2013-06-19,10,BABY CARE,0.000,0,Quito,Pichincha,C,15,0,0,0,
301193,301193,2013-06-19,10,BEAUTY,0.000,0,Quito,Pichincha,C,15,0,0,0,
301194,301194,2013-06-19,10,BEVERAGES,515.000,0,Quito,Pichincha,C,15,0,0,0,
301195,301195,2013-06-19,10,BOOKS,0.000,0,Quito,Pichincha,C,15,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2427607,2427607,2016-09-27,23,POULTRY,339.849,0,Ambato,Tungurahua,D,9,0,0,0,
2427608,2427608,2016-09-27,23,PREPARED FOODS,40.498,0,Ambato,Tungurahua,D,9,0,0,0,
2427609,2427609,2016-09-27,23,PRODUCE,1144.772,1,Ambato,Tungurahua,D,9,0,0,0,
2427610,2427610,2016-09-27,23,SCHOOL AND OFFICE SUPPLIES,0.000,0,Ambato,Tungurahua,D,9,0,0,0,


There are 3894 rows that have missing transaction values, yet whose store/date involves sales.

Next, we zip together another group of information consisting of the date, store number, and new transaction amount.

In [155]:
fill_trans2 = list(zip(agg_pf['date'],agg_pf['store_nbr'],agg_pf['transactions'].apply(np.int64)))
fill_trans2[0:5]

[('09/2016', 23, 994),
 ('09/2016', 7, 1806),
 ('01/2016', 54, 835),
 ('01/2016', 53, 866),
 ('01/2016', 51, 1695)]

Finally, we use this list in a for loop to fill any inappropriately missing values with the appropriate value.

In [136]:
for date,store,transaction in fill_trans2:
    store_df2.loc[(store_df2['date'].dt.strftime('%m/%Y')==date)&(store_df2['store_nbr']==store),'transactions']=transaction

In [137]:
from IPython.display import Audio
sound_file = r'C:\Users\Joseph Shire\ding_dong.wav'

Audio(sound_file, autoplay=True)


To put the finishing touches on the data, we can now fill in the rest of the NaN transactions with 0.

In [141]:
store_df2['transactions'] = store_df2['transactions'].fillna(0)

In [144]:
store_df2.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,local_holiday,regional_holiday,national_holiday,transactions
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,0,0,1,0.0
1,1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,0,0,1,0.0
2,2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13,0,0,1,0.0
3,3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13,0,0,1,0.0
4,4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13,0,0,1,0.0


In [145]:
# count missing values
missing = pd.concat([store_df2.isnull().sum(), 100 * store_df2.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
id,0,0.0
date,0,0.0
store_nbr,0,0.0
family,0,0.0
sales,0,0.0
onpromotion,0,0.0
city,0,0.0
state,0,0.0
type,0,0.0
cluster,0,0.0


#### 3.4 Merging current data with oil data<a id='3.4'></a>

In [189]:
oil

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [146]:
oil["date"] = pd.to_datetime(oil["date"])
store_df3 = store_df2.merge(oil,how='left',on=['date'])

store_df3

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,local_holiday,regional_holiday,national_holiday,transactions,dcoilwtico
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,0,0,0,2155.0,47.57
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,0,0,0,2155.0,47.57
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,0,0,0,2155.0,47.57
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,0,0,0,2155.0,47.57


In [191]:
# count missing values
missing = pd.concat([store_df3.isnull().sum(), 100 * store_df3.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
dcoilwtico,928422,30.938242
transactions,245784,8.190376
id,0,0.0
date,0,0.0
store_nbr,0,0.0
family,0,0.0
sales,0,0.0
onpromotion,0,0.0
city,0,0.0
state,0,0.0


In [192]:
# checksum
round(928422/(30.938242/100))

3000888

Unfortunately, we are missing oil price information for nearly a third of the dates.

In [193]:
2/(7/100)

28.57142857142857

In [90]:
28.57142857142857/(30.938242/100)

92.34987744755688

However, according to the Federal Reserve Economic Data [3], these oil numbers are not updated on weekends and some holidays.
Since 2 out of 7 days of each week are weekends (28.57%), that means that over 92% of the missing oil values are simply weekend days.


[3] https://fred.stlouisfed.org/series/DCOILWTICO

We can estimate the values on the weekends by using a filling strategy. Since we live in the present, we have *definitive* access to the past but not the future (though we can make *estimates* about the future using machine learning; that is what forecasting is all about). Therefore, we will use forward filling to estimate what the oil prices would be on days without that information.

In [147]:
store_df3['dcoilwtico'] = store_df3['dcoilwtico'].ffill().bfill()

In [148]:
store_df3

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,local_holiday,regional_holiday,national_holiday,transactions,dcoilwtico
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,93.14
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,93.14
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,93.14
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,93.14
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,0,0,1,0.0,93.14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,0,0,0,2155.0,47.57
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,0,0,0,2155.0,47.57
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,0,0,0,2155.0,47.57
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,0,0,0,2155.0,47.57


In [149]:
missing = pd.concat([store_df3.isnull().sum(), 100 * store_df3.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
id,0,0.0
date,0,0.0
store_nbr,0,0.0
family,0,0.0
sales,0,0.0
onpromotion,0,0.0
city,0,0.0
state,0,0.0
type,0,0.0
cluster,0,0.0


All missing data has now been accounted for.

The last step before exporting the data is to ensure all columns are of the proper data type.

In [157]:
store_df3.dtypes

id                           int64
date                datetime64[ns]
store_nbr                    int64
family                      object
sales                      float64
onpromotion                  int64
city                        object
state                       object
type                        object
cluster                      int64
local_holiday                int64
regional_holiday             int64
national_holiday             int64
transactions               float64
dcoilwtico                 float64
dtype: object

In [162]:
store_df3['transactions'] = store_df3['transactions'].apply('int64')

In [163]:
store_df3.dtypes

id                           int64
date                datetime64[ns]
store_nbr                    int64
family                      object
sales                      float64
onpromotion                  int64
city                        object
state                       object
type                        object
cluster                      int64
local_holiday                int64
regional_holiday             int64
national_holiday             int64
transactions                 int64
dcoilwtico                 float64
dtype: object

We export the finalized data for future use as a .csv file

In [150]:
f = './merged_data.csv'
store_df3.to_csv(f)

#### 4.0 Summary<a id='4.0'></a>

The data was loaded, inspected, cleaned, merged, saved, and is now ready for exploratory data analysis. 