## Project Objective


To build a model that accurately predicts the unit sales for the items sold by Corporation Favorita

## Hypothesis & Questions

### Hypotheses

### Questions

1. Is the train dataset complete (has all the required dates)?
2. Which dates have the lowest and highest sales for each year?
3. Did the earthquake impact sales?
4. Are certain groups of stores selling more products? (Cluster, city, state, type)
5. Are sales affected by promotions, oil prices and holidays?
6. What analysis can we get from the date and its extractable features?
7. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

# Installing Scikit-learn module

In [1]:
pip install -U scikit-learn


Note: you may need to restart the kernel to use updated packages.


# Importing Libraries

In [2]:

# importing the neccessary python libraries 
import numpy as np 
import pandas as pd 

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from  sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor

import xgboost as xgb
from xgboost import XGBRegressor

import matplotlib.pyplot as plt
%matplotlib inline 
import plotly.express as px
import seaborn as sns

from itertools import *

import warnings

# Hiding the warnings
warnings.filterwarnings('ignore')

print("Loading complete.", "Warnings hidden.")




### Loading Train Data

In [3]:
train_data = pd.read_csv("train.csv")
train_data

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0
1,1,2013-01-01,1,BABY CARE,0.000,0
2,2,2013-01-01,1,BEAUTY,0.000,0
3,3,2013-01-01,1,BEVERAGES,0.000,0
4,4,2013-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 137.4+ MB


In [5]:
#pd.options.display.float_format = '{:,0.2f}'.format

In [6]:
unique_days = train_data["date"].unique()
unique_days

array(['2013-01-01', '2013-01-02', '2013-01-03', ..., '2017-08-13',
       '2017-08-14', '2017-08-15'], dtype=object)

In [7]:
train_data["sales_date"] = pd.to_datetime(train_data["date"]).dt.date
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
 6   sales_date   object 
dtypes: float64(1), int64(3), object(3)
memory usage: 160.3+ MB


In [8]:
range_0f_date = train_data.sales_date.min(),train_data.sales_date.max()
range_0f_date

(datetime.date(2013, 1, 1), datetime.date(2017, 8, 15))

In [9]:
number_of_expected_days = pd.date_range(start = train_data["sales_date"].min(),end = train_data["sales_date"].max())
number_of_expected_days

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10',
               ...
               '2017-08-06', '2017-08-07', '2017-08-08', '2017-08-09',
               '2017-08-10', '2017-08-11', '2017-08-12', '2017-08-13',
               '2017-08-14', '2017-08-15'],
              dtype='datetime64[ns]', length=1688, freq='D')

We note a difference of 4 days between the actual dates (1,684) and expected dates (1,688) within the range. As such we have to find the missing dates and add them to ensure completeness of the dates.

This gives the answer to question 1 (Is the train dataset complete (has all the required dates)?) as a no.

In [10]:
missing_dates = set(number_of_expected_days.date) - set()

### Hypothesis

this is the hypothesis

In [11]:
# Getting the list of unique sets 
unique_stores = train_data["store_nbr"].unique()
unique_stores

array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,  4,
       40, 41, 42, 43, 44, 45, 46, 47, 48, 49,  5, 50, 51, 52, 53, 54,  6,
        7,  8,  9], dtype=int64)

In [12]:
# Getting unique Families 
unique_families = train_data["family"].unique()
unique_families

array(['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD'], dtype=object)

Since we're predicting the sales for each store, it means we have to fill in the missing dates for each store. We will do this with the _product_ module from _itertools_

In [13]:
missing_data = list(product(missing_dates,unique_stores,unique_families))
train_addon = pd.DataFrame(missing_data, columns = ["sales_date","store_nbr","family"])
train_addon


Unnamed: 0,sales_date,store_nbr,family
0,2017-04-21,1,AUTOMOTIVE
1,2017-04-21,1,BABY CARE
2,2017-04-21,1,BEAUTY
3,2017-04-21,1,BEVERAGES
4,2017-04-21,1,BOOKS
...,...,...,...
3008011,2013-06-02,9,POULTRY
3008012,2013-06-02,9,PREPARED FOODS
3008013,2013-06-02,9,PRODUCE
3008014,2013-06-02,9,SCHOOL AND OFFICE SUPPLIES


In [14]:
train_data = pd.concat([train_data,train_addon],ignore_index = True)
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6008904 entries, 0 to 6008903
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           float64
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  float64
 6   sales_date   object 
dtypes: float64(3), int64(1), object(3)
memory usage: 320.9+ MB


In [15]:
train_data

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,sales_date
0,0.0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,2013-01-01
1,1.0,2013-01-01,1,BABY CARE,0.0,0.0,2013-01-01
2,2.0,2013-01-01,1,BEAUTY,0.0,0.0,2013-01-01
3,3.0,2013-01-01,1,BEVERAGES,0.0,0.0,2013-01-01
4,4.0,2013-01-01,1,BOOKS,0.0,0.0,2013-01-01
...,...,...,...,...,...,...,...
6008899,,,9,POULTRY,,,2013-06-02
6008900,,,9,PREPARED FOODS,,,2013-06-02
6008901,,,9,PRODUCE,,,2013-06-02
6008902,,,9,SCHOOL AND OFFICE SUPPLIES,,,2013-06-02


- With December 25 omitted from each of the years, I assume that it was deliberate - most likely because all shops are closed on December 25 each year. In effect, no items would have been on promotion and no sales would have been made; that is to say that it is safe to fill the null "sales" and "onpromotion" column data with 0.

- By this, I am also dropping the "id" column as it will not be relevant to subsequent analyses and modelling.

- I will be filling the missing dates in the original dates column with the sales data, for aesthetic purposes only.

In [16]:
# Dropping "id" and "date" columns
train_data.drop(columns = ["id", "date"], axis = 1, inplace = True)

# Filling missing rows in the sales column and casting it to numeric
train_data["sales"].fillna(0, inplace = True)
train_data["sales"] = pd.to_numeric(train_data["sales"])

# Filling missing rows in the onpromotion column
train_data["onpromotion"].fillna(0, inplace = True)

train_data

Unnamed: 0,store_nbr,family,sales,onpromotion,sales_date
0,1,AUTOMOTIVE,0.0,0.0,2013-01-01
1,1,BABY CARE,0.0,0.0,2013-01-01
2,1,BEAUTY,0.0,0.0,2013-01-01
3,1,BEVERAGES,0.0,0.0,2013-01-01
4,1,BOOKS,0.0,0.0,2013-01-01
...,...,...,...,...,...
6008899,9,POULTRY,0.0,0.0,2013-06-02
6008900,9,PREPARED FOODS,0.0,0.0,2013-06-02
6008901,9,PRODUCE,0.0,0.0,2013-06-02
6008902,9,SCHOOL AND OFFICE SUPPLIES,0.0,0.0,2013-06-02


**Transactions data**

In [17]:
transactions = pd.read_csv("transactions.csv")
transactions

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922
...,...,...,...
83483,2017-08-15,50,2804
83484,2017-08-15,51,1573
83485,2017-08-15,52,2255
83486,2017-08-15,53,932


In [18]:
# Viewing basic information about the transactions data
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          83488 non-null  object
 1   store_nbr     83488 non-null  int64 
 2   transactions  83488 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.9+ MB


In [19]:
transactions.nunique()

date            1682
store_nbr         54
transactions    4993
dtype: int64

- Since the train data has the same number of unique stores as the transactions data, we can use the unique stores variable defined earlier to fill in the missing dates.
- Also, given that the transactions and train data cover the same period, it is concerning that the transactions data has even less unique dates than the train data has. As such, we have to find and impute the missing dates as done for the train data.

In [20]:
transactions["sales_date"] = pd.to_datetime(transactions["date"]).dt.date

In [21]:
# Getting missing dates
missing_txn_dates = set(number_of_expected_days.date) - set(transactions["sales_date"].unique())
missing_txn_dates

{datetime.date(2013, 12, 25),
 datetime.date(2014, 12, 25),
 datetime.date(2015, 12, 25),
 datetime.date(2016, 1, 1),
 datetime.date(2016, 1, 3),
 datetime.date(2016, 12, 25)}

In [22]:
missing_txn_data = list(product(missing_txn_dates, unique_stores))
txn_data_addon = pd.DataFrame(missing_txn_data, columns = ["sales_date", "store_nbr"])
txn_data_addon

Unnamed: 0,sales_date,store_nbr
0,2015-12-25,1
1,2015-12-25,10
2,2015-12-25,11
3,2015-12-25,12
4,2015-12-25,13
...,...,...
319,2013-12-25,54
320,2013-12-25,6
321,2013-12-25,7
322,2013-12-25,8


In [23]:
transactions

Unnamed: 0,date,store_nbr,transactions,sales_date
0,2013-01-01,25,770,2013-01-01
1,2013-01-02,1,2111,2013-01-02
2,2013-01-02,2,2358,2013-01-02
3,2013-01-02,3,3487,2013-01-02
4,2013-01-02,4,1922,2013-01-02
...,...,...,...,...
83483,2017-08-15,50,2804,2017-08-15
83484,2017-08-15,51,1573,2017-08-15
83485,2017-08-15,52,2255,2017-08-15
83486,2017-08-15,53,932,2017-08-15


In [24]:
# Adding the data for the missing transaction dates to the main transaction data and filling nulls with 0
transactions = pd.concat([transactions, txn_data_addon], ignore_index=True)
transactions.drop("date", axis = 1, inplace = True)
transactions["transactions"].fillna(0, inplace = True)

In [25]:
# Recasting the sales date column data type to date
transactions["sales_date"] = pd.to_datetime(transactions["sales_date"]).dt.date
transactions

Unnamed: 0,store_nbr,transactions,sales_date
0,25,770.0,2013-01-01
1,1,2111.0,2013-01-02
2,2,2358.0,2013-01-02
3,3,3487.0,2013-01-02
4,4,1922.0,2013-01-02
...,...,...,...
83807,54,0.0,2013-12-25
83808,6,0.0,2013-12-25
83809,7,0.0,2013-12-25
83810,8,0.0,2013-12-25


In [26]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83812 entries, 0 to 83811
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   store_nbr     83812 non-null  int64  
 1   transactions  83812 non-null  float64
 2   sales_date    83812 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 1.9+ MB


**Holidays and events data**

In [28]:
holidays_events = pd.read_csv("holidays_events.csv")
holidays_events

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
...,...,...,...,...,...,...
345,2017-12-22,Additional,National,Ecuador,Navidad-3,False
346,2017-12-23,Additional,National,Ecuador,Navidad-2,False
347,2017-12-24,Additional,National,Ecuador,Navidad-1,False
348,2017-12-25,Holiday,National,Ecuador,Navidad,False


In [29]:
holidays_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         350 non-null    object
 1   type         350 non-null    object
 2   locale       350 non-null    object
 3   locale_name  350 non-null    object
 4   description  350 non-null    object
 5   transferred  350 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 14.1+ KB


The holidays and events dataframe looks complete, hence there will be no need for any cleaning now.

In [30]:
holidays_events["date"] = pd.to_datetime(holidays_events["date"]).dt.date
holidays_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         350 non-null    object
 1   type         350 non-null    object
 2   locale       350 non-null    object
 3   locale_name  350 non-null    object
 4   description  350 non-null    object
 5   transferred  350 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 14.1+ KB


In [31]:
holidays_events.nunique()

date           312
type             6
locale           3
locale_name     24
description    103
transferred      2
dtype: int64

In [32]:
holidays_events["type"].unique()

array(['Holiday', 'Transfer', 'Additional', 'Bridge', 'Work Day', 'Event'],
      dtype=object)

In [33]:
holidays_events[holidays_events["type"] == "Work Day"]

Unnamed: 0,date,type,locale,locale_name,description,transferred
42,2013-01-05,Work Day,National,Ecuador,Recupero puente Navidad,False
43,2013-01-12,Work Day,National,Ecuador,Recupero puente primer dia del ano,False
149,2014-12-20,Work Day,National,Ecuador,Recupero Puente Navidad,False
161,2015-01-10,Work Day,National,Ecuador,Recupero Puente Primer dia del ano,False
283,2016-11-12,Work Day,National,Ecuador,Recupero Puente Dia de Difuntos,False


In [34]:
# Getting missing dates
missing_holiday_dates = set(number_of_expected_days.date) - set(holidays_events["date"].unique())
missing_holiday_dates

{datetime.date(2016, 6, 15),
 datetime.date(2013, 11, 20),
 datetime.date(2014, 10, 25),
 datetime.date(2013, 10, 14),
 datetime.date(2016, 8, 13),
 datetime.date(2014, 10, 5),
 datetime.date(2017, 6, 28),
 datetime.date(2013, 8, 20),
 datetime.date(2014, 6, 1),
 datetime.date(2015, 2, 9),
 datetime.date(2014, 5, 19),
 datetime.date(2016, 8, 23),
 datetime.date(2017, 1, 11),
 datetime.date(2016, 2, 5),
 datetime.date(2015, 9, 22),
 datetime.date(2016, 6, 21),
 datetime.date(2017, 2, 19),
 datetime.date(2017, 5, 31),
 datetime.date(2014, 8, 20),
 datetime.date(2017, 3, 22),
 datetime.date(2014, 4, 29),
 datetime.date(2013, 11, 27),
 datetime.date(2016, 2, 18),
 datetime.date(2015, 8, 28),
 datetime.date(2017, 2, 25),
 datetime.date(2014, 3, 7),
 datetime.date(2015, 3, 4),
 datetime.date(2017, 3, 16),
 datetime.date(2017, 3, 19),
 datetime.date(2015, 1, 6),
 datetime.date(2013, 6, 5),
 datetime.date(2015, 9, 4),
 datetime.date(2013, 6, 4),
 datetime.date(2016, 7, 19),
 datetime.date(2015

In [35]:
# Creating a dataframe for the missing dates in the holiday data
holidays_add = pd.DataFrame(missing_holiday_dates, columns = ["date"])
holidays_add

Unnamed: 0,date
0,2016-06-15
1,2013-11-20
2,2014-10-25
3,2013-10-14
4,2016-08-13
...,...
1427,2017-08-01
1428,2016-11-16
1429,2015-09-05
1430,2013-08-01


In [36]:
# Adding the  missing holiday dates to the main dataframe
holidays_events = pd.concat([holidays_events, holidays_add], ignore_index=True)
holidays_events["date"] = pd.to_datetime(holidays_events["date"]).dt.date
holidays_events = holidays_events.sort_values(by = ["date"], ignore_index = True)
holidays_events

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
...,...,...,...,...,...,...
1777,2017-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
1778,2017-12-23,Additional,National,Ecuador,Navidad-2,False
1779,2017-12-24,Additional,National,Ecuador,Navidad-1,False
1780,2017-12-25,Holiday,National,Ecuador,Navidad,False


**Oil data**

In [37]:
oil_data = pd.read_csv("oil.csv")
oil_data.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21
6,2013-01-09,93.08
7,2013-01-10,93.81
8,2013-01-11,93.6
9,2013-01-14,94.27


In [38]:
oil_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


We note about 43 missing values for oil prices in the oil data. Checks online revealed that said data were unavailable in real time, as such a forward fill method will be applied to fill the nulls and a backfill applied to fill any rows missing after that.

In [39]:
# Filling nulls with forward fill and backfill
oil_data = oil_data.ffill().bfill()
oil_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1218 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


In [40]:
oil_data.head(10)


Unnamed: 0,date,dcoilwtico
0,2013-01-01,93.14
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21
6,2013-01-09,93.08
7,2013-01-10,93.81
8,2013-01-11,93.6
9,2013-01-14,94.27


In [41]:
# Converting the dates in the oil data to dates
oil_data["date"] = pd.to_datetime(oil_data["date"]).dt.date

The oil data now has no nulls, and is supposed to be complete, but we note that there are still some missing dates. e.g. it moves from January 4, 2013 to January 7, 2013. A quick check reveals that those dates are weekends, implying that the data is for business days and does not include weekends. With this in mind, I assume that oil prices, for the period, are frozen at close of business days of Friday and so remain constant over the weekends. As such, the "missing dates" (weekends) can be brought in another forward fills applied to them.

In [42]:
# Getting missing dates
missing_oil_dates = set(number_of_expected_days.date) - set(oil_data["date"].unique())
missing_oil_dates

{datetime.date(2013, 1, 5),
 datetime.date(2013, 1, 6),
 datetime.date(2013, 1, 12),
 datetime.date(2013, 1, 13),
 datetime.date(2013, 1, 19),
 datetime.date(2013, 1, 20),
 datetime.date(2013, 1, 26),
 datetime.date(2013, 1, 27),
 datetime.date(2013, 2, 2),
 datetime.date(2013, 2, 3),
 datetime.date(2013, 2, 9),
 datetime.date(2013, 2, 10),
 datetime.date(2013, 2, 16),
 datetime.date(2013, 2, 17),
 datetime.date(2013, 2, 23),
 datetime.date(2013, 2, 24),
 datetime.date(2013, 3, 2),
 datetime.date(2013, 3, 3),
 datetime.date(2013, 3, 9),
 datetime.date(2013, 3, 10),
 datetime.date(2013, 3, 16),
 datetime.date(2013, 3, 17),
 datetime.date(2013, 3, 23),
 datetime.date(2013, 3, 24),
 datetime.date(2013, 3, 30),
 datetime.date(2013, 3, 31),
 datetime.date(2013, 4, 6),
 datetime.date(2013, 4, 7),
 datetime.date(2013, 4, 13),
 datetime.date(2013, 4, 14),
 datetime.date(2013, 4, 20),
 datetime.date(2013, 4, 21),
 datetime.date(2013, 4, 27),
 datetime.date(2013, 4, 28),
 datetime.date(2013, 5, 

In [43]:
oil_dates_add = pd.DataFrame(missing_oil_dates, columns = ["date"])
oil_dates_add

Unnamed: 0,date
0,2013-12-29
1,2014-10-25
2,2013-06-15
3,2015-12-12
4,2014-12-20
...,...
477,2017-02-05
478,2015-09-05
479,2015-11-14
480,2016-05-15


In [44]:
# Adding the  missing oil dates to the main dataframe
oil_data = pd.concat([oil_data, oil_dates_add], ignore_index=True)
oil_data["date"] = pd.to_datetime(oil_data["date"])
oil_data = oil_data.sort_values(by = ["date"], ignore_index = True)
oil_data.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,93.14
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-05,
5,2013-01-06,
6,2013-01-07,93.2
7,2013-01-08,93.21
8,2013-01-09,93.08
9,2013-01-10,93.81


In [45]:
# Filling nulls with forward fill and backfill
oil_data = oil_data.ffill().bfill()
oil_data.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,93.14
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-05,93.12
5,2013-01-06,93.12
6,2013-01-07,93.2
7,2013-01-08,93.21
8,2013-01-09,93.08
9,2013-01-10,93.81


In [46]:
# Recasting the oil data dates to datetime dates
oil_data["date"] = pd.to_datetime(oil_data["date"]).dt.date
oil_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700 entries, 0 to 1699
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1700 non-null   object 
 1   dcoilwtico  1700 non-null   float64
dtypes: float64(1), object(1)
memory usage: 26.7+ KB


**Stores data**

In [47]:
stores_data = pd.read_csv("stores.csv")
stores_data.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [48]:
stores_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


**Test Data**

In [49]:
test_data = pd.read_csv("test.csv")
test_data

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0
...,...,...,...,...,...
28507,3029395,2017-08-31,9,POULTRY,1
28508,3029396,2017-08-31,9,PREPARED FOODS,0
28509,3029397,2017-08-31,9,PRODUCE,1
28510,3029398,2017-08-31,9,SCHOOL AND OFFICE SUPPLIES,9


In [50]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           28512 non-null  int64 
 1   date         28512 non-null  object
 2   store_nbr    28512 non-null  int64 
 3   family       28512 non-null  object
 4   onpromotion  28512 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ MB


The test data looks complete, with no nulls. Casting the date column to date will be the only cleaning activity here.

In [51]:
# Casting the date column to date data type
test_data["date"] = pd.to_datetime(test_data["date"]).dt.date

**Sample Submission**

In [52]:
sample_submission = pd.read_csv("sample_submission.csv")
sample_submission

Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0
...,...,...
28507,3029395,0.0
28508,3029396,0.0
28509,3029397,0.0
28510,3029398,0.0


No changes will be made to the sample submission as it is only a guide.

## Answering the other questions

**Which dates have the lowest and highest sales for each year?**

The imputation of the originally missing dates means that automatically, minimum sales for each of the four years will be on those dates (December 25 each year), but that is not what we want. What we want to know is which days had the least sales when stores were opened, as such I will only include sales values greater than 0.

In [53]:
# Aggregating sales by dates
train_by_date = train_data[train_data["sales"] != 0.00
                          ].groupby(by = "sales_date").sales.agg(["sum"]
                                                                ).sort_values(by = "sales_date")
train_by_date

Unnamed: 0_level_0,sum
sales_date,Unnamed: 1_level_1
2013-01-01,2511.618999
2013-01-02,496092.417944
2013-01-03,361461.231124
2013-01-04,354459.677093
2013-01-05,477350.121229
...,...
2017-08-11,826373.722022
2017-08-12,792630.535079
2017-08-13,865639.677471
2017-08-14,760922.406081


In [54]:
# Creating a column for the years for grouping
train_by_date["year"] = pd.to_datetime(train_by_date.index).year
train_by_date.rename(columns = {"sum":"total_sales"}, inplace = True)
train_by_date = train_by_date.reset_index()
train_by_date

Unnamed: 0,sales_date,total_sales,year
0,2013-01-01,2511.618999,2013
1,2013-01-02,496092.417944,2013
2,2013-01-03,361461.231124,2013
3,2013-01-04,354459.677093,2013
4,2013-01-05,477350.121229,2013
...,...,...,...
1679,2017-08-11,826373.722022,2017
1680,2017-08-12,792630.535079,2017
1681,2017-08-13,865639.677471,2017
1682,2017-08-14,760922.406081,2017


In [55]:
fig = px.line(train_by_date, x = "sales_date", y = "total_sales", 
              title= "Sales trend for Corporation Favorita from 2013 - 2017", 
             labels = {"sales_date":"Sales Date", "total_sales":"Total Sales"})
fig.show()

In [56]:
data_2013 = train_by_date[train_by_date["year"] == 2013]
data_2013 = data_2013.reset_index()
data_2013

Unnamed: 0,index,sales_date,total_sales,year
0,0,2013-01-01,2511.618999,2013
1,1,2013-01-02,496092.417944,2013
2,2,2013-01-03,361461.231124,2013
3,3,2013-01-04,354459.677093,2013
4,4,2013-01-05,477350.121229,2013
...,...,...,...,...
359,359,2013-12-27,479314.968043,2013
360,360,2013-12-28,556952.305979,2013
361,361,2013-12-29,499719.504924,2013
362,362,2013-12-30,635134.735851,2013


In [57]:
min_sales_13 = data_2013["total_sales"].min()
max_sales_13 = data_2013["total_sales"].max()
low_hi_sales_13 = data_2013[(data_2013["total_sales"] == min_sales_13) | (data_2013["total_sales"] == max_sales_13)]
low_hi_sales_13

Unnamed: 0,index,sales_date,total_sales,year
0,0,2013-01-01,2511.618999,2013
356,356,2013-12-23,792865.284427,2013


In [58]:
data_2014 = train_by_date[train_by_date["year"] == 2014]
data_2014 = data_2014.reset_index()
data_2014

Unnamed: 0,index,sales_date,total_sales,year
0,364,2014-01-01,8602.065404,2014
1,365,2014-01-02,801011.226041,2014
2,366,2014-01-03,680672.845603,2014
3,367,2014-01-04,936628.886604,2014
4,368,2014-01-05,949618.788940,2014
...,...,...,...,...
359,723,2014-12-27,740596.158932,2014
360,724,2014-12-28,716329.635071,2014
361,725,2014-12-29,773998.401175,2014
362,726,2014-12-30,912970.533204,2014


In [59]:
fig = px.line(data_2014, x = "sales_date", y = "total_sales", title="Sales trend for Corporation Favorita in 2014", 
             labels = {"sales_date":"Sales Date", "total_sales":"Total Sales"})
fig.show()

In [60]:
min_sales_14 = data_2014["total_sales"].min()
max_sales_14 = data_2014["total_sales"].max()
low_hi_sales_14 = data_2014[(data_2014["total_sales"] == min_sales_14) | (data_2014["total_sales"] == max_sales_14)]
low_hi_sales_14

Unnamed: 0,index,sales_date,total_sales,year
0,364,2014-01-01,8602.065,2014
356,720,2014-12-23,1064978.0,2014


In [61]:
data_2015 = train_by_date[train_by_date["year"] == 2015]
data_2015 = data_2015.reset_index()
data_2015

Unnamed: 0,index,sales_date,total_sales,year
0,728,2015-01-01,1.277362e+04,2015
1,729,2015-01-02,6.577634e+05,2015
2,730,2015-01-03,6.488807e+05,2015
3,731,2015-01-04,7.309238e+05,2015
4,732,2015-01-05,5.692673e+05,2015
...,...,...,...,...
359,1087,2015-12-27,8.377141e+05,2015
360,1088,2015-12-28,7.896849e+05,2015
361,1089,2015-12-29,8.707620e+05,2015
362,1090,2015-12-30,1.030044e+06,2015
