## <center>LP2-Regression-Project</center>

### <center> Store Sales prediction-Time Series Forecasting</center>

## Description
We're thrilled to be working on this exciting project, which involves developing a state-of-the-art model capable of accurately forecasting unit sales for thousands of items sold at various Favorita stores. We're fortunate to have access to a rich dataset provided by Corporation Favorita, a prominent grocery retailer based in Ecuador. By leveraging this data and applying our innovative approach, we aim to uncover invaluable insights that can drive growth and optimize performance for the entire organization.

## File Descriptions and Data Field Information

### train.csv

The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.

store_nbr identifies the store at which the products are sold.

family identifies the type of product sold.

sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).

onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.

### test.csv

The test data, having the same features as the training data. You will predict the target sales for the dates in this file.

The dates in the test data are for the 15 days after the last date in the training data.

### transaction.csv

Contains date, store_nbr and transaction made on that specific date.

### sample_submission.csv

A sample submission file in the correct format.

### stores.csv

Store metadata, including city, state, type, and cluster.

cluster is a grouping of similar stores.

### oil.csv

Daily oil price which includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economical health is highly vulnerable to shocks in oil prices.)

### holidays_events.csv

Holidays and Events, with metadata

## Questions and Hypothesis

1. Is the train dataset complete (has all the required dates)?

2. Which dates have the lowest and highest sales for each year?

3. Did the earthquake impact sales?

4. Are certain groups of stores selling more products? (Cluster, city, state, type)

5. Are sales affected by promotions, oil prices and holidays?

6. Which items have a higher sales correlation with other items, and why?

6. What analysis can we get from the date and its extractable features?

7. What is the overall customer buying habit?

7. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)


## Data Understanding, Evaluation and Preparation

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



Import all the dataset
since the dataset contains dates will parse the dates and use it as our index

In [2]:
Train_data = pd.read_csv("./Dataset/train.csv",parse_dates=["date"],index_col="date")

Test = pd.read_csv("./Dataset/test.csv",parse_dates=["date"],index_col="date")

Transaction = pd.read_csv("./Dataset/transactions.csv",parse_dates=["date"],index_col="date")

Holidays = pd.read_csv("./Dataset/holidays_events.csv",parse_dates=["date"],index_col="date")

oil = pd.read_csv("./Dataset/oil.csv",parse_dates=["date"],index_col="date")

stores = pd.read_csv("./Dataset/stores.csv")

sample_submission=pd.read_csv("./Dataset/sample_submission.csv")


A look at the first five rows of all the datasets

In [3]:
Train_data.head()

Unnamed: 0_level_0,id,store_nbr,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,0,1,AUTOMOTIVE,0.0,0
2013-01-01,1,1,BABY CARE,0.0,0
2013-01-01,2,1,BEAUTY,0.0,0
2013-01-01,3,1,BEVERAGES,0.0,0
2013-01-01,4,1,BOOKS,0.0,0


In [4]:
Test.head()

Unnamed: 0_level_0,id,store_nbr,family,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-08-16,3000888,1,AUTOMOTIVE,0
2017-08-16,3000889,1,BABY CARE,0
2017-08-16,3000890,1,BEAUTY,2
2017-08-16,3000891,1,BEVERAGES,20
2017-08-16,3000892,1,BOOKS,0


In [5]:
Transaction.head()

Unnamed: 0_level_0,store_nbr,transactions
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-01,25,770
2013-01-02,1,2111
2013-01-02,2,2358
2013-01-02,3,3487
2013-01-02,4,1922


The transaction column contains the total number of transactions that occured

In [6]:
Holidays.head()

Unnamed: 0_level_0,type,locale,locale_name,description,transferred
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


The holiday dataset will help us understand how sales are affected by holidays

In [7]:
oil.head()

Unnamed: 0_level_0,dcoilwtico
date,Unnamed: 1_level_1
2013-01-01,
2013-01-02,93.14
2013-01-03,92.97
2013-01-04,93.12
2013-01-07,93.2


In [8]:
stores.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [9]:
sample_submission.head()

Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0


Since we are more interested in the train dataset will start by understanding the train data first then work on the other dataset as they become relevant to our analysis

In [10]:
#A look at first five
Train_data.head()

Unnamed: 0_level_0,id,store_nbr,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,0,1,AUTOMOTIVE,0.0,0
2013-01-01,1,1,BABY CARE,0.0,0
2013-01-01,2,1,BEAUTY,0.0,0
2013-01-01,3,1,BEVERAGES,0.0,0
2013-01-01,4,1,BOOKS,0.0,0


In [11]:
#Quick information about the dataset
Train_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3000888 entries, 2013-01-01 to 2017-08-15
Data columns (total 5 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   store_nbr    int64  
 2   family       object 
 3   sales        float64
 4   onpromotion  int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 137.4+ MB


The dataset has a total of 3000888 rows with 5 columns

it ranges from January 1st, 2013 to August 15, 2017

In [12]:
#some quick stats
Train_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,3000888.0,1500444.0,866281.891642,0.0,750221.75,1500443.5,2250665.0,3000887.0
store_nbr,3000888.0,27.5,15.585787,1.0,14.0,27.5,41.0,54.0
sales,3000888.0,357.7757,1101.997721,0.0,0.0,11.0,195.8473,124717.0
onpromotion,3000888.0,2.60277,12.218882,0.0,0.0,0.0,0.0,741.0


### check for null values

In [13]:
Train_data.isnull().sum()

id             0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

Check for missing dates

Exploratory Data analysis

categorical columns

In [14]:
#family

Train_data.family.nunique()

33

There are 33 different groups of items sold by favorita

In [15]:
min=Train_data.family.value_counts().min()
max=Train_data.family.value_counts().max()

print(f"The min number of items for a group is {min}, \nthe maximum number of items for a group is {max}")

The min number of items for a group is 90936, 
the maximum number of items for a group is 90936


Each group has equal number of items

Numeric columns

In [16]:
#store
#How many stores do favorita have

Train_data.store_nbr.max()

54

54 stores

In [18]:
#How are the stores distributed around the country

stores_dist=stores.groupby(by='state')

## Bivariate and Multivariate Analysis