# Time series Forecasting on Corporation Favorita Grocery Stores

**Business Objective:** The business objective is to optimize inventory management by accurately **forecasting product demand** across various locations, ensuring adequate stock levels to meet customer demand while minimizing stockouts and overstock situations.


**Understanding the Current Situation:** The understanding of the current situation involves a comprehensive analysis of Corporation Favorita's inventory management and demand forecasting processes. This encompasses reviewing historical sales data, inventory levels, and customer demand patterns to identify trends and fluctuations. Additionally, it involves evaluating the effectiveness of existing processes, technology infrastructure, and decision-making frameworks. Consultation with stakeholders from various departments helps gather insights into business requirements, challenges, and opportunities. External factors such as market dynamics, economic conditions, and regulatory requirements are also considered. This holistic understanding forms the basis for developing strategies and implementing data-driven solutions to optimize inventory management and meet business objectives. 

**Data Mining Goals:** The aim of data mining is to build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.. 

**Project Plan:** The project plan involves following the CRISP-DM framework to develop machine learning models for forecasting product demand in various locations for Corporation Favorita. This includes conducting a thorough data understanding phase to gather and explore relevant datasets, followed by data preparation to preprocess and clean the data for analysis. The modeling phase will involve building and evaluating machine learning models using techniques such as time series forecasting, regression analysis, and machine learning algorithms. Evaluation metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE) will be used to assess model performance. Finally, the selected model will be deployed into production to generate forecasts for product demand, with ongoing monitoring and refinement as needed to ensure alignment with business objectives and requirements.

**Data for the Project:**

The data for this projects has been divided into 2. The first data set are for training and evaluation the machine learning model  while the last data set is for testing the model. 

The training dataset can be found in a database which will have to be accessed remotely and a zip file hosted on Github repository.

The test dataset for this project can be found in OneDrive.

**File Descriptions and Data Field Information**

**train.csv**

-   The training data, comprising time series of features store_nbr, family, 
    and onpromotion as well as the target sales.

-   **store_nbr** identifies the store at which the products are sold.

-   **family** identifies the type of product sold.

-   **sales** gives the total sales for a product family at a particular store
    at a given date. Fractional values are possible since products can be sold in 
    fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).

-   **onpromotion** gives the total number of items in a product family that
    were being promoted at a store at a given date.

**test.csv**

-   The test data, having the same features as the training data. You will predict the target sales for the dates in this file.

-   The dates in the test data are for the 15 days after the last date in the training data.

**transaction.csv**

-   Contains date, store_nbr and transaction made on that specific date.

**sample_submission.csv**

-   A sample submission file in the correct format.

**stores.csv**

-   Store metadata, including city, state, type, and cluster.

-   cluster is a grouping of similar stores.

**oil.csv**

-   **Daily oil price** which includes values during both the train and
     test data timeframes. (Ecuador is an oil-dependent country and its
     economical health is highly vulnerable to shocks in oil prices.)

**holidays_events.csv**
-   Holidays and Events, with metadata

> **NOTE**: The attention was to be paid particularly to the **transferred** column, > > as explained. It was noted that a holiday officially designated as **transferred** > fell on a specific calendar day but had been moved to another date by the >  government. Such a transferred day was described as more akin to a normal day than a > holiday. To determine the day it was celebrated, it was instructed to locate the > corresponding row where the type was **Transfer.** An example was provided, citing the > holiday Independencia de Guayaquil, which was transferred from 2012-10-09 to > 2012-10-12, indicating that it was celebrated on 2012-10-12. It was clarified that > days labeled as **Bridge** were additional days added to a holiday to extend the break > across a long weekend. These additional days were often compensated for by **Work > Day,** which referred to a day not typically scheduled for work (e.g., Saturday) > meant to offset the extended holiday. Furthermore, it was mentioned that additional > holidays were days added to a regular calendar holiday, such as Christmas, > typically including Christmas Eve as a holiday.

**Additional Notes**

-   Wages in the public sector are paid every two weeks on the 15th and
    on the last day of the month. Supermarket sales could be affected
    by this.

-   A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People
    rallied in relief efforts donating water and other first need
    products which greatly affected supermarket sales for several
    weeks after the earthquake.


### `Analytical Questions`:

> 1. Is the train dataset complete (has all the required dates)?
> 2. Which dates have the lowest and highest sales for each year (excluding days the store was closed)?
> 3. Compare the sales for each month across the years and determine which month of which year had the highest sales.
> 4. Did the earthquake impact sales?
> 5. Are certain stores or groups of stores selling more products? (Cluster, city, state, type)
> 6. Are sales affected by promotions, oil prices and holidays?
> 7. What analysis can we get from the date and its extractable features?
> 8. Which product family and stores did the promotions affect.
> 9. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)
> 10. Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.


### `Hypotheses`

`Null Hypothesis`: The payment of wages in the public sector on the 15th and last days of the month does not influence store sales.

`Alternate Hypothesis`: The payment of wages in the public sector on the 15th and last days of the month significantly influences store sales.

# `DATA UNDERSTANDING`

In [1]:
# Data Analysis and Manipulation of Packages

# Data handling
import pyodbc     
from dotenv import dotenv_values   
import pandas as pd
import numpy as np

# Vizualisation
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Feature Processing
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn. linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn. preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.impute import SimpleImputer

# Other packages
import zipfile
import os
import warnings
warnings.filterwarnings('ignore')

# Display all columns and rows 
pd.set_option('display.max_columns', None)

### Loading Datasets

In [3]:
# Load from SQL Database source

# Load environment variables from .env file into a dictionary
from dotenv import dotenv_values
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = os.getenv("DB_SERVER")
database = os.getenv("DB_NAME")
username = os.getenv("DB_USER")
password = os.getenv("DB_PASSWORD")

# Create a connection string
connection_string = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}'

# Connect to the database
import pyodbc
connection = pyodbc.connect(connection_string)


In [4]:
# SQL query to extract the data 
query = "SELECT * from dbo.oil"
 
 # Execute the SQL query to load data into pandas Dataframe
data_1= pd.read_sql(query, connection)

data_1

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.139999
2,2013-01-03,92.970001
3,2013-01-04,93.120003
4,2013-01-07,93.199997
...,...,...
1213,2017-08-25,47.650002
1214,2017-08-28,46.400002
1215,2017-08-29,46.459999
1216,2017-08-30,45.959999


In [5]:
# SQL query to extract the data 
query = "SELECT * from dbo.holidays_events"
 
 # Execute the SQL query to load data into pandas Dataframe
data_2= pd.read_sql(query, connection)

#data_2.to_csv('holidays_events.csv')#

data_2.head(5)

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [6]:
# SQL query to extract the data 
query = "SELECT * from dbo.stores"
 
 # Execute the SQL query to load data into pandas Dataframe
data_3= pd.read_sql(query, connection)

#data_3.to_csv('stores.csv')#

data_3.head(5)

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4
