<a href="https://colab.research.google.com/github/htharshht/ML_Retail_Sales_Prediction/blob/main/ML_Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -
# Retail Sales Prediction - LinearML


##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Project By -** Harshit Tiwari

# **Project Summary -**

In seven European nations, Rossmann runs more than 3,000 pharmacies. Rossmann store managers are currently required to forecast their daily sales up to six weeks in advance. Numerous factors, such as marketing, rivalry, state and federal holidays, seasonality, and location, affect store sales. The accuracy of the results can be highly variable because thousands of different managers are making sales predictions based on their own situations.

# **GitHub Link -**

https://github.com/htharshht/AlmaBetter-Projects

# **Problem Statement**


Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment. Data Description Rossmann Stores Data.csv - historical data including Sales store.csv - supplemental information about the stores Data fields Most of the fields are self-explanatory. Id - an Id that represents a (Store, Da,,te) duple within the set Store - a unique Id for each store Sales - the turnover for any given day (Dependent Variable) Customers - the number of customers on a given day

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing necessary libraries needed in EDA
import numpy as np
import pandas as pd

# Libraries for visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px     # will be used for plotting

# Libraries for model building
import math

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import xgboost as xgb

from sklearn import metrics
from scipy import stats
from scipy.stats import norm
from scipy.stats import ttest_ind
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Monting google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Loading datasets
store_ds = pd.read_csv('/content/drive/MyDrive/EDA Data Set/store (1).csv')           # Store data
sales_ds = pd.read_csv('/content/drive/MyDrive/EDA Data Set/Rossmann Stores Data.csv')       # Rossman sales data

### Dataset First View

In [None]:
# Dataset First Look
# top 5 rows of store data
store_ds.head()

In [None]:
# bottom 5 rows of store data
store_ds.tail()

In [None]:
# top 5 rows of sales data
sales_ds.head()

In [None]:
# bottom 5 rows of sales data
sales_ds.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count in store data
print(f'Number of rows in store dataset : {len(store_ds.axes[0])}')
print(f'Number of columns in store dataset : {len(store_ds.axes[1])}')

In [None]:
# Dataset Rows & Columns count sales data
print(f'Number of rows in sales dataset : {len(sales_ds.axes[0])}')
print(f'Number of columns in sales dataset : {len(sales_ds.axes[1])}')

### Dataset Information

In [None]:
# Store dataset Info
store_ds.info()

In [None]:
# Sales dataset Info
sales_ds.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count for sales and store datasets
print('Count of duplicate values in sales data :', sales_ds.duplicated().sum())
print('Count of duplicate values in store data :', store_ds.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count in store data
store_ds.isnull().sum()

In [None]:
# Missing Values/Null Values Count in sales data
sales_ds.isnull().sum()

In [None]:
# Visualizing the missing values using Seaborn heatmap

plt.figure(figsize=(20,6))
sns.heatmap(store_ds.isna().transpose(),
            cmap="YlGnBu",
            cbar_kws={'label': 'Missing Data'})

plt.title('Missing Values', fontsize=18)
plt.show()

### What did you know about your dataset?

We have 2 different datasets. one is store dataset that is storing data related to various stores and the second dataset is sales dataset thta storing data related to sales in Rossman store.
After analyzing, we found out that sales dataset has zero null value and store dataset has null/ missing values in some of it's columns.

1) CompetitionOpenSinceMonth - This column tells us the approximate time in month when the last competetor started operating. This column contains mumerical data and 354 null values, which could mean that there was n o competetor.

2) CompetitionOpenSinceYear - This column tells us the approximate time in year when the last competetor started operating. This column contains mumerical data and 354 null values, which could mean that there was n o competetor.

3) Promo2SinceWeek, Promo2SinceYear and PromoInterval are NaN wherever Promo2 is 0 or False as can be seen in the first look of the dataset. They can be replaced with 0.

4) CompetitionDistance- distance in meters to the nearest competitor store, the distribution plot would give us an idea about the distances at which generally the stores are opened and we would impute the values accordingly.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Describe for sales data
sales_ds.describe()

In [None]:
# Dataset Describe for store data
store_ds.describe()

### Variables Description

**Rossmann Stores Data.csv** - historical data including Sales

**store.csv** - supplemental information about the stores

# ***Data fields***

Most of the fields are self-explanatory.

**Id** - an Id that represents a (Store, Date) duple within the set

**Store** - a unique Id for each store

**Sales** - the turnover for any given day (Dependent Variable)

**Customers** - the number of customers on a given day

**Open** - an indicator for whether the store was open: 0 = closed, 1 = open

**StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

**SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools

**StoreType** - differentiates between 4 different store models: a, b, c, d

**Assortment** - describes an assortment level: a = basic, b = extra, c = extended. An assortment strategy in retailing involves the number and type of products that stores display for purchase by consumers.

**CompetitionDistance** - distance in meters to the nearest competitor store **CompetitionOpenSince**[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

**Promo** - indicates whether a store is running a promo on that day

**Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

**Promo2Since[Year/Week]** - describes the year and calendar week when the store started participating in Promo2

**PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable in store dataset.
pd.Series({col:store_ds[col].unique() for col in store_ds})           # creating a series consisting every column name of the dataset and it's value.
                                                                # used for loop to iterate over every column in the dataset

In [None]:
# Check Unique Values for each variable sales dataset.
pd.Series({col:sales_ds[col].unique() for col in sales_ds})

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# creating a duplicates of the original datasets before making any changes in it
sales_ds1 = sales_ds.copy()
store_ds1 = store_ds.copy()

In [None]:
# checking columns of duplicated store dataset
store_ds1.columns

In [None]:
#distribution plot of competition distance
sns.distplot(x=store_ds1['CompetitionDistance'], hist = True)
plt.xlabel('Competition Distance Distribution Plot')

It appears that the CompetitionDistance distribution is biassed to the right and that the majority of values are to the left. That means greater resistance to the effects of outliers exists in the median.

In [None]:
# filling all the null values with 0.
store_ds1['CompetitionDistance'].fillna(0, inplace = True)
store_ds1['CompetitionOpenSinceMonth'].fillna(0, inplace = True)
store_ds1['CompetitionOpenSinceYear'].fillna(0, inplace = True)
store_ds1['Promo2SinceWeek'].fillna(0, inplace = True)
store_ds1['Promo2SinceYear'].fillna(0, inplace = True)
store_ds1['PromoInterval'].fillna(0, inplace = True)

In [None]:
# Counting missing values after replacing them with 0
store_ds1.isnull().sum()

Succesfully dealt with missing values.

# **Merging both datasets.**

In [None]:
merged = pd.merge(sales_ds1, store_ds1, on = 'Store', how = 'left')

In [None]:
merged.head()

In [None]:
merged.info()

In [None]:
# Checking duplicate value in merged data
merged.duplicated().sum()

In [None]:
# Changing object data type into int in StateHoliday column
merged.loc[merged['StateHoliday'] == '0', 'StateHoliday'] = 0
merged.loc[merged['StateHoliday'] == 'a', 'StateHoliday'] = 1
merged.loc[merged['StateHoliday'] == 'b', 'StateHoliday'] = 1
merged.loc[merged['StateHoliday'] == 'c', 'StateHoliday'] = 1
#store the value with same column name i.e StateHoliday with function astype
merged['StateHoliday'] = merged['StateHoliday'].astype(int, copy=False)

letters 'a', 'b' and 'c' indicates some type of public holiday so I replaced them with 1 which means yes to holiday and 0 means no holiday.

In [None]:
# Changing the data type into int in Assorted column
merged.loc[merged['Assortment'] == 'a', 'Assortment'] = 0
merged.loc[merged['Assortment'] == 'b', 'Assortment'] = 1
merged.loc[merged['Assortment'] == 'c', 'Assortment'] = 2
#store the value with same column name i.e Assortment with function astype
merged['Assortment'] = merged['Assortment'].astype(int, copy=False)

In Assorted column 'a' means basic assortment so replaced it with 0, 'b' means extra and 'c' means extended so replaced them with 2 and 3 respectively.

In [None]:
# Changing the object data type in Store type column
merged.loc[merged['StoreType'] == 'a', 'StoreType'] = 0
merged.loc[merged['StoreType'] == 'b', 'StoreType'] = 1
merged.loc[merged['StoreType'] == 'c', 'StoreType'] = 2
merged.loc[merged['StoreType'] == 'd', 'StoreType'] = 3
#store the value with same column name i.e Assortment with function astype
merged['StoreType'] = merged['StoreType'].astype(int, copy=False)

In column store type a, b, c and d indicate different types of stores so encoded them with 1, 2, 3 and 4 respectively.

In [None]:
# Changing float data into int type
merged['Promo2SinceWeek']= merged['Promo2SinceWeek'].astype(int)
merged['Promo2SinceYear']= merged['Promo2SinceYear'].astype(int)
merged['CompetitionDistance']= merged['CompetitionDistance'].astype(int)
merged['CompetitionOpenSinceMonth']= merged['CompetitionOpenSinceMonth'].astype(int)
merged['CompetitionOpenSinceYear']= merged['CompetitionOpenSinceYear'].astype(int)

In [None]:
# changing format of date from object to datetime
merged['Date'] = pd.to_datetime(merged['Date'], format= '%Y-%m-%d')

In [None]:
# Creating features from the datetime column

merged['Year'] = merged['Date'].dt.year
merged['Month'] = merged['Date'].dt.month
merged['WeekOfYear'] = merged['Date'].dt.weekofyear
merged['DayOfYear'] = merged['Date'].dt.dayofyear
years = merged['Year'].unique()


In [None]:
merged.info()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Categorical variables day of week against sales
plt.figure(figsize=(12,4))
sns.barplot(x=merged["DayOfWeek"], y=merged['Sales'])
plt.title("Sales in a Day")

plt.show()


##### 1. Why did you pick the specific chart?

bar chart is used to show comparison between different variables. in this chart I compared different days of the week in terms of sales.

##### 2. What is/are the insight(s) found from the chart?

After analyzing the above chart it is clear that Monday to Saturday has highest sales and almost no sales on sunday due to public holiday.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There are no such insight that lead to negative growth. As most of the sales comes on Weekdays (Mon-Sat), businesses can manage their stocks accordingly for positive customer experience.

#### Chart - 2

In [None]:
# Chart - 2 Visualizing relationship between sales and competiotionOpenSinceMonth
plt.figure(figsize=(12,4))
sns.barplot(x=merged["CompetitionOpenSinceMonth"], y=merged["Sales"])
plt.title("Sales based on Store Type")

plt.show()

##### 1. Why did you pick the specific chart?

Bar chart is helpful in coparing two variables.

##### 2. What is/are the insight(s) found from the chart?

When there is competition in the amrket then sales may vary. Higheest sales was 5, 6 months ago.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above chart it is clear that compettion has less impact on buisiness so businesses should focus more on customer satisfaction by giving them their best to boost sales.

#### Chart - 3

In [None]:
# Chart - 3 Visualization of impact of CompetiotionOpenSinceYear on Sales
plt.figure(figsize=(12,4))
sns.pointplot(x= 'CompetitionOpenSinceYear', y= 'Sales', data=merged)
sns.set_style("dark")
plt.title('Plot between Sales and Competition Open Since year')

##### 1. Why did you pick the specific chart?

Line chart is good when there are more data to compare.

##### 2. What is/are the insight(s) found from the chart?

From the above chart it is clear that when store opened in 1900 then sales much higher in comparison of other years because then copetiotion waS very low.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As the copetition increased sales declined so this is a negative insight but sales is not always in deceasing pattern infac it increased in 2000 and 2014 so if buisness focuses on it's quality and and customer satisfaction then there will less impact of copmpetition in it' growth.

#### Chart - 4

In [None]:
# Chart - 4 Store open/close status
plt.figure(figsize=(12,4))
sns.barplot(x=merged["Open"], y=merged["Sales"])
plt.title("Store Open/Close status")

plt.show()


##### 1. Why did you pick the specific chart?

Bar chart is helpful in coparing two variables.

##### 2. What is/are the insight(s) found from the chart?

Store is always open when there no holiday and closed when there is a holiday. Here 1 indicates non holidays and 0 indicates holidays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative insights found in this chart. Business should always operate where there is no holiday to earn maximum profit possible.

#### Chart - 5

In [None]:
# Chart - 5 Effect of promo on sales
plt.figure(figsize=(12,4))
sns.barplot(x=merged["Promo"], y=merged["Sales"])
plt.title("Effect of promo on sales")

plt.show()

##### 1. Why did you pick the specific chart?

Bar chart is useful in comparing variables. Specially when there are less number of variables to be compared.

##### 2. What is/are the insight(s) found from the chart?

From the above chart it is clear that sales was higher when promo was offered to customers and less when there was no promo offered.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Based on the analysis of the above chart we can say that business should offer more promos to boost sales and profit.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12,4))
sns.barplot(x=merged["StateHoliday"], y=merged["Sales"])
plt.title("Effect of State holiday on sales")

plt.show()

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent.

##### 2. What is/are the insight(s) found from the chart?

Normally stores are closed on State holidays except some exceptions. That is why sales very low state holiday days.
0 represents no state holiday and 1 represents state holiday which could include Christmas, Public holiday etc.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative insight found from the above chart.

#### Chart - 7

In [None]:
# Chart - 7 visualization code to check effect of school holiday on sales
plt.figure(figsize=(12,4))
sns.barplot(x=merged["SchoolHoliday"], y=merged["Sales"])
plt.title("Effect of School holiday on sales")

plt.show()

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent.

##### 2. What is/are the insight(s) found from the chart?

Few stores are closed on days of School Holidays hence the sales is slightly lower than normal days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative insight found.

#### Chart - 8

In [None]:
# Chart - 8 visualization code to check sales for different types of stores
plt.figure(figsize=(16, 5))
sns.barplot(x=merged["StoreType"], y=merged["Sales"])
plt.title("Sales based on Store Type")

plt.show()

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent.

##### 2. What is/are the insight(s) found from the chart?

Store type 1 which store type b has the highest sales in comparison of other stores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As store type 'b' has highest sales then business can open more b type stores to increase overall sales and profit.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(15,5))
merged.groupby('Assortment')['Sales'].sum().plot.pie(title = 'sales based on different assortment level', autopct='%1.2f%%', legend = True)
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart helps organize and show data as a percentage of a whole

##### 2. What is/are the insight(s) found from the chart?

Basic assortment is mopre than etra and extended.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no insight leading to negative growth.

#### Chart - 10

In [None]:
# Chart - 10 visualization code to check sales on cosecutive promo
plt.figure(figsize=(12,5))
sns.barplot(x=merged["Promo2"], y=merged["Sales"])
plt.title("Sales based on Promo2 (Consecutive promo)")

plt.show()

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent.

##### 2. What is/are the insight(s) found from the chart?

Most of the store didn't run consecutive promo hence sales is less in comparison of no promo.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative insight but business should run consecutive promo to see positive growth in the business.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Chart - 10 visualization code to check sales on promo intervals
plt.figure(figsize=(12,5))
sns.barplot(x=merged["PromoInterval"], y=merged["Sales"])
plt.title("Sales on Promo Intervals")

plt.show()

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent.

##### 2. What is/are the insight(s) found from the chart?

After analyzing the above chart it is clear that sales in first promo interval was highest than other intervals. So intervals decided in first interval period was good.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative isight found. Business should follow strategy of the first interval gap in other intervals also.

#### Chart - 12

In [None]:
# Chart - 12 visualization code to check customer share in different types of store
#customers and store type
merged.groupby("StoreType")["Customers"].sum().plot.pie(title='Customer Share', legend=True, autopct='%1.1f%%', shadow=True)
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart helps organize and show data as a percentage of a whole

##### 2. What is/are the insight(s) found from the chart?

Most of the customer visited store type 0 followed by store type 3,2 and 1.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Only 4.9% of total customers visiteed store type 1 and 14.3 % in store type 2 so business needs to change strategy for these 2 store types to increase customer visits and sales.

#### Chart - 13

In [None]:
# printing unique years which is stored in 'years' variable
years

In [None]:
# creating datasets of sales over different years
sales_df_2013 = merged[merged['Year']== 2013]
sales_df_2014 = merged[merged['Year']==2014]
sales_df_2015 = merged[merged['Year']== 2015]

In [None]:
# Creating datasets of sales over different months

sales_2013 = sales_df_2013.groupby("Month").sum().reset_index()
sales_2014 = sales_df_2014.groupby("Month").sum().reset_index()
sales_2015 = sales_df_2015.groupby("Month").sum().reset_index()


In [None]:
# Line graph comparision for various maonthly sales in different years
plt.figure(figsize=(18,7))
plt.plot(sales_2013.loc[:,'Sales'], label = "2013", color = "red")
plt.plot(sales_2014.loc[:,'Sales'], label = "2014", color = "green")
plt.plot(sales_2015.loc[:,'Sales'], label = "2015", color = "blue")
plt.title('Monthly Sales Over Years', fontsize = 15)
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Line graphs are used to track changes over short and long periods of time.

##### 2. What is/are the insight(s) found from the chart?

After analyzing the line chart we can see that sales rises in the end of each year due festivals before holidays. Sales for 2014 went down there for a couple months - July to September, indicating stores closed due to refurbishment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There are no negative insight found.
Stores should manage their stocks according to the holidays.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Extracting only meaningful columns for heatmap

columns_to_drop = ['Store', 'Year', 'WeekOfYear', 'DayOfYear']
corr_df = merged.drop(columns = columns_to_drop, axis =1)
corr_df['StateHoliday'].replace({'a':1, 'b':1,'c':1}, inplace=True)

In [None]:
# creating correlation heatmap
x = corr_df.corr()
plt.figure(figsize=(15,12))
sns.heatmap(x, cmap="YlOrBr", annot=True)
plt.title('Correlation heatmap', fontsize = 14)
plt.show()

##### 1. Why did you pick the specific chart?

Correlation heatmap shows correlation between different vasriables.

##### 2. What is/are the insight(s) found from the chart?

1) Day of the week has a negative correlation indicating low sales as the weekends, and promo, customers and open has positive correlation.

2) State Holiday has a negative correlation suggesting that stores are mostly closed on state holidays indicating low sales.

3)CompetitionDistance showing negative correlation suggests that as the distance increases sales reduce, which was also observed through the scatterplot earlier.

4) There's multicollinearity involved in the dataset as well. The features telling the same story like Promo2, Promo2 since week and year are showing multicollinearity.

5) The correlation matrix is agreeing with all the observations done earlier while exploring through barplots and scatterplots.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(merged)
plt.show()

##### 1. Why did you pick the specific chart?

A pairs plot allows us to see both distribution of single variables and relationships between two variables .

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Average sales of the stores is less than 5000.

Ho - < 4500

Ha - > 4500

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# extract sales column from dataset and try to find shape of the sales
sales = pd.DataFrame(merged, columns = ['Sales'])

sales.shape


In [None]:
# mean of the sales
sales.mean()

In [None]:
# standard deviation of sales
sales.std()

In [None]:
# convert sales data into list formate

sales_list = merged["Sales"].tolist()

In [None]:
# choose random sample from sales dataset (sample size is 1000)

import random
random_sample = random.sample(sales_list, 200)
random_samples = pd.DataFrame(random_sample)


In [None]:
# Average/ mean of random samples
random_samples.mean()

In [None]:
# creating a function to calculate z score
def calculate_z_score(value, random_samples):
    mean = random_samples.mean()
    std_dev = random_samples.std()
    square_root = math.sqrt(len(random_sample))
    z_score = (mean - value) / (std_dev / square_root)
    return z_score


In [None]:
# calculating z score on average sales and random samples
average_sales = 4500

z_score = calculate_z_score(average_sales, random_samples)
print("Z-Score:", z_score)

In [None]:
# Calculating P-value

prob_z = norm.cdf(z_score)
print(prob_z)

In [None]:
P_value = 1 - prob_z
print(P_value)

##### Which statistical test have you done to obtain P-Value?

I used Z score to find P-value for Hypothesis testing.
Based on the above observations, null hypothesis is rejected.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

It is mentioned in the problem statement that some stores were temporarily closed for refurbishment and hence did not generate any sales. Hence those rows have missing values or 0 records. Hence removing those records.

In [None]:
# Handling Missing Values & Missing Value Imputation
m_df = merged[merged.Open != 0]
m_df.drop('Open', axis=1, inplace=True)

m_df.head()

#### What all missing value imputation techniques have you used and why did you use those techniques?

In the given dataset there are many missing values. I removed all those entries using .drop() func.
Removing missing values can help to maintain the integrity of the analysis by ensuring that the available data is used appropriately.

### 2. Data tranformation

In [None]:
# transformation

m_df['Sales'] = np.log(m_df['Sales'])

m_df.dropna(inplace=True)

m_df.drop(m_df[m_df['Sales'] == float("-inf")].index,inplace=True)

sns.distplot(x=m_df['Sales'])

In [None]:
#changing into boolean
m_df['StateHoliday'].replace({'a':1, 'b':1,'c':1}, inplace=True)

In [None]:
#combining competition open since month and year into total months
m_df['CompetitionOpen'] = (m_df['Year'] - m_df['CompetitionOpenSinceYear'])*12 + (m_df['Month'] - m_df['CompetitionOpenSinceMonth'])
#correcting the neg values
m_df['CompetitionOpen'] = m_df['CompetitionOpen'].apply(lambda x:0 if x < 0 else x)
#dropping both the columns
m_df.drop(['CompetitionOpenSinceMonth','CompetitionOpenSinceYear'], axis=1,inplace=True)

In [None]:
#changing promo2 features into meaningful inputs
#combining promo2 to total months
m_df['Promo2Open'] = (m_df['Year'] - m_df['Promo2SinceYear'])*12 + (m_df['WeekOfYear'] - m_df['Promo2SinceWeek'])*0.230137

#correcting the neg values
m_df['Promo2Open'] = m_df['Promo2Open'].apply(lambda x:0 if x < 0 else x)*m_df['Promo2']

#creating a feature for promo interval and checking if promo2 was running in the sale month
def promo2running(df):
  month_dict = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sept', 10:'Oct', 11:'Nov', 12:'Dec'}
  try:
    months = df['PromoInterval'].split(',')
    if df['Month'] and month_dict[df['Month']] in months:
      return 1
    else:
      return 0
  except Exception:
    return 0

#Applying
m_df['Promo2running'] = m_df.apply(promo2running,axis=1)*m_df['Promo2']

#Dropping unecessary columns
m_df.drop(['Promo2SinceYear','Promo2SinceWeek','PromoInterval'],axis=1,inplace=True)

In [None]:
#setting date and store as index
m_df.set_index(['Date','Store'],inplace=True)
#sorting index following the time series
m_df.sort_index(inplace=True)


In [None]:
m_df.head(1)

# **Handling Outliers**

In [None]:

# Handling Outliers & Outlier treatments

# mean of sales
sales_mean = np.mean(m_df["Sales"])

# standard deviation of sales
sales_std = np.std(m_df["Sales"])

# we more than 3 threshold is consider as outlier
threshold = 3
outliers = []

for value in m_df["Sales"]:
  z_score = (value - sales_mean)/sales_std
  if z_score > threshold:
    outliers.append(value)

#total no of outliers
print(f'Total number of Outliers present in the Sales column are {len(outliers)}.')


In [None]:
#plotting the outlier distribution
sns.distplot(x=outliers).set(title='Outliers Distribution')


The data points with sales value higher than 10.2 are very low and hence they an be considered as outliers.

In [None]:
#percentage of sales greater than 10.2
sales_outliers = m_df.loc[m_df['Sales']> 10.2]
percentage_of_outliers = (len(sales_outliers)/len(m_df))*100
#print
print(f'The percentage of observations of sales greater than 28000 are {percentage_of_outliers}')

In [None]:
#exploring the reasons behind this behaviour
sales_outliers

# Observation:
Some interesting insights can be drawn from these outliers dataframe:

1) First thing that comes to notice is the DayOfWeek for Store 262. It's sunday
and it has high sales and it's of the store type B.

2) All other data points had promotion going on and they had a high number of Customers as well indicating no absurd behavior.

3) It can be well established that the outliers are showing this behavior for the stores with promotion = 1 and store type B. It would not be wise to treat them because the reasons behind this behavior seems fair.

In [None]:
#lets see which stores were open on Sunday in the outliers dataframe
#store 262
sales_outliers.loc[sales_outliers['DayOfWeek']==7]

In [None]:
#let's explore store type and Day Of week
sns.barplot(x=m_df['DayOfWeek'],y=m_df["Sales"],hue=m_df['StoreType'])

# Outlier Treatment:

1) It can be well established that the outliers are showing this behaviour for the stores with promotion = 1 and store type B. It would not be wise to treat them because the reasons behind this behaviour seems fair and important from the business point of view.

2) The primary reasons for the behaviour are promotion and store type B.

3) If the outliers are a valid occurrence it would be wise not to treat them by deleting or manipulating them especially when we have established the ups and downs of the target variable in relation to the other features. It is well established that there is seasonality involved and no linear relationship is possible to fit. For these kinds of datasets tree based machine learning algorithms are used which are robust to outlier effect.

# **Data Splitting**

In [None]:
#Sales should be the last col
columns=list(m_df.columns)
columns.remove('Sales')
columns.append('Sales')
m_df=m_df[columns]

In [None]:
#check
m_df.head(1)

In [None]:
# we won't need customers for sales forecasting
m_df.drop('Customers',axis=1,inplace=True)

In [None]:
#slicing the most recent six weeks and creating train and test set
#train
start_train = pd.to_datetime("2013-01-01")
end_train = pd.to_datetime("2015-06-14")
df_train = m_df.loc[start_train:end_train]
#test
start_test = pd.to_datetime("2015-06-15")
end_test = pd.to_datetime("2015-07-31")
df_test = m_df.loc[start_test:end_test]


In [None]:
#X and y split for train and test
X_train = df_train.drop('Sales',axis=1)
y_train = df_train[['Sales']]
X_test = df_test.drop('Sales',axis=1)
y_test = df_test[['Sales']]
print(f'The shape of X_train is: {X_train.shape}')
print(f'The shape of y_train is: {y_train.shape}')
print(f'The shape of X_test is: {X_test.shape}')
print(f'The shape of y_test is: {y_test.shape}')

### 3. Categorical Encoding

In [None]:
#importing
from sklearn.preprocessing import OneHotEncoder

#define categorical features
categorical_cols = ['DayOfWeek', 'StoreType', 'Assortment']

# fit encoder to features
encoder = OneHotEncoder(sparse=False)

# train encoder
encoder.fit(X_train[categorical_cols])
encoded_features = list(encoder.get_feature_names_out(categorical_cols))
X_train[encoded_features] = encoder.transform(X_train[categorical_cols])

# test encoder
X_test[encoded_features] = encoder.transform(X_test[categorical_cols])

# drop original features
X_train.drop(categorical_cols,axis=1,inplace=True)
X_test.drop(categorical_cols,axis=1,inplace=True)

# **Scaling the Data**

In [None]:
# scaling
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train[list(X_train.columns)] = stdsc.fit_transform(X_train[list(X_train.columns)])
X_test[list(X_test.columns)] = stdsc.transform(X_test[list(X_test.columns)])

scaler = StandardScaler()
y_train[list(y_train.columns)] = scaler.fit_transform(y_train[list(y_train.columns)])
y_test[list(y_test.columns)] = scaler.transform(y_test[list(y_train.columns)])

## ***7. ML Model Implementation***

### ML Model - 1
**Decision Tree Regressor**

In [None]:
# ML Model - 1 Implementation

# fitting decision tree
dt_basic = DecisionTreeRegressor(random_state=42)
dt_basic.fit(X_train,y_train)

# Define the parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 9],
    'min_samples_split': [2, 4, 6, 8],
    'min_samples_leaf': [1, 2, 3, 4]
}

# Columns needed to compare metrics
comparison_columns = ['Model_Name', 'Train_MAE', 'Train_MSE', 'Train_RMSE', 'Train_R2', 'Train_Adj_R2' ,'Test_MAE', 'Test_MSE', 'Test_RMSE', 'Test_R2', 'Test_Adj_R2']


In [None]:
# function to evaluate the model
def model_evaluation(model_name,model_variable,X_train,y_train,Xtes_test,y_t):
# This function predicts and evaluates various models for regression algorithms, visualizes results and creates a dataframe that compares the various models.

  # making predictions
  y_pred_train = model_variable.predict(X_train)
  y_pred_test = model_variable.predict(X_test)

  # Plot the test results
  a = y_test.copy()
  a['Pred Sales'] = y_pred_test.tolist()
  df_plot = a.reset_index(level=['Date'])
  plot = df_plot.groupby('Date')['Sales','Pred Sales'].sum()
  sns.lineplot(data = plot)
  plt.ylabel("Total Sales and Predicted Sales")
  plt.xticks(rotation = 25)


  # calculate metrics and print the results for test set
  # Mean Absolute Error or MAE
  MAE_train = round(mean_absolute_error(y_train,y_pred_train),6)
  MAE_test = round(mean_absolute_error(y_test,y_pred_test),6)

  # Mean Squared Error or MSE
  MSE_train = round(mean_squared_error(y_train,y_pred_train),6)
  MSE_test = round(mean_squared_error(y_test,y_pred_test),6)

  # Root Mean Squared Error or RMSE
  RMSE_train = round(mean_squared_error(y_train,y_pred_train,squared=False),6)
  RMSE_test = round(mean_squared_error(y_test,y_pred_test,squared=False),6)

  # R2
  R2_train = round(r2_score(y_train, y_pred_train),6)
  R2_test = round(r2_score(y_test, y_pred_test),6)

  # Adjusted R2
  Adj_r2_train = round(1 - (1-r2_score(y_train, y_pred_train)) * (len(y_train)-1)/(len(y_train)-X_train.shape[1]-1),6)
  Adj_r2_test = round(1 - (1-r2_score(y_test, y_pred_test)) * (len(y_test)-1)/(len(y_test)-X_test.shape[1]-1),6)

  # printing test results
  print(f'The Mean Absolute Error for the validation set is {MAE_test}')
  print(f'The Mean Squared Error for the validation set is {MSE_test}')
  print(f'The Root Mean Squared Error for the validation set is {RMSE_test}')
  print(f'The R^2 for the validation set is {R2_test}')
  print(f'The Adjusted R^2 for the validation set is {Adj_r2_test}')

  # Saving our results
  global comparison_columns
  metric_scores = [model_name,MAE_train,MSE_train,RMSE_train,R2_train,Adj_r2_train,MAE_test,MSE_test,RMSE_test,R2_test,Adj_r2_test]
  final_dict = dict(zip(comparison_columns,metric_scores))
  return [final_dict]

In [None]:
#function to create the comparison table
final_list = []
def add_list_to_final_df(dict_list):
  global final_list
  for elem in dict_list:
    final_list.append(elem)
  global comparison_df
  comparison_df = pd.DataFrame(final_list, columns= comparison_columns)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#decision tree evaluation
decision_tree = model_evaluation('Decision Tree Regressor',dt_basic,X_train,y_train,X_test,y_test)

In [None]:
#add results to comparison df
add_list_to_final_df(decision_tree)

In [None]:
#comparison df
comparison_df

**Observation:**

1) The baseline model- Decision tree was chosen considering our features were mostly categorical with few having continuous importance.

2) The above results show that a simple decision tree is performing pretty well on the validation set but it has completely overfitted the train set. It's better to have a much more generalized model for future data points.

3) Businesses prefer the model to be interpretable in nature in order to understand the patterns and strategize accordingly unlike any scientific facility where the results matter much more than interpretability.

4) If interpretability is important then sticking with tree based algorithms when most of the features are categorical; is beneficial and using tuned Hyperparameters to grow the tree deep enough without overfitting.**bold text**

# ML Model - 2
# **Random Forest**

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# fitting
random_forest = RandomForestRegressor(n_estimators=100,random_state=42)
random_forest.fit(X_train,y_train)

In [None]:
#model evaluation
random_f = model_evaluation('Random Forest Regressor',random_forest,X_train,y_train,X_test,y_test)

In [None]:
# updating comparison dataset
add_list_to_final_df(random_f)

In [None]:
#comparison df
comparison_df

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Create model
rf_tuned = RandomForestRegressor()


In [None]:
# import
from sklearn.model_selection import RandomizedSearchCV

# grid
random_grid = {'bootstrap': [True, False],
 'max_depth': [ 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 8],
 'n_estimators': [20, 40, 60]}

rf_random = RandomizedSearchCV(estimator = rf_tuned, param_distributions = random_grid, n_iter = 10, cv = 3, verbose=2, random_state=42)

# fitting
rf_random.fit(X_train,y_train)

In [None]:
#best para
rf_random.best_params_

In [None]:
#save the best parameters
random_t = rf_random.best_estimator_

In [None]:
#Columns needed to compare metrics

comparison_columns = ['Model_Name', 'Train_MAE', 'Train_MSE', 'Train_RMSE', 'Train_R2', 'Train_Adj_R2' ,'Test_MA']

In [None]:
#evaluate tuned model
random_tuned = model_evaluation('Random Forest Tuned',random_t,X_train,y_train,X_test,y_test)

In [None]:
#add to comparison_df
comparison_df = comparison_df.append(random_tuned)

In [None]:
#comparison_df
comparison_df

# **Model - 3**
# **XG Boost**

In [None]:
# ML Model - 3 Implementation
# Initialize the XGBoost Regressor
model = xgb.XGBRegressor(
    n_estimators=50,    # Number of boosting rounds (trees)
    learning_rate=0.1,    # Step size shrinkage to prevent overfitting
    max_depth=3,          # Maximum depth of the trees
    subsample=0.8,        # Fraction of samples to be used for fitting the individual trees
    colsample_bytree=0.8, # Fraction of features to be used for fitting the individual trees
    random_state=42
)

# Fit the model to the training data
model.fit(X_train, y_train)

In [None]:
#model evaluation
XGBoost = model_evaluation('XGBoost Regressor',model,X_train,y_train,X_test,y_test)

In [None]:
# updating comparison df
add_list_to_final_df(XGBoost)


In [None]:
#comparison df
comparison_df

In [None]:
#visualising feature importance of XGB
feature_imp = pd.DataFrame({"Variable": X_test.columns,"Importance": random_t.feature_importances_})
feature_imp.sort_values(by="Importance", ascending=False, inplace = True)
sns.barplot(x=feature_imp['Importance'], y= feature_imp['Variable'])

In [None]:
#Taking a look of our final comparison dataframe
comparison_df


In [None]:
#test values
baseline_r2 = 0.656444
random_r2 = 0.824541
r_tuned_r2 = 0.819486
XGBoost_r2 = 0.334879

In [None]:
#improvement %
improvement_r = ((random_r2 - baseline_r2)/baseline_r2)*100
print('Model Performance')
print(f'Improvement of {round(improvement_r,3)} % was seen in Random Forest against Decision Tree.')

In [None]:
#improvement % of tuned vs baseline

improvement_r = ((r_tuned_r2 - baseline_r2)/baseline_r2)*100
print('Model Performance')
print(f'Improvement of {round(improvement_r,3)} % was seen in Random Forest Tuned against Decision Tree.')


In [None]:
#improvement % of tuned vs simple random forest
improvement_r = ((r_tuned_r2 - random_r2)/random_r2)*100
print('Model Performance')
print(f'Improvement of {round(improvement_r,3)} % was seen in Random Forest Tuned against Simple Random Forest.')


In [None]:
#improvement % of tuned vs simple random forest
improvement_r = ((r_tuned_r2 - XGBoost_r2)/XGBoost_r2)*100
print('Model Performance')
print(f'Improvement of {round(improvement_r,3)} % was seen in Random Forest Tuned against XGBoost Regressor.')

# **Store wise Sales Predictions**

In [None]:
#predictions
y_pred_test = random_t.predict(X_test)
six_weeks_sales_df = y_test.copy()
six_weeks_sales_df['Pred_Sales'] = y_pred_test.tolist()


In [None]:
#head
six_weeks_sales_df.head()

In [None]:
#inverse
six_weeks_sales_df[['Sales']] = scaler.inverse_transform(six_weeks_sales_df[['Sales']])
six_weeks_sales_df[['Pred_Sales']] = scaler.inverse_transform(six_weeks_sales_df[['Pred_Sales']])

In [None]:
#sales vs predictions
six_weeks_sales_df.head()

In [None]:

#locating
six_weeks_sales_df.loc[('2015-06-15',5)]

In [None]:
#locating
six_weeks_sales_df.loc[('2015-07-28',56)]


In [None]:
#locating
six_weeks_sales_df.loc[('2015-07-21',10)]

# **Conclusion**

There's a positive correlation between customers and sales which is explanatory.

Here it can be deduced that there were more sales on Monday, probably because shops generally remain closed on Sundays which had the lowest sales in a week. This validates the hypothesis about this feature.

The positive effect of promotion on Customers and Sales is observable.

It is clear that most of the stores remain closed during State and School Holidays.

Based on the above findings it seems that there are quite a lot of opportunities in store type 'b' & 'd' as they had more number of customers per store and more sales per customer, respectively. Store type a & c are quite similar in terms of "per customer and per store" sales numbers and just because the majority of the stores were of these kinds, they had the best overall revenue numbers. On the other hand, store type b were very few in number and even then they had better average sales than others.

When comparing the sales of the three years, it is observable that sales increase by the end of the year indicating that people shop more before the holidays. All the stores showed Christmas seasonality. This validates the previous hypothesis.

The second thing to notice was that sales dropped for a few months in 2014 accounting for the stores closed due to refurbishment.

Most stores have competition distance within the range of 0 to 10 kms and had more sales than stores far away.

# **Recommendations**

More stores should be encouraged for promotion. Store type B should be increased in number. There's a seasonality involved, hence the stores should be encouraged to promote and take advantage of the holidays.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***