<a href="https://colab.research.google.com/github/pavithra64/Retail_sales_prediction/blob/main/Retail_sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Title : RETAIL SALES PREDICTION : Predicting sales of a major store chain Rossmann

**Problem Description**

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

**Data Description**

Rossmann Stores Data.csv - historical data including Sales
store.csv - supplemental information about the stores

**Data fields**
Most of the fields are self-explanatory. The following are descriptions for those that aren't.

Id - an Id that represents a (Store, Date) duple within the test set
Store - a unique Id for each store
Sales - the turnover for any given day (this is what you are predicting)
Customers - the number of customers on a given day
Open - an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
StoreType - differentiates between 4 different store models: a, b, c, d
Assortment - describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance - distance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

Promo - indicates whether a store is running a promo on that day
Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

# Importing Necessary Libraries

In [5]:
# Libraries for EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

**LOADING THE DATA**

In [6]:
from google.colab import files
uploaded=files.upload()

Saving Rossmann Stores Data.csv to Rossmann Stores Data (1).csv


In [7]:
from google.colab import files
uploaded=files.upload()

Saving store.csv to store (2).csv


In [None]:
store='store.csv'
store=pd.read_csv(store)

In [None]:
rsd='Rossmann Stores Data.csv'
ross_data=pd.read_csv(rsd)

# Exploratory Data Analysis
We have two datasets - 'ross_data' and 'store'. Let's try and understand the basics of these two datasets one by one.

**Data Collection and Preprocessing**

In [None]:
ross_data.head()

In [None]:
# No. of rows and columns of ross_data
print('Shape of ross_data is', ross_data.shape)
print('No. of rows in ross_data are', ross_data.shape[0])
print('No. of columns in ross_data are', ross_data.shape[1])

In [None]:
# Concize summary of ross_data
ross_data.info()

In [None]:
# Descriptive Stats of ross_data dataset
ross_data.describe()

In [None]:
# Top five rows of the dataset
store.head()

In [None]:
# No. of rows and columns
print('No of rows in the dataset (store) are', store.shape[0])
print('No of columns in the dataset (store) are', store.shape[1])

In [None]:
# Descriptive Stats of store dataset
store.describe()

In [None]:
# Concise summary of store
store.info()

# Handling Missing Values

In [None]:
# Sum of null values
store.isnull().sum()

In [None]:
# Null value (percentage) of total dataset(store).
(store.isnull().sum()/store.shape[0])*100

In [None]:
# Distribution of CompetitionDistance
sns.distplot(store.CompetitionDistance)
plt.title('Distribution of Store Competition Distance (m)')
plt.show()

# # Distribution of CompetitionOpenSinceYear

plt.title('CompetitionOpenSinceYear')
sns.distplot(store.CompetitionOpenSinceYear)
plt.show()

# Distribution of CompetitionOpenSinceMonth

plt.title('CompetitionOpenSinceMonth')
sns.distplot(store.CompetitionOpenSinceMonth)
plt.show()

Dristribution of CompetitionDistance is right skewed so we'll replace the missing values with the median.

In [None]:
# Replacing missing values with median value
store['CompetitionDistance'].fillna(store['CompetitionDistance'].median() ,inplace = True)

Since the columns 'CompetitionOpenSinceMonth' and 'CompetitionOpenSinceYear' both are categorical columns(Months and Years) so we will replace the missing values with mode of particular column and we can see in the code cell below that both columns have only 1 mode.

In [None]:
# Checking for modes
print(store['CompetitionOpenSinceMonth'].mode())
print(store['CompetitionOpenSinceYear'].mode())

In [None]:
# Replacing null values with mode
store['CompetitionOpenSinceMonth'].fillna(store['CompetitionOpenSinceMonth'].mode()[0], inplace = True)
store['CompetitionOpenSinceYear'].fillna(store['CompetitionOpenSinceYear'].mode()[0], inplace = True)

In [None]:
# Head
store.head(10).T

We can observe that 'Promo2SinceWeek', 'Promo2SinceYear' and 'PromoInterval' are NaN where Promo2 is zero and they have nearly 50% missing value so we will drop these columns.

In [None]:
#dropping columns from store dataset
store.drop('Promo2SinceWeek',axis=1,inplace=True)
store.drop('Promo2SinceYear',axis=1,inplace=True)
store.drop('PromoInterval',axis=1,inplace=True)

In [None]:
store.columns

In [None]:
# Null values sum (store)
store.isna().sum()

In [None]:
# Null values sum (ross_data)
ross_data.isna().sum()

**Value counts in following columns**

In [None]:
# Values Counts
print('DayOfWeek:\n', ross_data['DayOfWeek'].value_counts(), '\n\n' )
print('Open:\n', ross_data['Open'].value_counts(), '\n\n' )

In [None]:
# Value count cont.
print('Promo:\n', ross_data['Promo'].value_counts(), '\n\n' )
print('StateHoliday:\n', ross_data['StateHoliday'].value_counts(), '\n\n')
print('SchoolHoliday:\n', ross_data['SchoolHoliday'].value_counts())

Checking unique value in StateHoliday because it has two zeros.

In [None]:
#Checking unique value
ross_data['StateHoliday'].unique()

In 'StateHoliday' 0 is repeated so, we will fix this using lambda function.

**Barplot of StateHoliday vs Sales and StateHoliday vs Customers to check significance of different values.**

In [None]:
fig, (state1, state2) = plt.subplots(1,2,figsize= (16,4))

# Barplot of StateHoliday vs Sales
state1.title.set_text('StateHoliday vs Sales')
sns.barplot(x = 'StateHoliday', y = 'Sales', data = ross_data, ax = state1)

# Barplot of StateHoliday vs Customers
state2.title.set_text('StateHoliday vs Customers')
sns.barplot(x = 'StateHoliday', y = 'Customers', data = ross_data, ax = state2)

There is no significant difference in the value of sales of state holiday type a, b and c as compared to '0'. So, we can treat different types of stateholidays in same way. Thus we can replace state holiday type a, b & c by 1 only.

In [None]:
# Replacing 'a', 'b' and 'c' with 1
ross_data.StateHoliday.replace({'a': 1,
                                'b' : 1,
                                'c' : 1
                                }, inplace = True )

In [None]:
# Verifying
ross_data['StateHoliday'].value_counts()

In [None]:
# Extracting of data from 'Date' column
ross_data['Year'] = pd.to_datetime(ross_data['Date']).dt.year
ross_data['Month'] = pd.to_datetime(ross_data['Date']).dt.month
ross_data['Day'] = pd.to_datetime(ross_data['Date']).dt.day
ross_data['WeekofYear'] = pd.to_datetime(ross_data['Date']).dt.isocalendar().week

**Distribution of 'Sales'**

In [None]:
# distribution plot
sns.distplot(ross_data['Sales'])

**Sales per store type**

In [None]:
# Barplot
sns.barplot(x = store['StoreType'], y = ross_data['Sales'])
plt.title('Sales per store type')

**Sales vs Assortment**

In [None]:
# Barplot
sns.barplot(x = store['Assortment'], y = ross_data['Sales'])
plt.title('Sales vs Assortment')

**Effect of promotion in sales and number of customers.**

In [None]:
# Barplot
fig, (fig1, fig2) = plt.subplots(1,2,figsize= (16,4))

# Barplot of Promo vs Sales
fig1.title.set_text('Promo vs Sales')
sns.barplot(x = 'Promo', y = 'Sales', data = ross_data, ax = fig1)

# Barplot of Promo vs Customers
fig2.title.set_text('Promo vs Customers')
sns.barplot(x = 'Promo', y = 'Customers', data = ross_data, ax = fig2)

Here we can observe that Sales and number of customers increase significantly during promo periods. This shows that promotion have a positive effect of stores

**Sales vs holidays**

In [None]:
# Barplot
fig, (fig3, fig4) = plt.subplots(1,2,figsize= (16,4))

# StateHoliday vs Sales
fig3.title.set_text('StateHoliday vs Sales')
sns.barplot(x = 'StateHoliday', y = 'Sales', data = ross_data, ax = fig3)

# StateHoliday vs Customers
fig4.title.set_text('StateHolidays vs Customers')
sns.barplot(x = 'StateHoliday', y = 'Customers', data = ross_data, ax = fig4)

Only a few stores are open on state holidays.

**Sales and number of customers on School Holidays**

In [None]:
# Barplot
fig, (fig_1, fig_2) = plt.subplots(1,2,figsize= (16,4))

# SchoolHoliday vs Sales
fig_2.title.set_text('SchoolHoliday vs Sales')
sns.barplot(x = 'SchoolHoliday', y = 'Sales', data = ross_data, ax = fig_1)

# Schoolholiday vs number of customers
fig_2.title.set_text('SchoolHoliday vs Customers')
sns.barplot(x = 'SchoolHoliday', y = 'Customers', data = ross_data, ax = fig_2)


We can observe that there is slight increase in sales and number of customers visiting on school holidays.

**Open stores per day of week**

In [None]:
# Opened and closed stores in a week
fig, (fig6) = plt.subplots(1,1, figsize = (16,6))
sns.countplot(x = 'Open', hue = 'DayOfWeek', data = ross_data, palette= 'deep', ax = fig6)

This countplot clearly shows that majority of stores are closed on sunday. Some stores were also closed on other days of the week may be due to public holidays, as stores are usually closed on public holidays and are open during school vacations.

**Sales and number of customers vs days of week**

In [None]:
# Barplot
fig,(figure1, figure2) = plt.subplots(1,2, figsize = (16, 5))

# Sales per day
figure1.title.set_text('Sales per day')
sns.barplot(x = 'DayOfWeek', y = 'Sales', data = ross_data, order = [1,2,3,4,5,6,7], ax = figure1)

# Customers per day
figure2.title.set_text('Number of customers per day')
sns.barplot(x = 'DayOfWeek', y = 'Customers', data = ross_data, order = [1,2,3,4,5,6,7], ax = figure2)

This clearly shows most sales are done with the first days, but very less on the last day due to the closed shops on sunday

**Trend of Average Sales per day of week**

In [None]:
# Average salesplot
fig_a = ross_data.groupby('DayOfWeek')[['Sales']].mean().plot(figsize = (11,5), marker = 'o', color = 'b')
fig_a.set_title('Average sales by day of the week')

**Trend of Average number of customers per day of week**

In [None]:
# Avg customers plot
fig_b = ross_data.groupby('DayOfWeek')[['Customers']].mean().plot(figsize = (11,5), marker = 'o', color = 'r')
fig_b.set_title('Average number of customers per day of the week')

**Sales per year**

In [None]:
# Box plot
ross_data.boxplot('Sales', 'Year', figsize= (12,8), fontsize=13 )

**Sales per month**

In [None]:
# Boxplot
ross_data.boxplot('Sales', 'Month', figsize= (15,8), fontsize=13 )


**Trend of Sales per month**

In [None]:
# Avg sales per month
fig_c = ross_data.groupby('Month')[['Sales']].mean().plot(figsize = (12,7), marker = 'o', color = 'm')
fig_c.set_title('Average Sales per Month')

**Trend of average customers per month**

In [None]:
# Avg customers per month
fig_d = ross_data.groupby('Month')[['Customers']].mean().plot(figsize = (12,7), marker = 's', color = 'g')
fig_d.set_title('Average Customers per Month')

We can observe the significant increase in sales and number of customers in the month of december. This may be because of Christmas Holidays.

**Trend of Average sales per day of Month**

In [None]:
# Avg sales per day (Monthly)
fig_e = ross_data.groupby('Day')[['Sales']].mean().plot(figsize = (12,7), marker = 'o', color = 'c')
fig_e.set_title('Average Sales per Day')


**EDA findings:**

The best-selling and most frequently visited by customers is store of type A.

For all stores, promotion leads to increased sales and customers.

Sales are strongly correlated to the number of customers.

Stores open during school holidays have more sales than on normal days.

Each time a store participates in a promotion, we see Sales and number of customers increase significantly.

More stores are open during school holidays than on public holidays.

Sales increase during Christmas week, this may be due to people buying gifts during a Christmas holidays.

# Feature Engineering

In [None]:
# Dropping '0' in 'Open' as it indicates that store was closed
openstore_df = ross_data[ross_data['Open'] != 0]

In [None]:
# we can now drop the column 'Open' as we only included data with 'Open' = 1
openstore_df.drop('Open', axis = 1, inplace = True)
# Making a Copy
ross_df = openstore_df.copy()

In [None]:
# head
ross_df.head()

In [None]:
# Distribution of sales after we drop the closed store.
sns.distplot(ross_df['Sales'])

We can see that the spike that was present there is now gone.

In [None]:
# Checking for infinite values
np.isinf(ross_df['Sales']).sum()

In [None]:
# Checking for null
ross_df.isna().sum()

In [None]:
# info
ross_df.info()

In [None]:
# Creating a list of all relevant numerical features for linear regg.
num_features = list(ross_df.describe().columns)

# removing 'Store' (ID) and 'Sales' (target variable)
num_features.remove('Store')
num_features.remove('Sales')

num_features

**Relationship between numerical features and target variable.**

In [None]:
# # Plotting the relationship between each numerical features and the target (Sales) variable
for i in num_features:
  fig = plt.figure(figsize = (5,5))
  feature = ross_df[i]
  label = ross_df['Sales']

Merging both datasets

In [None]:
# Merging using left join
joined_data = pd.merge(ross_df, store, how= 'left')


**Label encoding**

Assigning each of the following categorical columns an integer value based on alphabetical order.

In [None]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

joined_data['StoreType'] = label_encoder.fit_transform(joined_data['StoreType'])
joined_data['Assortment'] = label_encoder.fit_transform(joined_data['Assortment'])
# joined_data['StateHoliday'] = label_encoder.fit_transform(joined_data['StateHoliday'])

# Head
joined_data.head().T

**Checking for multicolinearity**

In [None]:
joined_data['Date'].head()

In [None]:
# correlation heat map
plt.figure(figsize = (18,10))
joined_data['Date'] = pd.to_datetime(joined_data['Date'])
correlation=joined_data.corr()
sns.heatmap(abs(correlation), annot = True, cmap = 'YlGnBu')



In [None]:
# Dropping store and date columns because they are irrelevant
joined_data.drop(['Store', 'Date'], axis = 1, inplace = True)

**Variance Inflation Factor**

In [None]:
# importing vif
from statsmodels.stats.outliers_influence import variance_inflation_factor

# defining a fuction for vif
def calculate_vif(X):
    """
    this function calculates the variance inflation factor
    """
    # VIF calculation
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
# defining a DataFrame containing on low VIF variables (as we observed above)
joined_data_vif = joined_data[[i for i in joined_data.describe().columns if i not in ['Sales','Year','CompetitionOpenSinceYear','Month','WeekOfYear']]].head()
joined_data_vif.head()

# ML Model Building

In [None]:
# importing ML models
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [None]:
# Evaluation Metrics
import math
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

We will make a copy of merged dataframe to use it for linear regression and elastic net regression. We will use the log of 'Sales' and 'Customers' columns because it will remove the hetroscadasticity of the linear relationship betwwen then (we observed above in target columns vs numerical freatures).

In [None]:
# Copy of merged DF
joined_df_lr = joined_data.copy()

In [None]:
# log10 transformation of 'Sales'
joined_df_lr['Sales'] = np.log10(joined_df_lr['Sales'])

In [None]:
# Cheching for inf values
np.isinf(joined_df_lr['Sales']).sum()

In [None]:
# droping infinite values after transformation
joined_df_lr.drop(joined_df_lr[joined_df_lr['Sales'] == float("-inf")].index,inplace=True)

In [None]:
# log10 transformation of 'Customers'
joined_df_lr['Customers'] = np.log10(joined_df_lr['Customers'])

In [None]:
# Cheching for inf values again
np.isinf(joined_df_lr['Customers']).sum()

In [None]:
# Declaring Independent and dependent variable for linear regression and elastic net
dependent_var = 'Sales'
independent_var = joined_data_vif.columns
# Creating the datafrmae of independent variables
X = joined_df_lr[independent_var].values

# Creating the dataframe of dependent variable
y = joined_df_lr[dependent_var].values
# Splitting the Dataset into Test and Train
X = pd.DataFrame(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

# Shape of train data
print(X_train.shape)
# Shape of test data
print(X_test.shape)

In [None]:
# Using StandardScaler to normalize the independent variables.
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

**Linear Regression**

In [None]:
# Fitting Multiple Linear Regression to the Training set
# Formation of equation
regressor = LinearRegression()
regressor.fit(scaled_X_train, y_train)

In [None]:
# Intercept of equation
regressor.intercept_

In [None]:
# Model coefficients
regressor.coef_

In [None]:
# predicted sales from training dataset
y_pred_train = regressor.predict(scaled_X_train)

# predicted sales from testing dataset
y_pred_test = regressor.predict(scaled_X_test)

In [None]:
# Defining RMSE function
from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(x, y):
    return sqrt(mean_squared_error(x, y))

# Defining MAPE function
def mape(x, y):
    return np.mean(np.abs((x - y) / x)) * 100
# Evaluation Metrics for Linear Regression

print("Regresion Model Training Score" , ":" , regressor.score(scaled_X_train, y_train),
      "Model Test Score" ,":" , regressor.score(scaled_X_test, y_test))

print("Training RMSE", ":", rmse(y_train, y_pred_train),
      "Testing RMSE", ":", rmse(y_test, y_pred_test))

print("Training MAPE", ":", mape(y_train, y_pred_train),
      "Testing MAPE", ":", mape(y_test, y_pred_test))

r2 = r2_score(y_test, y_pred_test)
print("R2 :" ,r2)

In [None]:
# Performance of the model
r2s_lr = r2_score(y_train,y_pred_train)
r2s2_lr = r2_score(y_test,y_pred_test)

mae_lr = mae(y_train,y_pred_train)
mae2_lr = mae(y_test,y_pred_test)

rmse_lr = math.sqrt(mse(y_train,y_pred_train))
rmse2_lr = math.sqrt(mse(y_test,y_pred_test))

mse_lr = mse(y_train,y_pred_train)
mse2_lr = mse(y_test,y_pred_test)

print('Performance of Linear Regression Model:')
print('-'*40)

print('r2_score train:',r2s_lr)
print('r2_score test:',r2s_lr)

print('\nMean absolute error train: %.2f' % mae_lr)
print('Mean absolute error test: %.2f' % mae_lr)

print('\nRoot mean squared error train: ', rmse_lr)
print('Root mean squared error test: ', rmse_lr)

print('\nMean Sq error train: %.2f' % mse_lr)
print('Mean Sq error test: %.2f' % mse_lr)

In [None]:
# Showing the optimally fitted line
plt.figure(figsize=(10,10))
plt.scatter(y_test,y_pred_test)

p1 = max(max(y_pred_test),max(y_test))
p2 = min(min(y_pred_test),min(y_test))
plt.plot([p1,p2],[p1,p2],c='r')
plt.xlabel('Actual values')
plt.ylabel('Predicted values')

**Elastic Net Regression**

In [None]:
# ElasticNet
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)
# Model fitting
elasticnet.fit(scaled_X_train,y_train)

In [None]:
# Elasticnet score
elasticnet.score(scaled_X_train, y_train)

In [None]:
# Predicting test set
y_pred_en = elasticnet.predict(scaled_X_test)
MSE  = mse(y_test, y_pred_en)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(y_test,y_pred_en)
print("R2 :" ,r2)

**Xgboost Model**

In [None]:
# decalring independent and dependent variables
target_col = 'Sales'
input_cols = joined_data.columns.drop(target_col)
input_cols

We will use these independent and dependent variables for Xgboost, Decision Tree and Random forest because previous independent and dependent variables contain log10 transformation of 'Sales' and 'Customers' columns also, the following three models can handle multicolinearity.

In [None]:
# train test split
X_train, X_test, y_train, y_test  = train_test_split(joined_data[input_cols], joined_data[target_col], test_size = 0.2, random_state = 1)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# standard scaler to nornamlise the data
scaler = StandardScaler()
scale_X_train = scaler.fit_transform(X_train)
scale_X_test = scaler.transform(X_test)
scale_X_train[0:10]

In [None]:
# Building XGBoost Regressor Model:
xgb = XGBRegressor(random_state=42, n_jobs=-1, n_estimators=20, max_depth=4)
xgb.fit(scale_X_train,y_train)

y_predict_xgb = xgb.predict(scale_X_test)

In [None]:
#Performance of the model
r2s_xgb = r2_score(y_test,y_predict_xgb)
mae_xgb = mae(y_test,y_predict_xgb)
rmse_xgb = math.sqrt(mse(y_test,y_predict_xgb))
print('Performance of XGBoost Regressor Model:')
print('-'*40)
print('r2_score:',r2s_xgb)
print('Mean absolute error: %.2f' % mae_xgb)
print('Root mean squared error: ', rmse_xgb)

**DecisionTree Model**

In [None]:
# Building Decesion Tree Regressor Model:

model = DecisionTreeRegressor()
model.fit(scale_X_train,y_train)

y_predict_dt = model.predict(scale_X_test)
# Performance of the model
r2s_3 = r2_score(y_test,y_predict_dt)
mae3 = mae(y_test,y_predict_dt)
rmse3 = math.sqrt(mse(y_test,y_predict_dt))
print('Performance of Decesion Tree Model:')
print('-'*40)
print('r2_score:',r2s_3)
print('Mean absolute error: %.2f' % mae3)
print('Root mean squared error: ', rmse3)

**Random Forest Regression Model**

In [None]:
# Building Random Forest Regressor Model:

random_forest_model = RandomForestRegressor(n_estimators=100)
random_forest_model.fit(scale_X_train,y_train)
y_predict_rf = random_forest_model.predict(scale_X_test)

# Performance of the model
r2s_4 = r2_score(y_test,y_predict_rf)
mae4 = mae(y_test,y_predict_rf)
rmse4 = math.sqrt(mse(y_test,y_predict_rf))
print('Performance of Random Forest Regression Model:')
print('-'*40)
print('r2_score:', r2s_4)
print('Mean absolute error: %.2f' % mae4)
print('Root mean squared error: ', rmse4)

**Hyperparameter tuning for RandomForest**

In [None]:
# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

param_grid = {  'bootstrap': [True], 'max_depth': [5, 10, None], 'max_features': ['auto', 'log2'], 'n_estimators': [5, 6, 7, 8, 9, 10, 11, 12, 13, 15]}
rfr = RandomForestRegressor(random_state = 1)
g_search = GridSearchCV(estimator = rfr, param_grid = param_grid, cv = 3, n_jobs = 1, verbose = 0, return_train_score=True)
g_search.fit(scale_X_train, y_train);

print(g_search.best_params_)

In [None]:
# Model prediction train set
y_pred_RandomForest_tuned_train = g_search.predict(scale_X_train)
# Model prediction test set
y_pred_RandomForest_tuned_test = g_search.predict(scale_X_test)
print("Regresion Model Training Score" , ":" , g_search.score(scale_X_train, y_train),
      "Model Test Score" ,":" , g_search.score(scale_X_test, y_test))

print("Training RMSE", ":", rmse(y_train, y_pred_RandomForest_tuned_train),
      "Testing RMSE", ":", rmse(y_test, y_pred_RandomForest_tuned_test))

print("Training MAPE", ":", mape(y_train, y_pred_RandomForest_tuned_train),
      "Testing MAPE", ":", mape(y_test, y_pred_RandomForest_tuned_test))

r2 = r2_score(y_test, y_pred_RandomForest_tuned_test)
print("R2 :" ,r2)

**Feature importance**

In [None]:
#Lets Find Importance of each Feature
feature_importance = random_forest_model.feature_importances_
# Lets make a dataframe consists of features and values
columns_1 = list(X_train.columns)
feature_importance_df = pd.DataFrame({'Features':columns_1, 'Importance':feature_importance})
feature_importance_df.set_index('Features', inplace=True)
feature_importance_df.sort_values(by= 'Importance', ascending = False, inplace = True)
feature_importance_df

In [None]:
# Feature Importance
Features_imp = feature_importance_df.index

plt.figure(figsize=(15,12))
sns.barplot(y= Features_imp, x=feature_importance_df['Importance'], data = feature_importance_df ).set(title='Feature Importance')
plt.xticks(rotation=90)
plt.show()

**Observation**:-

As per our model; Customer, store Type, CompetitionDistance and Promo are the most important features which are having the most impact on Target Variable i.e. Sales Column.

# Conclusion from ML models
By Looking at the evaluation metrices obtained on implementing different sort of regression model, we decided to go with the Random Forest Tuned model.The maximum R^2 was seen in tuned Random Forest model with the value 0.97185. It means our best accurate model is able to explain approx/almost 97% of variances in the datasets.

Based on our model; Customer, store Type, Promo & CompetitionDistance are the most impactful features which are driving the sales more as compared to other features present in the dataset.

**Suggestions from our Analysis**

More stores should be encouraged for promotion.

Store type 'b' should be increased in number.

There is seasonality involved. Hence, the stores should be encouraged to promote and take advantages of the holidays.

