<a href="https://colab.research.google.com/github/prbrtbiswal/Regression/blob/main/ML_Regression_ProjectTemplate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Retail Sales Prediction





```

```

##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Project Name**    -Retail Sales Prediction


# **Project Summary -**

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

# **GitHub Link -**

https://github.com/prbrtbiswal

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pylab
from scipy import stats



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount("/content/drive")


### Dataset First View

In [None]:
# Dataset First Look
store_data=pd.read_csv('/content/drive/MyDrive/store.csv')
rosemaan_data=pd.read_csv('/content/drive/MyDrive/Rossmann Stores Data.csv')
store_data.head()

In [None]:
rosemaan_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
store_data.shape



We came to know that store data contains 1115 values and 10 features

In [None]:
rosemaan_data.shape

we see here that rosemaan dataset consists 1017209 values and 10 features

### Dataset Information

In [None]:
# Dataset Info
store_data.info()

In [None]:
rosemaan_data.info()

#### Duplicate Values

#

In [None]:
# Dataset Duplicate Value Count
store_data.duplicated().sum()

We See here ,there are no duplicate value in this data set


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
store_data.isnull().sum()

There are many null values in these columns:- CompetitionOpenSinceMonth ,CompetitionOpenSinceYear,Promo2sinceWeek,Promo2SinceYear,PromoInterval

and also CompetitionDistance have 3 null values .

we have to clear it.



# 1.   **CompetitionDistance**





In [None]:
store_data[pd.isnull(store_data['CompetitionDistance'])]

To fill up these null values there are so many ways like(0,mean,median,mode).We have to replace these null values by median

In [None]:
store_data['CompetitionDistance'].fillna(store_data['CompetitionDistance'].median(),inplace=True)

In [None]:
store_data['CompetitionDistance'].isnull().sum()



# 2. 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear, Promointerval', 'Promo2sinceWeek' and 'Promo2sinceYear'




There are not much information provided to these data. Also we observe from dataset that where the Promo2 has value equals to zero there are Nan values for these columns. That means the store which do not wat promotion they have null values in promointerval , promo2sinceweek and so on.So for this purpose the best way to fill those features is to assign value equals to zero.

In [None]:
ds=store_data.copy()

In [None]:
## code for replacing Nan values with 0.

## Replacing Nan values with 0 in CompetitionOpenSinceMonth
ds['CompetitionOpenSinceMonth'] = ds['CompetitionOpenSinceMonth'].fillna(0)

## Replacing Nan values with 0 in CompetitionOpenSinceYear
ds['CompetitionOpenSinceYear'] = ds['CompetitionOpenSinceYear'].fillna(0)

## Replacing Nan values with 0 in Promo2SinceWeek
ds['Promo2SinceWeek'] = ds['Promo2SinceWeek'].fillna(0)

## Replacing Nan values with 0 in Promo2SinceYear
ds['Promo2SinceYear'] = ds['Promo2SinceYear'].fillna(0)

## Replacing Nan values with 0 in PromoInterval
ds['PromoInterval'] = ds['PromoInterval'].fillna(0)

## Now checking Nan values
ds.isna().sum()

### What did you know about your dataset?

Answer Here
        There are two datasets are given from which we have to predict jtheir daily sales for upto six weeks.

1.   Rossmann Stores Data.csv - historical data including Sales
                    It has  1017209 entries and 9 columns and here there is no null values in any entry.
2.   store.csv - supplemental information about the stores
                    It has 1115 entries and 10 variables and here there is some null values but we treat that values with mean,median,mode and 0 .
                    'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear, Promointerval', 'Promo2sinceWeek' and 'Promo2sinceYear'



# Merge the Rossmann_df and Store_df csv by column 'Store' as in both csv Store column is common.

In [None]:
ds_final=pd.merge(rosemaan_data,ds,on='Store',how='left')
ds_final.tail()

In [None]:
ds_final.shape

# **3. Data Wrangling**

# **Changing different dtypes to int type.**

In [None]:
# code for changing StateHoliday dtype from object to int.
ds_final.loc[ds_final['StateHoliday'] == '0', 'StateHoliday'] = 0
ds_final.loc[ds_final['StateHoliday'] == 'a', 'StateHoliday'] = 1
ds_final.loc[ds_final['StateHoliday'] == 'b', 'StateHoliday'] = 2
ds_final.loc[ds_final['StateHoliday'] == 'c', 'StateHoliday'] = 3
ds_final['StateHoliday'] = ds_final['StateHoliday'].astype(int, copy=False)

print('levels :', ds_final['StateHoliday'].unique(), '; data type :', ds_final['StateHoliday'].dtype)

In [None]:
# code for changing Assortment dtype from object to int.
ds_final.loc[ds_final['Assortment'] == 'a', 'Assortment'] = 0
ds_final.loc[ds_final['Assortment'] == 'b', 'Assortment'] = 1
ds_final.loc[ds_final['Assortment'] == 'c', 'Assortment'] = 2
ds_final['Assortment'] = ds_final['Assortment'].astype(int, copy=False)

print('levels :', ds_final['Assortment'].unique(), '; data type :', ds_final['Assortment'].dtype)

In [None]:
# code for changing StoreType dtype from object to int.
ds_final.loc[ds_final['StoreType'] == 'a', 'StoreType'] = 0
ds_final.loc[ds_final['StoreType'] == 'b', 'StoreType'] = 1
ds_final.loc[ds_final['StoreType'] == 'c', 'StoreType'] = 2
ds_final.loc[ds_final['StoreType'] == 'd', 'StoreType'] = 3
ds_final['StoreType'] = ds_final['StoreType'].astype(int, copy=False)

print('levels :', ds_final['StoreType'].unique(), '; data type :', ds_final['StoreType'].dtype)

In [None]:
# code for changing format of date from object to datetime
ds_final['Date'] = pd.to_datetime(ds_final['Date'], format= '%Y-%m-%d')

In [None]:
ds_final['CompetitionOpenSinceYear']= ds_final['CompetitionOpenSinceYear'].astype(int)
ds_final['Promo2SinceYear']= ds_final['Promo2SinceYear'].astype(int)

In [None]:
ds_final['CompetitionOpenSinceMonth'] = pd.DatetimeIndex(ds_final['Date']).month

In [None]:
ds_final['CompetitionDistance']= ds_final['CompetitionDistance'].astype(int)
ds_final['Promo2SinceWeek']= ds_final['Promo2SinceWeek'].astype(int)

## ***2. Understanding Your Variables***

In [None]:
ds_final.info()

In [None]:
# Dataset Columns
ds_final.columns

In [None]:
# Dataset Describe
ds_final.describe()

### Variables Description


*   Id - an Id that represents a (Store, Date) duple within the test set
*  Store - a unique Id for each store
*   Sales - the turnover for any given day (this is what you are predicting)
*   Customers - the number of customers on a given day

*   Open - an indicator for whether the store was open: 0 = closed, 1 = open
*  StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

*   SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
*   StoreType - differentiates between 4 different store models: a, b, c, d

*   Assortment - describes an assortment level: a = basic, b = extra, c = extended
*   CompetitionDistance - distance in meters to the nearest competitor store

*   CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

*   Promo - indicates whether a store is running a promo on that day

*   Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

*   Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
*   PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

























### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in ds_final.columns:
   print(i,ds_final[i].unique())

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Sales**

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(15,6))
sns.pointplot(x= 'CompetitionOpenSinceYear', y= 'Sales', data=ds_final)
plt.title('Plot between Sales and Competition Open Since year')

From the Plot we can tell that Sales are high during the year 1900, as there are very few store were operated of Rossmann so there is less competition and sales are high. But as year pass on number of stores increased that means competition also increased and this leads to decline in the sales.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(15,6))
sns.pointplot(x= 'Promo2SinceYear', y= 'Sales', data=ds_final)
plt.title('Plot between Sales and Promo2 Since year')

Plot between Sales and promo2 since year shows that effect of sales of stores which continue their promotion. this data is available from yaer 2009 to 2015. Promo2 has very good effect on sales but in year 2013 sales be minimum and also in year 2012 and 2015 sales are very low.

#### Chart - 3

In [None]:

plt.figure(figsize=(15,6))
sns.pointplot(x= 'CompetitionOpenSinceMonth', y= 'Sales', data=ds_final)
plt.title('Plot between Sales and Competition Open Since Month')

Plot between Competition open since month and Sales explains the sales data in each month of a year. This data shows that sales after month november increases drastically. This is very clear that in December monthdue to Christmas Eve and New year celebration everone is buying. So sales of Rossmann store is very high in December.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(15,6))
sns.pointplot(x= 'DayOfWeek', y= 'Sales', data=ds_final)
plt.title('Plot between Sales and Day of Week')

Plot between Sales and Days of week shows that maximum sales is on Monday and sales gradually decreasing to 6th day of week i.e. on Saturday. It also shows that sales on Sunday is almost near to zero as on sunday maximum stores are closed.

# **BoxPlot of sales between Assortment and store type**

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(12, 8))
plot_storetype_sales = sns.boxplot(x="StoreType", y="Sales", data=ds_final)
plt.title('Boxplot For Sales Values')

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 8))
plot_storetype_sales = sns.boxplot(x="Assortment", y="Sales", data=ds_final)
plt.title('Boxplot For Sales Values on the basis of Assortment Level')

# **Plot between Dayof Week and Open & promo.**

#### Chart - 7

In [None]:
# Chart - 7 visualization code
sns.countplot(x= 'DayOfWeek', hue='Open', data= ds_final, palette='cool')
plt.title('Store Daily Open Countplot')

#### Chart - 8

In [None]:
# Chart - 8 visualization code
sns.countplot(x= 'DayOfWeek', hue='Promo', data= ds_final, palette='cool')
plt.title('Store Daily Promo Countplot')

#### Chart - 9

In [None]:
# Chart - 9 visualization code
promo_sales = sns.barplot(x="Promo", y="Sales", data=ds_final, palette='RdPu')

Barplot between promo and Sales shows the effect of promotion on Sales. Here 0 represents the store which didnt opt for promotion and 1 represents for stores who opt for promotion. Those store who took promotions their sales are high as compared to stores who didnt took promotion.

# **StateHoliday and SchoolHoliday**

Sales during State Holiday

0 = public holiday, 1 = Easter holiday, 2 = Christmas, 3 = None

#### Chart - 10

In [None]:
# Chart - 10 visualization code
stateholiday_sales = sns.barplot(x="StateHoliday", y="Sales", data=ds_final)

#### Chart - 11

In [None]:
# Chart - 11 visualization code
schoolholiday_sales = sns.barplot(x="SchoolHoliday", y="Sales", data=ds_final)

We can observe that most of the stores remain closed during State and Holidays. But it is interesting to note that the number of stores opened during School Holidays were more than that were opened during State Holidays. Another important thing to note is that the stores which were opened during School holidays had more sales than normal.

#### Chart - 12

# **Store Type**

In [None]:
# Chart - 12 visualization code
merged_df = pd.merge(rosemaan_data,ds, on='Store', how='left')

In [None]:
import itertools
fig, axes = plt.subplots(2, 2,figsize=(17,10) )
palette = itertools.cycle(sns.color_palette(n_colors=4))
plt.subplots_adjust(hspace = 0.28)
axes[0,0].bar(merged_df.groupby(by="StoreType").count().Store.index ,merged_df.groupby(by="StoreType").count().Store,color=[next(palette),next(palette),next(palette),next(palette)])
axes[0,0].set_title("Number of Stores per Store Type \n Fig 1.1")
axes[0,1].bar(merged_df.groupby(by="StoreType").sum().Store.index,merged_df.groupby(by="StoreType").sum().Sales/1e9,color=[next(palette),next(palette),next(palette),next(palette)])
axes[0,1].set_title("Total Sales per Store Type \n Fig 1.2")
axes[1,0].bar(merged_df.groupby(by="StoreType").sum().Customers.index,merged_df.groupby(by="StoreType").sum().Customers/1e6,color=[next(palette),next(palette),next(palette),next(palette)])
axes[1,0].set_title("Total Number of Customers per Store Type (in Millions) \n Fig 1.3")
axes[1,1].bar(merged_df.groupby(by="StoreType").sum().Customers.index,merged_df.groupby(by="StoreType").Sales.mean(),color=[next(palette),next(palette),next(palette),next(palette)])
axes[1,1].set_title("Average Sales per Store Type \n Fig 1.4")
plt.show()

From this training set we can see that Storetype A has the highest number of branches,sales and customers from the 4 different storetypes. But this doesn't mean it's the best performing Storetype.

When looking at the average sales and number of customers, we see that actually it is Storetype B who was the highest average Sales and highest average Number of Customers.

# **Conclusions from EDA**

1- There are two datasets - 1) Rossmann.csv & 2) Store.csv

2- shape of Rossmann dataset = (1017209,8) shape of store dataset = (1115, 10)

3- In both dataset 'Store' column is common. So we do inner join on the basis of column 'Store'.

4- On looking on datasets we find lots of Nan values in Store dataset.

5- Try to replace Nan values with suitable values.
In CompetitionDistance column only 3 Nan values are there. So we replaced it with median.

6- Now for rest columns(CompetitionOpenSinceMonth, CompetitionOpenSinceYear, Promo2, romointerval) there are lots of Nan values and best way to treat this values to replace with 0.

7- After combining shape of final dataset = (1017209,18)

8- Also there is some columns such as 'StateHoliday', 'SchoolHoliday' & 'Assortment' which contains object values. So, try to change into int by giving suitable values.

we also did some graphs analysis and conclusions we got are:-

   1- From plot sales and competition Open Since Month shows sales go increasing from Novemmber and highest in month December. This may be due to Christmas eve and New Year.

   2- From plot Sales and day of week, Sales highest on Monday and start declinig from tuesday to saturday and on Sunday Sales almost near to Zero. This is because on Sunday all stores be closed.

   3- Plot between promotion and Sales shows that promotion helps in increasing Sales. This similar trends also shows with customers.

   4- Plot between StateHolidays and sales shows that during Public holiday sales are actually high but for other holidays such as Easter and Christmas sales be very low. This is because During Easter and Christmas stores also closed so sales goes down.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(18,8))
correlation = ds_final.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

Answer Here

## ***6. Feature Engineering & Data Pre-processing***

# **Multicollinearity**

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)


In [None]:
calc_vif(ds_final[[i for i in ds_final.describe().columns if i not in ['Sales']]])

Multicolinearity of columns like 'Promo2SinceYear' is pretty high so we decided to drop it

In [None]:
calc_vif(ds_final[[i for i in ds_final.describe().columns if i not in ['Sales','Promo2SinceYear']]])

# Analysis on Sales - Dependent variable

In [None]:
pd.Series(ds_final['Sales']).hist()
plt.show()

Now checking for number of sales =0.

In [None]:
ds_final[(ds_final.Open == 0) & (ds_final.Sales == 0)].count()[0]

We see that **172817** times store is were temporarily closed for refurbishment. The best solution here is to get rid of closed stores and prevent the models to train on them and get false guidance

In [None]:
new_df = ds_final.drop(ds_final[(ds_final.Open == 0) & (ds_final.Sales == 0)].index)

In [None]:
new_df.head()

In [None]:
new_df.shape


PromoInterval to be changed into dummies as it is categorical feature.

In [None]:
new_df = pd.get_dummies(new_df, columns=['PromoInterval'])
new_df.head()

## ***7. ML Model Implementation***

In [None]:
from scipy.stats import zscore
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score as r2, mean_squared_error as mse
import math
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix,classification_report

### ML Model - 1

# **excluding rows which has sales =0**

We were confused about whether to include rows where sales value is 0.So first we built a model excluding sales value and then including those values

In [None]:
# defining dependent variable
dependent_variables = 'Sales'

# defining independent variable
independent_variables = list(new_df.columns.drop(['Promo2SinceYear','Date','Sales']))

In [None]:
independent_variables

In [None]:
# Create the data of independent variables
X = new_df[independent_variables].values

# Create the data of dependent variable
y = new_df[dependent_variables].values

In [None]:
# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)
print(X_train.shape)
print(X_test.shape)

# Linear Regression

In [None]:
reg = LinearRegression().fit(X_train, y_train)

In [None]:
reg.score(X_train, y_train)

In [None]:
reg.coef_

In [None]:
reg.intercept_

In [None]:
y_pred = reg.predict(X_test)
y_pred

In [None]:
y_pred_train = reg.predict(X_train)
y_pred_train

In [None]:
y_test

In [None]:
y_train

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_squared_error

MSE  = mean_squared_error(y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

**LASSO**


In [None]:
L1 = Lasso(alpha = 0.2, max_iter=10000)

In [None]:
L1.fit(X_train, y_train)

In [None]:
y_pred_lasso = L1.predict(X_test)

In [None]:
L1.score(X_test, y_test)

In [None]:
pd.DataFrame(zip(y_test, y_pred_lasso), columns = ['actual', 'pred'])

**Ridge**

In [None]:
L2 = Ridge(alpha = 0.5)
L2.fit(X_train, y_train)

In [None]:
L2.predict(X_test)

In [None]:
L2.score(X_test, y_test)

**DECISION TREE**

In [None]:
decision_tree=DecisionTreeRegressor(max_depth=5)
decision_tree.fit(X_train, y_train)
y_pred_dt = decision_tree.predict(X_test)
y_train_dt = decision_tree.predict(X_train)
#print('dt_regressor R^2: ', r2(v_test,v_pred))
MSE  = mean_squared_error(y_test, y_pred_dt)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)



r2 = r2_score(y_test, y_pred_dt)
print("R2 :" ,r2)

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

**MODEL 2 (By taking whole Dataset)**

We use dummy variables for the column 'PromoInterval'

In [None]:
ds_final = pd.get_dummies(ds_final, columns=['PromoInterval'])
ds_final.head()

In [None]:
ds_final.shape


We define dependent and independent variables and convert them into arrays

In [None]:
# defining dependent variable
dep_var = 'Sales'

# defining independent variable
indep_var = ds_final.columns.drop(['Store', 'Promo2SinceYear','Date','Sales'])

In [None]:
# Create the data of independent variables
U = ds_final[indep_var].values

# Create the dependent variable data
v = ds_final[dep_var].values

In [None]:
ds_final[indep_var]

**We do a train test split keeping the test size as 0.25**

In [None]:
# splitting the dataset
U_train, U_test, v_train, v_test = train_test_split(U, v, test_size=0.25, random_state = 0)
print(U_train.shape)
print(U_test.shape)

**LINEAR REGRESSION**

In [None]:
# scaling the x values
scaler=StandardScaler()

U_train = scaler.fit_transform(U_train)
U_test = scaler.transform(U_test)

In [None]:
# fitting the data into Lineat Regression Model
linear_regression = LinearRegression()
linear_regression.fit(U_train, v_train)

In [None]:
v_pred=linear_regression.predict(U_test)
v_pred

In [None]:
linear_regression.score(U_train, v_train)

In [None]:
regression_Dataframe = pd.DataFrame(zip(v_test, v_pred), columns = ['actual', 'pred'])
regression_Dataframe

In [None]:
sales_mean=ds_final[dep_var].mean()

In [None]:
from sklearn.metrics import mean_squared_error

MSE  = mean_squared_error(v_test, v_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

RMPSE=RMSE/sales_mean
print("RMPSE :",RMPSE)

r2 = r2_score(v_test, v_pred)
print("R2 :" ,r2)

**DECISION TREE**

In [None]:
decision_tree=DecisionTreeRegressor(max_depth=5)
decision_tree.fit(U_train, v_train)
v_pred_dt = decision_tree.predict(U_test)
v_train_dt = decision_tree.predict(U_train)
#print('dt_regressor R^2: ', r2(v_test,v_pred))
MSE  = mean_squared_error(v_test, v_pred_dt)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

RMPSE=RMSE/sales_mean
print("RMPSE :",RMPSE)

r2 = r2_score(v_test, v_pred_dt)
print("R2 :" ,r2)

In [None]:
decisiontree_Dataframe = pd.DataFrame(zip(v_test, v_pred_dt), columns = ['actual', 'pred'])
decisiontree_Dataframe

# **Light GBM Model :**

In [None]:
#model=lgb.LGBMRegressor(n_estimators=700)
#model.fit(U_train,v_train)
#v_pred_lgb=model.predict(U_test)

In [None]:
"""MSE  = mean_squared_error(v_test, v_pred_lgb)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

RMPSE=RMSE/sales_mean
print("RMPSE :",RMPSE)

r2 = r2_score(v_test, v_pred_lgb)
print("R2 :" ,r2)"""

We decided not to include the LightGBM model as we had a strong intuition that it is overfitting as it is giving abnormally high accuracy.But this can be included in future developement of this project

# **Conclusion**

**Conclusion from Model Training**

We saw that Sales column contains 172817 rows with 0 sale. So we created a new dataframe in which we removed 0 sales rows and tried to train our model. We used various algorithms and got accuracy score around 74%.

We werealso curious about the total dataset(including Sales = 0 rows). So we trained another model using various algorithms and we got accuracy near about 92% which is far better than previous model.

So we came to conclusion that removing sales=0 rows actually removes lot of information from dataset as it has 172817 rows which is quite large and therefore we decided not to remove those values.We got our best rmpse score from Random Forest model,we tried taking an optimum parameter so that our model doesnt overfit.

**CONCLUSION FROM EDA**

1)From plot sales and competition Open Since Month shows sales go increasing from November and highest in month December.

2)From plot Sales and day of week, Sales highest on Monday and start declining from Tuesday to Saturday and on Sunday Sales almost near to Zero.

3)Plot between Promotion and Sales shows that promotion helps in increasing Sales.

4)Type of Store plays an important role in opening pattern of stores.

5)All Type ‘b’ stores never closed except for refurbishment or other reason.

6)All Type ‘b’ stores have comparatively higher sales and it mostly constant with peaks appears on weekends.

7)ssortment Level ‘b’ is only offered at Store Type ‘b’.

8)We can observe that most of the stores remain closed during State Holidays. But it is interesting to note that the number of stores opened during School Holidays were more than that were opened during State Holidays.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***