## Predict the Productivity of an Outlet

#### In the data we have all information of an outlet, like wat was the last 3 prodcutivity,last 3 sales,
#### catchment area and all. Our client bran is M and C,G are cometitor brands. we have their share and reach index as well

### We will start by loading all the required packages

In [2]:
import pandas as pd
import numpy as np

from ngboost import NGBRegressor
from ngboost.distns import Normal

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 500)

In [3]:
#read the data
df = pd.read_excel('outletproddata.xlsx',sheet_name='Sheet1')

In [4]:
# verify the columns name
df.columns

Index(['Outlet Code', 'dependent', 'lag1', 'lag2', 'lag3', 'sfa1', 'sfa2',
       'sfa3', 'Cluster', 'EatShop Score', 'BarpubScore', 'Business Score',
       'Combined Score', 'M_Absence', 'M_Presence', 'KingSize Stores',
       'Premium_Eatshop', 'Premium_BarsPubs', 'Premium_Eatingout',
       'Premium_Shopping', 'Premium_Business', 'Total Barspubs',
       'Total Eatshop', 'Total Eating Out', 'Total Shopping', 'Total Business',
       'area_sqkm', 'Ambient population', 'Average meal cost per person(Rs)',
       '2k_catchment_Educational High-low',
       '2k_catchment_Transporthub high-low',
       '2k_catchment_Commercial high-low', '2k_catchment_Residential high-low',
       'Category', 'Area_Type', 'Distributor Point', 'Market Type',
       'Premium Type', 'Action Index', 'Categorical M Reach Index',
       'Categorical G Reach Index', 'Categorical C Reach Index',
       'Categorical M Share Index', 'Categorical G Index',
       'Categorical C Share Index', 'Categorical King Size 

In [5]:
# verify the shape of dataframe
print(df.shape)

(275, 58)


### Data sanity checks

In [6]:
# check for null values if any
df.isnull().sum()

Outlet Code                                 0
dependent                                   0
lag1                                        0
lag2                                        0
lag3                                        0
sfa1                                        0
sfa2                                        0
sfa3                                        0
Cluster                                     0
EatShop Score                               0
BarpubScore                                 0
Business Score                              0
Combined Score                              0
M_Absence                                   0
M_Presence                                  0
KingSize Stores                             0
Premium_Eatshop                             0
Premium_BarsPubs                            0
Premium_Eatingout                           0
Premium_Shopping                            0
Premium_Business                            0
Total Barspubs                    

### basic stats about the data

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
dependent,275.0,26.350909,7.705076,11.0,20.0,28.0,33.0,45.0
lag1,275.0,26.817455,6.623386,11.0,23.0,26.0,30.25,50.0
lag2,275.0,30.12297,6.34634,11.0,27.0,29.5,33.0,56.0
lag3,275.0,30.918364,7.306811,13.0,25.0,33.0,35.35,54.0
sfa1,275.0,0.4004,0.703454,0.0,0.085,0.14,0.37,5.23
sfa2,275.0,1.129164,1.673292,0.0,0.445,0.71,1.275,20.85
sfa3,275.0,1.078509,2.173281,0.0,0.29,0.66,1.18,30.45
Cluster,275.0,4.985455,2.575815,1.0,3.0,5.0,7.0,10.0
EatShop Score,275.0,0.418182,0.658768,0.0,0.0,0.0,1.0,3.0
BarpubScore,275.0,0.134545,0.48338,0.0,0.0,0.0,0.0,3.0


### Feature elimnation based on redundancy

In [11]:
# below I am droping some feature from the dataframe like 'EatShop Score','BarpubScore','Business Score','Combined Score'
# because we have already Total score in the data similarly for Reach and Share index we have numerical column 
# with the same information so we will be droping those also, similarly i am droping category because we have area_type 
# column in the data which contains similar information.
# droping outlet code because its a key veriable not required for modeling

drop_list = ['Outlet Code','EatShop Score','BarpubScore','Business Score','Combined Score','Category',
             'Average meal cost per person(Rs)',
                'Categorical M Reach Index','Categorical G Reach Index','Categorical C Reach Index',
                 'Categorical M Share Index','Categorical G Index','Categorical C Share Index']

In [12]:
df = df.drop(drop_list,axis =1)

## Now we will do some data pre processing

In [13]:
# create two data frame one for numeric and other for categorical
df_num_data = df.select_dtypes(include=['float64','int64'])
df_cat_data = df.select_dtypes(include=['object'])

In [14]:
## scale the nueric data
scaler = MinMaxScaler(feature_range=(0, 1))

# drop the dependent column before scaling
df_num_data = df_num_data.drop('dependent',axis =1)
df_num_data[df_num_data.columns] = scaler.fit_transform(df_num_data[df_num_data.columns])

In [15]:
# do dummy endoding for categorical variable
df_cat_data = pd.get_dummies(df_cat_data, prefix = df_cat_data.columns, drop_first = True)

In [16]:
# merge two data to create a single data frame
df_main = pd.merge(df_num_data, df_cat_data, left_index = True, right_index = True)

In [17]:
# Now i will drop cluster column from merge data becuase cluster column also has
# been scaled but we need that column without scaled because it represent cluster information
# so first we will drop scaled version then from the original data we will add this back

df_main2 = df_main.drop('Cluster',axis=1)

In [18]:
## In this step we will add 'cluster' as well as 'dependent' column to the main data
df_main2['dependent'] = df['dependent']
df_main2['cluster'] = df['Cluster']

In [19]:
# verify the sahpe of final data
print(df_main2.shape)

(275, 84)


In [20]:
# print some sample column
df_main2.head(5)

Unnamed: 0,lag1,lag2,lag3,sfa1,sfa2,sfa3,M_Absence,M_Presence,KingSize Stores,Premium_Eatshop,Premium_BarsPubs,Premium_Eatingout,Premium_Shopping,Premium_Business,Total Barspubs,Total Eatshop,Total Eating Out,Total Shopping,Total Business,area_sqkm,Ambient population,M Reach Index,G Reach Index,C Reach Index,M Share Index,G Share Index,C Share Index,2k_catchment_Educational High-low_Low,2k_catchment_Educational High-low_Medium,2k_catchment_Transporthub high-low_Low,2k_catchment_Transporthub high-low_Medium,2k_catchment_Commercial high-low_Low,2k_catchment_Commercial high-low_Medium,2k_catchment_Residential high-low_Low,2k_catchment_Residential high-low_Medium,Area_Type_bakery,Area_Type_bank,Area_Type_bar,Area_Type_building,Area_Type_cafe,Area_Type_clinic,Area_Type_common,Area_Type_construction,Area_Type_convenience,Area_Type_fast_food,Area_Type_golf_course,Area_Type_hairdresser,Area_Type_hospital,Area_Type_hotel,Area_Type_mall,Area_Type_neighbourhood,Area_Type_restaurant,Area_Type_retail,Area_Type_road,Area_Type_station,Area_Type_village,Distributor Point_GGN 2,Market Type_both,Market Type_busi,Premium Type_P,Action Index_Increase your Reach,Action Index_Optimise Routes,Action Index_Primitive Opportunity,Categorical King Size Vol Index (Market)_Low,Categorical King Size Vol Index (Market)_Mid,Categorical KingSize_PerStore_Index_Low,Categorical KingSize_PerStore_Index_Mid,Premium_Index_Eatshop_Low,Premium_Index_Eatshop_Mid,Premium_Index_Eatshop_NP,Premium_Index_Barpub_Low,Premium_Index_Barpub_Mid,Premium_Index_Barpub_NP,Premium_Index_eatingout_Low,Premium_Index_eatingout_Mid,Premium_Index_eatingout_NP,Premium_Index_Shopping_Low,Premium_Index_Shopping_Mid,Premium_Index_Shopping_NP,Premium_Index_Business_Low,Premium_Index_Business_Mid,Premium_Index_Business_NP,dependent,cluster
0,0.307692,0.444444,0.292683,0.162524,0.031175,0.018719,0.2,0.0,0.053333,0.0,0.0,0.0,0.0,0.0,0.0,0.011278,0.003802,0.025,0.021622,0.086783,0.27548,0.0,1.0,0.25,0.0,0.666667,0.0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,25.0,4
1,0.051282,0.444444,0.292683,0.01912,0.023501,0.007225,0.25,0.041667,0.106667,0.15,0.125,0.068966,0.384615,0.0,0.056604,0.240602,0.144487,0.325,0.802703,0.316706,0.079976,0.375,0.5,1.0,0.3762,0.544,0.688,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1,32.0,4
2,0.307692,0.444444,0.292683,0.01912,0.023501,0.007225,0.0,0.0,0.0,0.016667,0.0,0.0,0.076923,0.0,0.0,0.007519,0.0,0.025,0.067568,0.019189,0.034285,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,1,1,0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,29.0,4
3,0.307692,0.444444,0.292683,0.01912,0.007674,0.016092,0.3,0.569444,0.626667,0.033333,0.0,0.0,0.153846,0.0625,0.056604,0.530075,0.231939,1.0,0.354054,0.557164,0.321371,0.87234,0.978723,0.978723,0.309375,0.388889,0.583333,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,29.0,4
4,0.307692,0.444444,0.292683,0.01912,0.023501,0.007225,0.05,0.208333,0.213333,0.0,0.0,0.0,0.0,0.0625,0.0,0.007519,0.003802,0.0125,0.118919,0.271596,0.346918,0.9375,1.0,1.0,0.660954,0.306744,0.545665,1,0,0,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,17.0,4


### NGBoost Model training for Productivity Prediction

In [21]:
# first split the data into train and test
# I have used stratified sampling on cluster variable

X_train, X_test, y_train, y_test = train_test_split(df_main2.drop('dependent', axis =1),
                                                    df_main2[['dependent']], test_size=0.3,
                                                    stratify = df_main2[['cluster']],random_state = 1336)

In [22]:
# verify if all value is finit
np.all(np.isfinite(X_train))

True

In [23]:
# verify if there is any nan value
np.any(np.isnan(X_train))

False

## NGB with different Paramete setting

In [25]:
from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=3)
ngb = NGBRegressor(Base=learner,n_estimators=500, learning_rate=0.01, 
                                     minibatch_frac=0.7, col_sample=0.7,random_state=1336)
ngb = ngb.fit(X_train, y_train)

  return f(**kwargs)


[iter 0] loss=3.4477 val_loss=0.0000 scale=1.0000 norm=6.5747
[iter 100] loss=3.1906 val_loss=0.0000 scale=1.0000 norm=5.2109
[iter 200] loss=2.9774 val_loss=0.0000 scale=1.0000 norm=4.2794
[iter 300] loss=2.8261 val_loss=0.0000 scale=1.0000 norm=3.7849
[iter 400] loss=2.7258 val_loss=0.0000 scale=1.0000 norm=3.5691


In [27]:
# get the prediction on test data
y_preds = ngb.predict(X_test)

# get the distribution of each datapoint in test dataframe
y_dists = ngb.pred_dist(X_test)
# test Mean Squared Error
test_MSE = mean_squared_error(y_preds, y_test)
print("Test MSE", test_MSE)

Test MSE 54.84530756243354


### Different stat to compare train test performance

In [29]:
y_train_rf = y_train.values.flatten()
y_predicted_train = ngb.predict(X_train)
model_score = r2_score(y_train_rf, y_predicted_train)
errors = abs(y_predicted_train - y_train_rf)
mape = 100 * (errors / y_train_rf)
accuracy = 100 - np.mean(mape)
# Have a look at R sq to give an idea of the fit ,
# Explained variance score: 1 is perfect prediction
print('coefficient of determination R^2 of the prediction.: ',model_score)
print("Mean squared error train : %.2f"% mean_squared_error(y_train_rf, y_predicted_train))
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
print('Accuracy:', round(accuracy, 2), '%.')

print ('\n')
y_test_rf = y_test.values.flatten()
y_predicted = y_preds
errors = abs(y_predicted - y_test_rf)
mape = 100 * (errors / y_test_rf)
accuracy = 100 - np.mean(mape)
# The mean squared error
print("Mean squared error: %.2f"% mean_squared_error(y_test_rf, y_predicted))
# Explained variance score: 1 is perfect prediction
print('Test Variance score: %.2f' % r2_score(y_test_rf, y_predicted))
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
print('Accuracy:', round(accuracy, 2), '%.')

coefficient of determination R^2 of the prediction.:  0.7889276645115908
Mean squared error train : 12.47
Mean Absolute Error: 3.0 degrees.
Accuracy: 86.78 %.


Mean squared error: 54.85
Test Variance score: 0.08
Mean Absolute Error: 5.74 degrees.
Accuracy: 74.28 %.


### write the result to a csv output file

In [30]:
result = pd.DataFrame()
result['Actual'],result['Predicted'],result['loc'],result['scale'] = y_test['dependent'],y_preds,y_dists.params['loc'],y_dists.params['scale']
result.to_csv('result.csv')

In [31]:
### Print the few column of output
result.head(10)

Unnamed: 0,Actual,Predicted,loc,scale
187,33.0,27.893529,27.893529,2.913497
110,26.0,29.570578,29.570578,3.112088
75,23.0,23.236085,23.236085,4.018628
42,33.0,27.39716,27.39716,2.797532
254,25.5,25.1237,25.1237,5.632758
135,29.5,32.150025,32.150025,3.79213
226,14.0,24.578928,24.578928,2.707594
146,27.0,20.8744,20.8744,4.63169
134,28.0,31.955732,31.955732,3.071624
119,24.5,25.302741,25.302741,3.409723


### Explanation of result

#### AS I have said erlier that NGBoost give out put as whole distribution and we aslo know that
#### property of any distribution is there 'mean' and 'standard deviation'. So in the above output 'loc' is mean of the 
#### distribution and scale is 'standard deviation' of distribution.