# Model Identification to Predict the Total Discount

We would be training different Machine Learning models to Predict Total Discount that needs to be given to a POC.

Then we would split the total Discount obtained into 2 parts : On Invoice Discount & Off Invoice Discount

In this notebook , we try to obtain the best model for Total Discount Prediction . We introduce our train test split strategy and also how we are dividing the data based on GTO 

In [2]:
#### IMPORTS

import pandas as pd
import numpy as np
from types import FunctionType
from sklearn.model_selection import train_test_split
import math
import matplotlib.pyplot as plt
# Importing sklearn methods
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn import svm
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
from sklearn import tree

In [3]:
# data2.xlsx is the data obtained after running the feature engineering code
# data3.csv : converted data2.xlsx to data3.csv because of easiness of use of csv files in webapps
path = r"C:\Users\NISARG\Desktop\mech\Finance\Maverick\CODE"   #change the path to your local path
df = pd.read_csv(path + "\data3.csv")
df.head()


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Ship-to ID,Volume_2019,Volume_2018,sdfc_Tier,poc_image,segment,sub_segment,Product Set,...,Expected_GTO,Expected_product_volume,loyalty_index,min_order_size_for_discount,inventory_lingering_factor,profit_Product,profitability_indicator,discount_std,upper_limit,Disc_Percent
0,0,0,29000310,0.48,0.557,Tier 0,Mainstream,Entertainment Led,Events,RETURNABLE_BOTTLE_JUPILER_JUPILER PILS,...,116.362876,0.395568,0,5.9e-05,0.030049,370.415207,0.474992,840.354946,30.119944,0.0
1,1,1,29000419,0.45,0.54,Tier 1,Mainstream,Not applicable,Not applicable,RETURNABLE_BOTTLE_PIEDBOEUF_PIEDBOEUF TRIPLE,...,88.943478,0.352174,0,0.002273,0.057372,207.180476,0.265672,421.81302,21.621963,0.0
2,2,2,29000430,270.97,225.72,Tier 1,Mainstream,Drink Led,Party Place,OW_BULK_JUPILER_JUPILER PILS,...,71342.13643,276.519909,1,0.000768,60.857752,77983.46519,100.0,188693.2633,39604.15389,0.235763
3,3,3,29000430,270.97,225.72,Tier 1,Mainstream,Drink Led,Party Place,RETURNABLE_BOTTLE_JUPILER_JUPILER PILS,...,6955.593627,23.645077,1,5.9e-05,0.507591,370.415207,0.474992,840.354946,1800.544853,0.267487
4,4,4,29000430,270.97,225.72,Tier 1,Mainstream,Drink Led,Party Place,RETURNABLE_KEG_JUPILER_JUPILER PILS,...,3536.747237,13.908869,0,6.4e-05,5.144873,3160.708348,4.053049,11271.53787,948.335453,0.235409


# TRAIN TEST DATA CREATION STRATEGY

Aim of the split is to identify data that is being **discounted correctly** and use that data as **train data**. Training model based on this data will help us create a model that has seen only **correctly discounted data** and hence when used in testing , it will give us the correct discounts for the data that is not being discounted correctly

# Strategy

We use **upper_limit column** to do the requisite train test split.

As explained in Exploratory Data Analysis Notebook , there are 2 terms in upper limit : one which takes care of the **profitability of a particular product set** for ABInBev and the other which takes care of the **growth potential of POC**.

Upper Limit is then **scaled** by a factor of **1.2** 

If for a particular POC , discount given is **lesser than upper limit** as defined above , we can be assured that discount given to that **POC is correct** and we can **include it in our train dataset**
However if the total discount is **greater than upper limit**, then it is **wrongly discounted** and we include that in test dataset ( the dataset for which we want to correct the discount)


In [133]:

count = 0
index_test = list()
index_train = list()
for i in range(len(df['upper_limit'])):
    if(df['Discount_Total'][i]>df['upper_limit'][i]):
        count+=1
        index_test.append(i)
    else:
        index_train.append(i)
print(count/len(df['upper_limit']))

0.3232645073885446


In [134]:

'''
Here we get the train and test datasets based on the logic described above
'''

df_test = df.iloc[index_test]
df_test = df_test.reset_index()
df_train = df.iloc[index_train]
df_train = df_train.reset_index()     

In [135]:
'''
Encoding categorical variables here
'''


from sklearn.preprocessing import LabelEncoder
def encode(highGTOData):
    lb_make = LabelEncoder()
    highGTOData['sdfc_Tier'] = lb_make.fit_transform(highGTOData['sdfc_Tier'])
    for i in range(len(highGTOData['GTO_2019'])):
        if(highGTOData['poc_image'][i]==0):
            highGTOData['poc_image'][i] = "Mainstream"
    highGTOData['poc_image'] = lb_make.fit_transform(highGTOData['poc_image'])
    highGTOData['segment'] = lb_make.fit_transform(highGTOData['segment'])
    highGTOData['sub_segment'] = lb_make.fit_transform(highGTOData['sub_segment'])
    highGTOData['Product Set'] = lb_make.fit_transform(highGTOData['Product Set'])
    highGTOData['Brand'] = lb_make.fit_transform(highGTOData['Brand'])
    highGTOData['Sub-Brand'] = lb_make.fit_transform(highGTOData['Sub-Brand'])
    highGTOData['Pack_Type'] = lb_make.fit_transform(highGTOData['Pack_Type'])
    highGTOData['Returnalility'] = lb_make.fit_transform(highGTOData['Returnalility'])
    highGTOData['province'] = lb_make.fit_transform(highGTOData['province'])
    highGTOData['GTO_growth'] = highGTOData['Expected_GTO'] - highGTOData['GTO_2019']
    return highGTOData



In [136]:
df_test = encode(df_test)
df_train = encode(df_train)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


# Data Division 

We divide the dataset into 3 parts based on **GTO_2019** . Since statistically dominant features for these 3 parts vary , creating different machine learning models for them gives better results.

Dataset is divided into the following 3 parts

    1) LOW GTO : GTO < 10000
    
    2) Mid GTO : GTO > 10000 & GTO < 50000
    
    3) HIGH GTO : GTO > 50000

In [137]:



lowGTOData_train = df_train[df_train.GTO_2019<10000]
lowGTOData_test = df_test[df_test.GTO_2019<10000]

midGTOData_train = df_train[(df_train.GTO_2019>10000)&(df_train.GTO_2019<50000)]
midGTOData_test = df_test[(df_test.GTO_2019>10000)&(df_test.GTO_2019<50000)]

highGTOData_train = df_train[df_train.GTO_2019>50000]
highGTOData_test = df_test[df_test.GTO_2019>50000]

In [138]:
'''
Defining the class of machine learning models that we are going to try here
'''


class Models(object):
    
    global seed 
    seed = 34234
    
    # Initialization 
    def __init__(self, x_train, x_validation, y_train, y_validation):
        # changing input as dataframe to list
        self.x_train = [x_train.iloc[i].tolist() for i in range(len(x_train))]
        self.x_validation = [x_validation.iloc[i].tolist() for i in range(len(x_validation))]
        self.y_train = y_train.tolist()
        self.y_validation = y_validation.tolist()
    
    
    @staticmethod
    def print_info(cross_val_scores, mse):
        print("Cross Validation Scores: ", cross_val_scores)
        print("Mean Squared Error: ", mse)
        
        
    # Linear Regression 
    def linear_regression(self, x_train, x_validation,  y_train, y_validation):
        reg = linear_model.LinearRegression()
        # X = np.array(X).reshape([-1, 1])
        reg.fit(self.x_train, self.y_train)
        y_pred_list = reg.predict(self.x_validation)
        mse = math.sqrt(mean_squared_error(self.y_validation, y_pred_list))
        kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
        cross_val_scores = cross_val_score(reg, self.x_train, self.y_train, cv=kfold )
        mse_train = math.sqrt(mean_squared_error(self.y_train,reg.predict(self.x_train)))
        print("TRAINING ERROR = " , mse_train)
        print("\nLinear Regression Model")
        self.print_info(cross_val_scores, mse)
        return cross_val_scores, mse
        
    # Random Forest Regression model 
    def random_forest(self, x_train, x_validation,  y_train, y_validation):
        rfr = RandomForestRegressor(n_estimators=8, max_depth=8, random_state=12, verbose=0)
        # X = np.array(X).reshape([-1, 1])
        rfr.fit(self.x_train, self.y_train)
        y_pred_list = rfr.predict(self.x_validation)
        mse = math.sqrt(mean_squared_error(self.y_validation, y_pred_list))
        kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
        cross_val_scores = cross_val_score(rfr, self.x_train, self.y_train, cv=kfold )
        mse_train = math.sqrt(mean_squared_error(self.y_train,rfr.predict(self.x_train)))
        print("\nRandom Forest Regressor")
        print("TRAINING ERROR = " , mse_train)
        self.print_info(cross_val_scores, mse)
        return cross_val_scores, mse
            
    # Lasso method 
    def lasso(self, x_train, x_validation,  y_train, y_validation):
        reg = linear_model.Lasso(alpha = 0.1)
        # X = np.array(X).reshape([-1, 1])
        reg.fit(self.x_train, self.y_train)
        y_pred_list = reg.predict(self.x_validation)
        mse = math.sqrt(mean_squared_error(self.y_validation, y_pred_list))
        kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
        cross_val_scores = cross_val_score(reg, self.x_train, self.y_train, cv=kfold )
        mse_train = math.sqrt(mean_squared_error(self.y_train,reg.predict(self.x_train)))
        print("\nLasso Regression Model")
        print("TRAINING ERROR = " , mse_train)
        self.print_info(cross_val_scores, mse)
        return cross_val_scores, mse
    
    # Ridge method 
    def ridge(self, x_train, x_validation,  y_train, y_validation):
        reg = linear_model.Ridge(alpha = 0.1)
        # X = np.array(X).reshape([-1, 1])
        reg.fit(self.x_train, self.y_train)
        y_pred_list = reg.predict(self.x_validation)
        mse = math.sqrt(mean_squared_error(self.y_validation, y_pred_list))
        kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
        cross_val_scores = cross_val_score(reg, self.x_train, self.y_train, cv=kfold )
        mse_train = math.sqrt(mean_squared_error(self.y_train,reg.predict(self.x_train)))
        print("\nRidge Regression Model")
        print("TRAINING ERROR = " , mse_train)
        self.print_info(cross_val_scores, mse)
        return cross_val_scores, mse
    
    
     # CART method 
    def CART(self, x_train, x_validation,  y_train, y_validation):
        reg =  tree.DecisionTreeRegressor()
        # X = np.array(X).reshape([-1, 1])
        reg.fit(self.x_train, self.y_train)
        y_pred_list = reg.predict(self.x_validation)
        mse = math.sqrt(mean_squared_error(self.y_validation, y_pred_list))
        kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
        cross_val_scores = cross_val_score(reg, self.x_train, self.y_train, cv=kfold )
        mse_train = math.sqrt(mean_squared_error(self.y_train,reg.predict(self.x_train)))
        print("\nCART Regression Model")
        print("TRAINING ERROR = " , mse_train)
        self.print_info(cross_val_scores, mse)
        return cross_val_scores, mse
    
    # Gradient Boosing Regressor
    def GBR(self, x_train, x_validation,  y_train, y_validation):
        gbr = GradientBoostingRegressor(n_estimators=175, learning_rate=0.08, max_depth=3, random_state=1232, loss='ls')
        gbr.fit(self.x_train, self.y_train)
        kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
        cross_val_scores = cross_val_score(gbr, self.x_train, self.y_train, cv=kfold )
        mse_train = math.sqrt(mean_squared_error(self.y_train,gbr.predict(self.x_train)))
        mse = math.sqrt(mean_squared_error(self.y_validation, gbr.predict(self.x_validation)))
        print("mse = ",mse)
        print('\nGradient Boosting Regressor')
        print("TRAINING ERROR = " , mse_train)
        
        self.print_info(cross_val_scores, mse)
        return cross_val_scores, mse

In [139]:
'''
Keeping the essential columns as seen from the Exploratory Data Analysis
'''



lowGTOData_train = lowGTOData_train.reset_index()
lowGTOData_test = lowGTOData_test.reset_index()
target_train = lowGTOData_train['Discount_Total']
target_test = lowGTOData_test['Discount_Total']
colsToKeep = ['Volume_2019' , 'Volume_2018'  , 'Expected_GTO'  , 'Expected_product_volume', 'profitability_indicator' , 'upper_limit'  ,'sdfc_Tier'  , 'loyalty_index' , 'Returnalility', 'market_cap' ]
features_train = lowGTOData_train[colsToKeep]
features_test = lowGTOData_test[colsToKeep]

In [140]:
methods = [x for x, y in Models.__dict__.items() if type(y) == FunctionType]
methods.remove('__init__')
for model in methods:
    reg = Models(features_train, features_test, target_train, target_test)
    cross_val_scores, mse = getattr(reg, model)(features_train, features_test, target_train, target_test)
    print("MEAN CROSS VAL SCORES=" , np.mean(cross_val_scores))

TRAINING ERROR =  347.87168854519183

Linear Regression Model
Cross Validation Scores:  [ 4.00787233e-01  1.65688762e-01  6.04759808e-02  3.29433468e-01
  3.62163103e-01 -6.76259540e+10  2.75470063e-01 -1.91311127e+01
  3.09435497e-01  3.11794915e-01]
Mean Squared Error:  663.6101020307376
MEAN CROSS VAL SCORES= -6762595399.491838

Random Forest Regressor
TRAINING ERROR =  140.9393411095392
Cross Validation Scores:  [0.69403482 0.65730069 0.88129082 0.62996554 0.77444463 0.87657898
 0.91937588 0.73961251 0.71284645 0.70896735]
Mean Squared Error:  1107.173425684565
MEAN CROSS VAL SCORES= 0.7594417679010417

Lasso Regression Model
TRAINING ERROR =  347.8945502208254
Cross Validation Scores:  [ 4.00027013e-01  1.65209415e-01  6.10981997e-02  3.30323315e-01
  3.61302651e-01 -6.76247602e+10  2.76700198e-01 -1.91234010e+01
  3.09118247e-01  3.11419600e-01]
Mean Squared Error:  663.7064727308415
MEAN CROSS VAL SCORES= -6762476018.822924


  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T



Ridge Regression Model
TRAINING ERROR =  347.8716895234163
Cross Validation Scores:  [ 4.00782354e-01  1.65684153e-01  6.04825907e-02  3.29442825e-01
  3.62157623e-01 -6.76259100e+10  2.75490312e-01 -1.91310482e+01
  3.09433493e-01  3.11793406e-01]
Mean Squared Error:  663.6104011115797
MEAN CROSS VAL SCORES= -6762591003.736961

CART Regression Model
TRAINING ERROR =  2.1725896842663301e-16
Cross Validation Scores:  [0.46958692 0.49853673 0.79217374 0.33782931 0.50591797 0.76828446
 0.85416908 0.5326938  0.57554501 0.60132816]
Mean Squared Error:  1041.1524816616027
MEAN CROSS VAL SCORES= 0.5936065169212928
mse =  1158.7989157671286

Gradient Boosting Regressor
TRAINING ERROR =  139.34335533706295
Cross Validation Scores:  [0.69580969 0.65833065 0.88725394 0.67072566 0.7858876  0.89254566
 0.93262279 0.7447072  0.73465726 0.69742913]
Mean Squared Error:  1158.7989157671286
MEAN CROSS VAL SCORES= 0.7699969581837152


# Comments for lowGTOData :
1) Since our training data is correct data (data where discounts are correctly given ) , we would like our Machine Learning model
to give minimum training error.

2) For selecting the best model , we use the criteria : Less Training Error , high R^2 values

3) CART Model gives least training error but we can see that cross validation scores are not good , this means that the model is 
overfitting

4) Gradient Boosting Regressor performs the best in both Training Error and Cross Validation Scores Parameters


In [141]:
'''
Keeping the features as seen from EXPLORATORY DATA ANALYSIS
'''

midGTOData_train = midGTOData_train.reset_index()
midGTOData_test = midGTOData_test.reset_index()
target_train = midGTOData_train['Discount_Total']
target_test = midGTOData_test['Discount_Total']
colsToKeep = ['Volume_2019' , 'Volume_2018' ,'Volume_2019 Product' ,'Expected_GTO','Expected_product_volume' , 'profitability_indicator' , 'upper_limit'  ,'sdfc_Tier'  , 'loyalty_index' , 'Returnalility',  'inventory_lingering_factor', 'market_cap',
       'order_size']
features_train = midGTOData_train[colsToKeep]
features_test = midGTOData_test[colsToKeep]

In [142]:
methods = [x for x, y in Models.__dict__.items() if type(y) == FunctionType]
methods.remove('__init__')
for model in methods:
    reg = Models(features_train, features_test, target_train, target_test)
    cross_val_scores, mse = getattr(reg, model)(features_train, features_test, target_train, target_test)
    print("MEAN CROSS VAL SCORES=" , np.mean(cross_val_scores))

TRAINING ERROR =  2157.655855188812

Linear Regression Model
Cross Validation Scores:  [0.63800041 0.23246567 0.38247191 0.31567495 0.41035225 0.52893829
 0.47113808 0.39799634 0.50792226 0.47252448]
Mean Squared Error:  6583.829538001789
MEAN CROSS VAL SCORES= 0.43574846519336735

Random Forest Regressor
TRAINING ERROR =  1424.0698486921524
Cross Validation Scores:  [0.57852706 0.31551433 0.47641717 0.35457732 0.33568022 0.65900386
 0.62967686 0.38368241 0.52126377 0.49137169]
Mean Squared Error:  6414.869674902147
MEAN CROSS VAL SCORES= 0.4745714696634106

Lasso Regression Model
TRAINING ERROR =  2157.662605321827
Cross Validation Scores:  [0.63803764 0.23291445 0.38248092 0.31609428 0.41054297 0.52883927
 0.47129861 0.398313   0.50578807 0.47294024]
Mean Squared Error:  6583.274571440101
MEAN CROSS VAL SCORES= 0.4357249459524747

Ridge Regression Model
TRAINING ERROR =  2158.0184294360824
Cross Validation Scores:  [0.63808368 0.23631229 0.38197558 0.31789365 0.4114996  0.52757936
 0

# Comments for mid GTO Data
1) CART REGRESSION MODEL OVERFITS , EVEN THOUGH WE HAVE zero Training error but cross validation scores are very less

2) Gradient boosting Regressor has less training error and the highest k fold cross validation score

3) GBR model fits best for mid GTO Data

In [143]:
'''
Keeping the features as seen from EXPLORATORY DATA ANALYSIS
'''


highGTOData_train = highGTOData_train.reset_index()
highGTOData_test = highGTOData_test.reset_index()
target_train = highGTOData_train['Discount_Total']
target_test = highGTOData_test['Discount_Total']
colsToKeep = ['Volume_2019' , 'Volume_2018' ,'Volume_2019 Product' ,'Expected_GTO','Expected_product_volume' , 'profitability_indicator' , 'upper_limit'  ,  'inventory_lingering_factor',
       'order_size']
features_train = highGTOData_train[colsToKeep]
features_test = highGTOData_test[colsToKeep]

In [144]:
methods = [x for x, y in Models.__dict__.items() if type(y) == FunctionType]
methods.remove('__init__')
for model in methods:
    reg = Models(features_train, features_test, target_train, target_test)
    cross_val_scores, mse = getattr(reg, model)(features_train, features_test, target_train, target_test)
    print("MEAN CROSS VAL SCORES=" , np.mean(cross_val_scores))

TRAINING ERROR =  11592.764707271046

Linear Regression Model
Cross Validation Scores:  [-1.06087019 -0.36632146  0.97420802  0.72035285  0.62682464 -0.06819141
  0.49322678 -2.2610839   0.87362034  0.91377639]
Mean Squared Error:  124995.12651733206
MEAN CROSS VAL SCORES= 0.0845542055601822

Random Forest Regressor
TRAINING ERROR =  5966.083708326815
Cross Validation Scores:  [-1.74860924  0.5434003   0.91425773  0.85503962  0.62815798 -0.09866692
  0.46652807 -4.63388039  0.8523974   0.92125103]
Mean Squared Error:  171986.4990619022
MEAN CROSS VAL SCORES= -0.1300124437627878


  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)



Lasso Regression Model
TRAINING ERROR =  11592.8733428284
Cross Validation Scores:  [-1.07080799 -0.36169851  0.97953099  0.72041248  0.62548444 -0.07059253
  0.48980501 -2.07415161  0.87716266  0.91381811]
Mean Squared Error:  124878.77966440414
MEAN CROSS VAL SCORES= 0.1028963054053718

Ridge Regression Model
TRAINING ERROR =  11592.81823974513
Cross Validation Scores:  [-1.05852978 -0.36697204  0.97403141  0.72024496  0.62874838 -0.06852993
  0.49177446 -2.12594403  0.87364746  0.91384709]
Mean Squared Error:  125016.88051406432
MEAN CROSS VAL SCORES= 0.09823179709549928

CART Regression Model
TRAINING ERROR =  0.0
Cross Validation Scores:  [ -1.5315511    0.56870987   0.95914696   0.77538861   0.46493835
  -0.71221601   0.17658503 -28.61189892   0.86198964   0.93158534]
Mean Squared Error:  169286.79460617268
MEAN CROSS VAL SCORES= -2.611732223013262
mse =  174374.6010934097

Gradient Boosting Regressor
TRAINING ERROR =  1573.8869325240246
Cross Validation Scores:  [-1.86965283  0

# Comments for High GTO Data
1) Linear Models are seen to have better cross validation scores as compared to GBR even though they have higher training error

2) Combining results of GBR & Lasso could be an ideal choice for High GTO Data

# Final Comments

# Low GTO Data : GBR

# Mid GTO Data : GBR

# High GTO Data : GBR
