# BIP's error function

To the error function does expected to receive as input the following pandas.DataFrame struncture:

| StoreID  | Month | Region | NumberOfSales | _NumberOfSales |
| ---------|----------------|---------------| ---------------|
| 1000	   |3      | 4      | 16            | 16             |
| 1000	   |4      | 4      | 30            | 23             |
| 1001	   |3      | 6      | 410           | 411            |
| 1001	   |4      | 27     | 3130          | 3120           |
| 1002	   |3      | 58     | 10            | 8              |

Where:
 
 - *NumberOfSales* are the **test actual values**
 - *_NumberOfSales* are **the predicted values**
 
 

Start from test set to simulate a predicted dataset

In [1]:
from import_man import *

df = pd.read_csv('./dataset/preprocessed_train.csv')

print("Shape before: " + str(df.shape))

# Let's work on a reduced instance of the test set
df = df.sample(n=5000)

print("Shape after: " + str(df.shape))

Shape before: (523021, 51)
Shape after: (5000, 51)


In [2]:
error_evaluation_columns = ['StoreID', 'D_Month', 'Region', 'NumberOfSales', '_NumberOfSales']

# Create fake predicted sales 
df['_NumberOfSales'] = df.NumberOfSales.apply(lambda x: df['NumberOfSales'].sample().values[0])


# Save a copy of the dataframe to be passed to the function in order to evaluate it
df_BIP_err_fun_eval = df


# Remove useless columns
df =  df[error_evaluation_columns]

df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales
248997,1355,7,3,20916.0,6588.029148
280156,1400,7,3,2154.0,4985.0
296069,1424,4,3,3520.0,4455.0
291041,1416,6,3,6273.0,7175.0
196625,1280,10,3,3478.0,4902.0
344284,1492,2,8,3151.0,4787.0
435024,1623,8,9,4073.0,3627.0
33536,1047,9,9,4324.0,5696.0
467802,1670,2,2,3462.0,4034.846411
415523,1595,2,2,6134.0,2279.0


In [3]:
# let's keep only march and april
df = df.loc[df['D_Month'].isin([3,4])]

# compute the difference between actual and predicted NumberOfSales and do the abs
df['abs_diff'] = df.apply(lambda x: abs(x['NumberOfSales'] - x['_NumberOfSales']), axis=1)

df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales,abs_diff
296069,1424,4,3,3520.0,4455.0,935.0
406842,1582,4,0,8818.0,3393.0,5425.0
115356,1166,4,2,3087.919283,4891.0,1803.080717
1840,1002,3,3,4968.047776,4665.0,303.047776
362339,1519,3,9,3314.0,5367.474295,2053.474295
409731,1587,3,6,3329.0,4084.0,755.0
286417,1409,4,9,4694.852405,6189.714992,1494.862587
95481,1138,4,9,6420.0,7531.0,1111.0
178389,1254,4,2,5608.0,3199.0,2409.0
429596,1616,4,9,5964.0,3384.0,2580.0


In [4]:
# Let's sum over the region
df_sums_by_region = df.groupby(['Region']).sum()
df_sums_by_region.head(20)

Unnamed: 0_level_0,StoreID,D_Month,NumberOfSales,_NumberOfSales,abs_diff
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,89414,223,308413.154726,306758.975041,135487.271415
1,23244,60,77723.744303,78832.908187,42613.135135
2,233174,580,770129.393624,783912.86063,303882.331667
3,144665,361,578738.383433,492234.3216,223158.056186
4,17604,49,94524.537068,67703.233575,41211.656456
5,71289,192,250263.798749,253255.651464,103673.38744
6,50032,139,152871.293708,157765.274337,56325.453745
7,95078,244,303835.774423,331639.184538,132037.53909
8,49404,126,141054.904918,178154.85012,72363.569077
9,254514,625,882159.035007,872352.242932,393287.298063


In [5]:
# Divide the difference between actual and predicted NumberOfSales by the sum of actual
df_sums_by_region['E_r'] = df_sums_by_region['abs_diff'] / df_sums_by_region['NumberOfSales']

df_sums_by_region.head(20)

Unnamed: 0_level_0,StoreID,D_Month,NumberOfSales,_NumberOfSales,abs_diff,E_r
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,89414,223,308413.154726,306758.975041,135487.271415,0.439304
1,23244,60,77723.744303,78832.908187,42613.135135,0.548264
2,233174,580,770129.393624,783912.86063,303882.331667,0.394586
3,144665,361,578738.383433,492234.3216,223158.056186,0.385594
4,17604,49,94524.537068,67703.233575,41211.656456,0.435989
5,71289,192,250263.798749,253255.651464,103673.38744,0.414256
6,50032,139,152871.293708,157765.274337,56325.453745,0.36845
7,95078,244,303835.774423,331639.184538,132037.53909,0.434569
8,49404,126,141054.904918,178154.85012,72363.569077,0.513017
9,254514,625,882159.035007,872352.242932,393287.298063,0.445824


In [6]:
# Get the number of regions
N_regions = len(df.Region.unique())

print("Number of regions: {}".format(N_regions))

Number of regions: 11


In [7]:
Error = df_sums_by_region['E_r'].sum() / N_regions

print("BIP error: {}".format(Error))

BIP error: 0.4362029915647343


### Test of the BIP error function implemented

Test that the get_BIP_error function works as expected

In [8]:
from BIP_error import get_BIP_error

df_BIP_err_fun_eval.head()

Unnamed: 0,StoreID,Date,IsHoliday,IsOpen,HasPromotions,StoreType,AssortmentType,NearestCompetitor,Region,NumberOfCustomers,...,StoreType_ShoppingCenter,AssortmentType_General,AssortmentType_WithNFDept,AssortmentType_WithFishDept,Events_Fog,Events_Hail,Events_Rain,Events_Snow,Events_Thunderstorm,_NumberOfSales
248997,1355,18/07/2016,0,1,1,Hyper Market,General,209,3,1600.0,...,0,1,0,0,0,0,0,0,0,6588.029148
280156,1400,28/07/2016,0,1,0,Hyper Market,General,2620,3,174.0,...,0,1,0,0,0,0,1,0,0,4985.0
296069,1424,08/04/2016,0,1,0,Hyper Market,General,823,3,276.0,...,0,1,0,0,0,0,1,1,0,4455.0
291041,1416,15/06/2016,0,1,0,Hyper Market,With Non-Food Department,370,3,529.0,...,0,0,1,0,0,0,0,0,0,7175.0
196625,1280,21/10/2016,0,1,0,Shopping Center,General,589,3,371.0,...,1,1,0,0,0,0,1,0,0,4902.0


In [9]:
# use the function
error = get_BIP_error(df_BIP_err_fun_eval)

Number of regions identified: 11
BIP total error: 0.4362029915647343
