# BIP's error function

To the error function does expected to receive as input the following pandas.DataFrame struncture:

| StoreID  | Month | Region | NumberOfSales | _NumberOfSales |
| ---------|----------------|---------------| ---------------|
| 1000	   |3      | 4      | 16            | 16             |
| 1000	   |4      | 4      | 30            | 23             |
| 1001	   |3      | 6      | 410           | 411            |
| 1001	   |4      | 27     | 3130          | 3120           |
| 1002	   |3      | 58     | 10            | 8              |

Where:
 
 - *NumberOfSales* are the **test actual values**
 - *_NumberOfSales* are **the predicted values**
 
 

Start from test set to simulate a predicted dataset

In [1]:
from import_man import *

df = pd.read_csv('./dataset/preprocessed_train.csv')

print("Shape before: " + str(df.shape))

# Let's work on a reduced instance of the test set
df = df.sample(n=5000)

print("Shape after: " + str(df.shape))

Shape before: (523021, 51)
Shape after: (5000, 51)


In [2]:
error_evaluation_columns = ['StoreID', 'D_Month', 'Region', 'NumberOfSales', '_NumberOfSales', 'IsOpen']

# Create fake predicted sales 
df['_NumberOfSales'] = df.NumberOfSales.apply(lambda x: df['NumberOfSales'].sample().values[0])


# Save a copy of the dataframe to be passed to the function in order to evaluate it
df_BIP_err_fun_eval = df


# Remove useless columns and select all those required.
# Implicit check that all the required attributes to compute the error are present.
df = df[error_evaluation_columns]

df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales,IsOpen
200888,1286,7,3,12005.0,6442.991763,1
90501,1130,1,9,3409.0,3732.0,1
197336,1281,10,9,4771.0,3426.0,1
32764,1046,8,8,7140.0,3324.522241,1
59942,1085,9,4,3952.0,3035.041322,1
17104,1023,2,10,6088.0,5553.0,1
186439,1266,5,0,6220.0,7276.0,1
393856,1563,8,3,4584.0,3594.0,1
458458,1657,12,3,3671.0,3647.453048,1
273167,1390,5,3,2049.449753,5016.0,0


In [3]:
# let's consider only rows for which the store is open
df = df[df.IsOpen == 1]
df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales,IsOpen
200888,1286,7,3,12005.0,6442.991763,1
90501,1130,1,9,3409.0,3732.0,1
197336,1281,10,9,4771.0,3426.0,1
32764,1046,8,8,7140.0,3324.522241,1
59942,1085,9,4,3952.0,3035.041322,1
17104,1023,2,10,6088.0,5553.0,1
186439,1266,5,0,6220.0,7276.0,1
393856,1563,8,3,4584.0,3594.0,1
458458,1657,12,3,3671.0,3647.453048,1
391001,1559,10,9,3986.0,10131.0,1


In [4]:
# let's keep only march and april
# df = df.loc[df['D_Month'].isin([3, 4])] No. let's just evaluate months provided

# sum everything keeping distinguished: Month, Store and Region
df_sum_by_month = df.groupby(['D_Month', 'StoreID', 'Region']).sum()
df_sum_by_month.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NumberOfSales,_NumberOfSales,IsOpen
D_Month,StoreID,Region,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1000,7,6234.0,2126.0,1
1,1001,0,4500.0,3767.0,1
1,1006,10,9056.0,1186.0,1
1,1011,9,2039.0,4592.0,1
1,1013,3,6158.0,10662.0,1


In [5]:
# compute the difference between actual and predicted NumberOfSales and do the abs
df_sum_by_month['abs_diff'] = df_sum_by_month.apply(lambda x: abs(x['NumberOfSales'] - x['_NumberOfSales']), axis=1)
df_sum_by_month.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NumberOfSales,_NumberOfSales,IsOpen,abs_diff
D_Month,StoreID,Region,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1000,7,6234.0,2126.0,1,4108.0
1,1001,0,4500.0,3767.0,1,733.0
1,1006,10,9056.0,1186.0,1,7870.0
1,1011,9,2039.0,4592.0,1,2553.0
1,1013,3,6158.0,10662.0,1,4504.0


In [6]:
# Let's sum over the region
df_sums_by_region = df_sum_by_month.groupby(['Region']).sum()
df_sums_by_region.head(20)

Unnamed: 0_level_0,NumberOfSales,_NumberOfSales,IsOpen,abs_diff
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1563475.0,1551865.0,328,706157.1
1,584019.0,596711.6,131,252793.0
2,2371468.0,2445933.0,500,1070160.0
3,3244657.0,2735454.0,559,1399507.0
4,568231.0,523566.2,106,258392.8
5,1223809.0,1390985.0,281,553335.3
6,879348.0,1001526.0,205,376204.1
7,1736254.0,1827482.0,380,766357.9
8,694956.0,817106.5,173,295620.7
9,4807463.0,4676420.0,967,1994020.0


In [7]:
# Divide the difference between actual and predicted NumberOfSales by the sum of actual
df_sums_by_region['E_r'] = df_sums_by_region['abs_diff'] / df_sums_by_region['NumberOfSales']

df_sums_by_region.head(20)

Unnamed: 0_level_0,NumberOfSales,_NumberOfSales,IsOpen,abs_diff,E_r
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1563475.0,1551865.0,328,706157.1,0.451659
1,584019.0,596711.6,131,252793.0,0.432851
2,2371468.0,2445933.0,500,1070160.0,0.451265
3,3244657.0,2735454.0,559,1399507.0,0.431327
4,568231.0,523566.2,106,258392.8,0.454732
5,1223809.0,1390985.0,281,553335.3,0.452142
6,879348.0,1001526.0,205,376204.1,0.427822
7,1736254.0,1827482.0,380,766357.9,0.441386
8,694956.0,817106.5,173,295620.7,0.42538
9,4807463.0,4676420.0,967,1994020.0,0.414776


In [8]:
# Get the number of regions
N_regions = len(df.Region.unique())

print("Number of regions: {}".format(N_regions))

Number of regions: 11


In [9]:
step_by_step_error = df_sums_by_region['E_r'].sum() / N_regions

print("BIP error: {}".format(step_by_step_error))

BIP error: 0.4369147618965984


### Test of the BIP error function implemented

Test that the get_BIP_error function works as expected

In [14]:
from BIP_error import get_BIP_error

df_BIP_err_fun_eval[error_evaluation_columns].head()

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales,IsOpen
200888,1286,7,3,12005.0,6442.991763,1
90501,1130,1,9,3409.0,3732.0,1
197336,1281,10,9,4771.0,3426.0,1
32764,1046,8,8,7140.0,3324.522241,1
59942,1085,9,4,3952.0,3035.041322,1


In [15]:
# use the function
error_from_BIP = get_BIP_error(df_BIP_err_fun_eval)

# let's assert that the two errors are equal
np.testing.assert_almost_equal(step_by_step_error, error_from_BIP, decimal=10)

Number of regions identified: 11
BIP total error: 0.4369147618965984


### Test of the BIP error function: case of predictions = 0

Let's test how the function behaves if the predictions are all equal to 0.

In [16]:
test_zero = df_BIP_err_fun_eval
test_zero['_NumberOfSales'] = 0
test_zero[error_evaluation_columns].head()

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales,IsOpen
200888,1286,7,3,12005.0,0,1
90501,1130,1,9,3409.0,0,1
197336,1281,10,9,4771.0,0,1
32764,1046,8,8,7140.0,0,1
59942,1085,9,4,3952.0,0,1


In [17]:
# which is the erro in this case? expect 1.00
error_zero = get_BIP_error(test_zero)

Number of regions identified: 11
BIP total error: 1.0


### Test of the BIP error function: case of predictions = predictions + n

Let's test what happen if number of sales are incremented of a number *n*.

In [18]:
n= 100
test_n = df_BIP_err_fun_eval
test_n['_NumberOfSales'] = test_n['NumberOfSales'] + n
test_n[error_evaluation_columns].head()

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales,IsOpen
200888,1286,7,3,12005.0,12105.0,1
90501,1130,1,9,3409.0,3509.0,1
197336,1281,10,9,4771.0,4871.0,1
32764,1046,8,8,7140.0,7240.0,1
59942,1085,9,4,3952.0,4052.0,1


In [19]:
# which is the erro in this case?
error_n = get_BIP_error(test_n)

Number of regions identified: 11
BIP total error: 0.021345628278667313
