# BIP's error function

To the error function does expected to receive as input the following pandas.DataFrame struncture:

| StoreID  | Month | Region | NumberOfSales | _NumberOfSales |
| ---------|----------------|---------------| ---------------|
| 1000	   |3      | 4      | 16            | 16             |
| 1000	   |4      | 4      | 30            | 23             |
| 1001	   |3      | 6      | 410           | 411            |
| 1001	   |4      | 27     | 3130          | 3120           |
| 1002	   |3      | 58     | 10            | 8              |

Where:
 
 - *NumberOfSales* are the **test actual values**
 - *_NumberOfSales* are **the predicted values**
 
 

Start from test set to simulate a predicted dataset

In [1]:
from import_man import *

df = pd.read_csv('./dataset/preprocessed_train.csv')

print("Shape before: " + str(df.shape))

# Let's work on a reduced instance of the test set
df = df.sample(n=5000)

print("Shape after: " + str(df.shape))

Shape before: (523021, 51)
Shape after: (5000, 51)


In [2]:
error_evaluation_columns = ['StoreID', 'D_Month', 'Region', 'NumberOfSales', '_NumberOfSales']

# Create fake predicted sales 
df['_NumberOfSales'] = df.NumberOfSales.apply(lambda x: df['NumberOfSales'].sample().values[0])


# Save a copy of the dataframe to be passed to the function in order to evaluate it
df_BIP_err_fun_eval = df


# Remove useless columns and select all those required.
# Implicit check that all the required attributes to compute the error are present.
df = df[error_evaluation_columns]

df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales
211030,1301,5,2,5045.0,1426.0
371473,1532,3,9,3245.0,4865.0
201622,1287,1,2,6701.0,4032.224215
124833,1179,4,2,4919.0,5906.613599
307549,1440,4,6,4389.0,6456.0
6151,1008,7,9,5107.0,4183.0
505075,1723,12,9,5080.0,3469.0
36236,1051,2,2,4616.0,5189.0
208360,1297,1,9,6031.0,5095.0
33447,1047,6,9,5695.0,4639.0


In [3]:
# let's keep only march and april
# df = df.loc[df['D_Month'].isin([3, 4])] No. let's just evaluate months provided

# sum everything keeping distinguished: Month, Store and Region
df_sum_by_month = df.groupby(['D_Month', 'StoreID', 'Region']).sum()
df_sum_by_month.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NumberOfSales,_NumberOfSales
D_Month,StoreID,Region,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1000,7,7675.446488,5468.0
1,1002,3,5892.0,6886.0
1,1003,7,6444.0,3583.955075
1,1006,10,13113.342669,11211.054366
1,1012,4,20747.0,15604.0


In [4]:
# compute the difference between actual and predicted NumberOfSales and do the abs
df_sum_by_month['abs_diff'] = df_sum_by_month.apply(lambda x: abs(x['NumberOfSales'] - x['_NumberOfSales']), axis=1)
df_sum_by_month.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NumberOfSales,_NumberOfSales,abs_diff
D_Month,StoreID,Region,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1000,7,7675.446488,5468.0,2207.446488
1,1002,3,5892.0,6886.0,994.0
1,1003,7,6444.0,3583.955075,2860.044925
1,1006,10,13113.342669,11211.054366,1902.288303
1,1012,4,20747.0,15604.0,5143.0


In [5]:
# Let's sum over the region
df_sums_by_region = df_sum_by_month.groupby(['Region']).sum()
df_sums_by_region.head(20)

Unnamed: 0_level_0,NumberOfSales,_NumberOfSales,abs_diff
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1839191.0,1803487.0,731029.4
1,791324.9,808367.6,399638.0
2,3089159.0,3198266.0,1247008.0
3,3725963.0,3250695.0,1632221.0
4,657082.8,592386.6,280135.5
5,1595216.0,1774465.0,782289.4
6,1074973.0,1229042.0,471728.8
7,2100843.0,2248096.0,905940.7
8,803078.0,919116.4,325726.8
9,5997551.0,5918549.0,2410174.0


In [6]:
# Divide the difference between actual and predicted NumberOfSales by the sum of actual
df_sums_by_region['E_r'] = df_sums_by_region['abs_diff'] / df_sums_by_region['NumberOfSales']

df_sums_by_region.head(20)

Unnamed: 0_level_0,NumberOfSales,_NumberOfSales,abs_diff,E_r
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1839191.0,1803487.0,731029.4,0.397473
1,791324.9,808367.6,399638.0,0.505024
2,3089159.0,3198266.0,1247008.0,0.403672
3,3725963.0,3250695.0,1632221.0,0.438067
4,657082.8,592386.6,280135.5,0.426332
5,1595216.0,1774465.0,782289.4,0.490397
6,1074973.0,1229042.0,471728.8,0.438828
7,2100843.0,2248096.0,905940.7,0.431227
8,803078.0,919116.4,325726.8,0.405598
9,5997551.0,5918549.0,2410174.0,0.40186


In [7]:
# Get the number of regions
N_regions = len(df.Region.unique())

print("Number of regions: {}".format(N_regions))

Number of regions: 11


In [8]:
step_by_step_error = df_sums_by_region['E_r'].sum() / N_regions

print("BIP error: {}".format(step_by_step_error))

BIP error: 0.43009487751872366


### Test of the BIP error function implemented

Test that the get_BIP_error function works as expected

In [9]:
from BIP_error import get_BIP_error

df_BIP_err_fun_eval[error_evaluation_columns].head()

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales
211030,1301,5,2,5045.0,1426.0
371473,1532,3,9,3245.0,4865.0
201622,1287,1,2,6701.0,4032.224215
124833,1179,4,2,4919.0,5906.613599
307549,1440,4,6,4389.0,6456.0


In [10]:
# use the function
error_from_BIP = get_BIP_error(df_BIP_err_fun_eval)

# let's assert that the two errors are equal
np.testing.assert_almost_equal(step_by_step_error, error_from_BIP, decimal=10)

Number of regions identified: 11
BIP total error: 0.43009487751872366


### Test of the BIP error function: case of predictions = 0

Let's test how the function behaves if the predictions are all equal to 0.

In [11]:
test_zero = df_BIP_err_fun_eval
test_zero['_NumberOfSales'] = 0
test_zero[error_evaluation_columns].head()

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales
211030,1301,5,2,5045.0,0
371473,1532,3,9,3245.0,0
201622,1287,1,2,6701.0,0
124833,1179,4,2,4919.0,0
307549,1440,4,6,4389.0,0


In [12]:
# which is the erro in this case? expect 1.00
error_zero = get_BIP_error(test_zero)

Number of regions identified: 11
BIP total error: 1.0


### Test of the BIP error function: case of predictions = predictions + n

Let's test what happen if number of sales are incremented of a number *n*.

In [13]:
n= 100
test_n = df_BIP_err_fun_eval
test_n['_NumberOfSales'] = test_n['NumberOfSales'] + n
test_n[error_evaluation_columns].head()

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales
211030,1301,5,2,5045.0,5145.0
371473,1532,3,9,3245.0,3345.0
201622,1287,1,2,6701.0,6801.0
124833,1179,4,2,4919.0,5019.0
307549,1440,4,6,4389.0,4489.0


In [14]:
# which is the erro in this case?
error_n = get_BIP_error(test_n)

Number of regions identified: 11
BIP total error: 0.02083719471050353
