# BIP's error function

To the error function does expected to receive as input the following pandas.DataFrame struncture:

| StoreID  | Month | Region | NumberOfSales | _NumberOfSales |
| ---------|----------------|---------------| ---------------|
| 1000	   |3      | 4      | 16            | 16             |
| 1000	   |4      | 4      | 30            | 23             |
| 1001	   |3      | 6      | 410           | 411            |
| 1001	   |4      | 27     | 3130          | 3120           |
| 1002	   |3      | 58     | 10            | 8              |

Where:
 
 - *NumberOfSales* are the **test actual values**
 - *_NumberOfSales* are **the predicted values**
 
 

Start from test set to simulate a predicted dataset

In [1]:
from import_man import *

df = pd.read_csv('./dataset/preprocessed_train.csv')

print("Shape before: " + str(df.shape))

# Let's work on a reduced instance of the test set
df = df.sample(n=5000)

print("Shape after: " + str(df.shape))

Shape before: (523021, 51)
Shape after: (5000, 51)


In [2]:
error_evaluation_columns = ['StoreID', 'D_Month', 'Region', 'NumberOfSales', '_NumberOfSales']

# Create fake predicted sales 
df['_NumberOfSales'] = df.NumberOfSales.apply(lambda x: df['NumberOfSales'].sample().values[0])

# Remove useless columns
df =  df[error_evaluation_columns]

df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales
92559,1133,3,2,3861.0,3628.961857
383258,1548,1,3,2363.0,3980.0
384378,1550,2,0,5582.0,2690.0
252679,1360,2,2,2478.0,2037.0
434992,1623,7,9,4499.0,6491.0
462957,1663,10,9,3167.094527,6256.0
159955,1229,3,9,6209.0,3087.919283
18497,1025,12,6,7231.0,2883.0
60192,1086,5,10,5726.75453,5280.0
119466,1171,1,6,2731.0,2675.0


In [3]:
# let's keep only march and april
df = df.loc[df['D_Month'].isin([3,4])]

# compute the difference between actual and predicted NumberOfSales and do the abs
df['abs_diff'] = df.apply(lambda x: abs(x['NumberOfSales'] - x['_NumberOfSales']), axis=1)

df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales,abs_diff
92559,1133,3,2,3861.0,3628.961857,232.038143
159955,1229,3,9,6209.0,3087.919283,3121.080717
127181,1183,4,9,2735.268657,3353.0,617.731343
23559,1033,4,3,7667.0,5094.0,2573.0
175088,1250,3,5,5734.0,4194.755656,1539.244344
378919,1543,3,3,3083.0,5247.0,2164.0
428843,1615,3,10,3160.0,4707.0,1547.0
217013,1310,4,9,5906.613599,4702.810945,1203.802653
180201,1257,3,7,2571.0,7496.0,4925.0
222497,1317,4,9,4019.0,3922.0,97.0


In [4]:
# Let's sum over the region
df_sums_by_region = df.groupby(['Region']).sum()
df_sums_by_region.head(20)

Unnamed: 0_level_0,StoreID,D_Month,NumberOfSales,_NumberOfSales,abs_diff
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,73781,186,243764.298507,240156.286789,106791.530323
1,43100,121,149539.635418,149361.080904,67649.163674
2,215931,559,717273.187383,768973.349148,353118.716426
3,151943,388,627183.075325,528331.723264,279656.470149
4,21398,55,82193.523888,75674.53403,30791.205357
5,78509,212,282933.313433,250446.780378,140012.425613
6,59172,154,165939.329451,210777.09242,88040.736409
7,107638,272,388855.153027,416757.219772,191063.889825
8,45532,115,126255.649528,179901.613024,76280.427332
9,287545,699,965366.780866,955046.755658,423567.267261


In [5]:
# Divide the difference between actual and predicted NumberOfSales by the sum of actual
df_sums_by_region['E_r'] = df_sums_by_region['abs_diff'] / df_sums_by_region['NumberOfSales']

df_sums_by_region.head(20)

Unnamed: 0_level_0,StoreID,D_Month,NumberOfSales,_NumberOfSales,abs_diff,E_r
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,73781,186,243764.298507,240156.286789,106791.530323,0.438093
1,43100,121,149539.635418,149361.080904,67649.163674,0.452383
2,215931,559,717273.187383,768973.349148,353118.716426,0.492307
3,151943,388,627183.075325,528331.723264,279656.470149,0.445893
4,21398,55,82193.523888,75674.53403,30791.205357,0.374618
5,78509,212,282933.313433,250446.780378,140012.425613,0.49486
6,59172,154,165939.329451,210777.09242,88040.736409,0.53056
7,107638,272,388855.153027,416757.219772,191063.889825,0.49135
8,45532,115,126255.649528,179901.613024,76280.427332,0.604174
9,287545,699,965366.780866,955046.755658,423567.267261,0.438763


In [6]:
# Get the number of regions
N_regions = len(df.Region.unique())

print("Number of regions: {}".format(N_regions))

Number of regions: 11


In [8]:
Error = df_sums_by_region['E_r'].sum() / N_regions

print("BIP error: {}".format(Error))

BIP error: 0.46978604713897104
