# BIP's error function

To the error function does expected to receive as input the following pandas.DataFrame struncture:

| StoreID  | Month | Region | NumberOfSales | _NumberOfSales |
| ---------|----------------|---------------| ---------------|
| 1000	   |3      | 4      | 16            | 16             |
| 1000	   |4      | 4      | 30            | 23             |
| 1001	   |3      | 6      | 410           | 411            |
| 1001	   |4      | 27     | 3130          | 3120           |
| 1002	   |3      | 58     | 10            | 8              |

Where:
 
 - *NumberOfSales* are the **test actual values**
 - *_NumberOfSales* are **the predicted values**
 
 

Start from test set to simulate a predicted dataset

In [1]:
from import_man import *

df = pd.read_csv('./dataset/preprocessed_train.csv')

print("Shape before: " + str(df.shape))

# Let's work on a reduced instance of the test set
df = df.sample(n=5000)

print("Shape after: " + str(df.shape))

Shape before: (523021, 51)
Shape after: (5000, 51)


In [2]:
error_evaluation_columns = ['D_Month', 'Region', 'NumberOfSales', '_NumberOfSales']

# Create fake predicted sales 
df['_NumberOfSales'] = df.NumberOfSales.apply(lambda x: df['NumberOfSales'].sample().values[0])


# Save a copy of the dataframe to be passed to the function in order to evaluate it
df_BIP_err_fun_eval = df


# Remove useless columns
df =  df[error_evaluation_columns]

df.head(20)

Unnamed: 0,D_Month,Region,NumberOfSales,_NumberOfSales
58794,7,3,2973.0,5669.779242
79109,10,10,3356.0,6860.0
120732,7,9,4243.0,5582.0
426165,11,0,16128.0,5185.0
509998,6,9,6229.0,9481.0
505708,8,1,2043.0,4305.0
355044,2,0,3640.0,6189.714992
346851,2,9,5219.0,6224.0
77940,2,9,5590.0,6142.0
229733,3,3,7083.0,5609.647986


In [3]:
# let's keep only march and april
# df = df.loc[df['D_Month'].isin([3, 4])] No. let's just evaluate months provided

# compute the difference between actual and predicted NumberOfSales and do the abs
df['abs_diff'] = df.apply(lambda x: abs(x['NumberOfSales'] - x['_NumberOfSales']), axis=1)

df.head(20)

Unnamed: 0,D_Month,Region,NumberOfSales,_NumberOfSales,abs_diff
58794,7,3,2973.0,5669.779242,2696.779242
79109,10,10,3356.0,6860.0,3504.0
120732,7,9,4243.0,5582.0,1339.0
426165,11,0,16128.0,5185.0,10943.0
509998,6,9,6229.0,9481.0,3252.0
505708,8,1,2043.0,4305.0,2262.0
355044,2,0,3640.0,6189.714992,2549.714992
346851,2,9,5219.0,6224.0,1005.0
77940,2,9,5590.0,6142.0,552.0
229733,3,3,7083.0,5609.647986,1473.352014


In [4]:
# Let's sum over the region
df_sums_by_region = df.groupby(['Region']).sum()
df_sums_by_region.head(20)

Unnamed: 0_level_0,D_Month,NumberOfSales,_NumberOfSales,abs_diff
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2522,1882642.0,1925984.0,918554.5
1,1005,697218.0,750946.8,354485.2
2,3722,3108599.0,3330030.0,1405515.0
3,4336,3870304.0,3068434.0,1759728.0
4,801,773924.4,568668.6,377765.8
5,2188,1422355.0,1656083.0,750652.1
6,1496,1011217.0,1154099.0,475636.0
7,2896,2110053.0,2168826.0,893646.9
8,1041,714780.6,809601.2,293261.2
9,7841,5860334.0,5773050.0,2616867.0


In [5]:
# Divide the difference between actual and predicted NumberOfSales by the sum of actual
df_sums_by_region['E_r'] = df_sums_by_region['abs_diff'] / df_sums_by_region['NumberOfSales']

df_sums_by_region.head(20)

Unnamed: 0_level_0,D_Month,NumberOfSales,_NumberOfSales,abs_diff,E_r
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2522,1882642.0,1925984.0,918554.5,0.487907
1,1005,697218.0,750946.8,354485.2,0.508428
2,3722,3108599.0,3330030.0,1405515.0,0.452138
3,4336,3870304.0,3068434.0,1759728.0,0.454674
4,801,773924.4,568668.6,377765.8,0.488117
5,2188,1422355.0,1656083.0,750652.1,0.527753
6,1496,1011217.0,1154099.0,475636.0,0.47036
7,2896,2110053.0,2168826.0,893646.9,0.423519
8,1041,714780.6,809601.2,293261.2,0.410281
9,7841,5860334.0,5773050.0,2616867.0,0.446539


In [6]:
# Get the number of regions
N_regions = len(df.Region.unique())

print("Number of regions: {}".format(N_regions))

Number of regions: 11


In [7]:
Error = df_sums_by_region['E_r'].sum() / N_regions

print("BIP error: {}".format(Error))

BIP error: 0.4627666568632817


### Test of the BIP error function implemented

Test that the get_BIP_error function works as expected

In [8]:
from BIP_error import get_BIP_error

df_BIP_err_fun_eval.head()

Unnamed: 0,StoreID,Date,IsHoliday,IsOpen,HasPromotions,StoreType,AssortmentType,NearestCompetitor,Region,NumberOfCustomers,...,StoreType_ShoppingCenter,AssortmentType_General,AssortmentType_WithNFDept,AssortmentType_WithFishDept,Events_Fog,Events_Hail,Events_Rain,Events_Snow,Events_Thunderstorm,_NumberOfSales
58794,1084,14/07/2016,0,1,0,Hyper Market,General,1627,3,172.0,...,0,1,0,0,0,0,1,0,0,5669.779242
79109,1113,19/10/2017,0,1,0,Hyper Market,With Non-Food Department,43854,10,236.0,...,0,0,1,0,0,0,1,0,0,6860.0
120732,1173,18/07/2017,0,1,1,Hyper Market,With Non-Food Department,150,9,339.0,...,0,0,1,0,1,0,1,0,0,5582.0
426165,1611,07/11/2016,0,1,1,Hyper Market,General,370,0,1054.0,...,0,1,0,0,0,0,1,0,0,5185.0
509998,1730,05/06/2017,0,1,1,Hyper Market,General,4227,9,356.0,...,0,1,0,0,0,0,0,0,0,9481.0


In [9]:
# use the function
error = get_BIP_error(df_BIP_err_fun_eval)

Number of regions identified: 11
BIP total error: 0.4627666568632817
