# BIP's error function

To the error function does expected to receive as input the following pandas.DataFrame struncture:

| StoreID  | Month | Region | NumberOfSales | _NumberOfSales |
| ---------|----------------|---------------| ---------------|
| 1000	   |3      | 4      | 16            | 16             |
| 1000	   |4      | 4      | 30            | 23             |
| 1001	   |3      | 6      | 410           | 411            |
| 1001	   |4      | 27     | 3130          | 3120           |
| 1002	   |3      | 58     | 10            | 8              |

Where:
 
 - *NumberOfSales* are the **test actual values**
 - *_NumberOfSales* are **the predicted values**
 
 

Start from test set to simulate a predicted dataset

In [1]:
from import_man import *

df = pd.read_csv('./dataset/preprocessed_train.csv')

print("Shape before: " + str(df.shape))

# Let's work on a reduced instance of the test set
df = df.sample(n=5000)

print("Shape after: " + str(df.shape))

Shape before: (523021, 51)
Shape after: (5000, 51)


In [2]:
error_evaluation_columns = ['StoreID', 'D_Month', 'Region', 'NumberOfSales', '_NumberOfSales']

# Create fake predicted sales 
df['_NumberOfSales'] = df.NumberOfSales.apply(lambda x: df['NumberOfSales'].sample().values[0])


# Save a copy of the dataframe to be passed to the function in order to evaluate it
df_BIP_err_fun_eval = df


# Remove useless columns
df =  df[error_evaluation_columns]

df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales
210860,1301,11,2,5104.811659,2959.0
227093,1323,12,10,9602.0,3421.0
376935,1539,9,5,2329.669983,4513.0
463407,1664,1,0,5007.0,4332.998322
205369,1293,10,7,4277.0,2966.0
470317,1674,7,0,5739.0,4912.837209
184238,1263,4,3,5637.0,3525.0
362601,1519,11,9,2085.0,6723.0
470865,1675,1,2,3454.0,3182.291874
463235,1664,8,0,2993.0,6201.0


In [3]:
# let's keep only march and april
# df = df.loc[df['D_Month'].isin([3, 4])] No. let's just evaluate months provided

# compute the difference between actual and predicted NumberOfSales and do the abs
df['abs_diff'] = df.apply(lambda x: abs(x['NumberOfSales'] - x['_NumberOfSales']), axis=1)

df.head(20)

Unnamed: 0,StoreID,D_Month,Region,NumberOfSales,_NumberOfSales,abs_diff
210860,1301,11,2,5104.811659,2959.0,2145.811659
227093,1323,12,10,9602.0,3421.0,6181.0
376935,1539,9,5,2329.669983,4513.0,2183.330017
463407,1664,1,0,5007.0,4332.998322,674.001678
205369,1293,10,7,4277.0,2966.0,1311.0
470317,1674,7,0,5739.0,4912.837209,826.162791
184238,1263,4,3,5637.0,3525.0,2112.0
362601,1519,11,9,2085.0,6723.0,4638.0
470865,1675,1,2,3454.0,3182.291874,271.708126
463235,1664,8,0,2993.0,6201.0,3208.0


In [4]:
# Let's sum over the region
df_sums_by_region = df.groupby(['Region']).sum()
df_sums_by_region.head(20)

Unnamed: 0_level_0,StoreID,D_Month,NumberOfSales,_NumberOfSales,abs_diff
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,519576,2457,1742487.0,1748741.0,833545.4
1,215313,1111,758448.7,813464.7,361995.1
2,904499,3740,3074894.0,3167847.0,1285880.0
3,893224,4331,3770968.0,3103745.0,1670652.0
4,175467,865,763137.1,620570.6,390572.0
5,412490,2120,1448195.0,1542033.0,721820.6
6,313347,1535,971554.3,1161763.0,464718.6
7,606479,2865,2092793.0,2081793.0,958210.1
8,254404,1190,779482.3,997570.3,395230.1
9,1730466,8309,6019342.0,5872994.0,2665537.0


In [5]:
# Divide the difference between actual and predicted NumberOfSales by the sum of actual
df_sums_by_region['E_r'] = df_sums_by_region['abs_diff'] / df_sums_by_region['NumberOfSales']

df_sums_by_region.head(20)

Unnamed: 0_level_0,StoreID,D_Month,NumberOfSales,_NumberOfSales,abs_diff,E_r
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,519576,2457,1742487.0,1748741.0,833545.4,0.478365
1,215313,1111,758448.7,813464.7,361995.1,0.477284
2,904499,3740,3074894.0,3167847.0,1285880.0,0.418187
3,893224,4331,3770968.0,3103745.0,1670652.0,0.44303
4,175467,865,763137.1,620570.6,390572.0,0.511798
5,412490,2120,1448195.0,1542033.0,721820.6,0.498428
6,313347,1535,971554.3,1161763.0,464718.6,0.478325
7,606479,2865,2092793.0,2081793.0,958210.1,0.457862
8,254404,1190,779482.3,997570.3,395230.1,0.507042
9,1730466,8309,6019342.0,5872994.0,2665537.0,0.442829


In [6]:
# Get the number of regions
N_regions = len(df.Region.unique())

print("Number of regions: {}".format(N_regions))

Number of regions: 11


In [7]:
Error = df_sums_by_region['E_r'].sum() / N_regions

print("BIP error: {}".format(Error))

BIP error: 0.4688234560618476


### Test of the BIP error function implemented

Test that the get_BIP_error function works as expected

In [8]:
from BIP_error import get_BIP_error

df_BIP_err_fun_eval.head()

Unnamed: 0,StoreID,Date,IsHoliday,IsOpen,HasPromotions,StoreType,AssortmentType,NearestCompetitor,Region,NumberOfCustomers,...,StoreType_ShoppingCenter,AssortmentType_General,AssortmentType_WithNFDept,AssortmentType_WithFishDept,Events_Fog,Events_Hail,Events_Rain,Events_Snow,Events_Thunderstorm,_NumberOfSales
210860,1301,20/11/2016,0,0,0,Standard Market,With Non-Food Department,8389,2,226.706278,...,0,0,1,0,0,0,0,0,0,2959.0
227093,1323,04/12/2017,0,1,1,Hyper Market,With Non-Food Department,4154,10,640.0,...,0,0,1,0,0,0,0,0,0,3421.0
376935,1539,17/09/2017,0,0,0,Super Market,With Non-Food Department,19821,5,125.595357,...,0,0,1,0,0,0,1,0,0,4513.0
463407,1664,23/01/2017,0,1,1,Hyper Market,General,1057,0,307.0,...,0,1,0,0,1,0,1,0,0,4332.998322
205369,1293,27/10/2017,0,1,1,Standard Market,With Non-Food Department,6330,7,220.0,...,0,0,1,0,1,0,0,0,0,2966.0


In [9]:
# use the function
error = get_BIP_error(df_BIP_err_fun_eval)

Number of regions identified: 11
BIP total error: 0.4688234560618476
