# Models ensemble by averaging

In general we saw that XGB overall is better (when sampling randomly train and test or in January and February tests), however, Random Forest error does not have a high variance and is better when predicting on March and April, thus we can do an average of the predicted number of sales of every Store in every day, leading to an overall result which is slightly better than the average of the 2 errors.

The other possiblity that we will evaluate is to do an average of the two predictions at a higher level, when we have already grouped by StoreID and Month, this could be even better if one of the model for instance tends to overestimate the predicted value and the over one underestimates it.

**NOTEBOOK GOAL**: Ensemble of models created by averaging the preditions

Averaged predictions:

- **1_RFR** - Notebook 5.3 Random forrest
- **2_XGB** - Notebook 6.4 XGBoost
- **3_AVG** - Notebook 7.0 AVG Monthly average


In [1]:
from import_man import *
import collections

from BIP import get_BIP_error, apply_BIP_submission_format

### Load predicted tests

**NOTE** If you cannot load the followig datasets, please go to the corresponding notebook and run it to generate the related dataset file. 

In [2]:
dfs_dict = collections.OrderedDict()
# the following dataset will be evaluated

In [3]:
dfs_dict['RFR'] = pd.read_csv('./dataset/test_m12_53_RFR_on_prep.csv')

In [4]:
dfs_dict['XGB'] = pd.read_csv('./dataset/test_m12_64_Model_XGBoost_final.csv')

In [5]:
dfs_dict['AVG'] = pd.read_csv('./dataset/test_m12_70_Model_monthly_average.csv')

In [6]:
# let's apply the apply_BIP_submission_format to all the dataframes
for mdl_lbl, df in dfs_dict.items():
    dfs_dict[mdl_lbl] = apply_BIP_submission_format(df)

We decided to work with the average of the already summed up value because it reduces more the error

In [7]:
# make a copy of the first dataframe in order to use it as data structure
df_ens = list(dfs_dict.values())[0].copy()
df_ens['NumberOfSales'] = 0

In [8]:
stores_to_spot = [1000, 1245, 1300, 1301] 

# Let's spot some random rows in order to prove that the average works correctly
for mdl_lbl, df in dfs_dict.items():
    print('................................ ' + mdl_lbl + '................................')
    print(df.loc[df.StoreID.isin(stores_to_spot)][['StoreID', 'Target', 'NumberOfSales']].head(40))


................................ RFR................................
     StoreID  Target  NumberOfSales
0       1000  182917  201451.830901
1       1000  166161  184721.490504
490     1245  153770  161312.650848
491     1245  140180  148843.978536
600     1300  124236  124438.879364
601     1300  123832  117763.683670
602     1301  121875  118973.208250
603     1301  131267  124117.584620
................................ XGB................................
     StoreID  Target  NumberOfSales
0       1000  182917    195912.7838
1       1000  166161    172333.8114
490     1245  153770    150167.7479
491     1245  140180    138342.4841
600     1300  124236    116731.8826
601     1300  123832    112076.3303
602     1301  121875    114183.0000
603     1301  131267    120041.3171
................................ AVG................................
     StoreID  Target  NumberOfSales
0       1000  182917  192640.240000
1       1000  166161  175253.000000
490     1245  153770  143432.640000
4

In [9]:
df_ens[['StoreID', 'Target', 'NumberOfSales']].head()

Unnamed: 0,StoreID,Target,NumberOfSales
0,1000,182917,0
1,1000,166161,0
2,1001,95745,0
3,1001,88423,0
4,1002,121995,0


In [10]:
# sum up all
for mdl_lbl, df in dfs_dict.items():
    df_ens['NumberOfSales'] += df['NumberOfSales']

# divide by their number
df_ens['NumberOfSales'] /= len(dfs_dict)

df.loc[df.StoreID.isin(stores_to_spot)][['StoreID', 'Target', 'NumberOfSales']].head(40)

Unnamed: 0,StoreID,Target,NumberOfSales
0,1000,182917,192640.24
1,1000,166161,175253.0
490,1245,153770,143432.64
491,1245,140180,135155.0
600,1300,124236,115930.208333
601,1300,123832,114591.0
602,1301,121875,106975.0
603,1301,131267,117830.0


## Write to file

In [11]:
df_ens.to_csv('./dataset/test_m12_82_Ensemble_average.csv', index=False)