# File structure of the model

The model is composed of the following files:
<ol>
<li> a main file containing the data input, and the fitting and prediction of the model. The data loading is described in the 'Data input' section.</li>

<li> a data cleaning file where a data cleaning class is defined. The effect of the data cleaning class is described in 'Data cleaning section'</li>

<li> a model file containing the actual model and its parameters. The content of the model is descrived in the 'Model' section.</li>
</ol>   

In [None]:
import pandas as pd

from models import define_pipelines
from models import single_run

# Data input
<ol>
<li> The data are loaded.</li>

<li> We split the sales from the data. We obtain three DataFrames:
    <ol>
    <li> Training data for the store: 'Date', 'Store', 'DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday'</li>  
    <li> Store data: 'Store', 'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval'</li>
    <li> The target data: a Dataframe of one columns: 'Sale'</li>
    </ol>
</li>
</ol>

In [None]:
TRAINING_DATA = 'data/train.csv'
HOLDOUT_DATA = 'data/holdout.csv'
STORE_DATA = 'data/store.csv'
TEST_DATA = '' # input the path of the test data file here

RANDOM_SEED = 42
CORES = -1

try:
    df_test = pd.read_csv(TEST_DATA)
except FileNotFoundError:
    print('Test data file not found, using holdout as validation set')
    df_test = pd.read_csv(HOLDOUT_DATA)
    df_train = pd.read_csv(TRAINING_DATA)
else:
    print('Test data loaded, using full training data for model training')
    df_train = pd.concat([
        pd.read_csv(TRAINING_DATA),
        pd.read_csv(HOLDOUT_DATA)
    ])
finally:
    df_store = pd.read_csv(STORE_DATA)

X_train = df_train.drop(columns='Sales')
y_train = df_train.loc[:, 'Sales']
X_val = df_test.drop(columns='Sales')
y_val = df_test.loc[:, 'Sales']

# Data cleaning

<ol>
<li> We merge an inner merge on the store columns, between the training data and the store data. </li>


<li> Rows with empty sales or store id are dropped. </li>

<li> We drop the rows for which the sales are zeros. </li>

<li> At inspection of the data, we observe that the 'StateHoliday' has a class defined by two distinct parameters: an integer 0 and a string 0. This is corrected. </li>

<li> For the date: we add three new columns containing week day, day, month and year. The columns 'DayofWeek' is redun dant, and dropped. The 'Date' columns is ultimately dropped. </li> 

<li> We perform one hot encoding on the following parameters:
     'StateHoliday', 'Assortment', 'SchoolHoliday'
   For those columns containing Nan, a nan type columns is created. </li>


<li>  We introduce the follwing features
    <ol>
    <li> for each row, the median store sale of the corresponding store id of the row is added. </li>
    <li> for each row, the median store standard deviation of the corresponding store id of the row is added. </li>
    <li> for each row, the  store type mean sale of the store type of the corresponding row is added. </li>
    <li> for each row, the store stype standard deviation type of the store type of the corresponding row is added. </li>
     <li>   Ultimately, 'Store' id and 'Storetype' columns are dropped. </li>
     </ol>   

    

<li> The 'CompetitionDistance' column has missing value, and is filled in median. </li>

<li> The 'Promotion' column has missing value, and is filled with the mininum of the remaining value.</li>

<li> Ultimately, the following columns are dropped:
'Store', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear','Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'Date', 'Open',
'StoreType'. </li>

<li> The output of the cleaning is two dataframes with identical indices. The feature daraframe has the following columns:

'Promo', 'Sales', 'CompetitionDistance', 'Promo2', 'day', 'month',
       'year', 'weekday', 'median_sales', 'std_sales', 'type_median',
       'type_std', 'mean_sales', 'StateHoliday_b', 'StateHoliday_c',
       'StateHoliday_no', 'StateHoliday_nan', 'Assortment_b', 'Assortment_c',
       'Assortment_nan', 'SchoolHoliday_1.0', 'SchoolHoliday_nan'</li>
   
    
</ol>

In [None]:
from data_cleaning import DataCleaning

cleaning_settings = dict(
    hot_encoded_columns=[
        'StateHoliday',
        'Assortment',
        'SchoolHoliday',
    ],
    dropped_columns=[
        'Store',
        'CompetitionOpenSinceMonth',
        'CompetitionOpenSinceYear',
        'Promo2SinceWeek',
        'Promo2SinceYear',
        'PromoInterval',
        'Date',
        'Open',
        'StoreType',
    ],
    filled_in_median=[
        'CompetitionDistance',
    ],
    filled_in_mode=[
        'Promo',
    ],
    target=[
        'Sales',
    ],
)

cleaning = DataCleaning(
    store=df_store,
    hot_encoded_columns=cleaning_settings['hot_encoded_columns'],
    dropped_columns=cleaning_settings['dropped_columns'],
    filled_in_median=cleaning_settings['filled_in_median'],
    filled_in_mode=cleaning_settings['filled_in_mode'],
    target=cleaning_settings['target'],
)

X_train_clean, y_train_clean =\
    cleaning.cleaning(X_train, y_train, training=True)
X_val_clean, y_val_clean =\
    cleaning.cleaning(X_val, y_val, training=False)

# Model Definition

<li> We opt for a boosted trees model (XGBRegressor) as this model showed to be less
prone to over-fitting compared to our alternative (a random forest). </li>
<li> After iterating over the feature selection and transformation we settle on the following features:


In [None]:
X_train_clean.columns

<li> A basic grid-search over some key hyper-parameters showed that 500 shallow trees with 3 levels and a learning rate of 0.2 performed best.

In [None]:
xg_settings = dict(
    n_estimators=500,
    max_depth=3,
    learning_rate=0.2,
    random_state=RANDOM_SEED,
    n_jobs=CORES,
)

We define our pipeline (which consists of just the model itself), as scaling
showed no positive effect in our iterations:

In [None]:
pipes = define_pipelines(xg_settings)

We then run the model, which returns a self-evaluation:

In [None]:
__i = single_run(pipes, X_train_clean, y_train_clean,
           X_val_clean, y_val_clean, X_train, X_val)