## Raw ideas

so here's a few quick ideas for a validation schema:

- data shouldn't be shuffled before training, cause it's essential to have the time series values in consecutive order
- currently I came up with three ideas for a schema:
    1. `1,2train+3valid` -> `1,2,3train+4valid` -> `1,2,3,4train+5valid` etc.

        **Advatages**:
        - the train set grows, probably helping to test model's robustness
        - we always validate on different parts of the training set

        **Disadvantages**:
        - computationally heavy. at some point our train will contain almost the entire dataset
    2. `1,2train+3valid` -> `2,3train+4valid` -> `3,4train+5valid`

        **Advantages**:
        - fixed train size, no computational power issues
        - we can somewhat efficiently use our dataset since we use each fold several times during different trainig sessions
        
        **Disadvantages**:
        - the fact that different sets overlap might be a problem (shouldn't be, but who knows)
    3.  `1,2train+3valid` -> `4,5train+6valid` -> `7,8train+9valid`

        **Advantages**:
        - the easiest one to implement, goes through each fold once only
        - it's guaranteed that there's no overlapping or data leaks
        
        **Disadvantages**:
        - data hugry in some sense, each fold is used only ones and each trainig session takes $f_{train} + 1$ folds, where $f_{train}$ is the amount of folds for the trainig set. this might be a problem assuming we have only 33 folds at maximum

model ideas:

- **decision tree-based algorithms (boosting, random forest)**:

    seems to be a good choice for the task since we have a lot of categorical features and the target is a discrete variable (which is typical for decision tree regression). at the same time, trees can be overfit easily, which can cause a huge loss in robustness and very inaccurate predictions on unseen data. ensembles will help to resolve this, but the problem of overfitting won't be gone completely. also, decision trees don't work well once something completely unseen shows up in input data, which might be a problem during future usage
- **auto-regressive models**:

    a time-series specific solution, that's supposed to do a good job at predicting future sales based on historical data. the main problems are the following: 
        1. I don't have any experience working with them (but we'll still try I guess)
        2. the data is still quite noisy, so using AR models will require some additional preprocessing 
- **RNNs**:

    one of the hardest models to implement, but at the same time a really powerfull (in theory) solution, that by definition fits the task idea quite well. RNNs can be really good when it comes to analyzing consecitive data (like text processing or time series) 


useful additional features:

- day of the week
- month
- year
- lagged values (look them up in `EDA`)
- ...


## Loading the data

the point is to load the data in a format that's actually going to be fed to the model

In [1]:
from warnings import filterwarnings
filterwarnings('ignore', category=FutureWarning)

import pandas as pd
import numpy as np
import catboost as cb
import seaborn as sns
from sys import path
from matplotlib import pyplot as plt
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from tqdm.notebook import tqdm
from sklearn.ensemble import RandomForestRegressor

path.append('../')
from src.utils import FoldLoader

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [22]:
data = pd.read_csv('../data/processed_files/processed_data.csv')
data['date'] = pd.to_datetime(data['date'])

In [37]:
loader = FoldLoader(data, 3, folding_mode='seq')

In [6]:
loader.data.columns

Index(['date_block_num', 'shop_id', 'item_id', 'item_cnt_day', 'income'], dtype='object')

In [24]:
in_features = ['date_block_num', 'shop_id',
               'item_id', 'income']
target = 'item_cnt_day'
cat_cols = ['date_block_num', 'shop_id', 'item_id']

In [68]:
global_history = dict()

## CatBoost

trying CatBoost with no additional parameters, we can then see the scores, compare to other models and decide wich one to use in the end

In [55]:
template = 'fold: [{:2} out of {:2}]\tR2-score: [{:3.3f}]\tRMSE: [{:3.3f}]'
index = 1
history = {'loss': [], 'score': []}
loader.reset_folds()
for train, val in tqdm(loader):
    # fetch data
    train_data = cb.Pool(train[in_features], train[target], cat_features=cat_cols)
    val_data = cb.Pool(val[in_features], val[target])
    # reset the model
    model = cb.CatBoostRegressor(cat_features=cat_cols, thread_count=8)
    model.fit(train_data, verbose=False)

    # validate
    preds = model.predict(val[in_features])
    rmse = (np.sqrt(mean_squared_error(val[target], preds)))
    r2 = r2_score(val[target], preds)
    history['score'].append(r2)
    history['loss'].append(rmse)
    print(template.format(index, len(loader), r2, rmse))
    index += 1
global_history['catboost'] = history

  0%|          | 0/8 [00:00<?, ?it/s]

fold: [ 1 out of  8]	R2-score: [0.593]	RMSE: [1.884]
fold: [ 2 out of  8]	R2-score: [0.782]	RMSE: [1.776]
fold: [ 3 out of  8]	R2-score: [0.824]	RMSE: [2.008]
fold: [ 4 out of  8]	R2-score: [0.823]	RMSE: [1.610]
fold: [ 5 out of  8]	R2-score: [0.730]	RMSE: [2.022]
fold: [ 6 out of  8]	R2-score: [0.836]	RMSE: [1.916]
fold: [ 7 out of  8]	R2-score: [0.598]	RMSE: [2.367]
fold: [ 8 out of  8]	R2-score: [0.626]	RMSE: [1.990]


## RandomForest

this one requires manually encoding some features, since it can't process categorical features out-of-the-box. again, planning to see the scores I get and compare to others

### RNN

requires encoding cat. features too.