## Raw ideas

so here's a few quick ideas for a validation schema:

- data shouldn't be shuffled before training, cause it's essential to have the time series values in consecutive order
- currently I came up with three ideas for a schema:
    1. `1,2train+3valid` -> `1,2,3train+4valid` -> `1,2,3,4train+5valid` etc.

        **Advatages**:
        - the train set grows, probably helping to test model's robustness
        - we always validate on different parts of the training set

        **Disadvantages**:
        - computationally heavy. at some point our train will contain almost the entire dataset
    2. `1,2train+3valid` -> `2,3train+4valid` -> `3,4train+5valid`

        **Advantages**:
        - fixed train size, no computational power issues
        - we can somewhat efficiently use our dataset since we use each fold several times during different trainig sessions
        
        **Disadvantages**:
        - the fact that different sets overlap might be a problem (shouldn't be, but who knows)
    3.  `1,2train+3valid` -> `4,5train+6valid` -> `7,8train+9valid`

        **Advantages**:
        - the easiest one to implement, goes through each fold once only
        - it's guaranteed that there's no overlapping or data leaks
        
        **Disadvantages**:
        - data hugry in some sense, each fold is used only ones and each trainig session takes $f_{train} + 1$ folds, where $f_{train}$ is the amount of folds for the trainig set. this might be a problem assuming we have only 33 folds at maximum

model ideas:

- **decision tree-based algorithms (boosting, random forest)**:

    seems to be a good choice for the task since we have a lot of categorical features and the target is a discrete variable (which is typical for decision tree regression). at the same time, trees can be overfit easily, which can cause a huge loss in robustness and very inaccurate predictions on unseen data. ensembles will help to resolve this, but the problem of overfitting won't be gone completely. also, decision trees don't work well once something completely unseen shows up in input data, which might be a problem during future usage
- **auto-regressive models**:

    a time-series specific solution, that's supposed to do a good job at predicting future sales based on historical data. the main problems are the following: 
        1. I don't have any experience working with them (but we'll still try I guess)
        2. the data is still quite noisy, so using AR models will require some additional preprocessing 
- **RNNs**:

    one of the hardest models to implement, but at the same time a really powerfull (in theory) solution, that by definition fits the task idea quite well. RNNs can be really good when it comes to analyzing consecitive data (like text processing or time series) 


useful additional features:

- day of the week
- month
- year
- lagged values (look them up in `EDA`)
- ...


## Loading the data

the point is to load the data in a format that's actually going to be fed to the model

In [1]:
from warnings import filterwarnings
filterwarnings('ignore', category=FutureWarning)

import pandas as pd
import numpy as np
import catboost as cb
import seaborn as sns
import os
from sys import path
from matplotlib import pyplot as plt
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from tqdm.notebook import tqdm
from sklearn.ensemble import RandomForestRegressor

path.append('../')
from src.utils import FoldLoader
from src.transform import ETL

In [2]:
DATA_DIR = '../data/competitive-data-science-predict-future-sales/'

In [3]:
def etl_process():
    pipeline = ETL(DATA_DIR, ['sales_train.csv', 'items', 'shops', 'item_categories'])
    pipeline.extract()
    pipeline.tarnsform()
    pipeline.load(os.path.join(DATA_DIR, 'processed_files/'))

In [4]:
# uncomment if data reloading is needed (made this specifically after switching 
# the working device since the data dir is not on github)
# etl_process()

0.00% of data had empty values and was removed in sales
0.00% of data had empty values and was removed in items
0.00% of data had empty values and was removed in shops
0.00% of data had empty values and was removed in item_categories
resolving noisy values in shops...
resolving noisy values in items...
saving...


In [5]:
data = pd.read_csv(os.path.join(DATA_DIR, 'processed_files/processed_data.csv'))
test = pd.read_csv(os.path.join(DATA_DIR, 'test.csv'))
test['date_block_num'] = 34

In [6]:
columns_to_drop = ['date', 'item_name', 'item_category_name', 'shop_name']
data = data.drop(columns=columns_to_drop)

In [18]:
def train_model(train: pd.DataFrame, test: pd.DataFrame, cat_cols: list[str], in_features: list[str], target: str, max_depth=5) -> list:
    train_data = cb.Pool(train[train['date_block_num'] < 33][in_features],
                         train[train['date_block_num'] < 33][target],
                         cat_features=cat_cols)
    val_data = cb.Pool(train[train['date_block_num'] == 33][in_features],
                       train[train['date_block_num'] == 33][target],
                       cat_features=cat_cols)

    model = cb.CatBoostRegressor(
        cat_features=cat_cols, random_seed=42, max_depth=max_depth)
    model.fit(train_data, eval_set=val_data, use_best_model=True,
              verbose=True, early_stopping_rounds=50)
    # need to return the model as well
    return model.predict(test[in_features])

## Ungrouped data

In [22]:
in_features = ['date_block_num', 'shop_id', 'item_id']
target = 'item_cnt_day'
cat_cols = ['date_block_num', 'shop_id', 'item_id']

In [23]:
train_model(data, test, cat_cols, in_features, target, max_depth=2)

Learning rate set to 0.177859
0:	learn: 2.2502118	test: 9.7624728	best: 9.7624728 (0)	total: 412ms	remaining: 6m 51s
1:	learn: 2.2251096	test: 9.7575128	best: 9.7575128 (1)	total: 692ms	remaining: 5m 45s
2:	learn: 2.2096756	test: 9.7560999	best: 9.7560999 (2)	total: 824ms	remaining: 4m 33s
3:	learn: 2.1921792	test: 9.7558045	best: 9.7558045 (3)	total: 1.05s	remaining: 4m 20s
4:	learn: 2.1801099	test: 9.7557152	best: 9.7557152 (4)	total: 1.18s	remaining: 3m 53s
5:	learn: 2.1704762	test: 9.7556436	best: 9.7556436 (5)	total: 1.3s	remaining: 3m 35s
6:	learn: 2.1638171	test: 9.7555835	best: 9.7555835 (6)	total: 1.56s	remaining: 3m 40s
7:	learn: 2.1584108	test: 9.7556381	best: 9.7555835 (6)	total: 1.69s	remaining: 3m 29s
8:	learn: 2.1547616	test: 9.7557580	best: 9.7555835 (6)	total: 1.82s	remaining: 3m 20s
9:	learn: 2.1479554	test: 9.7553273	best: 9.7553273 (9)	total: 2.03s	remaining: 3m 21s
10:	learn: 2.1428333	test: 9.7512666	best: 9.7512666 (10)	total: 2.18s	remaining: 3m 15s
11:	learn: 2

array([1.1460224 , 1.09905976, 1.15019542, ..., 1.11267666, 1.11035982,
       1.19318129])

In [29]:
grouped = data.groupby(in_features, as_index=False).sum()
grouped_test = test.groupby(in_features, as_index=False).sum()

In [30]:
train_model(grouped, test, cat_cols, in_features, target)

Learning rate set to 0.161801
0:	learn: 7.9661015	test: 14.1955498	best: 14.1955498 (0)	total: 281ms	remaining: 4m 40s
1:	learn: 7.6425317	test: 14.0137319	best: 14.0137319 (1)	total: 505ms	remaining: 4m 11s
2:	learn: 7.3086872	test: 13.9089981	best: 13.9089981 (2)	total: 634ms	remaining: 3m 30s
3:	learn: 7.0570473	test: 13.8279289	best: 13.8279289 (3)	total: 750ms	remaining: 3m 6s
4:	learn: 6.8645835	test: 13.7844794	best: 13.7844794 (4)	total: 985ms	remaining: 3m 16s
5:	learn: 6.6448318	test: 13.7837624	best: 13.7837624 (5)	total: 1.3s	remaining: 3m 35s
6:	learn: 6.5054858	test: 13.7928778	best: 13.7837624 (5)	total: 1.58s	remaining: 3m 44s
7:	learn: 6.3953691	test: 13.7964513	best: 13.7837624 (5)	total: 1.89s	remaining: 3m 54s
8:	learn: 6.2913942	test: 13.8499907	best: 13.7837624 (5)	total: 2.04s	remaining: 3m 45s
9:	learn: 6.2263689	test: 13.8464533	best: 13.7837624 (5)	total: 2.24s	remaining: 3m 41s
10:	learn: 6.1709869	test: 13.8521432	best: 13.7837624 (5)	total: 2.42s	remaining:

array([1.31628164, 0.79431429, 1.46518921, ..., 0.93469649, 0.91364259,
       1.12327532])

In [68]:
global_history = dict()

## CatBoost

trying CatBoost with no additional parameters, we can then see the scores, compare to other models and decide wich one to use in the end

In [55]:
template = 'fold: [{:2} out of {:2}]\tR2-score: [{:3.3f}]\tRMSE: [{:3.3f}]'
index = 1
history = {'loss': [], 'score': []}
loader.reset_folds()
for train, val in tqdm(loader):
    # fetch data
    train_data = cb.Pool(train[in_features], train[target], cat_features=cat_cols)
    val_data = cb.Pool(val[in_features], val[target])
    # reset the model
    model = cb.CatBoostRegressor(cat_features=cat_cols, thread_count=8)
    model.fit(train_data, verbose=False)

    # validate
    preds = model.predict(val[in_features])
    rmse = (np.sqrt(mean_squared_error(val[target], preds)))
    r2 = r2_score(val[target], preds)
    history['score'].append(r2)
    history['loss'].append(rmse)
    print(template.format(index, len(loader), r2, rmse))
    index += 1
global_history['catboost'] = history

  0%|          | 0/8 [00:00<?, ?it/s]

fold: [ 1 out of  8]	R2-score: [0.593]	RMSE: [1.884]
fold: [ 2 out of  8]	R2-score: [0.782]	RMSE: [1.776]
fold: [ 3 out of  8]	R2-score: [0.824]	RMSE: [2.008]
fold: [ 4 out of  8]	R2-score: [0.823]	RMSE: [1.610]
fold: [ 5 out of  8]	R2-score: [0.730]	RMSE: [2.022]
fold: [ 6 out of  8]	R2-score: [0.836]	RMSE: [1.916]
fold: [ 7 out of  8]	R2-score: [0.598]	RMSE: [2.367]
fold: [ 8 out of  8]	R2-score: [0.626]	RMSE: [1.990]


## RandomForest

this one requires manually encoding some features, since it can't process categorical features out-of-the-box. again, planning to see the scores I get and compare to others

### RNN

requires encoding cat. features too.