## Raw ideas


so here's a few quick ideas for a validation schema:

- data shouldn't be shuffled before training, cause it's essential to have the time series values in consecutive order
- currently I came up with three ideas for a schema:

  1. `1,2train+3valid` -> `1,2,3train+4valid` -> `1,2,3,4train+5valid` etc.

     **Advatages**:

     - the train set grows, probably helping to test model's robustness
     - we always validate on different parts of the training set

     **Disadvantages**:

     - computationally heavy. at some point our train will contain almost the entire dataset

  2. `1,2train+3valid` -> `2,3train+4valid` -> `3,4train+5valid`

     **Advantages**:

     - fixed train size, no computational power issues
     - we can somewhat efficiently use our dataset since we use each fold several times during different trainig sessions

     **Disadvantages**:

     - the fact that different sets overlap might be a problem (shouldn't be, but who knows)

  3. `1,2train+3valid` -> `4,5train+6valid` -> `7,8train+9valid`

     **Advantages**:

     - the easiest one to implement, goes through each fold once only
     - it's guaranteed that there's no overlapping or data leaks

     **Disadvantages**:

     - data hugry in some sense, each fold is used only ones and each trainig session takes $f_{train} + 1$ folds, where $f_{train}$ is the amount of folds for the trainig set. this might be a problem assuming we have only 33 folds at maximum


model ideas:

- **decision tree-based algorithms (boosting, random forest)**:

  seems to be a good choice for the task since we have a lot of categorical features and the target is a discrete variable (which is typical for decision tree regression). at the same time, trees can be overfit easily, which can cause a huge loss in robustness and very inaccurate predictions on unseen data. ensembles will help to resolve this, but the problem of overfitting won't be gone completely. also, decision trees don't work well once something completely unseen shows up in input data, which might be a problem during future usage

- **auto-regressive models**:

  a time-series specific solution, that's supposed to do a good job at predicting future sales based on historical data. the main problems are the following: 1. I don't have any experience working with them (but we'll still try I guess) 2. the data is still quite noisy, so using AR models will require some additional preprocessing

- **RNNs**:

  one of the hardest models to implement, but at the same time a really powerfull (in theory) solution, that by definition fits the task idea quite well. RNNs can be really good when it comes to analyzing consecitive data (like text processing or time series)


useful additional features:

- day of the week
- month
- year
- lagged values (look them up in `EDA`)
- ...


## Loading and preparing the data

the point is to load the data in a format that's actually going to be fed to the model


In [1]:
from sys import path
path.append('../')


In [2]:
from src.utils import FoldLoader
from src.transform import ETL
from sklearn.ensemble import RandomForestRegressor
from tqdm.notebook import tqdm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from matplotlib import pyplot as plt
import os
import seaborn as sns
import catboost as cb
import numpy as np
import pandas as pd
import re
from math import sin, cos, pi
from warnings import filterwarnings
filterwarnings('ignore', category=FutureWarning)



In [3]:
DATA_DIR = '../data/competitive-data-science-predict-future-sales/'

In [4]:
def etl_process():
    pipeline = ETL(DATA_DIR, ['sales_train.csv',
                   'items', 'shops', 'item_categories'])
    pipeline.extract()
    pipeline.tarnsform()
    pipeline.load(os.path.join(DATA_DIR, 'processed_files/'))

In [5]:
# uncomment if data reloading is needed (made this specifically after switching
# the working device since the data dir is not on github)
etl_process()

0.00% of data had empty values and was removed in sales
0.00% of data had empty values and was removed in items
0.00% of data had empty values and was removed in shops
0.00% of data had empty values and was removed in item_categories
resolving noisy values in shops...
resolving noisy values in items...
saving...


In [6]:
data = pd.read_csv(os.path.join(
    DATA_DIR, 'processed_files/processed_data.csv'))
categories = pd.read_csv(os.path.join(DATA_DIR, 'item_categories.csv'))
items = pd.read_csv(os.path.join(DATA_DIR, 'items.csv'))

item_cats  =items.merge(categories, how='inner', on='item_category_id')
test = pd.read_csv(os.path.join(DATA_DIR, 'test.csv'))
test['date_block_num'] = 34

In [7]:
columns_to_drop = ['date', 'item_name', 'shop_name']
data = data.drop(columns=columns_to_drop)

In [8]:
agg_cols = ['date_block_num', 'shop_id', 'item_id', 'item_category_name']

In [9]:
grouped = data.groupby(agg_cols, as_index=False).sum()
grouped_test = test.groupby(agg_cols[:-1], as_index=False).sum()

## Feature extraction


In [10]:
def extract_features(data: pd.DataFrame, from_cols: list[str], to_cols: list[str], using: list[callable]):
    # weird idea but should work
    # doesn't work for more complex features but we won't use those i suppose
    _data = data.copy()
    if not (len(from_cols) == len(to_cols) == len(using)):
        raise ValueError('list sizes have to match')
    for f, t, u in zip(from_cols, to_cols, using):
        _data[t] = _data[f].apply(u)
    return _data

In [11]:
new_features = [
    'month_sin',
    'month_cos',
    'item_cat',
    'general_cat'
]
sources = [
    'date_block_num',
    'date_block_num',
    'item_id',
    'item_id'
]

lookup = {}


def month_wrapper(func: callable):
    def ex_month(x):
        month = x % 12 + 1
        return func(2*pi*month/12)
    return ex_month


def fetch_cat(id):
    if id not in lookup:
        lookup[id] = item_cats[item_cats['item_id'] == id]['item_category_name'].unique()[0]
    return lookup[id]


def gen_cat(id):
    cat = fetch_cat(id)
    return 'Игры' if cat.startswith('Игры') \
        else re.split(r'\s*-\s*', cat)[0]


extractors = [
    month_wrapper(sin),
    month_wrapper(cos),
    fetch_cat,
    gen_cat
]
grouped = extract_features(grouped, sources, new_features, extractors)

In [12]:
grouped.head(10)

Unnamed: 0,date_block_num,shop_id,item_id,item_category_name,item_price,item_cnt_day,item_category_id,month_sin,month_cos,item_cat,general_cat
0,0,2,27,Игры - PS3,2499.0,1,19,0.5,0.866025,Игры - PS3,Игры
1,0,2,33,Кино - Blu-Ray,499.0,1,37,0.5,0.866025,Кино - Blu-Ray,Кино
2,0,2,317,Книги - Аудиокниги 1С,299.0,1,45,0.5,0.866025,Книги - Аудиокниги 1С,Книги
3,0,2,438,Книги - Аудиокниги 1С,299.0,1,45,0.5,0.866025,Книги - Аудиокниги 1С,Книги
4,0,2,471,Книги - Методические материалы 1С,798.0,2,98,0.5,0.866025,Книги - Методические материалы 1С,Книги
5,0,2,481,Книги - Методические материалы 1С,330.0,1,49,0.5,0.866025,Книги - Методические материалы 1С,Книги
6,0,2,482,Программы - 1С:Предприятие 8,3300.0,1,73,0.5,0.866025,Программы - 1С:Предприятие 8,Программы
7,0,2,484,Программы - 1С:Предприятие 8,600.0,2,146,0.5,0.866025,Программы - 1С:Предприятие 8,Программы
8,0,2,491,Программы - 1С:Предприятие 8,600.0,1,73,0.5,0.866025,Программы - 1С:Предприятие 8,Программы
9,0,2,534,Программы - Обучающие,798.0,2,154,0.5,0.866025,Программы - Обучающие,Программы


## Catboost


In [13]:
in_features = ['shop_id', 'item_id', 'item_cat',
               'month_sin', 'month_cos', 'general_cat']
target = 'item_cnt_day'
cat_cols = ['shop_id', 'item_id', 'item_cat', 'general_cat']

In [14]:
def train_model(train: pd.DataFrame, cat_cols: list[str], in_features: list[str], target: str, lower: int = 0, upper: int = 33) -> list:
    mask_1 = train['date_block_num'] < upper
    mask_2 = train['date_block_num'] >= lower
    train_data = cb.Pool(train[mask_1 & mask_2][in_features],
                         train[mask_1 & mask_2][target],
                         cat_features=cat_cols)
    val_data = cb.Pool(train[train['date_block_num'] == upper][in_features],
                       train[train['date_block_num'] == upper][target],
                       cat_features=cat_cols)

    model = cb.CatBoostRegressor(cat_features=cat_cols, random_seed=67)
    model.fit(train_data, eval_set=val_data, use_best_model=True,
              verbose=True, early_stopping_rounds=100)
    return model

In [15]:
grouped.head()

Unnamed: 0,date_block_num,shop_id,item_id,item_category_name,item_price,item_cnt_day,item_category_id,month_sin,month_cos,item_cat,general_cat
0,0,2,27,Игры - PS3,2499.0,1,19,0.5,0.866025,Игры - PS3,Игры
1,0,2,33,Кино - Blu-Ray,499.0,1,37,0.5,0.866025,Кино - Blu-Ray,Кино
2,0,2,317,Книги - Аудиокниги 1С,299.0,1,45,0.5,0.866025,Книги - Аудиокниги 1С,Книги
3,0,2,438,Книги - Аудиокниги 1С,299.0,1,45,0.5,0.866025,Книги - Аудиокниги 1С,Книги
4,0,2,471,Книги - Методические материалы 1С,798.0,2,98,0.5,0.866025,Книги - Методические материалы 1С,Книги


In [16]:
print(grouped['item_cnt_day'].min(), grouped['item_cnt_day'].max())

-13 167


In [17]:
catboost_model = train_model(
    grouped, cat_cols, in_features, target, lower=0, upper=33)

Learning rate set to 0.161567
0:	learn: 3.6365442	test: 3.1328556	best: 3.1328556 (0)	total: 568ms	remaining: 9m 27s
1:	learn: 3.4675710	test: 2.9459217	best: 2.9459217 (1)	total: 953ms	remaining: 7m 55s
2:	learn: 3.3425742	test: 2.8216764	best: 2.8216764 (2)	total: 1.29s	remaining: 7m 10s
3:	learn: 3.2492664	test: 2.7477886	best: 2.7477886 (3)	total: 1.44s	remaining: 5m 58s
4:	learn: 3.1741964	test: 2.7009078	best: 2.7009078 (4)	total: 1.56s	remaining: 5m 10s
5:	learn: 3.1219240	test: 2.6676380	best: 2.6676380 (5)	total: 1.82s	remaining: 5m 1s
6:	learn: 3.0798157	test: 2.6583008	best: 2.6583008 (6)	total: 1.96s	remaining: 4m 38s
7:	learn: 3.0456407	test: 2.6566851	best: 2.6566851 (7)	total: 2.13s	remaining: 4m 24s
8:	learn: 3.0159017	test: 2.6687634	best: 2.6566851 (7)	total: 2.42s	remaining: 4m 26s
9:	learn: 2.9963849	test: 2.6675696	best: 2.6566851 (7)	total: 2.55s	remaining: 4m 12s
10:	learn: 2.9796710	test: 2.6666699	best: 2.6566851 (7)	total: 2.67s	remaining: 4m
11:	learn: 2.9648

In [18]:
test = extract_features(test, sources, new_features, extractors)

In [19]:
predicts = catboost_model.predict(test[in_features])
submission = pd.DataFrame({'ID':range(len(predicts)), 'item_cnt_month':predicts})
filename = os.path.join(DATA_DIR, 'submission.csv')
submission.to_csv(filename, index=False)

In [20]:
global_history = dict()

In [21]:
loader = FoldLoader(grouped, 12, folding_mode='stack')

In [23]:
template = 'fold: [{:2} out of {:2}]\tR2-score: [{:3.3f}]\tRMSE: [{:3.3f}]'
index = 1
history = {'loss': [], 'score': []}
loader.reset_folds()
model = cb.CatBoostRegressor(cat_features=cat_cols, random_seed=67)
for train, val in tqdm(loader):
    # fetch data
    train_data = cb.Pool(train[in_features],
                         train[target], cat_features=cat_cols)
    val_data = cb.Pool(val[in_features], val[target], cat_features=cat_cols)
    # reset the model
    model.fit(train_data, verbose=False)

    # validate
    preds = model.predict(val[in_features])
    rmse = (np.sqrt(mean_squared_error(val[target], preds)))
    r2 = r2_score(val[target], preds)
    history['score'].append(r2)
    history['loss'].append(rmse)
    print(template.format(index, len(loader), r2, rmse))
    index += 1
global_history['catboost'] = history

  0%|          | 0/22 [00:00<?, ?it/s]

In [None]:
loader = FoldLoader(data, [in_features])

  0%|          | 0/8 [00:00<?, ?it/s]

fold: [ 1 out of  8]	R2-score: [0.593]	RMSE: [1.884]
fold: [ 2 out of  8]	R2-score: [0.782]	RMSE: [1.776]
fold: [ 3 out of  8]	R2-score: [0.824]	RMSE: [2.008]
fold: [ 4 out of  8]	R2-score: [0.823]	RMSE: [1.610]
fold: [ 5 out of  8]	R2-score: [0.730]	RMSE: [2.022]
fold: [ 6 out of  8]	R2-score: [0.836]	RMSE: [1.916]
fold: [ 7 out of  8]	R2-score: [0.598]	RMSE: [2.367]
fold: [ 8 out of  8]	R2-score: [0.626]	RMSE: [1.990]


## RandomForest

this one requires manually encoding some features, since it can't process categorical features out-of-the-box. again, planning to see the scores I get and compare to others


## RNN

requires encoding cat. features too.
