## Raw ideas

so here's a few quick ideas for a validation schema:

- data shouldn't be shuffled before training, cause it's essential to have the time series values in consecutive order
- currently I came up with three ideas for a schema:
    1. `1,2train+3valid` -> `1,2,3train+4valid` -> `1,2,3,4train+5valid` etc.

        **Advatages**:
        - the train set grows, probably helping to test model's robustness
        - we always validate on different parts of the training set

        **Disadvantages**:
        - computationally heavy. at some point our train will contain almost the entire dataset
    2. `1,2train+3valid` -> `2,3train+4valid` -> `3,4train+5valid`

        **Advantages**:
        - fixed train size, no computational power issues
        - we can somewhat efficiently use our dataset since we use each fold several times during different trainig sessions
        
        **Disadvantages**:
        - the fact that different sets overlap might be a problem (shouldn't be, but who knows)
    3.  `1,2train+3valid` -> `4,5train+6valid` -> `7,8train+9valid`

        **Advantages**:
        - the easiest one to implement, goes through each fold once only
        - it's guaranteed that there's no overlapping or data leaks
        
        **Disadvantages**:
        - data hugry in some sense, each fold is used only ones and each trainig session takes $f_{train} + 1$ folds, where $f_{train}$ is the amount of folds for the trainig set. this might be a problem assuming we have only 33 folds at maximum

model ideas:

- **decision tree-based algorithms (boosting, random forest)**:

    seems to be a good choice for the task since we have a lot of categorical features and the target is a discrete variable (which is typical for decision tree regression). at the same time, trees can be overfit easily, which can cause a huge loss in robustness and very inaccurate predictions on unseen data. ensembles will help to resolve this, but the problem of overfitting won't be gone completely. also, decision trees don't work well once something completely unseen shows up in input data, which might be a problem during future usage
- **auto-regressive models**:

    a time-series specific solution, that's supposed to do a good job at predicting future sales based on historical data. the main problems are the following: 
        1. I don't have any experience working with them (but we'll still try I guess)
        2. the data is still quite noisy, so using AR models will require some additional preprocessing 
- **RNNs**:

    one of the hardest models to implement, but at the same time a really powerfull (in theory) solution, that by definition fits the task idea quite well. RNNs can be really good when it comes to analyzing consecitive data (like text processing or time series) 


useful additional features:

- day of the week
- month
- year
- lagged values (look them up in `EDA`)
- ...


In [107]:
import pandas as pd
from warnings import filterwarnings
filterwarnings('ignore', category=FutureWarning)
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor

## Loading the data

the point is to load the data in a format that's actually going to be fed to the model

In [3]:
data = pd.read_csv('../data/processed_files/processed_data.csv')
data['date'] = pd.to_datetime(data['date'])

In [7]:
data.columns

Index(['date', 'date_block_num', 'shop_id', 'item_id', 'item_price',
       'item_cnt_day', 'item_name', 'item_category_id', 'shop_name',
       'item_category_name'],
      dtype='object')

In [15]:
def trim_group(data: pd.DataFrame):
    # maybe should be moved to the loader class
    trimmed = data[['date_block_num', 'shop_id',
                    'item_id', 'item_price', 'item_cnt_day']]
    trimmed = trimmed.groupby(
        ['date_block_num', 'shop_id', 'item_id'], as_index=False).sum()
    return trimmed

In [78]:
trimmed = trim_group(data)

In [100]:
trimmed.head()

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,0,2,27,2499.0,1
1,0,2,33,499.0,1
2,0,2,317,299.0,1
3,0,2,438,299.0,1
4,0,2,471,798.0,2


In [105]:
# TODO this should be moved to a separate script once I get everything working
class FoldLoader:
    def __init__(self, data: pd.DataFrame,  n_train_folds: int, n_valid_folds: int = 1, mode: str = 'sequence') -> None:
        """
        TODO add a description

        Keyword arguments:

        data -- a pandas dataframe containing the training data

        target_col -- the name of a column in `data` contains the target values

        n_train_folds -- the amount of months used to train the model.
        must be less or equal to the amount of months in `data` minus n_valid_folds. 
        if `mode` is 'stack', only affects the train size on the first iteration

        n_valid_folds -- the amount of months used to validate the model.
        must be less or equal to the amount of months in `data` minus n_train_folds
        Default: 1

        mode -- one of "sequence", "overlap", "stack"
        a string denoting the data splitting mode:
            - `sequence` for `1,2train+3valid` -> `4,5train+6valid` -> `7,8train+9valid`
            - `overlap` for `1,2train+3valid` -> `2,3train+4valid` -> `3,4train+5valid`
            - `stack` for `1,2train+3valid` -> `1,2,3train+4valid` -> `1,2,3,4train+5valid`
        Default: `sequence`
        """
        self.min_fold = data['date_block_num'].unique().min()
        self.max_fold = data['date_block_num'].unique().max()
        self.current_fold = self.min_fold
        self.train_size = n_train_folds
        self.val_size = n_valid_folds
        if (self.max_fold - self.min_fold + 1) < self.train_size + self.val_size:
            raise ValueError(
                f'too many folds in train and valid for the given data')
        self.data = data
        self.mode = mode

    def _extract_features(self, fold) -> pd.DataFrame:
        # questionable, because we might want to use the entire dataset for feature extraction
        # TODO actual feature extraction
        return fold

    def _get_sets_internal(self, start_idx) -> tuple[pd.DataFrame, pd.DataFrame]:
        """
        iterates through the dataset to form the training and validation sets
        for current iteration

        Keyword arguments:

        start_idx -- index of the fold (month) to start from
        """
        train = pd.DataFrame(columns=self.data.columns)
        val = pd.DataFrame(columns=self.data.columns)

        for idx in range(start_idx, start_idx + self.train_size):
            train = pd.concat(
                [train, self.data[self.data['date_block_num'] == idx]])
        start_idx += self.train_size
        for idx in range(start_idx, start_idx + self.val_size):
            val = pd.concat(
                [val, self.data[self.data['date_block_num'] == idx]])
        return (train, val)

    def get_sets(self) -> tuple[pd.DataFrame, pd.DataFrame]:
        """
        calls `_get_sets_internal` to get training and validation sets.
        adjusts internal counters corresponding to the selected folding mode 
        """
        start_idx = self.current_fold

        if self.mode.startswith('seq'):
            result = self._get_sets_internal(start_idx)
            self.current_fold += self.train_size+self.val_size
            return result
        elif self.mode.startswith('over'):
            result = self._get_sets_internal(start_idx)
            self.current_fold += 1
            return result
        elif self.mode.startswith('stack'):
            result = self._get_sets_internal(start_idx)
            self.train_size += 1
            return result
        return (None, None)

    def reset_folds(self) -> None:
        """
        sets the current fold counter to point to the first fold 
        of the dataset. this effectively means that the iteration will be started over
        """
        self.current_fold = self.min_fold

    def validate(self, model):
        # conceptually:
        # should probably be deleted due to specific things in each model's training loop
        raise NotImplementedError

In [115]:
validator = FoldLoader(trimmed, 3, mode='stack')
train, val = validator.get_sets()
train, val = validator.get_sets()
train, val = validator.get_sets()
print(train['date_block_num'].value_counts())
val['date_block_num'].value_counts()

date_block_num
2    63532
0    62729
1    59503
3    54270
4    52960
Name: count, dtype: int64


date_block_num
5    55770
Name: count, dtype: int64

In [116]:
validator = FoldLoader(trimmed, 3, mode='seq')
train, val = validator.get_sets()
train, val = validator.get_sets()
train, val = validator.get_sets()
print(train['date_block_num'].value_counts())
val['date_block_num'].value_counts()

date_block_num
8     51087
10    51036
9     50560
Name: count, dtype: int64


date_block_num
11    65541
Name: count, dtype: int64

In [117]:
validator = FoldLoader(trimmed, 3, mode='over')
train, val = validator.get_sets()
train, val = validator.get_sets()
train, val = validator.get_sets()
print(train['date_block_num'].value_counts())
val['date_block_num'].value_counts()

date_block_num
2    63532
3    54270
4    52960
Name: count, dtype: int64


date_block_num
5    55770
Name: count, dtype: int64