source: https://www.kaggle.com/kyakovlev/m5-custom-features from https://www.kaggle.com/ejunichi/m5-three-shades-of-dark-darker-magic

In [1]:
import sys
import os
import pathlib
import gc
import pandas as pd
pd.set_option('display.max_columns', 500)
# pd.set_option('display.max_rows', 500)
import numpy as np
import math
import random
import pickle
import time
import psutil
import warnings

# custom import
from sklearn.preprocessing import LabelEncoder
from multiprocessing import Pool        # Multiprocess Runs
import lightgbm as lgb
from tqdm import tqdm
from scipy.sparse import csr_matrix
# warnings.filterwarnings('ignore')


# function fixing random seeds

In [2]:
def seed_everything(seed=0):
    """Sets seed to make all processes deterministic     # type: int
    
    """
    random.seed(seed)
    np.random.seed(seed)

SEED = 42
seed_everything(SEED)    

# constant variables for helper functions

In [3]:
N_CORES = psutil.cpu_count()     # Available CPU cores
print(f"N_CORES: {N_CORES}")

N_CORES: 36


#  constant variables for data import

In [4]:
# change this var according to the dataset you refer to 
# path to the source's pickle files
# _DATA_DIR = os.path.sep.join(["data", "M5_Three_shades_of_Dark_Darker_magic", "sample"])
_DATA_DIR = os.path.sep.join(["data", "M5_Three_shades_of_Dark_Darker_magic"])
_OUTPUT_DIR = os.path.sep.join(["data", "M5_Three_shades_of_Dark_Darker_magic"])
print(f"_DATA_DIR: {_DATA_DIR}")
_CALENDAR_CSV_FILE = "calendar.csv"
_SAMPLE_SUBMISSION_CSV_FILE = "sample_submission.csv"
_SALES_TRAIN_VALIDATION_CSV_FILE = "sales_train_validation.csv"
_SELL_PRICES_CSV_FILE = "sell_prices.csv"

#PATHS for Features
BASE = "clearned_base_grid_for_darker_magic.pkl"
PRICE = "base_grid_with_sales_price_features_for_darker_magic.pkl"
CALENDAR = "base_grid_with_calendar_features_for_darker_magic.pkl"

LAGS = "base_grid_with_lag_features_for_28_days.pkl"
MEAN_ENC = "base_grid_with_mean_encoded_ids_means_stds_for_darker_magic.pkl"


_DATA_DIR: data/M5_Three_shades_of_Dark_Darker_magic


# model hyperparameters and constant variables for training and test

In [5]:
# 'n_estimators': 1300 may be better
lgb_params = {
                    'boosting_type': 'gbdt',
                    'objective': 'tweedie',
                    'tweedie_variance_power': 1.1,
                    'metric': 'rmse',
                    'subsample': 0.5,
                    'subsample_freq': 1,
                    'learning_rate': 0.03,
                    'num_leaves': 2**11-1,
                    'min_data_in_leaf': 2**12-1,
                    'feature_fraction': 0.5,
                    'max_bin': 100,
                    'n_estimators': 1500, #特徴量を少し増やしたのでiterationの数も少し増やした。
                    'boost_from_average': False,
                    'verbose': -1,
                } 
# Let's look closer on params

## 'boosting_type': 'gbdt'
# we have 'goss' option for faster training
# but it normally leads to underfit.
# Also there is good 'dart' mode
# but it takes forever to train
# and model performance depends 
# a lot on random factor 
# https://www.kaggle.com/c/home-credit-default-risk/discussion/60921

## 'objective': 'tweedie'
# Tweedie Gradient Boosting for Extremely
# Unbalanced Zero-inflated Data
# https://arxiv.org/pdf/1811.10192.pdf
# and many more articles about tweediie
#
# Strange (for me) but Tweedie is close in results
# to my own ugly loss.
# My advice here - make OWN LOSS function
# https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/140564
# https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/143070
# I think many of you already using it (after poisson kernel appeared) 
# (kagglers are very good with "params" testing and tuning).
# Try to figure out why Tweedie works.
# probably it will show you new features options
# or data transformation (Target transformation?).

## 'tweedie_variance_power': 1.1
# default = 1.5
# set this closer to 2 to shift towards a Gamma distribution
# set this closer to 1 to shift towards a Poisson distribution
# my CV shows 1.1 is optimal 
# but you can make your own choice

## 'metric': 'rmse'
# Doesn't mean anything to us
# as competition metric is different
# and we don't use early stoppings here.
# So rmse serves just for general 
# model performance overview.
# Also we use "fake" validation set
# (as it makes part of the training set)
# so even general rmse score doesn't mean anything))
# https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/133834

## 'subsample': 0.5
# Serves to fight with overfit
# this will randomly select part of data without resampling
# Chosen by CV (my CV can be wrong!)
# Next kernel will be about CV

##'subsample_freq': 1
# frequency for bagging
# default value - seems ok

## 'learning_rate': 0.03
# Chosen by CV
# Smaller - longer training
# but there is an option to stop 
# in "local minimum"
# Bigger - faster training
# but there is a chance to
# not find "global minimum" minimum

## 'num_leaves': 2**11-1
## 'min_data_in_leaf': 2**12-1
# Force model to use more features
# We need it to reduce "recursive"
# error impact.
# Also it leads to overfit
# that's why we use small 
# 'max_bin': 100

## l1, l2 regularizations
# https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c
# Good tiny explanation
# l2 can work with bigger num_leaves
# but my CV doesn't show boost
                    
## 'n_estimators': 1400
# CV shows that there should be
# different values for each state/store.
# Current value was chosen 
# for general purpose.
# As we don't use any early stopings
# careful to not overfit Public LB.

##'feature_fraction': 0.5
# LightGBM will randomly select 
# part of features on each iteration (tree).
# We have maaaany features
# and many of them are "duplicates"
# and many just "noise"
# good values here - 0.5-0.7 (by CV)

## 'boost_from_average': False
# There is some "problem"
# to code boost_from_average for 
# custom loss
# 'True' makes training faster
# BUT carefull use it
# https://github.com/microsoft/LightGBM/issues/1514
# not our case but good to know cons
#######################################

VER = 3                          # Our model version
SEED = 42                        # We want all things to be as deterministic as possible
seed_everything(SEED)            
lgb_params['seed'] = SEED        

#LIMITS and const
TARGET      = 'sales'            # Our target column name
START_DAY_TRAIN = 0                  # We can skip some rows (Nans/faster training)

PREDICTION_HORIZON_DAYS = 28                 # Prediction horizon
END_DAY_TRAIN   = 1913 - PREDICTION_HORIZON_DAYS # End day of our train set
_NUM_UNIQUE_ITEM_ID = 30490
# Use or not use pretrained models: make this true after completing model training.
# USE_AUX = True
USE_AUX = False

# FEATURES to remove.
# These features lead to overfit or values not present in test set
REMOVE_FEATURES = ['id','state_id','store_id', 'date','wm_yr_wk','d',TARGET]
MEAN_STD_FEATURES   = ['enc_cat_id_mean','enc_cat_id_std',
                   'enc_dept_id_mean','enc_dept_id_std',
                   'enc_item_id_mean','enc_item_id_std'] 

# AUX(pretrained) Models paths
PRETRAINED_MODEL_DIR = 'trained_model'

#SPLITS for lags creation
SHIFT_DAYS  = 28
ROLLING_SPLIT = []
# original 
for i in [1,7,14]:
    for j in [7,14,30,60]:
        ROLLING_SPLIT.append([i,j])
# 以下のようにFeatureを減らして実行するためには、read_data_by_store関数にて、除いたFeaatureの列を削除する処理を加える必要がある。削除対象はLAGS。
# for i in [1,7]:
#     for j in [7 ,30]:
#         ROLLING_SPLIT.append([i,j])

START_DAY_VALIDATION = END_DAY_TRAIN + 1
print(f"START_DAY_VALIDATION: {START_DAY_VALIDATION}")
START_DAY_EVALUATION = START_DAY_VALIDATION + PREDICTION_HORIZON_DAYS
print(f"START_DAY_EVALUATION: {START_DAY_EVALUATION}")

START_DAY_VALIDATION: 1886
START_DAY_EVALUATION: 1914


# function nicely diplaying a head of Pandas DataFrame

In [6]:
import IPython

def display(*dfs, head=True):
    for df in dfs:
        IPython.display.display(df.head() if head else df)

# function processing df in multiprocess

In [7]:
def run_df_in_multiprocess(func, t_split):
    """Process ds in Multiprocess
    
    """
    num_cores = np.min([N_CORES,len(t_split)])
    print(f"num_cores: {num_cores}")
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, t_split), axis=1)
    pool.close()
    pool.join()
    return df

# other helper functions

In [8]:
def get_memory_usage():
    """メモリ使用量を確認するためのシンプルな「メモリプロファイラ」
    
    """
    return np.round(psutil.Process(os.getpid()).memory_info()[0]/2.**30, 2) 
        
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)


def merge_by_concat(df1, df2, merge_on):
    """
    dtypesを失わないための連結による結合
    
    """
    
    merged_gf = df1[merge_on]
    merged_gf = merged_gf.merge(df2, on=merge_on, how='left')
    new_columns = [col for col in list(merged_gf) if col not in merge_on]
    df1 = pd.concat([df1, merged_gf[new_columns]], axis=1)
    return df1


def get_base_test():
    """Recombines Test set after training
    
    """
    base_test = pd.DataFrame()

    for store_id in STORE_IDS:
        test_pkl_path = os.path.sep.join([PRETRAINED_MODEL_DIR, 'test_dataset_'+store_id+'.pkl'])
        temp_df = pd.read_pickle(test_pkl_path)
        temp_df['store_id'] = store_id
        base_test = pd.concat([base_test, temp_df]).reset_index(drop=True)
    
    return base_test



##### Helper to make dynamic rolling lags #####
def make_lag(lag_day):
    """
    
    """
    lag_df = base_test[['id','d',TARGET]]
    col_name = 'sales_lag_'+str(lag_day)
    lag_df[col_name] = lag_df.groupby(['id'])[TARGET].transform(lambda x: x.shift(lag_day)).astype(np.float16)
    return lag_df[[col_name]]


def make_lag_roll(lag_day):
    """
    
    """
    shift_day = lag_day[0]
    roll_wind = lag_day[1]
    lag_df = base_test[['id','d',TARGET]]
    col_name = 'rolling_mean_tmp_'+str(shift_day)+'_'+str(roll_wind)
    lag_df[col_name] = lag_df.groupby(['id'])[TARGET].transform(lambda x: x.shift(shift_day).rolling(roll_wind).mean())
    return lag_df[[col_name]]
##### Helper to make dynamic rolling lags #####

# function importing data

In [9]:
def reduce_mem_usage(df, verbose=True):
    """
    reduce the memory usage of the given dataframe.
    https://qiita.com/hiroyuki_kageyama/items/02865616811022f79754
    
    Args:
        df: Dataframe
        verbose: 
        
    Returns:
        df, whose memory usage is reduced.

    Raises:
        None
    """
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns: #columns毎に処理
        col_type = df[col].dtypes
        if col_type in numerics: #numericsのデータ型の範囲内のときに処理を実行. データの最大最小値を元にデータ型を効率的なものに変更
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

def read_csv_data(directory, file_name):
    print('Reading files...')
    df = pd.read_csv(os.path.sep.join([str(directory), _DATA_DIR, file_name]))
    df = reduce_mem_usage(df)
    print('{} has {} rows and {} columns'.format(file_name, df.shape[0], df.shape[1]))
    
    return df


def read_data_by_store(store):
#     # Read and contact basic feature
#     df = pd.concat([pd.read_pickle(BASE),
#                     pd.read_pickle(PRICE).iloc[:,2:],
#                     pd.read_pickle(CALENDAR).iloc[:,2:]],
#                     axis=1)

    # Read and contact basic feature
    parent_dir = pathlib.Path(os.path.abspath(os.curdir)).parent.parent
    df = pd.concat([pd.read_pickle(os.path.sep.join([str(parent_dir), _DATA_DIR, BASE])),
                    pd.read_pickle(os.path.sep.join([str(parent_dir), _DATA_DIR, PRICE])).iloc[:,2:],
                    pd.read_pickle(os.path.sep.join([str(parent_dir), _DATA_DIR, CALENDAR])).iloc[:,2:]],
                    axis=1)
#     print(f"df at read_data_by_store: {df}")
    
    # Leave only relevant store
    df = df[df['store_id']==store]

    # With memory limits we have to read lags and mean encoding features separately and drop items that we don't need.
    # As our Features Grids are aligned 
    # we can use index to keep only necessary rows
    # Alignment is good for us as concat uses less memory than merge.
    df2 = pd.read_pickle(os.path.sep.join([str(parent_dir), _DATA_DIR, MEAN_ENC]))[MEAN_STD_FEATURES]
    df2 = df2[df2.index.isin(df.index)]
    print(f"MEAN_ENC: {MEAN_ENC}")
    
    df3 = pd.read_pickle(os.path.sep.join([str(parent_dir), _DATA_DIR, LAGS])).iloc[:,3:]
    df3 = df3[df3.index.isin(df.index)]
    print(f"LAGS: {LAGS}")
    
    df = pd.concat([df, df2], axis=1)
    del df2 # to not reach memory limit 
    
    df = pd.concat([df, df3], axis=1)
    del df3 # to not reach memory limit 
    
    # Create features list
    features = [col for col in list(df) if col not in REMOVE_FEATURES]
    df = df[['id','d',TARGET]+features]
    
    # Skipping first n rows
    df = df[df['d']>=START_DAY_TRAIN].reset_index(drop=True)
    
    return df, features

# read csv data

In [10]:
# parent_dir = pathlib.Path(os.path.abspath(os.curdir)).parent.parent
# print(f"parent_dir: {parent_dir}")
# df_sales_train_validation = read_csv_data(parent_dir, _SALES_TRAIN_VALIDATION_CSV_FILE)
# #STORES ids
# STORE_IDS = df_sales_train_validation['store_id']
# STORE_IDS = list(STORE_IDS.unique())

STORE_IDS = ['CA_1', 'CA_2', 'CA_3', 'CA_4', 'TX_1', 'TX_2', 'TX_3', 'WI_1', 'WI_2', 'WI_3']
print(f"STORE_IDS: {STORE_IDS}")

STORE_IDS: ['CA_1', 'CA_2', 'CA_3', 'CA_4', 'TX_1', 'TX_2', 'TX_3', 'WI_1', 'WI_2', 'WI_3']


In [11]:
# print(df_sales_train_validation)

# read pretrained model if you want

In [12]:
########################### Aux Models
# If you don't want to wait hours and hours to have result,
# you can train each store in separate kernel and then just join result.
    
# Here is some 'logs' that can compare
#Train CA_1
#[100]	valid_0's rmse: 2.02289
#[200]	valid_0's rmse: 2.0017
#[300]	valid_0's rmse: 1.99239
#[400]	valid_0's rmse: 1.98471
#[500]	valid_0's rmse: 1.97923
#[600]	valid_0's rmse: 1.97284
#[700]	valid_0's rmse: 1.96763
#[800]	valid_0's rmse: 1.9624
#[900]	valid_0's rmse: 1.95673
#[1000]	valid_0's rmse: 1.95201
#[1100]	valid_0's rmse: 1.9476
#[1200]	valid_0's rmse: 1.9434
#[1300]	valid_0's rmse: 1.9392
#[1400]	valid_0's rmse: 1.93446

#Train CA_2
#[100]	valid_0's rmse: 1.88949
#[200]	valid_0's rmse: 1.84767
#[300]	valid_0's rmse: 1.83653
#[400]	valid_0's rmse: 1.82909
#[500]	valid_0's rmse: 1.82265
#[600]	valid_0's rmse: 1.81725
#[700]	valid_0's rmse: 1.81252
#[800]	valid_0's rmse: 1.80736
#[900]	valid_0's rmse: 1.80242
#[1000]	valid_0's rmse: 1.79821
#[1100]	valid_0's rmse: 1.794
#[1200]	valid_0's rmse: 1.78973
#[1300]	valid_0's rmse: 1.78552
#[1400]	valid_0's rmse: 1.78158

#############################

# calculate wrmsse (incomplete. skip as of 20200620)

In [13]:
# Read and contact basic feature
parent_dir = pathlib.Path(os.path.abspath(os.curdir)).parent.parent
base_df = pd.concat([pd.read_pickle(os.path.sep.join([str(parent_dir), _DATA_DIR, BASE])),
                pd.read_pickle(os.path.sep.join([str(parent_dir), _DATA_DIR, PRICE])).iloc[:,2:],
                pd.read_pickle(os.path.sep.join([str(parent_dir), _DATA_DIR, CALENDAR])).iloc[:,2:]],
                axis=1)

print (f"base_df.columns.values: {base_df.columns.values}")
print(f"base_df: {base_df}")

base_df.columns.values: ['id' 'item_id' 'dept_id' 'cat_id' 'store_id' 'state_id' 'd' 'sales'
 'release' 'sell_price' 'price_max' 'price_min' 'price_std' 'price_mean'
 'price_norm' 'price_nunique' 'item_nunique' 'price_momentum'
 'price_momentum_m' 'price_momentum_y' 'event_name_1' 'event_type_1'
 'event_name_2' 'event_type_2' 'snap_CA' 'snap_TX' 'snap_WI' 'tm_d' 'tm_w'
 'tm_m' 'tm_y' 'tm_wm' 'tm_dw' 'tm_w_end']
base_df:                                      id        item_id    dept_id   cat_id  \
0         HOBBIES_1_008_CA_1_validation  HOBBIES_1_008  HOBBIES_1  HOBBIES   
1         HOBBIES_1_009_CA_1_validation  HOBBIES_1_009  HOBBIES_1  HOBBIES   
2         HOBBIES_1_010_CA_1_validation  HOBBIES_1_010  HOBBIES_1  HOBBIES   
3         HOBBIES_1_012_CA_1_validation  HOBBIES_1_012  HOBBIES_1  HOBBIES   
4         HOBBIES_1_015_CA_1_validation  HOBBIES_1_015  HOBBIES_1  HOBBIES   
...                                 ...            ...        ...      ...   
46881672    FOODS_3_823_WI_3_v

In [14]:
# 予測期間とitem数の定義 / number of items, and number of prediction period
NUM_ITEMS = 30490
DAYS_PRED = 28

class WRMSSE(object):
    # WRMSSE calculation (source: https://www.kaggle.com/girmdshinsei/for-japanese-beginner-with-wrmsse-in-lgbm)
    # LightGBMのMetricとして, WRMSSEの効率的な計算を行う。あくまで, 28day-lagで1つのモデルの予測するときにLGBMで効率的なWRMSSEの計算を行う場合である。

    # weight_matという0 or 1の疎行列で、効率的にaggregation levelを行列積で計算出来るようにしている
    # LightGBMのMetricを効率的に計算するためにGroupby fucntionを使うことを避けているが、そのため、non-rezo demandのデータを除くと効率的な計算ができない。そのためすべてのitemでnon-zero demand dataとなっている最後の28日分のみで検証するコードとなっている.
    # Sparce matrixは順序がProductのItem通りになっていないといけないので注意。

    def __init__(self, sales_train_val, base_df):
                
        self.sales_train_val = sales_train_val
        
        self.product = sales_train_val[['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']].drop_duplicates()

        weight_mat = np.c_[np.ones([NUM_ITEMS,1]).astype(np.int8), # level 1
           pd.get_dummies(self.product.state_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.store_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.cat_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.dept_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.state_id.astype(str) + self.product.cat_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.state_id.astype(str) + self.product.dept_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.store_id.astype(str) + self.product.cat_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.store_id.astype(str) + self.product.dept_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.item_id.astype(str),drop_first=False).astype('int8').values,
           pd.get_dummies(self.product.state_id.astype(str) + self.product.item_id.astype(str),drop_first=False).astype('int8').values,
           np.identity(NUM_ITEMS).astype(np.int8) #item :level 12
           ].T
        self.weight_mat_csr = csr_matrix(weight_mat)
        print(f"self.weight_mat_csr: {self.weight_mat_csr}")
        del weight_mat; gc.collect()
        
        self.rmsse_demoninator = None
        self.weight = None
                        
    def weight_calc(self):
        """calculate the denominator of RMSSE, and calculate the weight base on sales amount
        
        """
        
        d_name = ['d_' + str(i+1) for i in range(1913)]

        sales_train_val = self.weight_mat_csr * self.sales_train_val[d_name].values
        print(f"sales_train_val: {sales_train_val}")
        print(f"sales_train_val.shape: {sales_train_val.shape}")

        # calculate the start position(first non-zero demand observed date) for each item / 商品の最初の売上日
        # 1-1914のdayの数列のうち, 売上が存在しない日を一旦0にし、0を9999に置換。そのうえでminimum numberを計算
        df_tmp = ((sales_train_val>0) * np.tile(np.arange(1,1914),(self.weight_mat_csr.shape[0],1)))

        start_no = np.min(np.where(df_tmp==0,9999,df_tmp),axis=1)-1

        flag = np.dot(np.diag(1/(start_no+1)) , np.tile(np.arange(1,1914),(self.weight_mat_csr.shape[0],1)))<1

        sales_train_val = np.where(flag,np.nan,sales_train_val)

        # denominator of RMSSE / RMSSEの分母
        rmsse_demoninator = np.nansum(np.diff(sales_train_val,axis=1)**2,axis=1)/(1913-start_no)

        # rmsse_demoninator == weight1
        self.rmsse_demoninator = rmsse_demoninator

#         2016/3/27より前を学習用、2016/3/27~2016/4/24（28day）を検証用として分割
#         （LightGBMのEarly stoppingの対象） 交差検証の方法はいろいろと検討余地あり。
#         calculate the sales amount for each item/level
#         (base_df == data)
#     df_tmp = data[(data['date'] > '2016-03-27') & (data['date'] <= '2016-04-24')]
#     df_tmp['amount'] = df_tmp['demand'] * df_tmp['sell_price']
#     df_tmp =df_tmp.groupby(['id'])['amount'].apply(np.sum)
#     df_tmp = df_tmp[product.id].values
#         print(f"df_tmp: {df_tmp}")
        df_tmp = base_df[(base_df['d'] > END_DAY_TRAIN - PREDICTION_HORIZON_DAYS) & (base_df['d'] <= END_DAY_TRAIN)]
        df_tmp['amount'] = df_tmp['sales'] * df_tmp['sell_price']
        df_tmp =df_tmp.groupby(['id'])['amount'].apply(np.sum)
        df_tmp = df_tmp[self.product.id].values
        print(f"df_tmp: {df_tmp}")
        print(f"df_tmp.shape: {df_tmp.shape}")

        weight = self.weight_mat_csr * df_tmp 

        weight = weight/np.sum(weight)

        # weight == weight2
        self.weight = weight
        print(f"self.weight: {self.weight}")
        print(f"self.weight.shape: {self.weight.shape}")
        
        del sales_train_val
        gc.collect()


    def wrmsse(self, preds, data):
        """calculates for last 28 days to consider the non-zero demand period
        
        """
        
        print(f"type(preds): {type(preds)}")
        print(preds.shape)
        print(f"type(data): {type(data)}")
        print(data)
        
        # actual obserbed values / 正解ラベル
        y_true = data.get_label()
        print(f"y_true: {y_true}")
        print(f"type(y_true): {type(y_true)}")
        print(f"y_true.shape: {y_true.shape}")
        
        y_true = y_true[-(NUM_ITEMS * DAYS_PRED):]
        preds = preds[-(NUM_ITEMS * DAYS_PRED):]
        # number of columns
        num_col = DAYS_PRED
        
        # reshape data to original array((NUM_ITEMS*num_col,1)->(NUM_ITEMS, num_col) ) / 推論の結果が 1 次元の配列になっているので直す
        reshaped_preds = preds.reshape(num_col, NUM_ITEMS).T
        reshaped_true = y_true.reshape(num_col, NUM_ITEMS).T
            
        train = self.weight_mat_csr*np.c_[reshaped_preds, reshaped_true]
        
        #todo: darker magic に組み込むには、下記スコアの計算を店舗毎にしないといけない。
        score = np.sum(
                    np.sqrt(
                        np.mean(
                            np.square(
                                train[:,:num_col] - train[:,num_col:])
                            ,axis=1) / self.rmsse_demoninator) * self.weight)
        
        return 'wrmsse', score, False


In [15]:
# wrmsse = WRMSSE(df_sales_train_validation, base_df)
# wrmsse.weight_calc()


# train models

In [None]:
if not USE_AUX:
    for store_id in STORE_IDS:
        print('Train', store_id)

        # Get grid for current store
        grid_df, features_columns = read_data_by_store(store_id)
        print(f"features_columns: {features_columns}")
        print(f"grid_df: {grid_df}")

#         Mask for each stage：
#                 1~1885 = training
#                 1886~1913= validation, 
#                 1913~1941 = prediction
        train_mask = grid_df['d']<=END_DAY_TRAIN
        print(f"train_mask.shape: {train_mask.shape}")
        valid_mask = grid_df['d']>=(START_DAY_VALIDATION) & (grid_df['d']<START_DAY_EVALUATION)
        print(f"valid_mask.shape: {valid_mask.shape}")        
        preds_mask = grid_df['d']>=START_DAY_EVALUATION
        print(f"preds_mask.shape: {preds_mask.shape}")
        
        # Apply masks and save lgb dataset as bin to reduce memory spikes during dtype conversions
        # https://github.com/Microsoft/LightGBM/issues/1032
        # "To avoid any conversions, you should always use np.float32" or save to bin before start training
        # https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/53773

        print(f"grid_df[train_mask][features_columns].shape: {grid_df[train_mask][features_columns].shape}")        
#         print(f"grid_df[train_mask][features_columns]: {grid_df[train_mask][features_columns]}")
        train_data = lgb.Dataset(grid_df[train_mask][features_columns], 
                           label=grid_df[train_mask][TARGET])
        train_data.save_binary('training_lgb_dataset.bin')
        train_data = lgb.Dataset('training_lgb_dataset.bin')

        print(f"grid_df[valid_mask][features_columns].shape: {grid_df[valid_mask][features_columns].shape}")
#         print(f"grid_df[valid_mask][features_columns]: {grid_df[valid_mask][features_columns]}")
        valid_data = lgb.Dataset(grid_df[valid_mask][features_columns], 
                           label=grid_df[valid_mask][TARGET])
        print(f"valid_data: {valid_data}")

        # Saving part of the dataset for later predictions
        # Removing features that we need to calculate recursively 
        grid_df = grid_df[preds_mask].reset_index(drop=True)
        print(f"grid_df after grid_df[preds_mask].reset_index(drop=True): {grid_df}")
        keep_cols = [col for col in list(grid_df) if '_tmp_' not in col]
        print(f"keep_cols: {keep_cols}")
        grid_df = grid_df[keep_cols]
        grid_df.to_pickle(os.path.sep.join([PRETRAINED_MODEL_DIR, 'test_dataset_'+store_id+'.pkl']))    
        del grid_df
        gc.collect()

        # Launch seeder again to make lgb training 100% deterministic with each "code line" np.random "evolves" 
        # so we need (may want) to "reset" it
        seed_everything(SEED)
    #### original hyperparameters #####
        estimator = lgb.train(lgb_params,
                              train_data,
                              valid_sets = [valid_data],
                              verbose_eval = 100,
                              )
    ###########################
#     ##### my hyperparameters #####
#         estimator = lgb.train(params=lgb_params,
#                               train_set=train_data,
#                               valid_sets = [train_data, valid_data],
#                               verbose_eval = 100,
#                               early_stopping_rounds = 200,
#                               feval= wrmsse.wrmsse
#                               )
#     ############################

        # Save model - it's not real '.bin' but a pickle file
        # estimator = lgb.Booster(model_file='model.txt') can only predict with the best iteration (or the saving iteration)
        # pickle.dump gives us more flexibility like estimator.predict(TEST, num_iteration=100)
        # num_iteration - number of iteration you want to predict with, 
        # NULL or <= 0 means use best iteration
        model_name = 'lgb_model_'+store_id+'_v'+str(VER)+'.bin'
#         pickle.dump(estimator, open(model_name, 'wb'))
        pickle.dump(estimator, open(os.path.sep.join([PRETRAINED_MODEL_DIR, model_name]), 'wb'))

        # Remove temporary files and objects to free some disk space and ram memory
        !rm training_lgb_dataset.bin
        del train_data, valid_data, estimator
        gc.collect()
else:
    # If we want to use pretrained models we can skip training 
    store_id = STORE_IDS[0]
    print(f"store_id: {store_id}")

    # we just want the column name list
    _, features_columns = read_data_by_store(store_id)
    print(f"features_columns: {features_columns}")
    print(f"len(features_columns): {len(features_columns)}")
    
# "Keep" models features for predictions
MODEL_FEATURES = features_columns

Train CA_1
MEAN_ENC: base_grid_with_mean_encoded_ids_means_stds_for_darker_magic.pkl
LAGS: base_grid_with_lag_features_for_28_days.pkl
features_columns: ['item_id', 'dept_id', 'cat_id', 'release', 'sell_price', 'price_max', 'price_min', 'price_std', 'price_mean', 'price_norm', 'price_nunique', 'item_nunique', 'price_momentum', 'price_momentum_m', 'price_momentum_y', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI', 'tm_d', 'tm_w', 'tm_m', 'tm_y', 'tm_wm', 'tm_dw', 'tm_w_end', 'enc_cat_id_mean', 'enc_cat_id_std', 'enc_dept_id_mean', 'enc_dept_id_std', 'enc_item_id_mean', 'enc_item_id_std', 'sales_lag_28', 'sales_lag_29', 'sales_lag_30', 'sales_lag_31', 'sales_lag_32', 'sales_lag_33', 'sales_lag_34', 'sales_lag_35', 'sales_lag_36', 'sales_lag_37', 'sales_lag_38', 'sales_lag_39', 'sales_lag_40', 'sales_lag_41', 'sales_lag_42', 'sales_lag_43', 'sales_lag_44', 'sales_lag_45', 'sales_lag_46', 'sales_lag_47', 'sales_lag_48', 'sales_lag_49', 'sale

grid_df[train_mask][features_columns].shape: (4617523, 92)
grid_df[valid_mask][features_columns].shape: (4788267, 92)
valid_data: <lightgbm.basic.Dataset object at 0x7fa03af0a6d8>
grid_df after grid_df[preds_mask].reset_index(drop=True):                                   id     d  sales        item_id    dept_id  \
0      HOBBIES_1_001_CA_1_validation  1914    NaN  HOBBIES_1_001  HOBBIES_1   
1      HOBBIES_1_002_CA_1_validation  1914    NaN  HOBBIES_1_002  HOBBIES_1   
2      HOBBIES_1_003_CA_1_validation  1914    NaN  HOBBIES_1_003  HOBBIES_1   
3      HOBBIES_1_004_CA_1_validation  1914    NaN  HOBBIES_1_004  HOBBIES_1   
4      HOBBIES_1_005_CA_1_validation  1914    NaN  HOBBIES_1_005  HOBBIES_1   
...                              ...   ...    ...            ...        ...   
85367    FOODS_3_823_CA_1_validation  1941    NaN    FOODS_3_823    FOODS_3   
85368    FOODS_3_824_CA_1_validation  1941    NaN    FOODS_3_824    FOODS_3   
85369    FOODS_3_825_CA_1_validation  1941    NaN  



[100]	valid_0's rmse: 2.85003
[200]	valid_0's rmse: 2.72403
[300]	valid_0's rmse: 2.66348
[400]	valid_0's rmse: 2.62322
[500]	valid_0's rmse: 2.59245
[600]	valid_0's rmse: 2.56643
[700]	valid_0's rmse: 2.54609
[800]	valid_0's rmse: 2.52589
[900]	valid_0's rmse: 2.50768
[1000]	valid_0's rmse: 2.49249
[1100]	valid_0's rmse: 2.47855
[1200]	valid_0's rmse: 2.46435
[1300]	valid_0's rmse: 2.45209
[1400]	valid_0's rmse: 2.44014
[1500]	valid_0's rmse: 2.42789
Train CA_2
MEAN_ENC: base_grid_with_mean_encoded_ids_means_stds_for_darker_magic.pkl
LAGS: base_grid_with_lag_features_for_28_days.pkl
features_columns: ['item_id', 'dept_id', 'cat_id', 'release', 'sell_price', 'price_max', 'price_min', 'price_std', 'price_mean', 'price_norm', 'price_nunique', 'item_nunique', 'price_momentum', 'price_momentum_m', 'price_momentum_y', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI', 'tm_d', 'tm_w', 'tm_m', 'tm_y', 'tm_wm', 'tm_dw', 'tm_w_end', 'enc_cat_id_mean

grid_df[train_mask][features_columns].shape: (4190404, 92)
grid_df[valid_mask][features_columns].shape: (4361148, 92)
valid_data: <lightgbm.basic.Dataset object at 0x7fa03af0a240>
grid_df after grid_df[preds_mask].reset_index(drop=True):                                   id     d  sales        item_id    dept_id  \
0      HOBBIES_1_001_CA_2_validation  1914    NaN  HOBBIES_1_001  HOBBIES_1   
1      HOBBIES_1_002_CA_2_validation  1914    NaN  HOBBIES_1_002  HOBBIES_1   
2      HOBBIES_1_003_CA_2_validation  1914    NaN  HOBBIES_1_003  HOBBIES_1   
3      HOBBIES_1_004_CA_2_validation  1914    NaN  HOBBIES_1_004  HOBBIES_1   
4      HOBBIES_1_005_CA_2_validation  1914    NaN  HOBBIES_1_005  HOBBIES_1   
...                              ...   ...    ...            ...        ...   
85367    FOODS_3_823_CA_2_validation  1941    NaN    FOODS_3_823    FOODS_3   
85368    FOODS_3_824_CA_2_validation  1941    NaN    FOODS_3_824    FOODS_3   
85369    FOODS_3_825_CA_2_validation  1941    NaN  



[100]	valid_0's rmse: 2.22258
[200]	valid_0's rmse: 2.16113
[300]	valid_0's rmse: 2.12757
[400]	valid_0's rmse: 2.10428
[500]	valid_0's rmse: 2.08704
[600]	valid_0's rmse: 2.07316
[700]	valid_0's rmse: 2.0609
[800]	valid_0's rmse: 2.04891
[900]	valid_0's rmse: 2.03868
[1000]	valid_0's rmse: 2.02866
[1100]	valid_0's rmse: 2.0195
[1200]	valid_0's rmse: 2.01139
[1300]	valid_0's rmse: 2.00319
[1400]	valid_0's rmse: 1.99539
[1500]	valid_0's rmse: 1.98786
Train CA_3
MEAN_ENC: base_grid_with_mean_encoded_ids_means_stds_for_darker_magic.pkl
LAGS: base_grid_with_lag_features_for_28_days.pkl
features_columns: ['item_id', 'dept_id', 'cat_id', 'release', 'sell_price', 'price_max', 'price_min', 'price_std', 'price_mean', 'price_norm', 'price_nunique', 'item_nunique', 'price_momentum', 'price_momentum_m', 'price_momentum_y', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI', 'tm_d', 'tm_w', 'tm_m', 'tm_y', 'tm_wm', 'tm_dw', 'tm_w_end', 'enc_cat_id_mean',

grid_df[train_mask][features_columns].shape: (4586569, 92)
grid_df[valid_mask][features_columns].shape: (4757313, 92)
valid_data: <lightgbm.basic.Dataset object at 0x7fa03acde828>
grid_df after grid_df[preds_mask].reset_index(drop=True):                                   id     d  sales        item_id    dept_id  \
0      HOBBIES_1_001_CA_3_validation  1914    NaN  HOBBIES_1_001  HOBBIES_1   
1      HOBBIES_1_002_CA_3_validation  1914    NaN  HOBBIES_1_002  HOBBIES_1   
2      HOBBIES_1_003_CA_3_validation  1914    NaN  HOBBIES_1_003  HOBBIES_1   
3      HOBBIES_1_004_CA_3_validation  1914    NaN  HOBBIES_1_004  HOBBIES_1   
4      HOBBIES_1_005_CA_3_validation  1914    NaN  HOBBIES_1_005  HOBBIES_1   
...                              ...   ...    ...            ...        ...   
85367    FOODS_3_823_CA_3_validation  1941    NaN    FOODS_3_823    FOODS_3   
85368    FOODS_3_824_CA_3_validation  1941    NaN    FOODS_3_824    FOODS_3   
85369    FOODS_3_825_CA_3_validation  1941    NaN  



[100]	valid_0's rmse: 4.09565
[200]	valid_0's rmse: 3.81729
[300]	valid_0's rmse: 3.70271
[400]	valid_0's rmse: 3.6305
[500]	valid_0's rmse: 3.57682
[600]	valid_0's rmse: 3.53338
[700]	valid_0's rmse: 3.4968
[800]	valid_0's rmse: 3.46395
[900]	valid_0's rmse: 3.43596
[1000]	valid_0's rmse: 3.40876
[1100]	valid_0's rmse: 3.38288
[1200]	valid_0's rmse: 3.3608
[1300]	valid_0's rmse: 3.34051
[1400]	valid_0's rmse: 3.32151
[1500]	valid_0's rmse: 3.30351
Train CA_4
MEAN_ENC: base_grid_with_mean_encoded_ids_means_stds_for_darker_magic.pkl
LAGS: base_grid_with_lag_features_for_28_days.pkl
features_columns: ['item_id', 'dept_id', 'cat_id', 'release', 'sell_price', 'price_max', 'price_min', 'price_std', 'price_mean', 'price_norm', 'price_nunique', 'item_nunique', 'price_momentum', 'price_momentum_m', 'price_momentum_y', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI', 'tm_d', 'tm_w', 'tm_m', 'tm_y', 'tm_wm', 'tm_dw', 'tm_w_end', 'enc_cat_id_mean', 

grid_df[train_mask][features_columns].shape: (4481814, 92)
grid_df[valid_mask][features_columns].shape: (4652558, 92)
valid_data: <lightgbm.basic.Dataset object at 0x7fa03afdab00>
grid_df after grid_df[preds_mask].reset_index(drop=True):                                   id     d  sales        item_id    dept_id  \
0      HOBBIES_1_001_CA_4_validation  1914    NaN  HOBBIES_1_001  HOBBIES_1   
1      HOBBIES_1_002_CA_4_validation  1914    NaN  HOBBIES_1_002  HOBBIES_1   
2      HOBBIES_1_003_CA_4_validation  1914    NaN  HOBBIES_1_003  HOBBIES_1   
3      HOBBIES_1_004_CA_4_validation  1914    NaN  HOBBIES_1_004  HOBBIES_1   
4      HOBBIES_1_005_CA_4_validation  1914    NaN  HOBBIES_1_005  HOBBIES_1   
...                              ...   ...    ...            ...        ...   
85367    FOODS_3_823_CA_4_validation  1941    NaN    FOODS_3_823    FOODS_3   
85368    FOODS_3_824_CA_4_validation  1941    NaN    FOODS_3_824    FOODS_3   
85369    FOODS_3_825_CA_4_validation  1941    NaN  



[100]	valid_0's rmse: 1.6161
[200]	valid_0's rmse: 1.57488
[300]	valid_0's rmse: 1.55429
[400]	valid_0's rmse: 1.53937
[500]	valid_0's rmse: 1.528
[600]	valid_0's rmse: 1.51815
[700]	valid_0's rmse: 1.51004
[800]	valid_0's rmse: 1.50257
[900]	valid_0's rmse: 1.49572
[1000]	valid_0's rmse: 1.48953
[1100]	valid_0's rmse: 1.48357
[1200]	valid_0's rmse: 1.47755
[1300]	valid_0's rmse: 1.47212
[1400]	valid_0's rmse: 1.46722
[1500]	valid_0's rmse: 1.46201
Train TX_1
MEAN_ENC: base_grid_with_mean_encoded_ids_means_stds_for_darker_magic.pkl
LAGS: base_grid_with_lag_features_for_28_days.pkl
features_columns: ['item_id', 'dept_id', 'cat_id', 'release', 'sell_price', 'price_max', 'price_min', 'price_std', 'price_mean', 'price_norm', 'price_nunique', 'item_nunique', 'price_momentum', 'price_momentum_m', 'price_momentum_y', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI', 'tm_d', 'tm_w', 'tm_m', 'tm_y', 'tm_wm', 'tm_dw', 'tm_w_end', 'enc_cat_id_mean', 

grid_df[train_mask][features_columns].shape: (4627211, 92)
grid_df[valid_mask][features_columns].shape: (4797955, 92)
valid_data: <lightgbm.basic.Dataset object at 0x7fa03af0af60>
grid_df after grid_df[preds_mask].reset_index(drop=True):                                   id     d  sales        item_id    dept_id  \
0      HOBBIES_1_001_TX_1_validation  1914    NaN  HOBBIES_1_001  HOBBIES_1   
1      HOBBIES_1_002_TX_1_validation  1914    NaN  HOBBIES_1_002  HOBBIES_1   
2      HOBBIES_1_003_TX_1_validation  1914    NaN  HOBBIES_1_003  HOBBIES_1   
3      HOBBIES_1_004_TX_1_validation  1914    NaN  HOBBIES_1_004  HOBBIES_1   
4      HOBBIES_1_005_TX_1_validation  1914    NaN  HOBBIES_1_005  HOBBIES_1   
...                              ...   ...    ...            ...        ...   
85367    FOODS_3_823_TX_1_validation  1941    NaN    FOODS_3_823    FOODS_3   
85368    FOODS_3_824_TX_1_validation  1941    NaN    FOODS_3_824    FOODS_3   
85369    FOODS_3_825_TX_1_validation  1941    NaN  



[100]	valid_0's rmse: 2.33904
[200]	valid_0's rmse: 2.25415
[300]	valid_0's rmse: 2.20815
[400]	valid_0's rmse: 2.17817
[500]	valid_0's rmse: 2.15367
[600]	valid_0's rmse: 2.13295
[700]	valid_0's rmse: 2.11575
[800]	valid_0's rmse: 2.09968
[900]	valid_0's rmse: 2.08554
[1000]	valid_0's rmse: 2.07188
[1100]	valid_0's rmse: 2.05955
[1200]	valid_0's rmse: 2.04752
[1300]	valid_0's rmse: 2.03632
[1400]	valid_0's rmse: 2.02581
[1500]	valid_0's rmse: 2.01423
Train TX_2
MEAN_ENC: base_grid_with_mean_encoded_ids_means_stds_for_darker_magic.pkl
LAGS: base_grid_with_lag_features_for_28_days.pkl
features_columns: ['item_id', 'dept_id', 'cat_id', 'release', 'sell_price', 'price_max', 'price_min', 'price_std', 'price_mean', 'price_norm', 'price_nunique', 'item_nunique', 'price_momentum', 'price_momentum_m', 'price_momentum_y', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI', 'tm_d', 'tm_w', 'tm_m', 'tm_y', 'tm_wm', 'tm_dw', 'tm_w_end', 'enc_cat_id_mean

grid_df[train_mask][features_columns].shape: (4637137, 92)
grid_df[valid_mask][features_columns].shape: (4807881, 92)
valid_data: <lightgbm.basic.Dataset object at 0x7fa03aee7748>
grid_df after grid_df[preds_mask].reset_index(drop=True):                                   id     d  sales        item_id    dept_id  \
0      HOBBIES_1_001_TX_2_validation  1914    NaN  HOBBIES_1_001  HOBBIES_1   
1      HOBBIES_1_002_TX_2_validation  1914    NaN  HOBBIES_1_002  HOBBIES_1   
2      HOBBIES_1_003_TX_2_validation  1914    NaN  HOBBIES_1_003  HOBBIES_1   
3      HOBBIES_1_004_TX_2_validation  1914    NaN  HOBBIES_1_004  HOBBIES_1   
4      HOBBIES_1_005_TX_2_validation  1914    NaN  HOBBIES_1_005  HOBBIES_1   
...                              ...   ...    ...            ...        ...   
85367    FOODS_3_823_TX_2_validation  1941    NaN    FOODS_3_823    FOODS_3   
85368    FOODS_3_824_TX_2_validation  1941    NaN    FOODS_3_824    FOODS_3   
85369    FOODS_3_825_TX_2_validation  1941    NaN  



[100]	valid_0's rmse: 2.82407
[200]	valid_0's rmse: 2.69433
[300]	valid_0's rmse: 2.63268
[400]	valid_0's rmse: 2.59439
[500]	valid_0's rmse: 2.56504
[600]	valid_0's rmse: 2.53958
[700]	valid_0's rmse: 2.51831
[800]	valid_0's rmse: 2.49727
[900]	valid_0's rmse: 2.47844
[1000]	valid_0's rmse: 2.46151
[1100]	valid_0's rmse: 2.44457
[1200]	valid_0's rmse: 2.42971
[1300]	valid_0's rmse: 2.41589
[1400]	valid_0's rmse: 2.40333
[1500]	valid_0's rmse: 2.39013
Train TX_3
MEAN_ENC: base_grid_with_mean_encoded_ids_means_stds_for_darker_magic.pkl
LAGS: base_grid_with_lag_features_for_28_days.pkl
features_columns: ['item_id', 'dept_id', 'cat_id', 'release', 'sell_price', 'price_max', 'price_min', 'price_std', 'price_mean', 'price_norm', 'price_nunique', 'item_nunique', 'price_momentum', 'price_momentum_m', 'price_momentum_y', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI', 'tm_d', 'tm_w', 'tm_m', 'tm_y', 'tm_wm', 'tm_dw', 'tm_w_end', 'enc_cat_id_mean

grid_df[train_mask][features_columns].shape: (4566423, 92)
grid_df[valid_mask][features_columns].shape: (4737167, 92)
valid_data: <lightgbm.basic.Dataset object at 0x7fa03afe6470>
grid_df after grid_df[preds_mask].reset_index(drop=True):                                   id     d  sales        item_id    dept_id  \
0      HOBBIES_1_001_TX_3_validation  1914    NaN  HOBBIES_1_001  HOBBIES_1   
1      HOBBIES_1_002_TX_3_validation  1914    NaN  HOBBIES_1_002  HOBBIES_1   
2      HOBBIES_1_003_TX_3_validation  1914    NaN  HOBBIES_1_003  HOBBIES_1   
3      HOBBIES_1_004_TX_3_validation  1914    NaN  HOBBIES_1_004  HOBBIES_1   
4      HOBBIES_1_005_TX_3_validation  1914    NaN  HOBBIES_1_005  HOBBIES_1   
...                              ...   ...    ...            ...        ...   
85367    FOODS_3_823_TX_3_validation  1941    NaN    FOODS_3_823    FOODS_3   
85368    FOODS_3_824_TX_3_validation  1941    NaN    FOODS_3_824    FOODS_3   
85369    FOODS_3_825_TX_3_validation  1941    NaN  



[100]	valid_0's rmse: 2.47616
[200]	valid_0's rmse: 2.36137
[300]	valid_0's rmse: 2.29873
[400]	valid_0's rmse: 2.25517
[500]	valid_0's rmse: 2.22621
[600]	valid_0's rmse: 2.20336
[700]	valid_0's rmse: 2.18351
[800]	valid_0's rmse: 2.1657
[900]	valid_0's rmse: 2.15034
[1000]	valid_0's rmse: 2.13591
[1100]	valid_0's rmse: 2.1237


# predict the target column 

In [None]:
# Create Dummy DataFrame to store predictions
all_preds = pd.DataFrame()

# Join back the Test dataset with a small part of the training data to make recursive features
base_test = get_base_test()

# Timer to measure predictions time 
main_time = time.time()

# Loop over each prediction day
# As rolling lags are the most timeconsuming, we will calculate it for whole day
for PREDICT_DAY in range(1, SHIFT_DAYS + 1):    
    print('Predict | Day:', PREDICT_DAY)
    start_time = time.time()

    # Make temporary grid to calculate rolling lags
    grid_df = base_test.copy()
    
    print(f"grid_df: {grid_df}")
    print(f"grid_df.shape: {grid_df.shape}")
    print(f"ROLLING_SPLIT: {ROLLING_SPLIT}")
    grid_df = pd.concat([grid_df, run_df_in_multiprocess(make_lag_roll, ROLLING_SPLIT)], axis=1)
        
    for store_id in STORE_IDS:
        
        # Read all our models and make predictions for each day/store pairs
        model_path = 'lgb_model_'+store_id+'_v'+str(VER)+'.bin' 
#             model_path = PRETRAINED_MODEL_DIR + model_path
        model_path = os.path.sep.join([PRETRAINED_MODEL_DIR, model_path])
                   
        estimator = pickle.load(open(model_path, 'rb'))
        
        day_mask = base_test['d']==(START_DAY_EVALUATION -1 + PREDICT_DAY)
    
#         print(f"day_mask: {day_mask}")
        store_mask = base_test['store_id']==store_id
#         print(f"store_mask: {store_mask}")
        
        mask = (day_mask)&(store_mask)
#         print(f"mask: {mask}")

        base_test[TARGET][mask] = estimator.predict(grid_df[mask][MODEL_FEATURES])
    
    # Make good column naming and add to all_preds DataFrame
    temp_df = base_test[day_mask][['id',TARGET]]
    temp_df.columns = ['id','F'+str(PREDICT_DAY)]
    print(f"temp_df: {temp_df}")
    if 'id' in list(all_preds):
        all_preds = all_preds.merge(temp_df, on=['id'], how='left')
    else:
        all_preds = temp_df.copy()
        
    print('#'*10, ' %0.2f min round |' % ((time.time() - start_time) / 60),
                  ' %0.2f min total |' % ((time.time() - main_time) / 60),
                  ' %0.2f day sales |' % (temp_df['F'+str(PREDICT_DAY)].sum()))
    del temp_df
    gc.collect()
    
all_preds = all_preds.reset_index(drop=True)
print(f"all_preds: {all_preds}")

# export train/test result

In [None]:
parent_dir = pathlib.Path(os.path.abspath(os.curdir)).parent.parent
# Reading competition sample submission and merging our predictions
submission_df = read_csv_data(parent_dir, _SAMPLE_SUBMISSION_CSV_FILE)
submission_ids_df = submission_df[["id"]]
display(submission_ids_df)

# submission_df = pd.read_csv(ORIGINAL+_SAMPLE_SUBMISSION_CSV_FILE)[['id']]
my_submission_df = submission_ids_df.merge(all_preds, on=['id'], how='left').fillna(0)

_EXPORT_FILE_NAME = 'submission_v'+str(VER)+'_validation.csv'
print("csv data export start")
my_submission_df.to_csv(os.path.sep.join([str(parent_dir), _OUTPUT_DIR, _EXPORT_FILE_NAME]), index=False)
print('csv data export finished. Size:', my_submission_df.shape)

# Summary

Of course here is no magic at all.
No "Novel" features and no brilliant ideas.
We just carefully joined all
our previous fe work and created a model.

Also!
In my opinion this strategy is a "dead end".
Overfits a lot LB and with 1 final submission 
you have no option to risk.


Improvement should come from:
Loss function
Data representation
Stable CV
Good features reduction strategy
Predictions stabilization with NN
Trend prediction
Real zero sales detection/classification


Good kernels references 
(the order is random and the list is not complete):
https://www.kaggle.com/ragnar123/simple-lgbm-groupkfold-cv
https://www.kaggle.com/jpmiller/grouping-items-by-stockout-pattern
https://www.kaggle.com/headsortails/back-to-predict-the-future-interactive-m5-eda
https://www.kaggle.com/sibmike/m5-out-of-stock-feature
https://www.kaggle.com/mayer79/m5-forecast-attack-of-the-data-table
https://www.kaggle.com/yassinealouini/seq2seq
https://www.kaggle.com/kailex/m5-forecaster-v2
https://www.kaggle.com/aerdem4/m5-lofo-importance-on-gpu-via-rapids-xgboost


Features were created in these kernels:
# 
Mean encodings and PCA options
https://www.kaggle.com/kyakovlev/m5-custom-features
#
Lags and rolling lags
https://www.kaggle.com/kyakovlev/m5-lags-features
#
Base Grid and base features (calendar/price/etc)
https://www.kaggle.com/kyakovlev/m5-simple-fe


Personal request
Please don't upvote any ensemble and copypaste kernels
The worst case is ensemble without any analyse.
The best choice - just ignore it.
I would like to see more kernels with interesting and original approaches.
Don't feed copypasters with upvotes.

It doesn't mean that you should not fork and improve others kernels
but I would like to see params and code tuning based on some CV and analyse
and not only on LB probing.
Small changes could be shared in comments and authors can improve their kernel.

Feel free to criticize this kernel as my knowlege is very limited
and I can be wrong in code and descriptions. 
Thank you.