# Sales pred simple ranking  
## ***Experience*** **52**:  
### -aggregated sales in different stores  
### -single-general learner  
### -without markdowns?  

## Contributions:
### - wide range of methods 
### - department based error(p_err and n_err) calculation
### - waited error ranking (check EXPLAIN method in exact folder)

In [1]:
# 0-importing necessary packages

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from datetime import datetime
import statsmodels.api as sm
from pycaret.regression import *
import xgboost as xgb
import catboost as ctb
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf
print('Importing libraries: Done')

Importing libraries: Done


In [2]:
# 1-Inputs operation


# 1-1 Checking inputs
print("Folder's files : ",os.listdir('inputs'), '\n')

# 1-2 Reading input CSV files and assigning a name to each one of them 
dataset = pd.read_csv("inputs/train.csv", names=['Store','Dept','Date','weeklySales','isHoliday'],sep=',', header=0)
features = pd.read_csv("inputs/features.csv",sep=',', header=0,names=['Store','Date','Temperature','Fuel_Price','MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5','CPI','Unemployment','IsHoliday']).drop(columns=['IsHoliday'])
stores = pd.read_csv("inputs/stores.csv", names=['Store','Type','Size'],sep=',', header=0)

# 1-3 Creating needed directories
os.makedirs('temp_test', exist_ok=True)
os.makedirs('input_analysis', exist_ok=True)
os.makedirs('pred_output', exist_ok=True)
os.makedirs('pred_output/exp51', exist_ok=True)
os.makedirs('output_analysis', exist_ok=True)
os.makedirs('output_analysis/exp52', exist_ok=True)

# 1-4 Flating data(merging different data bases into one table)
dataset = dataset.merge(stores, how='left').merge(features, how='left')  

# 1-5 Decreasing unnecessary memory usage 
dataset['Store'] = dataset['Store'].astype('int16')
dataset['Dept'] = dataset['Dept'].astype('int16')
dataset['weeklySales'] = dataset['weeklySales'].astype('float64')

# 1-6 Printing flatted dataset
print('─' * 100,'\n Original dataset sample: \n', dataset)

Folder's files :  ['features.csv', 'inputs.rar', 'inputs.zip', 'stores.csv', 'test.csv', 'train.csv'] 

──────────────────────────────────────────────────────────────────────────────────────────────────── 
 Original dataset sample: 
         Store  Dept        Date  weeklySales  isHoliday Type    Size  \
0           1     1  2010-02-05     24924.50      False    A  151315   
1           1     1  2010-02-12     46039.49       True    A  151315   
2           1     1  2010-02-19     41595.55      False    A  151315   
3           1     1  2010-02-26     19403.54      False    A  151315   
4           1     1  2010-03-05     21827.90      False    A  151315   
...       ...   ...         ...          ...        ...  ...     ...   
421565     45    98  2012-09-28       508.37      False    B  118221   
421566     45    98  2012-10-05       628.10      False    B  118221   
421567     45    98  2012-10-12      1061.02      False    B  118221   
421568     45    98  2012-10-19       760.01  

In [3]:
# 2-Data extraction

# 2-1 Deriving a sub-dataset from main dataset 
dataset_sub1 = dataset[['Date','Dept','Store', 'Type','MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5','isHoliday','weeklySales']]
dataset_sub1 = dataset_sub1.sort_index(axis=0)

# 2-2 getting list of unique departments' values
dept_list = dataset_sub1['Dept'].unique()
dept_list.sort()

# 2-3 getting list of unique stores' values
store_list = dataset_sub1['Store'].unique()
store_list.sort()

# 2-4 getting list of unique dates
date_list = dataset_sub1['Date'].unique()
date_list.sort()

# 2-5 Check printing 
print('Dataset_sub1: \n',dataset_sub1)
print('─' * 100,'\n List of Departments: \n',dept_list,'\n')
print('─' * 100,'\n List of Stores: \n',store_list,'\n')
print('─' * 100,'\n List of Dates: \n',date_list)

# Deriving a sub-dataset from main dataset which considers 9 more important features
#datasub_sub3 = dataset[['Date','Store','Dept','weeklySales','MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']]
#dataset_sub3 = dataset_sub3.sort_index(axis=0)
#Print('\n\n', dataset_sub3.tail(5))

Dataset_sub1: 
               Date  Dept  Store Type  MarkDown1  MarkDown2  MarkDown3  \
0       2010-02-05     1      1    A        NaN        NaN        NaN   
1       2010-02-12     1      1    A        NaN        NaN        NaN   
2       2010-02-19     1      1    A        NaN        NaN        NaN   
3       2010-02-26     1      1    A        NaN        NaN        NaN   
4       2010-03-05     1      1    A        NaN        NaN        NaN   
...            ...   ...    ...  ...        ...        ...        ...   
421565  2012-09-28    98     45    B    4556.61      20.64       1.50   
421566  2012-10-05    98     45    B    5046.74        NaN      18.82   
421567  2012-10-12    98     45    B    1956.28        NaN       7.89   
421568  2012-10-19    98     45    B    2004.02        NaN       3.18   
421569  2012-10-26    98     45    B    4018.91      58.08     100.00   

        MarkDown4  MarkDown5  isHoliday  weeklySales  
0             NaN        NaN      False     24924.50

In [6]:
# 3-Data cleaning

# 3-1 this function identifies departmets in different stores that have incomplete data or have below 0 sales values. the function then makes a dictionary of outliers(Test:OK)
def outlier_identifier(df, border_value, store_list, dept_list):
    data_map = pd.DataFrame(columns=['store', 'dept', 'number_of_entries', 'target_false_count', 'outlier_flag'])
    for i in store_list:
        for j in dept_list:
            number_of_entries = df[(df.Store == i) & (df.Dept == j)].Date.count()
            number_of_entries = number_of_entries.astype('int16')
            target_false_count = df[(df.weeklySales <= border_value) & (df.Store == 
                                                                        i) & (df.Dept == j)].weeklySales.count()
            target_false_count = target_false_count.astype('int16')
            if (number_of_entries == 143) & (target_false_count == 0):
                outlier_flag = 0
            else:
                outlier_flag = 1
            new_row = {'store': i, 'dept': j, 'number_of_entries': number_of_entries, 'target_false_count': target_false_count, 'outlier_flag': outlier_flag}
            data_map.loc[len(data_map)] = new_row
    return data_map

# 3-2 this function removes departmets in different stores that have incomplete data or have below 0 sales values.(Test:OK)
def outlier_remover(df, removal_map):
    for i in removal_map.index:
        a = removal_map.iloc[[i]].store
        a.reset_index(drop=True, inplace=True)
        b = removal_map.iloc[[i]].dept
        b.reset_index(drop=True, inplace=True)
        print('Store:', a[0], 'Department:', b[0],' Removed as outlier!','\n')
        index = df[(df.Store == a[0]) & (df.Dept == b[0])].index
        for j in index:
            df.drop(j , inplace=True)
    return df

# 3-3 Executing outlier identifier and save it as a mapping dataframe to know which store and department mix should be droped(Test:OK)
data_map = outlier_identifier(dataset_sub1, 0, store_list, dept_list)
removal_map = data_map[['store','dept','outlier_flag']]
removal_map = removal_map[removal_map.outlier_flag == 1]
removal_map.reset_index(drop=True, inplace=True)

# 3-4 Printing percentage of outlier data in compare with whole data
print('─' * 100, '\n')
print('outlier percentage:', data_map[data_map.outlier_flag == 1].store.count() / 3645 ,'─' * 100, '\n')

# 3-5 Executing outlier remover
dataset_sub2 = outlier_remover(dataset_sub1, removal_map)

# 3-6 Filling empty numeric values with 0  &  reseting index
dataset_sub2 = dataset_sub2.fillna(0)
dataset_sub2 = dataset_sub2.reset_index(drop=True)

# 3-7 Outlier removing process is considerably time consuming, therefore we save it and recall cleaned data later.
dataset_sub2.to_csv('temp_test/dataset_sub2_exp51.csv') 

# 3-8 Printing the result of data cleaning process
print('─' * 100, '\n Cleaned Dataset: \n', dataset_sub2)    

──────────────────────────────────────────────────────────────────────────────────────────────────── 

outlier percentage: 0.2792866941015089 

Store: 1 Department: 6  Removed as outlier! 

Store: 1 Department: 18  Removed as outlier! 

Store: 1 Department: 39  Removed as outlier! 

Store: 1 Department: 43  Removed as outlier! 

Store: 1 Department: 45  Removed as outlier! 

Store: 1 Department: 47  Removed as outlier! 

Store: 1 Department: 48  Removed as outlier! 

Store: 1 Department: 50  Removed as outlier! 

Store: 1 Department: 51  Removed as outlier! 

Store: 1 Department: 54  Removed as outlier! 

Store: 1 Department: 65  Removed as outlier! 

Store: 1 Department: 77  Removed as outlier! 

Store: 1 Department: 78  Removed as outlier! 

Store: 1 Department: 96  Removed as outlier! 

Store: 1 Department: 99  Removed as outlier! 

Store: 2 Department: 18  Removed as outlier! 

Store: 2 Department: 39  Removed as outlier! 

Store: 2 Department: 43  Removed as outlier! 

Store: 2 De

In [4]:
# 4-Reading cleaned dataset and updating some lists after cleaning

# 4-1 Reading saved clean data from memory
dataset_sub2 = pd.read_csv("temp_test/dataset_sub2_exp51.csv", names=['Date','Dept','Store', 'Type','MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5','isHoliday','weeklySales'],sep=',', header=0)
dataset_sub3 = dataset_sub2[['Date','Dept','weeklySales']]

# 4-2 Updating list of unique departments' values
dept_list = dataset_sub2['Dept'].unique()
dept_list.sort()

# 4-3 Updating list of unique stores' values
store_list = dataset_sub2['Store'].unique()
store_list.sort()

# 4-4 Updating list of unique dates
date_list = dataset_sub2['Date'].unique()
date_list.sort()

print('List of Departments:',dept_list,'\n')
print('─' * 100, '\n List of Stores: \n',store_list,'\n')
print('─' * 100, '\n List of Dates: \n',date_list)


List of Departments: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 19 20 21 22 23 24 25 26
 27 28 29 30 31 32 33 34 35 36 37 38 40 41 42 44 46 48 49 50 52 55 56 58
 59 60 65 67 71 72 74 79 80 81 82 83 85 87 90 91 92 93 94 95 96 97 98] 

──────────────────────────────────────────────────────────────────────────────────────────────────── 
 List of Stores: 
 [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45] 

──────────────────────────────────────────────────────────────────────────────────────────────────── 
 List of Dates: 
 ['2010-02-05' '2010-02-12' '2010-02-19' '2010-02-26' '2010-03-05'
 '2010-03-12' '2010-03-19' '2010-03-26' '2010-04-02' '2010-04-09'
 '2010-04-16' '2010-04-23' '2010-04-30' '2010-05-07' '2010-05-14'
 '2010-05-21' '2010-05-28' '2010-06-04' '2010-06-11' '2010-06-18'
 '2010-06-25' '2010-07-02' '2010-07-09' '2010-07-16' '2010-07-23'
 '2010-07-30' '2010-08-06' '2010-08-13' '2010-08-

In [8]:
# 5-Defining experiment process, models, and methods

# 5-1 This function aggregates stores weekly sales(Test:OK)
def aggregator(df):
    aggr = df.groupby(['Date','Dept'], as_index=False).sum()
    return aggr

# 5-2 This function gets a ataframe input and gives a dataframe output with transformed features. (Test:OK) 
# also the function reduces data types to minimum ram needed
def create_features(df):
    features = df
    features['Date'] = pd.to_datetime(df['Date'])
    features['dayofweek'] = df['Date'].dt.dayofweek
    features['quarter'] = df['Date'].dt.quarter
    features['month'] = df['Date'].dt.month
    features['year'] = df['Date'].dt.year
    features['dayofyear'] = df['Date'].dt.dayofyear
    features['dayofmonth'] = df['Date'].dt.day
    features['weekofyear'] = df['Date'].dt.weekofyear

    cols_int16 = ['dayofweek','quarter','month','year','dayofyear','dayofmonth','weekofyear', 'Dept']
    for col in cols_int16:
        features['{}'.format(col)] = df['{}'.format(col)].astype('int16')

    cols_float32 = ['weeklySales']
    for col in cols_float32:
        features['{}'.format(col)] = df['{}'.format(col)].astype('float32')
        
    X = features[['Date','dayofweek','quarter','month','year','dayofyear','dayofmonth','weekofyear', 'Dept', 'weeklySales']]
    X.index = features.index
    print('\n >>features: \n',X)
    return X

# 5-3 This function derives train and test datasets from a tmie-series database due to an input date(Test:OK)
def split_data(df, split_date):
    return df[df.Date < split_date].copy(), \
            df[df.Date >= split_date].copy()
    
# 5-4 This function plots test and train values of target in time (Test:NA)
def plt_test_train(df_train, df_test):
    plt.figure(figsize = (20,10))
    plt.xlabel('date')
    plt.ylabel('weekly sales')
    plt.plot(df_train.index, df_train['weeklySales'],label = 'train')
    plt.plot(df_test.index, df_test['weeklySales'], label ='test')
    plt.legend()
    plt.show()
    msg = 'PLT Done ! \n'
    return msg

# 5-5 This function creates, tunes, plots, finalizes, predicts, and evaluates all models in mdls list for a set of data (Test:NA)
def create_models(mdls, test, result_log):
    for mdl in mdls:
        mdll = create_model('{}'.format(mdl))
        print('\n \n >>mdll = create_model(mdl) for Model:{} IS  DONE! \n \n'.format(mdl))
        tuned_mdl = tune_model(mdll, n_iter = 10)
        print('\n \n >>tuned_mdl = tune_model(mdll) for Model:{} IS  DONE! \n \n'.format(mdl))
        
        #plot_model(mdll)
        #print('\n \n >>plot_model(mdll) for Model:{} IS  DONE! \n \n'.format(mdl))
        #plot_model(mdll, plot = 'error')
        #print('\n \n >>plot_model(mdll, plot = error) for Model:{} IS  DONE! \n \n'.format(mdl))
        #plot_model(tuned_mdl, plot = 'feature')
        #print('\n \n >>plot_model(tuned_mdl, plot = feature) for Model:{} IS  DONE! \n \n'.format(mdl))
        
        predict_model(tuned_mdl)
        print('\n \n >>predict_model(tuned_mdl) for Model:{} IS  DONE! \n \n'.format(mdl))
        
        final_mdl = finalize_model(tuned_mdl)
        print('\n \n >>final_mdl = finalize_model(tuned_mdl) for Model:{} IS  DONE! \n \n'.format(mdl))
        
        print(final_mdl)
        print('\n \n >>print(final_mdl) for Model:{} IS  DONE! \n \n'.format(mdl))
        #evaluate_model(final_mdl)
        #print('\n \n >>evaluate_model(final_mdl) for Model:{} IS  DONE! \n \n'.format(mdl))
        
        predict_model(final_mdl)
        print('\n \n >>predict_model(final_mdl) for Model:{} IS  DONE! \n \n'.format(mdl))
        
        pred_mdl = predict_model(final_mdl, data=test)
        pred_mdl.to_csv('pred_output/exp51/{}_pred.csv'.format(mdl))
        print('\n \n >>pred_mdl = predict_model(final_mdl, data=test) for Model:{} IS  DONE! \n \n'.format(mdl))
       
        result_log[mdl] = pred_mdl.prediction_label
        result_log['{}_ape'.format(mdl)] = ((result_log.actual - result_log[mdl]) / result_log.actual).abs() 
        result_log['{}_pe'.format(mdl)] = ((result_log.actual - result_log[mdl]) / result_log.actual) 
        result_log['{}_pos_pe'.format(mdl)] = result_log[(result_log['{}_pe'.format(mdl)] >= 0)]['{}_pe'.format(mdl)]
        result_log['{}_neg_pe'.format(mdl)] = result_log[(result_log['{}_pe'.format(mdl)] < 0)]['{}_pe'.format(mdl)]
        pos_pe_sum = result_log['{}_pos_pe'.format(mdl)].sum()
        max_pos_pe = result_log['{}_pos_pe'.format(mdl)].max()
        neg_pe_sum = result_log['{}_neg_pe'.format(mdl)].sum()
        max_neg_pe = result_log['{}_neg_pe'.format(mdl)].min()
        mape = result_log['{}_ape'.format(mdl)].mean()
        #result_log_aggr.at[i, '{}_pos_pe_sum'.format(mdl)] = pos_pe_sum
        #result_log_aggr.at[i, '{}_max_pos_pe'.format(mdl)] = max_pos_pe
        #result_log_aggr.at[i, '{}_neg_pe_sum'.format(mdl)] = neg_pe_sum
        #result_log_aggr.at[i, '{}_max_neg_pe'.format(mdl)] = max_neg_pe
        #result_log_aggr.at[i, '{}_mape'.format(mdl)] = mape
        
        #dept_mape_list.append(result_log['{}_ape'.format(mdl)].mean())
        #result_log_aggr = dept_mape_list.add(mape)
        print('\n \n >> Prediction of Model:{}  IS  DONE! and MAPE Value is: '.format(mdl), mape)
    return result_log   
    #pass


# 5-6 This function sets up machine-learning process configurations(Test:NA)
def mlsetup(train, test):
    reg = setup(data = train,
            test_data = test,
            target = 'weeklySales',
            #categorical_features = ['Type', 'isHoliday'],
            #numeric_features = ['Date', 'Dept','Store','dayofweek','quarter','month','year','dayofyear','dayofmonth','weekofyear','MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5'],
            #preprocess = False,
            imputation_type = None, #We dont want to impute missing values because they are alreay imputed.
            #numeric_imputation = 'mean',
            polynomial_features = False, #it means we do not want to take existing features and raise them to a power to capture non-linear relationships between the feature and the target variable.
            transformation = False,
            normalize = False,
            #normalize_method = 'zscore',
            transform_target = False,
            remove_multicollinearity = False,
            #multicollinearity_threshold = 0.95,
            remove_outliers = False,
            #outliers_method = 'ee' #options are 'ee', 'lof', 'iforest',
            #outliers_threshhold = 0.05,
            feature_selection = False,
            #feature_selection_method = 'sequential',
            #feature_selection_estimator = 'lightgbm',
            #n_features_to_select = 0.2,
            #use_gpu = True,
            #profile = True,
            fold_strategy = 'kfold', #other options are 'kfold', 'groupkfold', 'timeseries'
            fold = 10,  
            #fold_groups = 'dept',
            data_split_shuffle = False,
            #fold_shuffle = True,
           )
    print('\n \n >>ML setup  IS  DONE! \n \n')
    #best = compare_models(sort = 'MAPE', n_select = 1)
    #best2 = compare_models(sort = 'MAPE', n_select = 2)
    #best3 = compare_models(sort = 'MAPE', n_select = 3)
    #best4 = compare_models(sort = 'MAPE', n_select = 4)
    #best5 = compare_models(sort = 'MAPE', n_select = 5)
    #print('\n \n >>best = compare_models IS  DONE! \n \n')
    #evaluate_model(best)
    #print('\n \n >>evaluate_model(best) IS  DONE! \n \n')
    #return best
    pass


# 5-7 This function executes each step of the whole experiment process one by one.(Test:NA)
#Experiment51:
def experiment52(df, split_date, mdls):
    y = aggregator(df)
    z = create_features(y)
    train, test = split_data(z, split_date)
    train2 = train.drop(columns=['Date'])
    test2 = test.drop(columns=['Date'])
    print('>create features and split_data func is Done! \n')
    
    mlsetup(train2, test2)
    print('\n >mlsetup func is Done! \n')

    result_log = result_log_init(test)
    result_log = create_models(mdls, test, result_log)
    result_log.to_csv('output_analysis/exp52/result_log.csv')

#    output_anal = output_anal(result_log)
#    output_anal.to_csv('output_analysis/exp52/output_anal.csv')
    
    print('\n >create_models func is Done! \n')
        
    #result_log_aggr_exp1.to_csv('output_analysis/exp1/result_log_aggr_exp1.csv')
    process_end_msg = '>>>>>>>>>>> Experience 52 is DONE! <<<<<<<<<<<<'
    print(process_end_msg)
    pass


# 5-8 This function
def result_log_init(test_df):
    result_log = pd.DataFrame()
    result_log.index = test_df.index
    result_log['Date'] = test_df['Date']
    result_log['Dept'] = test_df['Dept']
    result_log['actual'] = test_df['weeklySales']
    return result_log


# 5-9 This function
#def output_anal(result_log):
#    output_anal = pd.DataFrame()
#    output_anal.index = mlds
    
    
    #output_anal_dept_best = pd.DataFrame()
    #output_anal.index = dept_list




In [9]:
# 6-Setting parameters and executing the experiment
#mdls = ['xgboost', 'knn','catboost', 'lightgbm', 'gbr', 'huber', 'ada', 'par', 'omp', 'en', 'lasso', 'llar', 'br', 'ridge', 'lar', 'lr','dt', 'rf', 'et']
mdls = ['xgboost', 'knn','catboost', 'lightgbm']
# excluded: 'dummy'

experiment52(dataset_sub3, '2011-10-19', mdls)


 >>features: 
             Date  dayofweek  quarter  month  year  dayofyear  dayofmonth  \
0     2010-02-05          4        1      2  2010         36           5   
1     2010-02-05          4        1      2  2010         36           5   
2     2010-02-05          4        1      2  2010         36           5   
3     2010-02-05          4        1      2  2010         36           5   
4     2010-02-05          4        1      2  2010         36           5   
...          ...        ...      ...    ...   ...        ...         ...   
10148 2012-10-26          4        4     10  2012        300          26   
10149 2012-10-26          4        4     10  2012        300          26   
10150 2012-10-26          4        4     10  2012        300          26   
10151 2012-10-26          4        4     10  2012        300          26   
10152 2012-10-26          4        4     10  2012        300          26   

       weekofyear  Dept   weeklySales  
0               5     1  8.8183

Unnamed: 0,Description,Value
0,Session id,2875
1,Target,weeklySales
2,Target type,Regression
3,Original data shape,"(10153, 9)"
4,Transformed data shape,"(10153, 9)"
5,Transformed train set shape,"(6319, 9)"
6,Transformed test set shape,"(3834, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,



 
 >>ML setup  IS  DONE! 
 


 >mlsetup func is Done! 



Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,44960.4688,8524821504.0,92329.9609,0.9843,0.3192,0.2457
1,46076.1797,10984639488.0,104807.6328,0.9786,0.2786,0.1836
2,31800.1934,2617919232.0,51165.6055,0.9951,0.2935,0.2273
3,36656.8477,3320674304.0,57625.293,0.9936,0.2872,0.213
4,89010.7969,161420853248.0,401772.1562,0.7709,0.407,0.2122
5,143194.7656,94497292288.0,307404.125,0.8692,0.5669,0.7732
6,41111.8555,8630706176.0,92901.5938,0.9842,0.2702,0.1957
7,42340.8359,10444719104.0,102199.4062,0.98,0.3625,0.1914
8,31549.9883,2446366976.0,49460.7617,0.9955,0.3335,0.1981
9,38451.0117,4328953344.0,65794.7812,0.9921,0.3591,0.1807



 
 >>mdll = create_model(mdl) for Model:xgboost IS  DONE! 
 



Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,83766.4766,20599029760.0,143523.625,0.9622,0.6018,0.8271
1,80021.4453,17265676288.0,131398.9219,0.9664,0.6799,1.7463
2,70600.7578,10603460608.0,102973.1094,0.9801,0.6742,1.2138
3,64000.125,8850051072.0,94074.7109,0.9829,0.5991,0.8451
4,119310.1875,169619800064.0,411849.25,0.7592,0.612,0.7849
5,169791.0781,125063503872.0,353643.1875,0.8269,0.8168,1.5775
6,80406.2656,12399921152.0,111354.9297,0.9773,0.7057,1.5248
7,80205.1641,18993750016.0,137817.8125,0.9637,0.7116,1.4189
8,74712.7109,10753192960.0,103697.6016,0.9803,0.7973,2.6354
9,66674.0156,9283755008.0,96352.2422,0.9831,0.6718,1.0748


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).

 
 >>tuned_mdl = tune_model(mdll) for Model:xgboost IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extreme Gradient Boosting,57413.9531,36313059328.0,190559.8594,0.942,0.3312,0.2148



 
 >>predict_model(tuned_mdl) for Model:xgboost IS  DONE! 
 


 
 >>final_mdl = finalize_model(tuned_mdl) for Model:xgboost IS  DONE! 
 

Pipeline(memory=Memory(location=None),
         steps=[('placeholder', None),
                ('actual_estimator',
                 XGBRegressor(base_score=None, booster='gbtree', callbacks=None,
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, device='cpu',
                              early_stopping_rounds=None,
                              enable_categorical=False, eval_metric=None,
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extreme Gradient Boosting,25056.1953,1753826816.0,41878.7148,0.9972,0.2997,0.1695



 
 >>predict_model(final_mdl) for Model:xgboost IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extreme Gradient Boosting,25056.1953,1753826816.0,41878.7148,0.9972,0.2997,0.1695



 
 >>pred_mdl = predict_model(final_mdl, data=test) for Model:xgboost IS  DONE! 
 


 
 >> Prediction of Model:xgboost  IS  DONE! and MAPE Value is:  0.16945149


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,393735.6562,329721118720.0,574213.5,0.3943,1.1914,3.6436
1,385118.7812,301702873088.0,549274.875,0.4135,1.2396,6.3871
2,370103.9062,301492436992.0,549083.25,0.4349,1.1818,4.7484
3,367079.0,296771649536.0,544767.5,0.428,1.153,3.6292
4,438247.8125,499146129408.0,706502.75,0.2914,1.1888,3.3159
5,464093.5625,479720275968.0,692618.4375,0.3361,1.3046,5.7778
6,391716.2188,322508521472.0,567898.3125,0.4084,1.2234,4.5382
7,392957.5,306737086464.0,553838.5,0.4138,1.2846,5.9809
8,375999.125,301053968384.0,548683.875,0.4479,1.2634,7.4919
9,368497.2188,302135574528.0,549668.625,0.4493,1.1873,3.7828



 
 >>mdll = create_model(mdl) for Model:knn IS  DONE! 
 



Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,46296.4453,15384372224.0,124033.7578,0.9717,0.1848,0.1112
1,49277.1172,17174842368.0,131052.8203,0.9666,0.1929,0.1138
2,30161.0449,2556115968.0,50558.0469,0.9952,0.1616,0.0828
3,36332.9453,5116688384.0,71531.0312,0.9901,0.1762,0.0921
4,235118.0781,391295303680.0,625536.0,0.4445,0.7604,1.3251
5,291808.4375,570596327424.0,755378.25,0.2103,0.8979,7.326
6,45448.9922,11270000640.0,106160.2578,0.9793,0.1938,0.1233
7,43111.5234,11359772672.0,106582.2344,0.9783,0.1861,0.1136
8,31193.2207,3767794176.0,61382.3594,0.9931,0.1648,0.1059
9,32379.1816,3799320832.0,61638.6328,0.9931,0.1415,0.0914


Fitting 10 folds for each of 10 candidates, totalling 100 fits

 
 >>tuned_mdl = tune_model(mdll) for Model:knn IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,42067.2734,9131756544.0,95560.2266,0.9854,0.1664,0.0938



 
 >>predict_model(tuned_mdl) for Model:knn IS  DONE! 
 


 
 >>final_mdl = finalize_model(tuned_mdl) for Model:knn IS  DONE! 
 

Pipeline(memory=Memory(location=None),
         steps=[('placeholder', None),
                ('actual_estimator',
                 KNeighborsRegressor(metric='euclidean', n_jobs=-1,
                                     n_neighbors=1))])

 
 >>print(final_mdl) for Model:knn IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.0,0.0,0.0,1.0,0.0,0.0



 
 >>predict_model(final_mdl) for Model:knn IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.0,0.0,0.0,1.0,0.0,0.0



 
 >>pred_mdl = predict_model(final_mdl, data=test) for Model:knn IS  DONE! 
 


 
 >> Prediction of Model:knn  IS  DONE! and MAPE Value is:  0.0


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,68465.8576,16459625303.6102,128295.0712,0.9698,0.5988,0.81
1,60395.2104,12542472461.5133,111993.1804,0.9756,0.5301,0.7306
2,45855.4347,4545918233.5391,67423.425,0.9915,0.511,0.6286
3,44262.9795,4480114180.6703,66933.6551,0.9914,0.4283,0.3997
4,94909.0843,145817050337.6274,381859.9879,0.793,0.4312,0.3881
5,163227.5383,190104978985.5444,436010.2969,0.7369,0.6762,1.0071
6,48395.339,8266595622.5699,90920.8206,0.9848,0.412,0.4375
7,59096.1809,13364290599.1184,115604.025,0.9745,0.4931,0.529
8,45935.8178,4276269563.9643,65393.192,0.9922,0.5108,0.7876
9,45219.3064,5164568468.2807,71864.9321,0.9906,0.374,0.2868



 
 >>mdll = create_model(mdl) for Model:catboost IS  DONE! 
 



Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,65810.6321,15517433416.6184,124568.9906,0.9715,0.5424,0.6722
1,60463.1984,11339348198.8351,106486.3756,0.978,0.5785,1.0014
2,46665.0503,4597185235.406,67802.5459,0.9914,0.5567,0.753
3,45340.3213,5458603455.1012,73882.3623,0.9895,0.4299,0.4063
4,105083.7461,140332905328.3892,374610.338,0.8008,0.4855,0.4876
5,160022.2246,128150851977.327,357981.6364,0.8226,0.7481,1.3186
6,52431.6884,8463694983.7519,91998.3423,0.9845,0.4938,0.6215
7,58277.8678,12495209858.3249,111781.9747,0.9761,0.5484,0.7899
8,45131.1935,4188760345.1208,64720.6331,0.9923,0.5351,0.8485
9,46197.6383,5385427218.283,73385.4701,0.9902,0.3749,0.3066


Fitting 10 folds for each of 10 candidates, totalling 100 fits

 
 >>tuned_mdl = tune_model(mdll) for Model:catboost IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,CatBoost Regressor,65788.0555,15993715659.8565,126466.2629,0.9745,0.484,0.4957



 
 >>predict_model(tuned_mdl) for Model:catboost IS  DONE! 
 


 
 >>final_mdl = finalize_model(tuned_mdl) for Model:catboost IS  DONE! 
 

Pipeline(memory=Memory(location=None),
         steps=[('placeholder', None),
                ('actual_estimator',
                 <catboost.core.CatBoostRegressor object at 0x000001D07202EA90>)])

 
 >>print(final_mdl) for Model:catboost IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,CatBoost Regressor,45538.2307,5902041691.9694,76824.7466,0.9906,0.4507,0.4419



 
 >>predict_model(final_mdl) for Model:catboost IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,CatBoost Regressor,45538.2307,5902041691.9694,76824.7466,0.9906,0.4507,0.4419



 
 >>pred_mdl = predict_model(final_mdl, data=test) for Model:catboost IS  DONE! 
 


 
 >> Prediction of Model:catboost  IS  DONE! and MAPE Value is:  0.44193582640308227


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,71193.8833,19519861336.3045,139713.4973,0.9641,0.516,0.5903
1,63814.0447,12383587171.5853,111281.5671,0.9759,0.5743,1.0249
2,49100.5951,5506515899.4201,74205.9021,0.9897,0.5194,0.6679
3,61682.59,10624504299.079,103075.2361,0.9795,0.5121,0.4868
4,119077.7203,164801449700.7374,405957.4481,0.7661,0.5685,0.5468
5,170278.8177,127309940667.2957,356805.1859,0.8238,0.7336,1.124
6,56517.8078,8651622674.4299,93014.0993,0.9841,0.5011,0.6333
7,57590.1851,12591054029.2583,112209.866,0.9759,0.5945,0.9813
8,48521.7063,5354305802.7194,73173.1221,0.9902,0.5929,1.1261
9,60323.5566,10035815339.535,100178.9166,0.9817,0.5039,0.5283



 
 >>mdll = create_model(mdl) for Model:lightgbm IS  DONE! 
 



Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,55050.9793,13057872246.1829,114271.0473,0.976,0.3591,0.2947
1,49750.7944,10774769300.215,103801.5862,0.9791,0.4687,0.6169
2,37406.5013,3456155958.6881,58789.0803,0.9935,0.4577,0.5141
3,40013.7944,3866661263.4811,62182.4836,0.9925,0.3707,0.3074
4,96632.9878,158664816721.3784,398327.5244,0.7748,0.4418,0.3625
5,149061.3716,106538447840.2915,326402.2792,0.8526,0.637,0.9049
6,43112.4533,6869769444.2196,82884.0723,0.9874,0.3087,0.2227
7,48287.1931,11623015580.1634,107810.0903,0.9778,0.4303,0.4172
8,36944.2173,3206079267.4787,56622.2506,0.9941,0.4583,0.6309
9,41600.8901,4597984146.073,67808.4371,0.9916,0.347,0.275


Fitting 10 folds for each of 10 candidates, totalling 100 fits

 
 >>tuned_mdl = tune_model(mdll) for Model:lightgbm IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Light Gradient Boosting Machine,53307.2772,9707070556.8699,98524.4668,0.9845,0.4112,0.3394



 
 >>predict_model(tuned_mdl) for Model:lightgbm IS  DONE! 
 


 
 >>final_mdl = finalize_model(tuned_mdl) for Model:lightgbm IS  DONE! 
 

Pipeline(memory=Memory(location=None),
         steps=[('placeholder', None),
                ('actual_estimator',
                 LGBMRegressor(bagging_fraction=0.9, bagging_freq=5,
                               feature_fraction=0.8, learning_rate=0.2,
                               min_child_samples=1, min_split_gain=0.7,
                               n_estimators=150, n_jobs=-1, num_leaves=40,
                               random_state=2875, reg_alpha=0.5,
                               reg_lambda=10))])

 
 >>print(final_mdl) for Model:lightgbm IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Light Gradient Boosting Machine,35502.0777,3222810529.5645,56769.803,0.9949,0.4002,0.329



 
 >>predict_model(final_mdl) for Model:lightgbm IS  DONE! 
 



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Light Gradient Boosting Machine,35502.0777,3222810529.5645,56769.803,0.9949,0.4002,0.329



 
 >>pred_mdl = predict_model(final_mdl, data=test) for Model:lightgbm IS  DONE! 
 


 
 >> Prediction of Model:lightgbm  IS  DONE! and MAPE Value is:  0.32896850404271477

 >create_models func is Done! 

>>>>>>>>>>> Experience 52 is DONE! <<<<<<<<<<<<
