# **Intro and Setup**

##**Final Project for Coursera's 'How to Win a Data Science Competition'**
April, 2020;  Andreas Theodoulou and Michael Gaidis;  (Competition Info last updated:  3 years ago)

###**About this Competition**

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

Evaluation: root mean squared error (RMSE). True target values are clipped into [0,20] range.

###**File descriptions**

***sales_train.csv*** - the training set. Daily historical data from January 2013 to October 2015.

***test.csv*** - the test set. You need to forecast the sales for these shops and products for November 2015.

***sample_submission.csv*** - a sample submission file in the correct format.

***items.csv*** - supplemental information about the items/products.

***item_categories.csv***  - supplemental information about the items categories.

***shops.csv***- supplemental information about the shops.

###**Data fields**

***ID*** - an Id that represents a (Shop, Item) tuple within the test set

***shop_id*** - unique identifier of a shop

***item_id*** - unique identifier of a product

***item_category_id*** - unique identifier of item category

***item_cnt_day*** - number of products sold. You are predicting a monthly amount of this measure

***item_price*** - current price of an item

***date*** - date in format dd/mm/yyyy

***month*** - a consecutive month number. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

***item_name*** - name of item

***shop_name*** - name of shop

***item_category_name*** - name of item category

## **Colab Prep Tips** for those using Google Colab




### **Save Previous Work**
* Click **File -> Save a copy in Drive** and click **Open in new tab** in the pop-up window to save your progress in Google Drive. (This places the copy at the top level of Colab directory.)
* Or, in Google Drive before opening this notebook, right-click on this ipynb and select ***make a copy***, then with the copy in the same directory, right-click and select ***rename*** to update the version number.  Finally, right-click on the new version and ***open in colab***.

### **Select Runtime Type** *before* running notebook:
* Click **Runtime -> Change runtime type** and select **GPU** or **TPU** in Hardware accelerator box to enable faster training.

### **Keep Colab Active**
* To keep Colab connected by clicking on Colab window once every minute, go to Chrome Dev Tools --> Console Tab --> run the following code (April 2020):
</br>Take note that this should prevent disconnecting after each 1.5 hours of inactivity, but each runtime, if you don't have Colab Pro, will be terminated after 12 hours. (Pro = 24 hours) (Interval below is in millisec.)
```
function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("#ok").click()
}
setInterval(ClickConnect,60000)
```
Note that it will throw an error, its ok, it means that the Disconnection notification is not shown. Once it appear it will be clicked to reconnect.

* If that doesn't work, try this in the console:
```
function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("colab-connect-button").click()
}
setInterval(ClickConnect,60000)
```
* Lastly, can try this (older):
```
function KeepClicking(){
   console.log("Clicking");
   document.querySelector("colab-toolbar-button#connect").click()
}setInterval(KeepClicking,600000)
```

### **Save Previous Work**
* Click **File -> Save a copy in Drive** and click **Open in new tab** in the pop-up window to save your progress in Google Drive. (This places the copy at the top level of Colab directory.)
* Or, in Google Drive before opening this notebook, right-click on this ipynb and select ***make a copy***, then with the copy in the same directory, right-click and select ***rename*** to update the version number.  Finally, right-click on the new version and ***open in colab***.
```
from datetime import datetime
from pytz import timezone
amsterdam = timezone('Europe/Amsterdam')
ams_time = amsterdam.localize(datetime(2002, 10, 27, 6, 0, 0))
print(ams_time)
# 2002-10-27 06:00:00+01:00
# It will also know when it's Summer Time
# in Amsterdam (similar to Daylight Savings Time):
ams_time = amsterdam.localize(datetime(2002, 6, 27, 6, 0, 0))
print(ams_time)
# 2002-06-27 06:00:00+02:00
```

##**Set Up Environment**

### **Import Python Packages; Set Environment Options; Identify Input Data Files**



In [None]:
''' ################################################################################################################################################
Import Python Packages, and Document Version Numbers
''' ################################################################################################################################################

global RUN_n, MEMORY_STATS, ALL_exploded_shape, OUTPUTS_df, GDRIVE_REPO_PATH, OUT_OF_REPO_PATH

from google.colab import drive  

########## General python libraries/packages used throughout the notebook ######################################
from   itertools import product
from   collections import OrderedDict
import pprint
import re
import os
import sys
from   pathlib import Path
import platform                         # determine the active version of python
import pkg_resources                    # determine the active versions of imported packages
import psutil                           # from psutil import virtual_memory   # find how much RAM you have left in Colab VM
import gc                               # garbage collection... ok with np arrays, not useful with python objects
import multiprocessing                  # help to delete unnecessary pandas dataframes when you are done with them
from contextlib import ContextDecorator # , contextmanager ## create helper function that can be used as a decorator (for timing, e.g.)

######### timing ################################################################################################
import time
from   time import strftime, tzset, perf_counter
from   timeit import default_timer
from   datetime import datetime
os.environ['TZ'] = 'EST+05EDT,M4.1.0,M10.5.0'   # allows formatted version of the local date and time; track of what cells were run, and when
tzset()                                         # set the time zone

########## Helpful packages for EDA, cleaning, data manipulation #################################################
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

########### ML packages ##########################################################################################
import lightgbm as lgb  # LGBMRegressor
import sklearn
from   sklearn.preprocessing import MinMaxScaler, RobustScaler
from   sklearn.experimental import enable_hist_gradient_boosting  #noqa #explicitly require expt feature before import HistGradientBoostingRegressor
from   sklearn.ensemble import HistGradientBoostingRegressor
from   sklearn.metrics import mean_squared_error
from   sklearn.metrics import r2_score
# %tensorflow_version 2.x
# import tensorflow as tf

''' ################################################################################################################################################
Package Version Control: For finding the versions of packages used in this notebook, list the relevant packages here
''' ################################################################################################################################################
packages = ['pandas','matplotlib','numpy','scikit-learn','lightgbm']  # ,'keras','catboost','seaborn','nltk','graphx','tensorflow'
print(f'Package Imports Complete: {strftime("%a %X %x")}')

''' ################################################################################################################################################
Adjust Pandas Setup Options for enhancements and for desired formatting in this ipynb
''' ################################################################################################################################################
pd.set_option('compute.use_bottleneck', False)  # speed up operation when using NaNs
pd.set_option('compute.use_numexpr', False)     # speed up bool ops, large dfs; DataFrame.query() and pandas.eval() will evaluate subexpr by numexpr
pd.set_option("display.max_rows",60)            # Override pandas choice of how many rows to show, to see, e.g., full 84-row item_category df
pd.set_option("display.max_columns",60)         # Similar to row code above, we can show more columns than default
pd.set_option("display.width", 220)             # Tune this to monitor window size to avoid horiz scroll bars in output windows (but may get text wrap)
pd.set_option("max_colwidth", None)             # This is done, for example, so we can see full item name and not '...' in the middle
pd.options.display.float_format = lambda x : '{:.0f}'.format(x) if round(x,0) == x else '{:,.3f}'.format(x)  # no decimals if integer; use 3 dec float
print(f'Pandas Setup Complete: {strftime("%a %X %x")}')


''' ################################################################################################################################################
Identify Input Data and File Path Info (path relative to local Google Drive Repo Head = GDRIVE_REPO_PATH)
''' ################################################################################################################################################
GDRIVE_HOME = Path("/content" + "/drive")                               # initial directory when mounting Google Drive in Colab
COLAB_DIR = GDRIVE_HOME / "My Drive/Colab Notebooks"                    # default Colab directory
GDRIVE_REPO_PATH = COLAB_DIR / "NRUHSE_2_Kaggle_Coursera/final/Kag"     # content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
OUT_OF_REPO_PATH = COLAB_DIR / "NRUHSE_2_Kaggle_Coursera/final"         # place >100MB files here, because they won't sync with GitHub (eg, ftr files)
data_files = [  "data_output/items_enc.csv",                # pandas df name will be "items_enc" (see list comprehension below)
                "data_output/shops_enc.csv",
                "data_output/date_scaling.csv",
                "data_output/stt.csv.gz",                   # stt is short for sales-train-test (contains all data for months 0 to 34)
                "readonly/final_project_data/test.csv.gz"]

data_df_names = [path_name.rsplit('/')[-1].split('.')[0] for path_name in data_files] # root name of csv files will name the respective pandas dfs
data_sources = list(zip(data_df_names, data_files))
print(f'Data File Sources Identified: {strftime("%a %X %x")}\n{data_sources}')

Package Imports Complete: Wed 06:09:37 09/30/20
Pandas Setup Complete: Wed 06:09:37 09/30/20
Data File Sources Identified: Wed 06:09:37 09/30/20
[('items_enc', 'data_output/items_enc.csv'), ('shops_enc', 'data_output/shops_enc.csv'), ('date_scaling', 'data_output/date_scaling.csv'), ('stt', 'data_output/stt.csv.gz'), ('test', 'readonly/final_project_data/test.csv.gz')]


In [None]:
class MemoryStats:
    """Gather relevant memory consumption info into one object.
        1) get RAM usage as specified by psutil.Process(pid) and psutil.virtual_memory()
        2) get the number of active children per the multiprocessing configuration being used 
            ( = empty list [] when process is properly completed / closed)."""

    def __init__(self, program_location='Start'):
        self.program_location = program_location
        self.pid = os.getpid()
        self.py = psutil.Process(pid)
        self.pid_mem_use_GB = self.py.memory_info()[0] / 2. ** 30
        self.vm_used_GB = psutil.virtual_memory().used/ 1e9
        self.vm_total_GB = psutil.virtual_memory().total/ 1e9
        self.vm_available_GB = self.vm_total - self.vm_used
        self.active_proc = multiprocessing.active_children()
        self.date_time = f'{strftime("%a %X %x")}'


class ProgramMgr:
    """Hold information about computer system and files being used in this notebook."""

    def __init__(self, packages):
        self.packages = packages  # list of imports, for version control
        self.package_versions = None
        self.GDRIVE_HOME = Path("/content" + "/drive")                               # initial directory when mounting Google Drive in Colab
        self.COLAB_DIR = self.GDRIVE_HOME / "My Drive/Colab Notebooks"               # default Colab directory
        self.GDRIVE_REPO_PATH = self.COLAB_DIR / "NRUHSE_2_Kaggle_Coursera/final/Kag"     # content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
        self.OUT_OF_REPO_PATH = COLAB_DIR / "NRUHSE_2_Kaggle_Coursera/final"         # place >100MB files here, because they won't sync with GitHub (eg, ftr files)
        self.data_files = [
                           "data_output/items_enc.csv",                # pandas df name will be "items_enc" (see list comprehension below)
                           "data_output/shops_enc.csv",
                           "data_output/date_scaling.csv",
                           "data_output/stt.csv.gz",                   # stt is short for sales-train-test (contains all data for months 0 to 34)
                           "readonly/final_project_data/test.csv.gz"
                           ]
        # Initialize MEMORY_STATS tracking list; throughout program; usage like: MEMORY_STATS.append(get_memory_stats("After Merge", printout=False))
        self.memory_stats = []
        self.runtime_type = 'Unknown runtime type'
        self.init_datetime = f'{strftime("%a %X %x")}'
        self.n_cpu = 0
        self.time = OrderedDict([('start': default_timer())])
        self.get_memory_stats("At Notebook start", printout=True)
        self.get_runtime_type(printout=True)

    def get_df_names(self):
        """Return a list of names intended for pandas dataframes, based on the prefix of the data source files."""
        return [path_name.rsplit('/')[-1].split('.')[0] for path_name in self.data_files] # root name of csv files will name the respective pandas dfs

    def get_data_sources(self):
        """Return a zipped list of datafile names and disk locations, to be used for multiprocessing map functions."""
        return list(zip(self.get_df_names(), self.data_files))

    def get_memory_stats(self, program_location="0", print_status='location'):
        """Provide memory consumption information.  print_status = one of [None, 'location', 'all']."""
        ms = MemoryStats(program_location)
        self.memory_stats.append(ms)
        # print either all the stats gathered throughout the program, or just the present program location, or nothing
        if print_status == 'all':
            max_meas_str_len = 0
            for ms in self.memory_stats:  # find longest string to accommodate during printout
                if len(ms.program_location) > max_meas_str_len:
                    max_meas_str_len = len(ms.program_location) 
            print(f"{' ':<21}   {' ':<{max_meas_str_len}} |  pid   |               vm               {' ':<14}|")
            print(f"{'  Time and Date':<21} | {'  Measurement Point':<{max_meas_str_len}} | pid-GB | used-GB | avail-GB | total-GB | {'Active Procs':<12} |")
            for ms in self.memory_stats:
                print(f"{ms.date_time:<21} | {ms.program_location:<{max_meas_str_len}} | {ms.pid_mem_use_GB:>6.2f} | "
                      f"{ms.vm_used_GB:>7.2f} | {ms.vm_available_GB:>8.2f} | {ms.vm_total_GB:>8.2f} | {ms.active_proc}")
        elif print_status == 'location':
            print(f'{program_location}:               Multiprocessing Active Children = {ms.active_proc}')
            print(f'Memory Use: | {ms.pid_mem_use_GB:.2f} | {ms.vm_used_GB:.2f} | {ms.vm_available_GB:.2f} | {ms.vm_total_GB:.2f} | GB:'
                  f'  pid, vm used / available / total.  {ms.date_time}.')

    def get_runtime_type(self, printout=True):  
        """Check if connected to a CPU vs. GPU-enabled runtime, or possibly a TPU: (relevant for people using Google Colab)."""
        try: 
            gpu_device_name = tf.test.gpu_device_name()
            if gpu_device_name != '/device:GPU:0':  # ' ' means CPU whereas '/device:G:0' means GPU
                try: 
                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
                    self.runtime_type = f'Colab using TPU at: {tpu.cluster_spec().as_dict()["worker"]}'
                except ValueError: self.runtime_type = 'Colab using CPU'
            else: self.runtime_type = f'Colab using GPU at: {gpu_device_name}'
        except: pass  # self.runtime_type = 'Unknown runtime type'
        if printout:
            print(f'Runtime Type: {self.runtime_type}')

    def get_n_cpu(printout=True):   
        """Find the number of CPUs available for use."""
        try: 
            self.n_cpu = multiprocessing.cpu_count()
        except: pass  # self.n_cpu = 0
        if printout:
            print(f'Number of available CPUs: {self.n_cpu}')

    def get_package_versions(self, printout=True):  
        """Find the versions of python packages used in this notebook, to assist with re-creation of notebook at some future date.
        Could also use the following methodology: print(pandas.__version__)."""
        self.package_versions = OrderedDict([("Python", platform.python_version())])
        for pkg in sorted(self.packages): 
            try: 
                self.package_versions[pkg] = pkg_resources.get_distribution(pkg).version
            except: pass
        if printout:
            for k,v in self.package_versions.items():
                print(f'{k} version: {v}')

    def get_elapsed_time(self, program_location='start', reference='start'):
        time_now = default_timer()
        if program_location == reference:
            program_location = f'{len(self.time)}'
        self.time[program_location] = time_now
        return f'{datetime.utcfromtimestamp(self.time[program_location] - self.time[reference]).strftime('%H:%M:%S')} H:M:S'

o = ProgramMgr()

At Notebook start:               Multiprocessing Active Children = []
Memory Use: | 0.15 | 0.61 | 26.78 | 27.39 | GB:  pid, vm used / available / total.  Mon 19:10:38 09/28/20.


### **Analysis and Descriptive (Helper) Functions**



In [None]:
# ############################################################################################
# def get_memory_stats(txt="0",printout=True):
#     """
#     1) get RAM usage as specified by psutil.Process(pid) and psutil.virtual_memory()
#     2) get the number of active children per the multiprocessing configuration being used (= empty list [] when process is properly completed / closed)
#     3) allow option of printing the single data point immediately, and/or append data to a variable for formatted printout summary
#     """
#     pid = os.getpid()
#     py = psutil.Process(pid)
#     pid_memory_use = py.memory_info()[0] / 2. ** 30
#     vm_used = psutil.virtual_memory().used/ 1e9
#     vm_total = psutil.virtual_memory().total/ 1e9
#     vm_available = vm_total - vm_used
#     active_proc = multiprocessing.active_children()
#     if printout:
#         print(f'{txt}:               Multiprocessing Active Children = {active_proc}')
#         print(f'Memory Use: | {pid_memory_use:.2f} | {vm_used:.2f} | {vm_available:.2f} | {vm_total:.2f} | GB:',end='')
#         print(f'  pid, vm used / available / total.  {strftime("%a %X %x")}.')
#     return {'datetime':f'{strftime("%a %X %x")}','measure_point':txt,'pid_mem_use_GB':pid_memory_use,
#             'vm_used_GB':vm_used,'vm_available_GB':vm_available,'vm_total_GB':vm_total,'active_proc':active_proc}

# # Initialize MEMORY_STATS tracking list; throughout program; usage like: MEMORY_STATS.append(get_memory_stats("After Merge", printout=False))
# MEMORY_STATS = [get_memory_stats("At Memory Stats Func Def", printout=True)]

# ############################################################################################
# def display_all_memory_stats(list_of_dicts):   
#     """
#     formatted printout of MEMORY_STATS collection of RAM use as tracked throughout the computations
#     """
#     max_meas_str_len = 0
#     for dicts in list_of_dicts:
#         if len(dicts['measure_point']) > max_meas_str_len:
#             max_meas_str_len = len(dicts['measure_point']) 
#     print(f"{' ':<21}   {' ':<{max_meas_str_len}} |  pid   |               vm               {' ':<14}|")
#     print(f"{'  Time and Date':<21} | {'  Measurement Point':<{max_meas_str_len}} | pid-GB | used-GB | avail-GB | total-GB | {'Active Procs':<12} |")
#     for dicts in list_of_dicts:
#         print(f"{dicts['datetime']:<21} | {dicts['measure_point']:<{max_meas_str_len}} | {dicts['pid_mem_use_GB']:>6.2f} | ",end='')
#         print(f"{dicts['vm_used_GB']:>7.2f} | {dicts['vm_available_GB']:>8.2f} | {dicts['vm_total_GB']:>8.2f} | {dicts['active_proc']}")
#     return

# ############################################################################################
# def get_runtime_type(do_printout=True):  
#     """
#     Check if connected to a CPU vs. GPU-enabled runtime, or possibly a TPU: (relevant for people using Google Colab)
#     """
#     runtime_type_dict = {}
#     try: 
#         gpu_device_name = tf.test.gpu_device_name()
#         if gpu_device_name != '/device:GPU:0':  #' ' means CPU whereas '/device:G:0' means GPU
#             try: 
#                 tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
#                 run_type = f'Colab using TPU at: {tpu.cluster_spec().as_dict()["worker"]}'
#             except ValueError: run_type = 'Colab using CPU'
#         else: run_type = f'Colab using GPU at: {gpu_device_name}'
#     except: run_type = 'Unknown runtime type'
#     if do_printout:
#         print(run_type)
#     return run_type

# ############################################################################################
# def get_n_cpu(do_printout=True):   
#     """
#     Find the number of CPUs available for use
#     """
#     try: 
#         n_cpu = multiprocessing.cpu_count()
#     except: n_cpu = 0
#     if do_printout:
#         print(f'Number of available CPUs: {n_cpu}')
#     return n_cpu

# ############################################################################################
# def get_package_versions(package_list=packages, do_printout=True):  
#     """
#     Find the versions of python packages used in this notebook, to assist with re-creation of notebook at some future date
#     could also use: print(pandas.__version__)
#     """
#     package_version_dict = OrderedDict([("Python",platform.python_version())])
#     for mod in sorted(package_list): 
#         try: 
#             package_version_dict[mod] = pkg_resources.get_distribution(mod).version
#         except: pass
#     if do_printout:
#         for k,v in package_version_dict.items():
#             print(f'{k} version: {v}')
#     return package_version_dict

############################################################################################
def print_col_info(df,nrows=5,SPACE_BETWEEN_COLS = 6,COLUMN_PADDING = 3):   # formatted print of column datatypes and memory usage
    """
    instead of the usual single column (plus index) series printout of dtypes and memory use in a dataframe, 
    this function combines dtypes and memory use so they use the same index, and then prints out multiple columns of length "nrows", 
    where each column is like: "column_dtype \t column_memory_use(MB) \t column_name"; finishes with a printout of total df mem usage
        df = dataframe of interest
        nrows = int, tells how many rows of (type/mem/name) to print before moving to a new printout column for the next triplet (type/mem/name)
    """
    col_mem = df.memory_usage(deep=True) /1e6  #change to MB
    print(f'DataFrame shape: {df.shape}\nDataFrame total memory usage: {col_mem.sum():.0f} MB\nDataFrame Column Names: {df.columns.to_list()}')
    # df.memory_usage includes Index, but df.dtypes does not include Index, so we have to add it
    col_dtypes = pd.concat([pd.Series([df.index.dtype], index = ['Index']),df.dtypes], axis=0)  
    col_info_df = pd.concat([col_dtypes, col_mem], axis=1).reset_index().rename(columns={'index':'Column Name', 0:'DType', 1:'MBytes'})
    if nrows == 0:  # if nrows == 0, print all triplets in just one column, with no "wrapping"
        print(col_info_df)
    else:
        col_info_df.MBytes = col_info_df.MBytes.apply(lambda x: str(f'{x:.1f}'))
        n_print_cols, stragglers = divmod(col_info_df.shape[0], nrows)  # adjust n rows such that we don't have nasty column with just a few rows
        if (stragglers > 0):  # add an extra column if we have lots of stragglers; else make n rows a bit higher so we don't have stragglers
            n_print_cols += (stragglers > nrows/2)
            nrows = col_info_df.shape[0] // n_print_cols + (col_info_df.shape[0] % n_print_cols > 0)
        # to make dataframe of where each column is shifted by number of elements to print per column (will truncate below)
        df_print = pd.concat([col_info_df.shift(periods= -nrows * pc) for pc in range(n_print_cols)], axis = 1)  
        # truncate "over-shifted" rows so only one copy of each element in the df; make first row = column names (easier for printing)
        df_print = df_print.iloc[:nrows][:].fillna(" ").T.reset_index().T.reset_index(drop=True)   
        columnStrLengths = np.vectorize(len) 
        # find max string length in each column, add COLUMN_PADDING (roughly = 3) to create nice/clean column look
        col_widths = np.add(columnStrLengths(df_print.values.astype(str)).max(axis=0), COLUMN_PADDING)  
        for r in range(nrows):  # create strings for each row to print sequentially, containing all columns in each row string
            print_row = ''
            for c in range(len(df_print.columns)):
                print_row = print_row + f'{str(df_print.iloc[r][c]):>{col_widths[c]}} '  
                # format string for right alignment in column; padding to fit column width of 'col_widths'
                print_row = print_row + " " * SPACE_BETWEEN_COLS * ((c+1)%col_info_df.shape[1] == 0)  
                # when finished formatting one column of data; add extra spaces and move to the next column
            print(print_row)
    # MEMORY_STATS.append(get_memory_stats("print_col_info Completed",printout=True))
    return

# ############################################################################################
# class elapsed_timer(ContextDecorator): # base class enables a contextmanager to also be used as a decorator
#     def __init__(self):   
#         self.function_init_time   = f'{strftime("%a %X %x")}'
#         self.start_time           = default_timer()
#         self.function_total_time  = 0
#     def __enter__(self):  
#         self.start_time = default_timer()
#         return datetime.utcfromtimestamp(default_timer() - self.start_time).strftime('%H:%M:%S')
#     def __exit__(self, exc_type, exc_value, exc_traceback):
#         self.function_total_time = datetime.utcfromtimestamp(default_timer() - self.start_time).strftime('%H:%M:%S')
#         return True # ignores errors in while loop
#     def restart_time(self):
#         self.start_time = default_timer()
#     def get_elapsed_time(self):
#         return datetime.utcfromtimestamp(default_timer() - self.start_time).strftime('%H:%M:%S')

print(f'Helper Functions Defined: {strftime("%a %X %x")}')

''' ################################################################################################################################################
Gather and print system information, including the version numbers of python packages being used (to assist others in replicating this work)
''' ################################################################################################################################################
runtime_type_cpu_gpu_tpu = get_runtime_type(do_printout=True)                   #####    Runtime Type     #####
n_available_cpus = get_n_cpu(do_printout=True)                                  #####   Number of CPUs    #####
available_vm_ram_gb = psutil.virtual_memory().total/ 1e9                        ##### Colab Available RAM #####
print(f'Available VM RAM (GB): {available_vm_ram_gb:.1f}\n')
package_versions = get_package_versions(packages, do_printout=True)             ##### Py Package Versions #####
print(f'\nSystem Info Gathered: {strftime("%a %X %x")}')
display_all_memory_stats(MEMORY_STATS)

At Memory Stats Func Def:               Multiprocessing Active Children = []
Memory Use: | 0.14 | 0.62 | 26.77 | 27.39 | GB:  pid, vm used / available / total.  Wed 03:36:14 09/09/20.
Helper Functions Defined: Wed 03:36:14 09/09/20
Unknown runtime type
Number of available CPUs: 4
Available VM RAM (GB): 27.4

Python version: 3.6.9
lightgbm version: 2.2.3
matplotlib version: 3.2.2
numpy version: 1.18.5
pandas version: 1.0.5
scikit-learn version: 0.22.2.post1

System Info Gathered: Wed 03:36:14 09/09/20
                                                 |  pid   |               vm                             |
  Time and Date       |   Measurement Point      | pid-GB | used-GB | avail-GB | total-GB | Active Procs |
Wed 03:36:14 09/09/20 | At Memory Stats Func Def |   0.14 |    0.62 |    26.77 |    27.39 | []


##**Identify (Manually) Splits to Be Performed and Features to Use**

###**Define Feature Columns, Statistics, and Lags**

In [None]:
### for reference, column names of the loaded dataframes:
# items_enc_cols  = [ 'item_id', 'item_tested', 'item_cluster', 'item_category_id', 'item_cat_tested', 'item_group', 
#                     'item_category1', 'item_category2', 'item_category3', 'item_category4']
# shops_enc_cols  = ['shop_id','shop_tested','shop_group','shop_type','s_type_broad','shop_federal_district','fd_popdens','fd_gdp','shop_city']
# date_adj_cols   = ['month', 'year', 'season', 'MoY', 'days_in_M', 'weekday_weight', 'retail_sales', 'week_retail_weight']
# stt_cols        = ['day', 'week', 'qtr', 'season', 'month', 'price', 'sales', 'shop_id', 'item_id']
# test_cols       = ['ID', 'shop_id', 'item_id']

########
#  Allow looping over variations in choices of categories to keep, statistics to use, and lag months (and stats to lag for each month)
#    However, instead of "cartesian product" type looping of all possible combinations of [categories, stats, lags], 
#    set these columns together as one unit for each iteration, like  [[[cats1,stats1,lags1],[cats2,stats2,lags2],...]]] 
#    instead of                                                       [[[cats1,stats1,lags1],[cats1,stats1,lags2],[cats1,stats2,lags1],...]]
########
n_categories_stats_lags_splits = 1
feature_params = []                                 # each dict within the "feature_params" list is an iteration split (another run) 
for csls in range(n_categories_stats_lags_splits):
    cats_stats_lags = OrderedDict()
    # columns to keep for this round of modeling (dropping some of the less important features to save memory):
    items_enc_keep =    ['item_id', 'item_category_id', 'item_group', 'item_cluster'] 
    shops_enc_keep =    ['shop_id', 'shop_group']
    date_scaling_keep = ['month', 'week_retail_weight'] 
    stt_keep =          ['month', 'sales', 'price', 'shop_id', 'item_id'] 
    test_keep =         ['ID', 'shop_id', 'item_id']
    # if you wish to run more than one combination of these included columns (make sure they align with lags/stats as mentioned above)
    cats_stats_lags["keep_columns"] = OrderedDict()
    for ds in data_sources:
        cats_stats_lags["keep_columns"][ds[0]] = eval(ds[0]+'_keep')
        
    cats_stats_lags['aggregate_stats'] = OrderedDict([ ("sales", ['sum', 'median', 'count']), ("revenue", ['sum']) ])  # , ("price", ['median']) 

    # lag specs for the csls'th split (lag months = list(lag_split.keys())    or     "for step_number,months_lagged in enumerate(lag_split):" )
    lag_split       = OrderedDict()  # single iteration choices for lag elements, to be matched with a single row of categories/stats from above
    lag_split[1]    = OrderedDict()  # key = lag month number; value = ODict with details on stats/grouping to include for this month number's lag
    lag_split[1]['shop_id_x_item_id']           = {'group':['shop_id', 'item_id'],          'stats':{"sales":['sum','median','count'],"revenue":['sum']}}
    lag_split[1]['shop_id_x_item_category_id']  = {'group':['shop_id', 'item_category_id'], 'stats':{"sales":['sum','median','count']}}
    lag_split[1]['shop_id_x_item_cluster']      = {'group':['shop_id', 'item_cluster'],     'stats':{"sales":['sum','median']}}
    lag_split[1]['shop_id']                     = {'group':['shop_id'],                     'stats':{"sales":['sum','count']}}
    lag_split[1]['item_id']                     = {'group':['item_id'],                     'stats':{"sales":['sum','median','count'],"revenue":['sum']}}
    lag_split[1]['shop_group']                  = {'group':['shop_group'],                  'stats':{"revenue":['sum']}}
    lag_split[1]['item_category_id']            = {'group':['item_category_id'],            'stats':{"sales":['sum', 'count'],"revenue":['sum']}}
    lag_split[1]['item_group']                  = {'group':['item_group'],                  'stats':{"sales":['sum'],"revenue":['sum']}}
    lag_split[1]['item_cluster']                = {'group':['item_cluster'],                'stats':{"sales":['sum', 'count'],"revenue":['sum']}}
    lag_split[2]    = OrderedDict()  # key = lag month number; value = ODict with details on stats/grouping to include for this month number's lag
    lag_split[2]['shop_id_x_item_id']           = {'group':['shop_id', 'item_id'],          'stats':{"sales":['sum','count'],"revenue":['sum']}}
    lag_split[2]['shop_id_x_item_category_id']  = {'group':['shop_id', 'item_category_id'], 'stats':{"sales":['count'],"revenue":['sum']}}
    lag_split[2]['shop_id']                     = {'group':['shop_id'],                     'stats':{"sales":['sum']}}
    lag_split[2]['item_id']                     = {'group':['item_id'],                     'stats':{"sales":['sum','count'],"revenue":['sum']}}
    lag_split[2]['item_category_id']            = {'group':['item_category_id'],            'stats':{"sales":['sum', 'count']}}
    lag_split[2]['item_cluster']                = {'group':['item_cluster'],                'stats':{"sales":['sum', 'count'],"revenue":['sum']}}
    lag_split[3]    = OrderedDict()  # key = lag month number; value = ODict with details on stats/grouping to include for this month number's lag
    lag_split[3]['shop_id_x_item_id']           = {'group':['shop_id', 'item_id'],          'stats':{"sales":['sum']}}
    lag_split[3]['shop_id']                     = {'group':['shop_id'],                     'stats':{"sales":['sum']}}
    lag_split[3]['item_id']                     = {'group':['item_id'],                     'stats':{"sales":['sum','count'],"revenue":['sum']}}
    lag_split[3]['item_category_id']            = {'group':['item_category_id'],            'stats':{"sales":['sum', 'count']}}
    lag_split[3]['item_cluster']                = {'group':['item_cluster'],                'stats':{"sales":['count']}}
    lag_split[4]    = OrderedDict()  # key = lag month number; value = ODict with details on stats/grouping to include for this month number's lag
    lag_split[4]['item_id']                     = {'group':['item_id'],                     'stats':{"sales":['sum']}}
    lag_split[5]    = OrderedDict()  # key = lag month number; value = ODict with details on stats/grouping to include for this month number's lag
    lag_split[5]['shop_id_x_item_id']           = {'group':['shop_id', 'item_id'],          'stats':{"sales":['sum']}}
    lag_split[6]    = OrderedDict()  # key = lag month number; value = ODict with details on stats/grouping to include for this month number's lag
    lag_split[6]['shop_id_x_item_id']           = {'group':['shop_id', 'item_id'],          'stats':{"sales":['sum']}}
    lag_split[7]    = OrderedDict()  # key = lag month number; value = ODict with details on stats/grouping to include for this month number's lag
    lag_split[7]['item_id']                     = {'group':['item_id'],                     'stats':{"sales":['sum']}}
    lag_split[8]    = OrderedDict()  # key = lag month number; value = ODict with details on stats/grouping to include for this month number's lag
    lag_split[8]['shop_id_x_item_id']           = {'group':['shop_id', 'item_id'],          'stats':{"sales":['sum']}}

    # Now get the minimal list of stats that need to be calculated, and get the feature names with lag months appended as text
    #   - stats_set is SET of all agg statistics columns for all lags (allows us to shed the other stats, keeping memory requirements lower)
    #   - first monthly group merge is on month/shop_id/item_id; keep other categorical features in this merge by using agg_stat = 'first'
    #        (none of these features get lagged, so don't include them in lag_split dict or in stats_set_feature_names, but do include in stats_set)
    first_cols = cats_stats_lags["keep_columns"]["shops_enc"][1:] + cats_stats_lags["keep_columns"]["items_enc"][1:]
    first_stats = dict(zip(first_cols,[['first']]*len(first_cols)))
    stats_set = OrderedDict( [ ('shop_id_x_item_id', {'group':['shop_id', 'item_id'], 'stats':first_stats} ) ] )                  
    stats_set_feature_names = []   #stats_set_feature_names = first_cols.copy()

    all_lag_feature_names = []
    for lag_mo, lag_stats in lag_split.items():
        feature_root_list = []
        lagged_feature_name_dict = OrderedDict()
        for lag_group_name, agg_details in lag_stats.items():
            for stat_target, stats in agg_details['stats'].items():
                if lag_group_name not in list(stats_set.keys()):
                    stats_set[lag_group_name] = agg_details
                else:
                    if stat_target not in list(stats_set[lag_group_name]['stats'].keys()):
                        stats_set[lag_group_name]['stats'][stat_target] = stats
                    else:
                        stats_set[lag_group_name]['stats'][stat_target] = list( set( stats_set[lag_group_name]['stats'][stat_target] + stats ) )
                for stat in stats:
                    root_name = f'{lag_group_name}_{stat_target}_{stat}'
                    lag_feature_name = f'{lag_group_name}_{stat_target}_{stat}_L{lag_mo}'
                    stats_set_feature_names.append(root_name)
                    feature_root_list.append(root_name)
                    lagged_feature_name_dict[root_name] = lag_feature_name
                    all_lag_feature_names.append(lag_feature_name)
        lag_split[lag_mo]['feature_root'] = feature_root_list
        lag_split[lag_mo]['lagged_feature_name'] = lagged_feature_name_dict
    stats_set_feature_names = list(OrderedDict.fromkeys(stats_set_feature_names))  #keep only first occurrence, remove any following duplicates
    
    for agg_item, agg_params in stats_set.items():
        #agg_names = first_cols if agg_item == 'shop_id_x_item_id' else []
        agg_names = []
        for stat_target, stat_type in agg_params['stats'].items():
            if stat_type == ['first'] or stat_type == 'first':
                agg_names.append(stat_target)
            else:
                for st in stat_type:
                    agg_names.append(f'{agg_item}_{stat_target}_{st}')
        stats_set[agg_item]['agg_names'] = agg_names

    # stt_final = columns in train/val/test set, before monthly grouping and calculation of statistics, (columns in order desired)
    cats_stats_lags['stt_final']    = ['month'] + list(cats_stats_lags['aggregate_stats'].keys()) + ['shop_id', 'item_id'] + first_cols
    cats_stats_lags['categorical']  = cats_stats_lags["keep_columns"]["shops_enc"] + cats_stats_lags["keep_columns"]["items_enc"]
    cats_stats_lags['integer']      = ['month'] + cats_stats_lags["categorical"] # int dtype uses less memory, and can speed model fitting     
    cats_stats_lags['lag_splits']   = { 'months_list':list(lag_split.keys()), 'params':lag_split, 'stats_set':stats_set, 
                                        'stats_set_feature_names':stats_set_feature_names }
    cats_stats_lags['all_feature_names'] = cats_stats_lags['integer'] + all_lag_feature_names
    cats_stats_lags['n_features']   = len(cats_stats_lags['all_feature_names'])

    feature_params.append(cats_stats_lags)

print(f'Feature Columns, Statistic, and Lags Identified: {strftime("%a %X %x")}')

Feature Columns, Statistic, and Lags Identified: Wed 03:36:14 09/09/20


###**Define Dictionaries / Dataframes to Enable Looping Grid Search for Optimal Parameters**

In [None]:
# Define various constants that drive the attributes of the various features
# Define hyperparameters for modeling and feature generation, including those that we might want to loop over multiple choices
# Put in a DataFrame so we can "explode" and generate rows for every possible combination of variable inputs

filename_root = 'v1p0_test1'
ALL_PARAMS = pd.DataFrame({
        ###### below are model/file name specifications ########  ... set model_filename_base to False if you want user to input it during run
        'model_filename_base':      f"{strftime('%Y%m%d-%H%M')}_{filename_root}", # fname like 'model_type'+'20200812-0740+decription'+RUN_n+suffix.xxx
        'model_type':               [['LGBM']],         # 'HGBR' for SKLearn version of GBDT
        'feature_params':           [feature_params],   # details on cats to use, agg stats and lag features, (list of dicts, 1 dict per iter)
        'data_sources':             [[data_sources]],   # (pd_df_name_string, filepath_string) tuple list load csv files from google drive
        ###### below are eda parameters ########               
        'eda_delete_shops':         [[[9,20]]],    # [0,1,8,9,11,13,17,20,23,27,29,30,32,33,40,43,51,54] [8, 9, 13, 20, 23, 32, 33, 40] False
        'eda_delete_item_cats':     [[[8,10,32,59,80,81,82]]], # [1,4,8,10,13,14,17,18,32,39,46,48,50,51,52,53,59,66,68,80,81,82] [8, 80, 81, 82] False
        'eda_scale_month':          [['week_retail_weight']],  # False # scale sales by days in month, n of each wkday, Russian recessn retail sales idx
        'feather_stt':              [[True]],      # if True, save STT dataframe to disk as .ftr file to conserve RAM
        ###### below are data conditioning parameters (monthly merging, scaling, cartesian product filling, lagged statistics) ########
        'cartprod_fillna0':         [[True]],      # fill n/a cartesian additions with zeros (not good for price-based stats, however)
        'cartprod_first_month':     [[13]],        # month no + max lag to start adding Cart Prod rows (eg, maxlag=6mo and cart_first_mon=10 fills 4-33)
        'cartprod_test_pairs':      [[False]],     # include all of test set 'shop-item pairs' with each month's normal cartesian product fill
        'clip_train_H':             [[20]],        # this clips sales after doing monthly groupings (monthly_stt dataframe)
        'clip_train_L':             [[0]],         # this clips sales after doing monthly groupings (monthly_stt dataframe)
        'feather_monthly_stt':      [[True]],      # if True, save final monthly_stt dataframe to disk as .ftr file to conserve RAM
        'feature_data_type':        [[np.int16]],  # np.float32 np.uint16 # if df contains np.NaNs, int type cannot represent, and we must use float32
        'minmax_scaler_range':      [[(0,16000)]], # int16=(0,32700); uint16=(0,65500) positive range for best LGBM results; smaller nums=faster fit
        'robust_scaler_quantiles':  [[(20,80)]],   # parameter determines how sklearn robust scaler will do it's bizness  
        'use_cartprod_fill':        [[True]],      # use cartesian fill, or not
        'use_categorical':          [[True]],      # relevant dataframe columns are changed to categorical dtype just before model fitting/creation
        'use_minmax_scaler':        [[True]],      # scale features linearly to use large range of np.int16 (NOTE: only use positive integer output)
        'use_robust_scaler':        [[True]],      # scale features using quantiles to reduce influence of outliers
        ###### below are train/val/test splitting parameters ######
        'feather_tvt_split':        [[False]],     # if True, save Train/Val/Test dataframes to disk as .ftr files to conserve RAM
        'test_month':               [[34]],
        'train_start_month':        [[13]],        # == 24 ==> less than a year of data, but avoids December 'outlier' of 2014
        'train_final_month':        [[29]],        # [29,32] #,30,32]
        'validate_months':          [[999]],       # 1 # 2 # 999 val= all months after training, incl 33; else val= n months after train_final_month
        ###### below are regresssor setup parameters ########
        'boosting_type':            'gbdt',
        'metric':                   [['rmse']],
        'learning_rate':            [[0.05]],     # default = 0.1
        'n_estimators':             [[200]],
        'colsample_bytree':         [[0.4]],      # = feature_fraction; default 1 for LGBM, 0 for HGBR (models use inverse forms of regularization)
        'random_state':             [[42]],
        'subsample_for_bin':        [[200000,800000]],
        'num_leaves':               [[31]],
        'max_depth':                [[-1]],
        'min_split_gain':           [[0.0]],
        'min_child_weight':         [[0.001]],
        'min_child_samples':        [[20]],
        'silent':                   [[False]],
        'importance_type':          [['split']],  # 'split' = num times the feature is used in model; 'gain' = total gains of splits using the feature
        'reg_alpha':                [[0.0]],
        'reg_lambda':               [[0.0]],
        'n_jobs':                   -1,
        'subsample':                1.0,
        'subsample_freq':           0,
        'objective':                'regression', 
        ## **kwargs is not supported in sklearn, it may cause unexpected issues.
        ###### below are regressor fitting parameters ########
        'early_stopping_rounds':    [[20]],
        'eval_metric':              [['rmse']], # if multiple metrics, use [['rmse',['rmse','l2']]] to get 1 run w/ 'rmse' and 1 w/ ['rmse','l2']
        'init_score':               None,
        'eval_init_score':          None,
        'verbose':                  [[True]],   # (int=4 prints every 4th iteration); True is every iteration; False= no print except best and last
        'feature_name':             [['auto']], # (list of strings or ‘auto’) if 'auto' and data is from pandas df, data column names are used
        'categorical_feature':      [['auto']], # If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used
        'callbacks':                None,
        ###### output processing parameters below ######  (also need inverse scaling transformers)
        'clip_predict_H':           [[20]],        # this clips the final result before submission to Coursera/Kaggle
        'clip_predict_L':           [[0]],         # this clips the final result before submission to Coursera/Kaggle
        ###### below are output results slots ########
        'model_filename':           filename_root,   # will be overwritten with actual filename inside control loop
        'best_iteration_':          0,                                      
        'best_score_':              0.0,
        'feature_name_':            [[[""]]],
        'feature_importances_':     [[[0.0]]],
        'model_params':             [[{}]],
        'time_cumulative':          "",
        'time_data_manip':          "",
        'time_dataset_splits':      "",
        'time_eda':                 "",
        'time_full_iteration':      "",
        'time_model_fit':           "",
        'time_model_predict':       "",
        'tr_rmse':                  0.0,
        'tr_R2':                    0.0,
        'val_rmse':                 0.0,
        'val_R2':                   0.0,
        'runtime_type':             runtime_type_cpu_gpu_tpu,
        'n_cpus':                   n_available_cpus,
        'ram_gb':                   available_vm_ram_gb,
        'package_versions':         [package_versions]
        })

model_cols          = ['model_filename_base','model_type','feature_params','data_sources']
eda_cols            = ['eda_delete_shops','eda_delete_item_cats','eda_scale_month','feather_stt']
data_cols           = ['cartprod_fillna0','cartprod_first_month','cartprod_test_pairs','clip_train_H','clip_train_L','feather_monthly_stt',
                        'feature_data_type','minmax_scaler_range','robust_scaler_quantiles','use_cartprod_fill','use_categorical',
                        'use_minmax_scaler','use_robust_scaler']
tvt_split_cols            = ['feather_tvt_split','test_month','train_start_month','train_final_month','validate_months']
lgbm_setup_cols     = ['boosting_type','metric','learning_rate','n_estimators','colsample_bytree','random_state','subsample_for_bin','num_leaves',
                        'max_depth','min_split_gain','min_child_weight','min_child_samples','silent','importance_type','reg_alpha','reg_lambda',
                        'n_jobs','subsample','subsample_freq','objective']
lgbm_fit_cols       = ['eval_metric','early_stopping_rounds','init_score','eval_init_score','verbose','feature_name','categorical_feature','callbacks']
processing_cols     = ['clip_predict_H','clip_predict_L']
OUTPUT_cols         = ['model_filename','best_iteration_','best_score_','feature_importances_','feature_name_','model_params','time_cumulative',
                        'time_data_manip','time_dataset_splits','time_eda','time_full_iteration','time_model_fit','time_model_predict',
                        'tr_R2','tr_rmse','val_R2','val_rmse','runtime_type','n_cpus','ram_gb','package_versions']

# One large df to explode all parameter variants and give us a sense of run size; other dfs are smaller portions of the input parameters set
# Parameters governing chunks of similar computations are collected into each of these different small dataframes,
#     so that in the main control section, there is one 'for loop' per small dataframe / objective (we won't need to use ['ALL'] later in the code)
SPLIT_PARAMS_module_dfs = OrderedDict()
SPLIT_PARAMS_module_dfs['ALL']          =   ALL_PARAMS.drop(OUTPUT_cols, axis=1).copy(deep=True) # don't explode OUTPUT_cols for input splits
SPLIT_PARAMS_module_dfs['MODEL']        =   ALL_PARAMS[model_cols].copy(deep=True)
SPLIT_PARAMS_module_dfs['EDA']          =   ALL_PARAMS[eda_cols].copy(deep=True)
SPLIT_PARAMS_module_dfs['DATA']         =   ALL_PARAMS[data_cols].copy(deep=True)
SPLIT_PARAMS_module_dfs['TVT_SPLIT']    =   ALL_PARAMS[tvt_split_cols].copy(deep=True)
SPLIT_PARAMS_module_dfs['LGBM_SETUP']   =   ALL_PARAMS[lgbm_setup_cols].copy(deep=True)
SPLIT_PARAMS_module_dfs['LGBM_FIT']     =   ALL_PARAMS[lgbm_fit_cols].copy(deep=True)
SPLIT_PARAMS_module_dfs['PROCESSING']   =   ALL_PARAMS[processing_cols].copy(deep=True)
OUTPUTS_df                              =   ALL_PARAMS[OUTPUT_cols].copy(deep=True)

# find the parameters that have splits in them, and print out to highlight; ignore feature_params because too unwieldy for summary here
parameter_splits = {}
for col_name, param in SPLIT_PARAMS_module_dfs['ALL'].to_dict('index', into=OrderedDict)[0].items(): # before exploding df
    if type(param) is list:  # this level of list gets removed during "explode" operation
        if len(param) > 1:
            if col_name == 'feature_params':  # this parameter has so many items in it, it is impractical for quick split summary
                parameter_splits[col_name] = f"[{len(param)} splits exist; printout below]"
            else:
                parameter_splits[col_name] = param

# pretty print dictionary-style dataframe info before exploding
for df_name, param_df in SPLIT_PARAMS_module_dfs.items():
    if 'feature_params' in param_df.columns:
        param_df = param_df.drop('feature_params', axis=1)  # will print this out below; it is a huge dict of a dict of a dict
    print(f'\n{df_name} Parameters DataFrame:\nColumns = {list(param_df.columns)}\n{param_df.to_dict(orient="list")}\n')      
print(f'\nOutput Results Dataframe:\nColumns = {list(OUTPUTS_df.columns)}\n{OUTPUTS_df.to_dict(orient="list")}\n')

# Explode the dataframes so each row is one iteration of modeling parameters
for df_name, param_df in SPLIT_PARAMS_module_dfs.items():
    for col in param_df.columns:
        param_df = param_df.explode(col)
    SPLIT_PARAMS_module_dfs[df_name] = param_df.reset_index(drop=True)

ALL_exploded_shape = SPLIT_PARAMS_module_dfs['ALL'].shape
print(f'Shape of Exploded ["ALL"] Parameters/Splits DataFrame (n_runs, n_parameters): {ALL_exploded_shape}')
print(f'N train models: {ALL_exploded_shape[0]}')  
print(f'Splits in this run (excluding features/lags): {parameter_splits}')

# for df_name, param_df in SPLIT_PARAMS_module_dfs.items():
#     print(f'\nExploded {df_name} Parameters DataFrame:\n{param_df}\n')  # tabular format

print(f'\nExploded SPLIT_PARAMS_module_dfs["ALL"] df, as a dict:\n{SPLIT_PARAMS_module_dfs["ALL"].to_dict(orient="list")}') # simple format

print("\nFeature Column Info:") 
for i, split_feature_dict in enumerate(feature_params):
    print(f'Iteration {i}:')
    for k, v in split_feature_dict.items():
        if k != 'lag_splits':
            print(f'  {k}: {v}')
        else:
            print(f'  Lag Split Info:  ',end='')
            for k1, v1 in v.items():
                print(f'    {k1}: {v1}')
print('\n')
# Make placeholder rows in the OUTPUTS_df to handle same number of runs as is equal to number of rows in exploded 'ALL' dataframe
#   Concatenate 'ALL' onto 'OUTPUTS_df' so we only have to write one log file for entire run (= OUTPUTS_df)
OUTPUTS_df = pd.DataFrame(np.repeat(OUTPUTS_df.values, ALL_exploded_shape[0], axis=0), columns=OUTPUTS_df.columns).reset_index(drop=True)
OUTPUTS_df = pd.concat([OUTPUTS_df, SPLIT_PARAMS_module_dfs['ALL']], axis=1)

time_cumulative         = elapsed_timer()
time_data_manip         = elapsed_timer()
time_dataset_splits     = elapsed_timer()
time_eda                = elapsed_timer()
time_full_iteration     = elapsed_timer()
time_model_fit          = elapsed_timer()
time_model_predict      = elapsed_timer()
block_time              = elapsed_timer() # general timer for various code blocks below; start it up right now with other initializations
MEMORY_STATS.append(get_memory_stats("Iteration Parameters Defined",printout=True))

print(f'\nRun Splits/Parameters Identified: {strftime("%a %X %x")}')


ALL Parameters DataFrame:
Columns = ['model_filename_base', 'model_type', 'data_sources', 'eda_delete_shops', 'eda_delete_item_cats', 'eda_scale_month', 'feather_stt', 'cartprod_fillna0', 'cartprod_first_month', 'cartprod_test_pairs', 'clip_train_H', 'clip_train_L', 'feather_monthly_stt', 'feature_data_type', 'minmax_scaler_range', 'robust_scaler_quantiles', 'use_cartprod_fill', 'use_categorical', 'use_minmax_scaler', 'use_robust_scaler', 'feather_tvt_split', 'test_month', 'train_start_month', 'train_final_month', 'validate_months', 'boosting_type', 'metric', 'learning_rate', 'n_estimators', 'colsample_bytree', 'random_state', 'subsample_for_bin', 'num_leaves', 'max_depth', 'min_split_gain', 'min_child_weight', 'min_child_samples', 'silent', 'importance_type', 'reg_alpha', 'reg_lambda', 'n_jobs', 'subsample', 'subsample_freq', 'objective', 'early_stopping_rounds', 'eval_metric', 'init_score', 'eval_init_score', 'verbose', 'feature_name', 'categorical_feature', 'callbacks', 'clip_predi

##**Mount Google Drive for access to Google Drive local repo; Load Data**

In [None]:
# click on the URL link presented to you by this command, get your authorization code from Google, 
#     then paste it into the input box and hit 'enter' to complete mounting of the drive

drive.mount('/content/drive')
MEMORY_STATS.append(get_memory_stats("Google Drive mounted", printout=True))

Mounted at /content/drive


NameError: ignored

In [None]:
# %cd '{GDRIVE_REPO_PATH}'
# os.chdir(OUT_OF_REPO_PATH)
print(GDRIVE_REPO_PATH.absolute())
print(GDRIVE_REPO_PATH.resolve())
p=Path('/content'+'/drive')
print(p)
platform.platform(terse=0)


/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
/content/drive


'Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic'

In [None]:
a = globals().copy()
import types
from pkg_resources import get_distribution

print(f'\n#################\n#################\n')
imports = OrderedDict()
vers = OrderedDict()
for k, vv in a.items():
    v1 = k.split('.')[0]
    v = eval(v1)
    if isinstance(v, types.ModuleType) or callable(v):
        imports[k] = OrderedDict()
        imports[k]['k'] = ('callable', type(v), v) if callable(v) else ('isinstance', type(v), v)
        try:
            v1 = v.__module__
            imports[k]['m'] = (type(v1), v1, '  len: ', len(v1))
            if len(v1) > 1:
                v = eval(v1.split('.')[0])
            else:
                imports[k]['m'] = 'Null __module__'
        except:
            imports[k]['m'] = '  no __module__'
        try:
            v1 = v.__package__
            if len(v1) > 1:
                imports[k]['p'] = (type(v1), v1)
                v = eval(v1.split('.')[0])
            else:
                imports[k]['p'] = 'Null __package__'
        except:
            imports[k]['p'] = '  no __package__'
        # try:
        #     v = eval(v)
        #     imports[k]['e'] = (type(v), v)
        # except:
        #     imports[k]['e'] = 0
        vers[v1] = OrderedDict()
        try:
            imports[k]['v'] = vers[v1]['v'] = v.__version__
        except:
            imports[k]['v'] = vers[v1]['v'] = ''
        try:
            imports[k]['g'] = vers[v1]['g'] = get_distribution(v)
        except:
            imports[k]['g'] = vers[v1]['g'] = ''

        # try:
        #     ver = v.__package__.__version__  # get the root module
        # except:
        #     print('  2. bad pkg ver v for: ', k)
        #     try:
        #         ver = eval(v.__package__).__version__
        #     except:
        #         print('  3. bad eval pkg ver for:', k)
        #         try:
        #             ver = v.__version__
        #         except:
        #             ver = (f'4. no version info: {k} : {v}')
        # vers.append(ver)
        # try:
        #     vers.append(v.__version__)
        #     #print('vvvv  ',imports[-1],':',vers[-1])
        # except:
        #     print(k,'    version issue')
        #     try: 
        #         vers.append((get_distribution(v),'viss'))
        #     except:
        #         vers.append('No distribution info')
    # if callable(v):
    #     print('iscallable',k)
    #     try:
    #         v = v.__package__  # get the root module
    #     except:
    #         print('  bad pkg eval v for: ', k)
    #     try:
    #         imports.append(v.__module__)
    #         # imports.append(inspect.getmodule(k).__name__)
    #     except:
    #         print('  cannot find name for callable: ',k,v)
    #     try:
    #         vers.append(v.__version__)
    #         #print('vvvv  ',imports[-1],':',vers[-1])
    #     except:
    #         print(k,'    callable version issue')
    #         try: 
    #             vers.append((get_distribution(v),'c viss'))
    #         except:
    #             vers.append('No distribution info')
print('\n\n####################################################\n################################\n')
# for i in range(len(vers)):
#     print(imports[i],' : ',vers[i])

for k,v in imports.items():
    print(k,':')
    try:
        for i,j in v.items():
            print(f'  {i:<8}: {j}')
    except:
        print(f'     {v}')


print(f'\n#################\n#################\n')

for k,v in vers.items():
    print(k,':')
    try:
        for i,j in v.items():
            print(f'  {i:<8}: {j}')
    except:
        print(f'     {v}')

# # b = re.findall('<module [\'"](.+?)[\'"]', str(a))
# # c = re.findall('<class [\'"]([^_].+?)\..+?[\'"]', str(a))
# # d = re.findall('<function (.+?)>', str(a))
# # print(a)
# # print(f'Modules:\n{b}')
# # print(f'\nClasses:\n{c}')
# # print(f'\nFunctions:\n{d}')
# # print(get_distribution('pandas'))
# print('\n\n')
# s = {'a':'b'}
# print(type(s))
# print(type(a['__builtin__']))
# import types
# for k, v in a.items():
#     if callable(v): #isinstance(v, types.ModuleType) or isinstance(v, types.FunctionType) or isinstance(v, types.ClassType):
#         print(k,':',v)
# # for k,v in a.items():
# #     if type(v) == False:  # type(a['__builtin__']):  # type(s):
# #         print(k,':.....')
# #         for i,j in v.items():
# #             print('  ',i,':',j)
# #     else:
# #         print(k,':',v)
# print('\n\n####################################################\n################################\n')
# # z = re.findall('.{20}(final_mg_v1p0\.ipynb).{20}', str(a))
# # print(z)
# # print('\n\n')
# # modulenames = set(sys.modules) & set(globals())
# # allmodules = [sys.modules[name] for name in modulenames]
# # print('allmodules:\n',allmodules)

# e = Path(sys.argv[2])
# print(sys.argv)
# # print(sys.argv[0], ':', e)
# f = e.read_text()
# print(f)
# # import_re = re.compile('(from (.*?) import|import (.*?)[\#\s])')
# # g = re.findall(import_re, f)
# # for i in g[:-2]:
# #     h = i[1] + i[2]
# #     try:
# #         print(h, get_distribution(h))
# #     except:
# #         print('   ',h)
# from pkg_resources import get_distribution
# import inspect
# print(inspect.getmodule(globals()['get_distribution']))
print('\n\n####################################################\n################################\n')
print('scikit-learn',get_distribution('scikit-learn'))


#################
#################



####################################################
################################

__builtin__ :
  k       : ('isinstance', <class 'module'>, <module 'builtins' (built-in)>)
  m       :   no __module__
  p       : Null __package__
  v       : 
  g       : 
__builtins__ :
  k       : ('isinstance', <class 'module'>, <module 'builtins' (built-in)>)
  m       :   no __module__
  p       : Null __package__
  v       : 
  g       : 
_sh :
  k       : ('isinstance', <class 'module'>, <module 'IPython.core.shadowns' from '/usr/local/lib/python3.6/dist-packages/IPython/core/shadowns.py'>)
  m       :   no __module__
  p       :   no __package__
  v       : 
  g       : 
get_ipython :
  k       : ('callable', <class 'method'>, <bound method InteractiveShell.get_ipython of <google.colab._shell.Shell object at 0x7f6085a09160>>)
  m       :   no __module__
  p       :   no __package__
  v       : 
  g       : 
exit :
  k       : ('callable', <class 'IPyt

In [None]:

import pathlib
print(get_distribution(plt.__package__))
print(f'1{pathlib.__package__}1')
for k, v in pathlib.__dict__.items():
    print(k,":",v)
print('\n###########\n')
i = 0
for k, v in plt.__dict__.items():
    if i<100:
        print(k,":",v)
    i+=1
print('\n###########\n')
for k, v in pkg_resources.__dict__.items():
    print(k,":",v)
print('\n###########\n')

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



matplotlib 3.2.2
11
__name__ : pathlib
__doc__ : None
__package__ : 
__loader__ : <_frozen_importlib_external.SourceFileLoader object at 0x7f609e19f588>
__spec__ : ModuleSpec(name='pathlib', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7f609e19f588>, origin='/usr/lib/python3.6/pathlib.py')
__file__ : /usr/lib/python3.6/pathlib.py
__cached__ : /usr/lib/python3.6/__pycache__/pathlib.cpython-36.pyc
All Rights Reserved.

Copyright (c) 2000 BeOpen.com.
All Rights Reserved.

Copyright (c) 1995-2001 Corporation for National Research Initiatives.
All Rights Reserved.

Copyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.
All Rights Reserved., 'credits':     Thanks to CWI, CNRI, BeOpen.com, Zope Corporation and a cast of thousands
    for supporting Python development.  See www.python.org for more information., 'license': Type license() to see the full license text, 'help': Type help() for interactive help, or help(object) for help about object., '__IPYTHON__': T

OSError: ignored

In [None]:
for i in inspect.getmembers(plt): 
      
    # to remove private and protected 
    # functions 
    if not i[0].startswith('_'): 
          
        # To remove other methods that 
        # doesnot start with a underscore 
        if not inspect.ismethod(i[1]):  
            print(i) 

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



('Annotation', <class 'matplotlib.text.Annotation'>)
('Arrow', <class 'matplotlib.patches.Arrow'>)
('Artist', <class 'matplotlib.artist.Artist'>)
('AutoLocator', <class 'matplotlib.ticker.AutoLocator'>)
('Axes', <class 'matplotlib.axes._axes.Axes'>)
('Button', <class 'matplotlib.widgets.Button'>)
('Circle', <class 'matplotlib.patches.Circle'>)
('Figure', <class 'matplotlib.figure.Figure'>)
('FigureCanvasBase', <class 'matplotlib.backend_bases.FigureCanvasBase'>)
('FixedFormatter', <class 'matplotlib.ticker.FixedFormatter'>)
('FixedLocator', <class 'matplotlib.ticker.FixedLocator'>)
('FormatStrFormatter', <class 'matplotlib.ticker.FormatStrFormatter'>)
('Formatter', <class 'matplotlib.ticker.Formatter'>)
('FuncFormatter', <class 'matplotlib.ticker.FuncFormatter'>)
('GridSpec', <class 'matplotlib.gridspec.GridSpec'>)
('IndexLocator', <class 'matplotlib.ticker.IndexLocator'>)
('Line2D', <class 'matplotlib.lines.Line2D'>)
('LinearLocator', <class 'matplotlib.ticker.LinearLocator'>)
('Locat

OSError: ignored

In [None]:
import importlib
for k, v in sklearn.__dict__.items():
    print(k,":",v)
print('\n###########\n')
for k, v in plt.__dict__.items():
    print(k,":",v)
print('\n###########\n')
for k, v in lgb.__dict__.items():
    print(k,":",v)
print('\n###########\n')
print(sklearn.__version__)
print('\n###########\n')
print(importlib.find_loader('sklearn'))
print(importlib.util.find_spec('sklearn').loader)
print('\n###########\n')
for k, v in sklearn.preprocessing._data.__dict__.items():
    print(k,":",v)
print('\n###########\n')
for k, v in eval(sklearn.preprocessing._data.__name__.split('.')[0]).__dict__.items():
    print(k,":",v)
print('\n###########\n')
print(sklearn.show_versions())


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



__name__ : pathlib
__doc__ : None
__package__ : 
__loader__ : <_frozen_importlib_external.SourceFileLoader object at 0x7f609e19f588>
__spec__ : ModuleSpec(name='pathlib', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7f609e19f588>, origin='/usr/lib/python3.6/pathlib.py')
__file__ : /usr/lib/python3.6/pathlib.py
__cached__ : /usr/lib/python3.6/__pycache__/pathlib.cpython-36.pyc
All Rights Reserved.

Copyright (c) 2000 BeOpen.com.
All Rights Reserved.

Copyright (c) 1995-2001 Corporation for National Research Initiatives.
All Rights Reserved.

Copyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.
All Rights Reserved., 'credits':     Thanks to CWI, CNRI, BeOpen.com, Zope Corporation and a cast of thousands
    for supporting Python development.  See www.python.org for more information., 'license': Type license() to see the full license text, 'help': Type help() for interactive help, or help(object) for help about object., '__IPYTHON__': True, 'display': <fun

OSError: ignored

In [None]:
# import psutil
# print(psutil.cpu_count(logical=False))
# print(psutil.cpu_freq(percpu=False))
# for k,v in os.environ.items():
#     print(k, v)
# import sys
# print(sys.platform)

# Output:
# 2
# None
# ENV /root/.bashrc
# GCS_READ_CACHE_BLOCK_SIZE_MB 16
# CLOUDSDK_CONFIG /content/.config
# CUDA_VERSION 10.1.243
# PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin
# HOME /root
# LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
# LANG en_US.UTF-8
# SHELL /bin/bash
# LIBRARY_PATH /usr/local/cuda/lib64/stubs
# CUDA_PKG_VERSION 10-1=10.1.243-1
# SHLVL 1
# GCE_METADATA_TIMEOUT 0
# NCCL_VERSION 2.7.8
# NVIDIA_VISIBLE_DEVICES all
# TF_FORCE_GPU_ALLOW_GROWTH true
# DEBIAN_FRONTEND noninteractive
# CUDNN_VERSION 7.6.5.32
# LAST_FORCED_REBUILD 20200910
# JPY_PARENT_PID 24
# PYTHONPATH /env/python
# DATALAB_SETTINGS_OVERRIDES {"kernelManagerProxyPort":6000,"kernelManagerProxyHost":"172.28.0.3","jupyterArgs":["--ip=\"172.28.0.2\""]}
# NO_GCE_CHECK True
# GLIBCXX_FORCE_NEW 1
# NVIDIA_DRIVER_CAPABILITIES compute,utility
# _ /tools/node/bin/node
# LD_PRELOAD /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
# NVIDIA_REQUIRE_CUDA cuda>=10.1 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419
# OLDPWD /
# HOSTNAME 931aab700078
# COLAB_GPU 0
# PWD /
# CLOUDSDK_PYTHON python3
# GLIBCPP_FORCE_NEW 1
# PYTHONWARNINGS ignore:::pip._internal.cli.base_command
# TBE_CREDS_ADDR 172.28.0.1:8008
# TERM xterm-color
# CLICOLOR 1
# PAGER cat
# GIT_PAGER cat
# MPLBACKEND module://ipykernel.pylab.backend_inline
# TZ EST+05EDT,M4.1.0,M10.5.0
# KMP_DUPLICATE_LIB_OK True
# KMP_INIT_AT_FORK FALSE
# linux

In [None]:
ilp = Path("/usr/lib/python3.6/importlib/__init__.py")
print(ilp.read_text())

"""A pure Python implementation of import."""
__all__ = ['__import__', 'import_module', 'invalidate_caches', 'reload']

# Bootstrap help #####################################################

# Until bootstrapping is complete, DO NOT import any modules that attempt
# to import importlib._bootstrap (directly or indirectly). Since this
# partially initialised package would be present in sys.modules, those
# modules would get an uninitialised copy of the source version, instead
# of a fully initialised version (either the frozen one or the one
# initialised below if the frozen one is not available).
import _imp  # Just the builtin component, NOT the full Python module
import sys

try:
    import _frozen_importlib as _bootstrap
except ImportError:
    from . import _bootstrap
    _bootstrap._setup(sys, _imp)
else:
    # importlib._bootstrap is the built-in import, ensure we don't create
    # a second copy of the module.
    _bootstrap.__name__ = 'importlib._bootstrap'
    _bootstrap.__pac

##**Load Data Files into Similarly-Named pd DataFrames**

In [None]:
# Issue: freeing up memory used by pandas dataframes that are no longer required by the program (del + gc.collect() does not reliably do this, 
#   nor does re-defining the df as empty pd.DataFrame())
#   to keep from overloading Colab memory limits, we utilize multiprocessing function calls, which use the OS level to discard unwanted dataframes

###############################################################################
def files_to_dfs(datasources_keepcols_tuple):   
    """
    get raw data from disk using filenames defined above; [ datasources_keepcols_tuple = ((pd_df_name_str, csv_filepath_str), keep_cols_list)]
    Returns pandas DataFrame for each of the data files...
        these get ordered into a list by the multiprocessing map function that calls this function
    """
    data_frame_name = datasources_keepcols_tuple[0][0]
    exec(data_frame_name + ' = pd.read_csv(datasources_keepcols_tuple[0][1])')
    # remove unnecessary columns from dataframes, to help with speed and memory savings
    exec(data_frame_name + " = eval(data_frame_name)[datasources_keepcols_tuple[1]]")  #.rename(columns = SHOPS_COLUMN_RENAME)
    return eval(data_frame_name)

###############################################################################
###############################################################################
###############################################################################
@time_eda    # decorator [x.get_elapsed_time() or x.function_total_time -HH:MM:SS] [x.function_init_time-y,m,d,h,min,sec]
def eda_cleanup(eda_delete_shops, eda_delete_item_cats, eda_scale_month, feather_stt, model_filename_base, model_type, 
                feature_params, data_sources):
#########################################
    """
    1) load datafiles (stt sales_train_test,items,shops,date,test) created, feature-augmented (some manually done), and/or cleaned with:
        ['calculate_days_per_month.ipynb', 'EDA_sales_by_day_of_week_mg.ipynb'] --> date_scaling.csv
        [['time_item_category_shop_correlations_v10_mg.ipynb'] --> shops_enc.csv, 'nlp_clustering_item_names_v1_june2020_mg.ipynb'] --> items_enc.csv
        ['data_cleaning_and_eda_feature_merging_v2_mg.ipynb'] --> stt.csv.gz         [no modifications] ==>> test.csv.gz
    2) modify dataframes using splits defined above (drop some features, merge into stt, scale according to date, set datatypes)
    x) use 'multiprocessing' python configuration to help ensure pandas dataframes release RAM back to the VM when they are no longer needed
        (simple 'del' command is unreliable; only workaround I have found to work is to multiprocess to call a new process, which releases all
        memory *at the OS level* when process is complete)... Also provide option to save quick-loading "feather" files and eliminate need to
        keep all dataframes in RAM at all times.  This is particularly important when you are RAM-limited, as sometimes happens in Google Colab
    """
    time_eda.restart_time()
    print(f'Start EDA Module: {strftime("%a %X %x")}')
    MEMORY_STATS.append(get_memory_stats('At Top of EDA Function', printout=False))
    %cd "{GDRIVE_REPO_PATH}"
    print("Loading Files from Google Drive repo into Colab...\n")

    ''' ###################  start multiproc  ################### '''
    with multiprocessing.Pool(None) as load_dfs_pool:  #None = use all processors, could use Pool(1, maxtasksperchild = 1)
        load_dfs = load_dfs_pool.map(files_to_dfs, list(zip(data_sources,list(feature_params['keep_columns'].values()))))
    load_dfs_pool.close()  # pool.terminate()
    load_dfs_pool.join()
    ''' ###################   end  multiproc  ################### '''

    MEMORY_STATS.append(get_memory_stats('load_dfs_pool Closed, Joined', printout=False))

    dfidx = {}
    for i in range(len(load_dfs)):
        name = data_sources[i][0]
        dfidx[name] = i
        print(f'---------- {name} ----------') 
        print(f'DataFrame shape: {load_dfs[i].shape}     DataFrame total memory usage: {(load_dfs[i].memory_usage(deep=True) /1e6).sum():.0f} MB')
        print(f'DataFrame Column Names: {load_dfs[i].columns.to_list()}')
        # print_col_info(load_dfs[i])
        print(f'^^^^^\n{load_dfs[i].head(2)}\n\n')

    #########################################  Merge shops and items into stt  #########################################
    load_dfs[dfidx['stt']] = load_dfs[dfidx['stt']].merge(load_dfs[dfidx['shops_enc']], on='shop_id', how='left')
    load_dfs[dfidx['stt']] = load_dfs[dfidx['stt']].merge(load_dfs[dfidx['items_enc']], on='item_id', how='left')

    MEMORY_STATS.append(get_memory_stats('stt Merged with shops and items', printout=False))

    print(f'---------- merged stt ----------') 
    print_col_info(load_dfs[dfidx['stt']])
    print(f"^^^^^\n{load_dfs[dfidx['stt']].head(2)}\n\n")

    #########################################  Del unwanted shops, item cats; Scale sales for days in month, etc. ######
    print('----------\n')
    if eda_delete_shops:  # drop undesirable shops
        load_dfs[dfidx['stt']] = load_dfs[dfidx['stt']].query('shop_id != @eda_delete_shops')
        print(f"Shape of stt after deleting shops {eda_delete_shops}: {load_dfs[dfidx['stt']].shape}")
    if eda_delete_item_cats:   # drop undesirable item categories
        load_dfs[dfidx['stt']] = load_dfs[dfidx['stt']].query('item_category_id != @eda_delete_item_cats')
        print(f"Shape of stt after deleting item categories {eda_delete_item_cats}: {load_dfs[dfidx['stt']].shape}\n")
    if eda_scale_month:    # scale by date_scaling as desired
        load_dfs[dfidx['stt']] = load_dfs[dfidx['stt']].merge(load_dfs[dfidx['date_scaling']][['month',eda_scale_month]], on='month', how='left')
        load_dfs[dfidx['stt']].sales = load_dfs[dfidx['stt']].sales * load_dfs[dfidx['stt']][eda_scale_month]
        load_dfs[dfidx['stt']].drop(eda_scale_month, axis=1, inplace=True) 

    #########################################  Insert revenue feature; adjust data types; drop unnecessary cols; set desired col order
    load_dfs[dfidx['stt']]['revenue'] = load_dfs[dfidx['stt']].sales * load_dfs[dfidx['stt']].price / 1000
    load_dfs[dfidx['stt']][['sales','price','revenue']].astype(np.float32)  # float so date_adj weight is accurate; can use price
    load_dfs[dfidx['stt']][feature_params['integer']] = load_dfs[dfidx['stt']][feature_params['integer']].astype('int16')
    load_dfs[dfidx['stt']] = load_dfs[dfidx['stt']][feature_params['stt_final']]  # drop, re-order
    load_dfs[dfidx['stt']] = load_dfs[dfidx['stt']].reset_index(drop=True)  #reset index saves 25MB 
    print(f'---------- final stt ----------') 
    print_col_info(load_dfs[dfidx['stt']])
    print(f"^^^^^\n{load_dfs[dfidx['stt']].head(2)}\n\n")

    MEMORY_STATS.append(get_memory_stats('stt Transforms Completed', printout=False))

    #########################################  Save ftr files or return dataframes #########################################
    if feather_stt:
        # optional save file as feather type (big file; don't store inside repo) ... does not support lists, tuples in dataframe cells
        # this allows reclaiming the RAM used by stt file, if this function was called by multiprocess pool
        block_time.restart_time()
        %cd "{OUT_OF_REPO_PATH}"
        ftr_names = []
        print('load_dfs feather files stored on google drive in "final" directory, outside repo:')
        for i in range(len(load_dfs)):
            name = data_sources[i][0]
            ftr_names.append(f'{name}_temp.ftr')
            load_dfs[i].to_feather(ftr_names[i])
            print(f'{ftr_names[i]}, ',end='') 
        load_dfs = ftr_names
        print(f'Elapsed time writing EDA feather files: {block_time.get_elapsed_time()}\n')

        MEMORY_STATS.append(get_memory_stats('After EDA Func .ftr Write', printout=False))

    OUTPUTS_df.at[RUN_n,"time_eda"] = time_eda.get_elapsed_time()
    print(f'EDA Module elapsed time: {OUTPUTS_df.at[RUN_n,"time_eda"]}')
    display_all_memory_stats(MEMORY_STATS)
    print(f'Done EDA Module: {strftime("%a %X %x")}')
    return load_dfs, dfidx

##**Data Preparation: Feature Merging and Feature Generation**
###**1) Compute and Merge Statistics-Based Features on Grouped-by-Month training data**
* Note that features based on price are nonsensical if we add cartesian product fill.  However, item sales and item revenues are OK to use.

###**2) Add Cartesian Product rows to the training data:**
* Idea is to help the model by informing it that we explicitly have no information about certain relevant shop_item pairs in certain months.
* Each month in train data will have additional rows such that the Cartesian Product of all shops and items ALREADY PRESENT IN THAT MONTH will be included.* * When we merge lagged features below, we will only forward-shift the shop-item pairs that are present in the later month. *(Might revisit later, if memory requirements are not too big, can forward-shift all shop-item pairs.)*
* **If not adding Cartesian Product, or if fillna(0) is used, can round features to integers, saving memory (pandas integers cannot store np.NaN; need float32)**

###**3) Add Lagged Statistics columns to the training data:**

In [None]:
from concurrent.futures import ThreadPoolExecutor  # for nested threaded multiprocessing
# https://stackoverflow.com/questions/49947935/nested-parallelism-in-python-multiprocessing

# Create MONTHLY_STT DataFrame = stt grouped by month, with statistics
# Compute values in "real time," then in a later code cell we will compute shifted (lagged) versions

###############################################################################
def compute_stats(stats_arg):
    """
    function for computing statistics-based features; flexible if we wish to add in extra statistics or extra group-by categories
    """
    # stt = stats_arg[0]
    # stats_set_dict = stats_arg[1]
    group = stats_arg[1]['group']
    group = ['month'] + group if 'month' not in group else group
    grouped_df = stats_arg[0].groupby(group).agg(stats_arg[1]['stats'])
    grouped_df.columns = stats_arg[1]['agg_names']
    grouped_df.reset_index(inplace=True)
    #gpr = [grouped_df, group]
    return grouped_df #gpr

###############################################################################
def create_and_merge_groups(stats_arg):
    # stt = stats_arg[0]
    # stats_set_dict = stats_arg[1]
    # monthly_stt = stats_arg[2]

    print('in create and merge groups')

    # ''' ###################  start multiproc  ################### '''
    # with multiprocessing.Pool(None) as stats_pool:  
    #     grouped_df_list = stats_pool.apply(compute_stats, stats_set_dict)
    # # with multiprocessing.Pool(None, maxtasksperchild = 1) as stats_pool:  
    # #     grouped_df_list = stats_pool.map(compute_stats, [stats_set_dict])
    # stats_pool.close()  # pool.terminate()
    # stats_pool.join()
    # ''' ###################   end  multiproc  ################### '''
    # MEMORY_STATS.append(get_memory_stats('stats_pool Closed and Joined',printout=False))
    grouped_df = compute_stats(stats_arg)


    # fix so monthly stt is properly initialized even though multiproc map fn doesn't guarantee process order at this point?

    # monthly_stt = grouped_df_list.pop(0)  # initialize grouping by month with stats based on monthly transactions of shop-item pairs

    monthly_stt = stats_arg[2].merge(grouped_df, on = stats_arg[1]['group'], how = 'left')

    return monthly_stt




def calc_stats(stats_arg):
    """
    function for computing statistics-based features; flexible if we wish to add in extra statistics or extra group-by categories
    """
    # stt = stats_arg[0]
    # stats_set_dict = stats_arg[1]
    group = stats_arg[1]['group']
    group = ['month'] + group if 'month' not in group else group
    grouped_df = stats_arg[0].groupby(group).agg(stats_arg[1]['stats'])
    grouped_df.columns = stats_arg[1]['agg_names']
    grouped_df.reset_index(inplace=True)

    monthly_stt_ = monthly_stt_merge_pool.submit(merge_monthly_stt, monthly_stt_)


    return 



def merge_monthly_stt(stt, monthly_stt, stats_dict):











###############################################################################
def numpy_cartesian_product(query_string):
    cartprod_rows = monthly_stt[['month','shop_id','item_id']].query(query_string)
    return np.array(list(product([i], cartprod_rows.shop_id.unique(), cartprod_rows.item_id.unique())), dtype=np.int16)

###############################################################################
###############################################################################
###############################################################################
@time_data_manip
def data_conditioning(  stt, shops_enc, items_enc,
                        cartprod_fillna0, cartprod_first_month, cartprod_test_pairs, clip_train_H, clip_train_L, feather_monthly_stt, 
                        feature_data_type, minmax_scaler_range, robust_scaler_quantiles, use_cartprod_fill, use_categorical, use_minmax_scaler, 
                        use_robust_scaler, eda_delete_shops, eda_delete_item_cats, eda_scale_month, feather_stt, 
                        model_filename_base, model_type, feature_params, data_sources):
    
    print('In data cond fn')
    # stt = 'stt_temp.ftr'
    # shops_enc = 'shops_enc_temp.ftr'
    # items_enc = 'items_enc_temp.ftr'


    """
    1) group training/val data by months, compute statistics while aggregating
    2) scale the feature columns for better use of full range of available datatype values (np.int16, np.int8, np.uint16,...) unless float
    3) add cartesian product fill to each month, as cartesian product of shop_id and item_id from that single month
    4) merge time-lagged statistics, discarding those that don't have a match with an existing shop_id,item_id pair in destination month
    5) adjust feature column datatypes as desired, ideally reducing VM RAM usage
    inputs:     *stt (sales train test) dataframe, *shops_enc, *items_enc, and the various *parameters to guide the above actions
    outputs:    *monthly_stt dataframe (grouped by month, cartesian product rows added, lagged statistics added, datatypes set)
                *robust_scalers and *minmax_scalers for each column (to inverse transform our model prediction before submission)
    """
    time_data_manip.restart_time()
    print(f'Start Data Module: {strftime("%a %X %x")}')
    MEMORY_STATS.append(get_memory_stats('At Start of Data Function',printout=False))


    print('At start of feather_stt load')

    if feather_stt:  # load dataframes from disk
        block_time.restart_time()
        print(f'Feather File Source Directory: ', end='')
        %cd "{OUT_OF_REPO_PATH}"
        stt = pd.read_feather(stt, columns=None, use_threads=True)
        shops_enc = pd.read_feather(shops_enc, columns=None, use_threads=True)
        items_enc = pd.read_feather(items_enc, columns=None, use_threads=True)
        print("Loaded DataFrames 'stt', 'shops_enc', 'items_enc' from Google Drive feather files into Colab...")
        print(f'Elapsed time loading feather file DataFrames: {block_time.get_elapsed_time()}\n')
        MEMORY_STATS.append(get_memory_stats('After Data Func .ftr Load',printout=False))

    ################### Aggregate Monthly Stats ############################################################################
    monthly_stt = pd.DataFrame()
    # create iterable list of stats calculations, so we can use multiprocess map function
    ''' ###################  start multiproc  ################### '''

    # print(f"listfeatparams: {list(feature_params['lag_splits']['stats_set'].values())}")
    # # listfeatparams: [{'group': ['shop_id', 'item_id'], 
    # #                 'stats': {'shop_group': ['first'], 'item_category_id': ['first'], 'item_group': ['first'], 
    # #                             'item_cluster': ['first'], 'sales': ['count', 'sum', 'median'], 'revenue': ['sum']}, 
    # #                 'agg_names': ['shop_group', 'item_category_id', 'item_group', 'item_cluster', 'shop_id_x_item_id_sales_count', 
    # #                             'shop_id_x_item_id_sales_sum', 'shop_id_x_item_id_sales_median', 'shop_id_x_item_id_revenue_sum']}, 
    # #
    # #                 {'group': ['shop_id', 'item_category_id'], 
    # #                 'stats': {'sales': ['count', 'sum', 'median'], 'revenue': ['sum']}, 
    # #                 'agg_names': ['shop_id_x_item_category_id_sales_count', 'shop_id_x_item_category_id_sales_sum', 
    # #                             'shop_id_x_item_category_id_sales_median', 'shop_id_x_item_category_id_revenue_sum']}, 
    # #
    # #                 {'group': ['shop_id', 'item_cluster'], 
    # #                 'stats': {'sales': ['sum', 'median']}, 
    # #                 'agg_names': ['shop_id_x_item_cluster_sales_sum', 'shop_id_x_item_cluster_sales_median']}, 
    # # 
    # #                 {'group': ['shop_id'], 'stats': {'sales': ['count', 'sum']}, 'agg_names': ['shop_id_sales_count', 'shop_id_sales_sum']}, {'group': ['item_id'], 'stats': {'sales': ['count', 'sum', 'median'], 'revenue': ['sum']}, 'agg_names': ['item_id_sales_count', 'item_id_sales_sum', 'item_id_sales_median', 'item_id_revenue_sum']}, {'group': ['shop_group'], 'stats': {'revenue': ['sum']}, 'agg_names': ['shop_group_revenue_sum']}, {'group': ['item_category_id'], 'stats': {'sales': ['count', 'sum'], 'revenue': ['sum']}, 'agg_names': ['item_category_id_sales_count', 'item_category_id_sales_sum', 'item_category_id_revenue_sum']}, {'group': ['item_group'], 'stats': {'sales': ['sum'], 'revenue': ['sum']}, 'agg_names': ['item_group_sales_sum', 'item_group_revenue_sum']}, {'group': ['item_cluster'], 'stats': {'sales': ['count', 'sum'], 'revenue': ['sum']}, 'agg_names': ['item_cluster_sales_count', 'item_cluster_sales_sum', 'item_cluster_revenue_sum']}]
    # for i in list(feature_params['lag_splits']['stats_set'].values()):
    #     for k,v in i.items():
    #         print(k,v)
    #     print('\n')




    stats_arg = [[stt, fp, monthly_stt] for fp in feature_params['lag_splits']['stats_set'].values()]
    MEMORY_STATS.append(get_memory_stats('stats_arg List Created',printout=False))
    # for i in stats_arg:
    #     print(i)
    monthly_stt = compute_stats(stats_arg.pop(0))   # initialize monthly_stt to capture standard features (stats='first')

    MEMORY_STATS.append(get_memory_stats('monthly_stt Initialized with 1st Stats',printout=False))
    with multiprocessing.Pool(None) as monthly_stt_pool:  
        # monthly_stt = monthly_stt_pool.map(create_and_merge_groups, list(feature_params['lag_splits']['stats_set'].values()))
        grouped_df_list = monthly_stt_pool.map(create_and_merge_groups, stats_arg) #compute_stats, stats_arg) #list(feature_params['lag_splits']['stats_set'].values()))
    monthly_stt_pool.close()  # pool.terminate()
    monthly_stt_pool.join()
    ''' ###################   end  multiproc  ################### '''
    MEMORY_STATS.append(get_memory_stats('monthly_stt_pool Closed, Joined', printout=False))
    print('grouped df list',grouped_df_list)

    monthly_stt = grouped_df_list.pop(0)  # initialize grouping by month with stats based on monthly transactions of shop-item pairs
    for group_tuple in grouped_df_list:
        monthly_stt = monthly_stt.merge(group_tuple[0], on = group_tuple[1], how = 'left')






    # monthly_stt_merge_pool = ThreadPoolExecutor(max_workers=4)
    # stats_calcs_pool = ThreadPoolExecutor(max_workers=4)

    monthly_stt_pool = multiprocessing.Pool(4)

    for stats_dict in list(feature_params['lag_splits']['stats_set'].values()):
        monthly_stt = monthly_stt_pool.submit(merge_monthly_stt,)

# header: referrer= google or imfp or something
# python param library to standardize above input dict
# qgrid = python dataframe viewer for jupyter nb
#google-auth-oauthlib
# tempora stopwatch
# use webappify 0.3 for Regex app on desktop? or is there regex in windows store?
#tqdm or progressbar to show progress
# use arrow instead of datetime, etc.
#  SEE ALSO package pandas-profiling   (HTML report that gives mega info on a df )

# pyperformance can also track memory
# pympler does memory usage well   #utf-8 with ftfy  #memory_profiler   #multidict  #pkginfo #progressbar2  #imagesize from jpeg header
#py-cpuinfo  # arrow  #curl  #httpx   #asyncio #imageio  #infinity
#pygithub   #unidecode  #cachecontrol
#tzlocal
# perhaps an alternative to the "eval" statements when doing file imports, etc??:
# istr
# CIMultiDict accepts str as key argument for dict lookups but uses case-folded (lower-cased) strings for the comparison internally.
#          also see orderedmultidict
# For more effective processing it should know if the key is already case-folded to skip the lower() call.

# The performant code may create case-folded string keys explicitly hand, e.g:

# >>> key = istr('Key')
# >>> key
# 'Key'
# >>> mdict = CIMultiDict(key='value')
# >>> key in mdict
# True
# >>> mdict[key]
# 'value'
# For performance istr strings should be created once and stored somewhere for the later usage, see aiohttp.hdrs for example.

see https://www.toptal.com/python/python-class-attributes-an-overly-thorough-guide
item 3 - tracking instances


#   - lightgbm -> python[version='>=3.5,<3.6.0a0']
#   - memory_profiler -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.5,<3.6.0a0']
#   - py-cpuinfo -> python[version='>=2.7,<2.8.0a0|>=3.5,<3.6.0a0|>=3.7,<3.8.0a0|>=3.6,<3.7.0a0']
#   - pympler -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.5,<3.6.0a0|>=3.7,<3.8.0a0']
#   - scandir -> python[version='>=2.7,<2.8.0a0']
#   - tqdm -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.5,<3.6.0a0']
#   - unidecode -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.5,<3.6.0a0']
#   - whoosh -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.5,<3.6.0a0']

# Your python: python==3.8.0


        #  https://nnn.com/ajax/actions.php?action=like&i=&g=47068000
#py3.8.5 not with whoosh, pympler, 
#nbsmoke, numexpr, orderedmultidict, pandas-profiling, param, pep8naming, pkginfo, pylint, pycpuinfo, pyperformance, pyquery, qgrid, scandir, tempora, tzlocal, unidecode

    MEMORY_STATS.append(get_memory_stats('monthly_stt Created',printout=False))




    ################### Clean, Scale, Datatype Monthly DataFrame ###########################################################
    #  monthly_stt = monthly_stt.rename(columns={'shop_id_x_item_id_sales_sum':'y_sales'})  # rename for convenience
    monthly_stt = monthly_stt[feature_params['integer'] + feature_params['lag_splits']['stats_set_feature_names']] # order, keep desired features
    monthly_stt.sort_values(['month','shop_id','item_id'], inplace=True)
    # clip to adhere to kaggle contest instructions
    monthly_stt.shop_id_x_item_id_sales_sum = monthly_stt.shop_id_x_item_id_sales_sum.clip(clip_train_L, clip_train_H)
    # always do minmax scaling after robust scaling; and do inverse scaling with minmax first, then robust
    robust_scalers = {} 
    if use_robust_scaler:  # squeeze outliers into central distribution
        for aggcol in feature_params['lag_splits']['stats_set_feature_names']:
            robust_scalers[aggcol] = RobustScaler(with_centering=False, quantile_range=robust_scaler_quantiles)
            monthly_stt[aggcol] = robust_scalers[aggcol].fit_transform(monthly_stt[aggcol].to_numpy().reshape(-1, 1))
    minmax_scalers = {} 
    if use_minmax_scaler:  # apply min-max scaler to make best use of np.int16 and memory usage
        for aggcol in feature_params['lag_splits']['stats_set_feature_names']:
            minmax_scalers[aggcol] = MinMaxScaler(feature_range=minmax_scaler_range)
            monthly_stt[aggcol] = minmax_scalers[aggcol].fit_transform(monthly_stt[aggcol].to_numpy().reshape(-1, 1))
    if feature_data_type in [np.int16, np.uint16]:
        monthly_stt = monthly_stt.fillna(0).round()
    monthly_stt = monthly_stt.astype(feature_data_type).reset_index(drop=True)
    print(f'\nmonthly_stt fully grouped and merged and scaled: shape = {monthly_stt.shape}')
    print_col_info(monthly_stt,8)
    MEMORY_STATS.append(get_memory_stats('monthly_stt Scaled and Downcast', printout=False))

    ################### Cartesian Product Insertion ########################################################################
    if use_cartprod_fill:
        # Create cartesian product so model has info to look at for every relevant shop-item-month combination in the months desired
        # add enough months of cart prod that after time-LAGS, we end up with cart products in months cartprod_first_month through 33
        first_month = max (cartprod_first_month - max(feature_params['lag_splits']['months_list']), 0)
        query_string_list = []
        for i in range(first_month,34):
            query_string_list.append('(month == @i)|(month == 34)' if cartprod_test_pairs else '(month == @i)')
            
        ''' ###################  start multiproc  ################### '''
        with multiprocessing.Pool(None, maxtasksperchild = 1) as cart_prod_pool:  
            numpy_cartprod_array_list = cart_prod_pool.map(numpy_cartesian_product, query_string_list)
        cart_prod_pool.close()  # pool.terminate()
        cart_prod_pool.join()
        ''' ###################   end  multiproc  ################### '''
        MEMORY_STATS.append(get_memory_stats('cart_prod_pool Closed, Joined',printout=False))

        monthly_stt = monthly_stt.merge( pd.DataFrame(np.vstack(matrix), columns=['month','shop_id','item_id']),
                                         how = 'outer', on = 'month')
        monthly_stt = monthly_stt.merge( shops_enc, how = 'left', on = 'shop_id')
        monthly_stt = monthly_stt.merge( items_enc, how = 'left', on = 'item_id')
        monthly_stt = monthly_stt.sort_values(['month','shop_id','item_id']).reset_index(drop=True)
        if cartprod_fillna0:
            monthly_stt = monthly_stt.fillna(0)  # store as integers to save memory, but 'price=0' values = nonsense; use revenue instead

        print(f'Column Data Types: \n{df.dtypes}\ndf memory usage: {df.memory_usage(deep=True).sum()/1e6:.0f} MBytes')
        print(f'Number of months: {df.month.nunique():,d}\nNumber of shops: {df.shop_id.nunique():,d}')
        print(f'Number of items: {df.item_id.nunique():,d}\nDataFrame length: {len(df):,d}\n')

    monthly_stt = monthly_stt.astype(feature_data_type).reset_index(drop=True) #np.int16 #.apply(pd.to_numeric, downcast= np.float32)
    print_col_info(monthly_stt,8)
    print(f'\nmonthly_stt.head:\n{monthly_stt.head(2)}')
    print(f'\nmonthly_stt.tail:\n{monthly_stt.tail(2)}\n')  # display(monthly_stt.describe())
    MEMORY_STATS.append(get_memory_stats('Cartesian Product Rows Added', printout=False))

    ################### Merge Time-Lag Features ############################################################################
    #   drop any rows in month (m minus lag) that don't have matching shop-item pair at month m
    print(f'Unlagged DataFrame length: {len(df):,d}\n')
    monthly_stt['y_target'] = monthly_stt.shop_id_x_item_id_sales_sum.copy(deep=True)  # unlagged shop_item sales/month for our predict target
    if cartprod_fillna0 or (feature_data_type in [np.int16, np.uint16]):
        monthly_stt.y_target = monthly_stt.y_target.fillna(0)     #.clip(INTEGER_MULTIPLIER*clip_train_L, INTEGER_MULTIPLIER*clip_train_H)
    lag_merge_on_cols = ['month','shop_id','item_id']
    for lag in feature_params['lag_splits']['months_list']:
        cols_to_shift = lag_merge_on_cols + feature_params['lag_splits']['params'][lag]['feature_root']
        lag_df = monthly_stt[cols_to_shift].copy(deep=True).rename(columns = feature_params['lag_splits']['params'][lag]['lagged_feature_name'])
        lag_df.eval('month = month + @lag', inplace=True).astype(feature_data_type)
        monthly_stt = monthly_stt.merge(lag_df, on = lag_merge_on_cols, how = 'left')
        if cartprod_fillna0 or (feature_data_type in [np.int16, np.uint16]):
            monthly_stt = monthly_stt.fillna(0)  
        monthly_stt = monthly_stt.astype(feature_data_type).reset_index(drop=True)
    if use_categorical:
        for cat_col in feature_params['categorical']:
            monthly_stt[cat_col] = monthly_stt[cat_col].astype('category')
    print('lagged features monthly_stt:')
    print(f'monthly_stt lagged dataframe shape: {monthly_stt.shape}\n')
    print_col_info(monthly_stt,8)
    print(f'\nlagged monthly_stt.head():\n{monthly_stt.head()}\n')
    MEMORY_STATS.append(get_memory_stats('Time Lag Features Added',printout=False))

    #########################################  Save ftr files or return dataframes #########################################
    if feather_monthly_stt:
        # optional save file as feather type (big file; don't store inside repo) ... does not support lists, tuples in dataframe cells
        # this allows reclaiming the RAM used by monthly_stt file, if this function was called by multiprocess pool
        block_time.restart_time()
        %cd "{OUT_OF_REPO_PATH}"
        monthly_stt.to_feather('monthly_stt_temp.ftr')
        monthly_stt = 'monthly_stt_temp.ftr'
        print("'monthly_stt_temp.ftr' feather file stored on google drive in 'final' directory, outside repo.")
        print(f'Elapsed time writing monthly_stt feather file: {block_time.get_elapsed_time()}\n')
        MEMORY_STATS.append(get_memory_stats('After Data Func .ftr Write',printout=False))
    
    OUTPUTS_df.at[RUN_n,"time_data_manip"] = time_data_manip.get_elapsed_time()
    print(f'Data Module elapsed time : {OUTPUTS_df.at[RUN_n,"time_data_manip"]}')
    display_all_memory_stats(MEMORY_STATS)
    print(f'End Data Module: {strftime("%a %X %x")}')
    return monthly_stt, robust_scalers, minmax_scalers

##**Train/Test split**

In [None]:

###############################################################################
###############################################################################
###############################################################################
@time_dataset_splits
def tvt_split_function(DataSets={}, feather_tvt_split=False, test_month=34, train_start_month=13, train_final_month=29, validate_months=999, 
                       feather_monthly_stt=True, feature_data_type=np.int16, 
                       use_categorical=True, categorical_features=["shop_id,item_id"], model_type='LGBM'):
    """
    DataSets is a dictionary containing all split data (train_X, train_y, val_X, val_y, test_X) with the split data names as keys, and
    the values = pandas DataFrames corresponding to those keys.  For flexibility and possible memory savings, we are able to read in the
    original (pre-split) data from a feather file on disk, and can save the 5 DataSet values dataframes as feather files as well.
    """
    time_dataset_splits.restart_time()
    print(f'Start Train-Val-Test Split Module: {strftime("%a %X %x")}')
    MEMORY_STATS.append(get_memory_stats('At Top of TVT_SPLIT Function',printout=False))
    if feather_monthly_stt:  # load monthly_stt dataframe from disk into "train_X" dataframe in VM
        print(f'{DataSets["train_X"]} source directory: ', end='')
        block_time.restart_time()
        %cd "{OUT_OF_REPO_PATH}"
        DataSets["train_X"] = pd.read_feather(DataSets["train_X"], columns=None, use_threads=True)
        print("Loaded monthly_stt from Google Drive feather file into Colab as train_X ...")
        print(f'Elapsed time loading monthly_stt feather file: {block_time.get_elapsed_time()}\n')
        MEMORY_STATS.append(get_memory_stats('After TVT_SPLIT Fn ftr Load',printout=False))

    if model_type == 'LGBM':
        # This train/val/test split is for time-ordered data; may not want same sort of algorithm with other source data or model types
        DataSets['train_X'] = DataSets['train_X'].query('month >= @train_start_month')  # remove early months that don't participate in model training
        DataSets['train_X'] = DataSets['train_X'].astype(feature_data_type).reset_index(drop=True)

        DataSets['test_X'] = DataSets['train_X'].query('month == @test_month').drop('y_target',axis=1).reset_index(drop=True)
        
        if validate_months == 999:  # include all months from end of training up to the start of test month; at a minimum include month 33
            DataSets['val_X'] = data.query('((month > (@train_final_month)) & (month < @test_month)) | (month == (@test_month-1))')
        else:  # include only n (validate_months = 1,2,3,...) months after training; but no later than start of test
            DataSets['val_X'] = data.query('(month > (@train_final_month)) & (month <= (@train_final_month + @validate_months)) & (month < @test_month)')
        DataSets['val_y'] = DataSets['val_X'].pop('y_target')

        DataSets['train_X'].query('month <= @train_final_month')
        DataSets['train_y'] = DataSets['train_X'].pop('y_target')

        feature_names = X_train.columns

        print('train_X:')
        print(DataSets['train_X'].head(2))
        print_col_info(DataSets['train_X'],8)
        print('\ntrain_y:')
        print(DataSets['train_y'].head(2))
        print_col_info(DataSets['train_y'],8)

        # Make sure all data sets are properly categorized and typed
        for ds in DataSets.keys():
            datatype = feature_data_type if ds[-1] == 'X' else np.float32  # 'target y' value can be high accuracy; 'features' can be int or float
            DataSets[ds] = DataSets[ds].astype(datatype).reset_index(drop=True)
            if use_categorical:
                DataSets[ds][categorical_features] = DataSets[ds][categorical_features].astype('category')
        MEMORY_STATS.append(get_memory_stats('At End of TVT_SPLIT Function',printout=False))

    if feather_tvt_split:
        # optional save file as feather type (big file; don't store inside repo) ... does not support lists, tuples in dataframe cells
        # this allows reclaiming the RAM used by train/val/test files, if this function was called by multiprocess pool
        block_time.restart_time()
        %cd "{OUT_OF_REPO_PATH}"
        ftr_names = {}
        print('Train-Val-Test feather files stored on google drive in "final" directory, outside repo:')
        for dfname in DataSets.keys():
            ftr_names[dfname] = f'{dfname}_temp.ftr'
            DataSets[dfname].to_feather(ftr_names[dfname])
            print(ftr_names[dfname],end='') 
        print(f'Elapsed time writing Train-Val-Test feather files: {block_time.get_elapsed_time()}\n')
        DataSets = ftr_names
        MEMORY_STATS.append(get_memory_stats('After TVT_SPLIT Fn ftr Write',printout=False))

    OUTPUTS_df.at[RUN_n, "time_dataset_splits"] = time_dataset_splits.get_elapsed_time()
    print(f'Data Module elapsed time : {OUTPUTS_df.at[RUN_n, "time_dataset_splits"]}')
    display_all_memory_stats(MEMORY_STATS)
    print(f'End Train-Val-Test Splits Module: {strftime("%a %X %x")}')
    return DataSets, feature_names

##**LightGBM - Lightweight Gradient-Boosted Decision Tree**

In [None]:

###############################################################################
def unscale(scaler,target):
    return scaler.inverse_transform(target.reshape(-1, 1)).squeeze()

###############################################################################
###############################################################################
###############################################################################
@time_model_fit
@time_model_predict
def gbdt_model(filename_submission, DataSets, gbdt_setup_params, gbdt_fit_params,
               clip_predict_H, clip_predict_L, feather_tvt_split, test, model_type, robust_scalers, minmax_scalers):
    """
    DataSets dictionary includes train_X, train_y, val_X, val_y, test_X (keys = string version of values), or filenames (if stored on disk)
    RUN_n is the nth model being trained/fit/predicted/ensembled
    lgbm model = LightGBM is a particular case of a gradient-boosted decision tree model, so it is a subroutine in this GBDT function
    """
    print(f'Start Gradient-Boosted Decision Tree Modeling Module: {strftime("%a %X %x")}')
    MEMORY_STATS.append(get_memory_stats('At Top of Modeling Function',printout=False))
    if feather_tvt_split:  # load train_X, train_y, val_X, val_y, test_X dataframes from disk
        print(f'DataSet Feather File Source Directory: ', end='')
        block_time.restart_time()
        %cd "{OUT_OF_REPO_PATH}"
        for dfname, df_filename in DataSets.items():
            DataSets[dfname] = pd.read_feather(DataSets[df_filename], columns=None, use_threads=True)
        print("Loaded DataFrames train_X, train_y, val_X, val_y, test_X from Google Drive feather files into Colab...")
        print(f'Elapsed time loading feather files: {block_time.get_elapsed_time()}\n')
        MEMORY_STATS.append(get_memory_stats('After Modeling Fn ftr Load',printout=False))

    ########################################################################################################
    ''' ###################  multiproc?  ################### '''
    #########################################
    if model_type == 'LGBM':
        print('Starting training...')
        time_model_fit.restart_time()
        model_lgbm = lgb.LGBMRegressor(**gbdt_setup_params)
        model_lgbm.fit(
            DataSets['train_X'],                       # Input feature matrix or df 'train_X' (array-like or sparse of shape = [n_samples, n_features])
            DataSets['train_y'],                       # The target values 'train_y' (real numbers in regression) (array-like of shape = [n_samples])
            [(DataSets['val_X'], DataSets['val_y'])],  # 'eval_set' list [(val_X, val_y)] can have multiple tuples of validation data inside this list
            None,                                      # eval_names = Names of eval_set (list of strings or None, optional (default=None))
            **gbdt_fit_params)
        OUTPUTS_df.at[RUN_n,"time_model_fit"]           = time_model_fit.get_elapsed_time()  # HH:MM:SS
        OUTPUTS_df.at[RUN_n,"best_iteration_"]          = model_lgbm.best_iteration_
        OUTPUTS_df.at[RUN_n,"best_score_"]              = model_lgbm.best_score_['valid_0']['rmse']
        OUTPUTS_df.at[RUN_n,"feature_name_"]            = model_lgbm.feature_name_    # The names of features, in an array of shape [n_features]
        OUTPUTS_df.at[RUN_n,"feature_importances_"]     = model_lgbm.feature_importances_
        OUTPUTS_df.at[RUN_n,"model_params"]             = model_lgbm.get_params()
        print(f'Done fitting; Model LGBM fit time: {OUTPUTS_df.at[RUN_n, "time_model_fit"]}')

    ########################################################################################################
    ''' ###################  multiproc?  ################### '''
    #########################################
    print("Starting predictions...")
    time_model_predict.restart_time()
    ''' ###################  multiproc?  ################### '''
    y_pred_train =  model_lgbm.predict( DataSets['train_X'], num_iteration=model_lgbm.best_iteration_ )
    y_pred_val =    model_lgbm.predict( DataSets['val_X'],   num_iteration=model_lgbm.best_iteration_ )
    y_pred_test =   model_lgbm.predict( DataSets['test_X'],  num_iteration=model_lgbm.best_iteration_ )
    y_train =       DataSets['train_y'].to_numpy()
    y_val =         DataSets['val_y'].to_numpy()
    # always do minmax scaling after robust scaling; and do inverse scaling with minmax first, then robust (like here)
    if any(minmax_scalers):
        ''' ###################  multiproc?  ################### '''
        y_pred_train =  unscale(minmax_scalers['shop_id_x_item_id_sales_sum'],  y_pred_train)
        y_pred_val =    unscale(minmax_scalers['shop_id_x_item_id_sales_sum'],  y_pred_val)
        y_pred_test =   unscale(minmax_scalers['shop_id_x_item_id_sales_sum'],  y_pred_test)
        y_train =       unscale(minmax_scalers['shop_id_x_item_id_sales_sum'],  y_train)
        y_val =         unscale(minmax_scalers['shop_id_x_item_id_sales_sum'],  y_val)
    if any(robust_scalers):
        ''' ###################  multiproc?  ################### '''
        y_pred_train =  unscale(robust_scalers['shop_id_x_item_id_sales_sum'],  y_pred_train)
        y_pred_val =    unscale(robust_scalers['shop_id_x_item_id_sales_sum'],  y_pred_val)
        y_pred_test =   unscale(robust_scalers['shop_id_x_item_id_sales_sum'],  y_pred_test)
        y_train =       unscale(robust_scalers['shop_id_x_item_id_sales_sum'],  y_train)
        y_val =         unscale(robust_scalers['shop_id_x_item_id_sales_sum'],  y_val)
    ''' ###################  multiproc?  ################### '''
    y_pred_train =  y_pred_train.clip(clip_predict_L,clip_predict_H)
    y_pred_val =    y_pred_val.clip(clip_predict_L,clip_predict_H)
    y_pred_test =   y_pred_test.clip(clip_predict_L,clip_predict_H)
    
    ''' ###################  multiproc?  ################### '''
    OUTPUTS_df.at[RUN_n, "time_model_predict"]  = time_model_predict.get_elapsed_time()  # HH:MM:SS
    OUTPUTS_df.at[RUN_n, 'tr_R2']               = sklearn.metrics.r2_score(y_train, y_pred_train)   
    OUTPUTS_df.at[RUN_n, 'val_R2']              = sklearn.metrics.r2_score(y_val, y_pred_val)
    OUTPUTS_df.at[RUN_n, 'tr_rmse']             = np.sqrt(sklearn.metrics.mean_squared_error(y_train, y_pred_train)) 
    OUTPUTS_df.at[RUN_n, 'val_rmse']            = np.sqrt(sklearn.metrics.mean_squared_error(y_val, y_pred_val))
    
    print(f'Model LGBM fit time: {OUTPUTS_df.at[RUN_n, "time_model_fit"]}')
    print(f'Transform and Predict train/val/test time: {OUTPUTS_df.at[RUN_n, "time_model_predict"]}')
    print(f'R^2 train  = {OUTPUTS_df.at[RUN_n,"tr_R2"]:.4f}    R^2 val  = {OUTPUTS_df.at[RUN_n,"val_R2"]:.4f}')
    print(f'RMSE train = {OUTPUTS_df.at[RUN_n,"tr_rmse"]:.4f}    RMSE val = {OUTPUTS_df.at[RUN_n,"val_rmse"]:.4f}\n')

    a = sklearn.metrics.r2_score(y_train, y_pred_train)   
    b = sklearn.metrics.r2_score(y_val, y_pred_val)
    c = np.sqrt(sklearn.metrics.mean_squared_error(y_train, y_pred_train)) 
    d = np.sqrt(sklearn.metrics.mean_squared_error(y_val, y_pred_val))
    print(f'a R^2 train  = {a:.4f}    b R^2 val  = {b:.4f}')
    print(f'c RMSE train = {c:.4f}    d RMSE val = {d:.4f}\n')

    # re-format feature importances? dict with key=featurename val=importance?
    # save model?

    # Merge the test predictions with IDs from the original test dataset, and keep only columns "ID" and "item_cnt_month"
    y_submission = pd.DataFrame.from_dict({'item_cnt_month':y_pred_test,'shop_id':DataSets['X_test'].shop_id,'item_id':DataSets['X_test'].item_id})
    y_submission = test.merge(y_submission, on=['shop_id','item_id'], how= 'left').reset_index(drop=True).drop(['shop_id','item_id'],axis=1)
    # save prediction for every one of the run iterations; can ensemble them later if desired
    %cd "{GDRIVE_REPO_PATH}"
    y_submission.to_csv("./models_and_predictions/" + filename_submission, index=False)
    MEMORY_STATS.append(get_memory_stats('End of Modeling and Predictions',printout=True))
    # print(f'Modeling and Predictions Done: {strftime("%a %X %x")}')

    return

##**Main Control Loop**

In [None]:
if __name__ == '__main__':  # possibly needed for multiprocessing to work smoothly??
    # Main Control Loop
    # model_cols          = ['model_filename_base','model_type','feature_params','data_sources']
    # eda_cols            = ['eda_delete_shops','eda_delete_item_cats','eda_scale_month','feather_stt']
    # data_cols           = ['cartprod_fillna0','cartprod_first_month','cartprod_test_pairs','clip_train_H','clip_train_L','feather_monthly_stt',
    #                         'feature_data_type','minmax_scaler_range','robust_scaler_quantiles','use_cartprod_fill','use_categorical',
    #                         'use_minmax_scaler','use_robust_scaler']
    # tvt_split_cols            = ['feather_tvt_split','test_month','train_start_month','train_final_month','validate_months']
    # lgbm_setup_cols     = ['boosting_type','metric','learning_rate','n_estimators','colsample_bytree','random_state','subsample_for_bin','num_leaves',
    #                         'max_depth','min_split_gain','min_child_weight','min_child_samples','silent','importance_type','reg_alpha','reg_lambda',
    #                         'n_jobs','subsample','subsample_freq','objective']
    # lgbm_fit_cols       = ['eval_metric','early_stopping_rounds','init_score','eval_init_score','verbose','feature_name','categorical_feature','callbacks']
    # processing_cols     = ['clip_predict_H','clip_predict_L']
    # OUTPUT_cols         = ['model_filename','best_iteration_','best_score_','feature_importances_','feature_name_','model_params','time_cumulative',
    #                         'time_data_manip','time_dataset_splits','time_eda','time_full_iteration','time_model_fit','time_model_predict',
    #                         'tr_R2','tr_rmse','val_R2','val_rmse','runtime_type','n_cpus','ram_gb','package_versions']

    # "exploded DataFrames" --> OrderedDicts for straightforward looping/enumeration of splits
    model_params_dict = SPLIT_PARAMS_module_dfs['MODEL'].to_dict('index', into=OrderedDict)
    eda_params_dict = SPLIT_PARAMS_module_dfs['EDA'].to_dict('index', into=OrderedDict)
    data_params_dict = SPLIT_PARAMS_module_dfs['DATA'].to_dict('index', into=OrderedDict)
    tvt_split_params_dict = SPLIT_PARAMS_module_dfs['TVT_SPLIT'].to_dict('index', into=OrderedDict)
    lgbm_setup_params_dict = SPLIT_PARAMS_module_dfs['LGBM_SETUP'].to_dict('index', into=OrderedDict)
    lgbm_fit_params_dict = SPLIT_PARAMS_module_dfs['LGBM_FIT'].to_dict('index', into=OrderedDict)
    processing_params_dict = SPLIT_PARAMS_module_dfs['PROCESSING'].to_dict('index', into=OrderedDict)

    RUN_n = 0
    time_full_iteration.restart_time()
    time_cumulative.restart_time()
    for model_iter_num, model_iter_params in model_params_dict.items(): #########################################
        if not model_iter_params['model_filename_base']:
            model_iter_params['model_filename_base'] = input("Enter the Base Model Filename Substring for Output (like: 'v4mg_01' )")
        model_done = (model_iter_num+1 == len(model_params_dict)) # True if we have finished all splits in model_params_dict

        for eda_iter_num, eda_iter_params in eda_params_dict.items():   #########################################
            # load datafiles, adjust stt for monthly stats grouping, output = stt, items_enc, shops_enc, test 
            eda_iter_params.update(model_iter_params)  # add to eda dict because feature_params, data files, names are needed for eda module
            ''' ###################  multiproc??  ################### '''
            print(eda_iter_params.keys())
            #########################################
            load_dfs, dfidx = eda_cleanup(**eda_iter_params)    # load_dfs dict holds stt, shops_enc, items_enc, test dfs (or filenames if feathered)
            #########################################
            test = load_dfs[dfidx['test']]
            if type(test) == str:
                %cd "{OUT_OF_REPO_PATH}"
                test = pd.read_feather(test, columns=None, use_threads=True)    # save for later to configure Coursera submission
            eda_done = (model_done and (eda_iter_num+1 == len(eda_params_dict)))    # True if finished all splits in model_params_dict and eda_params_dict

            # print(f'test:\n{test.head()}')
            # print(f'dfidx: {dfidx}\n')
            # print(f'load_dfs: {load_dfs}\n')
            # for i in ['stt','shops_enc','items_enc']:
            #     print(f'{i}:\n{load_dfs[dfidx[i]]}\n')


            for data_iter_num, data_iter_params in data_params_dict.items():  #########################################
                # compute grouped-by-month agg stats features, add cartesian product rows, add lagged feature columns
                data_iter_params.update(eda_iter_params)  # add to data dict because feature_params, feather_stt needed for data conditioning module
                ''' ###################  multiproc??  ################### '''


                # print(f'dataiterparams= {data_iter_params}')
                # print(f'dataiterparamskeys = {list(data_iter_params.keys())}')
                # for k,v in data_iter_params.items():
                #     print(k,v)


                #########################################
                monthly_stt, robust_scalers, minmax_scalers = data_conditioning(
                                    load_dfs[dfidx['stt']],load_dfs[dfidx['shops_enc']],load_dfs[dfidx['items_enc']], **data_iter_params)
                #########################################
                data_cond_done = (eda_done and (data_iter_num+1 == len(data_params_dict)))
                if data_cond_done: 
                    try:  # (no future splits need these dfs)
                        del [load_dfs[dfidx['stt']],load_dfs[dfidx['shops_enc']],load_dfs[dfidx['items_enc']],load_dfs[dfidx['date_scaling']]]
                    except:
                        print("Couldn't delete load_dfs: stt, shops, items")
                        
                for tvt_iter_num, tvt_split_iter_params in tvt_split_params_dict.items():  #########################################
                    # split data into train/val/test and assign desired datatypes where needed
                    tvt_split_iter_params['feather_monthly_stt'] = data_iter_params['feather_monthly_stt']
                    tvt_split_iter_params['feature_data_type'] = data_iter_params['feature_data_type']
                    tvt_split_iter_params['use_categorical'] = data_iter_params['use_categorical']
                    tvt_split_iter_params['categorical_features'] = model_iter_params['feature_params']['categorical']
                    tvt_split_iter_params['model_type'] = model_iter_params['model_type']
                    tvt_xy_datasets = {'train_X':monthly_stt}
                    ''' ###################  multiproc??  ################### '''
                    #########################################
                    DataSets, feature_names = tvt_split_function(tvt_xy_datasets, **tvt_split_iter_params)
                    #########################################

                    #OUTPUTS_df.at[RUN_n, 'feature_name_'] = feature_names #??? get this from lgbm model fit routine?

                    for lgbm_setup_iter_num, lgbm_setup_iter_params in lgbm_setup_params_dict.items():  #########################################
                        for lgbm_fit_iter_num, lgbm_fit_iter_params in lgbm_fit_params_dict.items():    #########################################

                            for processing_iter_num, processing_iter_params in processing_params_dict.items():  ############################
                                processing_iter_params['feather_tvt_split'] = tvt_split_iter_params['feather_tvt_split']
                                processing_iter_params['test'] = test
                                processing_iter_params['model_type'] = model_iter_params['model_type']
                                #processing_iter_params['feature_name_'] = feature_names
                                processing_iter_params['robust_scalers'] = robust_scalers # if any(robust_scalers): then do inverse scaler xform
                                processing_iter_params['minmax_scalers'] = minmax_scalers # if any(minmax_scalers): then do inverse scaler xform
                                base_filename = f'{model_iter_params["model_type"]}_{model_iter_params["model_filename_base"]}'
                                OUTPUTS_df.at[RUN_n,"model_filename"] = f'{base_filename}_{RUN_n:02d}'
                                filename_results =    f'{base_filename}_results.csv'
                                filename_submission = f'{OUTPUTS_df.at[RUN_n,"model_filename"]}_submission.csv'
                            
                                ''' ###################  multiproc??  ################### '''
                                #########################################
                                gbdt_model(filename_submission, DataSets, lgbm_setup_iter_params, lgbm_fit_iter_params, **processing_iter_params)
                                #########################################

                                OUTPUTS_df.at[RUN_n,'time_full_iteration'] = time_full_iteration.get_elapsed_time()
                                OUTPUTS_df.at[RUN_n,'time_cumulative'] = time_cumulative.get_elapsed_time()
                                %cd "{GDRIVE_REPO_PATH}"
                                OUTPUTS_df.to_csv("./models_and_predictions/" + filename_results, index=False) # save intermediate/ final params+outputs 

                                RUN_n += 1
                                if RUN_n < ALL_exploded_shape[0]:
                                    time_full_iteration.restart_time()



dict_keys(['eda_delete_shops', 'eda_delete_item_cats', 'eda_scale_month', 'feather_stt', 'model_filename_base', 'model_type', 'feature_params', 'data_sources'])
Start EDA Module: Wed 04:46:51 09/09/20
/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
Loading Files from Google Drive repo into Colab...

---------- items_enc ----------
DataFrame shape: (22170, 4)     DataFrame total memory usage: 1 MB
DataFrame Column Names: ['item_id', 'item_category_id', 'item_group', 'item_cluster']
^^^^^
   item_id  item_category_id  item_group  item_cluster
0        0                40           6           100
1        1                76           6           105


---------- shops_enc ----------
DataFrame shape: (60, 2)     DataFrame total memory usage: 0 MB
DataFrame Column Names: ['shop_id', 'shop_group']
^^^^^
   shop_id  shop_group
0        0           7
1        1           7


---------- date_scaling ----------
DataFrame shape: (35, 2)     DataFrame total memory usag

TypeError: ignored

In [None]:
# domino tiles:  https://www.fileformat.info/info/unicode/block/domino_tiles/utf8test.htm
print('\u2227'*5,'\u2228'*5,'\u2303'*5,'\u2304'*5,'^^^','\u02c5','\u02c4','\u02c6'*5,'\u02ec'*5,'\u22c0'*5,'\u22c1'*5,'\u2306'*5,'\u2305'*5,'\u23f7'*5)
print('\u25b2'*5,'\u25bc'*5,'\u25c6'*5,'\u25d2'*5,'\u25d3'*5,'\u25b4'*5,'\u25be'*5,'\u2635'*5,'\u269c'*5,'\u26d6'*5,'\u2b81'*5,'\u2b7f'*5,'\u26dd'*5)
print('\u26ba'*5,'\u26bb'*5,'\u2622'*5,'\u262f'*5,'\u2934'*5,'\u2935'*5,'\u2b71'*5,'\u2b73'*5,'\u2b9d'*5,'\u2b9f'*5,'\u2bc5'*5,'\u2bc6'*5)
print('\u2ba4'*5,'\u2ba5'*5,'\u2182'*5,'\u2180'*5,'\u2b12'*5,'\u2b13'*5,'\u2b18'*5,'\u2b19'*5,'\u273d'*5,'\u2720'*5,'\u1cf2'*5,'\u1cf6'*5,'\u205a'*5,'\u2021'*5)
print('\u224b'*5,'\u224d'*5,'\u2259'*5,'\u225a'*5,'\u2263'*5,'\u2251'*5,'\u2253'*5,'\u22ce'*5,'\u22cf'*5,'\u2339'*5,'\u2797'*5,'\u27d7'*5,'\u267b'*5)
print('\u29d6'*5,'\u29d7'*5,'\u2bc1'*5,'\u2b27'*5,'\u2a77'*5,'\u2a8b'*5,'\u2ad8'*5,'\u2e44'*5,'\u2e0e'*5,'\u2e1e'*5,'\u2e1f'*5,'\u3013'*5)
print('\U0001f503'*5,'\U0001f501'*5,'\U0001f3ac','\U0001f5aa','\U0001f5ab','\U0001f536'*5,'\U0001f06d'*5,'\U0001f6d1'*5)

##**Stop Execution of code below by invoking error**

In [None]:
# Dummy cell to stop the execution so we don't run any of the random code below (if we select "Run All", e.g.)
stop_running_code_at_this_cell = yes


In [None]:
a1 = 'a'
l1 = ['m','n','o']
[[a1,x] for x in l1]
MEMORY_STATS = MEMORY_STATS[:5]

##**To Do List:**

###**Loop over things to compute statistics, column scaling, adding cartesian product rows, and adding lagged features**
* multiprocess --> pool(merge,[months list]) do months in parallel? (maybe split monthly_stt by months, then do the merges in parallel, then concatenate all the months back together)
* multiprocess --> pool(merge,[lags list]) do lags in parallel?; check for proper column order / reset if necessary  (can I just add the shifted columns, and delete any N/A things where I don't have a shop-item match?, or make a big df with all the lags and then just one single merge (how="left") )

###**Loop over model fitting parameter splits**
* multiprocess.Pool (if not too overwhelming, can do several (or all) model fitting iterations in parallel)
* (?replicate "test" with simple code?) so we don't need to load and carry "test" dataframe in memory throughout.  Or, load from disk when loading ftr files in prediction module.
* del all dataframes containing data, after all loops over model params are done

###**Additional routines**
* Plot feature importance ... each iteration, and ensemble averages (e.g., if rmse is < xxx).  Make df of feature importances (names=columns) vs. run iteration number (rows) and compute mean, stddev, range, quantiles
* Compute ensemble averages: straight average, weighted by rmse, etc.     
```
Simple ensemble averaging ensemble_y_pred_test = []
? ensemble_y_pred_test.append(y_pred_test)
? y_test_pred_avg = np.mean(ensemble_y_pred_test, axis=0)
compute feature importances averaged over ensemble
```
* Look at other locations for multiprocessing
* Look at other locations for timing blocks (and maybe save in MEMORY_STATS, as in have a MEMORY_STATS append at the end of every timed block)
* Look at other locations for MEMORY_STATS
</br>

* Categorical features with LGBM: double-check it is working?
```
categorical_feature 🔗︎, default = "", type = multi-int or string, aliases: cat_feature, categorical_column, cat_column
        used to specify categorical features
        use number for index, e.g. categorical_feature=0,1,2 means columns 0, 1 and 2 are categorical features
        add a prefix name: for column name, e.g. categorical_feature=name:c1,c2,c3 means c1, c2 and c3
        Index starts from 0 and it doesn’t count the label column when passing type is int
        All values should be less than Int32.MaxValue (2147483647)
        Using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero
        All negative values will be treated as missing values
```
Scikit-learn API: If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. *????Note: (WHAT API?) only supports categorical int type (**not applicable** for data represented as **pandas DataFrame** in Python-package)*... double-check that we are using a working API
</br>

* Special code needed for **GPU** enabled-LGBM modeling??
</br>

* Possible setup for continued training, especially if we find runtimes are cut off by Colab.  (will probably need to save lgbm.model to disk at each step)...  init_model (string, Booster, LGBMModel or None, optional (default=None)) – 
Filename of LightGBM model, Booster instance or LGBMModel instance used for **continue training**




#**Potentially Useful Code Snippets**

###**Ensembling and Trend/Feature Importance**

In [None]:
# ENSEMBLING and FEATURE IMPORTANCE / TRENDS ############################################

# ensembling_fn = True # ensembling_fn(output_file_names):
#     # can pull the submission files off the disk using OUTPUTS_df[model filename], after optionally setting a threshold for inclusion in the
#     #    ensemble, such as OUTPUTS_df[val_rmse] must be in the lowest quantile or something similar
#     # average, weighted-average, other method, to combine anything already saved to disk (default = straight avg of all runs in above loop)

# compute_trends_fn = True # compute_trends_fn(output_results):
#     # create a df of features & feature importances for each run (or "explode" sideways the OUTPUTS_df features & importances)... use pd.info to
#     #    compute quantiles, mean, stddev for each of the features, and determine if anything looks interesting
#     # look at feature importances all together, and see if anything obvious good or bad
#     '''
#     Might want to look at the predict_contrib parameter for LGBM:  https://lightgbm.readthedocs.io/en/latest/Parameters.html
#     '''
#     # look at splits and see if any parameters obviously good or bad (correlation matrix of parameters with output results?)
#     #feat_imp = pd.DataFrame.from_dict(OUTPUTS_df["feature_importances_"])
#     #OUTPUTS_df.at[RUN_n,"feature_name_"]
#     make empty df with columns from list at = outputs.at[0,featname]
#     append rows with elements = list elements in feature_importances_ for each feature name
#     df now has as many rows as N_TRAIN_iterations
#     compute df.info stats or quantiles/mean/std and store somewhere; make some plots; make some automated recommendations?
nocode=True

###**Averaging Several Stored Predictions/Submissions from Disk**

In [None]:
# average several submission files to get ensemble average
%cd "{GDRIVE_REPO_PATH}"
# source_dir = Path('models_and_predictions/bagging_LGBM')
# prediction_files = source_dir.iterdir()
source_dir = 'models_and_predictions/bagging_LGBM'
prediction_files = os.listdir(source_dir)
print("Loading Files from Google Drive repo into Colab...\n")

# filename to save ensemble average predictions for submission
ensemble_name = 'LGBMv6v7_bag06'

print(f'filename {ensemble_name}')
# Loop to load the data files into appropriately-named pandas DataFrames, and save in np array for easy calc of ensemble average
preds = []
for f_name in prediction_files:
    filename = f_name.rsplit("/")[-1]
    data_frame_name = filename.split(".")[0][:-11]
    path_name = os.path.join('models_and_predictions/bagging_LGBM/'+ filename)
    exec(data_frame_name + " = pd.read_csv(path_name)")
    print(f'Data Frame: {data_frame_name}; n_rows = {len(eval(data_frame_name))}, n_cols = ')
    preds.append(eval(data_frame_name).item_cnt_month.to_numpy())

# Simple ensemble averaging
pred_ens_avg = np.mean(preds, axis=0)
ensemble_submission = LGBMv6mg_17_.copy(deep=True)
ensemble_submission.item_cnt_month = pred_ens_avg

ensemble_submission.to_csv("./models_and_predictions/" + ensemble_name + '_submission.csv', index=False)

display(ensemble_submission.head(8))
print(f'filename {ensemble_name} saved: {strftime("%a %X %x")}')
print('Coursera:  ')

###**Feature Importances**

In [None]:
# Plot feature importance - Results Visualization
itercount=0
if ITERS.at[itercount,'_model_type'] == 'LGBM':
    print_threshold = 25
    feature_importances_ = ITERS.at[itercount,'feature_importances_']
    feature_name_        = ITERS.at[itercount,'feature_name_']
    fi = pd.DataFrame(zip(feature_name_,feature_importances_),columns=['feature','value'])
    fi = fi.sort_values('value',ascending=False,ignore_index=True)
    fi['norm_value'] = round(100*fi.value / fi.value.max(),2)
    fi['lag'] = fi.feature.apply(lambda x: (x.split('L')[-1]) if len(x.split('L'))> 1 else 0)
    fi['feature_base'] = fi.feature.apply(lambda x: x.split('_L')[0])
    print(fi.iloc[list(range(0,8))+list(range(-7,0)),:]) #[[1,3,5,7,-7,-5]][:])
    # model_filename_fi = ITERS.at[itercount,'_model_type']+ITERS.at[itercount,'_model_filename'] + "_feature_importance.csv"
    # fi.to_csv("./models_and_predictions/" + model_filename_fi, index=False)
    # printout to assist with removing low-importance features for following runs
    if fi.norm_value.min() < print_threshold:
        fi_low = fi[fi.norm_value < print_threshold]
        fi_low = fi_low.sort_values(['lag','norm_value'])
        fi_low.norm_value = fi_low.norm_value.apply(lambda x: f'{round(x):d}')
        fi_low['lag_feature_importance'] = fi_low.apply(lambda x: f"{f'L{x.lag} fi{x.norm_value}':{len(x.feature_base)}s}",axis=1)
        print(fi_low.lag_feature_importance.to_list())
        print(fi_low.feature_base.to_list())
    # make importances relative to max importance
    feature_importances_ = 100.0 * (feature_importances_ / feature_importances_.max())
    sorted_idx = np.arange(feature_importances_.shape[0])
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.figure(figsize=(24,12)) 
    plt.bar(pos, feature_importances_[sorted_idx], align='center')
    plt.xticks(pos, feature_name_[sorted_idx])
    plt.ylabel('Relative Importance')
    plt.title('Variable Importance')
    plt.tick_params(axis='x', which='major', labelsize = 13, labelrotation=90)
    plt.grid(True,which='major',axis='y')
    plt.tick_params(axis='y', which='major', grid_color='black',grid_alpha=0.7)
    # plt.savefig('LGBM_feature_importance_v1.4_mg.png')
    plt.show()

###**Using GPUs with LGBM**

In [None]:
# GPU use with LGBM modeling:
'''
May want to see if we can better inform LGBM routine when we are using a GPU
https://lightgbm.readthedocs.io/en/latest/GPU-Targets.html#query-opencl-devices-in-your-system
Your system might have multiple GPUs from different vendors (“platforms”) installed. Setting up LightGBM GPU device requires two parameters: 
OpenCL Platform ID (gpu_platform_id) and OpenCL Device ID (gpu_device_id). Generally speaking, each vendor provides an OpenCL platform, 
and devices from the same vendor have different device IDs under that platform. For example, if your system has an Intel integrated GPU and 
two discrete GPUs from AMD, you will have two OpenCL platforms (with gpu_platform_id=0 and gpu_platform_id=1). If the platform 0 is Intel, 
it has one device (gpu_device_id=0) representing the Intel GPU; if the platform 1 is AMD, it has two devices (gpu_device_id=0, gpu_device_id=1) 
representing the two AMD GPUs. If you have a discrete GPU by AMD/NVIDIA and an integrated GPU by Intel, make sure to select the correct gpu_platform_id 
to use the discrete GPU as it usually provides better performance.

On Windows, OpenCL devices can be queried using GPUCapsViewer, under the OpenCL tab. http://www.ozone3d.net/gpu_caps_viewer/ 
Note that the platform and device IDs reported by this utility start from 1. So you should minus the reported IDs by 1.

On Linux, OpenCL devices can be listed using the clinfo command. On Ubuntu, you can install clinfo by executing sudo apt-get install clinfo.

Make sure you list the OpenCL devices in your system and set gpu_platform_id and gpu_device_id correctly. 
In the following examples, our system has 1 GPU platform (gpu_platform_id = 0) from AMD APP SDK. 
The first device gpu_device_id = 0 is a GPU device (AMD Oland), and the second device gpu_device_id = 1 is the x86 CPU backend.

R Example of using GPU (gpu_platform_id = 0 and gpu_device_id = 0 in our system):
> params <- list(objective = "regression",
+                metric = "rmse",
+                device = "gpu",
+                gpu_platform_id = 0,
+                gpu_device_id = 0,
+                nthread = 1,
+                boost_from_average = FALSE,
+                num_tree_per_iteration = 10,
+                max_bin = 32)
> model <- lgb.train(params,
+                    dtrain,
+                    2,
+                    valids,
+                    min_data = 1,
+                    learning_rate = 1,
+                    early_stopping_rounds = 10)
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 232
[LightGBM] [Info] Number of data: 6513, number of used features: 116
[LightGBM] [Info] Using GPU Device: Oland, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 40 dense feature groups (0.12 MB) transferred to GPU in 0.004211 secs. 76 sparse feature groups.
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
[1]:    test's rmse:1.10643e-17
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
[2]:    test's rmse:0

Running on OpenCL CPU backend devices is in generally slow, and we observe crashes on some Windows and macOS systems. 
Make sure you check the Using GPU Device line in the log and it is not using a CPU. The above log shows that we are using Oland GPU from AMD and not CPU.

Example of using CPU (gpu_platform_id = 0, gpu_device_id = 1). The GPU device reported is Intel(R) Core(TM) i7-4600U CPU, 
so it is using the CPU backend rather than a real GPU.

> params <- list(objective = "regression",
+                metric = "rmse",
+                device = "gpu",
+                gpu_platform_id = 0,
+                gpu_device_id = 1,
+                nthread = 1,
+                boost_from_average = FALSE,
+                num_tree_per_iteration = 10,
+                max_bin = 32)
> model <- lgb.train(params,
+                    dtrain,
+                    2,
+                    valids,
+                    min_data = 1,
+                    learning_rate = 1,
+                    early_stopping_rounds = 10)
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 232
[LightGBM] [Info] Number of data: 6513, number of used features: 116
[LightGBM] [Info] Using requested OpenCL platform 0 device 1
[LightGBM] [Info] Using GPU Device: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 40 dense feature groups (0.12 MB) transferred to GPU in 0.004540 secs. 76 sparse feature groups.
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
[1]:    test's rmse:1.10643e-17
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
[2]:    test's rmse:0

Known issues:
Using a bad combination of gpu_platform_id and gpu_device_id can potentially lead to a crash due to OpenCL driver issues on some machines 
(you will lose your entire session content). Beware of it.
****
**** some systems have integrated graphics card (Intel HD Graphics) and a dedicated graphics card (AMD, NVIDIA), the dedicated graphics card may 
automatically override the integrated graphics card. The workaround is to disable your dedicated graphics card to use your integrated graphics card.
'''
nocode=True

###**Multiprocessing to Reduce pandas Memory Usage**

In [None]:

# 16.6.2.9. Process Pools   https://docs.python.org/2/library/multiprocessing.html
# One can create a pool of processes which will carry out tasks submitted to it with the Pool class.
# class multiprocessing.Pool([processes[, initializer[, initargs[, maxtasksperchild]]]])
# processes is the number of worker processes to use. If processes is None then the number returned by cpu_count() is used. If initializer is not None then each worker process will call initializer(*initargs) when it starts.
# Note that the methods of the pool object should only be called by the process which created the pool.
# New in version 2.7: maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. 
# The default maxtasksperchild is None, which means worker processes will live as long as the pool.

# https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe/39101287#39101287
# There is one thing that always works, however, because it is done at the OS, not language, level.
# Suppose you have a function that creates an intermediate huge DataFrame, and returns a smaller result (which might also be a DataFrame):
# >     def huge_intermediate_calc(something):
# >         ...
# >         huge_df = pd.DataFrame(...)
# >         ...
# >         return some_aggregate
# Then if you do something like:
# >     import multiprocessing
# >     result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]
# Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. 
# Maybe it help someone, when creating the Pool try to use maxtasksperchild = 1
# In an iPython environment (like jupyter notebook), you need to .close() and .join() or .terminate() the pool to get rid of the spawned process. 
# The easiest way of doing that since Python 3.3 is to use the context management protocol: 
# >     with multiprocessing.Pool(1) as pool: 
# >         result = pool.map(huge_intermediate_calc, [something])
# Another option is for the subprocess to write the dataframe to disk using something like parquet. 
#    It may be faster than moving a big pickled dataframe back to the parent. It will be in the disk cache so it is fast. 
# If you stick to numeric numpy arrays, those are freed, but boxed object types are not.
# When modifying your dataframe, prefer inplace=True, so you don't create copies.

#  Use PIPE to concatenate functions in a multiprocess pool, keeping workers / dataframes all within the process so it kills them when the process terminates
# The Pipe() function returns a pair of connection objects connected by a pipe which by default is duplex (two-way). For example:
# from multiprocessing import Process, Pipe
# def f(conn):
#     conn.send([42, None, 'hello'])
#     conn.close()
# if __name__ == '__main__':
#     parent_conn, child_conn = Pipe()
#     p = Process(target=f, args=(child_conn,))
#     p.start()
#     print parent_conn.recv()   # prints "[42, None, 'hello']"
#     p.join()
# parent,child = multiprocessing.Pipe()

# with multiprocessing.Pool(None,maxtasksperchild = 1) as pool:  # None = use all available processors, could use Pool(1, maxtasksperchild = 1)
#     result = pool.map(load_dfs, [args])[0]
# pool.close()  # pool.terminate()
# pool.join()   

# gc.collect()
#print(parent.recv())
#child.close()
#print(f'Result = {result[:5]}') #.head(2))
#print(f'Result = {result[:5]}') #.head(2))
#print(result.head(2))
#result = 1.0
#print(f'Result = {result}') #.head(2))
# del result
# gc.collect()
# result = pd.DataFrame()  # doesn't seem to help immediately, nor does "del df" or put df in list and "del list" or "del list[0]" or "del list[0] --> del list", all followed by gc.collect()
# pool.close(); pool.join();  mu,vu,vt,va,t = get_memory_stats("4) After checking active children: ")  # pool.terminate()
# gc.collect(); mu,vu,vt,va,t = get_memory_stats("5) After gc.collect(): ")
# %reset Out   # flush the output cache (no obvious improvement with IPynb)
# gc.collect(); mu,vu,vt,va,t = get_memory_stats("6) After Reset Cache and gc.collect(): ")

                # "readonly/final_project_data/test.csv.gz",
                # "data_output/stt.csv.gz"]

# dummy test program to verify OK to run with multiprocessing (yes; it releases memory when parent process is done)
# def print_tail(df=pd.DataFrame()):
#     print(df.tail(2))
#     return df
#             exec(data_frame_name + ' = print_tail(' + data_frame_name + ')')
#         df = print_tail(eval(data_frame_name)) ;  arg_dict['q'].put(df)  #(df.tail(1))
#         arg_dict['q'].put(eval(data_frame_name)) # very slow to try passing full dfs by queue
# q = multiprocessing.Queue(len(data_files))  # optional argument: length of queue
# proc = multiprocessing.Process(target=load_dfs, args=(args_dict,))
# proc.start()
# proc.join()  # waits for queue to be filled/flushed
# mu,vu,vt,va,t = get_memory_stats("C) After Close and Join Pool: ")
# print(q.get().head(1))
# print(items_enc.head(2))
# print(q.get().head(1))
# print(shops_enc.head(2))
# print(q.get().head(1))
# print(date_scaling.head(2))
# print(q.get().head(1))
# print(test.head(2))
# print(q.get().head(1))
# print(stt.head(2))


# args_dict = {'repo_path':GDRIVE_REPO_PATH,'data_file_list':data_files}
# print("Loading Files from Google Drive repo into Colab...\n")
# #%cd "{paths['repo_path']}"
# %cd "{GDRIVE_REPO_PATH}"
# with multiprocessing.Pool(None,maxtasksperchild = 1) as pool:  # None = use all available processors, could use Pool(1, maxtasksperchild = 1)
#     # dfs = pool.map(load_dfs, data_files)#[0] #[args_dict])[0]
#     #items_enc,shops_enc,date_scaling,stt,test = pool.map(load_data_files_to_dfs, data_files)
#     dfs = pool.map(load_data_files_to_dfs, data_files)  # dfs is a list of ordered elements, each being the function "load_data_files_to_dfs" operating on an element in "data_files" list/iterable
    



# Multiprocessing with 0 returned from function
# Ran this 2x, and memory use stabilizes at low level as expected, with dataframes inside function being discarded when process is close-joined
        # # List of the *data* files (path relative to GitHub branch), to be loaded into pandas DataFrames
        # data_files = [  "data_output/items_enc.csv",
        #                 "data_output/shops_enc.csv",
        #                 "data_output/date_scaling.csv",
        #                 "data_output/stt.csv.gz",
        #                 "readonly/final_project_data/test.csv.gz"]

        # data_df_names = []
        # for path_name in data_files:
        #     filename = path_name.rsplit("/")[-1]
        #     data_df_names.append(filename.split(".")[0])

        # def load_data_files_to_dfs(datafile):
        #     filename = datafile.rsplit("/")[-1]
        #     data_frame_name = filename.split(".")[0]
        #     exec(data_frame_name + ' = pd.read_csv(datafile)')
        #     return 0#eval(data_frame_name)

        # # Issue: freeing up memory used by pandas dataframes that are no longer required by the program (del + gc.collect() does not reliably do this, nor does redefining the df as pd.DataFrame())
        # MEMORY_STATS.append(get_memory_stats('Data File Paths Defined',printout=False))

        # %cd "{GDRIVE_REPO_PATH}"
        # print("Loading Files from Google Drive repo into Colab...\n")
        # with multiprocessing.Pool(None,maxtasksperchild = 1) as pool:  # None = use all available processors, could use Pool(1, maxtasksperchild = 1)
        #     dfs = pool.map(load_data_files_to_dfs, data_files)  # dfs is a list of ordered elements, each being the function "load_data_files_to_dfs" operating on an element in "data_files" list/iterable
            
        # MEMORY_STATS.append(get_memory_stats('Pool Mapped Onto Load Data fn: ',printout=False))
        # pool.close()  # pool.terminate()
        # pool.join()
        # MEMORY_STATS.append(get_memory_stats('Pool Closed and Joined',printout=False))

        # # data_dfs = OrderedDict(zip(data_df_names, dfs))
        # # for name,df in data_dfs.items():
        # #     print(f'Data Frame: {name}; n_rows = {len(df)}, n_cols = ',end="")
        # #     print(f'{len(df.columns)}') #\nData Types: {df.dtypes}\n')
        # #     print(f'Column Names: {df.columns.to_list()}')
        # #     print(df.head(2),'\n')
        # print("dfs = ", dfs)
        # dfs = []; print("dfs = ", dfs)
        # data_dfs = {}; print("data_dfs = ", data_dfs,'\n')
        # print(f'Multiprocessing Active Children: {multiprocessing.active_children()}')  # should be zero (empty list) if all processes are terminated and memory is returned to system
        # MEMORY_STATS.append(get_memory_stats('dfs List Set to Empty',printout=False))
        # display_all_memory_stats(MEMORY_STATS)
        # ------------------------------------------------------------------------------------------ Output:
        # /content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
        # Loading Files from Google Drive repo into Colab...

        # dfs =  [0, 0, 0, 0, 0]
        # dfs =  []
        # data_dfs =  {} 

        # Multiprocessing Active Children: []

        #                                                         |  pid   |              vm               |
        #   Time and Date       |   Memory Usage Measure Point    | pid-GB | used-GB | avail-GB | total-GB |
        # Wed 09:08:08 07/29/20 | Environment Setup Done          |   0.38 |    0.81 |    26.59 |    27.39 |
        # Wed 09:08:09 07/29/20 | Defined Iteration Parameters    |   0.39 |    0.81 |    26.59 |    27.39 |
        # Wed 09:08:09 07/29/20 | Mounted Google Drive            |   0.39 |    0.81 |    26.59 |    27.39 |
        # Wed 09:08:09 07/29/20 | Data File Paths Defined         |   0.39 |    0.81 |    26.59 |    27.39 |
        # Wed 09:08:11 07/29/20 | Pool Mapped Onto Load Data fn:  |   0.39 |    0.81 |    26.58 |    27.39 |
        # Wed 09:08:11 07/29/20 | Pool Closed and Joined          |   0.39 |    0.81 |    26.58 |    27.39 |
        # Wed 09:08:11 07/29/20 | dfs List Set to Empty           |   0.39 |    0.81 |    26.59 |    27.39 |
        # Wed 09:08:20 07/29/20 | Data File Paths Defined         |   0.39 |    0.81 |    26.58 |    27.39 |
        # Wed 09:08:22 07/29/20 | Pool Mapped Onto Load Data fn:  |   0.39 |    0.81 |    26.58 |    27.39 |
        # Wed 09:08:22 07/29/20 | Pool Closed and Joined          |   0.39 |    0.81 |    26.58 |    27.39 |
        # Wed 09:08:22 07/29/20 | dfs List Set to Empty           |   0.39 |    0.81 |    26.58 |    27.39 |


# Multiprocessing with dataframe returned from function
# Ran this 4x, and memory use stabilizes
    # # List of the *data* files (path relative to GitHub branch), to be loaded into pandas DataFrames
    # data_files = [  "data_output/items_enc.csv",
    #                 "data_output/shops_enc.csv",
    #                 "data_output/date_scaling.csv",
    #                 "data_output/stt.csv.gz",
    #                 "readonly/final_project_data/test.csv.gz"]

    # data_df_names = []
    # for path_name in data_files:
    #     filename = path_name.rsplit("/")[-1]
    #     data_df_names.append(filename.split(".")[0])

    # def load_data_files_to_dfs(datafile):
    #     filename = datafile.rsplit("/")[-1]
    #     data_frame_name = filename.split(".")[0]
    #     exec(data_frame_name + ' = pd.read_csv(datafile)')
    #     return eval(data_frame_name)

    # # Issue: freeing up memory used by pandas dataframes that are no longer required by the program (del + gc.collect() does not reliably do this, nor does redefining the df as pd.DataFrame())
    # MEMORY_STATS.append(get_memory_stats('Data File Paths Defined',printout=False))

    # %cd "{GDRIVE_REPO_PATH}"
    # print("Loading Files from Google Drive repo into Colab...\n")
    # with multiprocessing.Pool(None,maxtasksperchild = 1) as pool:  # None = use all available processors, could use Pool(1, maxtasksperchild = 1)
    #     dfs = pool.map(load_data_files_to_dfs, data_files)  # dfs is a list of ordered elements, each being the function "load_data_files_to_dfs" operating on an element in "data_files" list/iterable
        
    # MEMORY_STATS.append(get_memory_stats('Pool Mapped Onto Load Data fn: ',printout=False))
    # pool.close()  # pool.terminate()
    # pool.join()
    # MEMORY_STATS.append(get_memory_stats('Pool Closed and Joined',printout=False))

    # data_dfs = OrderedDict(zip(data_df_names, dfs))
    # for name,df in data_dfs.items():
    #     print(f'Data Frame: {name}; n_rows = {len(df)}, n_cols = ',end="")
    #     print(f'{len(df.columns)}') #\nData Types: {df.dtypes}\n')
    #     print(f'Column Names: {df.columns.to_list()}')
    #     print(df.head(2),'\n')
    # dfs = []; print("dfs = ", dfs)
    # data_dfs = {}; print("data_dfs = ", data_dfs,'\n')
    # print(f'Multiprocessing Active Children: {multiprocessing.active_children()}')  # should be zero (empty list) if all processes are terminated and memory is returned to system
    # MEMORY_STATS.append(get_memory_stats('dfs List Set to Empty',printout=False))
    # display_all_memory_stats(MEMORY_STATS)

    #         # ------------------------------------------------------------------------------------------ Output:

    # /content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
    # Loading Files from Google Drive repo into Colab...

    # Data Frame: items_enc; n_rows = 22170, n_cols = 10
    # Column Names: ['item_id', 'item_tested', 'item_cluster', 'item_category_id', 'item_cat_tested', 'item_group', 'item_category1', 'item_category2', 'item_category3', 'item_category4']
    #    item_id  item_tested  item_cluster  item_category_id  item_cat_tested  item_group  item_category1  item_category2  item_category3  item_category4
    # 0        0            0           100                40                1           6               8               3               7               3
    # 1        1            0           105                76                1           6              11               6              10               5 

    # Data Frame: shops_enc; n_rows = 60, n_cols = 9
    # Column Names: ['shop_id', 'shop_tested', 'shop_group', 'shop_type', 's_type_broad', 'shop_federal_district', 'fd_popdens', 'fd_gdp', 'shop_city']
    #    shop_id  shop_tested  shop_group  shop_type  s_type_broad  shop_federal_district  fd_popdens  fd_gdp  shop_city
    # 0        0            0           7          5             2                      1           3       1         26
    # 1        1            0           7          1             0                      1           3       1         26 

    # Data Frame: date_scaling; n_rows = 35, n_cols = 8
    # Column Names: ['month', 'year', 'season', 'MoY', 'days_in_M', 'weekday_weight', 'retail_sales', 'week_retail_weight']
    #    month  year  season  MoY  days_in_M  weekday_weight  retail_sales  week_retail_weight
    # 0      0  2013       2    1         31           0.979         1.052               1.030
    # 1      1  2013       3    2         28           1.069         1.072               1.146 

    # Data Frame: stt; n_rows = 3150043, n_cols = 9
    # Column Names: ['day', 'week', 'qtr', 'season', 'month', 'price', 'sales', 'shop_id', 'item_id']
    #    day  week  qtr  season  month  price  sales  shop_id  item_id
    # 0    0     0    0       2      0     99      1        2      991
    # 1    0     0    0       2      0   2599      1        2     1472 

    # Data Frame: test; n_rows = 214200, n_cols = 3
    # Column Names: ['ID', 'shop_id', 'item_id']
    #    ID  shop_id  item_id
    # 0   0        5     5037
    # 1   1        5     5320 

    # dfs =  []
    # data_dfs =  {} 

    # Multiprocessing Active Children: []

    #                                                         |  pid   |              vm               |
    #   Time and Date       |   Memory Usage Measure Point    | pid-GB | used-GB | avail-GB | total-GB |
    # Wed 09:11:01 07/29/20 | Environment Setup Done          |   0.38 |    0.81 |    26.59 |    27.39 |
    # Wed 09:11:01 07/29/20 | Defined Iteration Parameters    |   0.39 |    0.81 |    26.59 |    27.39 |
    # Wed 09:11:01 07/29/20 | Mounted Google Drive            |   0.39 |    0.81 |    26.59 |    27.39 |
    # Wed 09:11:01 07/29/20 | Data File Paths Defined         |   0.39 |    0.81 |    26.59 |    27.39 |
    # Wed 09:11:05 07/29/20 | Pool Mapped Onto Load Data fn:  |   0.82 |    1.27 |    26.12 |    27.39 |
    # Wed 09:11:05 07/29/20 | Pool Closed and Joined          |   0.82 |    1.27 |    26.12 |    27.39 |
    # Wed 09:11:05 07/29/20 | dfs List Set to Empty           |   0.82 |    1.27 |    26.12 |    27.39 |
    # Wed 09:11:14 07/29/20 | Data File Paths Defined         |   0.82 |    1.27 |    26.12 |    27.39 |
    # Wed 09:11:18 07/29/20 | Pool Mapped Onto Load Data fn:  |   0.82 |    1.28 |    26.11 |    27.39 |
    # Wed 09:11:18 07/29/20 | Pool Closed and Joined          |   0.82 |    1.28 |    26.11 |    27.39 |
    # Wed 09:11:18 07/29/20 | dfs List Set to Empty           |   0.82 |    1.28 |    26.12 |    27.39 |
    # Wed 09:11:24 07/29/20 | Data File Paths Defined         |   0.82 |    1.28 |    26.12 |    27.39 |
    # Wed 09:11:28 07/29/20 | Pool Mapped Onto Load Data fn:  |   0.82 |    1.28 |    26.11 |    27.39 |
    # Wed 09:11:28 07/29/20 | Pool Closed and Joined          |   0.82 |    1.28 |    26.12 |    27.39 |
    # Wed 09:11:28 07/29/20 | dfs List Set to Empty           |   0.82 |    1.28 |    26.12 |    27.39 |
    # Wed 09:11:33 07/29/20 | Data File Paths Defined         |   0.82 |    1.28 |    26.12 |    27.39 |
    # Wed 09:11:38 07/29/20 | Pool Mapped Onto Load Data fn:  |   0.82 |    1.28 |    26.12 |    27.39 |
    # Wed 09:11:38 07/29/20 | Pool Closed and Joined          |   0.82 |    1.28 |    26.12 |    27.39 |
    # Wed 09:11:38 07/29/20 | dfs List Set to Empty           |   0.82 |    1.28 |    26.12 |    27.39 |

nocode=True

###**Timer created as Context Manager, allowing function decoration**

In [None]:
# some functions for calculating elapsed time, allowing timer blocks inside other timer blocks
#    (as opposed to just using time.perf_counter() )

# class elapsed_timer(ContextDecorator): # base class enables a contextmanager to also be used as a decorator
#     def __init__(self):   
#         self.function_init_time   = f'{strftime("%a %X %x")}'
#         self.start_time           = default_timer()
#         self.function_total_time  = datetime.utcfromtimestamp(default_timer() - self.start_time).strftime('%H:%M:%S')
#     def __enter__(self):  
#         self.start_time = default_timer()
#         return datetime.utcfromtimestamp(default_timer() - self.start_time).strftime('%H:%M:%S')
#     def __exit__(self, exc_type, exc_value, exc_traceback):
#         self.function_total_time = datetime.utcfromtimestamp(default_timer() - self.start_time).strftime('%H:%M:%S')
#         return True # ignores errors in while loop
#     def get_elapsed_time(self):
#         return datetime.utcfromtimestamp(default_timer() - self.start_time).strftime('%H:%M:%S')

# example usage of context manager with context decorator:
# xt = elapsed_timer()
# yt = elapsed_timer()
# @xt
# def printme():
#     for i in range(1500000): a=np.arctan(np.sqrt(i/3.367))**2; c = np.sqrt(a**a)
#     print(f"after first calc in function:     x={xt.get_elapsed_time()}")
#     with yt:
#         for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b*b**b)
#         print(f"inside inner block at end:     x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
# print(f'xt function_total_time = {xt.function_total_time};   initial time: {xt.function_init_time}')
# yt.restart_time()

# xt = elapsed_timer()
# yt = elapsed_timer()
# @xt
# def printsometimes():
#     print("Hello")
#     print(f"at start of outer block:             x={xt.get_elapsed_time()}")
#     for i in range(1500000): a=np.arctan(np.sqrt(i/3.367))**2; c = np.sqrt(a**a)
#     print(f"after first calc in outer block:     x={xt.get_elapsed_time()}")
#     for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b*b**b)
#     print(f"after second calc in outer block:    x={xt.get_elapsed_time()}")
#     with yt:
#         print("Goodbye")
#         print(f"at start of inner block:             x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
#         for i in range(1500000): a=np.arctan(np.sqrt(i/3.367))**2; c = np.sqrt(a**a)
#         print(f"at middle of inner block:            x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
#         for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b*b**b)
#         print(f"inside inner block at end:           x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
#     print(f'yt function_elapsed_time1 = {yt.function_total_time};   initial time: {yt.function_init_time}')
#     print(f"just outside of inner block:         x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
#     for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b**b)    
#     print(f"after calc outside inner block:      x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
#     for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b*b**b)
#     print(f"inside outer block at end            x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
#     print(f'yt function_elapsed_time2 = {yt.function_total_time};   initial time: {yt.function_init_time}')

# print(f"before calculation before fn call:   x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
# for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b**b)
# print(f"after calculation before fn call:    x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
# print(f'yt function_elapsed_time3 = {yt.function_total_time};   initial time: {yt.function_init_time}')

# printsometimes()

# print(f"just outside of function call:       x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
# for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b**b)
# print(f"after calc after function call:      x={xt.get_elapsed_time()}    y={yt.get_elapsed_time()}")
# print(f'yt function_elapsed_time4 = {yt.function_total_time};   initial time: {yt.function_init_time}')

# before calculation before fn call:   x=00:00:00    y=00:00:00
# after calculation before fn call:    x=00:00:04    y=00:00:04
# yt function_elapsed_time3 = 00:00:00;   initial time: Mon 08:33:22 08/10/20
# Hello
# at start of outer block:             x=00:00:00
# after first calc in outer block:     x=00:00:06
# after second calc in outer block:    x=00:00:10
# Goodbye
# at start of inner block:             x=00:00:10    y=00:00:00
# at middle of inner block:            x=00:00:17    y=00:00:06
# inside inner block at end:           x=00:00:22    y=00:00:11
# yt function_elapsed_time1 = 00:00:11;   initial time: Mon 08:33:22 08/10/20
# just outside of inner block:         x=00:00:22    y=00:00:11
# after calc outside inner block:      x=00:00:26    y=00:00:15
# inside outer block at end            x=00:00:30    y=00:00:19
# yt function_elapsed_time2 = 00:00:11;   initial time: Mon 08:33:22 08/10/20
# just outside of function call:       x=00:00:30    y=00:00:19
# after calc after function call:      x=00:00:35    y=00:00:24
# yt function_elapsed_time4 = 00:00:11;   initial time: Mon 08:33:22 08/10/20


############################################################################################
# @contextmanager
# def elapsed_timer():
#     """
#     use like:  "with elapsed_timer() as x_time: ->->print(x_time())... blah print(x_time())... blah <-<-print(f'totaltime={x_time()}')"
#     x_time() will continue increasing when you are inside the "with" block, but freezes as "x_time()" when you exit the "with"
#     """
#     start = default_timer()
#     elapser = lambda: datetime.utcfromtimestamp(default_timer() - start).strftime('%H:%M:%S')
#     yield lambda: elapser()
#     end = default_timer()
#     elapser = lambda: datetime.utcfromtimestamp(end - start).strftime('%H:%M:%S')

# with elapsed_timer() as x_time:
#     print("Hello")
#     print(f"at start of outer block:             x={x_time()}")
#     for i in range(1500000): a=np.arctan(np.sqrt(i/3.367))**2; c = np.sqrt(a**a)
#     print(f"after first calc in outer block:     x={x_time()}")
#     for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b*b**b)
#     print(f"after second calc in outer block:    x={x_time()}")
#     with elapsed_timer() as y_time:
#         print("Goodbye")
#         print(f"at start of inner block:             x={x_time()}    y={y_time()}")
#         for i in range(1500000): a=np.arctan(np.sqrt(i/3.367))**2; c = np.sqrt(a**a)
#         print(f"at middle of inner block:            x={x_time()}    y={y_time()}")
#         for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b*b**b)
#         print(f"inside inner block at end:           x={x_time()}    y={y_time()}")
#     print(f"just outside of inner block:         x={x_time()}    y={y_time()}")
#     for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b**b)    
#     print(f"after calc outside inner block:      x={x_time()}    y={y_time()}")
#     for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b*b**b)
#     print(f"inside outer block at end            x={x_time()}    y={y_time()}")

# print(f"just outside of outer block:         x={x_time()}    y={y_time()}")
# for j in range(1000000): b=np.arctan(np.sqrt(j/3.367))**2; d = np.sqrt(b**b)
# print(f"after calc out of outer block (x,y): x={x_time()}    y={y_time()}")

# Hello
# at start of outer block:             x=00:00:00
# after first calc in outer block:     x=00:00:06
# after second calc in outer block:    x=00:00:11
# Goodbye
# at start of inner block:             x=00:00:11    y=00:00:00
# at middle of inner block:            x=00:00:18    y=00:00:06
# inside inner block at end:           x=00:00:23    y=00:00:11
# just outside of inner block:         x=00:00:23    y=00:00:11
# after calc outside inner block:      x=00:00:27    y=00:00:11
# inside outer block at end            x=00:00:32    y=00:00:11
# just outside of outer block:         x=00:00:32    y=00:00:11
# after calc out of outer block (x,y): x=00:00:32    y=00:00:11


###**Old code: LightGBM - Lightweight Gradient-Boosted Decision Tree**
###**Old code: SK_HGBR - SKLearn Histogram Gradient Boosting Regressor**

In [None]:
# model_gbdt = lgb.LGBMRegressor(
#     objective='regression',
#     boosting_type='gbdt',           # gbdt= Gradient Boosting Decision Tree; dart= Dropouts meet Multiple Additive Regression Trees; goss= Gradient-based One-Side Sampling; rf= Random Forest
#     learning_rate=params["lr"],     # You can use callbacks parameter of fit method to shrink/adapt learning rate in training using reset_parameter callback
#     n_estimators=params["maxit"],   # Number of boosted trees to fit = max_iterations
#     metric='rmse',
#     subsample_for_bin=200000,       # Number of samples for constructing bins
#     num_leaves=31,                  # Maximum tree leaves for base learners
#     max_depth=-1,                   # Maximum tree depth for base learners, <=0 means no limit
#     min_split_gain=0.0,             # Minimum loss reduction required to make a further partition on a leaf node of the tree
#     min_child_weight=0.001,         # Minimum sum of instance weight (hessian) needed in a child (leaf)
#     min_child_samples=20,           # Minimum number of data needed in a child (leaf)
#     colsample_bytree=params["reg"], # dropout fraction of columns during fitting (max=1 = no dropout)
#     random_state=params["seed"],    # seed value
#     silent=False,                   # whether to print info during fitting
#     importance_type='split',        # feature importance type: 'split'= N times feature is used in model; 'gain'= total gains of splits which use the feature
#     reg_alpha=0.0,                  # L1 regularization
#     reg_lambda=0.0,                 # L2 regularization
#     n_jobs=- 1,                     # N parallel threads to use on computer
#     subsample=1.0,                  # row fraction used for training: keep at 1 for time series data
#     subsample_freq=0                # keep at 0 for time series
#     )


# model_gbdt.fit( 
#     data['X_train'],                        # Input feature matrix (array-like or sparse matrix of shape = [n_samples, n_features])
#     data['y_train'],                        # The target values (class labels in classification, real numbers in regression) (array-like of shape = [n_samples])
#     eval_set=[(data['X_val'], data['y_val'])],              # can have multiple tuples of validation data inside this list
#     eval_names=None,                        # Names of eval_set (list of strings or None, optional (default=None))
#     eval_metric='rmse',                     # Default: 'l2' (= mean squared error, 'mse') for LGBMRegressor; options include 'l2_root'='root_mean_squared_error'='rmse' and 'l1'='mean_absolute_error'='mae' + more
#     early_stopping_rounds=params["estop"],  # Activates early stopping. The model will train until the validation score stops improving. Validation score needs to improve at least every early_stopping_rounds 
#                                             #     to continue training. Requires at least one validation data and one metric. If there’s more than one, will check all of them. But the training data is ignored anyway. 
#                                             #     To check only the first metric, set the first_metric_only parameter to True in additional parameters **kwargs of the model constructor.
#     init_score=None,                        # Init score of train data
#     eval_init_score=None,                   # Init score of eval data (list of arrays or None, optional (default=None))
#     verbose=CONSTANTS["VERBOSITY"] ,        # If True, metric on the eval set is printed at each boosting stage. If n=int, the metric on the eval set is printed at every nth boosting stage. Best and final also print.
#     feature_name='auto',                    # Feature names. If 'auto' and data is pandas DataFrame, data columns names are used. (list of strings or 'auto', optional (default='auto'))
#     categorical_feature='auto',             # Categorical features (list of strings or int, or 'auto', optional (default='auto')) If list of int, interpreted as indices. 
#                                             # If list of strings, interpreted as feature names (need to specify feature_name as well). 
#                                             # If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). 
#                                             # Large values could be memory-consuming. Consider using consecutive integers starting from zero. All negative values in categorical features are treated as missing values.
#     callbacks=None                          # List of callback functions that are applied at each iteration (list of callback functions or None, optional (default=None)) See Callbacks in Python API for more information.
#     )

    # if mod_type == 'HGBR':
    #     # TTSplit should use TRAIN_FINAL = 33 (train on all data), and it will return also val=month33 for calculation at end (only)
    #     model_gbdt = HistGradientBoostingRegressor(
    #         learning_rate=LR, 
    #         max_iter=maxiter, 
    #         l2_regularization = reg,
    #         early_stopping=False, 
    #         verbosity = verb,
    #         random_state=seed_val)
    
    #     tic = perf_counter()
    #     model_gbdt.fit(X_train_np, y_train)
    #     toc = perf_counter()
    #     model_fit_time = datetime.utcfromtimestamp(toc-tic).strftime('%H:%M:%S')
    #     print(f"model HGBR fit time: {model_fit_time}")
    #     best_iter = maxiter
    #     best_val_rmse = 0
# model_params = {
        #         'objective':param_df.at[iternum,'objective'],
        #         'boosting_type':param_df.at[iternum,'boosting_type'],
        #         'learning_rate':param_df.at[iternum,'learning_rate'],
        #         'n_estimators':param_df.at[iternum,'n_estimators'],
        #         'metric':param_df.at[iternum,'metric'],
        #         'subsample_for_bin':param_df.at[iternum,'subsample_for_bin'],
        #         'num_leaves':param_df.at[iternum,'num_leaves'],
        #         'max_depth':param_df.at[iternum,'max_depth'],
        #         'min_split_gain':param_df.at[iternum,'min_split_gain'],
        #         'min_child_weight':param_df.at[iternum,'min_child_weight'],
        #         'min_child_samples':param_df.at[iternum,'min_child_samples'],
        #         'colsample_bytree':param_df.at[iternum,'colsample_bytree'],
        #         'random_state':param_df.at[iternum,'random_state'],
        #         'silent':param_df.at[iternum,'silent'],
        #         'importance_type':param_df.at[iternum,'importance_type'],
        #         'reg_alpha':param_df.at[iternum,'reg_alpha'],
        #         'reg_lambda':param_df.at[iternum,'reg_lambda'],
        #         'n_jobs':param_df.at[iternum,'n_jobs'],
        #         'subsample':param_df.at[iternum,'subsample'],
        #         'subsample_freq':param_df.at[iternum,'subsample_freq']
                # }

        # fit_params = {
        #         'eval_metric':param_df.at[iternum,'eval_metric'],
        #         'early_stopping_rounds':param_df.at[iternum,'early_stopping_rounds'],
        #         'init_score':param_df.at[iternum,'init_score'],
        #         'eval_init_score':param_df.at[iternum,'eval_init_score'],
        #         'verbose':param_df.at[iternum,'verbose'],
        #         'feature_name':param_df.at[iternum,'feature_name'],
        #         'categorical_feature':param_df.at[iternum,'categorical_feature'],
        #         'callbacks':param_df.at[iternum,'callbacks']
        #         }

        # param_df.at[iternum,"feature_name_"]            = model_gbdt.feature_name_




# # Parameters Dictionary stores everything for dumping to file later
# SPEC = OrderedDict()
# FEATURES["_MODEL_NAME"] = 'LGBMv13_15ens'   # 'LGBMv10_11ens'  # Name of file model substring to save data submission to (= False if user to input it below)
# FEATURES["_MODEL_TYPE"] = 'LGBM'  # 'HGBR'
# FEATURES["_TEST_MONTH"] = 34

# # Optional operations to delete irrelevant shops or item categories, and to scale sales by month length, etc.;  set to FALSE if no operation desired
# FEATURES["_EDA_DELETE_SHOPS"]     = [9,20] #[0,1,8,9,11,13,17,20,23,27,29,30,32,33,40,43,51,54] #[8, 9, 13, 20, 23, 32, 33, 40] # [9,20] #  # False # these are online shops, untested shops, and early-termination + online shops
# FEATURES["_EDA_DELETE_ITEM_CATS"] = [8, 10, 32, 59, 80, 81, 82]  #[1,4,8,10,13,14,17,18,32,39,46,48,50,51,52,53,59,66,68,80,81,82] #  #[8, 80, 81, 82]  # False # hokey categories, untested categories, really hokey categories
# FEATURES["_EDA_SCALE_MONTH"]         = 'week_retail_weight'  # False # scale sales by days in month, number of each weekday, and Russian recession retail sales index

# # columns to keep for this round of modeling (dropping some of the less important features to save memory):
# FEATURES["COLS_KEEP_ITEMS"]             = ['item_id', 'item_group', 'item_cluster', 'item_category_id']  #, 'item_category4']
# FEATURES["COLS_KEEP_SHOPS"]             = ['shop_id','shop_group']
# FEATURES["COLS_KEEP_DATE_SCALING"]      = ['month', 'days_in_M', 'weekday_weight', 'retail_sales', 'week_retail_weight']
# FEATURES["COLS_KEEP_BASE_TRAIN_TEST"]   = ['month', 'price', 'sales', 'shop_id', 'item_id']

# # re-order columns for organized readability, for the (to be created) combined sales-train-test (stt) dataset
# FEATURES["COLS_ORDER_STT"]        = ['month', 'sales', 'revenue', 'shop_id', 'item_id', 'shop_group', 'item_category_id', 'item_group', 'item_cluster'] #,   'revenue','item_category4','shop_group'
# FEATURES["PROVIDED_INTEGER_FEATURES"]        = [e for e in FEATURES["COLS_ORDER_STT"] if e not in {'sales','price','revenue'}]  
# FEATURES["FEATURES_MONTHLY_STT_START"]       = [e for e in FEATURES["COLS_ORDER_STT"] if e not in {'month','sales','price','revenue','shop_id','item_id'}]  # these are categorical features that need to be merged onto test data set
# FEATURES["PROVIDED_CATEGORICAL_FEATURES"]    = [e for e in FEATURES["COLS_ORDER_STT"] if e not in {'month','sales','price','revenue'}]
# FEATURES["_USE_CATEGORICAL"]         = True  # pd dataframe columns "PROVIDED_CATEGORICAL_FEATURES" are changed to categorical dtype just before model fitting/creation

# FEATURES["AGG_STATS"] = OrderedDict()
# FEATURES["AGG_STATS"]["sales"]     = ['sum', 'median', 'count']
# FEATURES["AGG_STATS"]["revenue"]   = ['sum']  # revenue can handle fillna(0) cartesian product; price doesn't make sense with fillna(0), so don't use that at this time
# #FEATURES["AGG_STATS"]["price"]     = ['median','std']

# # aggregate statistics columns (initial computation shall be 'sales per month' prediction target for shop_id-item_id pair grouping)
# FEATURES["STATS_FEATURES"] = [['shop_id', 'item_id'], ['shop_id', 'item_category_id'], ['shop_id', 'item_cluster']] + FEATURES["PROVIDED_CATEGORICAL_FEATURES"]

# FEATURES["LAGS_MONTHS"] = [1,2,3,4,5,6,7,8]  # month lags to include in model 
# FEATURES["LAG_FEATURES"] = {}
# for i in FEATURES["LAGS_MONTHS"]:
#     FEATURES["LAG_FEATURES"][i] = ['y_sales', 'shop_id_x_item_category_id_sales_sum', 'item_id_sales_sum', 'item_cluster_sales_sum'] 
# FEATURES["LAG_FEATURES"][1] = ['y_sales', 'shop_id_x_item_id_sales_median', 'shop_id_x_item_id_sales_count', 'shop_id_x_item_id_revenue_sum', 
#                      'shop_id_x_item_category_id_sales_sum', 'shop_id_x_item_category_id_sales_median', 'shop_id_x_item_category_id_sales_count', 
#                      'shop_id_x_item_cluster_sales_sum', 'shop_id_x_item_cluster_sales_median', 
#                      'shop_id_sales_sum', 'shop_id_sales_count', 
#                      'item_id_sales_sum', 'item_id_sales_median', 'item_id_sales_count', 'item_id_revenue_sum', 
#                      'shop_group_revenue_sum', 
#                      'item_category_id_sales_sum', 'item_category_id_sales_count', 'item_category_id_revenue_sum', 
#                      'item_group_sales_sum', 'item_group_revenue_sum', 
#                      'item_cluster_sales_sum', 'item_cluster_sales_count', 'item_cluster_revenue_sum']

# FEATURES["LAG_FEATURES"][2] = ['y_sales', 'shop_id_x_item_id_sales_count', 'shop_id_x_item_id_revenue_sum', 
#                      'shop_id_x_item_category_id_sales_sum', 'shop_id_x_item_category_id_sales_count', 'shop_id_x_item_category_id_revenue_sum', 
#                      'shop_id_x_item_cluster_sales_sum', 'shop_id_x_item_cluster_sales_count', 
#                      'shop_id_sales_sum', 'item_id_sales_sum', 'item_id_sales_count', 'item_id_revenue_sum', 
#                      'item_category_id_sales_sum', 'item_category_id_sales_count', 
#                      'item_group_sales_sum', 
#                      'item_cluster_sales_sum', 'item_cluster_sales_count', 'item_cluster_revenue_sum']

# FEATURES["LAG_FEATURES"][3] = ['y_sales', 'shop_id_x_item_id_sales_count', 
#                      'shop_id_x_item_category_id_sales_sum', 
#                      'shop_id_sales_sum', 
#                      'item_id_sales_sum', 'item_id_sales_count', 'item_id_revenue_sum', 
#                      'item_category_id_sales_sum', 'item_category_id_sales_count', 
#                      'item_cluster_sales_sum', 'item_cluster_sales_count']

# # keep at least the highest importance feature for each lag, but remove all others with < 20% importance (month 13-32 training)
# FEATURES["LAG_FEATURES"][2] = [e for e in FEATURES["LAG_FEATURES"][2] if e not in {'item_group_sales_sum','shop_id_x_item_category_id_sales_sum','shop_id_x_item_cluster_sales_sum','shop_id_x_item_cluster_sales_count'}]
# FEATURES["LAG_FEATURES"][3] = [e for e in FEATURES["LAG_FEATURES"][3] if e not in {'item_cluster_sales_sum','shop_id_x_item_category_id_sales_sum','shop_id_x_item_id_sales_count'}]
# FEATURES["LAG_FEATURES"][4] = [e for e in FEATURES["LAG_FEATURES"][4] if e not in {'shop_id_x_item_category_id_sales_sum','y_sales','item_cluster_sales_sum'}]
# FEATURES["LAG_FEATURES"][5] = [e for e in FEATURES["LAG_FEATURES"][5] if e not in {'item_cluster_sales_sum','item_id_sales_sum','shop_id_x_item_category_id_sales_sum'}]
# FEATURES["LAG_FEATURES"][6] = [e for e in FEATURES["LAG_FEATURES"][6] if e not in {'item_id_sales_sum','item_cluster_sales_sum','shop_id_x_item_category_id_sales_sum'}]
# FEATURES["LAG_FEATURES"][7] = [e for e in FEATURES["LAG_FEATURES"][7] if e not in {'y_sales','item_cluster_sales_sum','shop_id_x_item_category_id_sales_sum'}]
# FEATURES["LAG_FEATURES"][8] = [e for e in FEATURES["LAG_FEATURES"][8] if e not in {'item_id_sales_sum','item_cluster_sales_sum','shop_id_x_item_category_id_sales_sum'}]

# # LAG_STATS_SET is SET of all aggregate statistics columns for all lags (allows us to shed the other stats, keeping memory requirements low)
# LAG_STATS_SET = FEATURES["LAG_FEATURES"][1]
# for l in FEATURES["LAGS_MONTHS"][1:]:
#     LAG_STATS_SET = LAG_STATS_SET + [x for x in FEATURES["LAG_FEATURES"][l] if x not in LAG_STATS_SET]
# FEATURES["STT_MONTHLY_COLS"] = FEATURES["PROVIDED_INTEGER_FEATURES"] + LAG_STATS_SET

# # Define various constants that drive the attributes of the various features
# FEATURES["_CLIP_TRAIN_H"]   = 20          # this clips sales after doing monthly groupings (monthly_stt dataframe) will also clip item_cnt_month predictions to 20 after the model runs
# FEATURES["_CLIP_TRAIN_L"]   = 0                   
# FEATURES["_CLIP_PREDICT_H"] = 20          # this clips the final result before submission to coursera
# FEATURES["_CLIP_PREDICT_L"] = 0    

# FEATURES["_USE_ROBUST_SCALER"]         = True        # scale features to reduce influence of outliers
# FEATURES["_ROBUST_SCALER_QUANTILES"]   = (20,80)
# FEATURES["_USE_MINMAX_SCALER"]         = True        # scale features to use large range of np.int16
# FEATURES["_MINMAX_SCALER_RANGE"]       = (0,16000)   # int16 = (0,32700); uint16 = (0,65500)  --> keep this range positive for best results with LGBM; keep range smaller for faster LGBM fitting
# FEATURES["_FEATURE_DATA_TYPE"]         = np.int16    # np.float32 #np.int16   np.uint16          # if fill n/a = 0, can adjust feature values to be integer values and save memory (not finding that int can store np.NAN)
# FEATURES["_USE_CARTPROD_FILL"]         = True        # use cartesian fill, or not
# FEATURES["_CARTPROD_TEST_PAIRS"]  = False       # include all shop-item pairings from test month as well as the in-month pairings
# FEATURES["_CARTPROD_FILLNA0"]    = True        # fill n/a cartesian additions with zeros (not good for price-based stats, however)
# FEATURES["_CARTPROD_FIRST_MONTH"] = 13          # month number + max lag to start adding Cartesian product rows (i.e., maxlag=6mo and CARTPROD_FILL_MONTH_BEGIN=10 will cartesian fill from 4 to 33)
# FEATURES["TRAIN_MONTH_START"]         = [13]        # == 24 ==> less than a year of data, but avoids December 'outlier' of 2014
# FEATURES["TRAIN_MONTH_END"]           = [29]        # [29,32] #,30,32]
# FEATURES["N_VAL_MONTHS"]              = [False]     #1 # ; if false, val is all months after training, up to and including 33; otherwise val is this many months after train_month_end

# # Define hyperparameters for modeling
# FEATURES["LEARNING_RATE"]       = [0.05]  # default = 0.1
# FEATURES["MAX_ITERATIONS"]      = [200] # default = 100
# FEATURES["EARLY_STOPPING"]      = [20]
# FEATURES["REGULARIZATION"]      = [0.4] # default = 1 for LGBM, 0 for HGBR (these models use inverse forms of regularization)
# FEATURES["VERBOSITY"]           = True #4 four is to print every 4th iteration; True is every iteration; False is no print except best and last
# FEATURES["SEED_VALUES"]         = [42]

# FEATURES["ALL_exploded_shape[0]"] = (len(FEATURES["SEED_VALUES"])*len(FEATURES["N_VAL_MONTHS"])*len(FEATURES["TRAIN_MONTH_END"])*len(FEATURES["TRAIN_MONTH_START"])*
#                          len(FEATURES["EARLY_STOPPING"])*len(FEATURES["MAX_ITERATIONS"])*len(FEATURES["REGULARIZATION"])*len(FEATURES["LEARNING_RATE"]) )


# print(f'Done: {strftime("%a %X %x")}')








# def unscale(scaler,target):
#     return scaler.inverse_transform(target.reshape(-1, 1)).squeeze()

# def GBDT_model(data=df, CONSTANTS=SPEC, params=OrderedDict()):
#     """
#     data is entire dataframe with train, validation, and test rows, and all columns including target prediction at "y_target"
#     constants is dictionary of setup constants
#     params is dictionary of this particular model train/val split and model fitting/prediction parameters
#     """
#     results = OrderedDict()
#     if CONSTANTS["_MODEL_TYPE"] == 'LGBM':
        
#         train_start = params["train_start_mo"]
#         train_end   = params["train_final_mo"]
#         val_months  = params["val_mo"]
#         test_month  = CONSTANTS["TEST_MONTH"]

#         train   = data.query('(month >= @train_start) & (month <= @train_end)')
#         y_train = train['y_target'].astype(np.float32)
#         y_train = y_train.reset_index(drop=True)
#         X_train = train.drop(['y_target'], axis=1)
#         X_train = X_train.reset_index(drop=True)
#         feature_names = X_train.columns

#         if val_months:
#             val = data.query('(month > (@train_end)) & (month <= (@train_end + @val_months)) & (month < @test_month)')
#         else:
#             val = data.query('((month > (@train_end)) & (month < @test_month)) | (month == (@test_month-1))')
#         y_val = val['y_target'].astype(np.float32)
#         y_val = y_val.reset_index(drop=True)
#         X_val = val.drop(['y_target'], axis=1)
#         X_val = X_val.reset_index(drop=True)

#         X_test = data.query('month == @test_month').drop(['y_target'], axis=1)
#         X_test = X_test.reset_index(drop=True)

#         print('X_train:')
#         print_col_info(X_train,8)
#         print(f'\n{X_train.head(2)}\n\n')
#         print('X_val:')
#         print_col_info(X_val,8)
#         print(f'\n{X_val.head(2)}\n\n')
#         print('X_test:')
#         print_col_info(X_test,8)
#         print(f'\n{X_test.head(2)}\n\n')
#         data_types = X_train.dtypes

#         del [[data, train, val]]

#         print('Starting training...')
#         model_gbdt = lgb.LGBMRegressor(
#             objective='regression',
#             boosting_type='gbdt',           # gbdt= Gradient Boosting Decision Tree; dart= Dropouts meet Multiple Additive Regression Trees; goss= Gradient-based One-Side Sampling; rf= Random Forest
#             learning_rate=params["lr"],     # You can use callbacks parameter of fit method to shrink/adapt learning rate in training using reset_parameter callback
#             n_estimators=params["maxit"],   # Number of boosted trees to fit = max_iterations
#             metric='rmse',
#             subsample_for_bin=200000,       # Number of samples for constructing bins
#             num_leaves=31,                  # Maximum tree leaves for base learners
#             max_depth=-1,                   # Maximum tree depth for base learners, <=0 means no limit
#             min_split_gain=0.0,             # Minimum loss reduction required to make a further partition on a leaf node of the tree
#             min_child_weight=0.001,         # Minimum sum of instance weight (hessian) needed in a child (leaf)
#             min_child_samples=20,           # Minimum number of data needed in a child (leaf)
#             colsample_bytree=params["reg"], # dropout fraction of columns during fitting (max=1 = no dropout)
#             random_state=params["seed"],    # seed value
#             silent=False,                   # whether to print info during fitting
#             importance_type='split',        # feature importance type: 'split'= N times feature is used in model; 'gain'= total gains of splits which use the feature
#             reg_alpha=0.0,                  # L1 regularization
#             reg_lambda=0.0,                 # L2 regularization
#             n_jobs=- 1,                     # N parallel threads to use on computer
#             subsample=1.0,                  # row fraction used for training: keep at 1 for time series data
#             subsample_freq=0,               # keep at 0 for time series
#             )

#         tic = perf_counter()
#         model_gbdt.fit( 
#             X_train,                                # Input feature matrix (array-like or sparse matrix of shape = [n_samples, n_features])
#             y_train,                                # The target values (class labels in classification, real numbers in regression) (array-like of shape = [n_samples])
#             eval_set=[(X_val, y_val)],              # can have multiple tuples of validation data inside this list
#             eval_names=None,                        # Names of eval_set (list of strings or None, optional (default=None))
#             eval_metric='rmse',                     # Default: 'l2' (= mean squared error, 'mse') for LGBMRegressor; options include 'l2_root'='root_mean_squared_error'='rmse' and 'l1'='mean_absolute_error'='mae' + more
#             early_stopping_rounds=params["estop"],  # Activates early stopping. The model will train until the validation score stops improving. Validation score needs to improve at least every early_stopping_rounds 
#                                                     #     to continue training. Requires at least one validation data and one metric. If there’s more than one, will check all of them. But the training data is ignored anyway. 
#                                                     #     To check only the first metric, set the first_metric_only parameter to True in additional parameters **kwargs of the model constructor.
#             init_score=None,                        # Init score of train data
#             eval_init_score=None,                   # Init score of eval data (list of arrays or None, optional (default=None))
#             verbose=CONSTANTS["VERBOSITY"] ,        # If True, metric on the eval set is printed at each boosting stage. If n=int, the metric on the eval set is printed at every nth boosting stage. Best and final also print.
#             feature_name='auto',                    # Feature names. If 'auto' and data is pandas DataFrame, data columns names are used. (list of strings or 'auto', optional (default='auto'))
#             categorical_feature='auto',             # Categorical features (list of strings or int, or 'auto', optional (default='auto')) If list of int, interpreted as indices. 
#                                                     # If list of strings, interpreted as feature names (need to specify feature_name as well). 
#                                                     # If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). 
#                                                     # Large values could be memory-consuming. Consider using consecutive integers starting from zero. All negative values in categorical features are treated as missing values.
#             callbacks=None                          # List of callback functions that are applied at each iteration (list of callback functions or None, optional (default=None)) See Callbacks in Python API for more information.
#             )

#         toc = perf_counter()
#         results["model_fit_time"] = datetime.utcfromtimestamp(toc-tic).strftime('%H:%M:%S')
#         print(f'Model LGBM fit time: {results["model_fit_time"]}')
#         results["best_iter"] = model_gbdt.best_iteration_
#         results["best_val_rmse"] = 0 #best_score


#     # if mod_type == 'HGBR':
#     #     # TTSplit should use TRAIN_FINAL = 33 (train on all data), and it will return also val=month33 for calculation at end (only)
#     #     model_gbdt = HistGradientBoostingRegressor(
#     #         learning_rate=LR, 
#     #         max_iter=maxiter, 
#     #         l2_regularization = reg,
#     #         early_stopping=False, 
#     #         verbosity = verb,
#     #         random_state=seed_val)
    
#     #     tic = perf_counter()
#     #     model_gbdt.fit(X_train_np, y_train)
#     #     toc = perf_counter()
#     #     model_fit_time = datetime.utcfromtimestamp(toc-tic).strftime('%H:%M:%S')
#     #     print(f"model HGBR fit time: {model_fit_time}")
#     #     best_iter = maxiter
#     #     best_val_rmse = 0
        
#     print("Starting predictions...")
#     tic = perf_counter()
#     y_pred_train =  model_gbdt.predict( X_train, num_iteration=model_gbdt.best_iteration_ )
#     y_pred_val =    model_gbdt.predict( X_val,   num_iteration=model_gbdt.best_iteration_ )
#     y_pred_test =   model_gbdt.predict( X_test,  num_iteration=model_gbdt.best_iteration_ )
#     y_train =       y_train.to_numpy()
#     y_val =         y_val.to_numpy()
#     # always do minmax scaling after robust scaling; and do inverse scaling with minmax first, then robust
#     if CONSTANTS["_USE_MINMAX_SCALER"]:
#         y_pred_train =  unscale(minmax_scalers['y_sales'],  y_pred_train)
#         y_pred_val =    unscale(minmax_scalers['y_sales'],  y_pred_val)
#         y_pred_test =   unscale(minmax_scalers['y_sales'],  y_pred_test)
#         y_train =       unscale(minmax_scalers['y_sales'],  y_train)
#         y_val =         unscale(minmax_scalers['y_sales'],  y_val)
#     if CONSTANTS["_USE_ROBUST_SCALER"]:
#         y_pred_train =  unscale(robust_scalers['y_sales'],  y_pred_train)
#         y_pred_val =    unscale(robust_scalers['y_sales'],  y_pred_val)
#         y_pred_test =   unscale(robust_scalers['y_sales'],  y_pred_test)
#         y_train =       unscale(robust_scalers['y_sales'],  y_train)
#         y_val =         unscale(robust_scalers['y_sales'],  y_val)
#     y_pred_train =  y_pred_train.clip(CONSTANTS["_CLIP_PREDICT_L"], CONSTANTS["_CLIP_PREDICT_H"])
#     y_pred_val =    y_pred_val.clip(  CONSTANTS["_CLIP_PREDICT_L"], CONSTANTS["_CLIP_PREDICT_H"])
#     y_pred_test =   y_pred_test.clip( CONSTANTS["_CLIP_PREDICT_L"], CONSTANTS["_CLIP_PREDICT_H"]) 
#     toc = perf_counter()
#     results["predict_time"] = datetime.utcfromtimestamp(toc-tic).strftime('%H:%M:%S')
#     print(f'Transform and Predict train/val/test time: {results["predict_time"]}')

#     results["train_r2"],   results["val_r2"]    = sk_r2(y_train, y_pred_train),            sk_r2(y_val, y_pred_val)
#     results["train_rmse"], results["val_rmse"]  = np.sqrt(sk_mse(y_train, y_pred_train)),  np.sqrt(sk_mse(y_val, y_pred_val))
#     print(f'R^2 train  = {results["train_r2"]:.4f}      R^2 val  = {results["val_r2"]:.4f}')
#     print(f'RMSE train = {results["train_rmse"]:.4f}    RMSE val = {results["val_rmse"]:.4f}\n')

#     return model_gbdt, model_gbdt.get_params(), X_test, y_pred_test, feature_names, data_types, results

# print(f'Done: {strftime("%a %X %x")}')





# ensemble_feature_names = []
# ensemble_y_pred_test = []
# ensemble_df_columns = ['lr', 'reg', 'max_iter', 'estop', 'start', 'end', 'n_val_mo', 'seed', 'trR2', 'valR2', 'tr_rmse', 'val_rmse', 'best_iter', 'best_val_rmse', 'model_time','predict_time','total_time']
# ensemble_df_rows = []
# model_params = OrderedDict()
# itercount = 0
# for lr in FEATURES["LEARNING_RATE"]:
#     for reg in FEATURES["REGULARIZATION"]:
#         for maxit in FEATURES["MAX_ITERATIONS"]:
#             for estop in FEATURES["EARLY_STOPPING"]:
#                 for train_start_mo in FEATURES["TRAIN_MONTH_START"]:
#                     for train_final_mo in FEATURES["TRAIN_MONTH_END"]:
#                         for val_mo in FEATURES["N_VAL_MONTHS"]:
#                             for seed in FEATURES["SEED_VALUES"]:
#                                 itercount += 1
#                                 print(f'\n\nBelow: Model {itercount} of {FEATURES["ALL_exploded_shape[0]"]}: LR = {lr}; LFF = {reg}, train_start = {train_start_mo}; train_end = {train_final_mo}; seed = {seed}\n')
#                                 time0 = time.time()
#                                 model_params["lr"] = lr
#                                 model_params['reg'] = reg
#                                 model_params['maxit'] = maxit
#                                 model_params['estop'] = estop
#                                 model_params['train_start_mo'] = train_start_mo
#                                 model_params['train_final_mo'] = train_final_mo
#                                 model_params['val_mo'] = val_mo
#                                 model_params['seed'] = seed
#                                 ##model_fit, y_pred_test, train_r2, val_r2, train_rmse, val_rmse, best_iter, best_val_rmse, model_fit_time, predict_time = 
#                                 model_fit, model_params, X_test, y_pred_test, feature_names, data_types, results = GBDT_model(df, SPEC, model_params)
#                                 time2 = time.time(); model_time = datetime.utcfromtimestamp(time2 - time0).strftime('%H:%M:%S')

#                                 ensemble_feature_names.append(feature_names)
#                                 ensemble_y_pred_test.append(y_pred_test)
#                                 ##ensemble_df_rows.append([lr,reg,maxit,estop,train_start_mo,train_final_mo,val_mo,seed,train_r2,val_r2,train_rmse,val_rmse,best_iter,best_val_rmse,model_fit_time,predict_time,model_time])

#                                 # intermediate save after each model fit set of parameters, in case of crash or disconnect from Colab
#                                 # Simple ensemble averaging
#                                 y_test_pred_avg = np.mean(ensemble_y_pred_test, axis=0)
#                                 # Merge the test predictions with IDs from the original test dataset, and keep only columns "ID" and "item_cnt_month"
#                                 y_submission = pd.DataFrame.from_dict({'item_cnt_month':y_test_pred_avg,'shop_id':X_test.shop_id,'item_id':X_test.item_id})
#                                 y_submission = test.merge(y_submission, on=['shop_id','item_id'], how= 'left').reset_index(drop=True).drop(['shop_id','item_id'],axis=1)
#                                 y_submission.to_csv("./models_and_predictions/" + FEATURES["_MODEL_NAME"] + '_submission.csv', index=False)
#                                 ##ensemble_scores = pd.DataFrame(ensemble_df_rows, columns = ensemble_df_columns)
#                                 ##ensemble_scores.to_csv("./models_and_predictions/" + model_filename_ens, index=False)
#                                 time3 = time.time(); iteration_time = datetime.utcfromtimestamp(time3 - time0).strftime('%H:%M:%S')
#                                 #print(f'TTSplit Execution Time = {ttsplit_time};  
#                                 print(f'Model fit/predict Execution Time = {model_time};  Total Iteration Execution Time = {iteration_time}')
#                                 print(f'Below: Model {itercount} of {FEATURES["ALL_exploded_shape[0]"]}: LR = {lr}; LFF = {reg}, train_start = {train_start_mo}; train_end = {train_final_mo}; seed = {seed}\n')
# print(model_params)
# print(feature_names)
# print(data_types)
# print(results)
# #display(ensemble_scores)

# print(f'\nDone: {strftime("%a %X %x")}\n')

nocode=True

##**Random Stuff**

###**Parameters**

In [None]:

def display_params():
    df_mem = df.memory_usage(deep=True)
    print(f'{strftime("%a %X %x")};  df_size = {df_mem.sum()/(10**6):0.1f} MB, df_shape = {df.shape}; N train models: {ALL_exploded_shape[0]}; N features: {N_FEATURES}')
    
    print("------\n------------")
    pprint.pprint(dict(FEATURES),width=200,compact=True)
    print("\n")
    pprint.pprint(ITERS,width=200,compact=True)
    print("\n")


    print(f'COLS_ORDER_STT = {FEATURES["COLS_ORDER_STT"]}')
    print(f'LAGGED_FEATURE_GROUPS = {FEATURES["LAGGED_FEATURE_GROUPS"]}')
    print(f'AGG_STATS = {FEATURES["AGG_STATS"]};  _CLIP_TRAIN_L = {FEATURES["_CLIP_TRAIN_L"]}, _CLIP_PREDICT_L = {FEATURES["_CLIP_PREDICT_L"]}')
    print(f'EDA_DELETE_SHOPS = {FEATURES["_EDA_DELETE_SHOPS"]}; EDA_DELETE_ITEM_CATS = {FEATURES["_EDA_DELETE_ITEM_CATS"]}; EDA_SCALE_MONTH = {FEATURES["_EDA_SCALE_MONTH"]}')
    print(f'LAGS = {FEATURES["LAGS_MONTHS"]} (months)')
    for lag in FEATURES["LAGS_MONTHS"]:
        print(f'COLUMNS_TO_LAG[{lag}] = {FEATURES["LAG_FEATURES"][lag]}')
    print(f'USE_CARTPROD_FILL = {FEATURES["_USE_CARTPROD_FILL"]}, CARTPROD_WITH_TEST_PAIRS = {FEATURES["_CARTPROD_TEST_PAIRS"]}, CARTPROD_FILLNA_WITH_0 = {FEATURES["_CARTPROD_FILLNA0"]}')
    print(f'USE_ROBUST_SCALER = {FEATURES["_USE_ROBUST_SCALER"]}, ROBUST_SCALER_QUANTILES = {FEATURES["_ROBUST_SCALER_QUANTILES"]}, USE_MINMAX_SCALER = {FEATURES["_USE_MINMAX_SCALER"]}, ',end="")
    print(f'MINMAX_SCALER_RANGE = {FEATURES["_MINMAX_SCALER_RANGE"]}, FEATURE_DATA_TYPE = {FEATURES["_FEATURE_DATA_TYPE"]}')
    print(f'CARTPROD_FILL_MONTH_BEGIN = {FEATURES["_CARTPROD_FIRST_MONTH"]}, ',end='') 
    print(f'TRAIN_START_MONTH = {FEATURES["_TRAIN_START_MONTH"]},  TRAIN_FINAL_MONTH = {FEATURES["_TRAIN_FINAL_MONTH"]}, validate_months = {FEATURES["_validate_months"]}')
    print(f'LEARNING_RATE = {ITERS["learning_rate"].unique()}, MAX_ITERATIONS = {ITERS["n_estimators"].unique()}, EARLY_STOPPING = {ITERS["early_stopping_rounds"].unique()}, ',end='')
    print(f'REGULARIZATION = {ITERS["colsample_bytree"].unique()}, SEED_VALUES = {ITERS["random_state"].unique()}')
    print(f'SUBSAMPLE_FOR_BIN = {ITERS["subsample_for_bin"].unique()}')
    return

print(f'{FEATURES["_MODEL_NAME"]}  Model Type: {FEATURES["_MODEL_TYPE"]}')
display_params()  

###**K-Fold Training Splits; Ensemble Average; Save Intermediate Results**

In [None]:

%cd "{GDRIVE_REPO_PATH}"

ensemble_y_pred_test = []
ensemble_df_columns = ['tr_rmse','val_rmse','trR2','valR2','lr','reg','max_iter','estop','bin_sample','start','end','val_key','seed','best_iter','best_val_rmse','model_t','predict_t','total_t']
ensemble_df_rows = []


    if not ITERS.at[itercount,"_model_filename"]:
        ITERS.at[itercount,"_model_filename"] = input("Enter the Base Model Name Substring for Output File Naming (like: 'v4mg_01' )")
    filename_parameters = ITERS.at[itercount,"_model_type"] + ITERS.at[itercount,"_model_filename"] + "_params.csv"
    filename_submission = ITERS.at[itercount,"_model_type"] + ITERS.at[itercount,"_model_filename"] + '_submission.csv'

    print(f'\n\nBelow: Model {itercount+1} of {len(ITERS)}: lr= {ITERS.at[itercount,"learning_rate"]}; Reg= {ITERS.at[itercount,"colsample_bytree"]}, ',end='')
    print(f'train_start = {ITERS.at[itercount,"_train_start_month"]}; train_end = {ITERS.at[itercount,"_train_final_month"]}; seed = {ITERS.at[itercount,"random_state"]}\n')
    
    time0 = time.time()
    # CHANGE --> only redo this inside the loop if the months change
    DataSets, ITERS.at[itercount,"feature_name_"] = TTSplit(data=df, params=ITERS, iternum=itercount)

    y_pred_test, results_dict, parameters_of_model = GBDT_model(DataSets, model_params_dict, fit_params_dict) #ITERS, itercount)
    time1 = time.time()
    

    ITERS.at[itercount,"time_predict_end"] = datetime.utcfromtimestamp(time1 - time0).strftime('%H:%M:%S')
    print(f'Total Iteration Execution Time = {ITERS.at[itercount,"time_predict_end"]}')

    # intermediate save after each model fit set of parameters, in case of crash or disconnect from Colab
    # Simple ensemble averaging
    ensemble_y_pred_test.append(y_pred_test)
    y_test_pred_avg = np.mean(ensemble_y_pred_test, axis=0)
    # Merge the test predictions with IDs from the original test dataset, and keep only columns "ID" and "item_cnt_month"
    y_submission = pd.DataFrame.from_dict({'item_cnt_month':y_test_pred_avg,'shop_id':DataSets['X_test'].shop_id,'item_id':DataSets['X_test'].item_id})
    y_submission = test.merge(y_submission, on=['shop_id','item_id'], how= 'left').reset_index(drop=True).drop(['shop_id','item_id'],axis=1)
    y_submission.to_csv("./models_and_predictions/" + filename_submission, index=False)

    ITERS.to_csv("./models_and_predictions/" + filename_parameters, index=False)

    ensemble_df_rows.append(ITERS[['tr_rmse','val_rmse','tr_R2','val_R2','learning_rate','colsample_bytree','n_estimators','early_stopping_rounds','subsample_for_bin','_train_start_month','_train_final_month',
                                          '_validate_months','random_state','best_iteration_','best_score_','time_model_fit','time_model_predict','time_predict_end']].iloc[itercount].to_list())
    ensemble_scores = pd.DataFrame(ensemble_df_rows, columns = ensemble_df_columns)

    print(f'\nModel {itercount+1} of {len(ITERS)}: lr= {ITERS.at[itercount,"learning_rate"]}; Reg= {ITERS.at[itercount,"colsample_bytree"]}, ',end='')
    print(f'train_start = {ITERS.at[itercount,"_train_start_month"]}; train_end = {ITERS.at[itercount,"_train_final_month"]}; seed = {ITERS.at[itercount,"random_state"]}\n')
    display(ensemble_scores)
    
    itercount += 1

print(f'\nDone: {strftime("%a %X %x")}\n')

###**Document Results**

In [None]:
# Printout for copy-paste version control

print('\n------------------------------------------\n------------------------------------------')
print(f'{FEATURES["_MODEL_NAME"]}  Model Type: {FEATURES["_MODEL_TYPE"]}\nCoursera: \n------------------------------------------')
display_params()
print('------')
print(ensemble_scores)
print('------')
print(ensemble_scores.describe(percentiles=[], include=np.number))
print(f'------\nHighest and Lowest Feature Importance for Final Model:\n{fi.iloc[list(range(0,8))+list(range(-7,0)),:]}\n------')
print(y_submission.head(8))
print('------------------------------------------\n\n')


###**Record Results**

In [None]:
# Results
'''

Best Coursera score so far: 8/10 public and private LB scores are: 0.985186 and 0.979359 on 5/12 with Andreas' numbers
Best with this model: v7_ens21 8/10 public and private LB scores are: 0.974590 and 0.971219

LGBMv10_15ens 8/10 public and private LB scores are: 0.984054 and 0.979126
* LGBMv10_13ens 8/10 public and private LB scores are: 0.976077 and 0.973442
* LGBMv10_11ens 8/10 public and private LB scores are: 0.977330 and 0.974200  same as v10_10ens, but removed all features with importance below 20%
*** LGBMv10_10ens 8/10 public and private LB scores are: 0.975422 and 0.971682
LGBMv10_09ens 8/10 public and private LB scores are: 0.984677 and 0.984238
LGBMv10_08ens 8/10 public and private LB scores are: 0.984238 and 0.982864
LGBMv10_07ens 8/10 public and private LB scores are: 0.985275 and 0.985093
LGBMv10_06ens 8/10 public and private LB scores are: 0.984912 and 0.983360
LGBMv10_v9_18noscaler 8/10 8/10 public and private LB scores are: 0.984643 and 0.985256
LGBMv10_v9_18 8/10 public and private LB scores are: 0.982740 and 0.983633  robust scaler used
LGBMv9_18ens 8/10 public and private LB scores are: 0.984137 and 0.984686
LGBMv9_09clip 8/10 public and private LB scores are: 0.984877 and 0.985790
LGBMv9_08clip 8/10 public and private LB scores are: 0.985158 and 0.986282
LGBMv9_04ens (less memory) 8/10 public and private LB scores are: 0.981707 and 0.985473
* LGBMv9_03ens 8/10 public and private LB scores are: 0.975438 and 0.973606
LGBMv8_v7_21B_ens 8/10 public and private LB scores are: 0.976147 and 0.972920
*** LGBMv6v7_bag06 8/10 public and private LB scores are: 0.974873 and 0.971385
* LGBMv6v7_bag05 8/10 public and private LB scores are: 0.975973 and 0.972537
**** v7_ens21 8/10 public and private LB scores are: 0.974590 and 0.971219
** v7_ens20 8/10 public and private LB scores are: 0.975499 and 0.971916
* v6_ens32 8/10 public and private LB scores are: 0.975826 and 0.972352
v6_10 8/10 public and private LB scores are: 0.984495 and 0.978631
v6_ens01 (avg v6 #17 through #31): 8/10 public and private LB scores are: 0.984457 and 0.978061
v6_ens33 8/10 public and private LB scores are: 0.980232 and 0.975554
v7_03 8/10 public and private LB scores are: 0.980832 and 0.975157
v7_ens07 8/10 public and private LB scores are: 0.980749 and 0.978082

------------------------------------------
------------------------------------------
LGBMv10_15ens  Model Type: LGBM
------------------------------------------
Coursera: 8/10 public and private LB scores are: 0.984054 and 0.979126
Wed 19:10:11 07/15/20;  Size of df = 734.8 MB, Shape = (6226880, 59);  Size of X_train_np = 595.7 MB;  N Training Runs for this Model: 2
EDA_DELETE_SHOPS = [9, 20]; EDA_DELETE_ITEM_CATS = [8, 10, 32, 59, 80, 81, 82]; EDA_SCALE_MONTH = week_retail_weight; Lags(months) = [1, 2, 3, 4, 5, 6, 7, 8]
KEEP_STT_COLUMN_ORDER = ['month', 'sales', 'revenue', 'shop_id', 'item_id', 'shop_group', 'item_category_id', 'item_group', 'item_cluster']
STATS_FEATURES = [['shop_id', 'item_id'], ['shop_id', 'item_category_id'], ['shop_id', 'item_cluster'], 'shop_id', 'item_id', 'shop_group', 'item_category_id', 'item_group', 'item_cluster']
AGG_STATS = OrderedDict([('sales', ['sum', 'median', 'count']), ('revenue', ['sum'])]); _CLIP_TRAIN_L = 0; _CLIP_TRAIN_H = 20; _CLIP_PREDICT_L = 0; _CLIP_PREDICT_H = 20
COLUMNS_TO_LAG[1] = ['y_sales', 'shop_id_x_item_id_sales_median', 'shop_id_x_item_id_sales_count', 'shop_id_x_item_id_revenue_sum', 'shop_id_x_item_category_id_sales_sum', 'shop_id_x_item_category_id_sales_median', 'shop_id_x_item_category_id_sales_count', 'shop_id_x_item_cluster_sales_sum', 'shop_id_x_item_cluster_sales_median', 'shop_id_sales_sum', 'shop_id_sales_count', 'item_id_sales_sum', 'item_id_sales_median', 'item_id_sales_count', 'item_id_revenue_sum', 'shop_group_revenue_sum', 'item_category_id_sales_sum', 'item_category_id_sales_count', 'item_category_id_revenue_sum', 'item_group_sales_sum', 'item_group_revenue_sum', 'item_cluster_sales_sum', 'item_cluster_sales_count', 'item_cluster_revenue_sum']
COLUMNS_TO_LAG[2] = ['y_sales', 'shop_id_x_item_id_sales_count', 'shop_id_x_item_id_revenue_sum', 'shop_id_x_item_category_id_sales_count', 'shop_id_x_item_category_id_revenue_sum', 'shop_id_sales_sum', 'item_id_sales_sum', 'item_id_sales_count', 'item_id_revenue_sum', 'item_category_id_sales_sum', 'item_category_id_sales_count', 'item_cluster_sales_sum', 'item_cluster_sales_count', 'item_cluster_revenue_sum']
COLUMNS_TO_LAG[3] = ['y_sales', 'shop_id_sales_sum', 'item_id_sales_sum', 'item_id_sales_count', 'item_id_revenue_sum', 'item_category_id_sales_sum', 'item_category_id_sales_count', 'item_cluster_sales_count']
COLUMNS_TO_LAG[4] = ['item_id_sales_sum']
COLUMNS_TO_LAG[5] = ['y_sales']
COLUMNS_TO_LAG[6] = ['y_sales']
COLUMNS_TO_LAG[7] = ['item_id_sales_sum']
COLUMNS_TO_LAG[8] = ['y_sales']
CARTPROD_FILL_MONTH_BEGIN = 13; TRAIN_MONTH_START = [13]; TRAIN_FINAL_MONTH = [29]; N_VAL_MONTHS = [False]; USE_CARTESIAN_FILL = True; CART_PROD_INCLUDES_TEST = False; CARTPROD_FILLNA_WITH_0 = True
USE_ROBUST_SCALER = True; ROBUST_SCALER_QUANTILES = (20, 80); USE_MINMAX_SCALER = True; MINMAX_RANGE = (0, 32000); DATA_TYPE = <class 'numpy.int16'>
LEARNING_RATE = [0.02, 0.005]; MAX_ITERATIONS = [8000]; EARLY_STOPPING = [200]; REGULARIZATION = [0.4]; SEED_VALUES = [42]
------
     lr   reg  max_iter  estop  start  end  n_val_mo  seed  trR2  valR2  tr_rmse  val_rmse  best_iter  best_val_rmse model_time predict_time total_time
0 0.020 0.400      8000    200     13   29     False    42 0.572  0.427    0.743     0.783          0      1,253.473   00:35:36     00:13:33   00:49:09
1 0.005 0.400      8000    200     13   29     False    42 0.549  0.427    0.763     0.784          0      1,254.303   01:24:53     00:35:09   02:00:02
------
         lr   reg  max_iter  estop  start  end  seed  trR2  valR2  tr_rmse  val_rmse  best_iter  best_val_rmse
count     2     2         2      2      2    2     2     2      2        2         2          2              2
mean  0.013 0.400      8000    200     13   29    42 0.561  0.427    0.753     0.784          0      1,253.888
std   0.011     0         0      0      0    0     0 0.017  0.001    0.014     0.000          0          0.587
min   0.005 0.400      8000    200     13   29    42 0.549  0.427    0.743     0.783          0      1,253.473
50%   0.013 0.400      8000    200     13   29    42 0.561  0.427    0.753     0.784          0      1,253.888
max   0.020 0.400      8000    200     13   29    42 0.572  0.427    0.763     0.784          0      1,254.303
------
Highest and Lowest Feature Importance for Final Model:
                             feature  value  norm_value lag                   feature_base
0             item_id_sales_count_L1   9378         100   1            item_id_sales_count
1               item_id_sales_sum_L1   9199      98.090   1              item_id_sales_sum
2                       item_cluster   8256      88.040   0                   item_cluster
3                            item_id   7862      83.830   0                        item_id
4                   item_category_id   7540      80.400   0               item_category_id
5             item_id_revenue_sum_L1   7261      77.430   1            item_id_revenue_sum
6                            shop_id   6958      74.190   0                        shop_id
7                         y_sales_L1   6754      72.020   1                        y_sales
51       item_cluster_sales_count_L2   2536      27.040   2       item_cluster_sales_count
52              item_id_sales_sum_L3   2535      27.030   3              item_id_sales_sum
53  shop_id_x_item_id_sales_count_L2   2293      24.450   2  shop_id_x_item_id_sales_count
54   item_category_id_sales_count_L3   2170      23.140   3   item_category_id_sales_count
55       item_cluster_sales_count_L3   2067      22.040   3       item_cluster_sales_count
56   item_category_id_sales_count_L2   1987      21.190   2   item_category_id_sales_count
57     item_category_id_sales_sum_L3   1879      20.040   3     item_category_id_sales_sum
------
   ID  item_cnt_month
0   0           0.781
1   1           0.079
2   2           1.174
3   3           0.247
4   4           0.875
5   5           0.697
6   6           0.514
7   7           0.095
------------------------------------------
------------------------------------------
------------------------------------------

'''
nocode=True

###**Older method of control loop**

In [None]:
# SPLIT_PARAMS_module_dfs = OrderedDict()
# SPLIT_PARAMS_module_dfs['MODEL']               =   [['model_type','feature_params','provided_data_files','provided_df_names']].copy(deep=True)
# SPLIT_PARAMS_module_dfs['EDA']                 =   [['eda_delete_shops','eda_delete_item_cats','eda_scale_month','feather_stt']].copy(deep=True)
# SPLIT_PARAMS_module_dfs['DATA_CONDITIONING']   =   [['cartprod_fillna0','cartprod_first_month','cartprod_test_pairs','clip_train_H','clip_train_L',
#                                         'feather_monthly_stt','feature_data_type','minmax_scaler_range','robust_scaler_quantiles',
#                                         'use_cartprod_fill','use_categorical','use_minmax_scaler','use_robust_scaler']].copy(deep=True)
# SPLIT_PARAMS_module_dfs['TRAIN_VAL_SPLIT']     =   [['feather_tvt_split','test_month','train_start_month','train_final_month','validate_months']].copy(deep=True)
# SPLIT_PARAMS_module_dfs['LGBM_SETUP']          =   [['boosting_type','metric','learning_rate','n_estimators','colsample_bytree','random_state',
#                                         'subsample_for_bin','num_leaves','max_depth','min_split_gain','min_child_weight','min_child_samples','silent',
#                                         'importance_type','reg_alpha','reg_lambda','n_jobs','subsample','subsample_freq','objective']].copy(deep=True)
# SPLIT_PARAMS_module_dfs['LGBM_FIT']            =   [['eval_metric','early_stopping_rounds','init_score','eval_init_score','verbose',
#                                         'feature_name','categorical_feature','callbacks']].copy(deep=True)
# SPLIT_PARAMS_module_dfs['OUTPUT_PROCESSING']   =   clip_predict_H,clip_predict_L,feather_tvt_split,test,model_type,feature_names,robust_scalers,minmax_scalers
#                       
# OUTPUTS_df                       =   [['best_iteration_','best_score_','feature_importances_','feature_name_','model_params',
#                                        'time_cumulative','time_data_manip','time_dataset_splits','time_eda','time_full_iteration',
#                                        'time_model_fit','time_model_predict','tr_R2','tr_rmse','val_R2','val_rmse']].copy(deep=True)

# # Explode the dataframes so each row is one iteration of modeling parameters
# #   DataFrames each contain columns grouped by level of looping iteration (outermost loop loads data; 
#               innermost loop saves results, for more compact assignment of looping variables)
# for df_name,par_df in SPLIT_PARAMS_module_dfs.items():
#     for col in par_df.columns:
#         par_df = par_df.explode(col)
#     par_df = par_df.reset_index(drop=True)
#     SPLIT_PARAMS_module_dfs[df_name] = par_df
# # also explode a dataframe with all parameters so we get an idea of how big our run will be
# SPLIT_PARAMS_module_dfs['ALL'] = ALL_PARAMS.copy(deep=True)
# for c in SPLIT_PARAMS_module_dfs['ALL'].columns:
#     SPLIT_PARAMS_module_dfs['ALL'] = SPLIT_PARAMS_module_dfs['ALL'].explode(c)
# SPLIT_PARAMS_module_dfs['ALL'] = SPLIT_PARAMS_module_dfs['ALL'].reset_index(drop=True)

# df = pd.DataFrame({'shop': [1, 2, 10],
#                    'item': [0.5, 0.75, 0.25],
#                    'cate': [22, 33,11]}) #,index=['row1', 'row2'])
# print(df.to_dict('records'))
# print(df.to_dict('index'))
# print(df.to_dict('index', into=OrderedDict))

# # >>  [{'shop': 1, 'item': 0.5, 'cate': 22}, {'shop': 2, 'item': 0.75, 'cate': 33}, {'shop': 10, 'item': 0.25, 'cate': 11}]
# # >>  {0: {'shop': 1, 'item': 0.5, 'cate': 22}, 1: {'shop': 2, 'item': 0.75, 'cate': 33}, 2: {'shop': 10, 'item': 0.25, 'cate': 11}}
# # >>  OrderedDict([(0, {'shop': 1, 'item': 0.5, 'cate': 22}), (1, {'shop': 2, 'item': 0.75, 'cate': 33}), (2, {'shop': 10, 'item': 0.25, 'cate': 11})])

# model_params_dict = SPLIT_PARAMS_module_dfs['MODEL'].to_dict('index', into=OrderedDict)
# eda_params_dict = SPLIT_PARAMS_module_dfs['EDA'].to_dict('index', into=OrderedDict)
# data_conditioning_params_dict = SPLIT_PARAMS_module_dfs['DATA_CONDITIONING'].to_dict('index', into=OrderedDict)
# train_val_split_params_dict = SPLIT_PARAMS_module_dfs['TRAIN_VAL_SPLIT'].to_dict('index', into=OrderedDict)
# lgbm_setup_params_dict = SPLIT_PARAMS_module_dfs['LGBM_SETUP'].to_dict('index', into=OrderedDict)
# lgbm_fit_params_dict = SPLIT_PARAMS_module_dfs['LGBM_FIT'].to_dict('index', into=OrderedDict)
# output_processing_params_dict = SPLIT_PARAMS_module_dfs['OUTPUT_PROCESSING'].to_dict('index', into=OrderedDict)
# RUN_n = 0
# for model_iter, model_params in model_params_dict.items():
#     # ['model_type','feature_params','provided_data_files','provided_df_names'] 
#     model_done = (model_iter+1 == len(model_params_dict))

#     for eda_iter, eda_params in eda_params_dict.items():
#         # ['eda_delete_shops','eda_delete_item_cats','eda_scale_month','feather_stt'] 
#         # + ['model_type','feature_params','provided_data_files','provided_df_names']
#         eda_params.update(model_params)  # add to eda dict because feature_params, data files, names are needed for eda module
#         # load datafiles, adjust stt for monthly stats grouping, output = stt, items_enc, shops_enc, test (?replicate "test" with simple code?)
        
#     multiproc
#         load_dfs = eda_function(**eda_params) # stt, shops_enc, items_enc, test dataframes in dictionary (or their filenames if feathered to disk)
#         output_processing_params_dict[RUN_n]['test'] = load_dfs['test']  # dataframe, or feather file name
#         eda_done = (model_done and (eda_iter+1 == len(eda_params_dict)))

#         for data_iter, data_params in data_conditioning_params_dict.items():
#             # ['cartprod_fillna0','cartprod_first_month','cartprod_test_pairs','clip_train_H','clip_train_L','feather_monthly_stt','feature_data_type',
#             #  'minmax_scaler_range','robust_scaler_quantiles','use_cartprod_fill','use_categorical','use_minmax_scaler','use_robust_scaler']
#             #  + ['eda_delete_shops','eda_delete_item_cats','eda_scale_month','feather_stt']
#             #  + ['model_type','feature_params','provided_data_files','provided_df_names']
#             data_params.update(eda_params)  # add to data dict because feature_params, feather_stt needed for data conditioning module
#             #data_params['stt'], data_params['items_enc'], data_params['shops_enc'] = stt, items_enc, shops_enc
            
#     multiproc  ## 
#             monthly_stt, robust_scalers, minmax_scalers = data_conditioning_function(
#                                                                 load_dfs['stt'],load_dfs['shops_enc'],load_dfs['items_enc'], **data_params)
#             data_cond_done = (eda_done and (data_iter+1 == len(data_conditioning_params_dict)))
#             if data_cond_done:
#                 del load_dfs    # (no more splits that will affect build of monthly_stt)
#                 # discard stt, items_enc, shops_enc, and any intermediate dfs

#                 # merge with cartesian products, then with items_enc, shops_enc (do not create a new df; just use monthly_stt)
# 				# 	multiprocess --> pool(merge,[months list]) do months in parallel?
# 				# 	maybe split monthly_stt by months, then do the merges in parallel, then concatenate all the months back together
# 				# merge with lags (inplace with monthly_stt)
# 				# 	multiprocess --> pool(merge,[lags list]) do lags in parallel?; check for proper column order / reset if necessary
# 				# 	can I just add the shifted columns, and delete an N/A things where I don't have a shop-item match?, 
#                 #         or make a big df with all the lags and then just one single merge (how="left")
#                 # ?write to disk: monthly_stt.ftr feather? and inverse_scaler_xform?
#                 # outputs = monthly_stt (possibly on disk in ftr format), inverse_scaler_xform (in params df? or global? or on disk?)
#             for tvt_iter, tvt_split_params in train_val_split_params_dict.items():
#                 # ['feather_tvt_split','test_month','train_start_month','train_final_month','validate_months']
#                 # (train_X, feather_monthly_stt=True, feather_tvt_split=False, test_month=34, 
#                 #        train_start_month=13, train_final_month=29, validate_months=999)
#                 tvt_split_params['feather_monthly_stt'] = data_params['feather_monthly_stt']
#                 tvt_split_params['feature_data_type'] = data_params['feature_data_type']
#                 tvt_split_params['use_categorical'] = data_params['use_categorical']
#                 tvt_split_params['categorical_features'] = model_params['feature_params']['categorical']
#                 tvt_split_params['model_type'] = model_params['model_type']
#                 tvt_xy_datasets = {'train_X':monthly_stt}
#                 # train_X,train_y,val_X,val_y,test_X, feature_names =
#     multiproc ##
#                 DataSets, feature_names = tvt_split_function(tvt_xy_datasets, **tvt_split_params)
#                     # ??? allow dropping of certain feature columns here???? need an entry in tvt params df 
#                     #         (can't delete monthly_stt if plan to drop like this; would be best if saved to .ftr on disk)
#                     # ??? allow arbitrary setting of val months with a list?
#                     # read monthly_stt_temp.ftr into train_X 
#                     # 	test_X --> remove month 34 from train, rename into test_X; remove y_target column from test_X
#                     # 	val_X --> remove val months from train, rename into val_X
#                     # 	val_y --> pop y_target column from val_X, rename into val_y
#                     # 	train_y --> pop y_target column from train, rename into train_y
#                     # feature_names = train_X.columns
#                 for lgbm_setup_iter, lgbm_setup_params in lgbm_setup_params_dict.items():
#                     for lgbm_fit_iter, lgbm_fit_params in lgbm_fit_params_dict.items():
#                         model = lgbm_function(DataSets,**lgbm_setup_params,**lgbm_fit_params)
#                         for output_processing_iter, output_processing_params in output_processing_params_dict.items():
#                             OUTPUTS_df[RUN_n][feature_names] = feature_names
#                             OUTPUTS_df[RUN_n][other stuff] = model_output_function(model, DataSets,inverse_scaler_xform, output_processing_params)
#                                 # * do predictions
#                                 # * do inverse scaling
#                                 # * do clipping
#                                 # * compute rmse, r2 for train, val
#                                 # * compute test prediction / merge into proper shape/columns, and save to disk as model_filename+str(iter number)
#                                 # * compute feature importances
#                                 # save intermediate (or final) parameters+outputs dataframe to disk
#                                 # run = [model_iter*max(eda_iter)+eda_iter*max(data_iter)+data_iter*max(tvt_iter)+tvt_iter + setup + fit + process], or:
#                                 # RUN_n += 1

# ens = ensembling_function(output file names):
#     # average, weighted-average, other method, to combine anything already saved to disk (default = straight avg of all runs in above loop)

# next_runs = compute_trends(output  results):
#     # look at feature importances all together, and see if anything obvious good or bad
#     # look at splits and see if any parameters obviously good or bad (correlation matrix of parameters with output results?)


        # model_gbdt = lgb.LGBMRegressor(
        #     objective=param_df.at[iternum,'objective'],
        #     boosting_type=param_df.at[iternum,'boosting_type'],
        #     learning_rate=param_df.at[iternum,'learning_rate'],
        #     n_estimators=param_df.at[iternum,'n_estimators'],
        #     #metric=param_df.at[iternum,'metric'],
        #     subsample_for_bin=param_df.at[iternum,'subsample_for_bin'],
        #     num_leaves=param_df.at[iternum,'num_leaves'],
        #     max_depth=param_df.at[iternum,'max_depth'],
        #     min_split_gain=param_df.at[iternum,'min_split_gain'],
        #     min_child_weight=param_df.at[iternum,'min_child_weight'],
        #     min_child_samples=param_df.at[iternum,'min_child_samples'],
        #     colsample_bytree=param_df.at[iternum,'colsample_bytree'],
        #     random_state=param_df.at[iternum,'random_state'],
        #     silent=param_df.at[iternum,'silent'],
        #     importance_type=param_df.at[iternum,'importance_type'],
        #     reg_alpha=param_df.at[iternum,'reg_alpha'],
        #     reg_lambda=param_df.at[iternum,'reg_lambda'],
        #     n_jobs=param_df.at[iternum,'n_jobs'],
        #     subsample=param_df.at[iternum,'subsample'],
        #     subsample_freq=param_df.at[iternum,'subsample_freq']
        #     )
# 'eval_metric','early_stopping_rounds','init_score','eval_init_score','verbose',
#                                          'feature_name','categorical_feature','callbacks'

        # model_gbdt.fit(
        #     data['X_train'],                                # Input feature matrix (array-like or sparse matrix of shape = [n_samples, n_features])
        #     data['y_train'],                                # The target values (class labels in classification, real numbers in regression) (array-like of shape = [n_samples])
        #     eval_set=[(data['X_val'], data['y_val'])],      # can have multiple tuples of validation data inside this list
        #     eval_names=None,                                # Names of eval_set (list of strings or None, optional (default=None))
        #     eval_metric=param_df.at[iternum,'eval_metric'],
        #     early_stopping_rounds=param_df.at[iternum,'early_stopping_rounds'],
        #     init_score=param_df.at[iternum,'init_score'],
        #     eval_init_score=param_df.at[iternum,'eval_init_score'],
        #     verbose=param_df.at[iternum,'verbose'],
        #     feature_name=param_df.at[iternum,'feature_name'],
        #     categorical_feature=param_df.at[iternum,'categorical_feature'],
        #     callbacks=param_df.at[iternum,'callbacks']
        #      )


###**Older method of defining features**

In [None]:

    # Lag Split Info:      months_list: [1, 2, 3, 4, 5, 6, 7, 8]
    #     params: OrderedDict([
    #     (1, OrderedDict([
    #         ('shop_id_x_item_id', {'group': ['shop_id', 'item_id'], 'stats': {'sales': ['sum', 'median', 'count'], 'revenue': ['sum']}}), 
    #         ('shop_id_x_item_category_id', {'group': ['shop_id', 'item_category_id'], 'stats': {'sales': ['count', 'sum', 'median'], 'revenue': ['sum']}}), 
    #         ('shop_id_x_item_cluster', {'group': ['shop_id', 'item_cluster'], 'stats': {'sales': ['sum', 'median']}}), 
    #         ('shop_id', {'group': ['shop_id'], 'stats': {'sales': ['sum', 'count']}}), 
    #         ('item_id', {'group': ['item_id'], 'stats': {'sales': ['median', 'sum', 'count'], 'revenue': ['sum']}}), 
    #         ('shop_group', {'group': ['shop_group'], 'stats': {'revenue': ['sum']}}), 
    #         ('item_category_id', {'group': ['item_category_id'], 'stats': {'sales': ['sum', 'count'], 'revenue': ['sum']}}), 
    #         ('item_group', {'group': ['item_group'], 'stats': {'sales': ['sum'], 'revenue': ['sum']}}), 
    #         ('item_cluster', {'group': ['item_cluster'], 'stats': {'sales': ['sum', 'count'], 'revenue': ['sum']}})])), 
    #     (2, OrderedDict([
    #         ('shop_id_x_item_id', {'group': ['shop_id', 'item_id'], 'stats': {'sales': ['sum', 'count'], 'revenue': ['sum']}}), ...
    #     ...

    #     stats_set: OrderedDict([
    #         ('shop_id_x_item_id', {'group': ['shop_id', 'item_id'], 'stats': {
    #             'shop_group': 'first', 'item_category_id': 'first', 'item_group': 'first', 'item_cluster': 'first', 
    #             'sales': ['count', 'sum', 'median'], 'revenue': ['sum']}}), 
    #         ('shop_id_x_item_category_id', {'group': ['shop_id', 'item_category_id'], 'stats': {'sales': ['count', 'sum', 'median'], 'revenue': ['sum']}}), 
    #         ('shop_id_x_item_cluster', {'group': ['shop_id', 'item_cluster'], 'stats': {'sales': ['sum', 'median']}}), 
    #         ('shop_id', {'group': ['shop_id'], 'stats': {'sales': ['sum', 'count']}}), 
    #         ('item_id', {'group': ['item_id'], 'stats': {'sales': ['median', 'sum', 'count'], 'revenue': ['sum']}}), 
    #         ('shop_group', {'group': ['shop_group'], 'stats': {'revenue': ['sum']}}), 
    #         ('item_category_id', {'group': ['item_category_id'], 'stats': {'sales': ['sum', 'count'], 'revenue': ['sum']}}), 
    #         ('item_group', {'group': ['item_group'], 'stats': {'sales': ['sum'], 'revenue': ['sum']}}), 
    #         ('item_cluster', {'group': ['item_cluster'], 'stats': {'sales': ['sum', 'count'], 'revenue': ['sum']}})])
    #     stats_set_feature_names: [ #'shop_group', 'item_category_id', 'item_group', 'item_cluster',
    #              'shop_id_x_item_id_sales_sum', 
    #             'shop_id_x_item_id_sales_median', 'shop_id_x_item_id_sales_count', 'shop_id_x_item_id_revenue_sum', 
    #             'shop_id_x_item_category_id_sales_sum', 'shop_id_x_item_category_id_sales_median', 'shop_id_x_item_category_id_sales_count', 
    #             'shop_id_x_item_cluster_sales_sum', 'shop_id_x_item_cluster_sales_median', 'shop_id_sales_sum', 'shop_id_sales_count', 
    #             'item_id_sales_sum', 'item_id_sales_median', 'item_id_sales_count', 'item_id_revenue_sum', 'shop_group_revenue_sum', 
    #             'item_category_id_sales_sum', 'item_category_id_sales_count', 'item_category_id_revenue_sum', 'item_group_sales_sum', 
    #             'item_group_revenue_sum', 'item_cluster_sales_sum', 'item_cluster_sales_count', 'item_cluster_revenue_sum', 
    #             'shop_id_x_item_category_id_revenue_sum']
    # stt_final: ['month', 'sales', 'revenue', 'shop_id', 'item_id', 'shop_group', 'item_category_id', 'item_group', 'item_cluster']
    # categorical: ['shop_id', 'shop_group', 'item_id', 'item_category_id', 'item_group', 'item_cluster']
    # integer: ['month', 'shop_id', 'shop_group', 'item_id', 'item_category_id', 'item_group', 'item_cluster']



#apply(pd.to_numeric, downcast= 'np.float32')


# print("\nModel Name Parameters:");               pprint.pprint(SPLIT_PARAMS_module_dfs['MODEL'].iloc[0].to_dict()) # convert all of these dataframes to dicts for pretty printing
# print("\nEDA Parameters:");                      pprint.pprint(SPLIT_PARAMS_module_dfs['EDA'].iloc[0].to_dict())
# print("\nData Conditioning Parameters:");        pprint.pprint(SPLIT_PARAMS_module_dfs['DATA_CONDITIONING'].iloc[0].to_dict()) # print(params_data_conditioning_dict)
# print("\nTrain-Val-Test Splitting Parameters:"); pprint.pprint(SPLIT_PARAMS_module_dfs['TRAIN_VAL_SPLIT'].iloc[0].to_dict())
# print("\nLGBM Regressor Setup Parameters:");     pprint.pprint(SPLIT_PARAMS_module_dfs['LGBM_SETUP'].iloc[0].to_dict()) # pprint.pprint(PARS.filter(like='', axis=1).iloc[0].to_dict())
# print("\nLGBM Regressor Fit Parameters:");       pprint.pprint(SPLIT_PARAMS_module_dfs['LGBM_FIT'].iloc[0].to_dict())  # ,width=200,compact=True)
# print("\nOutput Processing Parameters:");        pprint.pprint(SPLIT_PARAMS_module_dfs['OUTPUT_PROCESSING'].iloc[0].to_dict())   

# # convert to dict for pretty printing
# outputs_dict =              OUTPUTS_df.iloc[0].to_dict()
# params_model_dict =         PARAMS_MODEL_df.iloc[0].to_dict()
# params_lgbm_setup_dict =    PARAMS_LGBM_SETUP_df.iloc[0].to_dict()        
# params_lgbm_fit_dict =      PARAMS_LGBM_FIT_df.iloc[0].to_dict()
# params_eda_dict =           PARAMS_EDA_df.iloc[0].to_dict()
# params_tvt_split_dict =     PARAMS_TVT_SPLIT_df.iloc[0].to_dict()
# params_data_manip_dict =    PARAMS_DATA_MANIP_df.iloc[0].to_dict()        

# FEATURES["LAGS_MONTHS"] = [1,2,3,4,5,6,7,8]  # month lags to include in model 
# FEATURES["LAG_FEATURES"] = OrderedDict()
# for i in FEATURES["LAGS_MONTHS"]:
#     FEATURES["LAG_FEATURES"][i] = ['y_sales', 'shop_id_x_item_category_id_sales_sum', 'item_id_sales_sum', 'item_cluster_sales_sum'] 
# # SECTION BELOW: manually remove some of the above-included features, as determined by feature importances to be likely unhelpful
# # keep at least the highest importance feature for each lag, but remove all others with < 20% importance (month 13-32 training)
# FEATURES["LAG_FEATURES"][8] = [e for e in FEATURES["LAG_FEATURES"][8] if e not in {'item_id_sales_sum','item_cluster_sales_sum','shop_id_x_item_category_id_sales_sum'}]


# # LAG_STATS_SET is SET of all aggregate statistics columns for all lags (allows us to shed the other stats, keeping memory requirements lower)
# LAG_STATS_SET = [] # FEATURES["LAG_FEATURES"][1]
# FEATURES["ALL_FEATURES"] = [] #FEATURES["LAG_FEATURES"][1]
# for m in FEATURES["LAGS_MONTHS"]: #[1:]:
#     LAG_STATS_SET = LAG_STATS_SET + [x for x in FEATURES["LAG_FEATURES"][m] if x not in LAG_STATS_SET]
#     FEATURES["ALL_FEATURES"] = FEATURES["ALL_FEATURES"] + [x + "_L" + str(m) for x in FEATURES["LAG_FEATURES"][m]]
# FEATURES["ALL_FEATURES"] = INTEGER_COLS + FEATURES["ALL_FEATURES"]
# N_FEATURES = len(FEATURES["ALL_FEATURES"])

# FEATURES["_CARTPROD_FILLNA0"]         = ITERS["_cartprod_fillna0"]
# FEATURES["_CARTPROD_FIRST_MONTH"]     = ITERS["_cartprod_first_month"]
# FEATURES["_CARTPROD_TEST_PAIRS"]      = ITERS["_cartprod_test_pairs"]
# FEATURES["_CLIP_TRAIN_H"]             = ITERS["_clip_train_H"]            
# FEATURES["_CLIP_TRAIN_L"]             = ITERS["_clip_train_L"]                   
# FEATURES["_CLIP_PREDICT_H"]           = ITERS["_clip_predict_H"]          
# FEATURES["_CLIP_PREDICT_L"]           = ITERS["_clip_predict_L"]    
# FEATURES["_EDA_DELETE_SHOPS"]         = ITERS["_eda_delete_shops"]
# FEATURES["_EDA_DELETE_ITEM_CATS"]     = ITERS["_eda_delete_item_cats"]
# FEATURES["_EDA_SCALE_MONTH"]          = ITERS["_eda_scale_month"]
# FEATURES["_FEATURE_DATA_TYPE"]        = ITERS["_feature_data_type"]
# FEATURES["_MINMAX_SCALER_RANGE"]      = ITERS["_minmax_scaler_range"]
# FEATURES["_MODEL_NAME_BASE"]          = ITERS["_model_filename"]
# FEATURES["_MODEL_TYPE"]               = ITERS["_model_type"]
# FEATURES["_ROBUST_SCALER_QUANTILES"]  = ITERS["_robust_scaler_quantiles"]     
# FEATURES["_TEST_MONTH"]               = ITERS["_test_month"]
# FEATURES["_TRAIN_START_MONTH"]        = ITERS["_train_start_month"]
# FEATURES["_TRAIN_FINAL_MONTH"]        = ITERS["_train_final_month"]
# FEATURES["_USE_CARTPROD_FILL"]        = ITERS["_use_cartprod_fill"]
# FEATURES["_USE_CATEGORICAL"]          = ITERS["_use_categorical"]         
# FEATURES["_USE_ROBUST_SCALER"]        = ITERS["_use_robust_scaler"]      
# FEATURES["_USE_MINMAX_SCALER"]        = ITERS["_use_minmax_scaler"]
# FEATURES["_validate_months"]             = ITERS["_validate_months"]

# pprint.pprint(ITERS,width=200,compact=True)
# pprint.pprint(ALL_PARAMS,width=200,compact=True)

# print(ITERS[['random_state','n_estimators','boosting_type','metric']].iloc[0].to_list())   # selecting only certain variables from a certain iteration line (0 in this case)



# class Delete_Me:
#     def __init__(self, name, df):
#         self.name = name
#         self.df = df
#     def clear_memory(self,dataframe):
#         print(f'Removing {self.name}')
#         del dataframe
#         gc.collect()
#         #dataframe = pd.DataFrame(np.zeros((1,1),dtype=np.int8)) # not sure whether this line really helps
#         return True

# def rm_df(rm_dict={'df':pd.DataFrame()}):
#     """
#     try to save memory by deleting unneeded dataframes
#     input is a dictionary of dataframe string names as keys, and dataframes as values
#     """
#     print(f'\nPrior to df delete, Google Colab runtime is using {virtual_memory().used / 1e9:.1f} GB of {virtual_memory().total / 1e9:.1f} GB available RAM\n')
#     for k,v in rm_dict.items():
#         rm_df_class = Delete_Me(k,v)
#         rm_df_class.clear_memory(v)
#         v = pd.DataFrame(np.zeros((1,1),dtype=np.int8))
#         # try: del v
#         # except: print(f'DataFrame {k} delete error.')
#     #gc.disable()
#     #gc.collect()
#     print(f'\nAfter gc.collect(), Google Colab runtime is using {virtual_memory().used / 1e9:.1f} GB of {virtual_memory().total / 1e9:.1f} GB available RAM\n')
#     return 


nocode=True