#**Quick processing of data sets for input to model**

**Data Cleaning, Feature Generation**

Andreas Theodoulou and Michael Gaidis (June, 2020)

#**Data Ouput from This Notebook**

##**1. A lightly-cleaned version of sales_train data set, merged with the test data set** (distinguishable as month = 34)
* *train_test_base*

##**2. Data sets to merge with the aforementioned data set, and also important to merge with Cartesian-Product rows that we insert into the training data.**
* *shops_features*
* *items_features*
* *date_adjustments*

###The intent is for the user to adapt these data sets as desired, in the IPynb notebook focused on modeling.




#**0.1 Mount Google Drive (Local File Storage/Repo For Colab)**

In [1]:
# click on the URL link presented to you by this command, get your authorization code from Google, then paste it into the input box and hit 'enter' to complete mounting of the drive
from google.colab import drive  
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


#**0.2 Configure Environment and Load Data Files**

In [2]:
# python libraries/modules used throughout this notebook (with some holdovers from other, similar notebooks)
# pandas data(database) storage, EDA, and manipulation
import pandas as pd
### pandas formatting
### Here's what I find works well for this particular IPynb, when using a FHD laptop monitor with a full-screen browser window containing my IPynb notebook:
pd.set_option("display.max_rows",120)     # Override pandas choice of how many rows to show, so, for example, we can see the full 84-row item_category dataframe instead of the first few rows, then ...., then the last few rows
pd.set_option("display.max_columns",26)   # Similar to row code above, we can show more columns than default  
pd.set_option("display.width", 230)       # Tune this to our monitor window size to avoid horiz scroll bars in output windows (but, the drawback is that we will get output text wrapping)
pd.set_option("max_colwidth", None)       # This is done, for example, so we can see full item name and not '...' in the middle
# Try to convince pandas to print without decimal places if a number is actually an integer (helps keep column width down, and highlights data types), or with precision = 3 decimals if a float
pd.options.display.float_format = lambda x : '{:.0f}'.format(x) if round(x,0) == x else '{:,.3f}'.format(x)

# Pandas additional enhancements
pd.set_option('compute.use_bottleneck', False)  # speed up operation when using NaNs
pd.set_option('compute.use_numexpr', False)     # speed up boolean operations, large dataframes; DataFrame.query() and pandas.eval() will evaluate the subexpressions that can be evaluated by numexpr


# data visualization
import matplotlib.pyplot as plt
# ipynb magic command to allow interactive matplotlib graphics in ipynb notebook
%matplotlib inline  

# computations
import numpy as np

# file operations
import os
from urllib.parse import urlunparse
from pathlib import Path
import feather   # this is 3x to 8x faster than pd.read_csv and pd.to_hdf, but file size is 2x hdf and 10x csv.gz
import pickle

# misc. python enhancements
import re
import string
from itertools import product
from collections import OrderedDict
import time
import datetime
from time import sleep, localtime, strftime, tzset, strptime
os.environ['TZ'] = 'EST+05EDT,M4.1.0,M10.5.0'   # allows user to simply print a formatted version of the local date and time; helps keep track of what cells were run, and when
tzset()

# ML packages
import sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

print(f'done: {strftime("%a %X %x")}')

done: Sat 14:11:03 06/20/20


In [3]:
train_test_base_save = False #True  # set to false if you plan to read in from previously-created csv.gz file

GDRIVE_REPO_PATH = "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"

data_files = [  #"readonly/final_project_data/shops.csv",
                #"data_output/shops_transl.csv",
                "data_output/shops_augmented.csv",
                "data_output/shops_new.csv",
               
                #"readonly/final_project_data/items.csv",
                #"data_output/items_transl.csv",
                "data_output/items_augmented.csv",
                "data_output/items_new.csv",
                "data_output/items_clustered_22170b.csv.gz",
               
                #"readonly/final_project_data/item_categories.csv",
                #"data_output/item_categories_transl.csv",
                "data_output/item_categories_augmented.csv",
                #"readonly/en_50k.csv",
               
                "readonly/final_project_data/sales_train.csv.gz",
                #"data_output/sales_train_cleaned.csv.gz",
               
                #"readonly/final_project_data/sample_submission.csv.gz",
                "readonly/final_project_data/test.csv.gz"
                ]


# Dict of helper code files, to be loaded and imported {filepath : import_as}
code_files = {}  # not used at this time; example dict = {"helper_code/kaggle_utils_at_mg.py" : "kag_utils"}


# GitHub file location info
git_hub_url = "https://raw.githubusercontent.com/migai"
repo_name = 'Kag'
branch_name = 'master'
base_url = os.path.join(git_hub_url, repo_name, branch_name)

if data_files:
    print('\n\ncsv files source directory: ', end='')
    %cd "{GDRIVE_REPO_PATH}"

    print("\nLoading csv Files from Google Drive repo into Colab...\n")

    # Loop to load the data files into appropriately-named pandas DataFrames
    for path_name in data_files:
        filename = path_name.rsplit("/")[-1]
        data_frame_name = filename.split(".")[0]
        exec(data_frame_name + " = pd.read_csv(path_name)")
        # if data_frame_name == 'sales_train':
        #     sales_train['date'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y')
        print(f'DataFrame {data_frame_name}, shape = {eval(data_frame_name).shape} :')
        print(eval(data_frame_name).head(2))
        print("\n")
else: 
    %cd "{GDRIVE_REPO_PATH}"
    
print(f'\nDataFrame Loading Complete: {strftime("%a %X %x")}\n')



csv files source directory: /content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag

Loading csv Files from Google Drive repo into Colab...

DataFrame shops_augmented, shape = (60, 17) :
                       shop_name  shop_id                       en_shop_name  shop_city_population  shop_tested shop_type  shop_type_enc shop_city  shop_city_enc shop_federal_district  shop_federal_district_enc s_type_broad  \
0  !Якутск Орджоникидзе, 56 фран        0  ! Yakutsk Ordzhonikidze, 56 Franc                235600        False      Shop             20   Yakutsk             54               Eastern                         16         Shop   
1  !Якутск ТЦ "Центральный" фран        1       ! Yakutsk TC "Central" Franc                235600        False      Mall             50   Yakutsk             54               Eastern                         16         Mall   

   s_type_broad_enc fd_popdens  fd_popdens_enc        fd_gdp  fd_gdp_enc  
0                10     Remote     

#**1. *train_test_base***



***sales_train*** dataset outliers: 

* Clip these rows:
```
Shop 24, sales of item 20949 --> clip to 200
Shop 25, sales of item 20949 --> clip to roughly 200/day
```
* Delete these rows (use this reverse order for deleting if using .iloc): </br>
```
[2909818, 2909401, 2326930, 2257299, 1163158, 484683]
```

***sales_train*** dataset shop overlap 
```
* Combine shop 11 into shop 10  (id == 11 --> set id = 10)
* Combine shop  0 into shop 57  (id ==  0 --> set id = 57)
* Combine shop  1 into shop 58  (id ==  1 --> set id = 58)
```

***sales_train*** dataset late-opening shop
```
* Multiply shop 36 sales by 31/15 to account for it being open only for the last 15 days of training.
The user can later scale again as desired to conform to the 30-day test month.
```

***sales_train merge with test***
```
* Append the test shop-item pairs to the sales_train data set, so merging and feature generation
have the option of including these rows as well.
```

***Additional time-based features***
```
Replace the "date" and "date_block_num" columns with:
    1. 'day'    = integer value of day number, 
                starting at day = 0 for the first training set transaction, and incrementing by "calendar" day number 
                (not by "transaction" day number).
                Thus, 'day' may not include all possible integers from start to finish.  
                It only assigns integer values (based on the calendar) to days when there are transactions in the 
                input dataframe --> if the input dataframe has no transactions on a particular day, that day's 
                "calendar" integer value will not be present in the column.
    2. 'week'   = integer value of week number, 
                however, unlike 'day', the 'week' number is aligned not to start at the first training set transaction, 
                but rather so that there is a full 'week' of 7 days that ends on Oct. 31, 2015 (the final day of training data).  
                If using the full sales_train data set, this results in week = 0 having only 5 days in it. 
                The final week of October, 2015 is assigned 'week' number = 147.  
                Arbitrarily assigning test to "Nov. 1, 2015" results in test week = 148
    3. 'month'  = renamed from "date_block_num" of original data set (no changes).  
                Integer values from 0 to 33 represent the months starting at day0.  Test month == 34 is Nov. 2015.
    4. 'qtr'    = quarter = integer number of 3-month chunks of time, aligned with the end of October, 2015.  
                day0 is included in 'qtr' = 0, but 'qtr'=0 only contains 1 month (Jan 2013) of data due to the alignment.
                The months of August, Sept, Oct 2015 form 'qtr' = 11.  "qtr" in this sense is just 3-month chunks... 
                it is not the traditional Q1,Q2,Q3,Q4 beginning Jan 1, but instead is more like date_block_num in that it 
                is monotonically increasing integers, incremented every 3 months.
    5. 'season' = integer number of 3-month chunks of time, reset each year (allowed values = 0,1,2,3)... 
                not quite the same as spring-summer-winter-fall, or Q1,Q2,Q3,Q4, but instead shifted to 
                better capture seasonal spending trends aligned in particular with high December spending
                2 = Dec 1 to Feb 28 (biggest spending season), 3 = Mar 1 to May 31, 
                0 = June 1 to Aug 30 (lowest spending season), 1 = Sept 1 to Nov 30
```


**3,150,043 rows**, corresponding to the original rows ***with outliers removed or clipped***</br>
and ***with identical shops merged together*** 
</br>

**9 columns:**
 * month (int8, ordinal, 0 to 34) = date_block_num
 * day (int16, ordinal, 0 to 1033) 
 * week (int8, ordinal, 0 to 148)
 * qtr (int8, ordinal, 0 to 12)
 * season (int8, categorical, 0 to 3) 
 * shop_id (int8, categorical, 2 to 10 and 12 to 59)
 * item_id (int16, categorical, 0 to 22,169)
 * price (float32, continuous, max is near 60,000) = item_price
 * sales (int16, continuous, range is roughly -20 to 1000) = item_cnt_day
</br>

```
train_test_base dataframe creation started: Sat 14:11:09 06/20/20

Shape of original sales_train data set = (2935849, 6)
Rows being clipped:
  1,501,160 sales clipped to 200
  1,708,207 sales clipped to 200
  2,296,209 sales clipped to 100
  2,341,308 sales clipped to 100
Rows being deleted:
  2909818
  2909401
  2326930
  2257299
  1163158
  484683
Shape of sales_train_cleaned after 6 outlier rows were removed: (2935843, 6)
Shape of sales_train_cleaned after merging shops as in {0: 57, 1: 58, 11: 10}: (2935843, 6)
Shops being scaled:
  shop 36 scaled by 2.07
Shape of traintest after appending test to sales_train_cleaned: (3150043, 6)
Shape of traintest after creating time-based feature columns: (3150043, 9)
traintest DataFrame creation done: Sat 14:13:28 06/20/20

/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
train_test_base.csv.gz file stored on google drive in data_output directory
train_test_base file save done: Sat 14:14:02 06/20/20

Example: display(train_test_base[train_test_base.week == 102].tail(2))

           day    week    qtr    season    month    price    sales  shop_id  item_id
2257039	718	  102	  8	    1	    23      399	    1	59	21970
2257040	718	  102	  8	    1	    23	  499	    1	59	22060

train_test_base done: Sat 14:14:02 06/20/20
```

##**1.1 Merge data sets and create day, week, quarter, and season feature columns**

In [4]:
# determine which rows I need to clip for shops 24 and 25
stclip = sales_train.query('((shop_id == 24) and (item_cnt_day > 100)) or ((shop_id == 25) and (item_cnt_day > 100))')
stclip

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
862929,17.09.2013,8,25,3732,2545.135,264
862945,17.09.2013,8,25,3734,2548.455,110
868495,05.09.2013,8,25,2808,999.0,133
1501160,15.03.2014,14,24,20949,5.0,405
1708207,28.06.2014,17,25,20949,5.0,501
2176634,18.11.2014,22,25,3733,3070.565,147
2296209,30.12.2014,23,25,20949,5.0,205
2341308,17.01.2015,24,25,20949,5.0,222
2567454,14.04.2015,27,25,3731,1941.995,207
2615667,19.05.2015,28,25,10209,1490.892,148


In [5]:
def clean_merge_augment(day0 = datetime.datetime(2013,1,1),
                        clip_rows = {1501160: 200, 1708207: 200, 2296209: 100, 2341308: 100},
                        delete_rows = [2909818, 2909401, 2326930, 2257299, 1163158, 484683],
                        merge_shops = {0: 57, 1: 58, 11: 10},
                        scale_shops = {36: 31/15},
                        dropout_repair = {},
                        delete_shops = []):
    """
    Parameters:
    day0 = datetime.datetime object representing the day you wish to use as your reference when creating time-based features
    clip_rows = not quite as bad as erroneous outliers, but sales are so unlike other days that clipping should help
    delete_rows = list of integer row numbers that you wish to delete from the sales_train data set, e.g., from outliers/erroneous rows
    merge_shops = dictionary of integer shop_id key:value pairs where shop(=key) is merged into shop(=value)
    scale_shops = artificially adjust sales amounts for shops that are only open for partial amounts of months
    dropout_repair = optional, fill in the dropouts where shops are apparently erroneously missing sales (I am pushing this for now, as it looks to be of only marginal use, and not easy to do in a robust way)
    delete_shops = optional, can delete shops if you think they are not of value to training (I am pushing this to the modeling IPynb for easier iteration)

    Global Variables: this function assumes you have the following pandas dataframes available globally:
    1) unaltered sales_train
    2) unaltered test

    This function does the following:
    1) clips moderate outlier rows
    2) cleans (deletes) severe outlier rows from the training set that appear to be erroneous or irrelevant entries
    3) merges 3 shops into other shops where it appears that the sales_train set simply has different names for the 
        same shop at different time periods (shop 0 absorbed by 57; shop 1 absorbed by 58, shop 11 absorbed by 10)
    4) optionally delete shops entirely from the sales_train data set (e.g., for irrelevant shops)
    5) append the test set rows to the sales_train rows, using a date of November 1, 2015 for test
    6) adjust the 'date' column on the merged dataset to be in datetime format, so it looks like a string of format: 'YYYY-M-D'

    Then, creates and inserts new time-based feature columns as follows:
    Given a dataframe with a 'date' column containing strings like '2015-10-30', create new time-series columns:
    1. 'day'    = integer value of day number, starting at day = 0 for parameter day0, and incrementing by calendar day number (not by transaction day number)... 
                    Thus, 'day' may not include all possible integers from start to finish.  It only assigns integer values (based on the calendar) to days when 
                    there are transactions in the input dataframe --> if the input dataframe has no transactions on a particular day, that day's 'calendar' integer 
                    value will not be present in the column (will be = 0)
    2. 'week'   = integer value of week number, with week = 0 at time= parameter day0.  However, unlike 'day', the 'week' number is aligned not to start at day0, but rather
                    so that there is a full 'week' of 7 days that ends on Oct. 31, 2015 (the final day of training data).  This results in week = 0 having only 5 days in it.
                    n.b., the final week of October, 2015 is assigned 'week' number = 147.  Artifically assigning test to Nov. 1, 2015 results in test week = 148
    3. 'month'  = renamed from "date_block_num" of original data set (no changes).  Integer values from 0 to 33 represent the months starting at day0.  Test month=34 is Nov. 2015.
    4. 'qtr'    = quarter = integer number of 3-month chunks of time, aligned with the end of October, 2015.  day0 is included in 'qtr' = 0, but 'qtr'=0 only contains 1 month (Jan 2013) of data due to the alignment
                    The months of August, Sept, Oct 2015 form 'qtr' = 11.  "qtr" in this sense is just 3-month chunks... it is not the traditional Q1,Q2,Q3,Q4 beginning Jan 1, but instead is more like
                    date_block_num in that it is monotonically increasing integers, incremented every 3 months such that #11 ends at the end of our training data
    5. 'season' = integer number of 3-month chunks of time, reset each year (allowed values = 0,1,2,3)... not quite the same as spring-summer-winter-fall, or Q1,Q2,Q3,Q4, but instead shifted to 
                    better capture seasonal spending trends aligned in particular with high December spending
                    2 = Dec 1 to Feb 28 (biggest spending season), 3 = Mar 1 to May 31, 0 = June 1 to Aug 30 (lowest spending season), 1 = Sept 1 to Nov 30

    Finally, drop the date column from the dataframe, and sort the dataframe by ['day','shop_id','item_id']  (original dataframe seems to be sorted by month, but unsorted within each month)

    returns: the cleaned/dated/feature-augmented DataFrame
    """

    print(f'Shape of original sales_train data set = {sales_train.shape}')

    # clip moderate outliers (first make a DataFrame copy so we can reuse sales_train later, if we need to)
    sales_train_cleaned = sales_train.copy(deep=True)
    if clip_rows:
        print('Rows being clipped:')
        for k,v in clip_rows.items(): 
            sales_train_cleaned.at[k,'item_cnt_day'] = v
            print(f'  {k:,d} sales clipped to {v}')

    # remove outlier rows from training set 
    print('Rows being deleted:')
    for i in sorted(delete_rows, reverse=True):   # delete the rows in reverse order to be sure we don't run into issues with indexing
        print(f'  {i}')
        sales_train_cleaned.drop(sales_train_cleaned.index[i],inplace=True)
    print(f'Shape of sales_train_cleaned after {len(delete_rows)} outlier rows were removed: {sales_train_cleaned.shape}')
    
    # Merge the 3 shops we are nearly certain must correctly fit into the other shops' dropout regions:
    sales_train_cleaned.shop_id = sales_train_cleaned.shop_id.replace(merge_shops)
    print(f'Shape of sales_train_cleaned after merging shops as in {merge_shops}: {sales_train_cleaned.shape}')

    # scale shops if desired
    if scale_shops:
        print('Shops being scaled:')
        for k,v in scale_shops.items(): 
            sales_train_cleaned.item_cnt_day = sales_train_cleaned.apply(lambda row: row.item_cnt_day * v if row.shop_id == k else row.item_cnt_day, axis = 1)
            print(f'  shop {k} scaled by {v:.2f}')

    # Remove irrelevant shops entirely from the sales_train_cleaned DataFrame:
    if delete_shops:
        sales_train_cleaned = sales_train_cleaned.query('shop_id != @delete_shops')
        print(f'Shape of sales_train_cleaned after deleting shops {delete_shops}: {sales_train_cleaned.shape}')

    # sales_train_cleaned = sales_train_cleaned[sales_train_cleaned.shop_id != 9]
    # sales_train_cleaned = sales_train_cleaned[sales_train_cleaned.shop_id != 13]
    # print(f'Shape of sales_train_cleaned after removal of shops: {sales_train_cleaned.shape})
    # print(f'{sales_train_cleaned.shop_id.nunique()} shops remaining in sales_train_cleaned DataFrame: {sorted(sales_train_cleaned.shop_id.unique())})

    sales_train_cleaned = sales_train_cleaned.astype({'date_block_num':np.int8,'shop_id':np.int8,'item_id':np.int16,
                                                    'item_price':np.float32,'item_cnt_day':np.int16}).reset_index(drop=True)

    # merge dataframes so we optionally include test elements in our EDA and feature generation
    test_prep = test.copy(deep=True)
    test_prep['date_block_num'] = 34
    test_prep['date'] = '1.11.2015' #pd.Timestamp(year=2015, month=11, day=1)
    traintest = sales_train_cleaned.append(test_prep).fillna(0)

    traintest = traintest[['date', 'date_block_num', 'item_price', 'item_cnt_day', 'shop_id', 'item_id']]
    traintest.columns = ['date', 'month', 'price', 'sales', 'shop_id', 'item_id']
    print(f'Shape of traintest after appending test to sales_train_cleaned: {traintest.shape}')
        
    # Add in the time-based feature columns
    traintest.date =  pd.to_datetime(traintest.date, dayfirst=True, infer_datetime_format=True)
    traintest.insert(1,'day', traintest.date.apply(lambda x: (x - day0).days))
    traintest.insert(2,'week', (traintest.day+2) // 7 )             # add the 2 days so we have end of a week coinciding with end of training data Oct. 31, 2015
    traintest.insert(3,'qtr', (traintest.month + 2) // 3 )          # add the 2 months so we have end of a quarter aligning with end of training data Oct. 31, 2015
    traintest.insert(4,'season', (traintest.month + 2) % 4 ) 
    traintest.drop('date',axis=1,inplace=True)
    traintest = traintest.sort_values(['day','shop_id','item_id']).reset_index(drop=True)  # note that the train dataset is sorted by month, but nothing obvious within the month; we sort it here for consistent results in calculations below
    print(f'Shape of traintest after creating time-based feature columns: {traintest.shape}')
    print(f'traintest DataFrame creation done: {strftime("%a %X %x")}\n')
    return traintest

print(f'\nDone: {strftime("%a %X %x")}\n')


Done: Sat 14:11:09 06/20/20



In [6]:
if train_test_base_save:
    print(f'train_test_base dataframe creation started: {strftime("%a %X %x")}\n')
    train_test_base = clean_merge_augment()

    %cd "{GDRIVE_REPO_PATH}"
    # can save as csv.gz for < 100 MB storage and sync with GitHub
    compression_opts = dict(method='gzip',
                            archive_name='train_test_base.csv')  
    train_test_base.to_csv('data_output/train_test_base.csv.gz', index=False, compression=compression_opts)
    print("train_test_base.csv.gz file stored on google drive in data_output directory")
    print(f'train_test_base file save done: {strftime("%a %X %x")}')

display(train_test_base[train_test_base.week == 102].tail(2))

print(f'\ntrain_test_base done: {strftime("%a %X %x")}')

train_test_base dataframe creation started: Sat 14:11:09 06/20/20

Shape of original sales_train data set = (2935849, 6)
Rows being clipped:
  1,501,160 sales clipped to 200
  1,708,207 sales clipped to 200
  2,296,209 sales clipped to 100
  2,341,308 sales clipped to 100
Rows being deleted:
  2909818
  2909401
  2326930
  2257299
  1163158
  484683
Shape of sales_train_cleaned after 6 outlier rows were removed: (2935843, 6)
Shape of sales_train_cleaned after merging shops as in {0: 57, 1: 58, 11: 10}: (2935843, 6)
Shops being scaled:
  shop 36 scaled by 2.07
Shape of traintest after appending test to sales_train_cleaned: (3150043, 6)
Shape of traintest after creating time-based feature columns: (3150043, 9)
traintest DataFrame creation done: Sat 14:13:28 06/20/20

/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
train_test_base.csv.gz file stored on google drive in data_output directory
train_test_base file save done: Sat 14:14:02 06/20/20


Unnamed: 0,day,week,qtr,season,month,price,sales,shop_id,item_id
2257039,718,102,8,1,23,399,1,59,21970
2257040,718,102,8,1,23,499,1,59,22060



train_test_base done: Sat 14:14:02 06/20/20


#1. Create ***shops_new*** data file </br>
**60 rows**, corresponding to the 60 original shops</br>
**14 columns:**
 * shop_id  (categorical; 0-59, int8, from original data set)
 * shop_tested (categorical; bool, indicating if the shop is in *test* set)
 * shop_type (categorical; string object: online, small shop, mall, SEC, Mega)
 * shop_type_enc (categorical; int8, ordinal/weighted encoding based on number of rows present in *test*, scaled to cover roughly the same range as shop_id values (0-59))
 * shop_city (categorical; string)
 * shop_city_enc (categorical; int8, encoding weighted like shop_type)
 * shop_federal_district (categorical; string object)
 * shop_federal_district_enc (categorical; int8, encoding weighted like shop_type)
 * s_type_broad (categorical; string object: like shop_type, but fewer categories by merging together "Mall","Mega","SEC")
 * s_type_broad_enc (categorical; int8, ordinal encoding roughly weighted by shop size, 0-60 scale)
 * fd_popdens (categorical; string object: 4 categories named by population density in the shop's federal district)
 * fd_popdens_enc (categorical; int8; ordinal encoding weight based on population density)
 * fd_gdp (categorical; string object: 3 categories named by gdp per person)
 * fd_gdp_enc (categorical; int8; ordinal encoding weight based on gdp/person)

In [None]:
# each of the shops in the test set has 5100 rows in the test set, but not all shops are present in the test set
# encode categories in the new dataset such that there is more weight given to the category values with more presence in the test set
shops_new = shops_augmented[['shop_id','shop_category','shop_city','shop_federal_district','shop_tested']].rename(columns={'shop_category':'shop_type'})
shops_new['test_rows'] = shops_new.shop_tested.apply(lambda x: 5100 if x else 0)
types = defaultdict.fromkeys(shops_new.shop_type.unique(),0)
cities = defaultdict.fromkeys(shops_new.shop_city.unique(),0)
feddists = defaultdict.fromkeys(shops_new.shop_federal_district.unique(),0)
for i in range(len(shops_new)):  # most of the weighting used to order these categories comes from the number of test rows; number of train rows helps to break ties
    train_items_sold = int(round(sales_train[sales_train.shop_id == i].item_cnt_day.sum()/1e3))
    types[shops_new.at[i,'shop_type']] += shops_new.at[i,'test_rows']*50 + train_items_sold
    cities[shops_new.at[i,'shop_city']] += shops_new.at[i,'test_rows']*50 + train_items_sold
    feddists[shops_new.at[i,'shop_federal_district']] += shops_new.at[i,'test_rows']*50 + train_items_sold

enc_types = defaultdict(np.int8)
enc_cities = defaultdict(np.int8)
enc_feddists = defaultdict(np.int8)
for feat in [[types,enc_types],[cities,enc_cities],[feddists,enc_feddists]]:
    enc = 0
    for cat in sorted(feat[0], key=feat[0].get): #, reverse=True):
        feat[1][cat] = enc * int(round((60 / len(feat[0]))))  # scale encoding to be similar to 0-60 range of shop_id
        enc+=1

shops_new['shop_type_enc'] = shops_new.shop_type.apply(lambda x: enc_types[x])
shops_new['shop_city_enc'] = shops_new.shop_city.apply(lambda x: enc_cities[x])
shops_new['shop_federal_district_enc'] = shops_new.shop_federal_district.apply(lambda x: enc_feddists[x])


# # From Wikipedia (2010, 2014, and 2017 numbers)

# Federal district	Population density(per km2)	GDP per capita (2017)
# Central	          59                          $11423
# Northwestern        8                           $10088
# Southern            33                          $5592
# North Caucasian     55                          $3262
# Volga               29                          $6388
# Ural                7                           $14819
# Siberian            4                           $6887
# Far Eastern         1                           $10767
# popdensity categories = remote, intermediate, populous; encode values as: (1+4+7+8)/4 = 5, (29+33)/2 = 31, (55+59)/2 = 57, online = overall avg = 196/8 = 24.5 --> 25
# gdp categories = bottom(3262), low((5592+6388+6887)/3)=6289), intermediate((10767+10088+11423)/3) = 10759), high(14819) --> use "high" for online; 
#      divide all by 250 to scale closer to shop_id encoding range --> 13, 25, 43, 60

type2cats_encs = {'featurename':'s_type_broad','enc_name':'s_type_broad_enc','Shop': ['Shop',10], 'Mall': ['Mall',60], 'Mega': ['Mall',60], 'SEC': ['Mall',60], 'Itinerant': ['Online',35], 'Online': ['Online',35]}
popdensitycats_encs = {'featurename':'fd_popdens','enc_name':'fd_popdens_enc','Eastern': ['Remote',5], 'South': ['Intermediate',31], 'Central': ['Populous',57], 'Northwestern': ['Remote',5], 
                  'None': ['Online',25], 'Volga': ['Intermediate',31], 'Siberian': ['Remote',5], 'Ural': ['Remote',5]}
gdpcats_encs = {'featurename':'fd_gdp','enc_name':'fd_gdp_enc','Eastern': ['Intermediate',43], 'South': ['Low',25], 'Central': ['Intermediate',43], 'Northwestern': ['Intermediate',43], 
                  'None': ['High',60], 'Volga': ['Low',25], 'Siberian': ['Low',25], 'Ural': ['High',60]}

for feat in [type2cats_encs]:
    shops_new[feat['featurename']] = shops_new.shop_type.apply(lambda x: feat[x][0])
    shops_new[feat['enc_name']] = shops_new.shop_type.apply(lambda x: feat[x][1])

for feat in [popdensitycats_encs,gdpcats_encs]:
    shops_new[feat['featurename']] = shops_new.shop_federal_district.apply(lambda x: feat[x][0])
    shops_new[feat['enc_name']] = shops_new.shop_federal_district.apply(lambda x: feat[x][1])

shops_new.drop('test_rows',axis=1,inplace=True)
shops_new = shops_new[['shop_id','shop_tested','shop_type','shop_type_enc','shop_city','shop_city_enc','shop_federal_district',
                       'shop_federal_district_enc','s_type_broad','s_type_broad_enc','fd_popdens','fd_popdens_enc','fd_gdp','fd_gdp_enc']]

shops_new = shops_new.astype({'shop_id':np.int8,'shop_tested':'bool','shop_type':'str','shop_type_enc':np.int8,'shop_city':'str',
                              'shop_city_enc':np.int8,'shop_federal_district':'str','shop_federal_district_enc':np.int8,
                              's_type_broad':'str','s_type_broad_enc':np.int8,'fd_popdens':'str','fd_popdens_enc':np.int8,'fd_gdp':'str','fd_gdp_enc':np.int8})
print('\n',shops_new.dtypes)
print('\n',shops_new)

# shops_new.to_csv("data_output/shops_new.csv", index=False)


 shop_id                        int8
shop_tested                    bool
shop_type                    object
shop_type_enc                  int8
shop_city                    object
shop_city_enc                  int8
shop_federal_district        object
shop_federal_district_enc      int8
s_type_broad                 object
s_type_broad_enc               int8
fd_popdens                   object
fd_popdens_enc                 int8
fd_gdp                       object
fd_gdp_enc                     int8
dtype: object

     shop_id  shop_tested  shop_type  shop_type_enc        shop_city  shop_city_enc shop_federal_district  shop_federal_district_enc s_type_broad  s_type_broad_enc    fd_popdens  fd_popdens_enc        fd_gdp  fd_gdp_enc
0         0        False       Shop             20          Yakutsk             54               Eastern                         16         Shop                10        Remote               5  Intermediate          43
1         1        False       Mall     

#2. Create ***items_new*** data file </br>
**22,170 rows**, corresponding to the 22,170 original items</br>
**7 columns:**
 * item_id  (categorical; range(22170), int16, from original data set)
 * item_tested (categorical; bool, indicating if item is in *test* set)
 * item_category_id (categorical; range(84), int8, from original data set)
 * cluster_code (categorical; int32, weighted encoding based on similarity of item names in a given cluster)
 * item_category3 (categorical; string)
 * item_category3_enc (categorical; int8, random nominal encoding done by pandas)
 * item_category4 (categorical; string)
 * item_category4_enc (categorical; int8, random nominal encoding done by pandas)


In [None]:

item_categories_augmented["item_category3"] = item_categories_augmented["item_category3"].astype('category')
item_categories_augmented["item_category3_enc"] = item_categories_augmented['item_category3'].cat.codes
item_categories_augmented["item_category4"] = item_categories_augmented["item_category4"].astype('category')
item_categories_augmented["item_category4_enc"] = item_categories_augmented['item_category4'].cat.codes

items_new = items_clustered_22170[['item_id','item_tested','item_category_id','cluster_code']].copy(deep=True)

items_new = items_new.merge(item_categories_augmented[['item_category_id','item_category3','item_category3_enc','item_category4','item_category4_enc']],how='left',on='item_category_id')
# print(len(items_new))
# print(items_new.head())
# print(items_new['item_category3_enc'].unique())
# print(items_new['item_category4_enc'].unique())
items_new = items_new.astype({'item_id':np.int16,'item_tested':'bool','item_category_id':np.int8,'cluster_code':np.int32,
                              'item_category3':'str','item_category3_enc':np.int8, 'item_category4':'str','item_category4_enc':np.int8}) 

print('\n',items_new.dtypes)
print('\n',items_new.head(20))

# items_new.to_csv("data_output/items_new.csv", index=False)


 item_id                int16
item_tested             bool
item_category_id        int8
cluster_code           int32
item_category3        object
item_category3_enc      int8
item_category4        object
item_category4_enc      int8
dtype: object

     item_id  item_tested  item_category_id  cluster_code item_category3  item_category3_enc item_category4  item_category4_enc
0         0        False                40           920         Movies                   7         Movies                   3
1         1        False                76          2600       Software                  10             PC                   5
2         2        False                40           802         Movies                   7         Movies                   3
3         3        False                40           330         Movies                   7         Movies                   3
4         4        False                40          1686         Movies                   7         Movies         

#3. Create ***sales_train_cleaned*** data file 
Remove outliers, shops 9 and 13, and merge shops 0,1,11 (and downcast)

</br>

**2,935,849 - 6 = 2,935,843 rows**, corresponding to the original rows ***with outliers removed*** (6 outlier rows deleted)=

* Delete these rows (use this order for deleting if using .iloc): [2909818,2909401,2326930,2257299,1163158,484683]

</br>

**2,914,268 rows**, after removal and merging of the shops:

* Combine shop 11 into shop 10 (i.e., wherever you see shop_id == 11, set it to shop_id = 10), so *sales_train* no longer contains any shop 11.
* Combine shop 0 into shop 57 (id == 0 --> set id = 57)
* Combine shop 1 into shop 58 (id == 1 --> set id = 58)
* Delete all *sales_train* rows where shop_id == 9
* Delete all *sales_train* rows where shop_id == 13

</br>

**6 columns:**
 * date (pd.datetime format)
 * date_block_num (int8) 
 * shop_id (categorical; int8, 55 unique values, inside the range 2 to 59, having removed shops 9 and 13, and merged shops 0,1,11) 
 * item_id (categorical; int16, range 0 to 21,699)
 * item_price (float32; max is near 60,000, roughly 4500 items 0 < $p <= 0.5, so don't round off to integer values)  
 * item_cnt_day (int16, range is roughly 0 to 1000)


In [None]:
sales_train_cleaned = sales_train.copy(deep=True)
print(len(sales_train_cleaned))
for i in [2909818,2909401,2326930,2257299,1163158,484683]:
    print(sales_train_cleaned[sales_train_cleaned.index == i])
    sales_train_cleaned.drop(sales_train_cleaned.index[i],inplace=True)
print(len(sales_train_cleaned))

print(sales_train_cleaned.shop_id.nunique())

sales_train_cleaned = sales_train_cleaned[sales_train_cleaned.shop_id != 9]
print(len(sales_train_cleaned))
sales_train_cleaned = sales_train_cleaned[sales_train_cleaned.shop_id != 13]
print(len(sales_train_cleaned))

sales_train_cleaned['shop_id'] = sales_train_cleaned.shop_id.apply(lambda x: 57 if x == 0 else x)
sales_train_cleaned['shop_id'] = sales_train_cleaned.shop_id.apply(lambda x: 58 if x == 1 else x)
sales_train_cleaned['shop_id'] = sales_train_cleaned.shop_id.apply(lambda x: 10 if x == 11 else x)


sales_train_cleaned = sales_train_cleaned.astype({'date_block_num':np.int8,'shop_id':np.int8,'item_id':np.int16,
                              'item_price':np.float32,'item_cnt_day':np.int16}) 

print('\n',sales_train_cleaned.dtypes)
print('\n',sales_train_cleaned.head())

compression_opts = dict(method='gzip',
                        archive_name='sales_train_cleaned.csv')  
sales_train_cleaned.to_csv('data_output/sales_train_cleaned.csv.gz', index=False, compression=compression_opts)

2935849
              date  date_block_num  shop_id  item_id  item_price  item_cnt_day
2909818 2015-10-28              33       12    11373       0.909          2169
              date  date_block_num  shop_id  item_id  item_price  item_cnt_day
2909401 2015-10-14              33       12    20949           4           500
              date  date_block_num  shop_id  item_id  item_price  item_cnt_day
2326930 2015-01-15              24       12    20949           4          1000
              date  date_block_num  shop_id  item_id  item_price  item_cnt_day
2257299 2014-12-19              23       12    20949           4           500
              date  date_block_num  shop_id  item_id  item_price  item_cnt_day
1163158 2013-12-13              11       12     6066      307980             1
             date  date_block_num  shop_id  item_id  item_price  item_cnt_day
484683 2013-05-15               4       32     2973          -1             1
2935843
60
2932092
2914268

 date             

#4. Create ***sales_train_cln_mrg*** and ***test_mrg*** data files </br>
Remove outliers, adjust shops, merge with several encoded features from *shops_new* and *items_new*

</br>

##Merge overview: (how = "left")
left = *sales_train_cleaned* dataframe and/or *test* dataframe

right = *shops_new* on 'shop_id'

right = *items_new* on 'item_id'

***only merge select columns, to keep sales_train_cln_mrg filesize manageable***

</br>

###***sales_train_cleaned*** dataframe:

**2,914,268 rows** (down from original 2,935,849 by removing outliers, shop 9, shop 13)

**6 columns:**

| dtype: | datetime64[ns] | int8 | int8 | int16 | float32 | int16 |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| row | date | date_block_num | shop_id | item_id | item_price | item_cnt_day |
| 0 | 2013-01-02 | 0 | 59 | 22154 | 999 | 1 |
| 1 | 2013-01-03 | 0 | 25 | 2552  | 899 | 1 |
| 2 | 2013-01-05 | 0 | 25 | 2552  | 899 | -1 |
| 3 | 2013-01-06 | 0 | 25 | 2554  | 1709 | 1 |
| 4 | 2013-01-15 | 0 | 25 | 2555  | 1099 | 1 |

</br>

###***shops_new*** dataframe:
**60 rows**, corresponding to the 60 original shops</br>
**14 columns** -- we will merge the bold columns below, on = 'shop_id'
 * shop_id  (categorical; 0-59, int8, from original data set)
 * ***shop_tested*** (categorical; bool, indicating if the shop is in *test* set)
 * shop_type (categorical; string object: online, small shop, mall, SEC, Mega)
 * ***shop_type_enc*** (categorical; int8, ordinal/weighted encoding based on number of rows present in *test*, scaled to cover roughly the same range as shop_id values (0-59))
 * shop_city (categorical; string)
 * ***shop_city_enc*** (categorical; int8, encoding weighted like shop_type)
 * shop_federal_district (categorical; string object)
 * ***shop_federal_district_enc*** (categorical; int8, encoding weighted like shop_type)
 * s_type_broad (categorical; string object: like shop_type, but fewer categories by merging together "Mall","Mega","SEC")
 * ***s_type_broad_enc*** (categorical; int8, ordinal encoding roughly weighted by shop size, 0-60 scale)
 * fd_popdens (categorical; string object: 4 categories named by population density in the shop's federal district)
 * ***fd_popdens_enc*** (categorical; int8; ordinal encoding weight based on population density)
 * fd_gdp (categorical; string object: 3 categories named by gdp per person)
 * ***fd_gdp_enc*** (categorical; int8; ordinal encoding weight based on gdp/person)

</br>

###***items_new*** dataframe:

**22,170 rows**, corresponding to the 22,170 original items</br>
**7 columns** -- we will merge the 5 bold columns below, on = 'item_id'
 * item_id  (categorical; 0 - 22169, int16, from original data set)
 * ***item_tested*** (categorical; bool, indicating if item is in *test* set)
 * ***item_category_id*** (categorical; 0-83, int8, from original data set)
 * ***cluster_code*** (categorical; int32, weighted encoding based on similarity of item names in a given cluster)
 * item_category3 (categorical; string)
 * ***item_category3_enc*** (categorical; int8, random nominal encoding done by pandas)
 * item_category4 (categorical; string)
 * ***item_category4_enc*** (categorical; int8, random nominal encoding done by pandas)


In [None]:
print(f'Number of rows in sales_train: {len(sales_train)}')
print(f'Number of rows in sales_train_cleaned: {len(sales_train_cleaned)}')
print(f'Number of columns in sales_train_cleaned: {len(sales_train_cleaned.columns)}')
print(f'Column datatypes for sales_train_cleaned:\n{sales_train_cleaned.dtypes}')
print(f'\nFirst 2 rows of sales_train_cleaned:\n{sales_train_cleaned.head(2)}')
print('\n')
print(f'Number of rows in test: {len(test)}')
print(f'Number of columns in test: {len(test.columns)}')
print(f'Column datatypes for test:\n{test.dtypes}')
print(f'\nFirst 2 rows of test:\n{test.head(2)}')
print('\n')

# Merge shop category encodings
sales_train_cln_mrg = sales_train_cleaned.merge(shops_new[['shop_id','shop_tested','shop_type_enc','shop_city_enc','shop_federal_district_enc',
                                                           's_type_broad_enc','fd_popdens_enc','fd_gdp_enc']], how='left', on='shop_id')
test_mrg = test.merge(shops_new[['shop_id','shop_tested','shop_type_enc','shop_city_enc','shop_federal_district_enc',
                                                           's_type_broad_enc','fd_popdens_enc','fd_gdp_enc']], how='left', on='shop_id')

# Merge item category encodings
sales_train_cln_mrg = sales_train_cln_mrg.merge(items_new[['item_id','item_tested','item_category_id','cluster_code',
                                                           'item_category3_enc','item_category4_enc']], how='left', on='item_id')
test_mrg = test_mrg.merge(items_new[['item_id','item_tested','item_category_id','cluster_code',
                                                           'item_category3_enc','item_category4_enc']], how='left', on='item_id')

# Reduce size of test_mrg columns from int64 to 32 or 16 or 8
test_mrg = test_mrg.astype({'ID':np.int32,'shop_id':np.int8,'item_id':np.int16}) 



print(f'Number of rows in sales_train_cln_mrg: {len(sales_train_cln_mrg)}')
print(f'Number of columns in sales_train_cln_mrg: {len(sales_train_cln_mrg.columns)}')
print(f'Column datatypes for sales_train_cln_mrg:\n{sales_train_cln_mrg.dtypes}')
print(f'\nFirst 2 rows of sales_train_cln_mrg:\n{sales_train_cln_mrg.head(2)}')
print('\n')
print(f'Number of rows in test_mrg: {len(test_mrg)}')
print(f'Number of columns in test_mrg: {len(test_mrg.columns)}')
print(f'Column datatypes for test_mrg:\n{test_mrg.dtypes}')
print(f'\nFirst 2 rows of test_mrg:\n{test_mrg.head(2)}')


Number of rows in sales_train: 2935849
Number of rows in sales_train_cleaned: 2914268
Number of columns in sales_train_cleaned: 6
Column datatypes for sales_train_cleaned:
date              datetime64[ns]
date_block_num              int8
shop_id                     int8
item_id                    int16
item_price               float32
item_cnt_day               int16
dtype: object

First 2 rows of sales_train_cleaned:
        date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0 2013-01-02               0       59    22154         999             1
1 2013-01-03               0       25     2552         899             1


Number of rows in test: 214200
Number of columns in test: 3
Column datatypes for test:
ID         int64
shop_id    int64
item_id    int64
dtype: object

First 2 rows of test:
   ID  shop_id  item_id
0   0        5     5037
1   1        5     5320


Number of rows in sales_train_cln_mrg: 2914268
Number of columns in sales_train_cln_mrg: 18
Column datatypes

In [None]:
# save the gzipped sales_train_cln_mrg and test_mrg dataframes to GoogleDrive (& GitHub)

compression_opts = dict(method='gzip',
                        archive_name='sales_train_cln_mrg.csv')  
sales_train_cln_mrg.to_csv('data_output/sales_train_cln_mrg.csv.gz', index=False, compression=compression_opts)

compression_opts = dict(method='gzip',
                        archive_name='test_mrg.csv')  
test_mrg.to_csv('data_output/test_mrg.csv.gz', index=False, compression=compression_opts)

In [None]:
print(sales_train_cln_mrg['item_id'].nunique(), sales_train_cln_mrg['shop_id'].nunique())
print(len(items_new))

21671 55
22170
22170


In [None]:
print(sales_train_cln_mrg.item_price.max(), sales_train_cln_mrg.item_price.min())
print(sales_train_cln_mrg.item_cnt_day.max(), sales_train_cln_mrg.item_cnt_day.min())
print(sales_train_cln_mrg.cluster_code.max(), sales_train_cln_mrg.cluster_code.min())

59200 0
669 -22
34420 19


In [None]:
print(len(sales_train_cln_mrg[sales_train_cln_mrg.item_price ==0]))
print(len(sales_train[sales_train.item_price < 1]))
print(len(sales_train[sales_train.item_price <= 0.5]))
st9_0 = sales_train[sales_train.shop_id == 9]
print(len(st9_0[st9_0.item_price < 1]))
st13_0 = sales_train[sales_train.shop_id == 13]
print(len(st13_0[st13_0.item_price < 1]))
st_rnd = sales_train.item_price.apply(lambda x: int(round(x)))
print(len(st_rnd[st_rnd == 0]))

4163
4658
4164
0
0
4163


#**Data Ouput from This Notebook**

##**1.  *shops_new*** </br>
**60 rows**, corresponding to the 60 original shops</br>
**14 columns:**
 * shop_id (categorical; 0-59, int8, from original data set)
 * shop_tested (categorical; bool, indicating if the shop is in test set)
 * shop_type (categorical; string object: online, small shop, mall, SEC, Mega)
 * shop_type_enc (categorical; int8, ordinal/weighted encoding based on number of rows present in test, scaled to cover roughly the same range as shop_id values (0-59))
 * shop_city (categorical; string)
 * shop_city_enc (categorical; int8, encoding weighted like shop_type)
 * shop_federal_district (categorical; string object)
 * shop_federal_district_enc (categorical; int8, encoding weighted like shop_type)
 * s_type_broad (categorical; string object: like shop_type, but fewer categories by merging together "Mall","Mega","SEC")
 * s_type_broad_enc (categorical; int8, ordinal encoding roughly weighted by shop size, 0-60 scale)
 * fd_popdens (categorical; string object: 4 categories named by population density in the shop's federal district)
 * fd_popdens_enc (categorical; int8; ordinal encoding weight based on population density)
 * fd_gdp (categorical; string object: 3 categories named by gdp per person)
 * fd_gdp_enc (categorical; int8; ordinal encoding weight based on gdp/person)

</br>

##**2.  *items_new*** </br>
**22,170 rows**, corresponding to the 22,170 original items</br>
**7 columns:**
 * item_id  (categorical; 0 - 22169, int16, from original data set)
 * item_tested (categorical; bool, indicating if item is in *test* set)
 * item_category_id (categorical; 0 - 83, int8, from original data set)
 * cluster_code (categorical; int32, weighted encoding based on similarity of item names in a given cluster)
 * item_category3 (categorical; string)
 * item_category3_enc (categorical; int8, random nominal encoding done by pandas)
 * item_category4 (categorical; string)
 * item_category4_enc (categorical; int8, random nominal encoding done by pandas)

</br>

##**3.  *sales_train_cleaned*** - remove outliers and adjust shops</br>
**2,914,268 rows**, corresponding to the original rows ***with outliers removed*** (6 outlier rows deleted)</br>
and ***with shops 9 and 13 rows removed*** 
</br>

**6 columns:**
 * date (pd.datetime format)
 * date_block_num (int8) 
 * shop_id (categorical; int8, range **2** to 59, and no entries for 9, 11, 13 either, having removed shops 9 and 13, and merged shops 0,1,11)
 * item_id (categorical; int16, range 0 to 22,169 (but only 21,671 present in *sales_train_cleaned*))
 * item_price (float32; downcast, max is near 60,000)  
 * item_cnt_day (int16, range is roughly 0 to 1000)

</br>

</br>

---
---

##**4a. *sales_train_cln_mrg:***
* remove outliers, 
* adjust shops (remove 9,13, and merge away 0,1,11)
* merge with encoded features from *shops_new* and *items_new* dataframes</br>

**2,914,268 rows,** corresponding the above *sales_train_cleaned* data set </br>
**18 columns,** corresponding to the 6 from *sales_train_cleaned* plus 12 (encoded) categorical columns coming from *shops_new* and *items_new*</br>

Column descriptions for sales_train_cln_mrg:</br>

| Column Name | DType | Description |
| ----------: | :---: | :--------- |
| date | datetime64 | ordinal day, month, year of transaction in that row |
| date_block_num | int8 | ordinal-encoded month # from start of train data |
| shop_id | int8 | categorical range(60) original shop_id values, minus 0,1,9,11,13 |
| item_id | int16 | categorical range(22170) original item_id values, 21671 present in *sales_train_cln_mrg* |
| item_price | float32 | continuous variable, downcast from float64; price is in range (0 to 59200] |
| item_cnt_day | int16 | continuous variable, items sold during the day of the sales_train row; range = [-22 to 669] |
| shop_tested | bool | True if shop id is in the test set |
| shop_type_enc | int8 | Categorical feature indicating small shop / mall / SEC / online... |
| shop_city_enc | int8 | Categorical feature indicating which city hosts the shop |
| shop_federal_district_enc | int8 | Categorical feature indicating which federal district the shop is in |
| s_type_broad_enc | int8 | Categorical feature like shop_type_enc, but merging together mall/Mega/SEC so fewer categories |
| fd_popdens_enc | int8 | Categorical feature indicating population density of the federal district the shop is in |
| fd_gdp_enc | int8 | Categorical feature indicating gdp/person for the federal district the shop is in |
| item_tested | bool | True if item id is in the test set |
| item_category_id | int8 | Original category codes for the items (0 to 83) |
| cluster_code | int32 | Categorical grouping of items by name similarity; encoding weighted </br>by avg. strength of the group coupling (19 to 34420; roughly 2000 groups)
| item_category3_enc | int8 | reduction of original 84 categories, grouping primarily by item type |
| item_category4_enc | int8 | reduction of original 84 categories, grouping primarily by item brand |




**Code in this ipynb does the following:**

***sales_train*** dataset outliers: 

* Delete these rows (use this order for deleting if using .iloc): [2909818,2909401,2326930,2257299,1163158,484683]


***sales_train*** dataset shop overlap / untested shops: 

* Combine shop 11 into shop 10 (i.e., wherever you see shop_id == 11, set it to shop_id = 10), so *sales_train* no longer contains any shop 11.
* Combine shop 0 into shop 57 (id == 0 --> set id = 57)
* Combine shop 1 into shop 58 (id == 1 --> set id = 58)
* Delete all *sales_train* rows where shop_id == 9
* Delete all *sales_train* rows where shop_id == 13

***shops_new*** dataset features:

* continue using ***shop_id*** as a categorical feature for training/test; suggest encoding it by weighting with the number of rows present in the ***test set***
* add ***shop_type*** as a categorical feature (also recommend encoding these categories by weighting with the number of rows present in ***test*** set)
* also see if ***shop_federal_district*** has significant feature importance (also encode based on relative presence in ***test*** set), depending on your time/computing resources.  I will also add ***shop_city*** as another potential feature if you have time/computing resources to try it.
* I'm adding 3 other potential features (2 columns: one for text description, and one for integer encoding):  
  * s_type_broad (like shop_type, but fewer categories)
  * fd_popdens (4-category grouping based on shop federal district population density)
  * fd_gdp (3-category grouping based on gdp per person in the shop's federal district)

***items_new*** dataset features:

* delete ***item_name*** column (just a waste of space for training purposes)
* merge with ***item_categories*** dataset to add ***item_category3*** and ***item_category4*** as feature columns.  Encoding randomly done by pandas.
* merge with ***items_clustered_22170*** dataset to add ***item_cluster_code*** as a pre-encoded categorical feature.
* suggest using ***item_id***, ***item_cluster_code***, ***item_category3_enc***, ***item_category_id***, and ***item_category4_enc*** as features in the model, in that order of importance (depending on your time/computing resources).  

</br>

</br>

**Some Thoughts On The Cleaning and Use of Features Above:**

1.  **Category overlap** in the new features of the items dataset:</br>
If one or more of these 4 category/cluster groupings shows little feature importance upon training, then drop it from the model inputs.  If one or two of the 4 groupings are substantially higher importance upon training, then suggest you do not use the other 2 or 3 category groupings as features at all. (There is *some* correlation between the categories of two different grouping methods, but I would be concerned that there are also anti-correlations between the different feature variants, and that it is not straightforward for the model to resolve conflicting category information.  I'm guessing that the best results would come from using one or two of the "broad" category groupings (cat3 or cat4) and one (and only one) of the "finer" category groupings (the original 84-category feature, or the weighted 2000ish-category cluster_code feature)

2.  **Shops inclusion and feature priority**:</br>
* I suspect *shop_federal_district* will not have a great importance in the training, but I would be interested to see if you find the same conclusion after training and looking at relative feature importance.  Although I'm also including *shop_city* as a possible feature to use, I don't recommend this unless you have plenty of time/computing resources.  If you can, give it a try, and see if it actually does have significant feature importance.
* I am dropping shops 9 and 13 from the training set due to their odd behavior and because they are not in the test set.  I don't think there will be worse model performance because of this (I suspect actually it could be significantly better), but if time permits, it would be worth double-checking this hypothesis by training the model with these two shops still in the *sales_train* data set, and see if you see any model improvement.
* Similarly, the merging / removal of shops 0, 1, and 11 is being done on a hunch that the similarity of these 3 untested shops with their (tested) merge partners is so strong that the merging to create more data for training tested shops will be beneficial.  The issue is that I have not done a deep dive into the aforementioned similarity.  There is a possibility that this merging of shops could distort the training behavior of time lag, item_id, or item_category if the shop 'similiarity' is not strong enough.  For example,  the 3 shop merges could create a substantially different sales distribution of items/categories/time for the 3 shops being tested. A shallow exploration suggests it's not an issue, but I'm not 100% confident, and we should consider poentially running the model training without merging these shops together.  (The good news is I don't think the 3 merges will have much effect, if any, on any of the other shops.)



In [None]:
df_name_dict = {'shops_features':shops_features,'items_features':items_features,'train_test_base':train_test_base}

for k,v in df_name_dict.items():
    print(f'{k}: n_rows = {len(v)}, n_cols = {len(v.columns)}\n{v.columns}\n')

shops_new: n_rows = 60, n_cols = 14
Index(['shop_id', 'shop_tested', 'shop_type', 'shop_type_enc', 'shop_city', 'shop_city_enc', 'shop_federal_district', 'shop_federal_district_enc', 's_type_broad', 's_type_broad_enc', 'fd_popdens', 'fd_popdens_enc', 'fd_gdp', 'fd_gdp_enc'], dtype='object')

items_new: n_rows = 22170, n_cols = 8
Index(['item_id', 'item_tested', 'item_category_id', 'cluster_code', 'item_category3', 'item_category3_enc', 'item_category4', 'item_category4_enc'], dtype='object')

sales_train_cleaned: n_rows = 2914268, n_cols = 6
Index(['date', 'date_block_num', 'shop_id', 'item_id', 'item_price', 'item_cnt_day'], dtype='object')

sales_train_cln_mrg: n_rows = 2914268, n_cols = 18
Index(['date', 'date_block_num', 'shop_id', 'item_id', 'item_price', 'item_cnt_day', 'shop_tested', 'shop_type_enc', 'shop_city_enc', 'shop_federal_district_enc', 's_type_broad_enc', 'fd_popdens_enc', 'fd_gdp_enc', 'item_tested', 'item_category_id', 'cluster_code', 'item_category3_enc',
       'it