#**Quick processing of data sets for input to model**

**Data Cleaning, Feature Generation**

Andreas Theodoulou and Michael Gaidis (June, 2020)

#**Data Ouput from This Notebook**

##**1. A lightly-cleaned version of sales_train data set, merged with the test data set** (distinguishable as month = 34)
* *train_test_base*

##**2. Data sets to merge with the aforementioned data set, and also important to merge with Cartesian-Product rows that we insert into the training data.**
* *shops_enc*
* *items_enc*
* *date_adjustments*

###The intent is for the user to adapt these data sets as desired, in the IPynb notebook focused on modeling.




#**0.1 Mount Google Drive (Local File Storage/Repo For Colab)**

In [None]:
# click on the URL link presented to you by this command, get your authorization code from Google, then paste it into the input box and hit 'enter' to complete mounting of the drive
from google.colab import drive  
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


#**0.2 Configure Environment and Load Data Files**

In [None]:
# python libraries/modules used throughout this notebook (with some holdovers from other, similar notebooks)
# pandas data(database) storage, EDA, and manipulation
import pandas as pd
### pandas formatting
### Here's what I find works well for this particular IPynb, when using a FHD laptop monitor with a full-screen browser window containing my IPynb notebook:
pd.set_option("display.max_rows",120)     # Override pandas choice of how many rows to show, so, for example, we can see the full 84-row item_category dataframe instead of the first few rows, then ...., then the last few rows
pd.set_option("display.max_columns",26)   # Similar to row code above, we can show more columns than default  
pd.set_option("display.width", 230)       # Tune this to our monitor window size to avoid horiz scroll bars in output windows (but, the drawback is that we will get output text wrapping)
pd.set_option("max_colwidth", None)       # This is done, for example, so we can see full item name and not '...' in the middle
# Try to convince pandas to print without decimal places if a number is actually an integer (helps keep column width down, and highlights data types), or with precision = 3 decimals if a float
pd.options.display.float_format = lambda x : '{:.0f}'.format(x) if round(x,0) == x else '{:,.3f}'.format(x)

# Pandas additional enhancements
pd.set_option('compute.use_bottleneck', False)  # speed up operation when using NaNs
pd.set_option('compute.use_numexpr', False)     # speed up boolean operations, large dataframes; DataFrame.query() and pandas.eval() will evaluate the subexpressions that can be evaluated by numexpr

# computations
import numpy as np

# file operations
import os
from urllib.parse import urlunparse
from pathlib import Path

# misc. python enhancements
from collections import OrderedDict
import time
import datetime
from time import sleep, localtime, strftime, tzset, strptime
os.environ['TZ'] = 'EST+05EDT,M4.1.0,M10.5.0'   # allows user to simply print a formatted version of the local date and time; helps keep track of what cells were run, and when
tzset()

print(f'done: {strftime("%a %X %x")}')

done: Mon 06:07:35 06/29/20


In [None]:
train_test_base_save = False #True  # set to false if you plan to read in from previously-created csv.gz file

GDRIVE_REPO_PATH = "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"

data_files = [  "data_output/shops_augmented.csv",
                "data_output/items_clustered_22170b.csv.gz",
                "data_output/item_categories_augmented.csv",
                "data_output/days_by_month.csv",
                "readonly/final_project_data/sales_train.csv.gz",
                "readonly/final_project_data/test.csv.gz"
                ]


# Dict of helper code files, to be loaded and imported {filepath : import_as}
code_files = {}  # not used at this time; example dict = {"helper_code/kaggle_utils_at_mg.py" : "kag_utils"}


# GitHub file location info
git_hub_url = "https://raw.githubusercontent.com/migai"
repo_name = 'Kag'
branch_name = 'master'
base_url = os.path.join(git_hub_url, repo_name, branch_name)

if data_files:
    print('\n\ncsv files source directory: ', end='')
    %cd "{GDRIVE_REPO_PATH}"

    print("\nLoading csv Files from Google Drive repo into Colab...\n")

    # Loop to load the data files into appropriately-named pandas DataFrames
    for path_name in data_files:
        filename = path_name.rsplit("/")[-1]
        data_frame_name = filename.split(".")[0]
        exec(data_frame_name + " = pd.read_csv(path_name)")
        # if data_frame_name == 'sales_train':
        #     sales_train['date'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y')
        print(f'DataFrame {data_frame_name}, shape = {eval(data_frame_name).shape} :')
        print(eval(data_frame_name).head(2))
        print("\n")
else: 
    %cd "{GDRIVE_REPO_PATH}"
    
print(f'\nDataFrame Loading Complete: {strftime("%a %X %x")}\n')



csv files source directory: /content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag

Loading csv Files from Google Drive repo into Colab...

DataFrame shops_augmented, shape = (60, 18) :
                       shop_name  shop_id shop_group shop_federal_district  shop_federal_district_enc  shop_tested                       en_shop_name  shop_city_population shop_type  shop_type_enc shop_city  shop_city_enc  \
0  !Якутск Орджоникидзе, 56 фран        0          N               Eastern                         16        False  ! Yakutsk Ordzhonikidze, 56 Franc                235600      Shop             20   Yakutsk             54   
1  !Якутск ТЦ "Центральный" фран        1          N               Eastern                         16        False       ! Yakutsk TC "Central" Franc                235600      Mall             50   Yakutsk             54   

  s_type_broad  s_type_broad_enc fd_popdens  fd_popdens_enc        fd_gdp  fd_gdp_enc  
0         Shop              

#**1. *train_test_base***
**3,150,043 rows**, corresponding to the original rows ***with outliers removed or clipped***</br>
and ***with identical shops merged together*** 
</br>

**9 columns:**
 * month (int8, ordinal, 0 to 34) = date_block_num
 * day (int16, ordinal, 0 to 1033) 
 * week (int8, ordinal, 0 to 148)
 * qtr (int8, ordinal, 0 to 12)
 * season (int8, categorical, 0 to 3) 
 * shop_id (int8, categorical, 2 to 10 and 12 to 59)
 * item_id (int16, categorical, 0 to 22,169)
 * price (float32, continuous, max is near 60,000) = item_price
 * sales (int16, continuous, range is roughly -20 to 1000) = item_cnt_day



***sales_train*** dataset outliers: 

* Clip these rows:
```
Shop 24, sales of item 20949 --> clip to 200
Shop 25, sales of item 20949 --> clip to roughly 200/day
```
* Delete these rows (use this reverse order for deleting if using .iloc): </br>
```
[2909818, 2909401, 2326930, 2257299, 1163158, 484683]
```

***sales_train*** dataset shop overlap 
```
* Combine shop 11 into shop 10  (id == 11 --> set id = 10)
* Combine shop  0 into shop 57  (id ==  0 --> set id = 57)
* Combine shop  1 into shop 58  (id ==  1 --> set id = 58)
```

***sales_train*** dataset late-opening shop
```
* Multiply shop 36 sales by 31/15 to account for it being open only for the last 15 days of training.
The user can later scale again as desired to conform to the 30-day test month.
```

***sales_train merge with test***
```
* Append the test shop-item pairs to the sales_train data set, so merging and feature generation
have the option of including these rows as well.
```

***Additional time-based features***
```
Replace the "date" and "date_block_num" columns with:
    1. 'day'    = integer value of day number, 
                starting at day = 0 for the first training set transaction, and incrementing by "calendar" day number 
                (not by "transaction" day number).
                Thus, 'day' may not include all possible integers from start to finish.  
                It only assigns integer values (based on the calendar) to days when there are transactions in the 
                input dataframe --> if the input dataframe has no transactions on a particular day, that day's 
                "calendar" integer value will not be present in the column.
    2. 'week'   = integer value of week number, 
                however, unlike 'day', the 'week' number is aligned not to start at the first training set transaction, 
                but rather so that there is a full 'week' of 7 days that ends on Oct. 31, 2015 (the final day of training data).  
                If using the full sales_train data set, this results in week = 0 having only 5 days in it. 
                The final week of October, 2015 is assigned 'week' number = 147.  
                Arbitrarily assigning test to "Nov. 1, 2015" results in test week = 148
    3. 'month'  = renamed from "date_block_num" of original data set (no changes).  
                Integer values from 0 to 33 represent the months starting at day0.  Test month == 34 is Nov. 2015.
    4. 'qtr'    = quarter = integer number of 3-month chunks of time, aligned with the end of October, 2015.  
                day0 is included in 'qtr' = 0, but 'qtr'=0 only contains 1 month (Jan 2013) of data due to the alignment.
                The months of August, Sept, Oct 2015 form 'qtr' = 11.  "qtr" in this sense is just 3-month chunks... 
                it is not the traditional Q1,Q2,Q3,Q4 beginning Jan 1, but instead is more like date_block_num in that it 
                is monotonically increasing integers, incremented every 3 months.
    5. 'season' = integer number of 3-month chunks of time, reset each year (allowed values = 0,1,2,3)... 
                not quite the same as spring-summer-winter-fall, or Q1,Q2,Q3,Q4, but instead shifted to 
                better capture seasonal spending trends aligned in particular with high December spending
                2 = Dec 1 to Feb 28 (biggest spending season), 3 = Mar 1 to May 31, 
                0 = June 1 to Aug 30 (lowest spending season), 1 = Sept 1 to Nov 30
```


##Code Output:
```
train_test_base dataframe creation started: Sat 14:11:09 06/20/20

Shape of original sales_train data set = (2935849, 6)
Rows being clipped:
  1,501,160 sales clipped to 200
  1,708,207 sales clipped to 200
  2,296,209 sales clipped to 100
  2,341,308 sales clipped to 100
Rows being deleted:
  2909818
  2909401
  2326930
  2257299
  1163158
  484683
Shape of sales_train_cleaned after 6 outlier rows were removed: (2935843, 6)
Shape of sales_train_cleaned after merging shops as in {0: 57, 1: 58, 11: 10}: (2935843, 6)
Shops being scaled:
  shop 36 scaled by 2.07
Shape of traintest after appending test to sales_train_cleaned: (3150043, 6)
Shape of traintest after creating time-based feature columns: (3150043, 9)
traintest DataFrame creation done: Sat 14:13:28 06/20/20

/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
train_test_base.csv.gz file stored on google drive in data_output directory
train_test_base file save done: Sat 14:14:02 06/20/20

Example: display(train_test_base[train_test_base.week == 102].tail(2))

           day    week    qtr    season    month    price    sales  shop_id  item_id
2257039	718	  102	  8	    1	    23      399	    1	59	21970
2257040	718	  102	  8	    1	    23	  499	    1	59	22060

train_test_base done: Sat 14:14:02 06/20/20
```

##**1.1 Merge data sets and create day, week, quarter, and season feature columns**

In [None]:
# determine which rows I need to clip for shops 24 and 25
stclip = sales_train.query('((shop_id == 24) and (item_cnt_day > 100)) or ((shop_id == 25) and (item_cnt_day > 100))')
stclip

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
862929,17.09.2013,8,25,3732,2545.135,264
862945,17.09.2013,8,25,3734,2548.455,110
868495,05.09.2013,8,25,2808,999.0,133
1501160,15.03.2014,14,24,20949,5.0,405
1708207,28.06.2014,17,25,20949,5.0,501
2176634,18.11.2014,22,25,3733,3070.565,147
2296209,30.12.2014,23,25,20949,5.0,205
2341308,17.01.2015,24,25,20949,5.0,222
2567454,14.04.2015,27,25,3731,1941.995,207
2615667,19.05.2015,28,25,10209,1490.892,148


In [None]:
def clean_merge_augment(day0 = datetime.datetime(2013,1,1),
                        clip_rows = {1501160: 200, 1708207: 200, 2296209: 100, 2341308: 100},
                        delete_rows = [2909818, 2909401, 2326930, 2257299, 1163158, 484683],
                        merge_shops = {0: 57, 1: 58, 11: 10},
                        scale_shops = {36: 31/15},
                        dropout_repair = {},
                        delete_shops = []):
    """
    Parameters:
    day0 = datetime.datetime object representing the day you wish to use as your reference when creating time-based features
    clip_rows = not quite as bad as erroneous outliers, but sales are so unlike other days that clipping should help
    delete_rows = list of integer row numbers that you wish to delete from the sales_train data set, e.g., from outliers/erroneous rows
    merge_shops = dictionary of integer shop_id key:value pairs where shop(=key) is merged into shop(=value)
    scale_shops = artificially adjust sales amounts for shops that are only open for partial amounts of months
    dropout_repair = optional, fill in the dropouts where shops are apparently erroneously missing sales (I am pushing this for now, as it looks to be of only marginal use, and not easy to do in a robust way)
    delete_shops = optional, can delete shops if you think they are not of value to training (I am pushing this to the modeling IPynb for easier iteration)

    Global Variables: this function assumes you have the following pandas dataframes available globally:
    1) unaltered sales_train
    2) unaltered test

    This function does the following:
    1) clips moderate outlier rows
    2) cleans (deletes) severe outlier rows from the training set that appear to be erroneous or irrelevant entries
    3) merges 3 shops into other shops where it appears that the sales_train set simply has different names for the 
        same shop at different time periods (shop 0 absorbed by 57; shop 1 absorbed by 58, shop 11 absorbed by 10)
    4) optionally delete shops entirely from the sales_train data set (e.g., for irrelevant shops)
    5) append the test set rows to the sales_train rows, using a date of November 1, 2015 for test
    6) adjust the 'date' column on the merged dataset to be in datetime format, so it looks like a string of format: 'YYYY-M-D'

    Then, creates and inserts new time-based feature columns as follows:
    Given a dataframe with a 'date' column containing strings like '2015-10-30', create new time-series columns:
    1. 'day'    = integer value of day number, starting at day = 0 for parameter day0, and incrementing by calendar day number (not by transaction day number)... 
                    Thus, 'day' may not include all possible integers from start to finish.  It only assigns integer values (based on the calendar) to days when 
                    there are transactions in the input dataframe --> if the input dataframe has no transactions on a particular day, that day's 'calendar' integer 
                    value will not be present in the column (will be = 0)
    2. 'week'   = integer value of week number, with week = 0 at time= parameter day0.  However, unlike 'day', the 'week' number is aligned not to start at day0, but rather
                    so that there is a full 'week' of 7 days that ends on Oct. 31, 2015 (the final day of training data).  This results in week = 0 having only 5 days in it.
                    n.b., the final week of October, 2015 is assigned 'week' number = 147.  Artifically assigning test to Nov. 1, 2015 results in test week = 148
    3. 'month'  = renamed from "date_block_num" of original data set (no changes).  Integer values from 0 to 33 represent the months starting at day0.  Test month=34 is Nov. 2015.
    4. 'qtr'    = quarter = integer number of 3-month chunks of time, aligned with the end of October, 2015.  day0 is included in 'qtr' = 0, but 'qtr'=0 only contains 1 month (Jan 2013) of data due to the alignment
                    The months of August, Sept, Oct 2015 form 'qtr' = 11.  "qtr" in this sense is just 3-month chunks... it is not the traditional Q1,Q2,Q3,Q4 beginning Jan 1, but instead is more like
                    date_block_num in that it is monotonically increasing integers, incremented every 3 months such that #11 ends at the end of our training data
    5. 'season' = integer number of 3-month chunks of time, reset each year (allowed values = 0,1,2,3)... not quite the same as spring-summer-winter-fall, or Q1,Q2,Q3,Q4, but instead shifted to 
                    better capture seasonal spending trends aligned in particular with high December spending
                    2 = Dec 1 to Feb 28 (biggest spending season), 3 = Mar 1 to May 31, 0 = June 1 to Aug 30 (lowest spending season), 1 = Sept 1 to Nov 30

    Finally, drop the date column from the dataframe, and sort the dataframe by ['day','shop_id','item_id']  (original dataframe seems to be sorted by month, but unsorted within each month)

    returns: the cleaned/dated/feature-augmented DataFrame
    """

    print(f'Shape of original sales_train data set = {sales_train.shape}')

    # clip moderate outliers (first make a DataFrame copy so we can reuse sales_train later, if we need to)
    sales_train_cleaned = sales_train.copy(deep=True)
    if clip_rows:
        print('Rows being clipped:')
        for k,v in clip_rows.items(): 
            sales_train_cleaned.at[k,'item_cnt_day'] = v
            print(f'  {k:,d} sales clipped to {v}')

    # remove outlier rows from training set 
    print('Rows being deleted:')
    for i in sorted(delete_rows, reverse=True):   # delete the rows in reverse order to be sure we don't run into issues with indexing
        print(f'  {i}')
        sales_train_cleaned.drop(sales_train_cleaned.index[i],inplace=True)
    print(f'Shape of sales_train_cleaned after {len(delete_rows)} outlier rows were removed: {sales_train_cleaned.shape}')
    
    # Merge the 3 shops we are nearly certain must correctly fit into the other shops' dropout regions:
    sales_train_cleaned.shop_id = sales_train_cleaned.shop_id.replace(merge_shops)
    print(f'Shape of sales_train_cleaned after merging shops as in {merge_shops}: {sales_train_cleaned.shape}')

    # scale shops if desired
    if scale_shops:
        print('Shops being scaled:')
        for k,v in scale_shops.items(): 
            sales_train_cleaned.item_cnt_day = sales_train_cleaned.apply(lambda row: row.item_cnt_day * v if row.shop_id == k else row.item_cnt_day, axis = 1)
            print(f'  shop {k} scaled by {v:.2f}')

    # Remove irrelevant shops entirely from the sales_train_cleaned DataFrame:
    if delete_shops:
        sales_train_cleaned = sales_train_cleaned.query('shop_id != @delete_shops')
        print(f'Shape of sales_train_cleaned after deleting shops {delete_shops}: {sales_train_cleaned.shape}')

    # sales_train_cleaned = sales_train_cleaned[sales_train_cleaned.shop_id != 9]
    # sales_train_cleaned = sales_train_cleaned[sales_train_cleaned.shop_id != 13]
    # print(f'Shape of sales_train_cleaned after removal of shops: {sales_train_cleaned.shape})
    # print(f'{sales_train_cleaned.shop_id.nunique()} shops remaining in sales_train_cleaned DataFrame: {sorted(sales_train_cleaned.shop_id.unique())})

    sales_train_cleaned = sales_train_cleaned.astype({'date_block_num':np.int8,'shop_id':np.int8,'item_id':np.int16,
                                                    'item_price':np.float32,'item_cnt_day':np.int16}).reset_index(drop=True)

    # merge dataframes so we optionally include test elements in our EDA and feature generation
    test_prep = test.copy(deep=True)
    test_prep['date_block_num'] = 34
    test_prep['date'] = '1.11.2015' #pd.Timestamp(year=2015, month=11, day=1)
    traintest = sales_train_cleaned.append(test_prep).fillna(0)

    traintest = traintest[['date', 'date_block_num', 'item_price', 'item_cnt_day', 'shop_id', 'item_id']]
    traintest.columns = ['date', 'month', 'price', 'sales', 'shop_id', 'item_id']
    print(f'Shape of traintest after appending test to sales_train_cleaned: {traintest.shape}')
        
    # Add in the time-based feature columns
    traintest.date =  pd.to_datetime(traintest.date, dayfirst=True, infer_datetime_format=True)
    traintest.insert(1,'day', traintest.date.apply(lambda x: (x - day0).days))
    traintest.insert(2,'week', (traintest.day+2) // 7 )             # add the 2 days so we have end of a week coinciding with end of training data Oct. 31, 2015
    traintest.insert(3,'qtr', (traintest.month + 2) // 3 )          # add the 2 months so we have end of a quarter aligning with end of training data Oct. 31, 2015
    traintest.insert(4,'season', (traintest.month + 2) % 4 ) 
    traintest.drop('date',axis=1,inplace=True)
    # note that the train dataset is sorted by month, but nothing obvious within the month; we sort it here for consistent results in calculations below
    traintest = traintest.sort_values(['day','shop_id','item_id']).reset_index(drop=True)  
    print(f'Shape of traintest after creating time-based feature columns: {traintest.shape}')
    print(f'traintest DataFrame creation done: {strftime("%a %X %x")}\n')
    return traintest

print(f'\nDone: {strftime("%a %X %x")}\n')


Done: Sat 14:11:09 06/20/20



In [None]:
if train_test_base_save:
    print(f'train_test_base dataframe creation started: {strftime("%a %X %x")}\n')
    train_test_base = clean_merge_augment()

    %cd "{GDRIVE_REPO_PATH}"
    # can save as csv.gz for < 100 MB storage and sync with GitHub
    compression_opts = dict(method='gzip',
                            archive_name='train_test_base.csv')  
    train_test_base.to_csv('data_output/train_test_base.csv.gz', index=False, compression=compression_opts)
    print("train_test_base.csv.gz file stored on google drive in data_output directory")
    print(f'train_test_base file save done: {strftime("%a %X %x")}')

display(train_test_base[train_test_base.week == 102].tail(2))

print(f'\ntrain_test_base done: {strftime("%a %X %x")}')

train_test_base dataframe creation started: Sat 14:11:09 06/20/20

Shape of original sales_train data set = (2935849, 6)
Rows being clipped:
  1,501,160 sales clipped to 200
  1,708,207 sales clipped to 200
  2,296,209 sales clipped to 100
  2,341,308 sales clipped to 100
Rows being deleted:
  2909818
  2909401
  2326930
  2257299
  1163158
  484683
Shape of sales_train_cleaned after 6 outlier rows were removed: (2935843, 6)
Shape of sales_train_cleaned after merging shops as in {0: 57, 1: 58, 11: 10}: (2935843, 6)
Shops being scaled:
  shop 36 scaled by 2.07
Shape of traintest after appending test to sales_train_cleaned: (3150043, 6)
Shape of traintest after creating time-based feature columns: (3150043, 9)
traintest DataFrame creation done: Sat 14:13:28 06/20/20

/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
train_test_base.csv.gz file stored on google drive in data_output directory
train_test_base file save done: Sat 14:14:02 06/20/20


Unnamed: 0,day,week,qtr,season,month,price,sales,shop_id,item_id
2257039,718,102,8,1,23,399,1,59,21970
2257040,718,102,8,1,23,499,1,59,22060



train_test_base done: Sat 14:14:02 06/20/20


#**2. *shops_enc*** </br>
**60 rows**, corresponding to the 60 original shops</br>
**9 columns:**
 1. shop_id  (categorical; int8, from original data set)
 2. shop_tested (categorical; int8 0/1, indicating if the shop is in *test* set)
 3. shop_group (categorical; tighter-grouped clusters were assigned letters first, so tend to be closer to A than to Z)
 4. shop_type (categorical; online, small shop, mall, SEC, Mega)
 5. s_type_broad (categorical; like shop_type, but fewer categories by merging together "Mall","Mega","SEC")
 6. shop_federal_district (categorical)
 7. fd_popdens (categorical; 4 categories named by population density in the shop's fed district)
 8. fd_gdp (categorical; 3 categories named by gdp per person)
 9. shop_city (categorical)

In [None]:
print(shops_augmented.columns)

Index(['shop_name', 'shop_id', 'shop_group', 'shop_federal_district', 'shop_federal_district_enc', 'shop_tested', 'en_shop_name', 'shop_city_population', 'shop_type', 'shop_type_enc', 'shop_city', 'shop_city_enc', 's_type_broad',
       's_type_broad_enc', 'fd_popdens', 'fd_popdens_enc', 'fd_gdp', 'fd_gdp_enc'],
      dtype='object')


In [None]:
shops_enc = shops_augmented[['shop_id', 'shop_tested', 'shop_group', 'shop_type', 's_type_broad', 'shop_federal_district', 'fd_popdens', 'fd_gdp', 'shop_city']].copy(deep=True)
shops_enc.shop_tested = shops_enc.shop_tested.astype(np.int8)
for c in ['shop_group', 'shop_type', 's_type_broad', 'shop_federal_district', 'fd_popdens', 'fd_gdp', 'shop_city']:
    shops_enc[c] = shops_enc[c].astype('category')
    shops_enc[c] = shops_enc[c].cat.codes
    print(f'Column {c} number of unique category codes: {shops_enc[c].nunique()}')
print('\n')
display(shops_enc.head())

shops_enc.to_csv("data_output/shops_enc.csv", index=False)

Column shop_group number of unique category codes: 14
Column shop_type number of unique category codes: 6
Column s_type_broad number of unique category codes: 3
Column shop_federal_district number of unique category codes: 8
Column fd_popdens number of unique category codes: 4
Column fd_gdp number of unique category codes: 3
Column shop_city number of unique category codes: 29




Unnamed: 0,shop_id,shop_tested,shop_group,shop_type,s_type_broad,shop_federal_district,fd_popdens,fd_gdp,shop_city
0,0,0,7,5,2,1,3,1,26
1,1,0,7,1,0,1,3,1,26
2,2,1,9,2,0,5,0,2,0
3,3,1,6,1,0,0,2,1,1
4,4,1,9,1,0,5,0,2,23


#**3. *items_enc*** </br>
**22,170 rows**, corresponding to the 22,170 original items</br>
**10 columns:**
 1. item_id  (categorical; int8, from original data set)
 2. item_tested (categorical; int8 0/1, indicating if the item is in *test* set)
 3. item_cluster (categorical; int; grouping from item sales correlations)
 ---
 1. item_category_id  (categorical; int8, from original data set)
 2. item_cat_tested (categorical; int8 0/1, indicating if the item category is in *test* set)
 3. item_group (categorical; int8)
 4. item_category1 (categorical; int8)
 5. item_category2 (categorical; int8)
 6. item_category3 (categorical; int8)
 7. item_category4 (categorical; int8)
 

In [None]:
items_enc = items_clustered_22170b[['item_id','item_tested','item_cluster','item_category_id']].copy(deep=True)
items_enc = items_enc.merge(item_categories_augmented[['item_category_id','item_cat_tested','item_group','item_category1','item_category2','item_category3','item_category4']],how='left',on='item_category_id')

items_enc.item_tested = items_enc.item_tested.astype(np.int8)
items_enc.item_cat_tested = items_enc.item_cat_tested.astype(np.int8)
for c in ['item_group','item_category1','item_category2','item_category3','item_category4']:
    items_enc[c] = items_enc[c].astype('category')
    items_enc[c] = items_enc[c].cat.codes
    print(f'Column {c} number of unique category codes: {items_enc[c].nunique()}')
print('\n')
display(items_enc.head())

items_enc.to_csv("data_output/items_enc.csv", index=False)

Column item_group number of unique category codes: 27
Column item_category1 number of unique category codes: 13
Column item_category2 number of unique category codes: 10
Column item_category3 number of unique category codes: 12
Column item_category4 number of unique category codes: 8




Unnamed: 0,item_id,item_tested,item_cluster,item_category_id,item_cat_tested,item_group,item_category1,item_category2,item_category3,item_category4
0,0,0,100,40,1,6,8,3,7,3
1,1,0,105,76,1,6,11,6,10,5
2,2,0,100,40,1,6,8,3,7,3
3,3,0,110,40,1,6,8,3,7,3
4,4,0,116,40,1,6,8,3,7,3


#**4. *date_adjustments*** </br>
**35 rows**, corresponding to the 34 training months + 1 test month</br>
**8 columns:**
 1. month  (numerical; int8 0-35, from original data set = date_block_num)
 2. year (numerical; int16, 2013-2015)
 3. season (categorical; int8 0-3, 3-month chunks aligned with seasons & shopping trends, repeating each year)
 4. MoY  (categorical; int8 1-12, month of the year)
 5. days_in_M (numerical; int8 28-31, number of days in that row's month)
 6. weekday_weight (numerical, float, scaling for weekly shopping trends)
 7. retail_sales (numerical, float, scaling for Russian economy)
 8. week_retail_weight (numerical, float, scaling for days_in_M, weekday_weight, and retail_sales combined)
 

* to normalize sales per month by number of days in month (28-31), multiply by ( 30 / (column "days_in_M"))
* to normalize sales per month by number of days in month, number of each weekday (Sun, Mon, Tues...) in month (mean sales over all 34 train months), multiply by column "weekday_weight"
* to normalize sales per month by recorded retail sales per month numbers for Russia, multiply by column "retail_sales"
* to normalize sales per month by number of days in month (28-31), number of each weekday (mean sales over all 34 train months), and retail sales numbers for Russia, multiply by column "week_retail_weight"


In [None]:
date_adjustments = days_by_month[['month','year','season','MoY','days_in_M','weekday_weight','retail_sales','week_retail_weight']].copy(deep=True)
display(date_adjustments.head())

date_adjustments.to_csv("data_output/date_adjustments.csv", index=False)

Unnamed: 0,month,year,season,MoY,days_in_M,weekday_weight,retail_sales,week_retail_weight
0,0,2013,2,1,31,0.979,1.052,1.03
1,1,2013,3,2,28,1.069,1.072,1.146
2,2,2013,0,3,31,0.946,0.989,0.936
3,3,2013,1,4,30,1.01,0.989,0.999
4,4,2013,2,5,31,0.973,0.966,0.94
