## First things first
* Click **File -> Save a copy in Drive** and click **Open in new tab** in the pop-up window to save your progress in Google Drive.
* Click **Runtime -> Change runtime type** and select **GPU** in Hardware accelerator box to enable faster GPU training.

#**For Jupyter Notebook Readability:**
Many sections are grouped so they may be collapsed for easier navigation to the code of interest.  (For example, the code to create new features and save them to a csv file exists in this notebook, but after that is done, a simple csv import is all that is needed, and we keep the code in the notebook just for future reference -- not to re-run every time we start a Google Colab runtime.)  Unfortunately, I haven't found a way in Colab to set cell metadata to disable running these unnecessary cells when selecting the "Run All" or "Run Before" menu options for the notebook.  Apparently this can be done in a standard (non-Colab) Jupyter notebook, or maybe using a plug-in like the one [here](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/freeze).
</br></br>
For now, **I've tried to highlight essential cells to run for the development team using headers in red** (using ' ' ' code comment blocks), with optional code cells (and code cells for non-team members) with green (######) comment blocks at the top to indicate if necessary to run vs. if the data is already computed and stored in a file that has been loaded (to save time).

#   -- ***To Do* Items** (from Coursera/Kaggle class info) --

**Optional Section - Intended for Project Development Team**

**Course Instructors' Requirements**

Follow these guidelines to simplify life for you and for the fellow learner. This is a must.

1. The solution runs without errors. 

2. Specify required libraries and their versions in the first notebook cell or in requirements.txt -- https://pip.readthedocs.io/en/1.1/requirements.html This will save a lot of time for other students, assessing your project.

3. Serialize the trained model to disk. This enables code to use the trained model to make predictions on the test data without re-training the model (which is typically much more time-intensive)

**Review Criteria for Coursera Peer Review**

Pay attention to the following criteria. Try to complete most of them and present results in a form that can be easily assessed.

1. Clarity

>* Clear step-by-step instructions on how to produce the final submit file are provided.

>* Code has comments where it is needed and meaningful function names

2. Feature preprocessing and generation with respect to models

>* Several simple features are generated

>* For non-tree-based models, preprocessing is used or the absence of it is explained

3. Feature extraction from text and images

>* Features from text are extracted

>* Special preprocessings for text are utilized (TF-IDF, stemming, levenshtening...)

4. EDA

>* Several interesting observations about data are discovered and explained

>* Target distribution is visualized, time trend is assessed

5. Validation

>* Type of train/test split is identified and used for validation

>* Type of public/private split is identified

6. Data leakages

>* Data is investigated for data leakages and investigation process is described

>* Found data leakages are utilized

7. Metrics optimization

>* Correct metric is optimized

8. Advanced Features I: mean encodings

>* Mean-encoding is applied

>* Mean-encoding is set up correctly, i.e. KFold or expanding scheme are utilized correctly

9. Advanced Features II

>* At least one feature from this topic is introduced (Statistics & Distance-Based Features, Matrix Factorizations, Feature Interactions, tSNE)

10. Hyperparameter tuning

>* Parameters of models are roughly optimal

11. Ensembles

>* Ensembling is utilized (linear combination counts)

>* Validation with ensembling scheme is set up correctly, i.e. KFold or Holdout is utilized

>* Models from different classes are utilized (at least two from the following: KNN, linear models, RF, GBDT, NN)

#**Final Project for Coursera's 'How to Win a Data Science Competition'**

**Optional Section - Description of Project Objectives, Inputs, Outputs**
</br></br>
April, 2020

Andreas Theodoulou and Michael Gaidis

(Competition Info last updated:  3 years ago)

##**About this Competition**

You are provided with **daily** historical sales data. The task is to forecast the total amount of products (irrespective of product type;  we just want the sum of all products) sold in every shop for the test set (the **month** of November, 2015). Note that the list of shops(!) and products slightly *changes every month*. Creating a robust model that can handle such situations is part of the challenge.

.

##**File descriptions**

***sales_train.csv*** - the training set. Daily historical data from January 2013 to October 2015.

***test.csv*** - the test set. You need to forecast the sales for these shops and products for November 2015.

***sample_submission.csv*** - a sample submission file in the correct format (two columns: "shop ID number" and "total number of products sold in Nov. 2015")

***items.csv*** - item names, their corresponding item_categories IDs, and item IDs to link with the other files

***item_categories.csv***  - item category names and corresponding IDs to link with the other files

***shops.csv***- shop names and corresponding IDs to link with the other files

.

##**Data fields**

***ID*** - an Id that represents a (Shop, Item) tuple within the test set

***shop_id*** - unique identifier of a shop

***item_id*** - unique identifier of a product

***item_category_id*** - unique identifier of item category

***item_cnt_day*** - number of products sold. You are predicting a monthly amount of this measure

***item_price*** - current price of an item

***date*** - date in format dd/mm/yyyy

***date_block_num*** - a consecutive month number. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

***item_name*** - name of item

***shop_name*** - name of shop

***item_category_name*** - name of item category

#**Workflow**

##1. Configure Environment


*   Fork/copy shared ipynb as necessary, to not conflict with teammate
*   Load competition data files
*   Load any utility code files
*   Import libraries



##2. Explore Data


*   Data formatting and translating
*   Descriptive explanations for the competition data
*   Grouping and statistical descriptions of the provided features
*   Data visualizations and correlations
*   Look for signs of data leakage
*   Record initial thoughts on features and models to use



##3. Prepare Data


*   Data formatting and translating (see above)
*   Data cleaning (--> handling missing entries, outliers, NaNs, ...)
*   Data grouping / Date-related issues / re-cleaning if needed after grouping
*   Data normalization (recheck cleaning & normalizing with data visualizations)
*   Initial feature selection (quick and dirty) and preparation
*   Save data in compressed or pickled format if helpful; use version control



##4. Quick Modeling (set up framework for more complex model improvement)


*   Choose and implement a fast and simple approach for train/val data splitting
*   Choose a simple and fast evaluation metric (comparable to Kaggle's metric)
*   Choose a simple, but appropriate, model to use (minimal hyperparameters)
*   Train the model, check for major issues (absolutely horrible performance)
*   Save the model parameters, etc., along with version control
*   Submit model to Kaggle to verify proper formatting of entry
*   Verify that Kaggle test performance is reasonably close to validation metric



##5. Refine the Model and the Features


###a) Features


*   Explore the data more deeply for feature correlations and data leaks to exploit
*   Consider complex feature generation based on intuition
*   Save data in compressed or pickled format if helpful for faster future iteration
*   Employ version control on datasets generated with new features / groupings

###b) Modeling


*   Look at alternative metrics for training and validation
*   Version control
*   Explore hyperparameter tuning for the initial quick and dirty model
*   Version control
*   Consider other models as time allows
*   Version control
*   Create ensembles as time allows
*   Version control
*   Adjust methods of train/val splitting if desirable and timely
*   Version control







##6. Finalize Model


*   Restart kernel, clean any possible lingering variables
*   Train and tune hyperparamers until you run out of time
*   Submit model



---



---





#0. Configure Environment

*  **Section 0.1 is optional**

*  **Section 0.2 is NOT optional**

##0.1) Install Packages (for Google Colab)
**Optional Section**

To run this notebook in its entirety, you will need to use a few nonstandard packages (i.e., packages not found in Google Colab).  We install them here.

In [0]:
############################################################

# Run this cell only if you plan to redo all preprocessing and feature generation
#   (otherwise, you can use existing, already saved data files, and eliminate many hours of runtime)
#   ** If you are not running in Google Colab, remove the !pip statement, and instead make sure googletrans 2.4.0 is installed on your machine

############################################################

# Translating Package using Google API
#   used for translating Russian text in dataframes, so we can better understand potential features, data leaks, or outliers
!pip install googletrans  # version 2.4.0

# Assuming you are planning to use this package (because you ran this cell and imported the googletrans package),
#  we will go ahead and import the library and instantiate a Translator class
from googletrans import Translator
translator = Translator()

In [0]:
############################################################

# Run this cell only if you plan to redo preprocessing and feature generation related to shop location
#   (otherwise, you can use existing, already saved data files, and eliminate many hours of runtime)
#   ** If you are not running in Google Colab, remove the !pip statement, and instead make sure geopy 1.17.0 is installed on your machine

############################################################

# Geocoding library 
#   used for creating features from shop location
!pip install geopy   # version 1.17.0

# Assuming you are planning to use this package (because you ran this cell and imported the geopy package),
#  we will go ahead and import the library elements and instantiate two rate-limited geocoders
from geopy.geocoders import Nominatim
from geopy.geocoders import GeoNames
from geopy.extra.rate_limiter import RateLimiter

# Utilize "RateLimiter" to limit location queries to one per second, as the free services tend to throttle rate of use
# We will use Nominatim for location, and GeoNames for population
nominatum_service = Nominatim(timeout=10, user_agent = "mgaidis@yahoo.com", format_string="%s, Russia")
nominatum_geocode = RateLimiter(nominatum_service.geocode, min_delay_seconds=1)
geonames_service = GeoNames(username='gaidis', timeout=10, user_agent="mgaidis@yahoo.com")  # be sure to enable free web services when creating geonames account
geonames_geocode = RateLimiter(geonames_service.geocode, min_delay_seconds=1)

##0.2) Import Libraries/Modules
**NOT OPTIONAL**

In [0]:
'''
############################################################

# Everyone should run this cell if they want to recreate any of the computations or EDA

############################################################
'''

# General python libraries/modules used throughout the notebook
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import os
from itertools import product
import re
import json
import time
from time import sleep, localtime, strftime
import pickle

'''
# NLP packages
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
'''

'''
# ML packages
from sklearn.linear_model import LinearRegression

!pip install catboost
from catboost import CatBoostRegressor 
'''

# Magics
%matplotlib inline


In [0]:
'''
############################################################

# Run this cell only if you have a copy of the GitHub repo on your "local" Google Drive, and plan to use Colab for code execution

############################################################
'''
# We run this notebook in Google Colab, using data files copied into our Google Drive, and controlled by git
# Importing the google.colab.drive module is required to mount the Google Drive so Colab has access to these Google Drive files
# New users of this notebook can ignore this Google Drive methodology and simply import the files we have placed in our public GitHub repo (see below)

from google.colab import drive  

#1. Load Data and Code Utility Files

*  **Section 1.1 is NOT optional**

*  **Section 1.2 is NOT optional** (but, choose only 1 of three file load methods)

The list of data files contains the original files provided by Kaggle, which can be run through this notebook to generate the new features used in our model.

As the data set preprocessing and feature generation can take several hours, this notebook is set up to also read data files that contain the features we created.  This allows you to skip execution of the code cells that take a long time to generate features. (They will be highlighted, so you will know which ones to skip.)

The code cells below therefore load the original and the augmented data files, and allow you to choose whether or not to execute "re-creation" of the augmented data files.



##1.1) Enter Data File Names and Paths

**NOT Optional**

In [0]:
'''
############################################################

# Everyone should run this cell and one of the following 3 options for loading the data files

############################################################
'''
# List of the data files (path relative to GitHub master), to be loaded into pandas DataFrames
data_files = [  "readonly/final_project_data/items.csv",
                "readonly/final_project_data/item_categories.csv",
                "readonly/final_project_data/shops.csv",
                "readonly/final_project_data/sample_submission.csv.gz",
                "readonly/final_project_data/sales_train.csv.gz",
                "readonly/final_project_data/test.csv.gz",
                "data_output/shops_transl.csv",
                "data_output/shops_augmented.csv",
                "data_output/item_categories_transl.csv",
                "data_output/item_categories_augmented.csv",
                "data_output/items_transl.csv"  ]

# Dict of helper code files, to be loaded and imported {filepath : import_as}
code_files = {"helper_code/kaggle_utils_at_mg.py" : "kag_utils"}

# GitHub file location info
git_hub_url = "https://raw.githubusercontent.com/migai"
repo_name = 'Kag'
branch_name = 'master'
base_url = os.path.join(git_hub_url, repo_name, branch_name)

##1.2) Load Data Files

**NOT Optional** (but, choose only one of the 3 methods below)
</br></br>
Three options are provided for loading the source data files and (if desired) the same data files augmented with additional preprocessing material.  The files are located in GitHub, within a public repo [migai/Kag](https://github.com/migai/Kag)

1. If you are running in Google Colab without git

2. If you are running in Google Colab with git (and have cloned the repo from GitHub to your local Google Drive already)... contact me if you wish to do this, and I can get you set up with an appropriate git token, etc.)

3. If you are running on a local machine / not using Colab

**Expand the appropriate ipynb section for your needs, and use those code cells to load the data**

####**Option 1:  You are running in Colab and are not integrating git with Google Drive**

In [0]:
############################################################

# Run this cell if you are executing this notebook in Google Colab, but you
#   are not using git to integrate Colab with GitHub and Google Drive 
#   (e.g., you are not a code developer on this team, but you run in Colab)

############################################################

def xfer_github_to_colab(path, filename):
    os.system("wget " + base_url + "/{} -O {}".format(path, filename))
    print(base_url + "/" + path + " ---> loaded into ---> " + filename)
    return

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False
if IN_COLAB:
    print("Loading Files from GitHub to Colab...\n")

    # Loop to load the above data files into appropriately-named pandas DataFrames
    for path_name in data_files:
      filename = path_name.rsplit("/")[-1]
      xfer_github_to_colab(path_name, filename)
      data_frame_name = path_name.rsplit("/")[-1].split(".")[0]
      exec(data_frame_name + " = pd.read_csv(filename)")
      if data_frame_name == 'sales_train':
        sales_train['date'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y')
      print("Data Frame: " + data_frame_name)
      print(eval(data_frame_name).head(2))
      print("\n")


    # Load in any helper functions from the code_files dictionary
    for code_path, import_as in code_files.items():
      code_filename = code_path.rsplit("/")[-1]
      xfer_github_to_colab(code_path, code_filename)
      exec("import " + code_filename[:-3] + " as " + import_as)  # no ".py" on the filepath for import


Loading Files from GitHub to Colab...

https://raw.githubusercontent.com/migai/Kag/masterreadonly/final_project_data/items.csv ---> loaded into ---> items.csv
Data Frame: items
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76


https://raw.githubusercontent.com/migai/Kag/masterreadonly/final_project_data/item_categories.csv ---> loaded into ---> item_categories.csv
Data Frame: item_categories
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1


https://raw.githubusercontent.com/migai/Kag/masterreadonly/final_project_data/shops.csv ---> loaded into ---> shops.csv
Data Frame: shops
                       shop_name  shop_id
0  !Якутск Орджоникидзе, 56 фран        0
1  !Якутск ТЦ "Центральный" фран        1


https://ra

####**Option 2:  You are running in Colab with local git repo on Google Drive**
(e.g., you are a code developer on this team --> **USE THIS**)

In [4]:
'''
############################################################

# Execute this cell if you are a code developer on this team, and are
#   using git to coordinate Google Drive with GitHub repo's (and you are using Colab)

############################################################
'''

drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [5]:
'''
############################################################

# Execute this cell if you are a code developer on this team, and are
#   using git to coordinate Google Drive with GitHub repo's (and you are using Colab)

############################################################
############################################################
############################################################

# Replace this path with the path on *your* Google Drive where Kag repo master branch is stored
GDRIVE_REPO_PATH = "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"

############################################################
############################################################
'''

%cd "{GDRIVE_REPO_PATH}"

print("Loading Files from Google Drive repo into Colab...\n")

# Loop to load the data files into appropriately-named pandas DataFrames
for path_name in data_files:
  filename = path_name.rsplit("/")[-1]
  data_frame_name = filename.split(".")[0]
  exec(data_frame_name + " = pd.read_csv(path_name)")
  if data_frame_name == 'sales_train':
    sales_train['date'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y')
  print("Data Frame: " + data_frame_name)
  print(eval(data_frame_name).head(2))
  print("\n")


# Load in any helper functions from the code_files dictionary
#    dictionary key is the path (replace "/"" with "." when using Google Drive + Colab), 
#      and dictionary value is the module reference name
#    note that the directory chain on GitHub (and local repo) from current directory down to the .py file
#      must include a "__init__.py" file (it can be empty) in each of the directories
for filepath, module in code_files.items():
  path_name = filepath.replace("/",".")[:-3]  # Google Drive reference does not use .py, and uses a "." instead of "/" for directory delineation
  exec("import " + path_name + " as " + module)

# Sanity check test
#test1 = kag_utils.add_one(2)
#print(test1)

/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
Loading Files from Google Drive repo into Colab...

Data Frame: items
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76


Data Frame: item_categories
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1


Data Frame: shops
                       shop_name  shop_id
0  !Якутск Орджоникидзе, 56 фран        0
1  !Якутск ТЦ "Центральный" фран        1


Data Frame: sample_submission
   ID  item_cnt_month
0   0             0.5
1   1             0.5


Data Frame: sales_train
        date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0 2013-01-02               0       59    22154       999.0           1.0
1 2013-01-03               0      

####**Option 3:  You are running this code on a local machine**

In [0]:
############################################################

# Execute this cell if you are executing this notebook on your local machine, and
#   therefore need no special accommodations for Colab integration

############################################################

print("Loading files from GitHub...\n")

# Loop to load the data files into appropriately-named pandas DataFrames
for path_name in data_files:
    full_url = os.path.join(base_url, path_name)
    data_frame_name = path_name.rsplit("/")[-1].split(".")[0]
    exec(data_frame_name + " = pd.read_csv(full_url)")
    if data_frame_name == 'sales_train':
      sales_train['date'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y')
    print("Data Frame: " + data_frame_name)
    print(eval(data_frame_name).head(2))
    print("\n")

# Load in any helper functions from the code_files dictionary
#############################
# You need to do this manually
#############################
  

Loading files from GitHub...

Data Frame: items
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76


Data Frame: item_categories
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1


Data Frame: shops
                       shop_name  shop_id
0  !Якутск Орджоникидзе, 56 фран        0
1  !Якутск ТЦ "Центральный" фран        1


Data Frame: sample_submission
   ID  item_cnt_month
0   0             0.5
1   1             0.5


Data Frame: sales_train
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154       999.0           1.0
1  03.01.2013               0       25     2552       899.0           1.0


Data Frame: test
   ID  shop_id  item_id
0   0       

#2. Explore Data (EDA) and Feature Generation

**Optional Section**

##2.1) Basic Data Overview

**Optional Section**

*  Investigation of *sales_train* dataset
*  Investigation of *test* dataset
*  Clues as to the overlap between the two, and what might make a feature important

In [7]:
print("Total number of rows in sales_train set: " + str(len(sales_train)))
print("Number of unique item IDs in sales_train set: " + str(sales_train.item_id.nunique()))
print("Number of unique shop IDs in sales_train set: " + str(sales_train.shop_id.nunique()))
shop_item_pairs = sales_train.groupby(['shop_id','item_id']).size().reset_index().rename(columns={0:'n_train_rows'})
shop_item_pairs['pair'] = list(zip(shop_item_pairs.shop_id, shop_item_pairs.item_id))  # make a column of (shop, item) tuples to compare with test set
print(shop_item_pairs.n_train_rows.describe())
shop_item_pairs.head()

Total number of rows in sales_train set: 2935849
Number of unique item IDs in sales_train set: 21807
Number of unique shop IDs in sales_train set: 60
count    424124.000000
mean          6.922148
std          15.694255
min           1.000000
25%           1.000000
50%           3.000000
75%           7.000000
max         867.000000
Name: n_train_rows, dtype: float64


Unnamed: 0,shop_id,item_id,n_train_rows,pair
0,0,30,9,"(0, 30)"
1,0,31,7,"(0, 31)"
2,0,32,11,"(0, 32)"
3,0,33,6,"(0, 33)"
4,0,35,12,"(0, 35)"


In [8]:
print("Total number of rows in test set: " + str(len(test)))
print("Number of unique item IDs in test set: " + str(test.item_id.nunique()))
print("Number of unique shop IDs in test set: " + str(test.shop_id.nunique()))
print("   note:  214,200 = 5,100 * 42")
test_shop_item_pairs = test.groupby(['shop_id','item_id']).size().reset_index().rename(columns={0:'n_test_rows'})
test_shop_item_pairs['pair'] = list(zip(test_shop_item_pairs.shop_id, test_shop_item_pairs.item_id))  # make a column of (shop, item) tuples to compare with train set
test_shop_item_pairs['in_train_data'] = test_shop_item_pairs.pair.isin(shop_item_pairs.pair)
print(test_shop_item_pairs.n_test_rows.describe())
test_shop_item_pairs.head()

Total number of rows in test set: 214200
Number of unique item IDs in test set: 5100
Number of unique shop IDs in test set: 42
   note:  214,200 = 5,100 * 42
count    214200.0
mean          1.0
std           0.0
min           1.0
25%           1.0
50%           1.0
75%           1.0
max           1.0
Name: n_test_rows, dtype: float64


Unnamed: 0,shop_id,item_id,n_test_rows,pair,in_train_data
0,2,30,1,"(2, 30)",True
1,2,31,1,"(2, 31)",True
2,2,32,1,"(2, 32)",True
3,2,33,1,"(2, 33)",True
4,2,38,1,"(2, 38)",False


std of 0, with mean = 1, on n_test_rows indicates the test data set only has unique shop-item pairs for each row

Let's see if each of the test set's "shop-item" pairs is present in the train data set as well.

In [9]:
n_tuples_test = len(test_shop_item_pairs)
print("Total number of unique shop-item pairs in the test set: " + str(n_tuples_test))
n_tuples_test_and_train = test_shop_item_pairs.in_train_data.sum()
n_tuples_test_only = n_tuples_test - n_tuples_test_and_train
print("\nFor shop-item pairs in both the train and test sets:")
print("Number of shop-item pairs present in both train and test: " + str(n_tuples_test_and_train))
print("Number of unique item IDs: ", end="")
print(test_shop_item_pairs[test_shop_item_pairs.in_train_data == True].item_id.nunique())
print("Number of unique shop IDs: ", end="")
print(test_shop_item_pairs[test_shop_item_pairs.in_train_data == True].shop_id.nunique())
print("\nFor shop-item pairs present only in the test set:")
print("Number of shop-item pairs present only in the test set: " + str(n_tuples_test_only))
print("Number of unique item IDs: ", end="")
print(test_shop_item_pairs[test_shop_item_pairs.in_train_data == False].item_id.nunique())
print("Number of unique shop IDs: ", end="")
print(test_shop_item_pairs[test_shop_item_pairs.in_train_data == False].shop_id.nunique())
print("\n")
test_shop_item_pairs.head()

Total number of unique shop-item pairs in the test set: 214200

For shop-item pairs in both the train and test sets:
Number of shop-item pairs present in both train and test: 111404
Number of unique item IDs: 4716
Number of unique shop IDs: 42

For shop-item pairs present only in the test set:
Number of shop-item pairs present only in the test set: 102796
Number of unique item IDs: 5100
Number of unique shop IDs: 42




Unnamed: 0,shop_id,item_id,n_test_rows,pair,in_train_data
0,2,30,1,"(2, 30)",True
1,2,31,1,"(2, 31)",True
2,2,32,1,"(2, 32)",True
3,2,33,1,"(2, 33)",True
4,2,38,1,"(2, 38)",False


In [10]:
test_items = pd.DataFrame({'item_id':test.item_id.unique(),'in_train':False})
test_items['in_train'] = test_items.item_id.isin(sales_train.item_id)
test_items['in_item_db'] = test_items.item_id.isin(items.item_id)
n_items_test_and_train = test_items.in_train.sum()
n_items_test_in_item_db = test_items.in_item_db.sum()
print(len(test_items))
print(str(n_items_test_and_train))
print(str(n_items_test_in_item_db))
test_items.head()

5100
4737
5100


Unnamed: 0,item_id,in_train,in_item_db
0,5037,True,True
1,5320,False,True
2,5233,True,True
3,5232,True,True
4,5268,False,True


*  There are 424,124 unique pairs of "shop_id + item_id" in the 2,935,849 rows of the *sales_train* dataset

*  Half the shop-item pairs have 3 or fewer rows in the entire *sales_train* dataset, and at least 25% of the shop-item pairs have only 1 row in the *sales_train* dataset.

*  There are 22,170 unique item_id values in the *items* dataset, of which only 21,807 are present in the *sales_train* dataset.  Only 5,100 of the 22,170 items are present in the *test* dataset.  The 373 items present in the *items* dataset but not in the *sales_train* dataset are all present in the *test* set.  Therefore, of the 5,100 unique items in the *test* set, only 5,100 - 373 = 4737 items are also present in the *sales_train* set.  Therefore, our model will have to make predictions on sales of 373 items for which we have no historical record of ever being sold.

*  There are 60 unique shop_id values in the *shops* dataset, all of which are present in the *sales_train* dataset, but only 42 of which are present in the *test* dataset.

*  The test dataset only has 214,200 unique pairs of "shop_id + item_id".  Because 214,200 = 5,100 \* 42, we know that the *test* set contains every possible pairing of the 42 shops with the 5100 items.

*  Of the 214,200 unique shop-item pairs in the *test* set, only 111,404 of them are present in the *sales_train* data set.  Therfore, our model will need to predict sales for a very large number of items at shops that have no recorded history of selling that item previously.  This type of prediction makes up roughly half of the total predictions we need to make.

*  The entire set of 42 unique *test* set shop IDs is present in both the set of shop-item pairs common to *test* and *sales_train* as well as the set of shop-item pairs found only in the *test*

*  All 5,100 unique item_id values in the *test* set are found in the set of shop-item pairs found in *test* but not in *sales_train*.  Only 4,716 unique item_id values are in the *test* set corresponding to shop-item pairs that are present in both *test* and *sales_train*.

</br>

A few questions to think about:

1.  Do we train our model for the 111,404 shop-item pairs that are in both the *test* and the *sales_train* sets, and predict 0 sales for the shop-item pairs that historically have not existed?  Or, perhaps we assign weights so that the in some way the *test* set's novel pairs are treated differently? </br> Is it possible that certain shops lag other shops in obtaining items for sale.  For the novel *test* set pairs we should look to predict from our model by extrapolating *sales_train* data from similar shops and items in previous months.  It could help to have a "similarity" metric between each of the shop-item pairs in the *test* set and each of the shop-item pairs in the *sales_train* set, and use this metric to restrict which shop-item pairs in the *sales_train* set are useful for predicting the sales of novel shop-item pairs in the *test* set.

2.  Do we use the full *sales_train* dataset to train our model, or do we eliminate certain rows that pertain to irrelevant shop_id or item_id or shop-item pairs?
</br>
The *shops* dataframe is only 60 rows, so it shouldn't be too hard to analyze manually and see if we can find some sort of data leak that allows us to use only a subset of the shops when training our model.  We should certainly keep all *sales_train* data pertaining to the 42 shops in the *test* set, but how much of the data from the other 18 shops should we use to train our model?

3.  Because we will be predicting future sales of 373 items of which we have no record of ever being sold, we will need to rely heavily on item category or other similarities with items that we *do* have in our *sales_train* set

##2.2) Detailed Investigation of Descriptive Data Sets

**Optional Section**

*  Russian-to-English Translation
*  *shops* data set
*  *item_categories* data set
*  *items* data set


###2.2.1) ***shops*** Dataset: EDA and Feature Generation

---



---



#####2.2.1.1) **Translate and Ruminate**
We will start by translating the Russian text in the dataframe, and add our ruminations on possible new features we can generate.

The dataframe shops_transl (equivalent to shops + 'column for English translation') is saved as a .csv file so we do not have to repeat the translation process the next time we open a Google Colab runtime.

**Tip**
</br>
If you want to run this code to generate translated features for the shops, be sure to install the googletrans package and import Translator as in code above

If you have already created and saved this data, save time by importing the modified csv datafile and you won't have to re-run the translating.  (This can be a big deal, because Google Translate API restricts amount of usage and/or rate of usage for calls to the translator.)

In [0]:
#################################################
#  Do NOT run unless recreating from beginning
#    This computation already stored in data file
#################################################

shops_transl = shops.copy(deep=True)
shops_transl['En_Name'] = shops_transl.shop_name.apply(lambda x: translator.translate(x, src='ru', dest='en').text)
print(len(shops_transl))
shops_transl.head()

Unnamed: 0,shop_name,shop_id,En_Name
0,"!Якутск Орджоникидзе, 56 фран",0,"! Yakutsk Ordzhonikidze, 56 Franc"
1,"!Якутск ТЦ ""Центральный"" фран",1,"! Yakutsk TC ""Central"" Franc"
2,"Адыгея ТЦ ""Мега""",2,"Adygea TC ""Mega"""
3,"Балашиха ТРК ""Октябрь-Киномир""",3,"Balashikha TRC ""October-Kinomir"""
4,"Волжский ТЦ ""Волга Молл""",4,"Volzhsky mall ""Volga Mall"""
5,"Вологда ТРЦ ""Мармелад""",5,"Vologda SEC ""Marmalade"""
6,"Воронеж (Плехановская, 13)",6,"Voronezh (Plekhanovskaya, 13)"
7,"Воронеж ТРЦ ""Максимир""",7,"Voronezh SEC ""Maksimir"""
8,"Воронеж ТРЦ Сити-Парк ""Град""",8,"Voronezh shopping center City Park ""Castle"""
9,Выездная Торговля,9,Itinerant trade


In [0]:
#################################################
#  Do NOT run unless recreating from beginning
#    This computation already stored in data file
#################################################

shops_transl.to_csv("data_output/shops_transl.csv", index=False)

Observations:


1.  The number of shops is only 60, so manual feature generation is not out of the question.
2.  Most shops have a city associated with their name, so it's reasonable that we can do some feature generation based on shop location.
3.  (After Googling several of the shops, we realized that...) The shop type may be categorized.  We noticed that "Mega" shops were located inside shopping malls that were anchored (and managed) by Ikea stores.  "SEC" acronym implies the shop is part of a shopping and entertainment center (like a large shopping mall with a cinema or other activities).  SC, TC, TRK, and TRC acronyms generally imply the shop is in a standard shopping mall, but careful inspection on the world wide interweb shows that some of these shops are actually in SECs.  We will try assigning a type to each store from the following:

**['Online', 'Itinerant', 'Shop', 'Mall', 'Mega', 'SEC']**



>*   Online (like Amazon or eBay)
*   Itinerant (traveling salesman)
*   Shop (small, isolated store)
*   Mall (store is based in a shopping mall)
*   Mega (store is based in an Ikea-managed mall)
*   SEC (store is located in a shopping-entertainment complex)




Once we have extracted city information from the shop names, we can use a geolocator package to help categorize the location of the store, and even the population of the city in which the store is located.


The geopy package seems pretty good for performing the location-based categorization.  Free services Nominatum and GeoNames can work with geopy to give us longitude, latitude, federal district, and population, for example.
</br></br>
Latitude and longitude of the shops are likely too fine-grained to prevent overfitting with our model.  Instead, we can generate a feature based on Russian **Federal District**, as retrieved with geocode Nominatum service.  Due to religious preferences, for example, there may be a bias for a certain region to have higher sales in November (before Christmas) or not.  The map below shows roughly how Nominatum would categorize the Russian Federal Districts (the red text in the image):

<img src="https://www.worldatlas.com/r/w728-h425-c728x425/upload/4c/4b/0f/shutterstock-183567236-1.jpg">

</br>
We have no shops in the dataframe that come from North Caucasia, so the category types for district are as follows ('None' indicates online or itinerant shops):

**['Central', 'Northwestern', 'Siberian', 'Ural', 'Volga', 'South', 'Eastern', 'None']**



#####2.2.1.2) **Geocoding New Features**
We now apply geocoding to the shop locations to create features including Russian Federal District and population of the shop's city.

*shops_augmented.csv* file is saved to contain the original *shops.csv* data plus the English translation plus a feature column for Federal District and a feature column for population.

**Tip**
</br>
If you want to run this code to generate geo features for the shops, be sure to install the geopy package and import Nominatum and GeoNames as in code above

If you have already created and saved this data, save time by importing the modified csv datafile and you won't have to re-run the geocoding.

In [0]:
#################################################
#  Do NOT run unless recreating from beginning
#    This computation already stored in data file
#################################################

# Add 'district' column to the shops_augmented dataframe using Nominatim
shops_augmented['district'] = shops_augmented.City.apply(lambda x:  'None' if (x == 'None') else re.search(r'[,\s](\w*)\sFederal District', str(nominatum_geocode(x, language='en'))).group(1))

# Add 'population' column to the shops_augmented dataframe using GeoNames
shops_augmented['population'] = shops_augmented.City.apply(lambda x:  'None' if (x == 'None') else geonames_geocode2(x, timeout=10).raw['population'])

In [0]:
#################################################
#  Do NOT run unless recreating from beginning
#    This computation already stored in data file
#################################################

# GeoNames had some trouble with Zhukovsky and Checkov... insert values from Wikipedia
shops_augmented.at[56,'population'] = 61000
shops_augmented.at[10,'population'] = 105000
shops_augmented.at[11,'population'] = 105000

# for 'Itinerant' (traveling salesman) shop, set the population to 100,000 (roughly the number of people the salesman might have access to in day)
shops_augmented.at[9,'population'] = 100000

# for 'Online' shops, set the population to 20,000,000 (estimate of the number of people who have internet and would place an online order from Russia)
shops_augmented.at[12,'population'] = 20000000
shops_augmented.at[20,'population'] = 20000000
shops_augmented.at[55,'population'] = 20000000

In [0]:
shops_augmented.to_csv("data_output/shops_augmented.csv", index=False)

####2.2.1.3) **Discussion of manually-augmented *shops* data**

In [0]:
shops_augmented

While entering city/type information, it looks like there is a possible issue with a couple of the shops... these two may actually be the same shop??  From extensive web searching, the only things at that address appear to be a security systems vendor and a trial attorney

10	Жуковский ул. Чкалова 39м?	10	Zhukovsky Street. Chkalov 39m?

11	Жуковский ул. Чкалова 39м²	11	Zhukovsky Street. Chkalov 39m²

</br>
We are probably best off by combining these two shops somehow.  Need to check if sales are significantly different for the two shops or not.  Do we make them into one shop and then split the prediction in half between shops #10 and 11 when we submit for grading?... need to chew on this a bit.

Now, let's compare this 60-row *shops_augmented* dataframe with the 42 unique shops used in the *test* dataset and see if anything odd comes from it...

In [0]:
#################################################
#  Do NOT run unless recreating from beginning
#    This computation already stored in data file
#################################################

test_shops = test.shop_id.unique()
shops_augmented["tested"] = False
for i in test_shops:
  shops_augmented.at[i,"tested"] = True

shops_augmented.to_csv("data_output/shops_augmented.csv", index=False)

In [0]:
shops_augmented

A few things pop out when comparing the full *shops_augmented* dataframe with only the shop_id values included in the *test* dataframe:

*  shop_id #9 is not in the *test* set.  This particular "shop" is the only "itinerant" shop out of all 60, which makes it somewhat unique.  This gives some confidence that we should discard any training involving shop_id == 9

*  shop_id #10 and #11 are located at the same place, and web search does not distinguish between the two.  Shop #11 is not included in the *test* set.  Thus, there is a good chance we can either discard shop #11 or combine it into shop #10 when training our model.

*  shop_id #12, 20, and 55 are the only "online" shops in the database.  Shop #20 is not included in the *test* set.  Because these three shops are set apart from the others as being online, they may behave quite differently than the others.  And, because there are only 3 to look at, we can analyze rather quickly to see the characteristics of 12, 20, and 55 to see if 20 is a good supplement for training, or if 20 should be discarded during training.

*  shop_id #0 and 1 are not included in the *test* set, leaving shops 57 and 58 as the only shops to be tested that come from the Eastern Federal District.  We might want to consider treating these 4 shops differently from the other shops when training, as they might be considered as "outliers."

*  shop_id #5, 42, and 43 are the only shops in the Northwestern Federal District, and shop 42 is not in the *test* set.  We need to watch closely at these shops too, to see if this makes any of the 3 to be outliers that we treat differently during training.


####2.2.1.4) **A Typical Shop**

(tbd)

In [77]:
shops_sales_train = sales_train.groupby("shop_id").count()['item_cnt_day']
shops_sales_train.describe()

count        60.000000
mean      48930.816667
std       44692.572612
min         306.000000
25%       20503.750000
50%       42037.500000
75%       58211.000000
max      235636.000000
Name: item_cnt_day, dtype: float64

In [78]:
shops_sales_train.head()

shop_id
0     9857
1     5678
2    25991
3    25532
4    38242
Name: item_cnt_day, dtype: int64

####2.2.1.5) **Shop 9 - Analysis**

**Tentative Conclusion (TL;DR): We should dump shop \#9 from any model training**

In [9]:
# starting with shop #9, let's look at the behavior in the training dataset
shop9_sales = sales_train.loc[sales_train['shop_id']==9]
print(shop9_sales.item_id.nunique())
shop9_sales.info()

1404
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3751 entries, 1012860 to 2919700
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            3751 non-null   datetime64[ns]
 1   date_block_num  3751 non-null   int64         
 2   shop_id         3751 non-null   int64         
 3   item_id         3751 non-null   int64         
 4   item_price      3751 non-null   float64       
 5   item_cnt_day    3751 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(3)
memory usage: 205.1 KB


In [11]:
shop9_sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
1012860,2013-10-05,9,9,16205,299.0,2.0
1012861,2013-10-06,9,9,16205,299.0,1.0
1012862,2013-10-03,9,9,16209,299.0,1.0
1012863,2013-10-05,9,9,16209,299.0,1.0
1012864,2013-10-06,9,9,16209,299.0,2.0


In [12]:
print(shop9_sales.item_cnt_day.sum())
shop9_sales.describe()

15866.0


Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,3751.0,3751.0,3751.0,3751.0,3751.0
mean,18.944548,9.0,13141.027993,1256.812248,4.229805
std,9.239632,0.0,6627.185299,1421.3441,7.991928
min,9.0,9.0,1407.0,90.0,-1.0
25%,9.0,9.0,6740.0,629.05,1.0
50%,21.0,9.0,14828.0,974.125,2.0
75%,21.0,9.0,20404.5,1549.0,4.0
max,33.0,9.0,22102.0,27499.0,168.0


In [13]:
shop9_many_sales = shop9_sales[shop9_sales.item_cnt_day > 75]
shop9_many_sales

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
1013357,2013-10-04,9,9,7802,299.0,78.0
1013358,2013-10-05,9,9,7802,299.0,89.0
1014331,2013-10-06,9,9,7096,498.992262,168.0
1015280,2013-10-06,9,9,6457,598.99037,135.0
1015736,2013-10-05,9,9,1448,598.983133,83.0
2046634,2014-10-03,21,9,7018,599.0,78.0
2046784,2014-10-03,21,9,10199,298.994505,91.0
2050285,2014-10-03,21,9,19436,798.991935,124.0
2918705,2015-10-02,33,9,4201,399.0,110.0


In [14]:
print(shop9_sales.date.nunique())
shop9_sales.date.unique()

14


array(['2013-10-05T00:00:00.000000000', '2013-10-06T00:00:00.000000000',
       '2013-10-03T00:00:00.000000000', '2013-10-04T00:00:00.000000000',
       '2014-10-05T00:00:00.000000000', '2014-10-04T00:00:00.000000000',
       '2014-10-02T00:00:00.000000000', '2014-10-03T00:00:00.000000000',
       '2015-04-22T00:00:00.000000000', '2015-10-04T00:00:00.000000000',
       '2015-10-01T00:00:00.000000000', '2015-10-02T00:00:00.000000000',
       '2015-10-03T00:00:00.000000000', '2015-10-14T00:00:00.000000000'],
      dtype='datetime64[ns]')

After a bunch of stumbling around and looking at shop \#9 in different ways, we found that the total number of days this shop had sales was only 14 (out of 34 months of training data), with a total of 1404 different items sold, and 3751 rows in the *sales_train* dataset.  So, it looks like this shop is indeed an outlier in that the sales take place on a few select days, and they don't happen to include November.

**We need to dump shop \#9 from any model training**

The "typical" shop had roughly 42000 rows in the *sales_train* dataset, as opposed to 3751 rows for shop #9.

In [99]:
#shop9_many_sales = shop9_sales[shop9_sales.item_cnt_day == 03.10.2014]
print(dates_sales_train[dates_sales_train.index == pd.Timestamp(year=2014, month=3, day=10)])
print(dates_sales_train[pd.Timestamp(year=2014, month=3, day=10)])

date
2014-03-10    3108
Name: item_cnt_day, dtype: int64
3108


####2.2.1.6) **Shops 10 and 11 - Analysis**

**Conclusion (TL;DR): in progress...**

In [15]:
# let's look at the behavior in the training dataset for shops 10 and 11
shop10_sales = sales_train.loc[sales_train['shop_id']==10]
print(shop10_sales.item_id.nunique())
print(shop10_sales.info())
print("\n")
shop11_sales = sales_train.loc[sales_train['shop_id']==11]
print(shop11_sales.item_id.nunique())
print(shop11_sales.info())

6002
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21397 entries, 53564 to 2919922
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            21397 non-null  datetime64[ns]
 1   date_block_num  21397 non-null  int64         
 2   shop_id         21397 non-null  int64         
 3   item_id         21397 non-null  int64         
 4   item_price      21397 non-null  float64       
 5   item_cnt_day    21397 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(3)
memory usage: 1.1 MB
None


371
<class 'pandas.core.frame.DataFrame'>
Int64Index: 499 entries, 2461045 to 2462007
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            499 non-null    datetime64[ns]
 1   date_block_num  499 non-null    int64         
 2   shop_id         499 non-null    int64         
 3   item_i

Shop 10 (the one included in the *test* dataset) has far more transactions than Shop 11.  Should we just dump Shop 11?

Let's look at a comparison of the items sold, and see if perhaps it is really the same shop, and we should concatenate the data.

###2.2.2) ***item_categories*** Dataset: EDA and Feature *Generation*

---



---



#####2.2.2.1) **Translate and Ruminate**
We will start by translating the Russian text in the dataframe, and add our ruminations on possible new features we can generate.

The dataframe *item_categories_transl* (equivalent to item_categories plus a column for English translation) is saved as a .csv file so we do not have to repeat the translation process the next time we open a Google Colab runtime.

**Tip**
</br>
If you want to run this code to generate translated features for the item categories, be sure to install the googletrans package and import Translator as in code above

If you have already created and saved this data, save time by importing the modified csv datafile and you won't have to re-run the translating.  (This can be a big deal, because Google Translate API restricts amount of usage and/or rate of usage for calls to the translator.)

In [0]:
item_categories.describe

<bound method NDFrame.describe of            item_category_name  item_category_id
0     PC - Гарнитуры/Наушники                 0
1            Аксессуары - PS2                 1
2            Аксессуары - PS3                 2
3            Аксессуары - PS4                 3
4            Аксессуары - PSP                 4
..                        ...               ...
79                  Служебные                79
80         Служебные - Билеты                80
81    Чистые носители (шпиль)                81
82  Чистые носители (штучные)                82
83           Элементы питания                83

[84 rows x 2 columns]>

In [0]:
#################################################
#  Do NOT run unless recreating from beginning
#    This computation already stored in data file
#################################################

item_categories_transl = item_categories.copy(deep=True)
item_categories_transl['En_Name'] = item_categories_transl.item_category_name.apply(lambda x: translator.translate(x, src='ru', dest='en').text)

# Save the translated data
item_categories_transl.to_csv("data_output/item_categories_transl.csv", index=False)

In [0]:
item_categories_transl.head()

Observations...
There is clearly an overlap in item_category types that can be made into a new feature (e.g., all accessories for PlayStation, all accessories for XBox, ...)


#####2.2.2.2) ***item_categories* With Manually-Augmented Features**
Since there are only 84 categories, offline hand-coding new categorical features into a csv file isn't difficult.  We will add column "Subcategory1" and column "Subcategory2" which focus on (1) type of product (console, software, etc.), and (2) platform of product (playstation, xbox, pc, etc.).
</br>
These two new columns are added to the *item_categories_transl.csv* file using an external spreadsheet editor, and are saved as file *item_categories_augmented.csv* 

As the features were manually added, there is no code to generate the file *item_categories_augmented.csv*.  We therefore use the .csv file loaded from the repo in the first part of this notebook.


In [0]:
# These are the categories presently being used (hand-coded) in the two extra item_categories feature columns:
print(item_categories_augmented.Subcategory1.unique())
print(item_categories_augmented.Subcategory2.unique())

['Audio' 'Accessories' 'Tickets' 'Shipping' 'Consoles' 'Games'
 'Debit_Cards' 'Movies' 'Books' 'Music' 'Gifts' 'Software' 'Internet']
['PC' 'PlayStation' 'Xbox' 'Any' 'Other' 'Phone' 'Movies' 'Books' 'Music'
 'Gifts']


In [0]:
item_categories_augmented.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   item_category_name  84 non-null     object
 1   item_category_id    84 non-null     int64 
 2   En_Name             84 non-null     object
 3   Subcategory1        84 non-null     object
 4   Subcategory2        84 non-null     object
dtypes: int64(1), object(4)
memory usage: 3.4+ KB


In [0]:
item_categories_augmented.head(20)

Unnamed: 0,item_category_name,item_category_id,En_Name,Subcategory1,Subcategory2
0,PC - Гарнитуры/Наушники,0,PC - Headsets / Headphones,Audio,PC
1,Аксессуары - PS2,1,Accessories - PS2,Accessories,PlayStation
2,Аксессуары - PS3,2,Accessories - PS3,Accessories,PlayStation
3,Аксессуары - PS4,3,Accessories - PS4,Accessories,PlayStation
4,Аксессуары - PSP,4,Accessories - PSP,Accessories,PlayStation
5,Аксессуары - PSVita,5,Accessories - PSVita,Accessories,PlayStation
6,Аксессуары - XBOX 360,6,Accessories - XBOX 360,Accessories,Xbox
7,Аксессуары - XBOX ONE,7,Accessories - XBOX ONE,Accessories,Xbox
8,Билеты (Цифра),8,Tickets (digits),Tickets,Any
9,Доставка товара,9,Delivery of goods,Shipping,Any


###2.2.3) ***items*** Dataset: EDA and Feature *Generation*

---



---



####2.2.3.1) **Translate and Ruminate**
We will start by translating the Russian text in the dataframe, and add our ruminations on possible new features we can generate.

The dataframe *items_transl* (equivalent to *items* plus a column for English translation) is saved as a .csv file so we do not have to repeat the translation process the next time we open a Google Colab runtime.

In [0]:
items.describe

<bound method NDFrame.describe of                                                item_name  ...  item_category_id
0              ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D  ...                40
1      !ABBYY FineReader 12 Professional Edition Full...  ...                76
2          ***В ЛУЧАХ СЛАВЫ   (UNV)                    D  ...                40
3        ***ГОЛУБАЯ ВОЛНА  (Univ)                      D  ...                40
4            ***КОРОБКА (СТЕКЛО)                       D  ...                40
...                                                  ...  ...               ...
22165             Ядерный титбит 2 [PC, Цифровая версия]  ...                31
22166    Язык запросов 1С:Предприятия  [Цифровая версия]  ...                54
22167  Язык запросов 1С:Предприятия 8 (+CD). Хрустале...  ...                49
22168                                Яйцо для Little Inu  ...                62
22169                      Яйцо дракона (Игра престолов)  ...                69

[2217

In [0]:
items.head(2)

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76


In [0]:
items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 519.7+ KB


#####Translation Code for 22,170-row items dataframe
Skip this section if you have already loaded the items_transl dataframe from previous translation efforts (save yourself about 15 hours)

In [0]:
#use this if you don't already have items_transl loaded:
temp_store = []
items_transl = items.copy(deep=True)
items_transl['En_name']= ""  # initialize an empty column

In [0]:
# google translate API is reliable for me only if I submit fewer than about 1 request every 2 seconds
#   below is a loop to translate all item_name cells for the 22,170 rows of the items dataframe

translator = Translator()

progress_counter = 0
progress_interval = 500  # we will print out a row number every 500 translations, for confirmation things are working OK (it takes several hours at 2sec per translation)
for i in range(len(items_transl)):
  items_transl.at[i,'En_name'] = translator.translate(items_transl.at[i,'item_name'],src='ru',dest='en').text
  if i//progress_interval > progress_counter:
    progress_counter += 1
    print("Translation completed for row number: " + str(i) + " at " + strftime("%H:%M",localtime()))
  sleep(2)

items_transl.to_csv("data_output/items_transl.csv", index = False)
items_transl.head()

15000 15001 15002 15003 15004 15005 15006 15007 15008 15009 15010 15011 15012 15013 15014 15015 15016 15017 15018 15019 15020 15021 15022 15023 15024 15025 15026 15027 15028 15029 15500 16000 16500 17000 17500 18000 18500 19000 19500 20000 20500 21000 21500 22000 

Unnamed: 0,item_name,item_id,item_category_id,En_name
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40,! POWER IN glamor (PLAST.) D
1,!ABBYY FineReader 12 Professional Edition Full...,1,76,! ABBYY FineReader 12 Professional Edition Ful...
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40,*** In the glory (UNV) D
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40,*** BLUE WAVE (Univ) D
4,***КОРОБКА (СТЕКЛО) D,4,40,*** BOX (GLASS) D


#####Thoughts regarding items dataframe
Let's first look at how many training examples we have to work with...

Many of the items have similar names, but slightly different punctuation, or only very slightly different version numbers or types.  (e.g., 'Call of Duty III' vs. 'Call of Duty III DVD')

One can expect that these two items would have similar sales in general, and by grouping them into a single feature category, we can eliminate some of the overfitting that might come as a result of the relatively small ratio of (training set shop-item-date combinations = 2935849)/(total number of unique items = 22170).  (This is an average of about 132 rows in the sales_train data for each shop-item-date combination that we are using to train our model.  Our task is to produce a monthly estimate of sales (for November 2015), so it is relevant to consider training our model based on how many sales in a month vs. how many sales in the entire training set.  Given that the sales_train dataset covers the time period from January 2013 to October 2015 (34 months), we have on average fewer than 4 shop-item combinations in our training set for a given item in any given month.  Furthermore, as we are trying to predict for a particular month (*November* 2015), it is relevant to consider how many rows in our training set occur in the month of November.  The sales_train dataset contains data for two 'November' months out of the total 34 months of data.  Another simple calculation gives us an estimate that our training set contains on average 0.23 shop-item combinations per item for November months.

To summarize:

*  *sales_train* contains 34 months of data, including 2935849 shop-item-date combinations
*  *items* contains 22170 "unique" item_id values

In the *sales_train* data, we therefore have:
*  on average, 132 rows with a given shop-item pair for a given item_id
*  on average, 4 rows with a given shop-item pair for a given item_id in a given month
*  on average, 0.23 rows with a given shop-item pair for a given item_id in all months named 'November'

If we wish to improve our model predictions for the following month of November, it behooves us to use monthly grouping of sales, or, even better, November grouping of sales.  This smooths out day-to-day variations in sales for a better monthly prediction.  However, the sparse number of available rows in the *sales_train* data will contribute to inaccuracy in our model training and predictions.

Imagine if we could reduce the number of item_id values from 22170 to perhaps half that or even less.  Given that the number of rows for training (per item, on a monthly or a November basis) is so small, then such a reduction in the number of item_id values would have a big impact.  (The same is true for creating features to supplement "shop_id" so as to group and reduce the individuality of each shop - and thus effectively create, on average, more rows of training data for each shop-item pair.

In [12]:
items_transl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
 3   En_name           22170 non-null  object
dtypes: int64(2), object(2)
memory usage: 692.9+ KB


It appears as though only 21807 out of 22170 possible item_ids are present in the training set.  Let's look at some distributions... first, the number of total sales of each individual item_id for the entire train set:

In [57]:
# How many rows in the sales_train set for each item?  How many total units of each item were sold (per the sales_train dataset)?
items_sales = sales_train.groupby("item_id", as_index=False).agg({'item_cnt_day': ['count', 'sum']})
items_sales.columns = ['item_id','item_total_train_rows','item_total_units_sold']
print(items_sales.describe())
items_sales.head(10)

            item_id  item_total_train_rows  item_total_units_sold
count  21807.000000           21807.000000            21807.00000
mean   11098.699271             134.628743              167.29518
std     6397.059362             406.938186             1366.22019
min        0.000000               1.000000              -11.00000
25%     5551.500000               6.000000                7.00000
50%    11105.000000              32.000000               33.00000
75%    16647.500000             119.000000              124.00000
max    22169.000000           31340.000000           187642.00000


Unnamed: 0,item_id,item_total_train_rows,item_total_units_sold
0,0,1,1.0
1,1,6,6.0
2,2,2,2.0
3,3,2,2.0
4,4,1,1.0
5,5,1,1.0
6,6,1,1.0
7,7,1,1.0
8,8,2,2.0
9,9,1,1.0


In [66]:
# How many rows in the sales_train set for each shop?  How many total items sold for each shop?
shops_sales = sales_train.groupby("shop_id", as_index=False).agg({'item_cnt_day': ['count', 'sum']})
shops_sales.columns = ['shop_id','shop_total_train_rows','shop_total_units_sold']
print(shops_sales.describe())
shops_sales.head(15)

         shop_id  shop_total_train_rows  shop_total_units_sold
count  60.000000              60.000000              60.000000
mean   29.500000           48930.816667           60803.433333
std    17.464249           44692.572612           57992.901750
min     0.000000             306.000000             330.000000
25%    14.750000           20503.750000           23333.000000
50%    29.500000           42037.500000           50176.000000
75%    44.250000           58211.000000           69562.250000
max    59.000000          235636.000000          310777.000000


Unnamed: 0,shop_id,shop_total_train_rows,shop_total_units_sold
0,0,9857,11705.0
1,1,5678,6311.0
2,2,25991,30620.0
3,3,25532,28355.0
4,4,38242,43942.0
5,5,38179,42762.0
6,6,82663,100489.0
7,7,58076,67058.0
8,8,3412,3595.0
9,9,3751,15866.0


In [18]:
# indeed every row has a nonzero value in item_cnt_day
nonzeros = sales_train[sales_train['item_cnt_day'].fillna(0).astype(bool) == False] #.sum(axis=0)
nonzeros

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


In [69]:
# Compute the number of times a shop_id-item_id pair appears in the sales_train dataset
# Also, for each shop_id-item_id pair, compute the number of those particular items sold at that shop (in the entire sales_train dataset)
# Finally, for each shop_id-item_id pair, compute the fraction of the above two items with respect to the total numbers for the shop
#   and with respect to the total numbers for the item
shop_item_sales = sales_train.groupby(['shop_id','item_id'], as_index=False).agg({'item_cnt_day': ['count', 'sum']})
shop_item_sales.columns = ['shop_id','item_id','n_train_rows','n_units_sold']
shop_item_sales = shop_item_sales.merge(items_sales, on = 'item_id')
shop_item_sales = shop_item_sales.merge(shops_sales, on = 'shop_id')
shop_item_sales['shop_row_fraction'] = shop_item_sales.n_train_rows / shop_item_sales.shop_total_train_rows
shop_item_sales['shop_units_fraction'] = shop_item_sales.n_units_sold / shop_item_sales.shop_total_units_sold
shop_item_sales['item_row_fraction'] = shop_item_sales.n_train_rows / shop_item_sales.item_total_train_rows
shop_item_sales['item_units_fraction'] = shop_item_sales.n_units_sold / shop_item_sales.item_total_units_sold
shop_item_sales.drop(['item_total_train_rows','item_total_units_sold','shop_total_train_rows','shop_total_units_sold'], axis=1, inplace=True)

print(shop_item_sales.describe())
shop_item_sales.head()

             shop_id        item_id  ...  item_row_fraction  item_units_fraction
count  424124.000000  424124.000000  ...      424124.000000         4.241220e+05
mean       31.431223   11458.020213  ...           0.051417                  NaN
std        16.962064    6133.332458  ...           0.112798                  NaN
min         0.000000       0.000000  ...           0.000096                 -inf
25%        18.000000    6244.000000  ...           0.013100         1.252610e-02
50%        30.000000   11614.000000  ...           0.023810         2.334630e-02
75%        46.000000   16662.000000  ...           0.047059         4.724409e-02
max        59.000000   22169.000000  ...           1.000000                  inf

[8 rows x 8 columns]


Unnamed: 0,shop_id,item_id,n_train_rows,n_units_sold,shop_row_fraction,shop_units_fraction,item_row_fraction,item_units_fraction
0,0,30,9,31.0,0.000913,0.002648,0.006508,0.01484
1,0,31,7,11.0,0.00071,0.00094,0.006306,0.007666
2,0,32,11,16.0,0.001116,0.001367,0.005839,0.007648
3,0,33,6,6.0,0.000609,0.000513,0.007229,0.007177
4,0,35,12,15.0,0.001217,0.001282,0.066667,0.067568


In [2]:
shop_item_sales.iloc[:,[4,5,6,7]].describe()

NameError: ignored

In [28]:
60*22170

1330200

####NLP for feature generation from items dataframe
Automate the search for commonality among items, and create new categorical feature to prevent overfitting from close similarity between many item names

### start from here

---



---

start from below when adding more rows to the items datafile translation


In [0]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
MULTIPLE_WHITESPACE_RE = re.compile('[ ]{2,}')
STOPWORDS = set(stopwords.words('english'))  #using "set" speeds things up a little; note all stopwords are in lowercase
#print("." in STOPWORDS)
def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text... need to do this before removing stopwords because stopwords are all lowercase
    text = REPLACE_BY_SPACE_RE.sub(' ',text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('',text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = MULTIPLE_WHITESPACE_RE.sub(' ',text)
    text = " ".join([word for word in text.split(" ") if word not in stopwords.words('english')]) # delete stopwords from text
 
    return text

In [7]:
sales_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int64  
 2   shop_id         int64  
 3   item_id         int64  
 4   item_price      float64
 5   item_cnt_day    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB


#2. Explore Data

##2c) Grouping and statistical descriptions of the provided features

Next:
*  Data visualizations and correlations
*  Look for signs of data leakage
*  Record initial thoughts on features and models to use

#3. Save Data with New Features, etc.

In [0]:
shops_augmented.to_csv("data_output/shops_augmented.csv", index=False)

In [0]:
item_categories_transl.to_csv("data_output/item_categories_transl.csv", index=False)

#XX) Below this markdown cell can be ignored

These are just code snippets I used to debug how to accomplish some of the tasks above.  I may want to revisit them some day, so I am too scared to just delete them. :)

In [0]:
# Utilize "RateLimiter" to limit location queries to one per second, as the free services tend to throttle rate of use
# We will use Nominatim for location, and GeoNames for population
nominatum_service = Nominatim(timeout=10, user_agent = "mgaidis@yahoo.com", format_string="%s, Russia")
nominatum_geocode = RateLimiter(nominatum_service.geocode, min_delay_seconds=1)
geonames_service = GeoNames(country_bias='ru', username='gaidis', timeout=10, user_agent="mgaidis@yahoo.com", format_string="%s, Russia")  # be sure to enable free web services when creating geonames account
geonames_geocode = RateLimiter(geonames_service.geocode, min_delay_seconds=1)

In [0]:
# Example use of Nominatum
location = nominatum_geocode('Adygea', language="en")
print(json.dumps(location.raw))
print(location.latitude, location.longitude)
print(location.address.split(",")[-2].strip())
location

{"place_id": 234832475, "licence": "Data \u00a9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright", "osm_type": "relation", "osm_id": 253256, "boundingbox": ["43.7601459", "45.2171133", "38.6840182", "40.7776469"], "lat": "44.6939006", "lon": "40.1520421", "display_name": "Republic of Adygea, South Federal District, Russia", "class": "boundary", "type": "administrative", "importance": 0.7686671937378127, "icon": "https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png"}
44.6939006 40.1520421
South Federal District


Location(Republic of Adygea, South Federal District, Russia, (44.6939006, 40.1520421, 0.0))

In [0]:
# Check on how Nominatum characterizes each federal district in Russia.  Below is a list of names of the biggest cities in each district
big_cities = ['Moscow','St. Petersburg','Novosibirsk','Yekaterinburg','Nizhny Novgorod','Rostov-on-Don','Makhachkala', 'Vladivostok','Sevastopol']

In [0]:
for bc in big_cities:
  location = nominatum_geocode(bc, language='en')
  print(location)

Moscow, Central Federal District, Russia
Saint Petersburg, Northwestern Federal District, 190000, Russia
Novosibirsk, Novosibirsk Oblast, Siberian Federal District, 630000, Russia
Yekaterinburg, Yekaterinburg Municipality, Sverdlovsk Oblast, Ural Federal District, Russia
Nizhny Novgorod, Nizhny Novgorod Oblast, Volga Federal District, Russia
Rostov-on-Don, Rostov Oblast, South Federal District, Russia
Makhachkala, Makhachkala Urban Okrug, Republic of Dagestan, North Caucasus Federal District, 367000, Russia
Vladivostok, Владивостокский городской округ, Primorsky Krai, Far Eastern Federal District, 690000, Russia
Sevastopol, Ленинский район, Sevastopol, South Federal District, 299000-299699, Russia


In [0]:
# It appears as though Crimea does not qualify as a district (Sevastapol falls into the South category, according to Nominatum)
# Here is a list of the districts as Nominatum reports them:
russian_districts = ['Central','Northwestern','Siberian','Ural','Volga','South','North Caucasus','Far Eastern']
# and, since Nominatum returns unpredictable presence of "zip codes", we will use a regex to make use of the fact that Nominatum always returns "xxx Federal District"
example_loc = 'Sevastopol, Ленинский район, Sevastopol, South Federal District, 299000-299699, Russia 299000-299699'
district_in_location = re.search(r'[,\s](\w*)\sFederal District', example_loc)
print(district_in_location.group(1))

South


In [0]:
# Check on how well GeoNames does with retrieving populations...
# Note that GeoNames doesn't consider Sevastapol (Crimea) to be part of Russia, so I did not use country bias or format string to force GeoNames to only look for Russian cities
#     Results below are close to Wikipedia.  We're good, at least for the big cities.
for bc in big_cities:
  g = geonames_geocode(bc,timeout=10)
  print(g.raw["population"])

10381222
5028000
1419007
1349772
1284164
1074482
497959
587022
416263


In [0]:
g = geonames_geocode('Yakutsk',timeout=10)

In [0]:
print(json.dumps(g.raw))

{"adminCode1": "63", "lng": "129.73306", "geonameId": 2013159, "toponymName": "Yakutsk", "countryId": "2017370", "fcl": "P", "population": 235600, "countryCode": "RU", "name": "Yakutsk", "fclName": "city, village,...", "adminCodes1": {"ISO3166_2": "SA"}, "countryName": "Russia", "fcodeName": "seat of a first-order administrative division", "adminName1": "Sakha", "lat": "62.03389", "fcode": "PPLA"}


In [0]:
g.raw["population"]

235600

In [0]:
path_name = "data_output/item_categories_augmented.csv"
filename = path_name.rsplit("/")[-1]
data_frame_name = filename.split(".")[0]
exec(data_frame_name + " = pd.read_csv(path_name)")
print("Data Frame: " + data_frame_name)
print(eval(data_frame_name).head(2))
print("\n")

Data Frame: item_categories3
        item_category_name  item_category_id  ... Subcategory1 Subcategory2
0  PC - Гарнитуры/Наушники                 0  ...        Audio           PC
1         Аксессуары - PS2                 1  ...  Accessories  PlayStation

[2 rows x 5 columns]


