## First things first
* Click **File -> Save a copy in Drive** and click **Open in new tab** in the pop-up window to save your progress in Google Drive.
* Click **Runtime -> Change runtime type** and select **GPU** in Hardware accelerator box to enable faster GPU training.

#**For Jupyter Notebook Readability:**
Many sections are grouped so they may be collapsed for easier navigation to the code of interest.  (For example, the code to create new features and save them to a csv file exists in this notebook, but after that is done, a simple csv import is all that is needed, and we keep the code in the notebook just for future reference -- not to re-run every time we start a Google Colab runtime.)  Unfortunately, I haven't found a way in Colab to set cell metadata to disable running these unnecessary cells when selecting the "Run All" or "Run Before" menu options for the notebook.  Apparently this can be done in a standard (non-Colab) Jupyter notebook, or maybe using a plug-in like the one [here](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/freeze).

#**Final Project for Coursera's 'How to Win a Data Science Competition'**
April, 2020

Andreas Theodoulou and Michael Gaidis

(Competition Info last updated:  3 years ago)

##**About this Competition**

You are provided with **daily** historical sales data. The task is to forecast the total amount of products (irrespective of product type;  we just want the sum of all products) sold in every shop for the test set (the **month** of November, 2015). Note that the list of shops(!) and products slightly *changes every month*. Creating a robust model that can handle such situations is part of the challenge.

.

##**File descriptions**

***sales_train.csv*** - the training set. Daily historical data from January 2013 to October 2015.

***test.csv*** - the test set. You need to forecast the sales for these shops and products for November 2015.

***sample_submission.csv*** - a sample submission file in the correct format (two columns: "shop ID number" and "total number of products sold in Nov. 2015")

***items.csv*** - item names, their corresponding item_categories IDs, and item IDs to link with the other files

***item_categories.csv***  - item category names and corresponding IDs to link with the other files

***shops.csv***- shop names and corresponding IDs to link with the other files

.

##**Data fields**

***ID*** - an Id that represents a (Shop, Item) tuple within the test set

***shop_id*** - unique identifier of a shop

***item_id*** - unique identifier of a product

***item_category_id*** - unique identifier of item category

***item_cnt_day*** - number of products sold. You are predicting a monthly amount of this measure

***item_price*** - current price of an item

***date*** - date in format dd/mm/yyyy

***date_block_num*** - a consecutive month number. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

***item_name*** - name of item

***shop_name*** - name of shop

***item_category_name*** - name of item category

#**Workflow**

##1. Configure Environment


*   Fork/copy shared ipynb as necessary, to not conflict with teammate
*   Load competition data files
*   Load any utility code files
*   Import libraries



##2. Explore Data


*   Data formatting and translating
*   Descriptive explanations for the competition data
*   Grouping and statistical descriptions of the provided features
*   Data visualizations and correlations
*   Look for signs of data leakage
*   Record initial thoughts on features and models to use



##3. Prepare Data


*   Data formatting and translating (see above)
*   Data cleaning (--> handling missing entries, outliers, NaNs, ...)
*   Data grouping / Date-related issues / re-cleaning if needed after grouping
*   Data normalization (recheck cleaning & normalizing with data visualizations)
*   Initial feature selection (quick and dirty) and preparation
*   Save data in compressed or pickled format if helpful; use version control



##4. Quick Modeling (set up framework for more complex model improvement)


*   Choose and implement a fast and simple approach for train/val data splitting
*   Choose a simple and fast evaluation metric (comparable to Kaggle's metric)
*   Choose a simple, but appropriate, model to use (minimal hyperparameters)
*   Train the model, check for major issues (absolutely horrible performance)
*   Save the model parameters, etc., along with version control
*   Submit model to Kaggle to verify proper formatting of entry
*   Verify that Kaggle test performance is reasonably close to validation metric



##5. Refine the Model and the Features


###a) Features


*   Explore the data more deeply for feature correlations and data leaks to exploit
*   Consider complex feature generation based on intuition
*   Save data in compressed or pickled format if helpful for faster future iteration
*   Employ version control on datasets generated with new features / groupings

###b) Modeling


*   Look at alternative metrics for training and validation
*   Version control
*   Explore hyperparameter tuning for the initial quick and dirty model
*   Version control
*   Consider other models as time allows
*   Version control
*   Create ensembles as time allows
*   Version control
*   Adjust methods of train/val splitting if desirable and timely
*   Version control







##6. Finalize Model


*   Restart kernel, clean any possible lingering variables
*   Train and tune hyperparamers until you run out of time
*   Submit model



---



---





#1. Configure Environment

##1a) Load Files
Load competition data files and import helpful custom code libraries from **GitHub Kag repo cloned onto your Google Drive**  
(similar to original Coursera template that loads files from GitHub directly, but by cloning the GitHub repo onto your "local" Google Drive, you can do add/commit/push/pull/status etc. from within Colab notebook, and have better version control when working with partners)

Note that you must use code similar to that in the "helper_code/Enable_Colab_git_GitHub-GDrive.ipynb" notebook to first clone the GitHub repo onto your Google Drive, and set up appropriate tokens and user info, to get all this to work "seamlessly." :)

In [0]:
# Import libraries needed for loading files:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
# List file names and paths needed for importing data and helper files

GDRIVE_REPO_PATH = "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"

%cd "{GDRIVE_REPO_PATH}"

# List of the data files (path relative to master branch top), to be loaded into pandas DataFrames
data_files = [  "readonly/final_project_data/items.csv",
                "readonly/final_project_data/item_categories.csv",
                "readonly/final_project_data/shops.csv",
                "readonly/final_project_data/sample_submission.csv.gz",
                "readonly/final_project_data/sales_train.csv.gz",
                "readonly/final_project_data/test.csv.gz",
                "data_output/shops2.csv",
                "data_output/shops3.csv",
                "data_output/item_categories2.csv",
                "data_output/item_categories3.csv"  ]

# Dict of helper code files, to be loaded into Colab and available for python import
#    key is the path (replace / with . ), and value is the module reference name
#    note that the directory chain from current directory down to the .py file
#      must include a "__init__.py" file (it can be empty)
code_files = {"helper_code.kaggle_utils_at_mg" : "kag_utils"}

/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag


In [0]:
# Loop to load the above data files into appropriately-named pandas DataFrames
for path_name in data_files:
  filename = path_name.rsplit("/")[-1]
  data_frame_name = filename.split(".")[0]
  exec(data_frame_name + " = pd.read_csv(path_name)")
  print("Data Frame: " + data_frame_name)
  print(eval(data_frame_name).head(2))
  print("\n")


Data Frame: items
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76


Data Frame: item_categories
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1


Data Frame: shops
                       shop_name  shop_id
0  !Якутск Орджоникидзе, 56 фран        0
1  !Якутск ТЦ "Центральный" фран        1


Data Frame: sample_submission
   ID  item_cnt_month
0   0             0.5
1   1             0.5


Data Frame: sales_train
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154       999.0           1.0
1  03.01.2013               0       25     2552       899.0           1.0


Data Frame: test
   ID  shop_id  item_id
0   0        5     5037
1   1        5    

##1b) Install Non-Standard Packages
these are used for EDA and for generating features

after this is done, and we save the modified feature-rich dataframes, we can cruise through this notebook quickly without having to repeat any lengthy operations.  (i.e., collapse this section)

In [4]:
# Translating Package using Google API
#   used for translating Russian text in dataframes, so we can better understand potential features, data leaks, or outliers
!pip install googletrans


# Assuming you are planning to use this package (because you ran this cell and imported the googletrans package),
#  we will go ahead and import the library and instantiate a Translator class
from googletrans import Translator
translator = Translator()

Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/fd/f0/a22d41d3846d1f46a4f20086141e0428ccc9c6d644aacbfd30990cf46886/googletrans-2.4.0.tar.gz
Building wheels for collected packages: googletrans
  Building wheel for googletrans (setup.py) ... [?25l[?25hdone
  Created wheel for googletrans: filename=googletrans-2.4.0-cp36-none-any.whl size=15777 sha256=4738d498b7df22b58879f8e71923014540f19187c8f12446398eac45bba2d717
  Stored in directory: /root/.cache/pip/wheels/50/d6/e7/a8efd5f2427d5eb258070048718fa56ee5ac57fd6f53505f95
Successfully built googletrans
Installing collected packages: googletrans
Successfully installed googletrans-2.4.0


In [0]:
# Geocoding library 
#   used for creating features from shop location
!pip install geopy

# Assuming you are planning to use this package (because you ran this cell and imported the geopy package),
#  we will go ahead and import the library elements and instantiate two rate-limited geocoders
from geopy.geocoders import Nominatim
from geopy.geocoders import GeoNames
from geopy.extra.rate_limiter import RateLimiter

# Utilize "RateLimiter" to limit location queries to one per second, as the free services tend to throttle rate of use
# We will use Nominatim for location, and GeoNames for population
nominatum_service = Nominatim(timeout=10, user_agent = "mgaidis@yahoo.com", format_string="%s, Russia")
nominatum_geocode = RateLimiter(nominatum_service.geocode, min_delay_seconds=1)
geonames_service = GeoNames(username='gaidis', timeout=10, user_agent="mgaidis@yahoo.com")  # be sure to enable free web services when creating geonames account
geonames_geocode = RateLimiter(geonames_service.geocode, min_delay_seconds=1)

##1c) Import "Standard" Libraries


In [0]:
# General python libraries used throughout the notebook

import matplotlib.pyplot as plt
import numpy as np

from itertools import product
import time
import re
import json
from time import sleep, localtime, strftime

from sklearn.linear_model import LinearRegression
import pickle

%matplotlib inline

#2. Explore Data

##2a) Data Formatting and Translating
##2b) Descriptive explanations of data in source files
</br>

First, we will consider the dataframes containing Russian text.  Let's translate to English and see if we can find useful feature creation ideas.

###</br>**shops.csv** EDA and Feature Generation

---



---



#####**Translate and Ruminate**
We will start by translating the Russian text in the dataframe, and add our ruminations on possible new features we can generate.

The dataframe shops2 (equivalent to shops + 'column for English translation') is saved as a .csv file so we do not have to repeat the translation process the next time we open a Google Colab runtime.

**Tip**
</br>
If you want to run this code to generate translated features for the shops, be sure to install the googletrans package and import Translator as in code above

If you have already created and saved this data, save time by importing the modified csv datafile and you won't have to re-run the translating.  (This can be a big deal, because Google Translate API restricts amount of usage and/or rate of usage for calls to the translator.)

In [0]:
shops2 = shops.copy(deep=True)
shops2['En_Name'] = shops2.shop_name.apply(lambda x: translator.translate(x, src='ru', dest='en').text)
shops2.head(10)

Unnamed: 0,shop_name,shop_id,En_Name
0,"!Якутск Орджоникидзе, 56 фран",0,"! Yakutsk Ordzhonikidze, 56 Franc"
1,"!Якутск ТЦ ""Центральный"" фран",1,"! Yakutsk TC ""Central"" Franc"
2,"Адыгея ТЦ ""Мега""",2,"Adygea TC ""Mega"""
3,"Балашиха ТРК ""Октябрь-Киномир""",3,"Balashikha TRC ""October-Kinomir"""
4,"Волжский ТЦ ""Волга Молл""",4,"Volzhsky mall ""Volga Mall"""
5,"Вологда ТРЦ ""Мармелад""",5,"Vologda SEC ""Marmalade"""
6,"Воронеж (Плехановская, 13)",6,"Voronezh (Plekhanovskaya, 13)"
7,"Воронеж ТРЦ ""Максимир""",7,"Voronezh SEC ""Maksimir"""
8,"Воронеж ТРЦ Сити-Парк ""Град""",8,"Voronezh shopping center City Park ""Castle"""
9,Выездная Торговля,9,Itinerant trade


In [0]:
shops2.to_csv("data_output/shops2.csv", index=False)

In [0]:
len(shops2)

60

Observations:


1.  The number of shops is only 60, so manual feature generation is not out of the question.
2.  Most shops have a city associated with their name, so it's reasonable that we can do some feature generation based on shop location.
3.  (After Googling several of the shops, we realized that...) The shop type may be categorized.  We noticed that "Mega" shops were located inside shopping malls that were anchored (and managed) by Ikea stores.  "SEC" acronym implies the shop is part of a shopping and entertainment center (like a large shopping mall with a cinema or other activities).  SC, TC, TRK, and TRC acronyms generally imply the shop is in a standard shopping mall, but careful inspection on the world wide interweb shows that some of these shops are actually in SECs.  We will try assigning a type to each store from the following:

**['Online', 'Itinerant', 'Shop', 'Mall', 'Mega', 'SEC']**



>*   Online (like Amazon or eBay)
*   Itinerant (traveling salesman)
*   Shop (small, isolated store)
*   Mall (store is based in a shopping mall)
*   Mega (store is based in an Ikea-managed mall)
*   SEC (store is located in a shopping-entertainment complex)




Once we have extracted city information from the shop names, we can use a geolocator package to help categorize the location of the store, and even the population of the city in which the store is located.


The geopy package seems pretty good for performing the location-based categorization.  Free services Nominatum and GeoNames can work with geopy to give us longitude, latitude, federal district, and population, for example.
</br></br>
Latitude and longitude of the shops are likely too fine-grained to prevent overfitting with our model.  Instead, we can generate a feature based on Russian **Federal District**, as retrieved with geocode Nominatum service.  Due to religious preferences, for example, there may be a bias for a certain region to have higher sales in November (before Christmas) or not.  The map below shows roughly how Nominatum would categorize the Russian Federal Districts (the red text in the image):

<img src="https://www.worldatlas.com/r/w728-h425-c728x425/upload/4c/4b/0f/shutterstock-183567236-1.jpg">

</br>
We have no shops in the dataframe that come from North Caucasia, so the category types for district are as follows ('None' indicates online or itinerant shops):

**['Central', 'Northwestern', 'Siberian', 'Ural', 'Volga', 'South', 'Eastern', 'None']**



#####**Geocoding New Features**
We now apply geocoding to the shop locations to create features including Russian Federal District and population of the shop's city.

*shops3.csv* file is saved to contain the original *shops.csv* data plus the English translation plus a feature column for Federal District and a feature column for population.

**Tip**
</br>
If you want to run this code to generate geo features for the shops, be sure to install the geopy package and import Nominatum and GeoNames as in code above

If you have already created and saved this data, save time by importing the modified csv datafile and you won't have to re-run the geocoding.

In [0]:
# Add 'district' column to the shops3 dataframe using Nominatim
shops3['district'] = shops3.City.apply(lambda x:  'None' if (x == 'None') else re.search(r'[,\s](\w*)\sFederal District', str(nominatum_geocode(x, language='en'))).group(1))

In [0]:
# Add 'population' column to the shops3 dataframe using GeoNames
shops3['population'] = shops3.City.apply(lambda x:  'None' if (x == 'None') else geonames_geocode2(x, timeout=10).raw['population'])

In [0]:
# GeoNames had some trouble with Zhukovsky and Checkov... insert values from Wikipedia
shops3.at[56,'population'] = 61000
shops3.at[10,'population'] = 105000
shops3.at[11,'population'] = 105000

# for 'Itinerant' (traveling salesman) shop, set the population to 100,000 (roughly the number of people the salesman might have access to in day)
shops3.at[9,'population'] = 100000

# for 'Online' shops, set the population to 20,000,000 (estimate of the number of people who have internet and would place an online order from Russia)
shops3.at[12,'population'] = 20000000
shops3.at[20,'population'] = 20000000
shops3.at[55,'population'] = 20000000

In [0]:
shops3.to_csv("data_output/shops3.csv", index=False)

####Have a look at the enhanced "shops" data

In [0]:
shops3

Unnamed: 0,shop_name,shop_id,En_Name,City,Type,district,population
0,"!Якутск Орджоникидзе, 56 фран",0,"! Yakutsk Ordzhonikidze, 56 Franc",Yakutsk,Mall,Eastern,235600
1,"!Якутск ТЦ ""Центральный"" фран",1,"! Yakutsk TC ""Central"" Franc",Yakutsk,Mall,Eastern,235600
2,"Адыгея ТЦ ""Мега""",2,"Adygea TC ""Mega""",Adygea,Mega,South,144055
3,"Балашиха ТРК ""Октябрь-Киномир""",3,"Balashikha TRC ""October-Kinomir""",Balashikha,Mall,Central,150103
4,"Волжский ТЦ ""Волга Молл""",4,"Volzhsky mall ""Volga Mall""",Volgograd,Mall,South,1011417
5,"Вологда ТРЦ ""Мармелад""",5,"Vologda SEC ""Marmalade""",Vologda,SEC,Northwestern,314900
6,"Воронеж (Плехановская, 13)",6,"Voronezh (Plekhanovskaya, 13)",Voronezh,Shop,Central,848752
7,"Воронеж ТРЦ ""Максимир""",7,"Voronezh SEC ""Maksimir""",Voronezh,SEC,Central,848752
8,"Воронеж ТРЦ Сити-Парк ""Град""",8,"Voronezh shopping center City Park ""Castle""",Voronezh,Mall,Central,848752
9,Выездная Торговля,9,Itinerant trade,,Itinerant,,100000


While entering city/type information, it looks like there is a possible issue with a couple of the shops... these two may actually be the same shop??  From extensive web searching, the only things at that address appear to be a security systems vendor and a trial attorney

10	Жуковский ул. Чкалова 39м?	10	Zhukovsky Street. Chkalov 39m?

11	Жуковский ул. Чкалова 39м²	11	Zhukovsky Street. Chkalov 39m²

</br>
We are probably best off by combining these two shops somehow.  Need to check if sales are significantly different for the two shops or not.  Do we make them into one shop and then split the prediction in half between shops #10 and 11 when we submit for grading?... need to chew on this a bit.

###</br>**item_categories.csv** EDA and Feature Generation

---



---



#####**Translate and Ruminate**
We will start by translating the Russian text in the dataframe, and add our ruminations on possible new features we can generate.

The dataframe *item_categories2* (equivalent to item_categories plus a column for English translation) is saved as a .csv file so we do not have to repeat the translation process the next time we open a Google Colab runtime.

**Tip**
</br>
If you want to run this code to generate translated features for the item categories, be sure to install the googletrans package and import Translator as in code above

If you have already created and saved this data, save time by importing the modified csv datafile and you won't have to re-run the translating.  (This can be a big deal, because Google Translate API restricts amount of usage and/or rate of usage for calls to the translator.)

In [0]:
item_categories.describe

<bound method NDFrame.describe of            item_category_name  item_category_id
0     PC - Гарнитуры/Наушники                 0
1            Аксессуары - PS2                 1
2            Аксессуары - PS3                 2
3            Аксессуары - PS4                 3
4            Аксессуары - PSP                 4
..                        ...               ...
79                  Служебные                79
80         Служебные - Билеты                80
81    Чистые носители (шпиль)                81
82  Чистые носители (штучные)                82
83           Элементы питания                83

[84 rows x 2 columns]>

In [0]:
item_categories2 = item_categories.copy(deep=True)
item_categories2['En_Name'] = item_categories2.item_category_name.apply(lambda x: translator.translate(x, src='ru', dest='en').text)

In [0]:
item_categories2.to_csv("data_output/item_categories2.csv", index=False)

In [0]:
item_categories2.head()

Unnamed: 0,item_category_name,item_category_id,En_Name
0,PC - Гарнитуры/Наушники,0,PC - Headsets / Headphones
1,Аксессуары - PS2,1,Accessories - PS2
2,Аксессуары - PS3,2,Accessories - PS3
3,Аксессуары - PS4,3,Accessories - PS4
4,Аксессуары - PSP,4,Accessories - PSP


Observations...
There is clearly an overlap in item_category types that can be made into a new feature (e.g., all accessories for PlayStation, all accessories for XBox, ...)


#####**item_categories With New Features**
Since there are only 84 categories, offline hand-coding new categorical features into a csv file isn't difficult.  We will add column "Subcategory1" and column "Subcategory2" which focus on (1) type of product (console, software, etc.), and (2) platform of product (playstation, xbox, pc, etc.).
</br>
These two new columns are added to the *item_categories2.csv* file using an external spreadsheet editor, and are saved as file *item_categories3.csv* 

As the features were manually added, there is no code to generate the file *item_categories3.csv*

If you haven't done so already, load *item_categories3.csv* into a dataframe using the code cell below:


In [0]:
path_name = "data_output/item_categories3.csv"
filename = path_name.rsplit("/")[-1]
data_frame_name = filename.split(".")[0]
exec(data_frame_name + " = pd.read_csv(path_name)")
print("Data Frame: " + data_frame_name)
print(eval(data_frame_name).head(2))
print("\n")

Data Frame: item_categories3
        item_category_name  item_category_id  ... Subcategory1 Subcategory2
0  PC - Гарнитуры/Наушники                 0  ...        Audio           PC
1         Аксессуары - PS2                 1  ...  Accessories  PlayStation

[2 rows x 5 columns]




In [0]:
# These are the categories presently being used (hand-coded) in the two extra item_categories feature columns:
print(item_categories3.Subcategory1.unique())
print(item_categories3.Subcategory2.unique())

['Audio' 'Accessories' 'Tickets' 'Shipping' 'Consoles' 'Games'
 'Debit_Cards' 'Movies' 'Books' 'Music' 'Gifts' 'Software' 'Internet']
['PC' 'PlayStation' 'Xbox' 'Any' 'Other' 'Phone' 'Movies' 'Books' 'Music'
 'Gifts']


In [0]:
item_categories3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   item_category_name  84 non-null     object
 1   item_category_id    84 non-null     int64 
 2   En_Name             84 non-null     object
 3   Subcategory1        84 non-null     object
 4   Subcategory2        84 non-null     object
dtypes: int64(1), object(4)
memory usage: 3.4+ KB


####Have a look at the enhanced "item_categories" data

In [0]:
item_categories3.head(20)

Unnamed: 0,item_category_name,item_category_id,En_Name,Subcategory1,Subcategory2
0,PC - Гарнитуры/Наушники,0,PC - Headsets / Headphones,Audio,PC
1,Аксессуары - PS2,1,Accessories - PS2,Accessories,PlayStation
2,Аксессуары - PS3,2,Accessories - PS3,Accessories,PlayStation
3,Аксессуары - PS4,3,Accessories - PS4,Accessories,PlayStation
4,Аксессуары - PSP,4,Accessories - PSP,Accessories,PlayStation
5,Аксессуары - PSVita,5,Accessories - PSVita,Accessories,PlayStation
6,Аксессуары - XBOX 360,6,Accessories - XBOX 360,Accessories,Xbox
7,Аксессуары - XBOX ONE,7,Accessories - XBOX ONE,Accessories,Xbox
8,Билеты (Цифра),8,Tickets (digits),Tickets,Any
9,Доставка товара,9,Delivery of goods,Shipping,Any


###</br>**items.csv** EDA and Feature Generation

---



---



#####**Translate and Ruminate**
We will start by translating the Russian text in the dataframe, and add our ruminations on possible new features we can generate.

The dataframe *items2* (equivalent to *items* plus a column for English translation) is saved as a .csv file so we do not have to repeat the translation process the next time we open a Google Colab runtime.

In [0]:
items.describe

<bound method NDFrame.describe of                                                item_name  ...  item_category_id
0              ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D  ...                40
1      !ABBYY FineReader 12 Professional Edition Full...  ...                76
2          ***В ЛУЧАХ СЛАВЫ   (UNV)                    D  ...                40
3        ***ГОЛУБАЯ ВОЛНА  (Univ)                      D  ...                40
4            ***КОРОБКА (СТЕКЛО)                       D  ...                40
...                                                  ...  ...               ...
22165             Ядерный титбит 2 [PC, Цифровая версия]  ...                31
22166    Язык запросов 1С:Предприятия  [Цифровая версия]  ...                54
22167  Язык запросов 1С:Предприятия 8 (+CD). Хрустале...  ...                49
22168                                Яйцо для Little Inu  ...                62
22169                      Яйцо дракона (Игра престолов)  ...                69

[2217

In [0]:
items.head(2)

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76


In [0]:
items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 519.7+ KB


In [0]:
# google translator has a limit to the rate at which you can send queries.  I'm not entirely sure, but it seems like 5 per second max.
#  so, for translation of the items dataframe, I'll create a looping function (instead of pandas .apply) to do the translations with a short wait between calls
#  to google translator.  I will also save the dataframe as a csv file every few hundred translations, so that even if I do exceed
#  google's limits, I will at least have something useful to start from again.
from time import sleep, localtime, strftime
translator = Translator()

temp_store = []
items2 = items.copy(deep=True)
items2['En_name']= ""  # initialize an empty column
save_counter = 0
save_every = 200  # write csv file after this many rows have been translated
for i in range(len(items2)):
  #items2.at[i,'En_name'] = translator.translate(items2.at[i,'item_name'],src='ru',dest='en').text
  # argh! try to get something useful even if Google API kicks me out
  translator = Translator()
  temp_store.append(translator.translate(items2.at[i,'item_name'],src='ru',dest='en').text)
  items2.at[i,'En_name'] = temp_store[-1]
  if i//save_every > save_counter:
    save_counter += 1
    print(str(i),end=" ")
    #items2.to_csv("data_output/items2.csv", index=False)
    #print("Saved after row number: " + str(i) + " at " + strftime("%H:%M",localtime()))
    #print("Row number: " + str(i) + " at " + strftime("%H:%M",localtime()))
  sleep(0.4)

items2.to_csv("data_output/items2.csv", index = False)
items2.head()

200 400 

JSONDecodeError: ignored

In [0]:
#save what we have so far
items2.to_csv("data_output/items2.csv", index = False)
items2.head()

Unnamed: 0,item_name,item_id,item_category_id,En_name
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40,! POWER IN glamor (PLAST.) D
1,!ABBYY FineReader 12 Professional Edition Full...,1,76,! ABBYY FineReader 12 Professional Edition Ful...
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40,*** In the glory (UNV) D
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40,*** BLUE WAVE (Univ) D
4,***КОРОБКА (СТЕКЛО) D,4,40,*** BOX (GLASS) D


In [0]:
len(temp_store)  # = 515 after above run

In [0]:
temp_store[-5:]

['1C: TOYS "Madagascar"',
 '1C: TOYS "Well, wait Issue 3. Song for a hare!" [PC, Digital Version]',
 '1C: TOYS "Rex and wizards" [PC, Digital Version]',
 '1C: TOYS "Rex and time machine" [PC, Digital Version]',
 '1C: TOYS "Rex and the treasures of the pirates" [PC, Digital Version]']

### start from here

---



---

start from below when adding more rows to the items datafile translation


In [10]:
path_name = "data_output/items2.csv"
filename = path_name.rsplit("/")[-1]
data_frame_name = filename.split(".")[0]
exec(data_frame_name + " = pd.read_csv(path_name)")
print("Data Frame: " + data_frame_name)
print(eval(data_frame_name).head(2))
print("\n")

Data Frame: items2
                                           item_name  ...                                            En_name
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D  ...                       ! POWER IN glamor (PLAST.) D
1  !ABBYY FineReader 12 Professional Edition Full...  ...  ! ABBYY FineReader 12 Professional Edition Ful...

[2 rows x 4 columns]




In [11]:
row_start = items2.En_name.count()
row_start

22170

In [8]:
# for giggles, load up the temp_store list
temp_store = items2.En_name.to_list()
print(len(temp_store))
print("\n")
print(temp_store[:5])
print("\n")
print(temp_store[1009:1014])
print("\n")
print(temp_store[-5:])


22170


['! POWER IN glamor (PLAST.) D', '! ABBYY FineReader 12 Professional Edition Full [PC, Digital Version]', '*** In the glory (UNV) D', '*** BLUE WAVE (Univ) D', '*** BOX (GLASS) D']


['3D Crystal Puzzle Drop L Lamp', '3D Crystal Puzzle Strawberry L Lamp', '3D Crystal Puzzle Crystal L', '3D Crystal Puzzle Cube Lamp L', '3D Crystal Puzzle Swan L']


[nan, nan, nan, nan, nan]


In [9]:
# so, we got 515 --> 1012 --> 1487 rows translated above
# try again, picking up where we were cut off by google;  change sleep from 0.4 to 0.6
#   ---> nope; after about 5 minutes, ran this code cell and immediately booted by google
# try changing VPN location... (NYC was first) -> Wash DC -> Dallas
#  --> no good
# try restarting runtime (first loading items2.csv to get what we already saved)
# --> no good
# try terminating session, loading up items2.csv and doing it again...
#  -->  YES this works.  However, google cuts us off at 1012 rows (adding only 497 to previous total)
# try reconnecting to a hosted runtime (top right menu), and change sleep from 0.6 to 1
#   ( for some reason, "manage sessions" didn't give me option to terminate this ipynb session this time)
#  --> no good
# go back to closing/terminating window and reopening... now at 1487... try sleep(1) --> sleep(2)

translator = Translator()

###temp_store = []
###items2 = items.copy(deep=True)
###items2['En_name']= ""  # initialize an empty column
save_counter = 0
save_every = 500  # write csv file after this many rows have been translated
#row_start = 1012  # the place where google api kicked me the previous attempt
row_counter = row_start
###for i in range(len(items2)):
for i in range(row_start, len(items2)):
  translator = Translator()
  temp_store[i] = translator.translate(items2.at[i,'item_name'],src='ru',dest='en').text
  items2.at[i,'En_name'] = temp_store[i]
  row_counter += 1
  if i//save_every > save_counter:
    save_counter += 1
    print(str(i),end=" ")
  sleep(2)

items2.to_csv("data_output/items2.csv", index = False)
items2.head()

15000 15001 15002 15003 15004 15005 15006 15007 15008 15009 15010 15011 15012 15013 15014 15015 15016 15017 15018 15019 15020 15021 15022 15023 15024 15025 15026 15027 15028 15029 15500 16000 16500 17000 17500 18000 18500 19000 19500 20000 20500 21000 21500 22000 

Unnamed: 0,item_name,item_id,item_category_id,En_name
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40,! POWER IN glamor (PLAST.) D
1,!ABBYY FineReader 12 Professional Edition Full...,1,76,! ABBYY FineReader 12 Professional Edition Ful...
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40,*** In the glory (UNV) D
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40,*** BLUE WAVE (Univ) D
4,***КОРОБКА (СТЕКЛО) D,4,40,*** BOX (GLASS) D


In [0]:
items2.to_csv("data_output/items2.csv", index = False)
items2.head()

Unnamed: 0,item_name,item_id,item_category_id,En_name
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40,! POWER IN glamor (PLAST.) D
1,!ABBYY FineReader 12 Professional Edition Full...,1,76,! ABBYY FineReader 12 Professional Edition Ful...
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40,*** In the glory (UNV) D
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40,*** BLUE WAVE (Univ) D
4,***КОРОБКА (СТЕКЛО) D,4,40,*** BOX (GLASS) D


In [0]:
print(type(items2.at[22169,'En_name'])) 
print(type(items2.at[1,'En_name']))

<class 'float'>
<class 'str'>


In [0]:
items2.En_name.value_counts()
items2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
 3   En_name           15000 non-null  object
dtypes: int64(2), object(2)
memory usage: 692.9+ KB


In [0]:
row_counter

15000

Things to try to increase the amount of translating I can do on this dataframe:

1. try VPN to get a new IP address when Google chokes
2. try randomizing the google server being used, by putting something like this inside the loop (with a few more URLs in the list): 
> translator = Translator(service_urls=[
      'translate.google.com',
      'translate.google.co.kr',
    ])
3. try sending a list (< 15k characters total) for translation, instead of a single cell.  Google may be restricting my number of requests rather than my amount of character throughput.
4. search the item_name column for repetetive words or word combinations, and only look to translate them.  Investigate for possible categorization.  Then, create categories from the Russian text column (reverse-translate the English categories, then search the Russian text column to look for inclusions

In [0]:
# google translator has a limit to the rate at which you can send queries.  I'm not entirely sure, but it seems like 5 per second max.
#  so, for translation of the items dataframe, I'll create a looping function (instead of pandas .apply) to do the translations with a short wait between calls
#  to google translator.  I will also save the dataframe as a csv file every few hundred translations, so that even if I do exceed
#  google's limits, I will at least have something useful to start from again.
from time import sleep, localtime, strftime

#temp_store = []
#items2 = items.copy(deep=True)
#items2['En_name']= ""  # initialize an empty column
#save_counter = 0
#save_every = 200  # write csv file after this many rows have been translated
for i in range(218,len(items2)):
  #items2.at[i,'En_name'] = translator.translate(items2.at[i,'item_name'],src='ru',dest='en').text
  # argh! try to get something useful even if Google API kicks me out
  translator = Translator(service_urls=[ 'translate.google.com', 'translate.google.co.kr', 'translate.google.co.jp', 'translate.google.co.uk',
                                      'translate.google.co.es', 'translate.google.co.ca', 'translate.google.co.de', 'translate.google.co.it',
                                      'translate.google.co.fr', 'translate.google.co.nl', 'translate.google.co.ie', 'translate.google.co.ch'
                                        ])
  temp_store.append(translator.translate(items2.at[i,'item_name'],src='ru',dest='en').text)
  items2.at[i,'En_name'] = temp_store[-1]
  sleep(0.5)

items2.to_csv("data_output/items2.csv", index = False)
items2.head()

ConnectionError: ignored

In [0]:

  #if i//save_every > save_counter:
  #  save_counter += 1
  #  #items2.to_csv("data_output/items2.csv", index=False)
  #  #print("Saved after row number: " + str(i) + " at " + strftime("%H:%M",localtime()))
  #  print("Row number: " + str(i) + " at " + strftime("%H:%M",localtime()))

len(temp_store)
for i in range(len(temp_store)):
  #items2.at[i,'En_name'] = translator.translate(items2.at[i,'item_name'],src='ru',dest='en').text
  # argh! try to get something useful even if Google API kicks me out
  items2.at[i,'En_name'] = temp_store[i]

In [0]:
items2.head()

Unnamed: 0,item_name,item_id,item_category_id,En_name
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40,! POWER IN glamor (PLAST.) D
1,!ABBYY FineReader 12 Professional Edition Full...,1,76,! ABBYY FineReader 12 Professional Edition Ful...
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40,*** In the glory (UNV) D
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40,*** BLUE WAVE (Univ) D
4,***КОРОБКА (СТЕКЛО) D,4,40,*** BOX (GLASS) D


In [0]:
items2.info()

In [0]:

translator = Translator()
print(translator.translate('ВО ВЛАСТИ НАВАЖДЕНИЯ ПЛАСТ',src='ru',dest='en').text)

JSONDecodeError: ignored

In [0]:
items2['En_Name'] = items2.En_Name0.apply(lambda x: translator.translate(x, src='ru', dest='en').text)
items2 = items2.drop(['En_Name0'])
items2.to_csv("data_output/items2.csv", index=False)

JSONDecodeError: ignored

#2. Explore Data

##2c) Grouping and statistical descriptions of the provided features

Next:
*  Data visualizations and correlations
*  Look for signs of data leakage
*  Record initial thoughts on features and models to use

#3. Save Data with New Features, etc.

In [0]:
shops3.to_csv("data_output/shops3.csv", index=False)

In [0]:
item_categories2.to_csv("data_output/item_categories2.csv", index=False)

#XX) Below this markdown cell can be ignored

These are just code snippets I used to debug how to accomplish some of the tasks above.  I may want to revisit them some day, so I am too scared to just delete them. :)

In [0]:
# Utilize "RateLimiter" to limit location queries to one per second, as the free services tend to throttle rate of use
# We will use Nominatim for location, and GeoNames for population
nominatum_service = Nominatim(timeout=10, user_agent = "mgaidis@yahoo.com", format_string="%s, Russia")
nominatum_geocode = RateLimiter(nominatum_service.geocode, min_delay_seconds=1)
geonames_service = GeoNames(country_bias='ru', username='gaidis', timeout=10, user_agent="mgaidis@yahoo.com", format_string="%s, Russia")  # be sure to enable free web services when creating geonames account
geonames_geocode = RateLimiter(geonames_service.geocode, min_delay_seconds=1)

In [0]:
# Example use of Nominatum
location = nominatum_geocode('Adygea', language="en")
print(json.dumps(location.raw))
print(location.latitude, location.longitude)
print(location.address.split(",")[-2].strip())
location

{"place_id": 234832475, "licence": "Data \u00a9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright", "osm_type": "relation", "osm_id": 253256, "boundingbox": ["43.7601459", "45.2171133", "38.6840182", "40.7776469"], "lat": "44.6939006", "lon": "40.1520421", "display_name": "Republic of Adygea, South Federal District, Russia", "class": "boundary", "type": "administrative", "importance": 0.7686671937378127, "icon": "https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png"}
44.6939006 40.1520421
South Federal District


Location(Republic of Adygea, South Federal District, Russia, (44.6939006, 40.1520421, 0.0))

In [0]:
# Check on how Nominatum characterizes each federal district in Russia.  Below is a list of names of the biggest cities in each district
big_cities = ['Moscow','St. Petersburg','Novosibirsk','Yekaterinburg','Nizhny Novgorod','Rostov-on-Don','Makhachkala', 'Vladivostok','Sevastopol']

In [0]:
for bc in big_cities:
  location = nominatum_geocode(bc, language='en')
  print(location)

Moscow, Central Federal District, Russia
Saint Petersburg, Northwestern Federal District, 190000, Russia
Novosibirsk, Novosibirsk Oblast, Siberian Federal District, 630000, Russia
Yekaterinburg, Yekaterinburg Municipality, Sverdlovsk Oblast, Ural Federal District, Russia
Nizhny Novgorod, Nizhny Novgorod Oblast, Volga Federal District, Russia
Rostov-on-Don, Rostov Oblast, South Federal District, Russia
Makhachkala, Makhachkala Urban Okrug, Republic of Dagestan, North Caucasus Federal District, 367000, Russia
Vladivostok, Владивостокский городской округ, Primorsky Krai, Far Eastern Federal District, 690000, Russia
Sevastopol, Ленинский район, Sevastopol, South Federal District, 299000-299699, Russia


In [0]:
# It appears as though Crimea does not qualify as a district (Sevastapol falls into the South category, according to Nominatum)
# Here is a list of the districts as Nominatum reports them:
russian_districts = ['Central','Northwestern','Siberian','Ural','Volga','South','North Caucasus','Far Eastern']
# and, since Nominatum returns unpredictable presence of "zip codes", we will use a regex to make use of the fact that Nominatum always returns "xxx Federal District"
example_loc = 'Sevastopol, Ленинский район, Sevastopol, South Federal District, 299000-299699, Russia 299000-299699'
district_in_location = re.search(r'[,\s](\w*)\sFederal District', example_loc)
print(district_in_location.group(1))

South


In [0]:
# Check on how well GeoNames does with retrieving populations...
# Note that GeoNames doesn't consider Sevastapol (Crimea) to be part of Russia, so I did not use country bias or format string to force GeoNames to only look for Russian cities
#     Results below are close to Wikipedia.  We're good, at least for the big cities.
for bc in big_cities:
  g = geonames_geocode(bc,timeout=10)
  print(g.raw["population"])

10381222
5028000
1419007
1349772
1284164
1074482
497959
587022
416263


In [0]:
g = geonames_geocode('Yakutsk',timeout=10)

In [0]:
print(json.dumps(g.raw))

{"adminCode1": "63", "lng": "129.73306", "geonameId": 2013159, "toponymName": "Yakutsk", "countryId": "2017370", "fcl": "P", "population": 235600, "countryCode": "RU", "name": "Yakutsk", "fclName": "city, village,...", "adminCodes1": {"ISO3166_2": "SA"}, "countryName": "Russia", "fcodeName": "seat of a first-order administrative division", "adminName1": "Sakha", "lat": "62.03389", "fcode": "PPLA"}


In [0]:
g.raw["population"]

235600