## First things first
* Click **File -> Save a copy in Drive** and click **Open in new tab** in the pop-up window to save your progress in Google Drive.
* Click **Runtime -> Change runtime type** and select **GPU** in Hardware accelerator box to enable faster GPU training.

#**Final Project for Coursera's 'How to Win a Data Science Competition'**
April, 2020

Andreas Theodoulou and Michael Gaidis

(Competition Info last updated:  3 years ago)

##**About this Competition**

You are provided with **daily** historical sales data. The task is to forecast the total amount of products (irrespective of product type;  we just want the sum of all products) sold in every shop for the test set (the **month** of November, 2015). Note that the list of shops(!) and products slightly *changes every month*. Creating a robust model that can handle such situations is part of the challenge.

.

##**File descriptions**

***sales_train.csv*** - the training set. Daily historical data from January 2013 to October 2015.

***test.csv*** - the test set. You need to forecast the sales for these shops and products for November 2015.

***sample_submission.csv*** - a sample submission file in the correct format (two columns: "shop ID number" and "total number of products sold in Nov. 2015")

***items.csv*** - item names, their corresponding item_categories IDs, and item IDs to link with the other files

***item_categories.csv***  - item category names and corresponding IDs to link with the other files

***shops.csv***- shop names and corresponding IDs to link with the other files

.

##**Data fields**

***ID*** - an Id that represents a (Shop, Item) tuple within the test set

***shop_id*** - unique identifier of a shop

***item_id*** - unique identifier of a product

***item_category_id*** - unique identifier of item category

***item_cnt_day*** - number of products sold. You are predicting a monthly amount of this measure

***item_price*** - current price of an item

***date*** - date in format dd/mm/yyyy

***date_block_num*** - a consecutive month number. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

***item_name*** - name of item

***shop_name*** - name of shop

***item_category_name*** - name of item category

#**Workflow**

##1. Configure Environment


*   Fork/copy shared ipynb as necessary, to not conflict with teammate
*   Load competition data files
*   Load any utility code files
*   Import libraries



##2. Explore Data


*   Data formatting and translating
*   Descriptive explanations for the competition data
*   Grouping and statistical descriptions of the provided features
*   Data visualizations and correlations
*   Look for signs of data leakage
*   Record initial thoughts on features and models to use



##3. Prepare Data


*   Data formatting and translating (see above)
*   Data cleaning (--> handling missing entries, outliers, NaNs, ...)
*   Data grouping / Date-related issues / re-cleaning if needed after grouping
*   Data normalization (recheck cleaning & normalizing with data visualizations)
*   Initial feature selection (quick and dirty) and preparation
*   Save data in compressed or pickled format if helpful; use version control



##4. Quick Modeling (set up framework for more complex model improvement)


*   Choose and implement a fast and simple approach for train/val data splitting
*   Choose a simple and fast evaluation metric (comparable to Kaggle's metric)
*   Choose a simple, but appropriate, model to use (minimal hyperparameters)
*   Train the model, check for major issues (absolutely horrible performance)
*   Save the model parameters, etc., along with version control
*   Submit model to Kaggle to verify proper formatting of entry
*   Verify that Kaggle test performance is reasonably close to validation metric



##5. Refine the Model and the Features


###a) Features


*   Explore the data more deeply for feature correlations and data leaks to exploit
*   Consider complex feature generation based on intuition
*   Save data in compressed or pickled format if helpful for faster future iteration
*   Employ version control on datasets generated with new features / groupings

###b) Modeling


*   Look at alternative metrics for training and validation
*   Version control
*   Explore hyperparameter tuning for the initial quick and dirty model
*   Version control
*   Consider other models as time allows
*   Version control
*   Create ensembles as time allows
*   Version control
*   Adjust methods of train/val splitting if desirable and timely
*   Version control







##6. Finalize Model


*   Restart kernel, clean any possible lingering variables
*   Train and tune hyperparamers until you run out of time
*   Submit model



---



---





#1. Configure Environment

##1a) Load Files
Load competition data files and import helpful custom code libraries from **GitHub Kag repo cloned onto Michael's Google Drive**  
(similar to original template that loads files from GitHub directly, but by cloning onto my Google Drive, I can do add/commit/push etc. from within Colab notebook)

In [0]:
# Import libraries needed for loading files:
import pandas as pd

In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [18]:
# List file names and paths needed for importing data and helper files

GDRIVE_REPO_PATH = "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"

%cd "{GDRIVE_REPO_PATH}"

# List of the data files (path relative to master branch top), to be loaded into pandas DataFrames
data_files = [  "readonly/final_project_data/items.csv",
                "readonly/final_project_data/item_categories.csv",
                "readonly/final_project_data/shops.csv",
                "readonly/final_project_data/sample_submission.csv.gz",
                "readonly/final_project_data/sales_train.csv.gz",
                "readonly/final_project_data/test.csv.gz"  ]

# Dict of helper code files, to be loaded into Colab and available for python import
#    key is the path (replace / with . ), and value is the module reference name
#    note that the directory chain from current directory down to the .py file
#      must include a "__init__.py" file (it can be empty)
code_files = {"helper_code.kaggle_utils_at_mg" : "kag_utils"}

/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag


In [19]:
# Loop to load the above data files into appropriately-named pandas DataFrames
for path_name in data_files:
  filename = path_name.rsplit("/")[-1]
  data_frame_name = filename.split(".")[0]
  exec(data_frame_name + " = pd.read_csv(path_name)")
  print("Data Frame: " + data_frame_name)
  print(eval(data_frame_name).head(2))
  print("\n")


Data Frame: items
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76


Data Frame: item_categories
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1


Data Frame: shops
                       shop_name  shop_id
0  !Якутск Орджоникидзе, 56 фран        0
1  !Якутск ТЦ "Центральный" фран        1


Data Frame: sample_submission
   ID  item_cnt_month
0   0             0.5
1   1             0.5


Data Frame: sales_train
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154       999.0           1.0
1  03.01.2013               0       25     2552       899.0           1.0


Data Frame: test
   ID  shop_id  item_id
0   0        5     5037
1   1        5    

##1b) Import Libraries
For now, just import libraries in the ipynb notebook here.  Perhaps later put this in a utility helper function in GitHub.

In [0]:
!pip install googletrans

In [0]:
import matplotlib.pyplot as plt
import numpy as np
from itertools import product
import time
from sklearn.linear_model import LinearRegression
import pickle
%matplotlib inline

from googletrans import Translator

In [0]:
translator = Translator()

In [28]:
rus_txt = shops.loc[0][0]
print(rus_txt)
translated = translator.translate(rus_txt, src='ru', dest='en')
print(translated.text)

!Якутск Орджоникидзе, 56 фран
! Yakutsk Ordzhonikidze, 56 Franc


In [34]:
test.describe

<bound method NDFrame.describe of             ID  shop_id  item_id
0            0        5     5037
1            1        5     5320
2            2        5     5233
3            3        5     5232
4            4        5     5268
...        ...      ...      ...
214195  214195       45    18454
214196  214196       45    16188
214197  214197       45    15757
214198  214198       45    19648
214199  214199       45      969

[214200 rows x 3 columns]>

In [35]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   ID       214200 non-null  int64
 1   shop_id  214200 non-null  int64
 2   item_id  214200 non-null  int64
dtypes: int64(3)
memory usage: 4.9 MB


In [29]:
items.describe

<bound method NDFrame.describe of                                                item_name  ...  item_category_id
0              ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D  ...                40
1      !ABBYY FineReader 12 Professional Edition Full...  ...                76
2          ***В ЛУЧАХ СЛАВЫ   (UNV)                    D  ...                40
3        ***ГОЛУБАЯ ВОЛНА  (Univ)                      D  ...                40
4            ***КОРОБКА (СТЕКЛО)                       D  ...                40
...                                                  ...  ...               ...
22165             Ядерный титбит 2 [PC, Цифровая версия]  ...                31
22166    Язык запросов 1С:Предприятия  [Цифровая версия]  ...                54
22167  Язык запросов 1С:Предприятия 8 (+CD). Хрустале...  ...                49
22168                                Яйцо для Little Inu  ...                62
22169                      Яйцо дракона (Игра престолов)  ...                69

[2217

In [30]:
item_categories.describe

<bound method NDFrame.describe of            item_category_name  item_category_id
0     PC - Гарнитуры/Наушники                 0
1            Аксессуары - PS2                 1
2            Аксессуары - PS3                 2
3            Аксессуары - PS4                 3
4            Аксессуары - PSP                 4
..                        ...               ...
79                  Служебные                79
80         Служебные - Билеты                80
81    Чистые носители (шпиль)                81
82  Чистые носители (штучные)                82
83           Элементы питания                83

[84 rows x 2 columns]>

In [0]:
item_categories['En_Name'] = item_categories.item_category_name.apply(lambda x: translator.translate(x, src='ru', dest='en').text)

In [122]:
item_categories

Unnamed: 0,item_category_name,item_category_id,En_Name
0,PC - Гарнитуры/Наушники,0,PC - Headsets / Headphones
1,Аксессуары - PS2,1,Accessories - PS2
2,Аксессуары - PS3,2,Accessories - PS3
3,Аксессуары - PS4,3,Accessories - PS4
4,Аксессуары - PSP,4,Accessories - PSP
...,...,...,...
79,Служебные,79,System Tools
80,Служебные - Билеты,80,Utilities - Tickets
81,Чистые носители (шпиль),81,Net carriers (spire)
82,Чистые носители (штучные),82,Net carriers (piece)


In [0]:
shops['En_Name'] = shops.shop_name.apply(lambda x: translator.translate(x, src='ru', dest='en').text)

In [33]:
shops.describe

<bound method NDFrame.describe of                                           shop_name  ...                                         En_Name
0                     !Якутск Орджоникидзе, 56 фран  ...               ! Yakutsk Ordzhonikidze, 56 Franc
1                     !Якутск ТЦ "Центральный" фран  ...                    ! Yakutsk TC "Central" Franc
2                                  Адыгея ТЦ "Мега"  ...                                Adygea TC "Mega"
3                    Балашиха ТРК "Октябрь-Киномир"  ...                Balashikha TRC "October-Kinomir"
4                          Волжский ТЦ "Волга Молл"  ...                      Volzhsky mall "Volga Mall"
5                            Вологда ТРЦ "Мармелад"  ...                         Vologda SEC "Marmalade"
6                        Воронеж (Плехановская, 13)  ...                   Voronezh (Plekhanovskaya, 13)
7                            Воронеж ТРЦ "Максимир"  ...                         Voronezh SEC "Maksimir"
8                    

SEC = Shopping and Entertainment Center (like a large shopping mall with a cinema)

SC = Shopping Center

TRC = Shopping Mall

...  looks like just about all of these shops are in big shopping malls --> probably nothing to be gained by trying to featurize on shopping mall name, except for the possibility that SECs may have different types of sales trends than simple shopping malls.  Perhaps we create a feature associated with whether it is an online shop, a small store, a huge entertainment complex, ...


Most of the shops have city names associated with them.  Unfortunately, I think it needs to be refined manually to pull the correct city name from the translated shop name (and a few of the "shops" are not associated with cities at all).  Fortunately, there aren't too many of them.  I'll work on this.

.

Next, we can automate feature generation from the city name:  (this shouldn't take long, other than that these geo services tend to throttle the rate at which you can send them a query (unless you pay $$)...  the geopy package has a rate limiter that is typically set to one query per second.  For 60 shops, not too bad.)
</br></br>

The GeoNames service can give us the **population** of the city surrounding the shop.  This could be a potential feature.
</br></br>
Latitude and longitude of the shops are likely too fine-grained to prevent overfitting with our model.  Instead, we can generate a feature based on Russian **Federal District**, as retrieved with geocode Nominatum service.  Due to religious preferences, for example, there may be a bias for a certain region to have higher sales in November (before Christmas) or not.  The map below shows how I believe Nominatum would categorize the locations:

<img src="https://ermakvagus.com/Europe/Russia/Map_of_Russia.png">

Below, I have started to manually refine the city names.  It's a bit painful entering into python like this.  I think on Sunday I will export to an MS Excel file and try to create a new "shops_v2" csv file with columns for city name and shop type.
</br></br>
Beneath that cell, I've worked out how to go about getting the geolocation and population data from the list of city names.  I'll apply this feature generation to the new shops dataframe after I've finished refining the city names.

In [0]:
city_names = ["Yakutsk","Yakutsk","Krosnodar","Balashikha","Volgograd","Vologda","Voronezh"]
city_locations = []
# type 0 = internet, type 1 = traveling salesman (itenerant trade), type 2 = small shop, type 3 = shopping mall, type 4 = SEC shopping entertainment complex
shop_type = [3,3,3,3,3,4,]

In [36]:
# Geocoding library 
!pip install geopy



In [0]:
from geopy.geocoders import Nominatim
from geopy.geocoders import GeoNames
from geopy.extra.rate_limiter import RateLimiter
import json

In [0]:
nominatum_service = Nominatim(timeout=10, user_agent = "mgaidis@yahoo.com", format_string="%s, Russia")
nominatum_geocode = RateLimiter(nominatum_service.geocode, min_delay_seconds=1)
geonames_service = GeoNames(country_bias='ru', username='gaidis', timeout=10, user_agent="mgaidis@yahoo.com", format_string="%s, Russia")  # be sure to enable free web services when creating geonames account
geonames_geocode = RateLimiter(geonames_service.geocode, min_delay_seconds=1)

In [95]:
location = nominatum_geocode('Yakutsk', language="en")    #, country_codes='ru')
print(json.dumps(location.raw))
print(location.latitude, location.longitude)
print(location.address.split(",")[-2].strip())
location

{"place_id": 235844195, "licence": "Data \u00a9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright", "osm_type": "relation", "osm_id": 5325621, "boundingbox": ["61.9489433", "62.1463495", "129.5631302", "129.8269626"], "lat": "62.027287", "lon": "129.732086", "display_name": "Yakutsk, Yakutsk Urban District, Sakha Republic, Far Eastern Federal District, Russia", "class": "place", "type": "city", "importance": 0.7441035948640713, "icon": "https://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.png"}
62.027287 129.732086
Far Eastern Federal District


Location(Yakutsk, Yakutsk Urban District, Sakha Republic, Far Eastern Federal District, Russia, (62.027287, 129.732086, 0.0))

In [96]:
location = nominatum_geocode('Voronezh', addressdetails=True, language="en")    #, country_codes='ru')
print(location.latitude, location.longitude)
print(location.address.split(",")[-2].strip())
location

51.6605982 39.2005858
Central Federal District


Location(Voronezh, Voronezh Oblast, Central Federal District, Russia, (51.6605982, 39.2005858, 0.0))

In [0]:
g = geonames_geocode('Yakutsk',timeout=10)

In [119]:
print(json.dumps(g.raw))

{"adminCode1": "63", "lng": "129.73306", "geonameId": 2013159, "toponymName": "Yakutsk", "countryId": "2017370", "fcl": "P", "population": 235600, "countryCode": "RU", "name": "Yakutsk", "fclName": "city, village,...", "adminCodes1": {"ISO3166_2": "SA"}, "countryName": "Russia", "fcodeName": "seat of a first-order administrative division", "adminName1": "Sakha", "lat": "62.03389", "fcode": "PPLA"}


In [120]:
g.raw["population"]

235600

#2. Explore Data

##2a) Data Formatting and Translating
##2b) Descriptive explanations of data in source files

In [14]:
!git status

On branch master
Your branch is ahead of 'origin/master' by 2 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean


##2c) Grouping and statistical descriptions of the provided features

Next:
*  Data visualizations and correlations
*  Look for signs of data leakage
*  Record initial thoughts on features and models to use