## First things first
* Click **File -> Save a copy in Drive** and click **Open in new tab** in the pop-up window to save your progress in Google Drive.
* Click **Runtime -> Change runtime type** and select **GPU** in Hardware accelerator box to enable faster GPU training.

#**Final Project for Coursera's 'How to Win a Data Science Competition'**
April, 2020

Andreas Theodoulou and Michael Gaidis

(Competition Info last updated:  3 years ago)

##**About this Competition**

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

.

##**File descriptions**

***sales_train.csv*** - the training set. Daily historical data from January 2013 to October 2015.

***test.csv*** - the test set. You need to forecast the sales for these shops and products for November 2015.

***sample_submission.csv*** - a sample submission file in the correct format.

***items.csv*** - supplemental information about the items/products.

***item_categories.csv***  - supplemental information about the items categories.

***shops.csv***- supplemental information about the shops.

.

##**Data fields**

***ID*** - an Id that represents a (Shop, Item) tuple within the test set

***shop_id*** - unique identifier of a shop

***item_id*** - unique identifier of a product

***item_category_id*** - unique identifier of item category

***item_cnt_day*** - number of products sold. You are predicting a monthly amount of this measure

***item_price*** - current price of an item

***date*** - date in format dd/mm/yyyy

***date_block_num*** - a consecutive month number. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

***item_name*** - name of item

***shop_name*** - name of shop

***item_category_name*** - name of item category

#**Workflow**

##1. Configure Environment


*   Fork/copy shared ipynb as necessary, to not conflict with teammate
*   Load competition data files
*   Load any utility code files
*   Import libraries



##2. Explore Data


*   Data formatting and translating
*   Descriptive explanations for the competition data
*   Grouping and statistical descriptions of the provided features
*   Data visualizations and correlations
*   Look for signs of data leakage
*   Record initial thoughts on features and models to use



##3. Prepare Data


*   Data formatting and translating (see above)
*   Data cleaning (--> handling missing entries, outliers, NaNs, ...)
*   Data grouping / Date-related issues / re-cleaning if needed after grouping
*   Data normalization (recheck cleaning & normalizing with data visualizations)
*   Initial feature selection (quick and dirty) and preparation
*   Save data in compressed or pickled format if helpful; use version control



##4. Quick Modeling (set up framework for more complex model improvement)


*   Choose and implement a fast and simple approach for train/val data splitting
*   Choose a simple and fast evaluation metric (comparable to Kaggle's metric)
*   Choose a simple, but appropriate, model to use (minimal hyperparameters)
*   Train the model, check for major issues (absolutely horrible performance)
*   Save the model parameters, etc., along with version control
*   Submit model to Kaggle to verify proper formatting of entry
*   Verify that Kaggle test performance is reasonably close to validation metric



##5. Refine the Model and the Features


###a) Features


*   Explore the data more deeply for feature correlations and data leaks to exploit
*   Consider complex feature generation based on intuition
*   Save data in compressed or pickled format if helpful for faster future iteration
*   Employ version control on datasets generated with new features / groupings

###b) Modeling


*   Look at alternative metrics for training and validation
*   Version control
*   Explore hyperparameter tuning for the initial quick and dirty model
*   Version control
*   Consider other models as time allows
*   Version control
*   Create ensembles as time allows
*   Version control
*   Adjust methods of train/val splitting if desirable and timely
*   Version control







##6. Finalize Model


*   Restart kernel, clean any possible lingering variables
*   Train and tune hyperparamers until you run out of time
*   Submit model



---



---





#1. Configure Environment

##1a) Load Files
Load competition data files and import helpful custom code libraries from **GitHub Kag repo cloned onto Michael's Google Drive**  
(similar to original template that loads files from GitHub directly, but by cloning onto my Google Drive, I can do add/commit/push etc. from within Colab notebook)

In [0]:
# Import libraries needed for loading files:
import pandas as pd

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [5]:
# List file names and paths needed for importing data and helper files

GDRIVE_REPO_PATH = "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"

%cd "{GDRIVE_REPO_PATH}"

# List of the data files (path relative to master branch top), to be loaded into pandas DataFrames
data_files = [  "readonly/final_project_data/items.csv",
                "readonly/final_project_data/item_categories.csv",
                "readonly/final_project_data/shops.csv",
                "readonly/final_project_data/sample_submission.csv.gz",
                "readonly/final_project_data/sales_train.csv.gz",
                "readonly/final_project_data/test.csv.gz"  ]

# Dict of helper code files, to be loaded into Colab and available for python import
#    key is the path (replace / with . ), and value is the module reference name
#    note that the directory chain from current directory down to the .py file
#      must include a "__init__.py" file (it can be empty)
code_files = {"helper_code.kaggle_utils_at_mg" : "kag_utils"}

/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag


In [6]:
# Loop to load the above data files into appropriately-named pandas DataFrames
for path_name in data_files:
  filename = path_name.rsplit("/")[-1]
  data_frame_name = filename.split(".")[0]
  exec(data_frame_name + " = pd.read_csv(path_name)")
  print("Data Frame: " + data_frame_name)
  print(eval(data_frame_name).head(2))
  print("\n")


Data Frame: items
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76


Data Frame: item_categories
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1


Data Frame: shops
                       shop_name  shop_id
0  !Якутск Орджоникидзе, 56 фран        0
1  !Якутск ТЦ "Центральный" фран        1


Data Frame: sample_submission
   ID  item_cnt_month
0   0             0.5
1   1             0.5


Data Frame: sales_train
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154       999.0           1.0
1  03.01.2013               0       25     2552       899.0           1.0


Data Frame: test
   ID  shop_id  item_id
0   0        5     5037
1   1        5    

##1b) Import Libraries
For now, just import libraries in the ipynb notebook here.  Perhaps later put this in a utility helper function in GitHub.

In [0]:
import matplotlib.pyplot as plt
import numpy as np
from itertools import product
import time
from sklearn.linear_model import LinearRegression
import pickle
%matplotlib inline


#2. Explore Data

##2a) Data Formatting and Translating
##2b) Descriptive explanations of data in source files

In [14]:
!git status

On branch master
Your branch is ahead of 'origin/master' by 2 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean


##2c) Grouping and statistical descriptions of the provided features

Next:
*  Data visualizations and correlations
*  Look for signs of data leakage
*  Record initial thoughts on features and models to use