In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Frame the problem and look at the big picture

## The objective: what needs to be done?
The task is to forecast the total amount of products sold in every shop for the test set.

! Note that the list of shops and products slightly changes every month.

## Frame this problem
* typical supervised learning task - we have the labeled training examples
* typical regression task - we're asked to predict a value
    * multiple regression problem (value prediction) - the system will use multiple features to make a prediction
    * also univariate regression problem - we're only trying to predict a single value (*total amount of products sold*) in every *shop*
* plain batch learning - we don't have a continuous flow of data coming to the system - the data doesn't need to be adjusted rapidly, and the data is small enoug to fit in memory (`is it so?`)

## How should performance be measured?
`todo`


In [None]:
import os
import csv
import numpy as np

ifile = os.path.abspath(os.path.join('input', 'sales_train.csv'))
rows = []
with open(ifile) as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    for row in readCSV:
        rows.append(row)

# use str() to avoid Exception has occurred: TypeError can only concatenate str (not "int") to str
print("csv row length: " + str(readCSV.line_num)) 
print("sizeof(row trasnformed into numpy obj): " + str(np.array(rows).nbytes) + " in bytes")

At this run, python takes about 2.4 GB extra RAM

In [1]:
import os
import csv
import numpy as np
import pandas as pd

sales_train = pd.read_csv(os.path.abspath(os.path.join('input', 'sales_train.csv')))
sales_train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


# Get the data

## Input files

* shops.csv- supplemental information about the shops -- `61 entries`
    * shop_name (e.g., "СПб ТК ""Сенная""")
    * shop_id (e.g., 43)

* item_categories.csv  - supplemental information about the items categories -- `85 entries`
    * item_category_name (e.g., Кино - DVD)
    * item_category_id (e.g., 40)

* items.csv - supplemental information about the items/products -- `22.171 entries`
    * item_name (e.g., 1812: 4 СЕРИИ (регион))
    * item_id (e.g., 97)
    * item_category_id (e.g., 40)

* sales_train.csv - the training set. Daily historical data from January 2013 to October 2015 -- `2.935.850 entries | 587.170 entries should be allotted to the training set`
    * <strike>date (e.g., 23.02.2013)</strike> *I don't see the reason of using this in ML training because we already have date_block_num as an attribute*
    * date_block_num (e.g., 1)
    * shop_id (e.g., 43) - `shop_id and item_id shall be concatenated to ID`
    * item_id (e.g., 97) - `shop_id and item_id shall be concatenated to ID`
    * item_price (e.g., 149.0)
    * item_cnt_day (e.g., 1.0)

* sample_submission.csv - a sample submission file in the correct format -- `214.201 entries`
    * ID (e.g., 0)
    * item_cnt_month (e.g., 0.5)

* test.csv - the test set. You need to forecast the sales for these shops and products for November 2015 -- `214.201 entries`
    * ID (e.g., 0)
    * shop_id (e.g., 43)
    * item_id (e.g., 97)

## Data fields
* ID - an Id that represents a (Shop, Item) tuple within the test set
* shop_id - unique identifier of a shop
* item_id - unique identifier of a product
* item_category_id - unique identifier of item category
* item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
* item_price - current price of an item
* date - date in format dd/mm/yyyy
* date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* item_name - name of item
* shop_name - name of shop
* item_category_name - name of item category

## Data format
`todo`

In [None]:
# Sample a test set, put it aside, and never look at it (no data snooping!)


# Explore the Data

* Study each attribute and its characteristics:
    * Name
    * Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
    * % of missing values
    * Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
    * Possibly useful for the task?
    * Type of distribution (Gaussian, uniform, logarithmic, etc.)
* For supervised learning tasks, identify the target attribute(s).
* Visualize the data.
* Study the correlations between attributes.
* Study how you would solve the problem manually.
* Identify the promising transformations you may want to apply.
* Identify extra data that would be useful (go back to “Get the Data”).
* Document what you have learned.


# Prepare the Data

## Data cleaning
* Fix or remove outliers (optional).
* Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).


## Feature selection
* Drop the attributes that provide no useful information for the task.


## Feature engineering
* Discretize continuous features.
* Decompose features (e.g., categorical, date/time, etc.).
* Add promising transformations of features (e.g., log(x), sqrt(x), x2, etc.).
* Aggregate features into promising new features.



## Feature scaling: standardize or normalize features.

# Short-List Promising Models
* Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
* Measure and compare their performance.
* For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.
* Analyze the most significant variables for each algorithm.
* Analyze the types of errors the models make.
* What data would a human have used to avoid these errors?
* Have a quick round of feature selection and engineering.
* Have one or two more quick iterations of the five previous steps.
* Short-list the top three to five most promising models, preferring models that make different types of errors.


# Fine-Tune the System

! Use as much data as possible for this step, especially as you move toward the end of fine-tuning
* Fine-tune the hyperparameters using cross-validation.
    * Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or with the median value? Or just drop the rows?).
    * Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams).
* Try Ensemble methods. Combining your best models will often perform better than running them individually.
* Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.


# Present the Solution & Launch!