<a href="https://colab.research.google.com/github/laxmankusuma/practice_notebook/blob/master/Predict_Future_Sales_kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict Future Sales

### https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

## File descriptions

**sales_train.csv** - the training set. Daily historical data from January 2013 to October 2015.

**test.csv** - the test set. You need to forecast the sales for these shops and products for November 2015.

**sample_submission.csv** - a sample submission file in the correct format.

**items.csv** - supplemental information about the items/products.

**item_categories.csv**  - supplemental information about the items categories.

**shops.csv**- supplemental information about the shops.

**Data fields**

**ID** - an Id that represents a (Shop, Item) tuple within the test set

**shop_id** - unique identifier of a shop

**item_id** - unique identifier of a product

**item_category_id** - unique identifier of item category

**item_cnt_day** - number of products sold. You are predicting a monthly amount of this measure

**item_price** - current price of an item

**date** - date in format dd/mm/yyyy

**date_block_num** - a consecutive month number, used for convenience. January 
2013 is 0, February 2013 is 1,..., October 2015 is 33

**item_name** - name of item

**shop_name** - name of shop

**item_category_name** - name of item category

**Pipline**

load data

heal data and remove outliers

work with shops/items/cats objects and features

create matrix as product of item/shop pairs within each month in the train set

get monthly sales for each item/shop pair in the train set and merge it to the matrix

clip item_cnt_month by (0,20)

append test to the matrix, fill 34 month nans with zeros

merge shops/items/cats to the matrix

add target lag features

add mean encoded features

add price trend features

add month

add days

add months since last sale/months since first sale features

cut first year and drop columns which can not be calculated for the test set

select best features

set validation strategy 34 test, 33 validation, less than 33 train

fit the model, predict and clip targets for the test set

## Part 1, perfect features

In [1]:
import numpy as np
import pandas as pd
# setting for the display
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns',100)

from itertools import product
from sklearn.preprocessing import LabelEncoder

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from xgboost import XGBRegressor
from xgboost import plot_importance

def plot_features(booster, figsize):
  fig, ax = plt.subplots(1,1,figsize=figsize)
  return plot_importance(booster=booster, ax=ax)

import time
import sys
import gc
import pickle
sys.version_info

sys.version_info(major=3, minor=7, micro=10, releaselevel='final', serial=0)

In [3]:
#Google Colab: how to read data from my google drive?
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
!ls /content/drive

MyDrive


In [5]:
!ls /content/drive/MyDrive

'Colab Notebooks'   competitive-data-science-predict-future-sales


In [6]:
!ls /content/drive/MyDrive/competitive-data-science-predict-future-sales

item_categories.csv  sales_train.csv	    shops.csv
items.csv	     sample_submission.csv  test.csv


In [8]:
items = pd.read_csv("/content/drive/MyDrive/competitive-data-science-predict-future-sales/items.csv")
shops  = pd.read_csv("/content/drive/MyDrive/competitive-data-science-predict-future-sales/shops.csv")
cats  = pd.read_csv("/content/drive/MyDrive/competitive-data-science-predict-future-sales/item_categories.csv")
train  = pd.read_csv("/content/drive/MyDrive/competitive-data-science-predict-future-sales/sales_train.csv")
# set index to ID to avoid droping it later
test  = pd.read_csv('/content/drive/MyDrive/competitive-data-science-predict-future-sales/test.csv').set_index('ID')