## First things first
* Click **File -> Save a copy in Drive** and click **Open in new tab** in the pop-up window to save your progress in Google Drive.
* Click **Runtime -> Change runtime type** and select **GPU** in Hardware accelerator box to enable faster GPU training.

#**Final Project for Coursera's 'How to Win a Data Science Competition'**
April, 2020

Andreas Theodoulou and Michael Gaidis

(Competition Info last updated:  3 years ago)

##**About this Competition**

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

Evaluation: root mean squared error (RMSE). True target values are clipped into [0,20] range.

.

##**File descriptions**

***sales_train.csv*** - the training set. Daily historical data from January 2013 to October 2015.

***test.csv*** - the test set. You need to forecast the sales for these shops and products for November 2015.

***sample_submission.csv*** - a sample submission file in the correct format.

***items.csv*** - supplemental information about the items/products.

***item_categories.csv***  - supplemental information about the items categories.

***shops.csv***- supplemental information about the shops.

.

##**Data fields**

***ID*** - an Id that represents a (Shop, Item) tuple within the test set

***shop_id*** - unique identifier of a shop

***item_id*** - unique identifier of a product

***item_category_id*** - unique identifier of item category

***item_cnt_day*** - number of products sold. You are predicting a monthly amount of this measure

***item_price*** - current price of an item

***date*** - date in format dd/mm/yyyy

***date_block_num*** - a consecutive month number. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

***item_name*** - name of item

***shop_name*** - name of shop

***item_category_name*** - name of item category

# Connect Google Drive to Github Repo for any relevant pushes/pulls
Github/Google Drive configurartions

In [0]:
OAUTH_TOKEN_FILENAME = 'GitHub_Token.txt'
COLAB_GDRIVE_MOUNTPOINT = '/content/drive'  # leave this unchanged unless you know something
COLAB_DEFAULT_DIR = 'My Drive/Colab Notebooks'  # leave this unchanged unless you explicitly created a different default Colab directory
GDRIVE_PATH_TO_LOCAL_REPO = 'Coursera_Data_Science_Competitions_Kaggle_project'  # this is the directory (relative to Colab Default) in which you will have cloned the remote GitHub repo
GIT_REPO_MASTER = 'Kag'  # Name of master branch on GitHub
GIT_USERNAME = 'migai'
GIT_USER_EMAIL = "gaidis@alum.mit.edu"
GIT_REPO_USERNAME = 'migai'

from pathlib import Path
import os

GDRIVE_HOME = Path(COLAB_GDRIVE_MOUNTPOINT)                 # "/content/drive/My Drive/Colab Notebooks/
COLAB_HOME = GDRIVE_HOME / COLAB_DEFAULT_DIR                # "/content/drive/My Drive/Colab Notebooks/
TOKEN_FILE = COLAB_HOME / OAUTH_TOKEN_FILENAME              # "/content/drive/My Drive/Colab Notebooks/
GDRIVE_CLONE_PATH = COLAB_HOME / GDRIVE_PATH_TO_LOCAL_REPO  # "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final"
GDRIVE_REPO_PATH = GDRIVE_CLONE_PATH / GIT_REPO_MASTER      # "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"

Mount collab to google drive

In [2]:
# This code will mount your personal Google Drive in Colab at "/content/drive"

from google.colab import drive
drive.mount(COLAB_GDRIVE_MOUNTPOINT)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
GIT_TOKEN = Path(TOKEN_FILE).read_text()
GITHUB_REPO_PATH = Path("https://" + GIT_TOKEN + "@github.com") / GIT_REPO_USERNAME / (GIT_REPO_MASTER + ".git")
print(GITHUB_REPO_PATH)

https:/620336781d85add615845c6c8d251087ec3e011d@github.com/migai/Kag.git


In [4]:
push_message = "added time series features & modelling script"

################################################
os.chdir(GDRIVE_REPO_PATH)
!git config user.email "{GIT_USER_EMAIL}"
!git config user.name "{GIT_USERNAME}"

# make sure we are in the correct location on GitHub
!git remote remove origin   
!git remote add origin "{GITHUB_REPO_PATH}"

!git add .
!git commit -m "{push_message}"
!git push origin master

[master 8673036] added time series features & modelling script
 2 files changed, 2 insertions(+), 2 deletions(-)
 rewrite Time Series Features & Modelling.ipynb (84%)
 rewrite ipynb_versions/Time Series Features & Modelling.ipynb (85%)
ssh: Could not resolve hostname https: Name or service not known
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


In [0]:
!git add .

In [0]:
!git config --global user.email "andreas.theodoulou3@gmail.com"
!git config --global user.name "AndreasTheodoulou"
!git commit -m "commit of time series feautres & modelling script "

[master dad0d07] commit of time series feautres & modelling script
 2 files changed, 2 insertions(+)
 create mode 100644 Time Series Features & Modelling.ipynb
 create mode 100644 ipynb_versions/Time Series Features & Modelling.ipynb


In [0]:
!git push origin master

fatal: could not read Username for 'https://github.com': No such device or address


#Load Files
Load competition data files and import helpful custom code libraries from shared GitHub repository

In [0]:
# GitHub file location info
git_hub_url = "https://raw.githubusercontent.com/migai/"
repo_name = 'Kag/'
branch_name = 'master/'
base_url = git_hub_url + repo_name + branch_name

# List of the data files (path relative to GitHub branch), to be loaded into pandas DataFrames
data_files = [  "readonly/final_project_data/items.csv",
                "readonly/final_project_data/item_categories.csv",
                "readonly/final_project_data/shops.csv",
                "readonly/final_project_data/sample_submission.csv.gz",
                "readonly/final_project_data/sales_train.csv.gz",
                "readonly/final_project_data/test.csv.gz"  ]

# List of helper code files, to be loaded into Colab and available for python import
code_files = [  "kaggle_utils_at_mg.py"]

In [0]:
import pandas as pd
import os

def xfer_github_to_colab(path):
    filename = path.rsplit("/")[-1]
    os.system("wget " + base_url + "{} -O {}".format(path, filename))
    print(base_url + path + " ---> loaded into ---> " + filename)
    return filename

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False
if IN_COLAB:
    print("Loading Files from GitHub to Colab\n")

    # Loop to load the above data files into appropriately-named pandas DataFrames
    for path_name in data_files:
      filename = xfer_github_to_colab(path_name)
      data_frame_name = path_name.rsplit("/")[-1].split(".")[0]
      exec(data_frame_name + " = pd.read_csv(filename)")
      print("Data Frame: " + data_frame_name)
      print(eval(data_frame_name).head(2))
      print("\n")


    # to load a code (".py") file into Colab, first shred to make sure you aren't using an old version
    for path_name in code_files:
      filename = path_name.rsplit("/")[-1]
      ! shred -u {filename}
      filename = xfer_github_to_colab(path_name)

Loading Files from GitHub to Colab

https://raw.githubusercontent.com/migai/Kag/master/readonly/final_project_data/items.csv ---> loaded into ---> items.csv
Data Frame: items
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76


https://raw.githubusercontent.com/migai/Kag/master/readonly/final_project_data/item_categories.csv ---> loaded into ---> item_categories.csv
Data Frame: item_categories
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1


https://raw.githubusercontent.com/migai/Kag/master/readonly/final_project_data/shops.csv ---> loaded into ---> shops.csv
Data Frame: shops
                       shop_name  shop_id
0  !Якутск Орджоникидзе, 56 фран        0
1  !Якутск ТЦ "Центральный" фран        1


https://ra

In [0]:
# test to check that .py utility file loaded into Colab OK
'''
import kaggle_utils_at_mg as kag_utils
test1 = kag_utils.add_one(2)
print(test1)
'''

'\nimport kaggle_utils_at_mg as kag_utils\ntest1 = kag_utils.add_one(2)\nprint(test1)\n'

In [0]:
import matplotlib.pyplot as plt
import pdb; pdb.set_trace()
import numpy as np
from itertools import product
import time
from sklearn.linear_model import LinearRegression
#from catboost import CatBoostRegressor
import pickle

from catboost import CatBoostRegressor 

--Return--
> <ipython-input-27-abcee8ea5336>(2)<module>()->None
-> import pdb; pdb.set_trace()
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76
2      ***В ЛУЧАХ СЛАВЫ   (UNV)                    D        2                40
3    ***ГОЛУБАЯ ВОЛНА  (Univ)                      D        3                40
4        ***КОРОБКА (СТЕКЛО)                       D        4                40
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **Data Preparation**

*   Make data table monthly from daily (is there any point in using the daily data in more advanced modelling versions? Probably just to create more relevant monthly related features (e.g. mean/std or any other type) rather than keeping the format of the table daily)
*   To do: Merge item_category_id as a feature




Make table monthly

In [0]:
matrix = []
cols = ['date_block_num','shop_id','item_id']
for i in range(34):
    sales = sales_train[sales_train.date_block_num==i]
    matrix.append(np.array(list(product([i], sales.shop_id.unique(), sales.item_id.unique())), dtype='int16'))
    
matrix = pd.DataFrame(np.vstack(matrix), columns=cols)
matrix['date_block_num'] = matrix['date_block_num'].astype(np.int8)
matrix['shop_id'] = matrix['shop_id'].astype(np.int8)
matrix['item_id'] = matrix['item_id'].astype(np.int16)
matrix.sort_values(cols,inplace=True)
print("monthly table is")
matrix.head()


In [0]:
ts = time.time()
group = sales_train.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': ['sum']})
group.columns = ['item_cnt_month']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=cols, how='left')
matrix['item_cnt_month'] = (matrix['item_cnt_month']
                                .fillna(0)
                                .clip(0,20) # NB clip target here
                                .astype(np.float16))


In [0]:
matrix.tail()

In [0]:
test['date_block_num'] = 34
test['date_block_num'] = test['date_block_num'].astype(np.int8)
test['shop_id'] = test['shop_id'].astype(np.int8)
test['item_id'] = test['item_id'].astype(np.int16)
matrix = pd.concat([matrix, test], ignore_index=True, sort=False, keys=cols)
matrix.fillna(0, inplace=True) # 34 month
matrix.head()

In [0]:
#Count of monthly data points for each category
#item_id category
matrix[matrix['date_block_num'] <= 6].groupby('item_id').agg({'item_id': 'count'}).describe()
#shop_id category
matrix[matrix['date_block_num'] <= 6].groupby(['shop_id']).agg({'shop_id': 'count'}).describe()
#shop_id & item_id category
matrix[matrix['date_block_num'] <= 6].groupby(['shop_id', 'item_id']).agg({'shop_id': 'count'}).describe()

In [0]:
sales_train.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': ['count']}).describe()

In [0]:
group = sales_train.groupby(['date_block_num','item_id']).agg({'item_price': ['mean'],
                                                               'item_cnt_day': ['mean']})
group.columns = ['item_price_mean_per_item_and_month', 'item_cnt_mean_per_item_and_month']
group

### **Featue Generation/Engineering**

Time series features
*   Statistics of previous months (e.g. mean of item_price for a specific item/shop in previous months)
*   Trends of previous months - rate of change of the above statistics based features (e.g. rate of change of mean item_price from today to the past 3 months for a specific shop/item)




In [0]:
def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

Stage 1: Statistics based features

> 1st step: Compute their Values


In [0]:
#mean of item price at specific date_block_num and item_id
group = sales_train.groupby(['date_block_num','item_id']).agg({'item_price': ['mean'],
                                                               'item_cnt_day': ['mean']})
group.columns = ['item_price_mean_per_item_and_month', 'item_cnt_mean_per_item_and_month']
group.reset_index(inplace=True)
matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['item_price_mean_per_item_and_month'] = matrix['item_price_mean_per_item_and_month'].astype(np.float16)
matrix['item_cnt_mean_per_item_and_month'] = matrix['item_cnt_mean_per_item_and_month'].astype(np.float16)


#mean of item price at specific date_block_num and shop_id
group = sales_train.groupby(['date_block_num','shop_id']).agg({'item_price': ['mean'],
                                                               'item_cnt_day': ['mean']})
group.columns = ['item_price_mean_per_shop_and_month', 'item_cnt_mean_per_shop_and_month']
group.reset_index(inplace=True)
matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left')
matrix['item_price_mean_per_shop_and_month'] = matrix['item_price_mean_per_shop_and_month'].astype(np.float16)
matrix['item_cnt_mean_per_shop_and_month'] = matrix['item_cnt_mean_per_shop_and_month'].astype(np.float16)

matrix

> 2nd step: Lag them (put them in the same row/month as the one you'll be using them to predict - e.g e.g if going to use 6month ago mean of item_price to predict item_cnt of next month, put 6 month ago mean of item_price in the same row as current month's values, used to predict next month)




In [0]:
ts = time.time()

lags = [1, 2, 3, 4, 6, 7, 12, 13]

features_engineered = ['item_price_mean_per_item_and_month', 'item_cnt_mean_per_item_and_month', 'item_price_mean_per_shop_and_month', 'item_cnt_mean_per_shop_and_month']
features_to_lag = features_engineered
matrix_tmp = []
for i in range(len(features_to_lag)):
  matrix_tmp.append(lag_feature(matrix, lags, features_to_lag[i]))
for matrix_lagged in matrix_tmp:
  import pdb; pdb.set_trace()
  matrix = pd.merge(matrix, matrix_lagged, on=['date_block_num','shop_id','item_id'], how='left')


fetures_to_drop = features_engineered #features are renamed and added as a new column within the lag_features functions, so remove these one
matrix = matrix.drop(fetures_to_drop, axis = 1)
matrix = matrix.fillna(0)

matrix = matrix.loc[:,~matrix.columns.str.contains('_y')]
matrix = matrix.loc[:,~matrix.columns.str.contains('_x')]
import pdb; pdb.set_trace()


time.time()-ts

In [0]:
#matrix = matrix[matrix.columns.drop(list(matrix.filter(regex='y')))]
#df.drop(list(df.filter(regex = '_x')), axis = 1, inplace = True)#
matrix.tail()
import pdb; pdb.set_trace()


2nd Stage: Trend based features


> Rate of change of price/item count in the past 1m, 3m, 6m, 12m



In [0]:
ts = time.time()
trend_lags = [2, 4, 7, 13]
for feature_engineered in features_engineered:
  for i in trend_lags:
    matrix['trend_' + feature_engineered + '_lag_'+str(i-1)] = \
        (matrix[feature_engineered +'_lag_'+str(i)] - matrix[feature_engineered + '_lag_1']) / matrix[feature_engineered + '_lag_1']
print(time.time()-ts)
matrix.tail()



In [0]:
#Is this code really needed? it's really slow as well
'''
ts = time.time()
def select_trend(row):
  #for i in trend_lags:
  print(i)
  if row['trend_' + feature_engineered + '_lag_' + str(i-1)]:
    return row['trend_' + feature_engineered + '_lag_' + str(i-1)]
  return 0

for feature_engineered in features_engineered:
      for i in trend_lags:
                matrix['trend_' + feature_engineered + '_lag_' + str(i-1)] = matrix.apply(select_trend, axis=1)
                matrix['trend_' + feature_engineered + '_lag_' + str(i-1)] = matrix['trend_' + feature_engineered + '_lag_' + str(i-1)].astype(np.float16)
                matrix['trend_' + feature_engineered + '_lag_' + str(i-1)].fillna(0, inplace=True)

time.time() - ts
'''

In [0]:
matrix.head()



> Categorical feature for whether mean value of feature_engineered (e.g.price/item cnt of current month) is above mean value of past 12 months of that feature 



In [0]:
#if price_lag_1 > mean(price_lag_1,3,6,12)
for feature_engineered in features_engineered:
  matrix['above_12m_avg_' + feature_engineered] = matrix[feature_engineered + '_lag_1'] >= matrix[[feature_engineered + '_lag_1', feature_engineered + '_lag_3', feature_engineered + '_lag_6', feature_engineered + '_lag_12']].mean(axis = 1)


In [0]:
matrix.head()

In [0]:
'''
from google.colab import files
from google.colab import drive
drive.mount('/content/drive')
#matrix.to_csv('Full-TS-Features-DataSet.csv')
files.download('Full-TS-Features-DataSet.csv')
'''

In [0]:
'''
matrix = pd.read_csv('Full-TS-Features-DataSet.csv')
matrix.head()
'''

In [0]:
#features_to_remove_post_trend = ['item_price_mean_per_item_and_month', 'item_price_mean_per_shop_and_month'] #for all lags - do not sound like useful features -> their trends should be more useful
lags_to_remove_post_trend = ['_4', '_7', '_13'] #for all features - not needed any more - were just needed to calculate 1m (2m-1m), 3m (4m-1m), 6m (7m-1m), 12m (13m-1m) trends
'''
for feature_to_remove_post_trend in features_to_remove_post_trend:
  matrix = matrix.loc[:,~matrix.columns.str.startswith(feature_to_remove_post_trend)]
'''
for lag_to_remove_post_trend in lags_to_remove_post_trend:
  matrix = matrix.loc[:,~matrix.columns.str.endswith(lag_to_remove_post_trend)]


matrix.head()

In [0]:
import numpy as np
matrix = matrix.replace([np.inf, -np.inf], np.nan)
matrix.fillna(0, inplace=True)
matrix.head()

In [0]:
'''
from google.colab import files
matrix.to_csv('for-modelling-TS-Features-DataSet.csv')
'''

# Modelling



*   Train/Val/Test split
*   Model specific feature set
*   Model Fit & Validate
*   Test/Submission Results





Train/Test split

In [0]:
use_toy_data = False #to be used just for code to run quicker when tests are needed to be made
data = matrix
if use_toy_data == True:
  train_start_index = 28
else:
  train_start_index = 14 #skip first 13 months - used to caclulate time series features
train_final_index = 28 #makes validation set to be 20% of the non-test data (threshold is surely debatable)

data = data[data['date_block_num'] >= train_start_index ]  
X_train = data[data.date_block_num <= train_final_index].drop(['item_cnt_month', 'ID'], axis=1)
y_train = data[data.date_block_num <= train_final_index]['item_cnt_month']
X_val = data[(data.date_block_num > train_final_index) & (data.date_block_num <= 33)].drop(['item_cnt_month', 'ID'], axis=1)
y_val = data[(data.date_block_num > train_final_index) & (data.date_block_num <= 33)]['item_cnt_month']
X_test = data[data.date_block_num == 34].drop(['item_cnt_month', 'ID'], axis=1)



In [0]:
X_train.head()

In [0]:
y_train.head()

Model Specific feature set

In [0]:
'''
#Remove categorical features unless encoded (e.g one-hot encoding) for basically any method other than a tree method (Linear Regresion, Neural Networks etc)
LinRegFeaturesToDrop= ['date_block_num', 'shop_id', 'item_id'] 
X_train_LinReg = X_train.drop(LinRegFeaturesToDrop, axis = 1)
X_val_LinReg = X_val.drop(LinRegFeaturesToDrop, axis = 1)
X_test_LinReg = X_test.drop(LinRegFeaturesToDrop, axis = 1)
'''

Model Fit & Validate

In [0]:
import sklearn
model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train, y_pred_val =  model.predict(X_train) , model.predict(X_val)
train_score, val_score = sklearn.metrics.r2_score(y_train, y_pred_train), sklearn.metrics.r2_score(y_val, y_pred_val)
train_rmsle, val_rmse = np.sqrt(sklearn.metrics.mean_squared_error(y_train, y_pred_train)), np.sqrt(sklearn.metrics.mean_squared_error(y_val, y_pred_val))
print('R^2 train_score is ' + str(train_score))
print('R^2 val_score is ' + str(val_score))

Test/Submission Results

In [0]:
Y_pred = model.predict(X_val).clip(0, 20)
Y_test = model.predict(X_test).clip(0, 20)

submission = pd.DataFrame({
    "ID": test.index, 
    "item_cnt_month": Y_test
})
submission.to_csv('xgb_submission.csv', index=False)

# save predictions for an ensemble
pickle.dump(Y_pred, open('linReg_train.pickle', 'wb'))
pickle.dump(Y_test, open('linReg_test.pickle', 'wb'))

In [0]:
submission.head()

Catboost

In [0]:
'''
# Prepare Categorical Variables

categorical = []
for feature_engineered in features_engineered:
  categorical.append('above_12m_avg' + feature_engineered)

categorical.extend(['date_block_num','shop_id', 'item_id'])

def column_index(df, query_cols):
    indices = []
    for query_col in query_cols:
      index=df.columns.get_loc(query_col)
      indices.append(index)
    return indices
categorical_features_pos = column_index(X_train,categorical)

model = CatBoostRegressor()
model.fit(
    X_train, y_train,
    #cat_features=categorical_features_pos,
    eval_set=(X_val, y_val)
#     logging_level='Verbose',  # you can uncomment this for text output
    #plot=True
)
y_pred_train, y_pred_val =  model.predict(X_train) , model.predict(X_val)
train_score, val_score = sklearn.metrics.r2_score(y_train, y_pred_train), sklearn.metrics.r2_score(y_val, y_pred_val)
train_rmse, val_rmse = np.sqrt(sklearn.metrics.mean_squared_error(y_train, y_pred_train)), np.sqrt(sklearn.metrics.mean_squared_error(y_val, y_pred_val))
print('R^2 train_score is ' + str(train_score))
print('R^2 val_score is ' + str(val_score))

Y_pred = model.predict(X_val).clip(0, 20)
Y_test = model.predict(X_test).clip(0, 20)

submission = pd.DataFrame({
    "ID": test.index, 
    "item_cnt_month": Y_test
})
submission.to_csv('xgb_submission.csv', index=False)

# save predictions for an ensemble
pickle.dump(Y_pred, open('linReg_train.pickle', 'wb'))
pickle.dump(Y_test, open('linReg_test.pickle', 'wb'))

'''

In [0]:
from sklearn.ensemble import GradientBoostingRegressor
gbt = GradientBoostingRegressor(max_depth = 7)
gbt.fit(X_train, y_train)
model = gbt

y_pred_train, y_pred_val =  model.predict(X_train) , model.predict(X_val)
train_score, val_score = sklearn.metrics.r2_score(y_train, y_pred_train), sklearn.metrics.r2_score(y_val, y_pred_val)
train_rmse, val_rmse = np.sqrt(sklearn.metrics.mean_squared_error(y_train, y_pred_train)), np.sqrt(sklearn.metrics.mean_squared_error(y_val, y_pred_val))
print('R^2 train_score is ' + str(train_score))
print('R^2 val_score is ' + str(val_score))

Y_pred = model.predict(X_val).clip(0, 20)
Y_test = model.predict(X_test).clip(0, 20)

submission = pd.DataFrame({
    "ID": test.index, 
    "item_cnt_month": Y_test
})
submission.to_csv('xgb_submission.csv', index=False)

# save predictions for an ensemble
pickle.dump(Y_pred, open('linReg_train.pickle', 'wb'))
pickle.dump(Y_test, open('linReg_test.pickle', 'wb'))

# save the model to disk
filename = 'gbt_model.sav'
pickle.dump(model, open(filename, 'wb'))
 
# some time later...
 '''
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)
'''

In [0]:
%debug
from google.colab import drive
drive.mount('/content/drive')

In [0]:
!ls "/content/drive/My Drive"


In [0]:
# Plot feature importance
feature_importance = gbt.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(10,13)) 
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.tick_params(axis='y', which='major', labelsize = 13)
plt.show()

In [0]:
len(X_train.columns)//4

In [0]:
X_train.columns

EDA

In [0]:
df1 = data.describe(include = 'all')
df1.loc['dtype'] = data.dtypes
df1.loc['size'] = len(data)
df1.loc['% Null_count'] = data.isnull().mean()
df1

**Data Cleaning**

In [0]:
#impute any potential missing values or deal with outliers

Feature Engineering

In [0]:
# To construct month, year feature from data
# count of days in a month
# time components of item_price and item_cnt (value at t-1, t-2, t-3, t-6, t-12 maybe)
# rate of change of item_cnt (between t-1 and t-2 e.g.), 
# statistics on item_price and item_cnt - mean, std, range, mode, skew?


In [0]:
#Create a distinct day, month, year column
'''
df['date'] = pd.to_datetime(df['date'], format = "%d.%m.%Y")
df['year'], df['month'], features['day'] = df['date'].dt.year, df['date'].dt.month, df['date'].dt.day
df.head()
'''
#also get day count (days in a month)