### Sales Prediction for 6 weeks and for over 1000 stores around Germany
In this file, a process of data cleaning and imputing NaN values, as well as data modelling and prediction has been done. Although, not a part of this file, I guess, imputing values of the 6 months sales for 180 stores, according to their historical values could also improve the model.

The 3 different data sets, imported in this file, include the historical sale data (train), store attributes (store) and the data for 6 weeks of sales prediction (test).

For data modelling, extreme gradient boosting approach (recommended by many kaggle competetors) is chosen.

In [1]:
# Importing the essential libraries
import pandas as pd
import csv as csv
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import matplotlib
import matplotlib.pylab as pl
import xgboost as xgb

In [2]:
train = pd.read_csv(r'C:...\data\train.csv', header=0, dtype={'StateHoliday':'str'})
store = pd.read_csv(r'C:...\data\store.csv', header=0)
test = ds = pd.read_csv(r'C:...\data\test.csv', header=0, dtype={'StateHoliday':'str'})

In [3]:
train.columns = [x.lower() for x in train.columns]
store.columns = [x.lower() for x in store.columns]
test.columns = [x.lower() for x in test.columns]

In [4]:
#I consider some attributes for data modelling
train = train.drop(['stateholiday','schoolholiday'], axis = 1)
test = test.drop(['stateholiday','schoolholiday'], axis = 1)
store = store.drop(['competitionopensincemonth','competitionopensinceyear','promo2','promo2sinceweek',
                    'promo2sinceyear','promointerval'], axis = 1)

In [5]:
# Merging both test and train on the store
train = pd.merge(train,store, on='store')
test = pd.merge(test,store, on='store')

In [6]:
# It is important to change all the associting attributes for data modelling into numerical attributes
replace_alph_numer = {'assortment': {'a':1,'b':2,'c':3},
                      'storetype': {'a':1,'b':2,'c':3,'d':4}}
train.replace(replace_alph_numer,inplace= True)
test.replace(replace_alph_numer, inplace= True)

In [7]:
# I extract the year, month and day for each row from the date for both test and train data
train['year'] = train.date.apply(lambda x: x.split('-'))
train['month'] = train.year.apply(lambda x: int(x[1]))
train['day'] = train.year.apply(lambda x: int(x[2]))
train['year'] = train.year.apply(lambda x: int(x[0]))

In [8]:
test['year'] = test.date.apply(lambda x: x.split('-'))
test['month'] = test.year.apply(lambda x: int(x[1]))
test['day'] = test.year.apply(lambda x: int(x[2]))
test['year'] = test.year.apply(lambda x: int(x[0]))

In [9]:
train.date = pd.to_datetime(train.date)
test.date = pd.to_datetime(test.date)

In [10]:
for col in test:
    print(type(test[col][1]))

<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'pandas._libs.tslib.Timestamp'>
<class 'numpy.float64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.float64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>


In [11]:
test.describe()

Unnamed: 0,id,store,dayofweek,open,promo,storetype,assortment,competitiondistance,year,month,day
count,41088.0,41088.0,41088.0,41077.0,41088.0,41088.0,41088.0,40992.0,41088.0,41088.0,41088.0
mean,20544.5,555.899533,3.979167,0.854322,0.395833,2.252336,2.001168,5088.583138,2015.0,8.354167,13.520833
std,11861.228267,320.274496,2.015481,0.352787,0.489035,1.397401,0.994741,7225.487467,0.0,0.478266,8.44845
min,1.0,1.0,1.0,0.0,0.0,1.0,1.0,20.0,2015.0,8.0,1.0
25%,10272.75,279.75,2.0,1.0,0.0,1.0,1.0,720.0,2015.0,8.0,6.75
50%,20544.5,553.5,4.0,1.0,0.0,1.0,2.0,2425.0,2015.0,8.0,12.5
75%,30816.25,832.25,6.0,1.0,1.0,4.0,3.0,6480.0,2015.0,9.0,19.25
max,41088.0,1115.0,7.0,1.0,1.0,4.0,3.0,75860.0,2015.0,9.0,31.0


In [12]:
#We have missing values in 'open' column as well as 'competitiondistance'
test.loc[test.open.isnull(), 'open'] = 1

In [13]:
#replace NaN values with 0 in both train and test 
test.fillna(0,inplace = True)
test.isnull().sum()

id                     0
store                  0
dayofweek              0
date                   0
open                   0
promo                  0
storetype              0
assortment             0
competitiondistance    0
year                   0
month                  0
day                    0
dtype: int64

In [14]:
# I assume that stores that had no sale, but announced open, have been closed.
train.loc[(train.open == 1)&(train.sales == 0), 'open'] = 0

In [15]:
train.fillna(0,inplace = True)
train.isnull().sum()

store                  0
dayofweek              0
date                   0
sales                  0
customers              0
open                   0
promo                  0
storetype              0
assortment             0
competitiondistance    0
year                   0
month                  0
day                    0
dtype: int64

In [16]:
# I drop some more columns for modelling
train = train.drop(['date','customers'],axis=1)
test = test.drop(['date'],axis=1)

In [37]:
# courtesy of Chenglong Chen from Kaggle forum
def ToWeight(y):
    w = np.zeros(y.shape, dtype=float)
    ind = y != 0
    w[ind] = 1./(y[ind]**2)
    return w

def ToZero(y):
    w = np.zeros(y.shape, dtype=float)
    ind = y > 0
    w[ind] = y[ind]
    return w

def rmspe(ytest, y):
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean(w*(y-ytest)**2 ))
    return rmspe

In [18]:
import math
def error_evaluation(preds, dtrain):
    labels = dtrain.get_label()
    y = [math.exp(x)-1 for x in labels[labels > 0]]
    yhat = [math.exp(x)-1 for x in preds[labels > 0]]
    ssquare = [math.pow((y[i] - yhat[i])/y[i],2) for i in range(len(y))]
    return 'rmpse', math.sqrt(np.mean(ssquare))

In [19]:
features = train.drop('sales', axis = 1).columns

In [29]:
# set the parameters for xgboost
params = {"objective": "reg:linear",
          "eta": 0.25,
          "max_depth": 10,
          "silent": 1,
          "subsample": 0.9,
          "colsample_bytree": 0.7,
          "seed": 1,
          "booster": "gbtree"}
num_trees = 400

In [30]:
# I split the train data into test and train for cross validation
from sklearn import cross_validation
X_train, X_test = cross_validation.train_test_split(train,test_size = 0.2)

In [31]:
#I set the DMatrix for prediction as well as having a watchlist, while running the training
dtrain = xgb.DMatrix(X_train[features], np.log(X_train["sales"] + 1))
dvalid = xgb.DMatrix(X_test[features], np.log(X_test["sales"] + 1))
dtest = xgb.DMatrix(test[features])
watchlist = [(dvalid, 'eval'), (dtrain, 'train')]

In [32]:
#I perform the training, with the following attributes. We can see the output in every 5 rounds, while the model will stop
#if there is no improvement after 25 rounds
gbm = xgb.train(params, dtrain, num_trees, evals=watchlist, early_stopping_rounds=25, verbose_eval=5, feval=error_evaluation)

[0]	eval-rmse:5.65991	train-rmse:5.65632	eval-rmpse:0.997949	train-rmpse:0.99795
Multiple eval metrics have been passed: 'train-rmpse' will be used for early stopping.

Will train until train-rmpse hasn't improved in 25 rounds.
[5]	eval-rmse:1.3753	train-rmse:1.37475	eval-rmpse:0.763071	train-rmpse:0.761868
[10]	eval-rmse:0.42967	train-rmse:0.429556	eval-rmpse:0.386655	train-rmpse:0.362551
[15]	eval-rmse:0.280966	train-rmse:0.28056	eval-rmpse:0.353489	train-rmpse:0.310144
[20]	eval-rmse:0.251396	train-rmse:0.250796	eval-rmpse:0.35687	train-rmpse:0.308797
[25]	eval-rmse:0.228911	train-rmse:0.228097	eval-rmpse:0.341954	train-rmpse:0.289782
[30]	eval-rmse:0.209107	train-rmse:0.208378	eval-rmpse:0.324683	train-rmpse:0.26796
[35]	eval-rmse:0.195218	train-rmse:0.194219	eval-rmpse:0.311654	train-rmpse:0.246028
[40]	eval-rmse:0.185905	train-rmse:0.184787	eval-rmpse:0.303389	train-rmpse:0.234677
[45]	eval-rmse:0.179269	train-rmse:0.177923	eval-rmpse:0.306712	train-rmpse:0.225578
[50]	eval-rmse:

In [33]:
# We calculate the model error for the splitted set from train and according to predicted values from the model and real values
# from train data set
print("Model Validation in Process")
train_probs = gbm.predict(xgb.DMatrix(X_test[features]))
error = rmspe(np.exp(train_probs) - 1, X_test['sales'].values)
print(error)

Model Validation in Process
0.245963696278


In [34]:
#perform the prediction on the test data set
real_test = gbm.predict(xgb.DMatrix(test[features]))

In [35]:
len(real_test)

41088

In [38]:
#generate a data frame for the predicted values
sale_forecast = pd.DataFrame({"id": test["id"], "sales": np.exp(ToZero(real_test)) - 1})

In [39]:
# save the predicted values in a csv format file
sale_forecast.to_csv("some_file.csv", index=False)