# Use Neural Networks to solve Rossmann Sales

- Idears from http://rstudio-pubs-static.s3.amazonaws.com/423824_f67e5db15f6e4e59bcf284208e441cfd.html

- Main toolset or library: [Keras](https://keras.io/) 

- Keras is designed for the purpose of Python deep learning both on CPU and GPU

- Keras is biologically-inspired modelling tool consisting of interconnected nodes; simulating the way the human brain reacts to stimuli from afferent signals

- Jupytre notebook named with GCP means it runs on GCP VM

In [1]:
# Install keras on the server
#!pip install keras
#!pip install tensorflow


In [2]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# Import the keras sequential model
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from sklearn.model_selection  import train_test_split
from keras.optimizers import SGD,RMSprop
from keras_tqdm import TQDMNotebookCallback

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Problem Description

- The task is a supervised regression problem: given sales data of Rossmann - a German chain drug store , I want to build a model that can predict the furture sales of Rossmann for the next 6 weeks. 

- Before I switch to the automated machine learning method, I have tried some complete machine learning model such as decision tree and Adaboost totally by hand. That is , the data cleaning, feature engineering, model selection and parameters tuning (although have not done yet for decision tree and adboost ) are done manualy

## Dataset

- Colab is able to retrieve data from GitHub directely

- store, train and test data sets are given but features and lables need to be splitted

- **Pay attention that test.csv does not include** ***Customers***

In [3]:
# read data from GitHub

train = pd.read_csv('https://raw.githubusercontent.com/lidatou1991/udacity_final_rossmann/master/inputs/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/lidatou1991/udacity_final_rossmann/master/inputs/test.csv')
store = pd.read_csv('https://raw.githubusercontent.com/lidatou1991/udacity_final_rossmann/master/inputs/store.csv')

print('Training data shape: ', train.shape)
print('testing data shape: ', test.shape)
print('Store data shape: ', store.shape)

  interactivity=interactivity, compiler=compiler, result=result)


Training data shape:  (1017209, 9)
testing data shape:  (41088, 8)
Store data shape:  (1115, 10)


## Processing outliner and NaN in data

1. Different with TPOT, neural network model has to manually preprocess outliners in data

2. If not, the model will fail to fit and returns **loss = nan** which I encontered firstly

3. Since no additional information is given, here I keep it simple way to preprocess raw data. Consist with other models in order to compare performance.

In [4]:
#train中去除open=1但是sales=0的数据

train = train.loc[train['Sales']>0]

print('{} train datas were deleted'.format(1017209 - len(train)))

172871 train datas were deleted


In [5]:
#train 中StateHoliday 将字符与数字0混淆，存在5个不同值

train.StateHoliday = train.StateHoliday.map({'0':'0',0:'0','a':'a','b':'b','c':'c'})

print('StateHoliday unique values {}'.format(len(train.StateHoliday.unique())))


StateHoliday unique values 4


In [6]:
#store中的NaN用0填充,表示没有竞争或者没有没有促销

store.fillna(0,inplace = True)

In [7]:
#test中的NaN用1填充,因为其中的NaN 全部是 Open，如果Open=0 则没有预测的必要了

test.fillna(1,inplace = True)

### check if is there any NaN in raw data

In [8]:
train.isnull().any() 


Store            False
DayOfWeek        False
Date             False
Sales            False
Customers        False
Open             False
Promo            False
StateHoliday     False
SchoolHoliday    False
dtype: bool

In [9]:
store.isnull().any() 

Store                        False
StoreType                    False
Assortment                   False
CompetitionDistance          False
CompetitionOpenSinceMonth    False
CompetitionOpenSinceYear     False
Promo2                       False
Promo2SinceWeek              False
Promo2SinceYear              False
PromoInterval                False
dtype: bool

In [10]:
test.isnull().any() 

Id               False
Store            False
DayOfWeek        False
Date             False
Open             False
Promo            False
StateHoliday     False
SchoolHoliday    False
dtype: bool

## Note that train/test data have different columns (features)

- the feature **Customers** should be deleted from the train data 

- store data provides additional information and should be merged with train data

In [11]:
#combine raw data of train and store
train_comb = pd.merge(train,store,on ='Store',how ='left')

train_comb.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,5,2015-07-31,5263,555,1,1,0,1,c,a,1270.0,9.0,2008.0,0,0.0,0.0,0
1,2,5,2015-07-31,6064,625,1,1,0,1,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,5,2015-07-31,8314,821,1,1,0,1,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,5,2015-07-31,13995,1498,1,1,0,1,c,c,620.0,9.0,2009.0,0,0.0,0.0,0
4,5,5,2015-07-31,4822,559,1,1,0,1,a,a,29910.0,4.0,2015.0,0,0.0,0.0,0


In [12]:
#delete the 'Customers' column
train_comb.drop('Customers',inplace = True,axis = 1)

train_comb.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,5,2015-07-31,5263,1,1,0,1,c,a,1270.0,9.0,2008.0,0,0.0,0.0,0
1,2,5,2015-07-31,6064,1,1,0,1,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,5,2015-07-31,8314,1,1,0,1,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,5,2015-07-31,13995,1,1,0,1,c,c,620.0,9.0,2009.0,0,0.0,0.0,0
4,5,5,2015-07-31,4822,1,1,0,1,a,a,29910.0,4.0,2015.0,0,0.0,0.0,0


In [13]:
# obtain train feature and label
train_feature = train_comb.drop('Sales',inplace = False, axis =1)
train_label = train_comb['Sales']

## check categary features

categary feature needed to be one-hot encoded

In [14]:
train_feature.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 844338 entries, 0 to 844337
Data columns (total 16 columns):
Store                        844338 non-null int64
DayOfWeek                    844338 non-null int64
Date                         844338 non-null object
Open                         844338 non-null int64
Promo                        844338 non-null int64
StateHoliday                 844338 non-null object
SchoolHoliday                844338 non-null int64
StoreType                    844338 non-null object
Assortment                   844338 non-null object
CompetitionDistance          844338 non-null float64
CompetitionOpenSinceMonth    844338 non-null float64
CompetitionOpenSinceYear     844338 non-null float64
Promo2                       844338 non-null int64
Promo2SinceWeek              844338 non-null float64
Promo2SinceYear              844338 non-null float64
PromoInterval                844338 non-null object
dtypes: float64(5), int64(6), object(5)
memory usage: 109.

## one-hot encoding object type features 

- 'Date'  feature is useless and drop it

- one-hot encoding 'StateHoliday', 'StoreType', 'Assortment' and 'PromoInterval'

In [15]:
#drop Date feature
train_feature.drop('Date',inplace = True, axis =1)

train_feature_ready = pd.get_dummies(train_feature,prefix=['StateHoliday', 'StoreType', 'Assortment','PromoInterval'],drop_first=True)

In [16]:
train_feature_ready.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 844338 entries, 0 to 844337
Data columns (total 22 columns):
Store                             844338 non-null int64
DayOfWeek                         844338 non-null int64
Open                              844338 non-null int64
Promo                             844338 non-null int64
SchoolHoliday                     844338 non-null int64
CompetitionDistance               844338 non-null float64
CompetitionOpenSinceMonth         844338 non-null float64
CompetitionOpenSinceYear          844338 non-null float64
Promo2                            844338 non-null int64
Promo2SinceWeek                   844338 non-null float64
Promo2SinceYear                   844338 non-null float64
StateHoliday_a                    844338 non-null uint8
StateHoliday_b                    844338 non-null uint8
StateHoliday_c                    844338 non-null uint8
StoreType_b                       844338 non-null uint8
StoreType_c                       84433

In [17]:
train_feature_ready.head()

Unnamed: 0,Store,DayOfWeek,Open,Promo,SchoolHoliday,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,...,StateHoliday_b,StateHoliday_c,StoreType_b,StoreType_c,StoreType_d,Assortment_b,Assortment_c,"PromoInterval_Feb,May,Aug,Nov","PromoInterval_Jan,Apr,Jul,Oct","PromoInterval_Mar,Jun,Sept,Dec"
0,1,5,1,1,1,1270.0,9.0,2008.0,0,0.0,...,0,0,0,1,0,0,0,0,0,0
1,2,5,1,1,1,570.0,11.0,2007.0,1,13.0,...,0,0,0,0,0,0,0,0,1,0
2,3,5,1,1,1,14130.0,12.0,2006.0,1,14.0,...,0,0,0,0,0,0,0,0,1,0
3,4,5,1,1,1,620.0,9.0,2009.0,0,0.0,...,0,0,0,1,0,0,1,0,0,0
4,5,5,1,1,1,29910.0,4.0,2015.0,0,0.0,...,0,0,0,0,0,0,0,0,0,0


In the code below, we convert to numpy arrays. This is not strictly necessary, but the labels should be converted to a one-dimensional vector (using reshape in the code below) or Scikit-Learn will show a warning message.

In [18]:
# Convert to numpy arrays
#train_feature_ready = train_feature_ready.values


# Sklearn wants the labels as one-dimensional vectors
#train_label = train_label.values

In [19]:
x_train, x_valid, y_train, y_valid  = train_test_split(train_feature_ready, 
                                                     train_label, 
                                                     test_size=0.20, 
                                                     random_state=42)

After the minimal data preparation, we can create the neural network. The syntax for keras neural network is designed to be as close to that for Scikit-Learn models as possible. 



I set the following parameters in the establish the neural network :

- hidden layer = 1(for the trial of first time)

- hidden units = 150

- activation function = relu

- loss function = MSE

- optimizer = stochastic gradient descent

After we create the neural network, we fit it to the training data as with any Scikit-Learn machine learning model. We can compare the training time between deep learning neural networks and TPOT optimization

In [20]:
# build sequential neural network model with:
# 150 hidden units
# 0.2 dropout regularizer
# tanh activation function
# mean squared error loss function
# stochastic gradient descent optimizer

model = Sequential()
model.add(Dense(units=150, input_dim=np.shape(train_feature_ready)[1]))
model.add(Dropout(0.2))
model.add(Activation('relu'))
model.add(Dense(units=1))

#sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
rmsprop = RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=1e-6)

model.compile(loss='mean_squared_error', optimizer='rmsprop')

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [21]:
#fit model to data
history = model.fit(x_train, y_train, 
                    verbose=0, 
                    callbacks=[TQDMNotebookCallback()], 
                    epochs=200, 
                    batch_size=500,
                    validation_data=(x_valid, y_valid))

Instructions for updating:
Use tf.cast instead.


A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget




In [22]:
# evaluate training model
score = model.evaluate(x_train, y_train, batch_size=200)
print('Training error %f' % (score))
print('')

score = model.evaluate(x_valid, y_valid, batch_size=200)
print('Validation error %f' % (score))

Training error 6972980.964292

Validation error 6980199.453419


In [23]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 150)               3450      
_________________________________________________________________
dropout_1 (Dropout)          (None, 150)               0         
_________________________________________________________________
activation_1 (Activation)    (None, 150)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 151       
Total params: 3,601
Trainable params: 3,601
Non-trainable params: 0
_________________________________________________________________


## prepare testing data features

1. test and store data should be combined first

2. Drop useless feature **Date** and **Id** since it is not included in train feature

3. Categary features should be one-hot encoded

In [24]:
# merge test and store
test_comb = pd.merge(test,store,on = 'Store',how = 'left')

test_comb.drop('Date',inplace = True, axis =1)

In [25]:
# one-hot encode

test_feature_ready = pd.get_dummies(test_comb,prefix=['StateHoliday', 'StoreType', 'Assortment','PromoInterval'], drop_first=True)

In [26]:
test_feature_ready.drop('Id',inplace = True, axis =1)

In [27]:
test_feature_ready.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41088 entries, 0 to 41087
Data columns (total 20 columns):
Store                             41088 non-null int64
DayOfWeek                         41088 non-null int64
Open                              41088 non-null float64
Promo                             41088 non-null int64
SchoolHoliday                     41088 non-null int64
CompetitionDistance               41088 non-null float64
CompetitionOpenSinceMonth         41088 non-null float64
CompetitionOpenSinceYear          41088 non-null float64
Promo2                            41088 non-null int64
Promo2SinceWeek                   41088 non-null float64
Promo2SinceYear                   41088 non-null float64
StateHoliday_a                    41088 non-null uint8
StoreType_b                       41088 non-null uint8
StoreType_c                       41088 non-null uint8
StoreType_d                       41088 non-null uint8
Assortment_b                      41088 non-null uint8

## missing 3 features in testing data

1. only StateHoliday = 0 exist in testing data

2. therefore features *'StateHoliday_0'*, *'StateHoliday_b'*, and *'StateHoliday_c'* are missing 

## possible idea to solve missing features in test

1. use map() method to one-hot encode feature 'StateHoliday' rather than get_dummy()

2. use 'lablel' to combine (to be continued)

In [28]:
# Get missing columns in the training test
missing_cols = {'StateHoliday_0','StateHoliday_b','StateHoliday_c' }

In [29]:
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test_feature_ready[c] = 0

In [30]:
# Ensure the order of column in the test set is in the same order than in train set

#train, test = train.align(test, axis=1)

test_feature_ready = test_feature_ready[train_feature_ready.columns]

In [31]:
# check if the test feature number is 22 or not
test_feature_ready.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41088 entries, 0 to 41087
Data columns (total 22 columns):
Store                             41088 non-null int64
DayOfWeek                         41088 non-null int64
Open                              41088 non-null float64
Promo                             41088 non-null int64
SchoolHoliday                     41088 non-null int64
CompetitionDistance               41088 non-null float64
CompetitionOpenSinceMonth         41088 non-null float64
CompetitionOpenSinceYear          41088 non-null float64
Promo2                            41088 non-null int64
Promo2SinceWeek                   41088 non-null float64
Promo2SinceYear                   41088 non-null float64
StateHoliday_a                    41088 non-null uint8
StateHoliday_b                    41088 non-null int64
StateHoliday_c                    41088 non-null int64
StoreType_b                       41088 non-null uint8
StoreType_c                       41088 non-null uint8

## Testing on the final model from Neural Network

In [32]:
# make predictions on the testing data
test_sales = model.predict(test_feature_ready)

In [33]:
test_sales

array([[ 7919.264 ],
       [ 7447.2773],
       [ 9241.919 ],
       ...,
       [ 7663.4795],
       [10060.766 ],
       [ 5899.2393]], dtype=float32)

In [34]:
len(test_sales)

41088

In [35]:
# output test results to csv file

df = pd.DataFrame({"Id":range(1,len(test_sales) + 1),'Sales':test_sales.flatten()})
df.to_csv('submission_NN.csv',index = False)

## The final score on public leaderboard is ~0.16, which is not good enough

1. In the TPOT regeresor model, the light configuration is chosen since jupyter kernel will die on my local MBP

2. The light TPOT is chosen following instruction from https://github.com/EpistasisLab/tpot/issues/745 and https://github.com/EpistasisLab/tpot/issues/546

3. Therefore only simple model is fitted in light TPOT according to https://epistasislab.github.io/tpot/using/#built-in-tpot-configurations

4. Besides only 3 hours training time is given. All of those above factors may lead TPOT not find the best model

5. It is interesting to see what will happen on GCP with GPU and expanding training time and select normal TPOT.