# Task #2 forecasting #

The goal of this task is to forecast number of enquiries from US customers in period from 2017-05-02 to 2017-07-31, having some history data.

First of all, import all needed libraries and set up plotting.

In [1]:
import pandas as pd
import pandasql as ps
import numpy as np

# import matplotlib.pylab as plt
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

Read file and parse date while making it index.

In [2]:
dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m-%d')
data = pd.read_csv("forecast_data.csv", parse_dates=["date"], index_col="date", date_parser=dateparse)
data.head()

Unnamed: 0_level_0,user_country,sessions,enquiries
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-01,United Kingdom,404,8
2014-01-01,United States,623,1
2014-01-02,United Kingdom,591,1
2014-01-02,United States,563,8
2014-01-03,United Kingdom,560,4


We are interested only in US, so keeping only according entries.

In [3]:
data_us = data[data.user_country=="United States"]
data_us.head()

Unnamed: 0_level_0,user_country,sessions,enquiries
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-01,United States,623,1
2014-01-02,United States,563,8
2014-01-03,United States,690,5
2014-01-04,United States,792,5
2014-01-05,United States,714,4


Graph our data to oversee it, also calculate rolling mean.

In [4]:
rolmean = data_us["enquiries"].rolling(window=100).mean()

iplot(go.Figure(
    data=[go.Scatter(x=data_us.index, y=data_us["enquiries"], name="enquiries"),
          go.Scatter(x=data_us.index, y=data_us["sessions"], name="sessions"),
          go.Scatter(x=data_us.index, y=rolmean, name="mean")], 
    layout=go.Layout(title="Enquiries by US users",
                     yaxis=dict(type="log", title="Number of enquiries"),
                     xaxis=dict(title="Date"))))

Graph enquiries during "Expos". Hard to see any pattern here.

In [5]:
us_expos = [data_us["2016-01-23":"2016-02-01"],
            data_us["2016-09-15":"2016-09-23"],
            data_us["2016-11-25":"2016-11-29"],
            data_us["2016-01-25":"2016-02-03"]]
expo1 = go.Scatter(x=us_expos[0].index, y=us_expos[0]["enquiries"])
expo2 = go.Scatter(x=us_expos[1].index, y=us_expos[1]["enquiries"])
expo3 = go.Scatter(x=us_expos[2].index, y=us_expos[2]["enquiries"])
expo4 = go.Scatter(x=us_expos[3].index, y=us_expos[3]["enquiries"])
fig = tools.make_subplots(rows=1, cols=4)
fig.append_trace(expo1, 1, 1)
fig.append_trace(expo2, 1, 2)
fig.append_trace(expo3, 1, 3)
fig.append_trace(expo4, 1, 4)
fig['layout'].update(height=300, title='Expos from 1 to 4')
iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]  [ (1,4) x4,y4 ]



Check correlation between sessions and enquiries. The are quite strongly correlated, so for prediction model we will use only enquiry feature vectors.

In [125]:
data_us.corr()

Unnamed: 0,sessions,enquiries
sessions,1.0,0.874518
enquiries,0.874518,1.0


Prepare the dataset.

In [6]:
feat_1 = data_us["enquiries"].tolist()
feat_2 = data_us["sessions"].tolist()

X = []
y = []
for i in range(len(feat_1)-31):
    X.append(feat_1[i:i+30])
    y.append(feat_1[i+31])
X = np.array(X).astype(float)
y = np.array(y).astype(float)

x_pred_month = pd.date_range("2017-05-02", "2017-05-31")
x_pred_range = data_us.index[60:].append(x_pred_month)


Split it into training and testing part...

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

skl = StandardScaler()
X = skl.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

...and perform regression with SVM regressor which is good baseline for other methods.

In [9]:
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

clf = SVR()
clf.fit(X_train, y_train)
print("\nMSE:", mean_squared_error(y_test, clf.predict(X_test)))
y_predicted_svm = clf.predict(X)
iplot(go.Figure(
    data=[go.Scatter(x=data_us.index, y=data_us["enquiries"], name="enquiries"),
          go.Scatter(x=x_pred_range, y=y_predicted_svm, name="predicted"),
          go.Scatter(x=data_us.index, y=rolmean, name="mean")], 
    layout=go.Layout(title="Prediction by SVM",
                     yaxis=dict(title="Number of enquiries"),
                     xaxis=dict(title="Date"))))


MSE: 24.0010982336


Trying simple Neural Network as regressor (execution may take a while).

In [11]:
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.wrappers.scikit_learn import KerasRegressor


def model():
    model = Sequential()
    model.add(Dense(64, input_shape=[X_train.shape[1]], activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

regressor = KerasRegressor(build_fn=model, epochs=30, batch_size=5, verbose=0)
regressor.fit(X_train, y_train, verbose=1)
y_predicted_nn = regressor.predict(X)

print("\nMSE:", mean_squared_error(y_test, regressor.predict(X_test)))
iplot(go.Figure(
    data=[go.Scatter(x=data_us.index, y=data_us["enquiries"], name="enquiries"),
          go.Scatter(x=x_pred_range, y=y_predicted_nn, name="prediction"),
          go.Scatter(x=data_us.index, y=rolmean, name="mean")], 
    layout=go.Layout(title="Prediction by NN",
                     yaxis=dict(title="Number of enquiries"),
                     xaxis=dict(title="Date"))))


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30

MSE: 25.4331379748


So the NN is not always better, however it may perform better after switching to more sophisticated architecture (CNN or some kind of RNN) and careful hyperparameter tuning.

Finally, let's take a close look at our predictions for May 2017.

In [425]:
iplot(go.Figure(
    data=[go.Scatter(x=x_pred_month, y=y_predicted_svm[len(y_predicted_svm)-30:], name="svm"),
          go.Scatter(x=x_pred_month, y=y_predicted_nn[len(y_predicted_nn)-30:], name="nn")], 
    layout=go.Layout(title="Comparison of predictions by SVR and NN",
                     yaxis=dict(title="Number of enquiries"),
                     xaxis=dict(title="Date"))))

Having this forecast, we already can make some marketing decisions. It is hard to say how making new "expo" during this period would affect sales.