# Simple time series model
### Using Naive Bayes for forecasting
This model takes advantage of one approach that is commonly known as binning (discretization or reducing the scale level of a random variable from numerical to categorical). An advantage of this technique is the reduction of noise - however, this comes at the cost of losing quite an amount of information.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from simple_time_series import *
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import mean_absolute_error

## Import and fix dataset

In [None]:
df = pd.read_csv('../../data/dataset.csv')
df.index = pd.to_datetime(df.pop('date'))
df = df.filter(['Canteen'])

Plot number of people vs time

In [None]:
plt.figure(figsize=(18,9))
plt.plot(df)
plt.title('Number of people at Telenor 2016-2019')

In [None]:
plt.figure(figsize=(16,8))
plt.plot(df['2018-02-01':'2018-02-28'])
plt.title('Number of people at Telenor 2016-2019')

In [None]:
weekly = df.resample('B').sum()

In [None]:
plt.figure(figsize=(18,9))
plt.plot(weekly)
plt.title('Number of people on business days')
#plt.style(style=[':', '---', '-'])

## Test for stationarity

Stationarity is defined using very strict criterion. However, for practical purposes we can assume the series to be stationary if it has constant statistical properties over time, ie. the following:

- constant mean
- constant variance
- an autocovariance that does not depend on time.

### Moving average
We can plot the moving average or moving variance and see if it varies with time. By moving average/variance I mean that at any instant 't', we’ll take the average/variance of the last year, i.e. last 12 months. This is more of a visual technique.

In [None]:
test_stationarity(df, 365)

Both the mean and standard deviation is slightly decreasing with time and this is therefore not a stationary series. Also, the test statistic is way smaller than the critical values. Note that the signed values should be compared and not the absolute values.

In [None]:
split = int(df.shape[0]/2)
x1 = df.iloc[:-split]
x2 = df.iloc[-split:]

mean1, mean2 = x1.mean(), x2.mean()
var1, var2 = x1.var(), x2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))

In [None]:
df.hist()
plt.show()

In [None]:
moving_avg_diff = df - df.rolling(365).mean()
moving_avg_diff.dropna(inplace=True)
test_stationarity(moving_avg_diff)

## Split dataset

In [None]:
test_period = 8 # 20% of total

train = df.iloc[:-test_period]
test = df.iloc[-test_period:]

## Preprocessing

In [None]:
trend_removed = train.diff()

In [None]:
plt.figure(figsize=(18,9))
plt.plot(trend_removed)

In [None]:
df2.std()

In [None]:
from sklearn import preprocessing

x = df2.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized = pd.DataFrame(x_scaled)

In [None]:
diffs = normalized.diff()

## Binning the data
To bin the data,  K∈ℤ+  intervals on ℝ are being assigned. Now, each continuous observation xt is replaced by an indicator xt=k, k∈{1;...;K} where k is the interval that xt falls in. The number of intervals was chosen arbitrarily.

In [None]:
bins = [-1, 1, 200, 500, 1000, 1500, 1700, 1800, 1900, 10000] # Desired bins for people in canteen
# bins = [-1, 1, 100, 300, 500, 700, 1000, 1200, 1500, 1700, 1800, 1900, 10000]

The data are binned and the mean of realizations xt in each interval is saved in a dictionary in order to map the interval category back to actual realizations (bin_means).

In [None]:
binned_series, bin_means = bin_data(train, bins)
binned_test_series, bin_test_means = bin_data(test, bins)

In [None]:
binned_series.head()

In [None]:
bin_means

To forecast future realizations, the classic approach of using  S∈ℤ+  lagged realizations of  xt  will be applied. The amount of lags I chose was  365, assuming that there is no longer auto-dependency of the process beyond a horizon of one year.

In [None]:
train_x, train_y = get_lagged_list(binned_series, test_period)

In [None]:
train_x.head()

In [None]:
train_y.head()

## Predictive Model

In [None]:
model = create_sts_model(train_x, train_y)

To calculate the 'class'-means from before, I wrote a quick function that takes the predicted class as an input and returns the corresponding means.

In [None]:
resulting_prediction = find_training_prediction(train_x, train_y, model, bin_means)

### Plot training prediction

In [None]:
plt.figure(figsize = (18,9))
plt.plot(train)
plt.plot(resulting_prediction)

### Finding predictions for test set (out of samle forecast)

Out-of-sample forecasts need to be calculated iteratively since lagged values are required.

In [None]:
predictions, pred_class = find_prediction_forecast(test, train_x, train_y, model, bin_means)

# Results

### Plotting prediction of test set

In [None]:
plt.figure(figsize = (18,9))
plt.plot(test)
plt.plot(predictions)

### Plotting both train and test prediction

In [None]:
plt.figure(figsize = (18,9))
plt.plot(train)
plt.plot(resulting_prediction)

plt.plot(test)
plt.plot(predictions)
plt.xlabel("Time")
plt.legend(["In-sample truth", "In-sample prediction", "Out-of-sample truth", "Out-of-sample forecast"])

## RMSE and accuracy

### Testing accuracy

In [None]:
accuracy = test_accuracy(pred_class, binned_test_series)
print('The accuracy is ' + str(accuracy) + ' %')

### RMSE and MAE

RMSE for test data and prediction:

In [None]:
find_RMSE(test, predictions)

In [None]:
mae = mean_absolute_error(test, predictions)
mae

RMSE for training data and prediction:

In [None]:
find_RMSE(train, resulting_prediction)

In [None]:
mae_train = mean_absolute_error(train, resulting_prediction)
mae_train