<img src="https://www.telenor.rs/media/TelenorSrbija/media/Telenor_horizontalni.jpg" alt="Drawing" style="width: 300px;"/>

<h1><center>TELENOR DIGITAL: SMARTE BYGG</center></h1>
<h2><center> Summer project 2019 </center><h2>
<h4><center> By Maria Hilmo Jensen, May Helen Storvik and Odd Eirik Igland </center><h4>

# Introduction
The main purpose of this project is to predict how many people that will come in to work at Telenor Fornebu for up to approximatly one week into the future. We were provided with data about the number of transactions made in the canteens, as well as parking information for the time period October 2016 to February 2019. Since this was the only available data, we assumed that the number of people at work are equal to the amount that eats in the canteens. The data set did not contain information about the canteen transactions for all the dates between the start date and end date, so based on the parking information we could assume how many people were at Telenor also when we were missing data about the canteen. We also decided to consider outside parameters, such as temperature, precipitation, vacations, holidays, day of the week and time of year. We also simply assumed that no one is here in the weekends, so all the canteen data for weekends were set to zero. 

The team decided to test several different prediction models to find the one performing the best. The models we used were:
* Linear regression
* Simple time series model with Naive Bayes
* Facebook Prophet for time series analysis
* Feed Forward Neural Network
* Catboost Decision Tree
* LSTM Neural Network

## Imports
All the necessary packages and files are imported here.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.abspath(''), os.path.pardir)))

from all_models import *
from helpers.helpers import plot_history_df, plot_history, load_model, plot_prediction, plot_history_and_prediction_df, plot_history_and_prediction_ml

# 2 Data Analysis
The team were provided with raw canteen transaction data and parking data from Telenor. This part is about understanding the provided data and collecting more data from external APIs.

## Parking and canteen data
parking_and_canteen.py: 
* Combinding the raw data from parking and canteen transactions.
* Finding the correlation between parking and canteen
* Removing outliers
* Fill in missing canteen data based on parking

### Coorelation
Describe

In [None]:
get_correlation()

### Display data
Plots the canteen data for visualization.

In [None]:
display_canteen_data()

## Weather
weather.py:
* Historic weather data are collected from the [Frost API](https://frost.met.no/index.html) provided by Meterologisk Institutt.
* Using the weather station at Blindern (no. SN18700)

## Holidays and vacation
holiday.py:
* Historic holiday data are collected from [WebAPI](https://webapi.no/api/v1/holidays/).

## Combination
combined_dataset.py:
* Merges the combined canteen/parking data with weather data and holidays/vacation/inneklemt
* Stores the created dataframe into a .csv file in the data folder: dataset.csv

In [None]:
dataset = pd.read_csv('../data/dataset.csv', index_col='date')

The data in the file looks like this:

In [None]:
dataset.head()

# 3 Preprocessing
After the raw data files are combined into dataset.csv, the data needs to be preprocessed further before they are used by the models. All preprocessing methods are found in the preprocessing folder and are summarized in two methods in preprocessing.py.

## Including historic canteen data 
The team decided to add historic canteen data as columns to the data file because of the strong correlation between these numbers. This includes the number of canteen transactions made one week ago and the previous day.

The following plots shows the relation between these numbers for our data.

## Including weekday and time of the year
We were also interested in finding the effect of the day of the week and time of year. Two colums were therefore added, one displaying the weekday and one variable with distance from start of the year (as a number between 0 and 1).

## Categorizing the temperature
The original data contained both maximum and minimum temperature. It was decided to combine these to one column by taking the average of the two values. We made the assumption that temperatures in spesicic intervals increases the probability that people stay home from work. Therefore, we categorized the average temperatures into two groups; Temperatures where you are more likely to skip work (stay_home_temp) and temperatures were you most likely go to work (preferred_work_temp). We assumed that very low temperatures (less than -10 degrees), temperatures around 0 degrees (from -2 to +2 degrees) and temperatures above +20 degrees are more likely to affect if people come to work, and these interval were therefore chosen to be our stay_home_temp. All other temperatures are preferred_working_temp. 

-10 <= x <= -2 || +2 <= x <= 20 (preferred_work_temp),
x < -10 || -2 < x < +2 || x > 20 (stay_home_temp)

## Resulting data format
After the previously mentioned preprocessing steps were taken, it was necessary to create two different input data formats. The reason for this is that the machine learning models consider weighted inputs and the categories for each column are therefore used as new columns. All the other models can use the previously created format (now named decision tree dataframe).   

By running the main_file.py file, all necessary files are being added to the data folder. This includes all the preprocessing steps and stores two new data files named decision_tree_df.csv and ml_df.csv in the data folder.

### Loading the data files
After the new files are created they can be loaded:

In [None]:
dt_df, ml_df = load_datafiles()

The top five rows of the decision tree and machine learning data sets are shown respectively:

In [None]:
dt_df.head()

In [None]:
ml_df.head()

# Models
Throughout this summer the team has worked on several different prediction models, both statistical and machine learning models. All the models, explaination and visualizations are presented below. 

## Linear regression
Linear regression is a linear approach to modeling the relationship between two variables. The linear() function uses the inputs date and historic canteen values in order to predict future values. 

A model using linear regression will try to draw a straight line that will minimize the residual sum of squares between the observed responses in the dataset. (Source: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)

In [None]:
x, y, y_pred, x_test, y_test = linear(dataset) 
plot_linear(x, y, x_test, y_pred)

As our canteen data fluctuates a straight line will not manage to create a good prediction of future values as can be seen in the graph above. The mean absolute error of the linear regression model is usually around 800 people. 

The model performs slightly better if weekends are removed from the dataset, but the result is still not sufficient compared to the other models. 

## Simple time series with Naive Bayes for forecasting
This model takes advantage of one approach that is commonly known as binning (discretization or reducing the scale level of a random variable from numerical to categorical). An advantage of this technique is the reduction of noise - however, this comes at the cost of losing quite an amount of information.

This model only considers the canteen data as a function of time (date).

In [None]:
simple_time_series(dt_df, 30, True)

The graph displays the real canteen values and the predicted canteen values for the data set. As we can see the prediction can discover some of the trends in the data set, but fails to detect the decreasing trend over time. The mean absolute error for this model is 181.15 people, which is quite high but still a major improvement compared to the linear regression. 

Since this is a time series model, it is made for predicting the values directly following the data set it is provided. Therefore, when this model are going to predict the future, it has to predict every day between the last day in the data set and today before it can come up with a prediction for future dates. 

## Facebook Prophet
Prophet is designed for analyzing time series with daily observations that display patterns on different time scales. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. It also has advanced capabilities for modeling the effects of holidays on a time-series.

Also this model only considers the canteen data and date, as well as Norwegian holidays (provided by Prophet).

In [None]:
prophet(dt_df)

The plot displays the prediction made by Prohpet. The black dots represents the real values, the dark blue lines is the predictions and the light blue lines are a 95 % confidence interval for the predictions. As we can see, the model fits some of the data points and are able to detect trends from holidays. The MAE for this model is 69.91 people, which is a great improvement to the simple time series model. The model performs surprisingly well for only considering people over time and holidays. 

A major drawback to this model is that it has to predict the values directy following the data set, in the same way the simple time series model works. This implies that in order to perform well the model require recent canteen data.

## Feed Forward Neural Network

In [None]:
plot_history_and_prediction_ml(
    load_model("feed_forward_history"),
    load_model("feed_forward_epoch"),
    load_model("feed_forward_test_set_prediction")
)

## Catboost Decision Tree

In [None]:
plot_history_and_prediction_df(
    load_model("catboost_evaluation_result"), 
    load_model("catboost_test_set_prediction")
)

## LSTM Neural Network

A model was made using a recurrent neural network with LSTM (Long Short Term Memory). The model was made using tensorflow, which provides a high level API called keras for modeling and training of neural networks. 

The model input includes all relevant external factors that might affect the amount of people that go to work a given day in the near future. All input is scaled to be in the range of 0 to 1. 

Models with a Mean Absolute Error (MAE) score of less than 40 is stored in lstm_model.h5, so that it is easy to reuse the model and its weights.

In [None]:
plot_history(load_model("lstm_history"), load_model("lstm_epoch"))

The graph above displays the result of a trained model (test dataset size=8) with a MAE score of less than 40, this can of course be changed to a lower or higher number as needed. The model's MAE usually differs with the test dataset size, so if you have a test dataset of size 175 then a good MAE score is usua

# Compare the models
All the different models are compared based on how they perform on predicting the last given number days of our dataset, as well as 8 days into the future (from today).  

## Test data from data set

#### Create test dataframe for model comparison
All the models will be tested against the last given number of days of the full data set.  

In [None]:
test_size = 8

real_canteen, dt_df_test = create_dataframe_for_comparison(dt_df, test_size)
_, ml_df_test = create_dataframe_for_comparison(ml_df, test_size)

The decision tree test dataframe looks like this:

In [None]:
dt_df_test

### Creating the predictions
The following table displays the predictions from all the different models.

In [None]:
merged = create_predictions(dt_df, ml_df, dt_df_test, ml_df_test, False, real_canteen)
merged

Plotting the predictions against the real values:

In [None]:
plot_all_test_predictions(merged)

As we can see, Catboost and LSTM are the models predicting most correctly. The simple time series model with Naive Bayes performs most poorly. 

## Predicting the future
We start by getting the data sets for the next days. These sets are created with the same format as the test data sets from the previous section. 

In [None]:
dt_next, ml_next = load_next_days()

The dataframe for decision trees looks like this:

In [None]:
dt_next

In [None]:
future_merged = create_predictions(dt_df, ml_df, dt_next, ml_next, True)
future_merged

In [None]:
plot_all_test_predictions(future_merged)

As we can see from the figure above, the prediction results vary greatly among the different models but it appears as if Catboost and LSTM still provides the most likely results. 

# Conclusions and next steps

Based on the achieved results, we would recommend to use either the Catboost or LSTM model. 


Next steps:
* Update daily with new data to provide a stronger prediction