<img src="https://www.telenor.rs/media/TelenorSrbija/media/Telenor_horizontalni.jpg" alt="Drawing" style="width: 300px;"/>

<h1><center>TELENOR DIGITAL: SMARTE BYGG</center></h1>
<h2><center> Summer project 2019 </center><h2>
<h4><center> By Maria Hilmo Jensen, May Helen Storvik and Odd Eirik Igland </center><h4>

# Introduction
The main purpose of this project is to predict how many people that will come to work at Telenor Fornebu, for up to approximatly one week into the future. The number of transactions made in the canteens, as well as parking information for the time period October 2016 to February 2019, were provided. It was assumed that the number of people at work equals the number that eats in the canteens. The data set did not contain information about the canteen transactions for all the dates between the start date and end date. Based on the correlation between the parking data and the canteen transactions we were able to deduce how many people were at Telenor when canteen data were missing. 

The team also decided to consider outside parameters, such as temperature, precipitation, vacations, holidays, "inneklemte dager", day of the week and time of year. It was also assumed that no one is at work in the weekends, so the number of people working were set to zero. 

Several different prediction models were created to compare performance and determine suitability for the given data set. The models created were:
* Linear regression
* Simple time series model with Naive Bayes
* Facebook Prophet for time series analysis
* Feed Forward Neural Network
* Catboost Decision Tree
* LSTM Neural Network

## Imports
All the necessary packages and files are imported here.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.abspath(''), os.path.pardir)))

from all_models import *
from helpers.helpers import plot_history_df, plot_history, load_model, plot_prediction, plot_history_and_prediction_df, plot_history_and_prediction_ml

## Setup

To be able to use all the following methods it is necessary to create all the relevant .csv files. By running the main.py file, they will be added to the data folder. This method includes all the data analysis and preprocessing steps.

To be able to run this method, it is necessary to have a folder called "data" containing "kdr_transactions" and "parking_data" folders with Excel data files.

In [None]:
# main()

# Data Analysis
The team were provided with raw canteen transaction data and parking data from Telenor. This part of the report is about understanding the provided data and collecting more data from external APIs.

## Parking and canteen data
The team created one main python file for processing the parking and canteen transaction data. This is the parking_and_canteen.py file, and it contains code for the following: 
* Combinding the raw data from parking and canteen transactions
* Removing outliers (data points that differs significantly from other observations)
* Finding the correlation/dependency between parking and canteen data
* Fill in missing canteen data based on parking

### Correlation
In statistics, dependence is any statistical relationship between two random variables or bivariate data. Correlation is defined as any statistical dependence, though it is commonly refered to as the degree to which a pair of variables are linearly related. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.

To determine if we could supply the missing canteen data using the provided parking data, it was necessary to first calculate the correlation between these two variables. A high correlation indicates a strong dependency and the range is from 0 to 1. 

In [None]:
get_correlation()

This plot shows the pairwise relationship between parking and canteen data. The diagonally descending axis show the univariate distribution of the data for the variable in that column. A univariate distribution is a probability distribution of only one random variable.

As we can see from the calculated correlation and the graphs above, there is a strong relationship between the two variables. We can therefore exploit their dependency to supply the missing canteen data.  

### Display data
The following plot displays the complete data set.

In [None]:
display_canteen_data()

As can be seen from the plot above the number of people on weekdays normally fluctuates between 1500 and 2000 people. It is also possible to spot the trend were the number of people significantly decline during for example summer vacation, easter and christmas.  

## Weather
Historical weather data was collected from the [Frost API](https://frost.met.no/index.html) provided by Meterologisk Institutt. The team decided to use the weather station at Blindern (no. SN18700) because this was the closest weather station containing all necessary data. 

It was decided to only use the precipitation and temparature, because these variables were considered to have most impact on whether people show up for work.

## Holidays and vacation
Holidays, vacations and "inneklemte dager" will affect the use of the office spaces. They are defined as:
* Holidays: All days registered as official Norwegian holidays
* Vacation: Summer, winter and autumn vacation given by the Oslo and Akershus municipality
* "Inneklemte dager": One or two working days that falls between two days off from work (either holiday or weekend)

Historical holiday data were collected using [WebAPI](https://webapi.no/api/v1/holidays/). An algorithm for finding "inneklemte dager" was created. The code can be found in holiday.py.

As seen in the figure above, summer vacation has a bigger impact on the amount of working people than autumn and winter vacation. Holidays affect even more. 

## Combination
combined_dataset.py:
* Merges the combined canteen/parking data with weather data and holidays/vacation/inneklemt
* Stores the created dataframe into a .csv file in the data folder: dataset.csv

In [None]:
dataset = pd.read_csv('../data/dataset.csv', index_col='date')

The data in the file looks like this:

In [None]:
dataset.head()

# Preprocessing
After the raw data files are combined into dataset.csv, the data needs to be preprocessed further before they are used by the models. All preprocessing methods are found in the preprocessing folder and are summarized in one method in preprocessing.py.

## Including historic canteen data 
The team discovered that the number of canteen transactions made one week ago and the previous day had a strong correlation to the current date.

The following plots shows the relation between these numbers for our data.

## Including weekday and time of the year
It was interesting to find the effect of the day of the week and time of year. Two colums were therefore added, one displaying the weekday and one variable with distance from start of the year (as a number between 0 and 1).

These variables will most likely not have a great effect on the final prediction, due to the small amount of data. They are added for future usage.  

## Categorizing the temperature
The original data contained both maximum and minimum temperature. It was decided to combine these to one column by finding the average of the two values. The assumption that temperatures in spesicic intervals increases the probability that people stay home from work was made. Therefore, the average temperatures was categorized into two groups: 
* Temperatures where you are more likely to skip work (stay_home_temp)
* Temperatures were you most likely go to work (preferred_work_temp)

It was assumed that very low temperatures (less than -10 degrees), temperatures around 0 degrees (from -2 to +2 degrees) and temperatures above +20 degrees are more likely to affect if people come to work, and these intervals were therefore chosen to be our stay_home_temp. All other temperatures are preferred_working_temp. 

FIX FORMAT:
```
-10 <= x <= -2 || +2 <= x <= 20 (preferred_work_temp),
x < -10 || -2 < x < +2 || x > 20 (stay_home_temp)
```

## Resulting data format
After the previously mentioned preprocessing steps were taken, it was necessary to create two different input data formats: one containing categorical values and one without. The reason for this is that the machine learning models consider weighted inputs and the input needs to be normalized between 0 and 1. In order to do this all columns must have integer values and can therefore not be categorical.

After following the instructions in *Setup* all necessary files have been added to the data folder. This stores among others two new data files named decision_tree_df.csv and ml_df.csv in the data folder.

### Loading the data files
After the new files are created they can be loaded:

In [None]:
dt_df, ml_df = load_datafiles()

The top five rows of the decision tree and machine learning data sets are shown respectively:

In [None]:
dt_df.head()

In [None]:
ml_df.head()

# Models
Throughout this summer the team has worked on several different prediction models, both statistical and machine learning models. All the models, explaination and visualizations are presented below. 

## Linear regression
Linear regression is a linear approach to modeling the relationship between two variables. The linear() function uses the inputs date and historic canteen values in order to predict future values. 

A model using linear regression will try to draw a straight line that will minimize the residual sum of squares between the observed responses in the dataset. (Source: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)

In [None]:
x, y, y_pred, x_test, y_test = linear(dataset) 
plot_linear(x, y, x_test, y_pred)

As our canteen data fluctuates a straight line will not manage to create a good prediction of future values as can be seen in the graph above. The mean absolute error of the linear regression model is usually around 800 people. 

The model performs slightly better if weekends are removed from the dataset, but the result is still not sufficient compared to the other models. 

## Simple time series with Naive Bayes for forecasting
This model takes advantage of one approach that is commonly known as binning (discretization or reducing the scale level of a random variable from numerical to categorical). An advantage of this technique is the reduction of noise - however, this comes at the cost of losing quite an amount of information.

This model only considers the canteen data as a function of time (date).

In [None]:
simple_time_series(dt_df, 30, True)

The graph displays the real canteen values and the predicted canteen values for the data set. As we can see the prediction can discover some of the trends in the data set, but fails to detect the decreasing trend over time. The mean absolute error for this model is 181.15 people, which is quite high but still a major improvement compared to the linear regression. 

Since this is a time series model, it is made for predicting the values directly following the data set it is provided. Therefore, when this model are going to predict the future, it has to predict every day between the last day in the data set and today before it can come up with a prediction for future dates. 

## Facebook Prophet
Prophet is designed for analyzing time series with daily observations that display patterns on different time scales. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. It also has advanced capabilities for modeling the effects of holidays on a time-series.

Also this model only considers the canteen data and date, as well as Norwegian holidays (provided by Prophet).

In [None]:
prophet(dt_df)

The plot displays the prediction made by Prohpet. The black dots represents the real values, the dark blue lines is the predictions and the light blue lines are a 95 % confidence interval for the predictions. As we can see, the model fits some of the data points and are able to detect trends from holidays. The MAE for this model is 69.91 people, which is a great improvement to the simple time series model. The model performs surprisingly well for only considering people over time and holidays. 

A major drawback to this model is that it has to predict the values directy following the data set, in the same way the simple time series model works. This implies that in order to perform well the model require recent canteen data.

## Feed Forward Neural Network

The feed forward model is made using tensorflow with keras. It's a simple neural network with one hidden layer. This model is trained with the machine learning dataset.

20% of the dataset is used as a test set. From the plots below it's possible to see that the training data is decreasing by the number of epochs, and stays close to the validation error. This indicates that the model is not overfitted.

The plot on the right side shows the quality of the prediction. Red line shows true values, so dots close to the line are good predictions. This model is most accurate when many people are working, probabiliy because that's when the model have most of its data. 

In [None]:
plot_history_and_prediction_ml(
    load_model("feed_forward_history"),
    load_model("feed_forward_epoch"),
    load_model("feed_forward_test_set_prediction")
)

## Catboost Decision Tree

Catboost is a python library for making decision trees, for both regression and classification problems. A decision tree split the dataset by one attribute at the time. Attributes that splits most of the dataset are chosen first. The data you want to predict goes through the tree, and end up with the predicted value in the leaf node. Since this is a regression problem it's the number of people in the canteen.

By looking at the train and validation error both flats out around 70-80 in mean absolute error (MAE). The graph on the right shows quite good result. Compared to feed forward this model has its dots closer to the red line, which concludes that this is a better model.

In [None]:
plot_history_and_prediction_df(
    load_model("catboost_evaluation_result"), 
    load_model("catboost_test_set_prediction")
)

## LSTM Neural Network

A model was made using a recurrent neural network with LSTM (Long Short Term Memory). The model was made using tensorflow, which provides a high level API called keras for modeling and training of neural networks. 

The model input includes all relevant external factors that might affect the amount of people that go to work a given day in the near future. All input is scaled to be in the range of 0 to 1. 

Models with a Mean Absolute Error (MAE) score of less than 40 is stored in lstm_model.h5, so that it is easy to reuse the model and its weights.

In [None]:
plot_history(load_model("lstm_history"), load_model("lstm_epoch"))

The graph above displays the result of a trained model (test dataset size=8) with a MAE score of less than 40, this can of course be changed to a lower or higher number as needed. The model's MAE usually differs with the test dataset size, so if you have a test dataset of size 175 then a good MAE score is usua

# Compare the models
All the different models are compared based on how they perform on predicting the last given number days of our dataset, as well as 8 days into the future (from today).  

## Test data from data set

#### Create test dataframe for model comparison
All the models will be tested against the last given number of days of the full data set.  

In [None]:
test_size = 8

real_canteen, dt_df_test = create_dataframe_for_comparison(dt_df, test_size)
_, ml_df_test = create_dataframe_for_comparison(ml_df, test_size)

The decision tree test dataframe looks like this:

In [None]:
dt_df_test

### Creating the predictions
The following table displays the predictions from all the different models.

In [None]:
merged = create_predictions(dt_df, ml_df, dt_df_test, ml_df_test, False, real_canteen)
merged

Plotting the predictions against the real values:

In [None]:
plot_all_test_predictions(merged)

As we can see, Catboost and LSTM are the models predicting most correctly. The simple time series model with Naive Bayes performs most poorly. 

## Predicting the future
We start by getting the data sets for the next days. These sets are created with the same format as the test data sets from the previous section. 

In [None]:
dt_next, ml_next = load_next_days()

The dataframe for decision trees looks like this:

In [None]:
dt_next

In [None]:
future_merged = create_predictions(dt_df, ml_df, dt_next, ml_next, True)
future_merged

In [None]:
plot_all_test_predictions(future_merged)

As we can see from the figure above, the prediction results vary greatly among the different models but it appears as if Catboost and LSTM still provides the most likely results. 

# Conclusions and next steps

Based on the achieved results, we would recommend to use either the Catboost or LSTM model. This is based on the amount of data, with a bigger data set other result could occur. Some of the input parameters probably makes no differece at the moment, several years of data could give models that predict yearly trends.

Next steps:
* Update daily with new data to provide a stronger prediction
* Make models more general, so they can be used by other buildings and companies
* Figure out which input parameters that effect the result, could some of them be excluded? Or some added(for example change in air pressure)?

Use this prediction to:
* Avoid food waste
* Save energy by closing floors in quiet periods
* Strenghten the utilization algorithm
* Find the distribution on the canteens based on the menu
