<img src="https://www.telenor.rs/media/TelenorSrbija/media/Telenor_horizontalni.jpg" alt="Drawing" style="width: 300px;"/>

<h1><center>TELENOR DIGITAL: SMARTE BYGG</center></h1>
<h2><center> Summer project 2019 </center><h2>
<h4><center> By Maria Hilmo Jensen, May Helen Storvik and Odd Eirik Igland </center><h4>

Example markdown:

# This is a level 1 heading
## This is a level 2 heading
This is some plain text that forms a paragraph.
Add emphasis via **bold** and __bold__, or *italic* and _italic_.
Paragraphs must be separated by an empty line.
* Sometimes we want to include lists.
* Which can be indented.
1. Lists can also be numbered.
2. For ordered lists.
[It is possible to include hyperlinks](https://www.example.com)
Inline code uses single backticks: `foo()`, and code blocks use triple backticks:
```
bar()
```
Or can be indented by 4 spaces:
foo()
And finally, adding images is easy: ![Alt text](https://www.example.com/image.jpg)

# 1 Introduction
Introduce the project and the desired outcome. 

Models we are using:
* Linear regression
* Simple time series model with Naive Bayes
* Facebook Prophet for time series analysis
* Feed Forward Neural Network
* Catboost Decision Tree
* LSTM Neural Network

## Imports
All the necessary packages and files are imported here.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# All the created prediction models
from models.all_models import *
from models.linear_regression.linear_regression.linear_regression import *
from models.prophet.prophet_model import *
from models.simple_time_series.simple_time_series import *
from models.feed_forward.feed_forward import *
from models.catboost_model.catboost_model import *
from models.linear_regressionlstm.lstm import *
#from analysis.parking_and_canteen import *

import warnings
warnings.filterwarnings("ignore")

# 2 Data Analysis
The team were provided with raw canteen transaction data and parking data from Telenor. This part is about understanding the provided data and collecting more data from external APIs.

## Parking and canteen data
parking_and_canteen.py: 
* Combinding the raw data from parking and canteen transactions.
* Finding the correlation between parking and canteen (graph from Odd please)
* Removing outliers
* Fill in missing canteen data based on parking

### Coorelation
Describe

In [None]:
get_correlation_parking_canteen()

### Display data
Plots the canteen data for visualization.

In [None]:
#canteen_plot = dt_df_test.filter(['date', 'Canteen'])
#plt.figure(figsize=(16,8))
#plt.plot(canteen_df)
#plt.title('Number of people at Telenor Oct 2016 - Feb 2019')

## Weather
weather.py:
* Historic weather data are collected from the [Frost API](https://frost.met.no/index.html) provided by Meterologisk Institutt.
* Using the weather station at Blindern (no. SN18700)

## Holidays and vacation
holiday.py:
* Historic holiday data are collected from [WebAPI](https://webapi.no/api/v1/holidays/).

## Combination
combined_dataset.py:
* Merges the combined canteen/parking data with weather data and holidays/vacation/inneklemt
* Stores the created dataframe into a .csv file in the data folder: dataset.csv

In [None]:
dataset = pd.read_csv('../data/dataset.csv', index_col='date')

The data in the file looks like this:

In [None]:
dataset.head()

# 3 Preprocessing
After the raw data files are combined into dataset.csv, the data needs to be preprocessed further before they are used by the models. All preprocessing methods are found in the preprocessing folder and are summarized in two methods in preprocessing.py.

## Including historic canteen data 
The team decided to add historic canteen data as columns to the data file because of the strong correlation between these numbers. This includes the number of canteen transactions made one week ago and the previous day.

The following plots shows the relation between these numbers for our data.

## Including weekday and time of the year
We were also interested in finding the effect of the day of the week and time of year. Two colums were therefore added, one displaying the weekday and one variable with distance from start of the year (as a number between 0 and 1).

## Categorizing the temperature
The original data contained both maximum and minimum temperature. It was decided to combine these to one column by taking the average of the two values. We made the assumption that temperatures in spesicic intervals increases the probability that people stay home from work. Therefore, we categorized the average temperatures into two groups; Temperatures where you are more likely to skip work (stay_home_temp) and temperatures were you most likely go to work (preferred_work_temp). We assumed that very low temperatures (less than -10 degrees), temperatures around 0 degrees (from -2 to +2 degrees) and temperatures above +20 degrees are more likely to affect if people come to work, and these interval were therefore chosen to be our stay_home_temp. All other temperatures are preferred_working_temp. 

-10 <= x <= -2 || +2 <= x <= 20 (preferred_work_temp),
x < -10 || -2 < x < +2 || x > 20 (stay_home_temp)

## Resulting data format
After the previously mentioned preprocessing steps were taken, it was necessary to create two different input data formats. The reason for this is that the machine learning models consider weighted inputs and the categories for each column are therefore used as new columns. All the other models can use the previously created format (now named decision tree dataframe).   

By reading the data from dataset.csv and storing this as a DataFrame, we can then use the save_dataframes method in preprocessing.py. This method includes all the preprocessing steps and stores two new data files named decision_tree_df.csv and ml_df.csv in the data folder.

### Loading the data files
After the new files are created they can be loaded:

In [None]:
dt_df, ml_df = load_datafiles()

The top five rows of the decision tree and machine learning data sets are shown respectively:

In [None]:
dt_df.head()

In [None]:
ml_df.head()

# Models
Throughout this summer the team has worked on several different prediction models, both statistical and machine learning models. All the models, explaination and visualizations are presented below. 

## Linear regression

Information here

In [None]:
x, y, y_pred, x_test, y_test = linear(dataset) 
plot_linear(x, y, x_test, y_pred)

## Simple time series with Naive Bayes

In [None]:
sts_pred = simple_time_series(dt_df, 30, True)

## Facebook Prophet
Prophet is designed for analyzing time series with daily observations that display patterns on different time scales. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. It also has advanced capabilities for modeling the effects of holidays on a time-series.

This model only consider the canteen data and date, as well as Norwegian holidays (provided by Prophet).

In [None]:
prophet(dt_df)

## Feed Forward Neural Network

## Catboost Decision Tree

## LSTM Neural Network

# Compare the models
All the different models are compared based on how they perform on predicting the last given number days of our dataset, as well as 8 days into the future (from today).  

#### Create test dataframe for model comparison
All the models will be tested against the last given number of days of the data set.  

In [None]:
test_size = 8

real_canteen, dt_df_test = create_dataframe_for_comparison(dt_df, test_size)
_, ml_df_test = create_dataframe_for_comparison(ml_df, test_size)

### Creating the predictions

In [None]:
merged = create_predictions(dt_df, dt_df_test, ml_df_test)
merged

Plotting the prediction from the different models.

In [None]:
plot_all_test_figures(real_canteen, merged)