_____

<table align="left" width=100%>
    <td>
        <div style="text-align: center;">
          <img src="./images/bar.png" alt="entidades financiadoras"/>
        </div>
    </td>
    <td>
        <p style="text-align: center; font-size:24px;"><b>Introduction to Data Science</b></p>
        <p style="text-align: center; font-size:18px;"><b>Master in Electrical and Computer Engineering</b></p>
        <p style="text-align: center; font-size:14px;"><b>Pedro Cardoso (pcardoso@ualg.pt)</b></p>
    </td>
</table>

_____

__Short Lesson Title:__ Time Series Regression Models

*__Summary:__ This lesson introduces various regression techniques specifically designed for time series data. Students will learn about the challenges of applying standard regression methods to time series (like autocorrelation and non-stationarity) and how specialized approaches address these issues. The lesson will likely cover models such as Autoregressive (AR), Moving Average (MA), Autoregressive Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA) models. It will explain the underlying principles of each model, including model order selection and parameter estimation. Students will gain practical experience in fitting these models to time series data, evaluating their performance using appropriate metrics, and making predictions. The lesson may also touch upon more advanced techniques like SARIMA (Seasonal ARIMA) for data with seasonal components.*


# Regression in Time Series

**Time series regression** is a statistical technique used to model the relationship between a dependent variable and one or more independent variables that change over time. In other words, it is a way of **predicting future values of a variable based on its past values and the values of other related variables**.

In time series regression, the dependent variable is a time series, which means that it is **assumed** as a sequence of **observations taken at regular intervals over time**. The **independent variables** can also be time series or other types of variables that are believed to have an impact on the dependent variable.

The goal of **time series regression is to identify the underlying patterns and trends** in the data and use this information to **make accurate predictions about future values of the dependent variable**. This is typically done by fitting a statistical model to the data and estimating the coefficients of the model using a variety of techniques, such as least squares or maximum likelihood estimation.

We should distinguish between **time series regression** and **time series forecasting**.

- In **time series forecasting**, the goal is to predict the future values of a variable based on its past values alone.
- In **time series regression**, the goal is to predict the past, present or future values of a variable based on its past values and the values of other related variables.

In other words, time series regression is a type of supervised learning, while time series forecasting is a type of unsupervised learning.

## Evaluate Models

Evaluating a time series regression model involves assessing the model's performance and accuracy in predicting future values of the time series. Next, some common methods for evaluating time series regression models are listed.

### Train-test split
When training a machine learning model, it's essential to evaluate its performance on data that hasn't been used during training. This is done to avoid overfitting, where the model performs well on the training data but poorly on new data. To achieve this, the original dataset is usually split into a training set and a test set. However, there are cases where a third subset, called the validation set, is added to the mix. Here are the main differences between two popular data splitting strategies: train-test and train-validation-test.

One potential issue with the train-test split is that the test set might be too small to provide a reliable estimate of the model's performance. The smaller the test set, the more likely it is that the evaluation metric will be affected by chance. To mitigate this issue, the train-validation-test split is often used.

#### Train-Validation-Test Split

The train-validation-test split involves dividing the original dataset into three subsets: a **training set, a validation set, and a test set**. The training set is used to train the model, the validation set is used to evaluate the model's performance during training and to tune the hyperparameters of the model, and the test set is used to evaluate the final performance of the model after training and tuning.

During training, the model is evaluated on the validation set after every epoch or after a fixed number of iterations. The hyperparameters of the model are then adjusted based on the performance on the validation set. This process is repeated until the best hyperparameters are found. Once the model has been fully trained and tuned, it is evaluated on the test set to obtain an estimate of its performance on new data.

One advantage of the train-validation-test split is that it provides a more accurate estimate of the model's performance on new data. By using a separate validation set, we can ensure that the model is not overfitting to the training data. By using a separate test set, we can ensure that the evaluation metric is not biased by chance.


#### Train-Test Split in time series

In time series analysis, a common approach to model evaluation is also to use a train-test split. However, the traditional random train-test split is not suitable for time series data, as it violates the assumption that the data points are independent and identically distributed (iid). In time series data, the order of the data points matters, and the past observations can be used to predict the future ones.

A more appropriate approach for time series data is to use a "rolling window" or "walk-forward" validation technique. This involves training the model on a portion of the data and testing it on a subsequent portion, sliding the window forward until the entire dataset has been used for testing. This allows for the evaluation of the model's performance on unseen data while maintaining the temporal ordering of the data.

In the rolling window technique, the size of the test set can be chosen based on the application requirements and the available data. However, a common practice is to use a test set that is large enough to provide a reliable estimate of the model's performance but small enough to allow for multiple iterations of the rolling window approach. The performance metrics are then aggregated over the different iterations to obtain a more robust estimate of the model's performance.

Let us see some examples using the airline passengers dataset.

In [None]:
import pandas as pd

df_passengers = pd.read_csv('./data/passengers_TS/passengers.csv', parse_dates=True)
df_passengers['Month'] = pd.to_datetime(df_passengers['Month'])
df_passengers['Year'] = df_passengers['Month'].dt.year
df_passengers['Month'] = df_passengers['Month'].dt.month

df_passengers.drop('Unnamed: 0', axis=1, inplace=True)

df_passengers

### Polynomial features
Beside the use o month, year as features, and #passengers as targer, it is also possible to do some increase in the number os features using polynomial features. Polynomial features are a type of feature engineering technique used in machine learning to create non-linear models by adding polynomial terms to the original features.

In simpler terms, polynomial features are created by taking the original features of a dataset and raising them to a power. For example, if we have a dataset with one feature $x$, we can create polynomial features by adding a new feature $x^2$. Obviously, we can also add higher order terms like $x^3$, $x^4$, and so on.
The general formula for generating polynomial features is as follows:
$$(x_1, x_2, \dots, x_n) \rightarrow (1, x_1, x_2, \dots, x_n, x_1^2, x_1x_2, \dots, x_1x_n, x_2^2, x_2x_3,\dots, x_n^2)$$



The purpose of adding polynomial features is to capture non-linear relationships between the features and the target variable. In some cases, the relationship between the features and the target variable may not be linear, and adding polynomial features can help to capture these non-linear relationships. This also increases model flexibility since we are increasing the number of dimensions of the dataset. This increased dimensionality can lead to a more flexible model that can fit more complex relationships. On the other side, it can lead to overfitting, where the model performs well on the training data but poorly on new, unseen data; increased model complexity which can make it harder to interpret the model and identify the most important features; and increased computation time required to train the model, especially if we generate a large number of features.

Overall, polynomial features can be a useful tool for improving the performance of machine learning models when the relationship between the features and the target variable is non-linear.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# get the features and target
X = df_passengers[['Month', 'Year']]
y = df_passengers['#Passengers']

# create the PolynomialFeatures model
poly_features_model = PolynomialFeatures(degree=5)

# fit and apply model
X = pd.DataFrame(poly_features_model.fit_transform(X),
                 columns = poly_features_model.get_feature_names_out(['Month', 'Year']))

X

#### Model

Now, we will use a pipeline to implement a ML solution, which starts by scaling data, before training the model. More about the Ridge model will be discussed below.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

# split the data into training and test sets, predictiong the next year
n_train = 12
X_train = X.iloc[:-n_train]
y_train = y[:-n_train]

X_test =  X.iloc[-n_train:]
y_test = y[-n_train:]

# fit a regression model to the training data
reg_model = Pipeline([
    ('scaler', StandardScaler()),
    ('Ridge', Ridge(alpha=0.001, max_iter=10**4))
])
reg_model.fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = reg_model.predict(X_test)

# build a dataframe with the real, predicted and error values
df_test_pred = pd.DataFrame()
df_test_pred['# passenger'] = y_test
df_test_pred['# pred passenger'] = y_test_pred
df_test_pred['error'] = df_test_pred['# passenger'] - df_test_pred['# pred passenger']
df_test_pred

### Visual inspection
In addition to numerical metrics, it's important to visually inspect the model's performance. This can include plotting the predicted values against the actual values and examining the residual plots to ensure that the model is capturing the underlying patterns in the data.


In [None]:
df_test = pd.DataFrame(y_test)
df_test["#Passengers predicted"] =  y_test_pred
df_test.set_index(pd.to_datetime(dict(year=X_test.Year, month=X_test.Month, day=1)), inplace=True)
df_test.plot()

### Mean absolute error (MAE)
The MAE is a common metric used to evaluate time series regression models. It measures the average absolute difference between the predicted values $\hat{y}_t$ and the actual values $y_t$:
$$\text{MAE} = \frac{1}{n}\sum_{t=1}^{n}|\hat{y}_t - y_t|$$

A lower MAE indicates a more accurate model. Let us see an example.

In [None]:
# compute the MAE for the test set
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_test_pred)
print('MAE:', mae)

### Mean squared error (MSE)
The MSE is another common metric used to evaluate time series regression models. It measures the average squared difference between the predicted values and the actual values:
$$\text{MSE} = \frac{1}{n}\sum_{t=1}^{n}(\hat{y}_t - y_t)^2$$

The MSE penalizes large errors more heavily than small errors.

In [None]:
from sklearn.metrics import mean_squared_error

MSE = mean_squared_error(y_test, y_test_pred)
print('MSE:', MSE)

### Root mean squared error (RMSE)
The RMSE is the square root of the MSE and is often used as a measure of the model's accuracy. Like the MSE, the RMSE penalizes large errors more heavily than small errors:
$$\text{RMSE} = \sqrt{\text{MSE}}$$


In [None]:
RMSE = MSE ** .5
print('RMSE:', RMSE)

### Coefficient of determination (R-squared)
The R-squared value measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. A higher R-squared value indicates a better fit. It is calculated as follows:
$$\text{R-squared} = 1 - \frac{\text{SSE}}{\text{SST}}$$

where $\text{SSE}$ is the sum of squared errors and $\text{SST}$ is the total sum of squares:

$$\text{SSE} = \sum_{t=1}^{n}(\hat{y}_t - y_t)^2$$

$$\text{SST} = \sum_{t=1}^{n}(y_t - \bar{y})^2$$

where $\bar{y}$ is the mean of the actual values.


In [None]:
from sklearn.metrics import r2_score

R2 = r2_score(y_test, y_test_pred)
print('R2:', R2)

## ("Standard") Machine Learning models

Several "standard" machine learning model exist that can be used to do time series regression. Next, some examples are shown. Note that no special attention was taken to the tunning of the methods.

Instead of the Passengers airlines data, we will be using the house consumption data.

In [None]:
df_energy = pd.read_csv("./data/house_consumption_TS/house_consumption.csv", parse_dates=True)
df_energy.date = pd.to_datetime(df_energy.date)

# use only data previous to "2022-12-28"  (the house installed a photovoltaic panel system on that date which probably changed the energy consumption behaviour
query = df_energy.date < "2022-12-28"
df_energy = df_energy[query]

# resample to hour consumptions instead of 15 minutes step
df_energy.set_index('date', inplace=True)
df_energy = df_energy.resample('h').mean()

# expand the date to its components (we can't feed the algorithms with dates/strings)
df_energy['hour'] = df_energy.index.hour
df_energy['day'] = df_energy.index.day
df_energy['day of week'] = df_energy.index.dayofweek
df_energy['month'] = df_energy.index.month

df_energy.head()

In [None]:
# We can see that 2 values are missing in the kw column: lets us remove those 2 lines
df_energy.info()

In [None]:
df_energy.dropna(inplace=True)

Further, we will be using polynomial features

In [None]:
from sklearn.preprocessing import PolynomialFeatures

independent_cols = ['hour', 'day', 'day of week', 'month']
X = df_energy[independent_cols]
y = df_energy['kw']

# build he poy features
poly_features_model = PolynomialFeatures(degree=5)
X = pd.DataFrame(
    poly_features_model.fit_transform(X),
    columns=poly_features_model.get_feature_names_out(independent_cols)
)

# keep the index for later
X.index = df_energy.index

X

Now, let us define the train and test data sets. Note that, for obvious reasons, in this case, the data should not be shuffled. Further, we'll keep out the next 2 days (48 hours) and try to predict them

In [None]:
n_train = 48

# training data
X_train = X.iloc[:-n_train]
y_train = y.iloc[:-n_train]

# testing data
X_test = X.iloc[-n_train:]
y_test = y.iloc[-n_train:]


### Linear regression

Linear regression is a simple and widely used method for time series regression. It models the relationship between the independent variables and the dependent variable using a linear function. Four examples will be shown next.

#### Ordinary Least Squares (OLS):
OLS is a linear regression method that minimizes the sum of the squared residuals (the differences between the predicted and actual values of the dependent variable) using a closed-form solution. The mathematical formulation for OLS is:

$$\beta_{OLS} = (X^TX)^{-1}X^Ty$$

where $\beta_{OLS}$ is the vector of coefficients, $X$ is the matrix of independent variables, $y$ is the vector of dependent variable values. The OLS method seeks to find the values of $\beta$ that minimize the residual sum of squares (RSS):

$$RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2$$

where $\hat{y}_i$ is the predicted value of $y_i$ based on the model, $p$ is the number of independent variables or predictors in the regression model (i.e., $p$ is the number of columns in the matrix $X$), and $n$ is the number of observations or data points in the dataset (i.e., $n$ is the number of rows in the matrix $X$ and the length of the vectors $y$ and $\beta$).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

ols_model = Pipeline([
    ('scaler', StandardScaler()),
    ('OLS', LinearRegression())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = ols_model.predict(X_test)

Let us define a method to present the results in a systematic way

In [None]:
def print_results(y_test, y_test_pred):
    from sklearn.metrics import mean_absolute_error
    from sklearn.metrics import mean_squared_error
    from sklearn.metrics import r2_score

    # compute the MAE for the test set
    print('real values:', y_test.values)
    print('Predictions:', y_test_pred)

    mae = mean_absolute_error(y_test, y_test_pred)
    print('MAE:', mae)

    MSE = mean_squared_error(y_test, y_test_pred)
    print('MSE:', MSE)

    RMSE = MSE ** .5
    print('RMSE:', RMSE)

    R2 = r2_score(y_test, y_test_pred)
    print('R2:', R2)

    df_test = pd.DataFrame(y_test)
    df_test["#Passengers predicted"] =  y_test_pred
    df_test.set_index(X_test.index, inplace=True)
    df_test.plot(figsize=(20, 8))

print_results(y_test, y_test_pred)

#### Lasso

Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression method that includes a penalty term on the absolute values of the coefficients to encourage sparsity in the model. The mathematical formulation for Lasso is:

$$\beta_{lasso} = argmin_{\beta} \left\{ \frac{1}{n} \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 + \lambda \sum_{j=1}^p |\beta_j| \right\}$$

where $\beta_{lasso}$ is the vector of coefficients, $y$ is the vector of dependent variable values, $x_{ij}$ is the value of the $j$-th independent variable for the $i$-th observation, $\lambda$ is the regularization parameter. The Lasso method seeks to find the values of $\beta$ that minimize the following objective function:

$$\min_\beta \frac{1}{n} \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 + \lambda \sum_{j=1}^p |\beta_j|$$

In [None]:
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline

lasso_model = Pipeline([
    ('scaler', StandardScaler()),
    ('Lasso', Lasso())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = lasso_model.predict(X_test)

print_results(y_test, y_test_pred)

#### Ridge
Ridge regression is a linear regression method that includes a penalty term on the squared values of the coefficients to prevent overfitting. The mathematical formulation for Ridge is:

$$\hat{\beta}_{ridge} = \arg \min_{\beta} \left\{\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^{p}\beta_j^2\right\}$$

where $\beta_{ridge}$ is the vector of coefficients, $y$ is the vector of dependent variable values, $x_{ij}$ is the value of the $j$-th independent variable for the $i$-th observation, $\lambda$ is the regularization parameter.


In [None]:
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

ridge_model = Pipeline([
    ('scaler', StandardScaler()),
    ('Ridge', Ridge())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = ridge_model.predict(X_test)

print_results(y_test, y_test_pred)

#### Elastic Net
Elastic Net is a linear regression method that combines both Lasso and Ridge penalties to achieve both sparsity and prevent overfitting. The mathematical formulation for Elastic Net is:

$$\beta_{elastic} = argmin_{\beta} \left\{ \frac{1}{n} \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j)^2 + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2 \right\}$$

where $\beta_{elastic}$ is the vector of coefficients, $y$ is the vector of dependent variable values, $x_{ij}$ is the value of the $j$-th independent variable for the $i$-th observation, $\lambda_1$ and $\lambda_2$ are regularization parameters. The Elastic Net method seeks to find the values of $\beta$ that minimize the following objective function:

$$\text{minimize } \frac{1}{2n} \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j)^2 + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2$$

Note that when $\lambda_1 = 0$ and $\lambda_2 > 0$, Elastic Net reduces to Ridge regression, and when $\lambda_1 > 0$ and $\lambda_2 = 0$, Elastic Net reduces to Lasso regression.


In [None]:
from sklearn.linear_model import ElasticNet

elasticnet_model = Pipeline([
    ('scaler', StandardScaler()),
    ('ElasticNet', ElasticNet())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = elasticnet_model.predict(X_test)

print_results(y_test, y_test_pred)

### Decision trees

A decision tree algorithm is a predictive modeling tool used in machine learning to identify relationships between input variables and their corresponding output variables. It creates a tree-like model of decisions and their possible consequences, based on a set of rules and data. At each node of the tree, the algorithm chooses the best attribute to split the data into two or more subsets, based on their level of impurity. The attribute that creates the most homogeneous subsets is selected as the best splitter. This process continues recursively, creating more nodes and subtrees, until a stopping criterion is met, such as a predefined maximum depth or a minimum number of samples per leaf. Once the decision tree is built, it can be used to make predictions for new input data by traversing the tree from the root to a leaf node and outputting the corresponding class label or numerical value. Decision trees are easy to interpret and visualize, and can handle both categorical and continuous data.

Decision trees can be susceptible to extrapolation, which is the process of making predictions outside the range of the training data. This is because decision trees are designed to model the relationships between input and output variables based on the patterns observed in the training data. If the data used to train the decision tree does not adequately cover the entire range of the input variables, then the tree may not be able to accurately predict the output values for input variables outside that range.


In [None]:
from sklearn.tree import DecisionTreeRegressor

model = Pipeline([
    ('scaler', StandardScaler()),
    ('method', DecisionTreeRegressor())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = model.predict(X_test)

print_results(y_test, y_test_pred)

### Random forests

Random forest is a popular machine learning algorithm that combines the power of decision trees with the benefits of ensemble learning. It creates a forest of decision trees, where each tree is trained on a randomly sampled subset of the training data and a randomly selected subset of the input features. The outputs of the individual trees are then combined through a voting or averaging mechanism to produce the final prediction.

The main advantage of random forests over single decision trees is their ability to reduce overfitting and improve accuracy. By combining the predictions of multiple trees, random forests are able to better capture the underlying relationships between the input and output variables and generalize to new data.

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = Pipeline([
    ('scaler', StandardScaler()),
    ('method', RandomForestRegressor())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = model.predict(X_test)

print_results(y_test, y_test_pred)

### Support vector machines (SVM)
SVM (Support Vector Machine) regression is a machine learning algorithm that can be used for both classification and regression tasks. In SVM regression, the goal is to predict a continuous output variable by finding the best possible linear function that approximates the data in a high-dimensional feature space.

The basic idea of SVM regression is to find a hyperplane that maximizes the margin between the predicted values and the actual output values. The hyperplane is defined by a set of support vectors, which are the training examples closest to the decision boundary. The distance between the hyperplane and the support vectors is known as the margin, and the goal of SVM regression is to maximize this margin while minimizing the training error.

In contrast to traditional linear regression methods, which aim to minimize the sum of squared errors between the predicted and actual output values, SVM regression uses a loss function that penalizes deviations from the true output values. The penalty parameter C controls the trade-off between minimizing the training error and maximizing the margin, and can be tuned to optimize the performance of the algorithm.

SVM regression is a powerful technique that can handle both linear and nonlinear relationships between the input and output variables. It is particularly useful in cases where there are many input variables and a small number of training examples. However, it can be sensitive to outliers and may require careful parameter tuning to achieve good performance.

In [None]:
from sklearn.svm import SVR

model = Pipeline([
    ('scaler', StandardScaler()),
    ('method', SVR())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = model.predict(X_test)

print_results(y_test, y_test_pred)

### Neural networks

MLP (Multi-Layer Perceptron) is a type of artificial neural network commonly used in machine learning for both classification and regression tasks. In MLP regression, the goal is to predict a continuous output variable by mapping the input data through a series of hidden layers to produce a final output value.

An MLP regression model typically consists of an input layer, one or more hidden layers, and an output layer. Each layer consists of a set of nodes or neurons that perform a weighted sum of the inputs, followed by the application of an activation function. The weights between the nodes are learned during training using backpropagation, a gradient-based optimization algorithm that adjusts the weights to minimize the difference between the predicted and actual output values.

One advantage of MLP regression over other regression techniques is its ability to learn complex, nonlinear relationships between the input and output variables. The number of hidden layers and the number of nodes in each layer can be adjusted to optimize the performance of the model, although this requires careful tuning to avoid overfitting or underfitting.

In [None]:
from sklearn.neural_network import MLPRegressor

model = Pipeline([
    ('scaler', StandardScaler()),
    ('method', MLPRegressor())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = model.predict(X_test)

print_results(y_test, y_test_pred)

### Gradient Boosting
Gradient Boosting is a powerful machine learning technique for building regression and classification models. It is an ensemble method that combines the predictions of multiple weak learners, typically decision trees, to produce a final prediction.

The basic idea of gradient boosting is to sequentially add decision trees to the model, with each tree attempting to correct the errors of the previous trees. At each stage of the algorithm, the model computes the difference between the predicted and actual output values, known as the residual error. The next decision tree is then trained to predict the residual error, rather than the original output values.

The key to gradient boosting's success is the use of gradient descent optimization to iteratively improve the model. After each decision tree is added, the model computes the gradient of the loss function with respect to the predicted output values, and adjusts the model parameters to minimize the loss function. This process is repeated until the model reaches a pre-defined stopping criterion, such as a maximum number of iterations or a minimum improvement in the loss function.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

model = Pipeline([
    ('scaler', StandardScaler()),
    ('method', GradientBoostingRegressor())
]).fit(X_train, y_train)

# use the model to predict values for the test set
y_test_pred = model.predict(X_test)

print_results(y_test, y_test_pred)

### Long Short-Term Memory
LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network (RNN) that is particularly useful for time series forecasting. LSTMs are capable of handling time series data with long-term dependencies, which can be challenging for traditional neural networks.

LSTMs work by maintaining memory cells that can selectively retain or forget information at each time step, based on the current input and the previous state of the cell. The output of the LSTM at each time step is determined by both the current input and the previous state of the cell. This allows LSTMs to capture patterns in the time series data that are dependent on past events.

An example of using LSTM for time series forecasting might involve predicting the stock prices of a particular company based on historical price data. The LSTM model would be trained on a dataset of past prices and other relevant data (such as trading volume, news articles, and economic indicators). Once the model is trained, it can be used to make predictions for future prices based on new input data.

Another example could be predicting weather patterns based on historical climate data. The LSTM model would be trained on data from previous years, including temperature, humidity, air pressure, and other weather-related variables. The model would then be used to predict future weather patterns, which could be useful for a variety of applications such as agriculture, transportation, and energy production.

Overall, LSTMs are a powerful tool for time series forecasting, capable of capturing complex patterns in the data and making accurate predictions for future events.

In [None]:
import numpy as np

# get data prior to 2022-12-28 and resample to hour
df_energy = pd.read_csv("./data/house_consumption_TS/house_consumption.csv", parse_dates=True)
df_energy.date = pd.to_datetime(df_energy.date)
query = df_energy.date < "2022-12-28"
df_energy = df_energy[query]
df_energy.set_index('date', inplace=True)
df_energy = df_energy.resample('h').mean()
df_energy.bfill(inplace=True) #fill some missing values

# expand the date to its components (we can't feed the algorithms with dates/strings)
df_energy['hour'] = df_energy.index.hour
df_energy['day of week'] = df_energy.index.dayofweek

df_target = df_energy['kw']
df_energy.drop('kw', axis=1, inplace=True)

# 48 hours span
window_size = 48
X = np.array([df_energy.iloc[i:i+window_size, :] for i in range(len(df_energy)-window_size+1)])
X = np.reshape(X, (X.shape[0], -1))
X = pd.DataFrame(X)

# use the index of the last reading as reference
X.index = df_energy.index[window_size-1:]
y = df_target[X.index]

X.head()

In [None]:
n1 = 21 * 24
n2 = 7 * 24

X_train = X[:-n1]
y_train = y[:-n1]

X_val = X[-n1: -n2]
y_val = y[-n1: -n2]

X_test = X[-n2:]
y_test = y[-n2:]

print('shapes:')
X_train.shape, X_val.shape, X_test.shape

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Define the LSTM model
model = keras.Sequential()
model.add(keras.Input(shape=(X_train.shape[1],1)))
model.add(layers.LSTM(units=256, activation='relu'))
model.add(layers.Dense(units=64, activation='relu'))
model.add(layers.Dense(units=1, activation='relu'))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
hist = model.fit(X_train, y_train, 
                 epochs=5, 
                 batch_size=32, 
                 validation_data=(X_val, y_val))

# Make predictions on test data
y_test_pred = model.predict(X_test)

In [None]:
import matplotlib.pyplot as plt
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.xlabel('Epochs')
plt.ylabel('MSLE Loss')
plt.legend(['loss'])
plt.show()

In [None]:
print_results(y_test, y_test_pred.reshape(1,-1)[0])

### Conclusion
In conclusion, there are many different machine learning regression methods available to choose from, each with their own strengths and weaknesses depending on the specific task and dataset. The examples provided here are just a small subset of the many possible approaches that can be used.

It is important to carefully tune the hyperparameters of each method to achieve optimal performance, as different hyperparameters can have a significant impact on the final results. Additionally, there may be other preprocessing techniques or data transformations that could be used to further improve the accuracy of the models (e.g., adition of time lags). These were not used in the above examples.

Furthermore, it is worth noting that time series data require a specific type of cross-validation called time series cross-validation. This method takes into account the sequential nature of the data and avoids any leakage of future information into the training set. This was not considered for the examples provided here, and therefore, the results may not be as accurate as they could have been.

In summary, while there are many different regression methods available, it is important to carefully consider the specific task and dataset at hand, tune the hyperparameters, and use appropriate cross-validation techniques, such as time series cross-validation, to ensure the most accurate results.


##  Autoregressive Integrated Moving Average models

ARIMA (Autoregressive Integrated Moving Average) models are a popular class of time series models used for forecasting, having numerous applications in a variety of industries. It is widely used in demand forecasting, such as predicting future demand in the food industry.

There are several variants of ARIMA models, such as:

- __Seasonal ARIMA (SARIMA) models__: These models are an extension of ARIMA models that take into account the seasonal component of a time series. SARIMA models incorporate seasonal differences, seasonal autoregressive (SAR) terms, and seasonal moving average (SMA) terms in addition to the non-seasonal terms of an ARIMA model.

- __ARIMA with exogenous variables (ARIMAX)__: These models are an extension of ARIMA models that include additional independent variables, also known as exogenous variables, that may be useful in predicting the time series. ARIMAX models can be useful when there are external factors that influence the time series.

The choice of which model to use depends on the specific characteristics of the time series being modeled and the nature of the forecasting problem.

### ARIMA

In theory, ARIMA models are the most general class of models for forecasting a time series that can be made "stationary" by differencing (if necessary), possibly in conjunction with nonlinear transformations such as logging or deflating.

A stationary random variable is one whose statistical properties remain constant over time. A stationary series has no trend, constant amplitude variations around its mean, and wiggles in a consistent manner, i.e., its short-term random time patterns always look the same statistically. The latter condition implies that its autocorrelations (correlations with its own prior deviations from the mean) are constant over time, or that its power spectrum is constant over time.

A random variable of this type can be viewed (as usual) as a combination of signal and noise, with the signal (if present) being a pattern of fast or slow mean reversion, sinusoidal oscillation, or rapid sign alternation, and possibly a seasonal component. An ARIMA model can be thought of as a "filter" that attempts to separate the signal from the noise before extrapolating the signal into the future to generate forecasts.

The ARIMA forecasting equation for a stationary time series is a linear (i.e., regression-type) equation in which the predictors are lags of the dependent variable and/or lags of the forecast errors.

__Predicted value of Y = "a constant" + "a weighted sum of one or more recent values of Y" + "a weighted sum of one or more recent values of the errors"__

So, ARIMA stands for Auto-Regressive Integrated Moving Average. In the forecasting equation, lags of the stationarized series are called __autoregressive terms__, lags of forecast errors are called __moving average__ terms, and a time series that must be differenced to become stationary is called an "integrated" version of a stationary series.

An $ARIMA(p, d, q)$ model is a nonseasonal ARIMA model, where:

- $p$  represents the number of lagged observations of the dependent variable (also called "lags") included in the model. In other words, it is the number of previous time steps that are used to predict the current value of the time series.

- $d$ represents the degree of differencing used to make the time series stationary. Differencing involves subtracting each observation from the previous observation to remove the trend and seasonality, resulting in a stationary time series. $d$ represents the number of times this differencing is performed.

- $q$ represents the number of lagged forecast errors (also called "residuals") included in the model. These are the errors that result from the difference between the actual and predicted values in the time series. The inclusion of these lagged errors allows the model to capture any remaining patterns or dependencies in the time series.

ARIMA models can be used to model any 'non-seasonal' time series that has patterns and is not random white noise. If a time series possesses seasonal patterns, it is necessary to add seasonal terms and it becomes SARIMA, short for 'Seasonal ARIMA'. More on that once ARIMA is completed.

### Making a series stationary

Making the time series stationary is the first step in developing an ARIMA model, because the term "Auto Regressive" in ARIMA refers to a linear regression model that employs its own lags as predictors. Linear regression models perform best when the predictors are uncorrelated and independent of one another. So, to make the series sationary the most common method is to differentiate it. To put it another way, subtract the previous value from the current value. Depending on the complexity of the series, more than one differencing may be required at times. As a result, the value of $d$ is the smallest number of differencing required to make the series stationary. And $d = 0$ if the time series is already stationary.


### Pure Auto Regressive
A simple Auto Regressive, $AR(p)$, model is one in which $Y_t$ is solely determined by its own lags. That is, $Y_t$ is a function of $Y_t$'s lags.

$$Y_t = \alpha + \beta_1 Y_{t-1} + \beta_2 Y_{t-2} + \cdots + \beta_p Y_{t-p} + \epsilon_t
= \alpha + \sum_{i=1}^p \beta_i Y_{t-i} + \epsilon_t $$

where $Y_{t-i}$ is the series' $lag_i$, ..., $\beta_i$ is the $lag_i$ coefficient estimated by the model, and $alpha$ is the intercept term estimated by the model.

### Pure Moving Average
Similarly, a pure Moving Average, $MA(q)$, model is one in which $Y_t$ is determined solely by the lagged forecast errors.

$$ Y_t = \alpha + \epsilon_t + \phi_1 \epsilon_{t-1} + \phi_2 \epsilon_{t-2} + \cdots + \phi_q \epsilon_{t-p}
= \alpha+ \epsilon_t + \sum_{i=1}^q \phi_i \epsilon_{t-i} $$

where the error terms are the errors of the respective lag autoregressive models, i.e., they are noise.

### ARMA

Then, an $ARMA(p,q)$ is simply the combination of both AR and MA models into a single equation:

$$Y_t =  \alpha + \epsilon_t + \sum_{i=1}^q \phi_i \epsilon_{t-i} + \sum_{i=1}^p \beta_i Y_{t-i}$$

### Autoregression + Integrated + Moving average

An ARIMA model is one where the time series was differenced at least once to make it stationary and you combine the AR and the MA terms.

- AR (Autoregression): Model that shows a changing variable that regresses on its own lagged/prior values.

- I (Integrated): Differencing of raw observations to allow for the time series to become stationary

- MA (Moving average): Dependency between an observation and a residual error from a moving average model

For further an initial, more advanced reading, we suggest the following page: https://online.stat.psu.edu/stat510/book/export/html/665

In the follwoing, we can see that Rolling Mean itself has a trend component even though Rolling Standard Deviation is fairly constant with time.

For time series to be stationary, we need to ensure that both Rolling Mean and Rolling Standard Deviation remain fairly constant WRT time.

Both the curves needs to be parallel to X-Axis, in our case it is not so.

## Useful libraries

There are many Python libraries available for time series forecasting, some used above, the most popular are:

- pandas: A popular library for data manipulation and analysis, including time series data. It provides convenient functions for working with time series data, such as resampling, shifting, and rolling calculations.

- numpy: A fundamental library for scientific computing in Python, which provides powerful tools for mathematical calculations and array manipulation, which are often used in time series forecasting.

- statsmodels: A library for statistical modeling and analysis, which provides a range of time series analysis tools, including ARIMA and SARIMA models, as well as seasonal decomposition and forecasting functions.

- scikit-learn: A machine learning library that includes several time series forecasting models, such as Support Vector Regression (SVR) and Random Forest Regression, which can be useful for making predictions on time series data.

- prophet: A time series forecasting library developed by Facebook, which uses an additive model with seasonality, trend, and holidays components to make predictions.

- pyflux: A library for time series modeling and forecasting, which provides a range of models, including ARIMA, VAR, and Bayesian Structural Time Series (BSTS) models.

- fbprophet: Another time series forecasting library developed by Facebook, which uses a similar approach to prophet, but provides additional functionality for handling uncertainty and seasonality.

These libraries can be used in combination with each other to perform various tasks related to time series forecasting, such as data preparation, model selection, model fitting, and prediction.

### Prophet

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.

Let us see how to use Prophet for time series forecasting in Python.

In [None]:
from prophet import Prophet
import pandas as pd

# import airline passenger data
df = pd.read_csv('./data/passengers_TS/passengers.csv', parse_dates=['Month']).drop(columns=['Unnamed: 0'], axis=1)

# prophet requires columns ds (Date) and y (value)
df = df.rename(columns={'Month':'ds', '#Passengers': 'y'})

m = Prophet()
m.fit(df)

# predict the next 3 years
future = m.make_future_dataframe(periods=36, freq='MS')
forecast = m.predict(future)

# plot forecast
m.plot(forecast)
m.plot_components(forecast)