# Exchange Rate Dataset

**Approximate Learning Time:** Up to 1 hour

---

In this notebook, we will introduce the **Exchange Rate** dataset, split it into training, validation, and test sets, and explore the training data using the techniques learned in previous modules, thereby setting the stage for the upcoming modeling approaches.

---

## About Dataset

The **exchange rate dataset** contains daily exchange rates from 1990 to 2016 for 8 countries: Australia, Britain, Canada, Switzerland, China, Japan, New Zealand, and Singapore. It includes a total of 8 univariate time series, each with 7,588 time steps. 

For ease of training in this tutorial series, we will resample the dataset to a weekly frequency, reducing it to a time series with 1,084 time steps.
This is done within the `load_tutorial_data` function.

**Note:** We will encapsulate these loading and splitting functions into a function in `utils.py` to avoid repetition in every notebook.

In [2]:
import pathlib
import numpy as np
import matplotlib.pyplot as plt

import sys; sys.path.append("../")
import utils_tfb
import utils

PLOTTING_COLORS = utils.PLOTTING_COLORS

FORECASTING_HORIZON = [4, 8, 12, 24, 52] # weeks 
MAX_FORECASTING_HORIZON = max(FORECASTING_HORIZON)

%load_ext autoreload
%autoreload 2

## Load, Transform, and Split Dataset

---

### Load and Downsample 

Let's load the dataset and downsample it to weekly frequency. We use `pandas`'s, `resample`([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html)) function to do so. 

In [3]:
def load_tutotrial_data(dataset):
    """Loads dataset for tutorial."""
    TS_DATA_FOLDER = pathlib.Path("../forecasting").resolve()
    if dataset == 'exchange_rate':
        dataset = TS_DATA_FOLDER / "Exchange.csv"
        data = utils_tfb.read_data(str(dataset))
        data.index.freq = 'D'  # since we know that the frequency is daily
        data = data.resample("W").mean() # Resmaple to obtain weekly time series
        return data
    else:
        raise ValueError(f"Unrecognized dataset: {dataset}")
    

In [None]:
data = load_tutotrial_data(dataset='exchange_rate')
print("Sampling frequency of data", data.index.freq)
data.head()

---

### Transformation 

Due to the nature of exchange rate, the common practice is to predict the daily returns or log daily returns.
Thus, given a time series $\{x_t\}_{t=0}^{T} = \{x_0, x_1, x_2, ..., x_T\}$, following two transformations are usually considered, 

1. **Daily Returns** is calculated as 

$$
r_t = \frac{x_t - x_{t-1}}{x_{t-1}}
$$

Daily returns has the following properties:
- May have skewed distributions with heavy tails.
- Can include extreme values or outliers.
- Not additive over time.
- Distribution is often not normally distributed.


2. **Log Daily Returns** is calculated as 

$$
r_t = ln\big(\frac{x_t}{x_{t-1}}\big)
$$

with the following properties:
- Tend to approximate a normal distribution, especially over short intervals.
- Additive over time, which is advantageous for cumulative analyses.
- Reduce the impact of outliers due to the logarithmic transformation.
- Distribution is closer to normality.

Using which transformation may be context dependent and might require more expertise to reliably justify. In this notebook, we will perform EDA on the raw data as well as its log transformation. **For our tutorial we will use log daily returns as it follows a nice property of normality which is the assumption behind a lot of modelling approaches.**


Let's transform the data to log daily returns. 

In [None]:
print(f"Number of observations before transformation:{data.shape[0]}")
transformed_data = np.log(data / data.shift(1)).dropna()
print(f"Number of observations after transformation:{transformed_data.shape[0]}")
data = data.iloc[1:]


**Exercise**: Why is there a decrement in 1 observation?

--- 

### Train-Test Split

As discussed in previous module, we split the dataset into training, validation, and testing subsets. This splitting is chronological for time series. 

We are interested in building models that can predict $H$ time steps ahead. For our tutorial series, we will consider various values of H. These are specified in `FORECASTING_HORIZON`. Our forecasting models will be evaluated for all these horizon. Specific choice will be very much dependent on the task at hand. 


In [None]:
train_val_data = data.iloc[:-MAX_FORECASTING_HORIZON]
train_data, val_data = train_val_data.iloc[:-MAX_FORECASTING_HORIZON], train_val_data.iloc[-MAX_FORECASTING_HORIZON:]
test_data = data.iloc[-MAX_FORECASTING_HORIZON:]
print(f"Number of steps in training data: {len(train_data)}\nNumber of steps in validation data: {len(val_data)}\nNumber of steps in test data: {len(test_data)}")

transformed_train_val_data = transformed_data.iloc[:-MAX_FORECASTING_HORIZON]
transformed_train_data, transformed_val_data = transformed_train_val_data.iloc[:-MAX_FORECASTING_HORIZON], train_val_data.iloc[-MAX_FORECASTING_HORIZON:]
transformed_test_data = transformed_data.iloc[-MAX_FORECASTING_HORIZON:]



--- 

## EDA of Exchange Rate Dataset

**Note:** We will only examine the **training data** to ensure that our choice of modeling techniques is not influenced by the validation or test data, thus preventing bias in the metrics.

### Raw Data Visualization

Let's plot the raw data as well as its log transformations defined above. 

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 8), dpi=100)
_ = utils.plot_raw_data(train_data, ax=axs[0])
axs[0].set_title("Exchange Rates")
_ = utils.plot_raw_data(transformed_train_data, ax=axs[1], cols=data.columns)
_= axs[1].set_title("Log daily returns")

**Observation**: Although the exchange rate magnitudes vary significantly across countries, the log daily returns tend to fall within a similar range.

---

### Autocorrelation / Partial Autocorrelation Functions

In [None]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 8), dpi=100)
_ = utils.plot_acf_pacf(train_data,  axs=axs[:, 0], n_lags=50)
axs[0, 0].set_title("Exchange Rates\n\nAuto Correlations")
_ = utils.plot_acf_pacf(transformed_train_data, axs=axs[:, 1], n_lags=10)
axs[0, 1].set_title("Log daily returns\n\nAuto Correlations")

**Observations**: The raw data shows higher correlations at larger lags, but a sudden decline is observed when using log-transformed data. Partial correlations, which account for intermediate lags, are more in line with the correlations for the log-transformed data. This suggests that a log transformation could result in a better prediction model, potentially reducing the need to include many lag terms.

--- 

### Time Series Decomposition



In [None]:
fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(15, 8), dpi=100)
_ = utils.plot_seasonality_decompose(train_data,  axs=axs[:, 0])
axs[0, 0].set_title("Exchange Rates\n\nOriginal")
_ = utils.plot_seasonality_decompose(transformed_train_data, axs=axs[:, 1])
_ = axs[0, 1].set_title("Log daily returns\n\nOriginal")

**Observations**: 

- The trend appears relatively stable in both the raw and log-transformed data.
  
- Seasonality in the log-transformed data is twice the magnitude of the trend, whereas in the raw data, the opposite is true, with trend dominating over seasonality.
  
- There are sudden peaks in some time series, as revealed by the residuals, which may warrant further investigation into their underlying causes. These anomalies could provide insight into ways to improve the model.


--- 

### Check Stationarity 

In [None]:
utils.check_stationarity(train_data)

In [None]:
utils.check_stationarity(transformed_train_data)

**Observations:** Raw exchange rates exhibit non-stationarity, but the returns are more consistent over time. The log transformation of returns reduces the impact of large price movements and helps stabilize the variance over time, thereby making them appropriate for modeling. 


**Note**: In the notebooks that follow, we will build forecasting mdoels to predict future log daily returns instead of raw data. 

--- 

## Forecasting Strategy

In this section, we will discuss various types of forecasting formulations and outline the approach we will use in this tutorial.

Broadly, forecasting problems can be categorized into two types:

Given a time series $ \{x_0, x_1, \dots, x_T\} $, a forecasting problem may involve either:
1. **Single-step forecasting**: Predicting just one step ahead, $ x_{T+1} $.
2. **Multi-step forecasting**: Predicting multiple future time steps, say $ H $, i.e., $ \{\hat{x}_{T+1}, \hat{x}_{T+2}, \dots, \hat{x}_{T+H}\} $.

**Approaches to Multi-step Forecasting**

While single-step forecasting is relatively simple, multi-step forecasting can be approached in several ways:

- **Iterative forecasting**: Train a single model to predict the next time step. The model is used iteratively, where the output from one step becomes the input for predicting the next step. This continues until $ H $ steps are forecasted.
  
- **Direct forecasting**: Train **H different models**, one for each time step in the forecasting horizon, where each model predicts a specific step ahead.

- **Hybrid approach**: Train **H different models**, where each model takes in the prediction from the previous model as input, making the models for later steps more informative by incorporating earlier predictions.


In **this tutorial**, owing to its simplicity, we will focus on the **iterative approach**, where a single model is trained to predict the next time step, and it is used iteratively to generate multi-step forecasts.

---

## Conclusion

In this module, we explored the dataset that will be used throughout the remaining notebooks. We decided on a specific transformation, namely log daily returns. Additionally, we examined several visualizations to form an initial intuition about the data. Finally, we outlined the specifics of the forecasting problem that will be used in the upcoming notebooks.

--- 

## Exercises

- Generate plots for the rolling statistics (e.g., mean and standard deviation) of the raw exchange rates and their log daily returns transformation
  
- Utilize techniques from Exploratory Data Analysis (EDA) for multivariate time series to analyze the log daily returns, identifying similar time series

---

## Next Steps

- Proceed to other notebooks in this module to explore classical forecasting methods on this dataset.
- To learn about other machine learning based approaches, check out the module 4 (XGBoost), module 5 (LSTM-based models), module 6 (Transformer based models), or module 7 (LLM-based models).

---