# Introduction and Preprocessing - Exercises

In [178]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Basic Informations

--- 
Load the macroeconomic_missing dataset from the csv file `macroeconomic_missing.csv`. 

This is a multivariate time series of US Macroeconomic Data. More info at https://www.statsmodels.org/dev/datasets/generated/macrodata.html.

In [179]:
df = pd.read_csv("data/macroeconomic_missing.csv", index_col=0)

---

Extract basic informations from the time series:
- shape of the time series
- number of channels
- number of points
- basic statistics
- missing values count

*hint: these are all methods of variables of the time series dataframe (`df.something` or `df.something()`)*

## Plots

---

Plot all channels in a single plot. Make sure that all the time series are readable using the appropriate scale. 

Does a logarithmic scale work in this case? Why?


---

Plot a boxplot of each channel. 

*hint: boxplot is a method of the dataframe*

## Time Series manipulation

---

Save the channel `tbilrate` into the variable `ts1`, convert it into a numpy array and plot it. 
`tbilrate` is the quarterly monthly average of the monthly 3-month treasury bill: secondary market rate

*Do you notice something strange in the plot? What could be the reason for that?*

## Missing Values

---

Fill the missing values of `tbilrate` using forward fill and plot the original and imputed time series in the same plot.

*bonus: try to use a different imputation method and compare the results*

In [191]:
from sktime.transformations.series.impute import Imputer

---

Save the channel `infl` (inflation) in the variable `ts2`, converting it into a numpy array. Plot `ts1` (after imputation) and `ts2` in the same plot.

## Anomalies

---
`ts2` is very noisy, so try to find out if there are anomalies in the time series. Test at least two methods to detect anomalies and plot the time series with the detected anomalies.

*hint: don't go too heavy on the anomalies, try to find the ~5 most obvious ones*

### Outlier replacement

---
Treat the anomalies you found as missing values, and use an imputation method to fill them. Plot the original and imputed time series in the same plot.

*hint: again, don't remove too many points, just the most extreme ones*

## Normalizations

In [208]:
from sktime.transformations.series.adapt import TabularToSeriesAdaptor
from sklearn.preprocessing import StandardScaler

---

From the previous steps, you should have a `ts1_imputed` (after missing values replacement), and a `ts2_imputed` (after anomalies replacement). Convert them to the same scale and plot them. 

## Stationarity

In [212]:
from sktime.param_est.stationarity import StationarityADF
from statsmodels.tsa.stattools import adfuller
from sktime.transformations.series.difference import Differencer

---

- test if the two normalized time series are stationary
- try the same test after differencing the time series
- plot the autocorrelation of the differenced time series
- looking at the plots, what's the strongest seasonality in the time series? Do they have some seasonality component in common?

## Decompose the time series

In [222]:
from sktime.forecasting.trend import STLForecaster

---

- Decompose the normalized (not differenced) time series into trend, seasonal and residual components.

*hint: for the seasonality, use the strongest seasonality (>2) you found from the ACF plot. Each time series can have a different seasonality, so you will probably need to decompose them separately.*

- Plot the decomposed parts of time series.

- Now compare only the trends of the two time series.

## Discussion

---

Compare the first plot of the time series with the last plot of the trends. What can you say about the time series? 

Do they have similar trends?

What can you say about the seasonality of the time series?

What can you say about the residuals of the time series?

Did normalization help in comparing the time series?

## Bonus Exercise 1

In [228]:
def moving_average(x, w):
    return x.rolling(window=w).mean()

---
Can you perform all the steps above at once for `ts1`?

*hint: you can use a Pipeline to do that*

Start from `ts1` and then:
- impute missing values (Imputer)
- normalize the time series (TabularToSeriesAdaptor + a scaler of your choice)
- deseasonalize the time series (Deseasonalizer)
- smooth the time series (with a moving average)
- impute missing values (generated by the moving average)

Plot the original and the final time series.


## Bonus Exercise 2

Given the following time series (`time_series1`, `time_series2`, `time_series3`, `time_series4`, `time_series5`, make them stationary if necessary, using the appropriate transformations.

### Time Series 1

In [233]:
np.random.seed(0)
n = 100
x1 = np.linspace(0, 10, n)
noise1 = np.random.normal(0, 1, n)
time_series1 = pd.Series(x1 + noise1)

### Time Series 2

In [239]:
np.random.seed(1)
n = 120
x2 = 10 * np.sin(np.linspace(0, 3 * np.pi, n))
noise2 = np.random.normal(0, 1, n)
time_series2 = pd.Series(x2 + noise2)

### Time Series 3

In [241]:
np.random.seed(2)
n = 100
time_series3 = pd.Series(np.cumsum(np.random.normal(0, 1, n)))

### Time Series 4

In [246]:
np.random.seed(4)
n = 100
time_series4 = pd.Series(np.exp(np.linspace(0, 3, n)) + np.random.normal(0, 5, n)) + 1000

### Time Series 5

In [251]:
np.random.seed(4)
n = 150
x5 = np.linspace(0, 15, n)
sp = 5 * np.sin(np.linspace(0, 5 * np.pi, n))
noise5 = np.random.normal(0, 1, n)
time_series5 = pd.Series(x5 + sp + noise5)