[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nepslor/B5203E-TSAF/blob/main/W1/L1_data_visualization.ipynb)

# Time Series visualization and analysis

In this exercise we will go through an example of exploratory analyisis and time series visualization with python. We will usa a dataset containing power measurements and meteorological forecasts relative to a set of **24 power meters** located in Rolle (Switzerland).


<img src="https://raw.githubusercontent.com/nepslor/teaching/main/TimeSeriesForecasting/figs/REeL_Demo_grid.png" width="500"/>

The yellow dots in the image shows the positions of the power meters.
Besides power readings, the dataset includes **temperatue** and **irradiance** measurements from a local meteo station.

Let's start downlowading and looking at the first rows of the dataset:

In [None]:
import pandas as pd
df_all = pd.read_pickle("https://github.com/nepslor/teaching/raw/refs/heads/main/TimeSeriesForecasting/data/power_dataset.pk")
df_all.head()

In [None]:
# last timestep in the dataset
print(df_all.index[-1])

# descriptive statistics for the columns in the dataset
df_all.describe()

We see that:
* the dataset contains 3 signals, the power and two covariates, irradiance and temperature, possibily useful to predict the power
* The datase has hourly timestamps
* It start Jan 2018 and ends Jan 2019
* Each signals contains 8928 values

## Check for missing values and timestamp regularity
We can check if data presents some missing values:

In [None]:
df_all.isna().sum()

And the distribution of sampling times:

In [None]:
df_all.index.diff().value_counts()

The only present sampling time is 1 hour, this means the series is regularly sampled.

## Line plots
We can try to plot a subset of the dataset as lineplot via pandas:

In [None]:
data = df_all[['power', 'temperature', 'irradiance']]

# simple plot
data.plot(figsize=(20, 3))

# last 7 days plot
data.tail(24*7).plot(figsize=(20, 3))

# since signals have different scales, it is useful to plot them in separate axes:
data.tail(24*7).plot(figsize=(20, 3), subplots=True)

Let's plot the power meters' readings, filtering the dataset using the `.filter` method and the `like` argument

In [None]:
df_all.filter(like='meter').plot(figsize=(20, 3), alpha=0.5).legend(ncols=3, loc='upper right')
df_all.filter(like='meter').tail(24*7).plot(figsize=(20, 3), alpha=0.5).legend(ncols=3, loc='upper right')

## Scatter plots

We start to do a simple analysis scattering all the signals against each other using the `seaborn` library `pairplot` function. We are also interested in see if the current value of the power is correlated with itself at the previous day. To see this we can use the `panda`'s `shift` method. Note how the first values becom NaNs, since we cannot retrieve past values for the first 24 observations:

In [None]:
# we shift the signal by 24 steps
data['power'].shift(24)

We can use the `.assign` method that temporarly adds a column to the dataframe and scatter all the 4 variables

In [None]:
import seaborn as sb

sb.pairplot(data.assign(power_lag24=data['power'].shift(24)),
            plot_kws={"s": 3, "alpha":0.2})


❓ **which patterns/relations can you spot between these variables?**



We can try to visualize the relationship of the signal with itself at increasing lags by producing a set of scatter polots, here from 1 to 48:  

In [None]:
#@title Lag Animation
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
import numpy as np

# Create the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# Initialize the scatter plot
x = data['power']
y = data['power'].shift(1)
sc = ax.scatter(x, y, s=1, alpha=0.5)

# Set axis labels and title
ax.set_xlabel('Power')
ax.set_ylabel('Power Lag')
ax.set_title('Scatter Plot of Power vs. Power Lag')

# despine axes
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Animation update function
def update(frame):
    k = frame + 1
    sc.set_offsets(list(zip(data['power'], data['power'].shift(k))))  # Update the scatter plot data
    ax.set_ylabel(f'Power Lag {k}')
    # compute correlation
    ax.set_title(f'Scatter Plot of Power vs. Power Lag {k}')
    return sc,

# Create the animation
ani = FuncAnimation(fig, update, frames=48, blit=True, interval=80, repeat=True)
plt.close(fig)
# Display the animation in HTML5
HTML(ani.to_jshtml())


## Auto covariance function
This information can be summarized by plotting just the **linear dependence** of the signal with its previous lags, the autocorrelation function.

$$\begin{align}\rho_k = \frac{cov(x_t, x_{t-k})}{\sigma(x_t)\sigma(x_{t-k})}
\stackrel{stationary}{=}& \frac{cov(x_t, x_{t-k})}{\sigma^2(x)}\\
\stackrel{sampling}{=}& \frac{\sum_{t=1}^{t=T-k} (x_t-\hat{\mu}_x) (x_{t -k}-\hat{\mu}_x)}{T \hat{\sigma}_x^2}
\end{align}$$
where
$$
\begin{align}
\hat{\mu}_x &= \frac{1}{T} \sum_{t=1}^T x_t \qquad \qquad \ \ \  \color{green}{\text{df.mean()}} \\
\hat{\sigma}_x &= \frac{1}{T} \sum_{t=1}^T (x_t-\hat{\mu}_x)^2 \qquad \color{green}{\text{df.std()}}\\
\end{align}
$$
❓ **Try to code the autocorrelation function using the .shift method**

You can define it as `acf = lambda x, k: your code here`

In [None]:
acf = lambda x, k: pass # complete this line

In [None]:
from statsmodels.graphics.tsaplots import plot_acf
fig, ax = plt.subplots(figsize=(15, 3))
plot_acf(data['power'], lags=24*8, ax=ax, label='statsmodels ACF');
plt.plot([acf(data['power'], k) for k in range(24*8)], label='our ACF')
plt.grid()
plt.ylim(-0.25, 1)
plt.legend(loc='lower right');

Strong seasonalities can be spotted at 24 hours intervals, with a second local maximum after 7 days (the second dashed vertical line), indicating a strong weekly seasonality.


## Embeddings
We can now try to use the two local maxima (24 and 24*7 steps) as embedding for the time series, and try to see if the signal show eveident patterns.
Loosly speaking the following chain holds:

                        patterns -->  compressibility --> forecastability

The idea: if we reshape the signal with the maxima of the ACF we can plot it as a matrix, making patterns evident.

In [None]:
daily_power = data.assign(                            # assign method temporaly adds new features to a dataframe
    day=data.index.date,
    hour=data.index.hour
).pivot(index='hour', columns='day', values='power')  # pivot create a matrix from "index" and "columns"

# plot heatmap
fig, ax = plt.subplots(figsize=(15, 3))
sb.heatmap(daily_power, cmap='viridis', ax=ax)
ax.set_title('Daily power consumption')


In [None]:
# the dataset spans more than one year, "week" index is not unique ->
# -> we use pivot_table which average observations falling in the same index-column bin
weekly_power = data.assign(
    week=data.index.isocalendar().week,
    weekhour= data.index.hour + data.index.dayofweek * 24
).pivot_table(index='weekhour', columns='week', values='power')

# plot heatmap
fig, ax = plt.subplots(figsize=(15, 3))
sb.heatmap(weekly_power, cmap='viridis', ax=ax)
ax.set_title('Daily power consumption')


❓ **Try to obtain the same plots for temperature and irradiance. What do you observe?**

In [None]:
# get the day of the week of each column
dow = data.assign(day=data.index.date,
      hour=data.index.hour,
      dayofweek=data.index.dayofweek).pivot(index='hour',
                                            columns='day',
                                            values='dayofweek').mean()

daily_power.loc[:,dow<5].plot(color='pink', alpha=0.3, legend=False)
daily_power.loc[:,dow>=5].plot(color='r', alpha=0.3, legend=False, ax=plt.gca())

## ❓Some exploratory analysis
* Among the bottom time series, find the most similar and most dissimilar couples
* Look at the most dissimilar couple. Try to scatter them against the values of the predicted GHI
* Can you spot other series for which the GHI has a similar effect?


# BONUS: forecasting by whithening
In the following we will explore a powerful method to produce probabilistic forecasts of a time series.  

Idea: **if we can make the TS ~white noise through transformations, we can predict it by sampling from the noise and inverting the transform**

Matematically, if
$$f_1(f_2(..f_n(x))) \quad \sim \text{i.i.d.} \quad \mathcal{N}(0, \sigma)$$
then
$$\hat{x}_{t:t+h} = f_n^{-1}(f_{n-1}^{-1}(..f_1^{-1}(\mathcal{N}(0, \sigma))))$$



In [None]:
# let's start by splitting the time series in a training and a test set
p_tr = data['power'].iloc[:-24]
p_te = data['power'].iloc[-24:]


we will apply just two transformations: difference from previous week values and difference from previous step

In [None]:
p_week_diff = p_tr.diff(24*7)
p_week_diff_1 = p_week_diff.diff(1)

# plot histograms of p_tr and the other two transfomations
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 3, figsize=(15, 3))
p_tr.plot.hist(ax=axs[0], bins=50, alpha=0.5, label='p_tr')
p_week_diff.plot.hist(ax=axs[1], bins=50, alpha=0.5, label='p_week_diff')
p_week_diff_1.plot.hist(ax=axs[2], bins=50, alpha=0.5, label='p_week_diff_1')
for ax in axs:
    ax.legend()


# plot ACF functions for all the three signals
fig, axs = plt.subplots(1, 3, figsize=(15, 3))
plot_acf(p_tr, lags=24*8, ax=axs[0], label='p_tr')
plot_acf(p_week_diff.dropna(), lags=24*8, ax=axs[1], label='p_week_diff')
plot_acf(p_week_diff_1.dropna(), lags=24*8, ax=axs[2], label='p_week_diff_1')
for ax in axs:
    ax.legend()

plt.figure(figsize=(15, 3))
(p_tr).tail(24*30).plot()
p_week_diff.tail(24*30).plot()
p_week_diff_1.tail(24*30).plot()



In [None]:
import numpy as np

for i in range(100):
  p_hat =  p_tr.iloc[-24*7:-24*6] + np.random.choice(p_week_diff_1, 24).cumsum()  # values of previous week + integrated samples of p_week_diff_1
  p_hat.index = p_te.index
  p_hat.plot(color='r', alpha=0.3, linewidth=0.5)


# plot the forecast and some hisory
(p_tr).tail(24*14).plot(figsize=(15, 5))
p_week_diff.tail(24*14).plot()
p_week_diff_1.tail(24*14).plot()
p_te.plot(color='k', linewidth=2)
