**Data Science and AI for Energy Systems** 

Karlsruhe Institute of Technology

Institute of Automation and Applied Informatics

Summer Term 2024

---

# Exercise IV: Data Analysis

**Imports**

In [1]:
#!pip install emd
import emd
import scipy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.fft import fft, fftfreq
import statsmodels
import seaborn as sns
import matplotlib

## Problem IV.2 (Programming) -- Basic data analysis methods in Python 

#### We consider several basic data analysis methods like calculation of the moments, computation of the autocorrelation, a Fourier analysis and an Empirical Mode Decomposition. In order to understand the concepts of the methods in detail, we apply them to both synthetic data and empirical load time series. In this exercise we consider a synthetic dataset, part of which was already shown in task I.1 (d). The dataset is given as *synthetic\_data\_ex4.csv* in the BW-Sync-and-share folder [https://bwsyncandshare.kit.edu/s/QPySS7eZCWjSjYP](https://bwsyncandshare.kit.edu/s/QPySS7eZCWjSjYP).

***
**(a) Calculate mean, standard deviation, skewness and kurtosis for the  synthetic data set.**


In [3]:
df_syn = pd.read_csv('data/synthetic_data_ex4.csv', index_col=0)
df_syn_values = df_syn.values
'''You can use the following functions:
mean: np.mean
standard deviation: np.std
skewness: sc.stats.skew
kurtosis: sc.stats.kurtosis'''
# Now calculate the mean, standard deviation, skewness and kurtosis of the synthetic data:


'You can use the following functions:\nmean: np.mean\nstandard deviation: np.std\nskewness: sc.stats.skew\nkurtosis: sc.stats.kurtosis'

***
**(b) Show the probability distribution for the time series (use *seaborn.kdeplot*). Which kind of modality can you observe?**

***
**(c) Calculate the increments of the time series and plot the probability distribution (use again *seaborn.kdeplot*). Which kind of modality can you observe?**

In [10]:
'''Use pd.DataFrame.diff() or np.diff() to calculate the increments of the synthetic data:'''


'Use pd.DataFrame.diff() to calculate the increments of the synthetic data.'

***
**(d) Plot the autocorrelation $\rho_{xx}(\tau)$ for $\tau \in [0,1000]$, you can use the function *acf* from the module *statsmodels.tsa.stattools*, with $nlags = 1000$. Which behaviour can you observe in the autocorrelation?**

***
**(e) We do a Fourier analysis, calculating the Discrete Fourier Transform (DFT) and the Fast Fourier Transform (FFT)**

(i): Write a formula to calculate the DFT, which is given as $X_k = \sum_{m=0}^{n-1}x_m e^{-2i\pi km/n}$ for discrete measurements $x_0,x_1,\ldots , x_{n-1}$. Calculate the DFT with this formula.

(ii) Now use *scipy.fft.fft* and *scipy.fft.fftfreq* to calculate the Fast Fourier Transform.

In [19]:
'''
Use the following example as a guide for (ii) and (iii) below:

# Number of sample points
N = 600
# sample spacing
T = 1.0 / 800.0
x = np.linspace(0.0, N*T, N, endpoint=False)
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = fft(y)
xf = fftfreq(N, T)[:N//2]
plt.plot(xf, 2.0/N * np.abs(yf[0:N//2]))
plt.grid()
plt.show()

'''
# The Fast Fourier Transform (FFT) is calculated as follows:


'\nUse the following example as a guide for (ii) and (iii) below:\n\n# Number of sample points\nN = 600\n# sample spacing\nT = 1.0 / 800.0\nx = np.linspace(0.0, N*T, N, endpoint=False)\ny = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)\nyf = fft(y)\nxf = fftfreq(N, T)[:N//2]\nplt.plot(xf, 2.0/N * np.abs(yf[0:N//2]))\nplt.grid()\nplt.show()\n\n'

(iii) Plot the Fourier amplitudes depending on the frequency, using as well the results from (i) as also from (ii). Are the results (almost) identical? What are the main frequencies of the signal?

***
**(f) Finally we carry out an Empirical Mode Decomposition: For extracting the Intrinsic Mode Functions (IMFs) of the signal, use the function *emd.sift.sift*. You can plot the IMFs with *emd.plotting.plot\_imfs*. <br>How many mode functions are extracted, and which mode functions are relevant for the long-time oscillations, short-time oscillations and noise?**

## Problem IV.3 (Programming) -- Basic data analysis methods for  a load time series

#### As we now saw some basic data analysis concepts applied on a synthetic dataset, we want to apply the same methods to an empirical time series which shows several months of energy demand of an industrial building, with a time resolution of $15$ minutes. The empirical dataset is given as *empirical\_load\_data\_ex4.csv* in the BW-Sync-and-share folder [https://bwsyncandshare.kit.edu/s/QPySS7eZCWjSjYP](https://bwsyncandshare.kit.edu/s/QPySS7eZCWjSjYP).

***
**(a) Calculate the mean, standard deviation, skewness and kurtosis for the load time series.**

In [4]:
df_load = pd.read_csv('data/empirical_data_ex4.csv', parse_dates=True, index_col=0, names=['Load time series']).squeeze()
df_load_values = df_load.values
# Now calculate the mean, standard deviation, skewness and kurtosis of the empirical data:


***
**(b) Show the probability distribution for the time series (use *seaborn.kdeplot*). In especially, plot the results with different smoothing parameters *bw\_adjust* $\in \{0.1,0.5,1\}$ and plot also the histogram using *plt.hist* with $50$ bins.**

***
**(c) Calculate the increments of the time series and plot the probability distribution (use again *seaborn.kdeplot*).**


Calculate the increments:

***
**(d) Plot the autocorrelation $\rho_{xx}(\tau)$ for $\tau \in [0,24*4*10]$ (10 days). Which behaviour can you observe in the autocorrelation?**

***
**(e) We do a Fourier analysis calculating the Fast Fourier Transform: Use *scipy.fft.fft* and *scipy.fft.fftfreq* to calculate the Fast Fourier Transform. Plot the Fourier amplitudes depending on the frequency. What are main frequencies of the signal?**

***
**(f) As for the synthetic time series in Exercise IV.2, we carry out an Empirical Mode Decomposition: <br>Use the function *emd.sift.sift* for extracting the Intrinsic Mode Functions (IMFs) of the load signal. You can plot the IMFs with *emd.plotting.plot\_imfs*. How many mode functions are extracted, and do the modes provide information about the oscillations and trends that determine the time series?**

### Additional note: The synthetic dataset for Exercise IV.2 is created as follows:

In [21]:
time=np.linspace(0,1,1000, endpoint=False)
cleanData = 2*np.sin(time*10*2*np.pi) + 0.3*np.sin(time*85*2*np.pi) 
#add Gaussian noise
noiseProcess = sc.stats.norm(loc = 0, scale = 0.1).rvs(size=len(time))
synthetic_data = cleanData+noiseProcess

You can change the dataset and analyze the data for different parameters.