**Data Science and AI for Energy Systems** 

Karlsruhe Institute of Technology

Institute of Automation and Applied Informatics

Summer Term 2024

---

# Exercise V: Central Limit Theorem and Superstatistics

**Imports**

In [32]:
import numpy as np
import matplotlib.pyplot as plt
import scipy as sc
import pandas as pd
import seaborn as sns
from scipy.special import gamma

## Problem V.2 (programming) -- Stable distributions

**After examining some properties of Gaussian random variables, we want to take a look into non-Gaussian distributions, i.e. in the exercise we consider stable distributions. We start with writing a function for the probability density depending on the stability, skewness, scale and location parameters.**

***
**(a) The characteristic function $\varphi(t,\alpha, \beta, c, \mu)$ of a stable probability distribution is given by**


In [33]:
def char_fun_stable(t,alpha, beta, c, mu):
    return np.exp(1j*t*mu - ((np.abs(c*t))**(alpha))*
            (1-1j*beta*np.sign(t)*np.tan(np.pi*alpha/2)))

**for $\alpha \neq 1$.<br>
Now we can denote a formula for the probability density function (PDF) of the stable distribution. The PDF is given as a Fourier Transform of the characteristic function:
\begin{align*}p(x)=\frac{1}{2\pi}\int_{-\infty}^{+\infty}\varphi(t)e^{-ixt}dt.\end{align*} 
Write a function for the PDF using *scipy.integrate.quad*.**

In [34]:
'''Write a separate function for the integrand and then write a function for the integration, using scipy.integrate.quad'''

'Write a separate function for the integrand and then write a function for the integration, using scipy.integrate.quad'

***
**(b) Consider the parameters $c = 1, \mu = 0$. Plot the PDF for each combination of $\alpha \in \{0.4,1.5\}$ and $\beta \in \{0,0.5,0.75\}$, for $x \in [-5,5]$ with a step size of $0.05$.**

***
**(c) Instead of our own implementation we can also use *scipy.stats.levy\_stable*. Plot the PDF using *scipy.stats.levy\_stable.pdf* with the same parameters as in (b). What do you observe?** 

***
**(d) In the following step we draw samples from the a stable distribution. Consider the parameters $\alpha = 0.5, \beta = 0.5, c = 1, mu = 0$. Draw 10000 realization from the distribution using *scipy.stats.levy\_stable.rvs*. In the next step, plot a histogram of the samples.**

In [35]:
'''choose a suitable number of bins for the histogram'''


'choose a suitable number of bins for the histogram'

## Problem V.3 (PROGRAMMING) – Q-GAUSSIAN DISTRIBUTIONS AND PARAMETER FITTING

**We continue looking into non-Gaussian, i.e. q-Gaussian distributions. We consider a dataset of power grid frequency data with a 1-second time resolution, the dataset *frequency\_sample\_2015\_ex5.csv* can be found in [https://bwsyncandshare.kit.edu/s/QPySS7eZCWjSjYP](https://bwsyncandshare.kit.edu/s/QPySS7eZCWjSjYP).**

***
**(a) Plot a histogram of the data, both with linear and logarithmic scale on the y-axis, and calculate the kurtosis with *scipy.stats.kurtosis* (use *fisher=False*).**

In [37]:
data = pd.read_csv('data/frequency_sample_2015_ex5.csv').values.reshape(-1)
'''Plot the histograms and calculate the kurtosis:'''


'Plot the histograms and calculate the kurtosis:'

***
**(b) The probability density function of a q-Gaussian distribution is given as**

In [38]:
def q_Gauss_pdf(x,q,beta,mu):
    if q==1:
        constant = np.sqrt(np.pi)
    elif 1<q<3:
        constant=np.sqrt(np.pi)*gamma((3-q)/(2*(q-1)))/(np.sqrt(q-1)*gamma(1/(q-1)))
    else:
        constant = 2*np.sqrt(np.pi)*gamma(1/(1-q))/((3-q)*(1-q)*gamma((3-q)/(2*(1-q))))
    
    pdf=np.sqrt(beta)/constant*(1+(1-q)*(-beta*(x-mu)**2))**(1/(1-q))
    return pdf

'''We use beta > 0 in this definition'''

'We use beta > 0 in this definition'

**We assume beta > 0 in this definition.<br>
Now we aim to write a function for fitting the distribution, i.e. for a maximum log-likelihood estimation of the parameters $q$, with fixed $\beta$ and $\mu$, and initial parameter value for the maximization, $q_0$. Maximize the log-likelihood 
\begin{align*}\log(l_q(x_1,\ldots,x_n)) = \sum_{i=1}^n\log(p(x_i|q,\beta,mu))\end{align*}
with respect to $q$, where $x_i$ are the data points. For the optimization you can use *scipy.minimize*.**

***
**(c) Normalize the frequency data by subtracting the mean and dividing by the standard deviation:**

In [39]:
data_normalized = (data-np.mean(data))/np.std(data)

**Fit the normalized frequency data with the following probability distributions:**
-  **Gaussian distribution $\rightarrow$ output: mean $\mu$ and variance $\sigma^2$,**
- **q-Gaussian distribution for fixed $\beta=1$, $\mu=0$ and initial parameter value $q_0=1.2$ $\rightarrow$ output: $q$.**

**For the normal distribution you can use *scipy.stats.norm.fit* and for the q-Gaussian distribution use your own function from (b).**

***
**Plot the resulting probability density functions (PDFs) in a figure together with the histogram of the normalized frequency data. Use a logarithmic scaling for the y-axis. Compare the result to the figure on slide 35 in the lecure 5.**

In [40]:
'''
Use 
x = np.arange(np.min(data_normalized),np.max(data_normalized),0.01) as the x-axis for the plots
of the PDFs, with q_Gauss_pdf(x,...) and sc.stats.norm.pdf(x,...)
'''


'\nUse \nx = np.arange(np.min(data_normalized),np.max(data_normalized),0.01) as the x-axis for the plots\nof the PDFs, with q_Gauss_pdf(x,...) and sc.stats.norm.pdf(x,...)\n'

## Problem V.4 (programming) -- Superstatistics - Examples for a synthetic frequency dataset

**We consider the dataset *time\_series\_superstatistics.csv* which contains a time series consisting of several shorter time series linked together. The time series is constructed as follows:**

In [50]:
'''You don't need to tun the code! It is just for illustration purposes'''

damping=0.00211224
T= 20000 
logNormalFit=(0.5228696759042721, -6.139031040500482, 115.71274435712661)
#set some further parameters: Initial condition very close to zero and time resolution of the simulation
oneSecond=10
initialFrequency=10**-10
t = np.linspace(0, T, oneSecond*T+1)
delta_t = np.diff(t)[0]
#define the deterministic and the probabilistic contributions to the frequency dynamics
def bulkFrequency(y, t):
    omega = y
    dydt = -damping*omega
    return dydt

# define function for solving the stochastic differential equation 
def frequencyTrajectory(eps,initialFrequency):
    omega = np.zeros(len(t))
    omega[0]=initialFrequency
    dW = np.random.normal(0,1,len(t)) * np.sqrt(delta_t) # noise dynamics
    for i in range(1,len(t)):
        omega[i] = omega[i-1] - delta_t * damping*omega[i-1] + eps* dW[i] #  noise dynamics with noise amplitude epsilons
    sol = omega 
    return sol

'''An alternative method using the library sdeint for solving the stochastic differential equation would be:'''
# def frequencyTrajectory(eps,initialFrequency):
#     def noiseDynamicsDirect(x, t):
#         #define noise process
#         return eps
#     sol = sdeint.itoint(bulkFrequency,noiseDynamicsDirect, initialFrequency, t)
#     return sol.flatten()

#initialize list of frequency measurements
aggregatedTrajectory=[]
#draw several random realizations from the log-normal distribution to then use as noise amplitudes, similar to the Mathematica distribution, see also https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html
numberOfGaussians = 500
randomBetas= sc.stats.lognorm(*logNormalFit).rvs(size=numberOfGaussians)
epsilons=np.sqrt(2*damping/randomBetas) # effective friction: randomBetas = 2*damping/epsilons^2 
#run process in series: to use previous end point of trajectory as new initial condition
for eps in epsilons:
    trajectory=frequencyTrajectory(eps,initialFrequency)
    initialFrequency=trajectory[-1]
    aggregatedTrajectory.append(trajectory[:-1])
#rescale to frequencies
frequency=(50+1/(2*np.pi)*np.array(aggregatedTrajectory).flatten())[0::oneSecond]

**The dynamics of the time series is given by a constant damping and a variable noise amplitude $\epsilon$ which is changing over time, i.e. each data snippet has a different noise amplitude. The "effective friction" $\beta = (2*damping)/ \epsilon^2$ is distributed with respect to a log-normal distribution.<br>
The list of "randomBetas" which was used for the construction of the time series can be found in *randomBetas.txt* in [https://bwsyncandshare.kit.edu/s/QPySS7eZCWjSjYP](https://bwsyncandshare.kit.edu/s/QPySS7eZCWjSjYP).**

***
**(a) Plot a histogram of the data, both with linear and logarithmic scale on the y-axis, and calculate the kurtosis with *scipy.stats.kurtosis* (use *fisher=False*).**

In [41]:
frequency = pd.read_csv('data/time_series_superstatistics.csv').values.reshape(-1)
'''Plot the histograms of the synthetic power grid frequency time series and calculate the kurtosis:'''


'Plot the histograms of the synthetic power grid frequency time series and calculate the kurtosis:'

***
**(b) We want to derive the superstatistics from the time series which is subject to the probability density function (PDF)
\begin{align*}
p(x)= \int_0^{\infty}f(\beta)p(x|\beta),
\end{align*} 
where $f(\beta)$ is the distribution of the effective friction $\beta$. For details, see slides 32 and 33 of lecture 5. <br>
At first we can derive the local kurtosis $\kappa(\Delta t)$ and the "long time scale" $T$, for which $\kappa(\Delta t = T) \approx 3$, i.e. the case for which a locally Gaussian kurtosis arises. <br>
The formula for the average kurtosis $\kappa(\Delta t)$ is given by**

In [42]:
def averageKappa(data,DeltaT):
    #make sure that negative calls return still a number
    if DeltaT<1:
        return 0
    meanData=np.mean(data); #here we use a global mean but a local mean for each data snippet of length DeltaT would also be ok. Results do not change too much
    tMax=len(data);
    nominator=sum((data[0:DeltaT]-meanData)**4)/DeltaT
    denominator=sum((data[0:DeltaT]-meanData)**2)/DeltaT
    sumOfFractions=nominator/(denominator**2);

    for i in range(0,tMax-DeltaT):
        nominator=nominator+((data[i+DeltaT]-meanData)**4-(data[i]-meanData)**4)/DeltaT;
        denominator=denominator+((data[i+DeltaT]-meanData)**2-(data[i]-meanData)**2)/DeltaT;
        sumOfFractions = sumOfFractions + nominator/(denominator**2);
    return sumOfFractions/(tMax-DeltaT) 

**Run *averageKappa(data=frequency, DeltaT = T)* with "long time scale $T=20000$. The output value should be $\approx 3$.**

In [28]:
averageKappa(frequency,DeltaT=20000)

***
**(c) Now complete in a similar way as the function *averageKappa* the following function for calculating the distribution of $\beta$, 
\begin{align*}\beta(t_0) = \frac{1}{<x^2>_{t_0,T}-<x>^2_{t_0,T}}\end{align*} 
for $t_0 = 0,\ldots,length(data)-T$, depending on the dataset and $T$:** 

In [86]:
def betaList(data,T):
    tMax=len(data)
    xSquareMean=sum(data[0:T]**2)/T # here: t_0 = 0
    xMean=sum(data[0:T])/T # here: t_0 = 0
    betaValues=[1/(xSquareMean-xMean**2)]
    # calculate the local averages for every t_0 = 1,...,tMax-T
    for i in range(0,tMax-T):
    #???  # Complete the function simmilar as in the function "averageKappa"
    return np.array(betaValues) # this list should contain lenth(data)-T+1 entries

**Note: We denote $<...>_{t_0,\Delta t} = \frac{1}{\Delta t}\int_{t_0}^{t_0+\Delta t}...~dt$ as local averages.**

***
**(d) Use your formula from (c) to calculate the values of  $\beta$ for $T=20000$ and for the dataset $data = (frequency-50) \cdot 2\pi$ (this represents the angulare velocity which is also used for the construction of the dataset), i.e. calculate 
\begin{align*}betaList\left(\text{data} = (frequency-50)\cdot 2\pi,\text{T} = 20000\right).\end{align*}  
<br>
    Then fit a log-normal distribution to the distribution, use *scipy.stats.lognorm.fit* in order to receive the parameters for the log-normal distribution.**

***
**(e) Plot the following results in one figure:**
- **histogram of the output distribution from the *betaList*-function from (d),**
- **the probability density function given the parameters that were calculated in (d), use *scipy.stats.lognorm.pdf*,**
- **histogram of the randomBetas in *randomBetas.txt*.**

**Compare your result to the figure at the bottom in slide 36 of lecture 5.**

In [43]:
'''Use x = np.arange(np.min(beta_list),np.max(beta_list),0.1) as x-axis for the plots:'''

'Use x = np.arange(np.min(beta_list),np.max(beta_list),0.1) as x-axis for the plots:'

***
**(f) Normalize the time series as in Exercise 3. Fit the distribution of the normalized time series as in Exercise 3 with a normal distribution
    and a q-Gaussian distribution via the Maximum Likelihood method. <br>
For the normal distribution you can use *scipy.stats.norm.fit* 
    and for the q-Gaussian use your own function from Exercise V.3, using $\beta=1,\mu=0$ and $q_0=1.2$. Then plot the PDFs together with the histogram of the time series values. Use a logarithmic scaling for the y-axis.**

Normalize the frequency and fit the data with the distributions:

Plot the histograms and the PDFs: