# Bayesian Blocks

The routines for processing Loughborough pyranometer measurements and calculating Bayesian block change points

Tested on
* JupyterLab v3.2.4
* Pandas v1.3.4
* Matplotlib v3.5.1
* NumPy v1.23.0
* SciPy v1.8.1
* pvlib v0.9.1
* Skyfield v1.4.2
* Astropy v5.0.2


In [1]:
import pandas as pd
import numpy as np
from numpy.random import default_rng
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from pytz import timezone
#from pvlib import clearsky, atmosphere, solarposition
from pvlib.location import Location
import gc # for garbage collection on large file processing
import scipy.signal as signal # for Lomb-Scargle periodogram
from scipy import stats

Some general housekeeping, adapt as appropriate

In [5]:
# some housekeeping

# tz = timezone('Europe/London')
tz = timezone('UTC')
lat = 52.7616
lon = -1.2406 # +ve East from Greenwich Meridian
location = Location(lat, lon, name="Loughborough", altitude=79, tz=tz)

# define some colours for the plots
c_sim = 'k'
c_meas = 'g'
c_mean = 'b'

c_ghi = 'g'
c_dhi = 'b'

c_obs = 'grey'
c_sim = 'cyan'

YEAR=2018

# from the year long night time measurements for the measurement uncertainty
mean = -3.7
#sigma = 1.39
#sigma = 5 # value for negligible instrument noise contribution to n_cp
# 23.75 +/- 4.54 to be equivalent to 5 W/m2 fraction of mean dhi
sigma = 25 # good compromise between maximising n_cp and minimising impact of aerosol based intermittency
# sigma = 50 # value when expect intermittency to be fully time resolved in 1s measurements
p0=0.01

prefix=f"."
#prefix=f"D:/Data/Lboro"
#indir="."
indir=f"{prefix}/Date" # this directory must exist in advance
#savedir=f"{prefix}/Blocks-sigma-{sigma:.2f}"
savedir=f"{prefix}/Data" # this directory must exist in advance

# Point Measurements

Method 3 of Scargle et al. Assuming Normally distributed measurement uncertainties, the log-likelihood reduces to

$ \log L_{max}^{(k)} = \frac{b_{k}^2}{4 a_{k}} $ <br>
where
* $a_k = \frac{1}{2} \sum \frac{1}{\sigma_n^2}$
* $b_k = - \sum \frac{x_n}{\sigma_n^2}$

Routine follows the astropy scheme, including the log factor missing from the Scargle et al. paper, but is reproduced here for clarity. 
*To keep up with any changes in the routine implementation consider moving to importing the astropy module*


In [6]:
# following https://docs.astropy.org/en/stable/_modules/astropy/stats/bayesian_blocks.html#bayesian_blocks
import warnings

import numpy as np

from inspect import signature

__all__ = ['FitnessFunc', 'PointMeasures',
           'bayesian_blocks']
def bayesian_blocks(t, x=None, sigma=None,
                    fitness='events', **kwargs):
    r"""Compute optimal segmentation of data with Scargle's Bayesian Blocks

    This is a flexible implementation of the Bayesian Blocks algorithm
    described in Scargle 2013 [1]_.

    Parameters
    ----------
    t : array-like
        data times (one dimensional, length N)
    x : array-like, optional
        data values
    sigma : array-like or float, optional
        data errors
    fitness : str or object
        the fitness function to use for the model.
        If a string, the following options are supported:

        - nb, not used here 'events' : binned or unbinned event data.  Arguments are ``gamma``,
          which gives the slope of the prior on the number of bins, or
          ``ncp_prior``, which is :math:`-\ln({\tt gamma})`.
        - nb, not used here 'regular_events' : non-overlapping events measured at multiples of a
          fundamental tick rate, ``dt``, which must be specified as an
          additional argument.  Extra arguments are ``p0``, which gives the
          false alarm probability to compute the prior, or ``gamma``, which
          gives the slope of the prior on the number of bins, or ``ncp_prior``,
          which is :math:`-\ln({\tt gamma})`.
        - 'measures' : fitness for a measured sequence with Gaussian errors.
          Extra arguments are ``p0``, which gives the false alarm probability
          to compute the prior, or ``gamma``, which gives the slope of the
          prior on the number of bins, or ``ncp_prior``, which is
          :math:`-\ln({\tt gamma})`.

        In all three cases, if more than one of ``p0``, ``gamma``, and
        ``ncp_prior`` is chosen, ``ncp_prior`` takes precedence over ``gamma``
        which takes precedence over ``p0``.

        Alternatively, the fitness parameter can be an instance of
        :class:`FitnessFunc` or a subclass thereof.

    **kwargs :
        any additional keyword arguments will be passed to the specified
        :class:`FitnessFunc` derived class.

    Returns
    -------
    edges : ndarray
        array containing the (N+1) edges defining the N bins

    Examples
    --------

    .. testsetup::

        >>> np.random.seed(12345)

    Measured point data with errors:

    >>> t = 100 * np.random.random(100)
    >>> x = np.exp(-0.5 * (t - 50) ** 2)
    >>> sigma = 0.1
    >>> x_obs = np.random.normal(x, sigma)
    >>> edges = bayesian_blocks(t, x_obs, sigma, fitness='measures')

    References
    ----------
    .. [1] Scargle, J et al. (2013)
       https://ui.adsabs.harvard.edu/abs/2013ApJ...764..167S

    .. [2] Bellman, R.E., Dreyfus, S.E., 1962. Applied Dynamic
       Programming. Princeton University Press, Princeton.
       https://press.princeton.edu/books/hardcover/9780691651873/applied-dynamic-programming

    .. [3] Bellman, R., Roth, R., 1969. Curve fitting by segmented
       straight lines. J. Amer. Statist. Assoc. 64, 1079–1084.
       https://www.tandfonline.com/doi/abs/10.1080/01621459.1969.10501038

    See Also
    --------
    astropy.stats.histogram : compute a histogram using bayesian blocks
    """
    #FITNESS_DICT = {'events': Events,
    #                'regular_events': RegularEvents,
    #                'measures': PointMeasures}
    FITNESS_DICT = {'measures': PointMeasures}
    fitness = FITNESS_DICT.get(fitness, fitness)

    if type(fitness) is type and issubclass(fitness, FitnessFunc):
        fitfunc = fitness(**kwargs)
    elif isinstance(fitness, FitnessFunc):
        fitfunc = fitness
    else:
        raise ValueError("fitness parameter not understood")

    return fitfunc.fit(t, x, sigma)

class FitnessFunc:
    """Base class for bayesian blocks fitness functions

    Derived classes should overload the following method:

    ``fitness(self, **kwargs)``:
      Compute the fitness given a set of named arguments.
      Arguments accepted by fitness must be among ``[T_k, N_k, a_k, b_k, c_k]``
      (See [1]_ for details on the meaning of these parameters).

    Additionally, other methods may be overloaded as well:

    ``__init__(self, **kwargs)``:
      Initialize the fitness function with any parameters beyond the normal
      ``p0`` and ``gamma``.

    ``validate_input(self, t, x, sigma)``:
      Enable specific checks of the input data (``t``, ``x``, ``sigma``)
      to be performed prior to the fit.

    ``compute_ncp_prior(self, N)``: If ``ncp_prior`` is not defined explicitly,
      this function is called in order to define it before fitting. This may be
      calculated from ``gamma``, ``p0``, or whatever method you choose.

    ``p0_prior(self, N)``:
      Specify the form of the prior given the false-alarm probability ``p0``
      (See [1]_ for details).

    For examples of implemented fitness functions, see :class:`Events`,
    :class:`RegularEvents`, and :class:`PointMeasures`.

    References
    ----------
    .. [1] Scargle, J et al. (2013)
       https://ui.adsabs.harvard.edu/abs/2013ApJ...764..167S
    """
    def __init__(self, p0=0.05, gamma=None, ncp_prior=None):
        self.p0 = p0
        self.gamma = gamma
        self.ncp_prior = ncp_prior

    def validate_input(self, t, x=None, sigma=None):
        """Validate inputs to the model.

        Parameters
        ----------
        t : array-like
            times of observations
        x : array-like, optional
            values observed at each time
        sigma : float or array-like, optional
            errors in values x

        Returns
        -------
        t, x, sigma : array-like, float or None
            validated and perhaps modified versions of inputs
        """
        # validate array input
        t = np.asarray(t, dtype=float)

        # find unique values of t
        t = np.array(t)
        if t.ndim != 1:
            raise ValueError("t must be a one-dimensional array")
        unq_t, unq_ind, unq_inv = np.unique(t, return_index=True,
                                            return_inverse=True)

        # if x is not specified, x will be counts at each time
        if x is None:
            if sigma is not None:
                raise ValueError("If sigma is specified, x must be specified")
            else:
                sigma = 1

            if len(unq_t) == len(t):
                x = np.ones_like(t)
            else:
                x = np.bincount(unq_inv)

            t = unq_t

        # if x is specified, then we need to simultaneously sort t and x
        else:
            # TODO: allow broadcasted x?
            x = np.asarray(x, dtype=float)

            if x.shape not in [(), (1,), (t.size,)]:
                raise ValueError("x does not match shape of t")
            x += np.zeros_like(t)

            if len(unq_t) != len(t):
                raise ValueError("Repeated values in t not supported when "
                                 "x is specified")
            t = unq_t
            x = x[unq_ind]

        # verify the given sigma value
        if sigma is None:
            sigma = 1
        else:
            sigma = np.asarray(sigma, dtype=float)
            if sigma.shape not in [(), (1,), (t.size,)]:
                raise ValueError('sigma does not match the shape of x')

        return t, x, sigma

    def fitness(self, **kwargs):
        raise NotImplementedError()

    def p0_prior(self, N):
        """
        Empirical prior, parametrized by the false alarm probability ``p0``
        See  eq. 21 in Scargle (2013)

        Note that there was an error in this equation in the original Scargle
        paper (the "log" was missing). The following corrected form is taken
        from https://arxiv.org/abs/1304.2818
        """
        return 4 - np.log(73.53 * self.p0 * (N ** -0.478))

    # the fitness_args property will return the list of arguments accepted by
    # the method fitness().  This allows more efficient computation below.
    @property
    def _fitness_args(self):
        return signature(self.fitness).parameters.keys()

    def compute_ncp_prior(self, N):
        """
        If ``ncp_prior`` is not explicitly defined, compute it from ``gamma``
        or ``p0``.
        """

        if self.gamma is not None:
            return -np.log(self.gamma)
        elif self.p0 is not None:
            return self.p0_prior(N)
        else:
            raise ValueError("``ncp_prior`` cannot be computed as neither "
                             "``gamma`` nor ``p0`` is defined.")

    def fit(self, t, x=None, sigma=None):
        """Fit the Bayesian Blocks model given the specified fitness function.

        Parameters
        ----------
        t : array-like
            data times (one dimensional, length N)
        x : array-like, optional
            data values
        sigma : array-like or float, optional
            data errors

        Returns
        -------
        edges : ndarray
            array containing the (M+1) edges defining the M optimal bins
        """
        t, x, sigma = self.validate_input(t, x, sigma)

        # compute values needed for computation, below
        if 'a_k' in self._fitness_args:
            ak_raw = np.ones_like(x) / sigma ** 2
        if 'b_k' in self._fitness_args:
            bk_raw = x / sigma ** 2
        if 'c_k' in self._fitness_args:
            ck_raw = x * x / sigma ** 2

        # create length-(N + 1) array of cell edges
        edges = np.concatenate([t[:1],
                                0.5 * (t[1:] + t[:-1]),
                                t[-1:]])
        block_length = t[-1] - edges

        # arrays to store the best configuration
        N = len(t)
        best = np.zeros(N, dtype=float)
        last = np.zeros(N, dtype=int)

        # Compute ncp_prior if not defined
        if self.ncp_prior is None:
            ncp_prior = self.compute_ncp_prior(N)
        else:
            ncp_prior = self.ncp_prior

        # ----------------------------------------------------------------
        # Start with first data cell; add one cell at each iteration
        # ----------------------------------------------------------------
        for R in range(N):
            # Compute fit_vec : fitness of putative last block (end at R)
            kwds = {}

            # T_k: width/duration of each block
            if 'T_k' in self._fitness_args:
                kwds['T_k'] = block_length[:R + 1] - block_length[R + 1]

            # N_k: number of elements in each block
            if 'N_k' in self._fitness_args:
                kwds['N_k'] = np.cumsum(x[:R + 1][::-1])[::-1]

            # a_k: eq. 31
            if 'a_k' in self._fitness_args:
                kwds['a_k'] = 0.5 * np.cumsum(ak_raw[:R + 1][::-1])[::-1]

            # b_k: eq. 32
            if 'b_k' in self._fitness_args:
                kwds['b_k'] = - np.cumsum(bk_raw[:R + 1][::-1])[::-1]

            # c_k: eq. 33
            if 'c_k' in self._fitness_args:
                kwds['c_k'] = 0.5 * np.cumsum(ck_raw[:R + 1][::-1])[::-1]

            # evaluate fitness function
            fit_vec = self.fitness(**kwds)

            A_R = fit_vec - ncp_prior
            A_R[1:] += best[:R]

            i_max = np.argmax(A_R)
            last[R] = i_max
            best[R] = A_R[i_max]

        # ----------------------------------------------------------------
        # Now find changepoints by iteratively peeling off the last block
        # ----------------------------------------------------------------
        change_points = np.zeros(N, dtype=int)
        i_cp = N
        ind = N
        while i_cp > 0:
            i_cp -= 1
            change_points[i_cp] = ind
            if ind == 0:
                break
            ind = last[ind - 1]
        if i_cp == 0:
            change_points[i_cp] = 0
        change_points = change_points[i_cp:]

        return edges[change_points]

class PointMeasures(FitnessFunc):
    r"""Bayesian blocks fitness for point measures

    Parameters
    ----------
    p0 : float, optional
        False alarm probability, used to compute the prior on :math:`N_{\rm
        blocks}` (see eq. 21 of Scargle 2013). If gamma is specified, p0 is
        ignored.
    ncp_prior : float, optional
        If specified, use the value of ``ncp_prior`` to compute the prior as
        above, using the definition :math:`{\tt ncp\_prior} = -\ln({\tt
        gamma})`.  If ``ncp_prior`` is specified, ``gamma`` and ``p0`` are
        ignored.
    """
    def __init__(self, p0=0.05, gamma=None, ncp_prior=None):
        super().__init__(p0, gamma, ncp_prior)

    def fitness(self, a_k, b_k):
        # eq. 41 from Scargle 2013
        return (b_k * b_k) / (4 * a_k)

    def validate_input(self, t, x, sigma):
        if x is None:
            raise ValueError("x must be specified for point measures")
        return super().validate_input(t, x, sigma)


* `do_stuff` is the routine for calculating blocks on the (measured - modelled) residual
* `doOtherStuff` [**Preferred**] is the routine for calculating blocks on the observed and modelled independently
* sigma is the threshold ($\sigma$) for the BB analysis, it will default to the value set in the first cell
* density will normalise the matplotlib histograms, but ultimately I don't find it that useful
* verbose determines how much text you want it to spit out as it processes

The routines are defaulted to processing GHI and DHI for measurements and simulations. It takes approximately 4m per day or 24h to run a year's dataset. If you want to do it more quickly consider if you want to measure blocks for all of the four.


In [7]:
fs = 1 # sampling frequency
x = np.arange(1,24)

# to save memory only load the columns you will be working on
cols = ['MTR TimeStamp', 'GHI [W/m2]', 'DHI [W/m2]']

def do_stuff(date, sigma=sigma, verbose=True):
    # the pyranometer measuremets
    # go to use_cols to save some memory space?
    if verbose:
        print(f"Opening {indir}/{date}.csv")
    df = pd.read_csv(f"{indir}/{date}.csv", sep=',', index_col=0, header='infer', parse_dates=True, usecols=cols)
    #for i, (t,g,d) in enumerate(zip(df.index, df['GHI [W/m2]'], df['DHI [W/m2]'])):
    # the simulated expected clear sky values
    if verbose:
        print(f"Simulating {date}")
    cs = location.get_clearsky(df.index)

    ncols=2
    nrows=4
    fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(12*ncols,4*nrows))

    ax = axs[0,0]
    ax.set_ylabel("GHI [W/$^2$]")
    ax.plot(df.index, df['GHI [W/m2]'], label='pyranometer ghi')
    ax.plot(cs.index, cs['ghi'], label='pvlib ghi')
    
    # push the repeated work off to a subroutine...
    cs['residuals_ghi'] = df['GHI [W/m2]'] - cs['ghi']
    ax = axs[1,0]
    ax.set_ylabel("GHI residuals (obs-sim) [W/m$^2$]")
    ax.plot(cs.index, cs['residuals_ghi'], color=c_ghi)

    t = np.array(cs.index.to_frame(index=None).index.to_series().values)
    if verbose:
        print(f"Calculating blocks, sigma={sigma}, p0={p0}")
    edges_ghi = bayesian_blocks(t, cs['residuals_ghi'], sigma, fitness='measures', p0=p0)
    if verbose:
        print(f"Writing blocks")
    with open(f"{savedir}/{date}-ghi-BB.txt", "w") as WriteMe:
        for edge in edges_ghi:
            WriteMe.write(f"{edge}\n")
    times=[]
    for i in edges_ghi:
        i = int(i)
        times.append(df.index[i])
    #print(times)
    ax.vlines(times, ymin=0, ymax=sigma, ls=':', color='r')
    #if True:
    #    return

    ax = axs[2,0]
    ax.set_ylabel("Entries [-]")
    ax.set_xlabel("Time [s]")
    dt = []
    for i,edge in enumerate(edges_ghi[0:-2]):
        dt.append(edges_ghi[i+1]-edges_ghi[i])
    dt_ghi=np.array(dt)
    # print(dt, dt.mean(), np.median(dt), stats.mode(dt)[0], dt.std(ddof=1), dt.min(), dt.max())
    ax.hist(dt, bins=np.arange(0,1000, 10), color=c_ghi, alpha=0.3)
    ax.vlines([60, 15*60], ymin=0, ymax=10, ls=':', color='r')
    ax.set_yscale('log')

    ax = axs[3,0]
    if verbose:
        print(f"Calculating Lomb-Scargle periodogram")
    f, Pper_spec = signal.periodogram(cs['residuals_ghi'], fs, 'flattop', scaling='spectrum')
    ax.loglog(f, Pper_spec, color=c_ghi, alpha=0.3, label='ghi')
    ax.set_xlabel('frequency [Hz]')
    ax.set_ylabel('PSD')
    ax.grid()
    ax.vlines(1/(x*3600), ymin=0.01*x, ymax=x, color='red', alpha=0.3) # per hour
    ax.vlines([1/(60), 1/(5*60), 1/(7.5*60), 1/(15*60)], ymin=0.01, ymax=1, color='red')

    if verbose:
        print(f"ditto al for dhi")

    ax = axs[0,1]
    ax.set_ylabel("DHI [W/$^2$]")
    ax.plot(df.index, df['DHI [W/m2]'], label='pyranometer dhi')
    ax.plot(cs.index, cs['dhi'], label='pvlib dhi')

    ax = axs[1,1]
    cs['residuals_dhi'] = df['DHI [W/m2]'] - cs['dhi']
    ax.set_ylabel("DHI [W/m$^2$]")
    ax.plot(cs.index, cs['residuals_dhi'], color=c_dhi)
    edges_dhi = bayesian_blocks(t, cs['residuals_dhi'], sigma, fitness='measures', p0=p0)
    with open(f"{savedir}/{date}-dhi-BB.txt", "w") as WriteMe:
        for edge in edges_dhi:
            WriteMe.write(f"{edge}\n")
    times=[]
    for i in edges_dhi:
        i = int(i)
        times.append(df.index[i])
    ax.vlines(times, ymin=0, ymax=sigma, ls=':', color='r')

    ax = axs[2,1]
    ax.set_ylabel("Entries [-]")
    ax.set_xlabel("Time [s]")
    dt = []
    for i,edge in enumerate(edges_dhi[0:-2]):
        dt.append(edges_dhi[i+1]-edges_dhi[i])
    dt_dhi=np.array(dt)
    # print(dt, dt.mean(), np.median(dt), stats.mode(dt)[0], dt.std(ddof=1), dt.min(), dt.max())
    ax.hist(dt, bins=np.arange(0,1000, 10), color=c_dhi, alpha=0.3)
    ax.vlines([60, 15*60], ymin=0, ymax=10, ls=':', color='r')
    ax.set_yscale('log')

    ax = axs[3,1]
    f, Pper_spec = signal.periodogram(cs['residuals_dhi'], fs, 'flattop', scaling='spectrum')
    ax.loglog(f, Pper_spec, color=c_dhi, alpha=0.3, label='dhi')
    ax.set_xlabel('frequency [Hz]')
    ax.set_ylabel('PSD')
    ax.grid()
    ax.vlines(1/(x*3600), ymin=0.01*x, ymax=x, color='red', alpha=0.3) # per hour
    ax.vlines([1/(60), 1/(5*60), 1/(7.5*60), 1/(15*60)], ymin=0.01, ymax=1, color='red')

    with open(f"{savedir}/{YEAR}-blocks-results.txt", "a") as WriteMe:
        WriteMe.write(f"{date}\t{len(edges_ghi)}\t{dt_ghi.mean():.2f}\t{np.median(dt_ghi)}\t{stats.mode(dt_ghi)[0][0]}\t{len(edges_dhi)}\t{dt_dhi.mean():.2f}\t{np.median(dt_dhi)}\t{stats.mode(dt_dhi)[0][0]}\n")

    plt.tight_layout()
    plt.savefig(f"{date}-blocks.png", facecolor='w')
    plt.close(fig)
    del df, cs, dt_ghi, dt_dhi, edges_ghi, edges_dhi, fig, axs, times
    #gc.collect()


def doOtherStuff(date, sigma=sigma, verbose=True, density=True):
    # the pyranometer measuremets
    # go to use_cols to save some memory space?
    if verbose:
        print(f"Opening {indir}/{date}.csv")
    df = pd.read_csv(f"{indir}/{date}.csv", sep=',', index_col=0, header='infer', parse_dates=True, usecols=cols)
    #for i, (t,g,d) in enumerate(zip(df.index, df['GHI [W/m2]'], df['DHI [W/m2]'])):
    # the simulated expected clear sky values

    if verbose:
        print(f"Simulating {date}")
    cs = location.get_clearsky(df.index)

    ncols=2
    nrows=4
    fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(12*ncols,4*nrows))

    ax = axs[0,0]
    ax.set_ylabel("GHI [W/$^2$]")
    ax.plot(df.index, df['GHI [W/m2]'], label='pyranometer ghi', color=c_obs)
    ax.plot(cs.index, cs['ghi'], label='pvlib ghi', color=c_sim)
    ax.legend(loc='best')

    t = np.array(cs.index.to_frame(index=None).index.to_series().values)
    if verbose:
        print(f"Calculating blocks, sigma={sigma}, p0={p0}")
    edges_obs_ghi = bayesian_blocks(t, df[cols[1]], sigma, fitness='measures', p0=p0)
    if verbose:
        print(f"Writing blocks")
    with open(f"{savedir}/{date}-obs-ghi-BB.txt", "w") as WriteMe:
        for edge in edges_obs_ghi:
            WriteMe.write(f"{edge}\n")
    times=[]
    for i in edges_obs_ghi:
        i = int(i)
        times.append(df.index[i])
    #print(times)
    ax.vlines(times, ymin=0, ymax=sigma, ls=':', color='g', alpha=0.5)

    edges_sim_ghi = bayesian_blocks(t, cs['ghi'], sigma, fitness='measures', p0=p0)
    if verbose:
        print(f"Writing blocks")
    with open(f"{savedir}/{date}-sim-ghi-BB.txt", "w") as WriteMe:
        for edge in edges_sim_ghi:
            WriteMe.write(f"{edge}\n")
    times=[]
    for i in edges_sim_ghi:
        i = int(i)
        times.append(df.index[i])
    #print(times)
    ax.vlines(times, ymin=0, ymax=sigma, ls=':', color=c_sim, alpha=0.5)

    ax = axs[1,0]
    ax.set_xlabel("dt [s]")
    ax.set_ylabel("Entries [s]")
    ax.set_xlim(0.9, 4000)
    ax.set_xscale('log')
    ax.set_yscale('log')
    #bins = np.arange(0,1000, 10)
    bins = np.arange(0, 60*60)
    # print(dt, dt.mean(), np.median(dt), stats.mode(dt)[0], dt.std(ddof=1), dt.min(), dt.max())
    n_obs, b, p = ax.hist(np.diff(edges_obs_ghi), bins=bins, color=c_obs, alpha=0.3, density=density)
    n_sim, b, p = ax.hist(np.diff(edges_sim_ghi), bins=bins, color=c_sim, alpha=0.3, density=density)
    ax = axs[2,0]
    ax.set_xlabel("dt [s]")
    ax.set_ylabel("Entries [s]")
    ax.set_xscale('log')
    ax.set_xlim(0.9, 4000)
    diff = n_obs-n_sim
    ax.bar(b[:-1], diff, width=1, align='edge')
    ax.vlines([60, 15*60], ymin=diff.min(), ymax=diff.max(), ls=':', color='r')

    ax = axs[3,0]
    if verbose:
        print(f"Calculating Lomb-Scargle periodogram")
    f, Pper_spec = signal.periodogram(df[cols[1]], fs, 'flattop', scaling='spectrum')
    ax.loglog(f, Pper_spec, color=c_obs, alpha=0.3, label='obs')
    f, Pper_spec = signal.periodogram(cs['ghi'], fs, 'flattop', scaling='spectrum')
    ax.loglog(f, Pper_spec, color=c_sim, alpha=0.3, label='sim')
    ax.set_xlabel('frequency [Hz]')
    ax.set_ylabel('PSD')
    ax.grid()
    ax.vlines(1/(x*3600), ymin=0.01*x, ymax=x, color='red', alpha=0.3) # per hour
    ax.vlines([1/(60), 1/(5*60), 1/(7.5*60), 1/(15*60)], ymin=0.01, ymax=1, color='red')


    ax = axs[0,1]
    ax.set_ylabel("DHI [W/$^2$]")
    ax.plot(df.index, df['DHI [W/m2]'], label='pyranometer dhi', color=c_obs)
    ax.plot(cs.index, cs['dhi'], label='pvlib dhi', color=c_sim)
    ax.legend(loc='best')

    t = np.array(cs.index.to_frame(index=None).index.to_series().values)
    if verbose:
        print(f"Calculating blocks, sigma={sigma}, p0={p0}")
    edges_obs_dhi = bayesian_blocks(t, df[cols[2]], sigma, fitness='measures', p0=p0)
    if verbose:
        print(f"Writing blocks")
    with open(f"{savedir}/{date}-obs-dhi-BB.txt", "w") as WriteMe:
        for edge in edges_obs_dhi:
            WriteMe.write(f"{edge}\n")
    times=[]
    for i in edges_obs_dhi:
        i = int(i)
        times.append(df.index[i])
    #print(times)
    ax.vlines(times, ymin=0, ymax=sigma, ls=':', color='g', alpha=0.5)

    edges_sim_dhi = bayesian_blocks(t, cs['dhi'], sigma, fitness='measures', p0=p0)
    if verbose:
        print(f"Writing blocks")
    with open(f"{savedir}/{date}-sim-dhi-BB.txt", "w") as WriteMe:
        for edge in edges_sim_dhi:
            WriteMe.write(f"{edge}\n")
    times=[]
    for i in edges_sim_dhi:
        i = int(i)
        times.append(df.index[i])
    #print(times)
    ax.vlines(times, ymin=0, ymax=sigma, ls=':', color=c_sim, alpha=0.5)

    ax = axs[1,1]
    ax.set_xlabel("dt [s]")
    ax.set_ylabel("Entries [s]")
    ax.set_xlim(0.9, 4000)
    ax.set_xscale('log')
    ax.set_yscale('log')
    # print(dt, dt.mean(), np.median(dt), stats.mode(dt)[0], dt.std(ddof=1), dt.min(), dt.max())
    n_obs, b, p = ax.hist(np.diff(edges_obs_dhi), bins=bins, color=c_obs, alpha=0.3, density=density)
    n_sim, b, p = ax.hist(np.diff(edges_sim_dhi), bins=bins, color=c_sim, alpha=0.3, density=density)
    ax = axs[2,1]
    ax.set_xlabel("dt [s]")
    ax.set_ylabel("Entries [s]")
    ax.set_xscale('log')
    ax.set_xlim(0.9, 4000)
    diff = n_obs-n_sim
    ax.bar(bins[:-1], diff, width=1, align='edge')
    ax.plot(bins[:-1], diff)
    ax.vlines([60, 15*60], ymin=diff.min(), ymax=diff.max(), ls=':', color='r')

    ax = axs[3,1]
    if verbose:
        print(f"Calculating Lomb-Scargle periodogram")
    f, Pper_spec = signal.periodogram(df[cols[2]], fs, 'flattop', scaling='spectrum')
    ax.loglog(f, Pper_spec, color=c_obs, alpha=0.3, label='obs')
    f, Pper_spec = signal.periodogram(cs['dhi'], fs, 'flattop', scaling='spectrum')
    ax.loglog(f, Pper_spec, color=c_sim, alpha=0.3, label='sim')
    ax.set_xlabel('frequency [Hz]')
    ax.set_ylabel('PSD')
    ax.grid()
    ax.vlines(1/(x*3600), ymin=0.01*x, ymax=x, color='red', alpha=0.3) # per hour
    ax.vlines([1/(60), 1/(5*60), 1/(7.5*60), 1/(15*60)], ymin=0.01, ymax=1, color='red')

    with open(f"{savedir}/{YEAR}-ind-blocks-results.txt", "a") as WriteMe:
        WriteMe.write(f"{date}\t{len(edges_obs_ghi)}\t{len(edges_sim_ghi)}\t{len(edges_obs_dhi)}\t{len(edges_sim_dhi)}\n")

    plt.tight_layout()
    plt.savefig(f"{savedir}/{date}-ind-blocks.png", facecolor='w')
    plt.close(fig)
    del df, cs, edges_obs_ghi, edges_sim_ghi, edges_obs_dhi, edges_sim_dhi, fig, axs, times
    #gc.collect()

**Look before you leap** - practice on a day to see things are working before committing your computer to a day's processing time

In [8]:
#do_stuff("2018-09-21")
#doOtherStuff('2018-05-14', density=False) # the clearest clear day
doOtherStuff('2018-06-26', density=False) # a clear day
#doOtherStuff('2018-05-07', density=False) 
#doOtherStuff('2018-10-15', density=False) # an overcast day
#doOtherStuff('2018-10-21', density=False) # a mixed day

Opening ./2018-06-26.csv
Simulating 2018-06-26
Calculating blocks, sigma=25, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram
Calculating blocks, sigma=25, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram


Loop for processing a year's data. If wanting more than a year then add a new iteration at the start.

This loop turns off matplotlib interactivity, otherwise it will try and store a heap of plots to plot after it loops (running out of allocation long before it gets there)

In [6]:
for month in range(1,12+1):
    end = 30
    if month==1 or month == 3 or month == 5 or month == 7 or month==8 or month==10 or month==12:
        end+=1
    elif month==2:
        end=28
    for day in range(1, end+1):
        date = f"{YEAR}-{month:02}-{day:02}"
        print(date)
        plt.ioff() # interactive mode is off, figure will not be automatically showed [let us hope it is saved though]
        #do_stuff(date, sigma=sigma) # irradiance obs-sim
        doOtherStuff(date, sigma=sigma, density=False) # blocks obs-sim
        gc.collect()
plt.ion()

2018-01-01
Opening ./2018-01-01.csv
Simulating 2018-01-01
Calculating blocks, sigma=24, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram
Calculating blocks, sigma=24, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram
2018-01-02
Opening ./2018-01-02.csv
Simulating 2018-01-02
Calculating blocks, sigma=24, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram
Calculating blocks, sigma=24, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram
2018-01-03
Opening ./2018-01-03.csv
Simulating 2018-01-03
Calculating blocks, sigma=24, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram
Calculating blocks, sigma=24, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram
2018-01-04
Opening ./2018-01-04.csv
Simulating 2018-01-04
Calculating blocks, sigma=24, p0=0.01
Writing blocks
Writing blocks
Calculating Lomb-Scargle periodogram
Calculating blocks, sigma=24, p0=