# Bayesian Blocks in HEP

## BB Motivation Outline:

1. Choose optimal bin edges for a given histogram
    1. Bin edges should convey meaning
    2. Content in each bin should be self-similar
    3. Edge choice should be unbiased
2. Decide how optimal edges are chosen
    1. Assume all data within a bin is consistent with a single pdf
    2. Variation is solely due to statistical fluctuation
    3. Default case: each bin is modeled by uniform pdf
    4. Additional criterion:
        1. There should be a penalty for more bins
3. Benefits:
    1. Visual:
        1. Removes user bias
        2. Not bound by fixed-width
        3. Can cover large changes in count, orders of mag
        4. No empty bins (good for particle physics)
        5. No wild variations in ratio plot
        6. Can make signal easy to spot
    2. Statistical:
        1. Natural way to do binned shape analysis?
        2. Some sort of bias-variance tradeoff?
        3. Remove some statistical variation, but also remove some shape info


### Hgg Outline

Can BB be used for a classic bump hunt?

1. Most simplistic case: no signal model, no bg model, simply "look" for a bump
    1. Algo in its current state cannot locate peaks on a falling background
    2. May require change in underlying pdf assumptions
2. More typical case: signal model and bg model exist
    1. Apply BB on bg and signal model independently
    2. Combine bin edges to create a hybrid binning
    3. Apply hybrid binning to data
        1. Is hybrid binning more visually appealing than a standard binning?
        2. Does the hybrid binning have comparable statistical power to an unbinned likelihood?
        

## Visualization Examples

![black_hole_1](figures/ST_mul8_mc_and_data_normed_databin_signal_shape.png)
![black_hole_2](figures/ST_mul8_mc_and_data_normed_databin_nobb_signal_shape.png)
![DY comparison](figures/z_data_hist_binsVbb.png)
![DY_ratio1](figures/bb_Z_gen_reco.png)
![DY_ratio2](figures/b25_Z_gen_reco.png)

# Section 2: Explanation of Bayesian Blocks

Bayesian Blocks is a nonparametric modeling technique for determining the optimal segmentation of a given data set.  The optimal segmentation presented here is one that maximizes an expression that models the data a series of piecewise constant values.  This expression is referred to as a 'fitness function', which is constructed from the unbinned likelihood known as the Cash Statistic.  Each segment is equivalent to a histogram bin, where the data contained in each bin are consistent with a Poisson rate for that bin.  Therefore, the bin edges are statistically significant, because they denote a change in expected event rate for a group of data.  To prevent the extremum case in which each data point is contained in its own bin, a regularization parameter is applied to penalize the fitness function as the number of bins grows.

This analyses takes advantage of the optimization algorithm presented in [Scargle], which guarantees that the global maximum for the fitness function is found for any given set of data.  The regularization parameter must be determined empirically for any given dataset.  Following the work performed in [Scargle], we use a parameter equivalent to a false-positive rate of 1%, which sufficiently suppresses non-monotonic behavior when applied to monotonically increasing or decreasing datasets.



# Section 3: A Classic Bump Hunt

The Bayesian Block technique was developed with the Astronomy community in mind.  The data are typically time-series photon counts, where the background and signals are well-described by uniform pdfs.  This is rarely the case in particle physics; the data is typically time-independent and follows a smoothly increasing or decreasing spectrum.  In the context of  reconstructed invariant mass, a signal can usually be modeled as a Gaussian-like peak on top of a rising or falling spectrum. 

The naive application of the Bayesian Block algorithm leads to...