# Backlog Forecasting

Fitting data to known distributions use is done using [Chi2Fit](https://hex.pm/packages/chi2fit).

## Table of contents

* [Set-up](#Set-up)
* [Data and simulation set-up](#Data-and-simulation-set-up)
* [Preparation](#preparation)
* [Simple forecast using the empirical data](#Simple-forecast-using-the-empirical-data)
* [Forecasting using a Poisson distribution](#Forecasting-using-a-Poisson-distribution)
* [Monte Carlo](#Monte-Carlo)
* [Total Monte Carlo](#Total-Monte-Carlo)
* [References](#References)
* [Tear-down](#Tear-down)

## Set-up

### Chi2fit

In [1]:
Boyle.mk("chi2fit")
Boyle.list()

All dependencies up to date


{:ok, ["chi2fit", "gnuplot"]}

In [2]:
Boyle.activate("chi2fit")
Boyle.install({:chi2fit, path: "/app/chi2fit"})

All dependencies up to date
Resolving Hex dependencies...
Dependency resolution completed:
Unchanged:
  exalgebra 0.0.5
  exboost 0.2.4
==> exboost
make: 'priv/libboostnif.so' is up to date.



:ok

In [3]:
require Chi2fit.Distribution
alias Chi2fit.Distribution, as: D
alias Chi2fit.Fit, as: F
alias Chi2fit.Matrix, as: M
alias Chi2fit.Utilities, as: U
alias Exboost.Math

Exboost.Math

### GnuPlot

In [4]:
Boyle.mk("gnuplot")
Boyle.list()

{:ok, ["chi2fit", "gnuplot"]}

In [5]:
Boyle.activate("gnuplot")
Boyle.install({:gnuplot, "~> 1.19"})

All dependencies up to date
Resolving Hex dependencies...
Dependency resolution completed:
Unchanged:
  gnuplot 1.19.95
All dependencies up to date


:ok

In [6]:
alias Gnuplot, as: G

Gnuplot

In [7]:
defmodule Plots do

    def histogram(data, options) do
        hist = data |> U.make_histogram(1,0) |> Enum.map(&Tuple.to_list/1)
        commands = [
            ['width=1.'],
            ['hist(x,width)=width*floor(x/width)+width/2.0'],
            [:set, :boxwidth, 'width*0.9'],
            [:set, :style, :fill, :solid, 0.5],
            if(options[:plottitle], do: [:set, :title, options[:plottitle]], else: []),
            if(options[:xrange], do: [:set, :xrange, options[:xrange]], else: []),
            if(options[:yrange], do: [:set, :yrange, options[:yrange]], else: []),
            if(options[:xlabel], do: [:set, :xlabel, options[:xlabel]], else: []),
            if(options[:ylabel], do: [:set, :ylabel, options[:ylabel], :rotate, :by, 90], else: [])
        ]
        G.plot(commands ++
            [[:plot, "-", :u, '(hist($1,width)):2', :smooth, :freq, :w, :boxes, :lc, 'rgb"green"', :notitle]],
            [hist])
    end
    
    def ecdf(data, options) do
        hist = data|>Enum.map(&Tuple.to_list/1)
        commands = [
            [:set, :style, :line, 1,
                :linecolor, :rgb, "#0060ad",
                :linetype, 1, :linewidth, 2,
                :pointtype, 7, :pointsize, 1.5],
            [:set, :style, :line, 2,
                :linecolor, :rgb, "#dd181f",
                :linetype, 1, :linewidth, 2],
            [:set, :style, :line, 3,
                :linecolor, :rgb, "green",
                :linetype, 1, :linewidth, 2],
            if(options[:plottitle], do: [:set, :title, options[:plottitle]], else: []),
            if(options[:xrange], do: [:set, :xrange, options[:xrange]], else: []),
            [:set, :yrange, '[0:1.2]'],
            if(options[:xlabel], do: [:set, :xlabel, options[:xlabel]], else: []),
            if(options[:ylabel], do: [:set, :ylabel, options[:ylabel], :rotate, :by, 90], else: [])
        ]
        G.plot(commands ++
          [
            [:plot, G.list([
                    ~w('-' u 1:2 w steps ls 1 notitle)a,
                    ~w('' u 1:2 w points ls 1 notitle)a,
                    ~w('' u 1:2:3:4 w yerrorbars ls 2 title 'Empirical CDF')a,
                    if(options[:func], do: ["", :u, '1:2', :w, :lines, :ls, 3, :title, options[:title]], else: [])
                ])
            ]
          ],
          [
              [[0,0,0,0]|hist]++[[14,1,0,0]],
              hist,
              hist
          ] ++ if(options[:func], do: [options[:func]], else: []))
    end
    
    def pdf(data, options) do
        hist = data |> U.make_histogram(1,0) |> Enum.map(&Tuple.to_list/1)
        commands = [
            ['count=#{length(data)}'],
            ['width=1.'],
            ['hist(x,width)=width*floor(x/width)+width/2.0'],
            [:set, :boxwidth, 'width*0.9'],
            [:set, :style, :fill, :solid, 0.5],
            if(options[:plottitle], do: [:set, :title, options[:plottitle]], else: []),
            if(options[:yrange], do: [:set, :yrange, options[:yrange]], else: []),
            if(options[:xlabel], do: [:set, :xlabel, options[:xlabel]], else: []),
            if(options[:ylabel], do: [:set, :ylabel, options[:ylabel], :rotate, :by, 90], else: [])
        ]
        G.plot(commands ++
          [
            [:plot, G.list([
                    ~w|'-' u (hist($1,width)):($2/count) smooth freq w boxes lc rgb "green" title "Empirical PDF"|a,
                    ~w|'-' u (hist($1,width)):($2/count):(sqrt($2)/count) w errorbars ls 3 notitle|a,
                    ["", :u, '1:2', :w, :lines, :ls, 3, :title, options[:title]]
                ])
            ]
          ],
          [ hist,hist,options[:pdf] ])
    end
    
end

{:module, Plots, <<70, 79, 82, 49, 0, 0, 23, 192, 66, 69, 65, 77, 65, 116, 85, 56, 0, 0, 1, 137, 0, 0, 0, 46, 12, 69, 108, 105, 120, 105, 114, 46, 80, 108, 111, 116, 115, 8, 95, 95, 105, 110, 102, 111, 95, 95, 9, ...>>, {:pdf, 2}}

## Data and simulation set-up

As an example consider the throughput of completed backlog items. At the end of a fixed time period we count the number of backlog items that a teram completes. Partially completed items are excluded from the count.

In [8]:
data = [3,3,4,4,7,5,1,11,5,6,3,6,6,5,4,10,4,5,8,2,4,12,5]

[3, 3, 4, 4, 7, 5, 1, 11, 5, 6, 3, 6, 6, 5, 4, 10, 4, 5, 8, 2, 4, 12, 5]

In [9]:
Plots.histogram(data,
    plottitle: "Throughput histogram",
    xlabel: "Throughput (items per 2 weeks)",
    ylabel: "Frequency",
    yrange: '[0:7]')

{:ok, "width=1.;\nhist(x,width)=width*floor(x/width)+width/2.0;\nset boxwidth width*0.9;\nset style fill solid 0.5;\nset title \"Throughput histogram\";\n;\nset yrange [0:7];\nset xlabel \"Throughput (items per 2 weeks)\";\nset ylabel \"Frequency\" rotate by 90;\nplot \"-\" u (hist($1,width)):2 smooth freq w boxes lc rgb\"green\" notitle"}

Other parameters that affect the forecasting are listed below. Please adjust to your needs.

In [10]:
# The size of the backlog, e.g. 100 backlog items
size = 100

# Number of iterations to use in the Monte Carlo
iterations = 1000

# Number of probes to use in the chi2 fit
probes = 10_000

# The range of the parameter to look for a (global) minimum
initial = {1,10}

{1, 10}

## Preparation

Next, we convert the throughput data to a histogram. To this end we group the data in bins of size 1 starting at 0.

In [11]:
hdata = U.to_bins data, {1,0}

[{1, 0.043478260869565216, 0.0058447102657877, 0.13689038224309594}, {2, 0.08695652173913043, 0.02977628442357071, 0.1905937209791003}, {3, 0.21739130434782608, 0.1263699563343216, 0.33774551477037923}, {4, 0.43478260869565216, 0.3160946312914347, 0.5600249434832333}, {5, 0.6521739130434783, 0.5263221461493021, 0.7626298741894857}, {6, 0.782608695652174, 0.6622544852296207, 0.8736300436656784}, {7, 0.8260869565217391, 0.709703289667655, 0.9080712005244068}, {8, 0.8695652173913043, 0.7585954661422599, 0.9405937209791002}, {10, 0.9130434782608695, 0.8094062790208998, 0.9702237155764294}, {11, 0.9565217391304348, 0.8631096177569041, 0.9941552897342123}, {12, 1.0, 0.9225113769324543, 1.0}]

The data returned contains a list of tuples each describing a bin:
* the end-point of the bin,
* the proportional number of events for this bin (the total count is normalized to one),
* the lower value of the error bound,
* the upper value of the error bound.

As can be seen the sizes of the lower and upper bounds are different in value, i.e. they are asymmetrical. The contribution or weight to the likelihood function used in fitting known distributions will de different depending on whether the observed value if larger or smaller than the predicted value. This is specified by using the option `:linear` (see below). See [3] for details.

In [12]:
Plots.ecdf(hdata,
    plottitle: "Empirical CDF",
    xlabel: "Throughput (items per 2 weeks)",
    ylabel: "Probability",
    xrange: '[0:15]')

{:ok, "set style line 1 linecolor rgb \"#0060ad\" linetype 1 linewidth 2 pointtype 7 pointsize 1.5;\nset style line 2 linecolor rgb \"#dd181f\" linetype 1 linewidth 2;\nset style line 3 linecolor rgb \"green\" linetype 1 linewidth 2;\nset title \"Empirical CDF\";\nset xrange [0:15];\nset yrange [0:1.2];\nset xlabel \"Throughput (items per 2 weeks)\";\nset ylabel \"Probability\" rotate by 90;\nplot '-' u 1:2 w steps ls 1 notitle,'' u 1:2 w points ls 1 notitle,'' u 1:2:3:4 w yerrorbars ls 2 title 'Empirical CDF',"}

## Simple forecast using the empirical data

Using the histogram data for the throughput we perform a Monte Carlo simulation to get an estimation for the number of iterations needed to deplete the backlog. Since for a large enough number of samples results of a Monte Carlo simulation approximate the normal distribution. This provides a range for the uncertainty of the number of iterations. We express this as a probability using percentages.

In [13]:
{avg,sd,all} = U.mc(iterations, fn -> U.forecast(fn -> Enum.random(data) end, size) end,true)
U.display {avg,sd}

50% with      20.0
84% within    22.0 iterations
97.5% within  24.0 iterations
99.85% within 26.0 iterations


:ok

In [14]:
Plots.histogram(all,
    plottitle: "Monte Carlo result for duration",
    xlabel: "Duration (number of iterations)",
    ylabel: "Frequency",
    xrange: '[0:30]')

{:ok, "width=1.;\nhist(x,width)=width*floor(x/width)+width/2.0;\nset boxwidth width*0.9;\nset style fill solid 0.5;\nset title \"Monte Carlo result for duration\";\nset xrange [0:30];\n;\nset xlabel \"Duration (number of iterations)\";\nset ylabel \"Frequency\" rotate by 90;\nplot \"-\" u (hist($1,width)):2 smooth freq w boxes lc rgb\"green\" notitle"}

## Forecasting using a Poisson distribution

Instead of directly using the raw data captured one can also use a known probability distribution. The parameter of the distribution is matched to the data. After matching the parameter value one uses the known distribution to forecast.

Here, we will use the __Poisson distribution__ [1]. This basically assumes that the data points are independent of each other.

The code below uses basic settings of the commands provided by `Chi2Fit`. More advanced options can be found at [2]. First a fixed number of random parameter values are tried to get a rough estimate. The option `probes` equals the number of tries. Furthermore, since we are fitting a probability distribution which has values on the interval `[0,1]` the errors are asymmetrical. This is specified by the option `linear`. 

In [15]:
model = D.model "poisson"
options = [probes: probes, smoothing: false, model: :linear, saved?: true]
result = {_,parameters,_,saved} = F.chi2probe hdata, [initial], {model[:fun], &F.nopenalties/2}, options
U.display result

Initial guess:
    chi2:		3.768092967691295
    pars:		[5.467582508516224]
    ranges:		{[5.261720953421076, 5.687367816309191]}



:ok

The errors reported is the found range of parameter values where the corresponding `chi2` values are within 1 of the found minimum value.

After roughly locating the minimum we do a more precise (and computationally more expensive) search for the minimum.

In [16]:
options = [{:probes,saved}|options]
result = {_,cov,parameters,_} = F.chi2fit hdata, {parameters, model[:fun], &F.nopenalties/2}, 10, options
U.display(hdata,model,result,options)

Final:
    chi2:		3.7680887442307545
    Degrees of freedom:	10
    gradient:		[-2.286421879453262e-7]
    parameters:		[5.46802637200297]
    errors:		[0.21599806081548534]
    ranges:
			chi2:		3.7680887442307545	-	4.762162293040989
			parameter:	5.261720953421076	-	5.687367816309191


:ok

For a (local) minimum the value of the gradient should be very close to zero.

In [17]:
cdf = 0..100 |> Enum.map(fn i -> [i*15.0/100.0,D.poissonCDF(hd(parameters)).(i*15.0/100.0)] end)
Plots.ecdf(hdata,
    plottitle: "Fit of Poisson to CDF",
    xlabel: "Throughput (items per 2 weeks)",
    ylabel: "Probability",
    xrange: '[0:15]',
    title: "Poisson",
    func: cdf)

{:ok, "set style line 1 linecolor rgb \"#0060ad\" linetype 1 linewidth 2 pointtype 7 pointsize 1.5;\nset style line 2 linecolor rgb \"#dd181f\" linetype 1 linewidth 2;\nset style line 3 linecolor rgb \"green\" linetype 1 linewidth 2;\nset title \"Fit of Poisson to CDF\";\nset xrange [0:15];\nset yrange [0:1.2];\nset xlabel \"Throughput (items per 2 weeks)\";\nset ylabel \"Probability\" rotate by 90;\nplot '-' u 1:2 w steps ls 1 notitle,'' u 1:2 w points ls 1 notitle,'' u 1:2:3:4 w yerrorbars ls 2 title 'Empirical CDF',\"\" u 1:2 w lines ls 3 title \"Poisson\""}

In [18]:
rate = hd(parameters)
pdf = 0..100 |> Enum.map(fn i -> x = i*15.0/100.0; [x,:math.pow(rate,x)*:math.exp(-rate)/Math.tgamma(x+1)] end)
Plots.pdf(data,
    plottitle: "Fit of PDF to Poisson",
    xlabel: "Throughput (items per 2 weeks)",
    ylabel: "Frequency",
    yrange: '[0:0.35]',
    pdf: pdf,
    title: "Poisson")

{:ok, "count=23;\nwidth=1.;\nhist(x,width)=width*floor(x/width)+width/2.0;\nset boxwidth width*0.9;\nset style fill solid 0.5;\nset title \"Fit of PDF to Poisson\";\nset yrange [0:0.35];\nset xlabel \"Throughput (items per 2 weeks)\";\nset ylabel \"Frequency\" rotate by 90;\nplot '-' u (hist($1,width)):($2/count) smooth freq w boxes lc rgb \"green\" title \"Empirical PDF\",'-' u (hist($1,width)):($2/count):(sqrt($2)/count) w errorbars ls 3 notitle,\"\" u 1:2 w lines ls 3 title \"Poisson\""}

Again, using a Monte Carlo simulation we estimate the number of iterations and the range to expect.

In [19]:
[rate] = parameters
{avg,sd,all} = U.mc(iterations, fn -> U.forecast(D.poisson(rate), size) end, true)
U.display {avg,sd}

50% with      19.0
84% within    21.0 iterations
97.5% within  23.0 iterations
99.85% within 25.0 iterations


:ok

In [20]:
Plots.histogram(all,
    plottitle: "Monte Carlo simulation for duration",
    xlabel: "Duration (number of iterations)",
    ylabel: "Frequency",
    xrange: '[0:30]')

{:ok, "width=1.;\nhist(x,width)=width*floor(x/width)+width/2.0;\nset boxwidth width*0.9;\nset style fill solid 0.5;\nset title \"Monte Carlo simulation for duration\";\nset xrange [0:30];\n;\nset xlabel \"Duration (number of iterations)\";\nset ylabel \"Frequency\" rotate by 90;\nplot \"-\" u (hist($1,width)):2 smooth freq w boxes lc rgb\"green\" notitle"}

## Total Monte Carlo

In the results of a Monte Carlo simulation the errors reported and the range of the number of iterations is the statistical error associated with the Monte Carlo simulation. It dopes not take into account the uncertainty of the parameter used in the fitted probability distribution function.

In Total Monte Carlo [4] multiple Monte Carlo simulations are done that correspond to the extreme values of the error bounds of the used parameters. The error results is of a different nature than the statistical error from the Monte Carlo simulation. These error reported separately.

In [21]:
# Pick up the error in the paramater value
param_errors = cov |> M.diagonal |> Enum.map(fn x->x|>abs|>:math.sqrt end)
[sd_rate] = param_errors

{avg_min,_} = U.mc(iterations, fn -> U.forecast(D.poisson(rate-sd_rate), size) end)
{avg_max,_} = U.mc(iterations, fn -> U.forecast(D.poisson(rate+sd_rate), size) end)

sd_min = avg - avg_max
sd_plus = avg_min - avg

IO.puts "Number of iterations to complete the backlog:"
IO.puts "#{Float.round(avg,1)} (+/- #{Float.round(sd,1)}) (-#{Float.round(sd_plus,1)} +#{Float.round(sd_min,1)})"

Number of iterations to complete the backlog:
18.8 (+/- 1.8) (-0.7 +0.7)


:ok

The first error is symmetric while the second error reported is asymmetric.

## References

[1] _Poisson distribution_, https://en.wikipedia.org/wiki/Poisson_distribution/<br>
[2] _Chi2Fit_, Pieter Rijken, 2018, https://hex.pm/packages/chi2fit<br>
[3] _Asymmetric errors_, Roger Barlow, Manchester University, UK and Stanford University, USA, PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003, https://www.slac.stanford.edu/econf/C030908/papers/WEMT002.pdf<br>
[4] _Efficient use of Monte Carlo: uncertainty propagation_, D. Rochman et. al., Nuclear Science and Engineering, 2013, ftp://ftp.nrg.eu/pub/www/talys/bib_rochman/fastTMC.pdf

## Tear-down

In [None]:
Boyle.deactivate()