# Template - forecasting - using Cycle Times

Let's first define what _Cycle Time_ means or how it's defined for the purpose of this notebook:

__Cycle Time__:
  > "...the time between two items emerging from a process"

This notebook illustrates an analysis and forecasting for data based on _Cycle Time_ as defined above.
In particular the following aspects are considered:

* working days versus calendar days
* working hours versus 24h
* batches of deliveries or single items
* consistency with the 'Delivery Rate'
* noise in the data due to sloppy/faulty entry dates of the data

## Set-up

In [None]:
require Chi2fit.Distribution
alias Chi2fit.Distribution, as: D
alias Chi2fit.Fit, as: F
alias Chi2fit.Matrix, as: M
alias Chi2fit.Utilities, as: U
alias Gnuplotlib, as: P
alias Exboost.Math
:"do not show this result in output"

## Data and simulation set-up

In [None]:
#
# Completed items have a resolution date which is in the column "Resolved"
# Jira exports time data as shown above.
#
deliveries = "/app/notebooks/<filename>"
|> File.stream!
|> U.csv_to_list("Resolved", header?: true, format: "{0D}/{Mshort}/{YY} {h24}:{0m}")

IO.inspect(deliveries, print: false, limit: 3)
IO.puts "Number of completed items: #{length(deliveries)}"
:"do not show this result in output"

First, we set some parameters that will be used later on in this notebook. Especially important for analyzing the data, is that we need to make a choice for how to handle:

* working days vs calendar days,
* working hours vs 24h,
* look at batches of deliveries or at single deliveries,
* decide on the size of a cluster of data for fitting to a known distribution.

In [None]:
##
## Data analysis
##

# Working hours: 8AM to 8PM
workhours = {8,20}

# Correct for working days and/or working hours (:weekday, :worktime, :"weekday+worktime")
correct = :"weekday+worktime"

# Cutoff for minimum amount of time between consecutive deliveries (15/12/60 corresponds to 15 minutes)
cutoff = 15/(12*60)

# Size of the bins to group the data (2/24 means a granularity of 2 hours)
binsize = 2/12

# The noise to add to the delivery times to estimagte the error due to sloppy/faulty administration
#noise = D.normal(0.0, 2.0/12)
noise = fn -> 0.0 end # No noise

##
## Forecasting
##

# The size of the backlog, e.g. 100 backlog items
size = 1000

##
## Monte Carlo simulations stuff
##

# Number of iterations to use in the Monte Carlo
iterations = 100

# Number of probes to use in the chi2 fit
probes = 50_000

##
## Fitting a distribution
##

# The range of the parameter to look for a (global) minimum
initial = [{0.1,50},{0.1,50}]
:"do not show this result in output"

In [None]:
{startofday,endofday} = workhours
hours_in_day = endofday - startofday
:"do not show this result in output"

In [None]:
# Cycle Times in number of calendar days
ctlist = deliveries
|> Stream.chunk_every(2, 1, :discard)
|> Stream.map(fn [d1,d2] -> NaiveDateTime.diff(d1,d2) end) # Calculate the time difference between two consecutive deliveries in seconds
|> Enum.map(& &1/24/3600) # Convert the number of seconds to number of days
IO.inspect(ctlist, print: false, limit: 3)
:"do not show this result in output"

In [None]:
P.histogram(ctlist,
    bin: binsize,
    plottitle: "Cycle Time histogram",
    xlabel: "Cycle Time (calendar days)",
    ylabel: "Frequency",
    xrange: '[0:]')
:"this is an inline image"

## Data analysis

On first site there appear to be bumps around _Cycle Times_ of a whole number of days. Second, between 0 and 1 the _Cycle Time_ exhibits a dip. There are two main factors that may be relevant and may explain this observed behaviour:

1. __Work hours__. Most people have regularised working hours somewhere between 8AM and 18PM depending on how early they start work,
1. __Weekdays__. People don't work during the weekends,
1. __Sloppy administration__. What often happens is that work has completed at some time of the day and instead of immediately registering the work as done, this often happens a couple of hours later, or the next day, or even at the end of an iteration (just before reporting).

First we will examine the first 2 factors. One way of handling the 3rd factor is to add a random noise to the completion dates and estimate the effect of this.

For now, we'll assume no noise.

In [None]:
# No noise
noise = fn -> 0.0 end

# Assume that in practice deliveries never are at exactly the same time. If so, then we'll further assume that
# this is due to 'sloppy' administration. When this happens we set the _Cycle Time_ to a certain minimum space
# between them (the cutoff)
fun = fn dat -> dat
  # Map the delivery times to numbers: the number of days since the epoch Jan 1st, 1970:
  |> Stream.map(& NaiveDateTime.diff(&1, ~N[1970-01-01 00:00:00], :second)/(24*3600))

  # Adjust the time for working hours: 8AM - 22PM
  # This maps the period of working hours to the interval 0..1
  |> U.adjust_times(correct: correct, workhours: workhours)

  # Apply noise to our data
  |> Stream.map(& &1+noise.())
  
  # Sort again to get properly ordered completeion dates
  |> Enum.sort(& &1>&2)
  
  # Calculate time differences with cut-off
  |> U.time_diff(cutoff: cutoff)
end
:"do not show this result in output"

Next, recalculate the _Cycle Times_ with the corrections specified above. This basically switches from _calendar days_ to _working days_.

In [None]:
# Cycle Times in number of days
ctlist = deliveries
|> fun.()

ctlist |> P.histogram(
    bin: binsize,
    plottitle: "Cycle Time histogram",
    xlabel: "Cycle Time (calendar days)",
    ylabel: "Frequency",
    xrange: '[0:]')
:"this is an inline image"

## Empirical CDF

In [None]:
hdata = ctlist |> U.to_bins({binsize,0})
IO.puts "#{length(ctlist)} Cycle Times reduced to #{length(hdata)} bins"
:"do not show this result in output"

The data returned contains a list of tuples each describing a bin:
* the end-point of the bin,
* the proportional number of events for this bin (the total count is normalized to one),
* the lower value of the error bound,
* the upper value of the error bound.

## Simple forecast using the empirical data

In [None]:
{avg,sd,all} = U.mc(iterations, U.forecast_items(ctlist, size), collect_all?: true)
U.display {avg,sd,:+}
:"do not show this result in output"

In [None]:
P.histogram(all,
    plottitle: "Monte Carlo result for duration after the first item is delivered",
    xlabel: "Duration (number of calendar days)",
    ylabel: "Frequency",
    yrange: '[0:7]')
:"this is an inline image"

## Forecasting using an Erlang distribution

#### Fitting an Erlang distribution to the data

In [None]:
model = D.model {"erlang", 1.0}
options = [model: :linear]

result = {_,cov,parameters,_} = F.chi2fit hdata, {[2.5], model[:fun], &F.nopenalties/2}, 50, model: :linear
U.display(hdata,model,result,options)

#### Second try: time between batches of 10 deliveries

In [None]:
batch = 10
binsize = 2/hours_in_day

bdel = deliveries
|> Stream.chunk_every(batch, batch, :discard)
|> Stream.map(& hd &1)

hdata = bdel
|> U.binerror(fun, bin: binsize, iterations: 1, correct: correct, workhours: workhours, cutoff: cutoff)

IO.puts "#{length(ctlist)} Cycle Times reduced to #{length(hdata)} bins"
:"do not show this result in output"

Therefore, we will alter the data to determine the _Cycle Times_ between 10 completed items. We still expect an Erlang distribution and to be more specific, an Erlang-10 distribution since we will be considering batches of 10 deliveries.

In [None]:
P.ecdf(hdata,
    plottitle: "Empirical CDF (batches of 10 items)",
    xlabel: "Cycle Time (working days)",
    ylabel: "Probability",
    xrange: '[0:15]')
:"this is an inline image"

In [None]:
model = D.model {"erlang", batch*1.0}

result = {_,cov,parameters,_} = F.chi2fit hdata, {[2.5], model[:fun], &F.nopenalties/2}, 50, model: :linear
U.display(hdata,model,result,options)
:"do not show this result in output"

#### Third try: considering only recent data

Perhaps not all data is relevant. As a variation we will consider only data after January 1st, 2019.

Again, batches of 10 deliveries.

In [None]:
recent = deliveries
|> Stream.filter(fn t -> Timex.after?(t, ~D[2019-01-01]) end)
|> Stream.chunk_every(batch, batch, :discard)
|> Stream.map(& hd &1)

hdata = recent
|> U.binerror(fun, bin: binsize, iterations: 1, correct: correct, workhours: workhours, cutoff: cutoff)

IO.puts "#{length(ctlist)} Cycle Times reduced to #{length(hdata)} bins"
:"do not show this result in output"

Next, perform a dit against the Erlang distribution.

In [None]:
result = {_,cov,[lambda],_} = F.chi2fit hdata, {[2.5], model[:fun], &F.nopenalties/2}, 50, model: :linear
U.display(hdata,model,result,options)
:"do not show this result in output"

In [None]:
P.ecdf(hdata,
    plottitle: "Fit of Poisson to CDF",
    xlabel: "Cycle Times (working days)",
    ylabel: "Probability",
    title: "Erlang_{10}",
    func: D.erlangCDF(batch*1.0,lambda))
:"this is an inline image"

## Finding an appropriate subsequence

Instead of manually removing old data from our data set, `Chi2fit` provides a function for partitioning the data set into longest subsequences that will fit the chosen model.

In [None]:
batch = 10
binsize = 2/hours_in_day

options = [probes: probes, bin: binsize, init: List.duplicate({0.1,50.0},model[:df]), fitmodel: model, model: :linear]

# Find points in the delivery dates that indicate a change in the model.
# `find_all` is a lazy function, meaning it does not always traverse the entire data set.
# In the example below, it stops after finding 5 jumping points.
trends = bdel
|> fun.()
|> F.find_all(options)
|> Enum.take(5)
:"do not show this result in output"

In [None]:
trends
|> Enum.map(fn {chi,[rate],list} -> {chi, rate, Enum.sum(list), length(list)} end)
|> U.as_table({"Goodness of fit", "Delivery rate (items per working day)", "Duration (work days)", "Count of items"})
:"do not show this result in output"

In [None]:
# Get the most recent sequence:
{_chi,[lambda],list} = hd trends
delivery_rate = lambda * 10 # items per 2 weeks
startdate = Timex.shift(~D[2019-05-13], days: -round(Enum.sum(list)/5*7)) # work days to calendar days

IO.puts ~s[The found sequence runs from #{Timex.format! startdate, "{Mshort} {D}, {YYYY}"} till #{Timex.format! ~D[2019-05-13], "{Mshort} {D}, {YYYY}"}]
IO.puts "Delivery rate = #{Float.round(delivery_rate,2)} items per 2 weeks"
:"do not show this result in output"

#### Forecasting

In [None]:
# The Erlang_1 distribution is equivalent to the Exponential distribution.
# We could have also used Erlang_10 and divided size by 10. This gives equivalent results.
{avg,sd} = U.mc(iterations, U.forecast_items(D.exponential(lambda), size))

IO.puts "Forecast using the parameter as fit with the Erlang_10 distribution:"
U.display {avg,sd,:+}
:"do not show this result in output"

Or using the empirical data of the subsequence instead of the whole data set:

In [None]:
# Remember to divide size by 10 since `list` corresponds to cycle times of batches of 10
{avg,sd} = U.mc(iterations, U.forecast_items(list, size/10))

IO.puts "Forecast directly using the subsequence of the data set:"
U.display {avg,sd,:+}
:"do not show this result in output"