# Template: Forecasting - based on empirical data

In [None]:
alias Chi2fit.Utilities, as: U
alias Chi2fit.Distribution, as: D
alias Chi2fit.Fit, as: F
alias Gnuplotlib, as: P

#### Using this template

Follow these steps to start with forecasting based on your team's capability to deliver:

1. Extract the delivery dates of completed items from your favourite tool ([Team data](#Team-data))
1. Adjust any simulation parameters ([Simulation settings](#Simulation-settings)) to your needs
1. Run the initial simulation to get a first forecast ([Simple forecast using the empirical data: number of completed items](#Simple-forecast-using-the-empirical-data:-number-of-completed-items))
1. Select the most recent subsequence of deliveries ([Finding the largest set of most recent relevant data](#Finding-the-largest-set-of-most-recent-relevant-data))
1. Run the simulation based on the found subsequence ([Forecast: number of completed items revisited](#Forecast:-number-of-completed-items-revisited))

#### README

The [README](README.ipynb) contains more information on how to use the notebooks.

## Data and simulation set-up

In the set-up below the data is assumed to be uploaded as a file named `team.csv`. It assumes that one column holds the date when an item is completed. The time stamps are supposed to be in the format `<day>/<month as a 3 letter code>/<2 digit year> <hours>:<minutes>`. A tool that uses this format is [Jira](https://jira.atlassian.com/).

An example file is
```csv
Issue key,Issue id,Issue Type,Custom field (Status),Status,Custom field (Created),Created,Resolved
<key1>,<id>,Story,,Done,,07/May/19 13:21,13/May/19 12:37
<key2>,<id>,Story,,Done,,07/May/19 13:20,10/May/19 09:31
.....
```

#### Team data

In [None]:
#
# A completed items have a resolution date which is in the column "Resolved"
# Jira exports time data as shown above.
#
deliveries = "/app/notebooks/team.csv"
|> File.stream!
|> U.csv_to_list("Resolved", header?: true, format: "{0D}/{Mshort}/{YY} {h24}:{0m}")
IO.inspect(deliveries, print: false, limit: 3)
:"do not show this result in output"

#### Extract the Cycle Times

The definition of _Cycle Time_ used here is:

Cycle Time
: "...the time between two items emerging from a process"

Note: See ["Essential Kanban Condensed"](http://leankanban.com/wp-content/uploads/2016/06/Essential-Kanban-Condensed.pdf)

In [None]:
# Cycle Times in number of days
ctlist = deliveries
|> Stream.chunk_every(2, 1, :discard)
|> Stream.map(fn [d1,d2] -> NaiveDateTime.diff(d1,d2) end) # Calculate the time difference between two consecutive deliveries in seconds
|> Enum.map(& &1/24/3600) # Convert the number of seconds to number of days
IO.inspect(ctlist, print: false, limit: 3)
:"do not show this result in output"

#### Extract the Delivery Rates

In [None]:
# Every first and sixteenth of the month
intervals = U.intervals()

data = intervals
|> U.throughput(deliveries)
|> tl # Skip the first data point because it corresponds to an incomplete iteration

IO.inspect Enum.zip(Enum.take(intervals, length(data)) |> tl, data)
:"do not show this result in output"

A visualization of the data using a histogram or frequency chart is shown below. The horizontal axis indicates the number of completed items in an iteration. The vertical axis shows how often a certain throughput occured.

In [None]:
P.histogram(data,
    plottitle: "Throughput histogram",
    xlabel: "Throughput (items per 2 weeks)",
    ylabel: "Frequency",
    xrange: '[0:100]',
    yrange: '[0:3]')
:"this is an inline image"

#### Simulation settings

Parameters that affect the forecasting are listed below. Please adjust to your needs.

In [None]:
# The size of the backlog, e.g. 100 backlog items
size = 500

# Number of iterations to use in the Monte Carlo
iterations = 500000

# Number of iterations to forecast the number of completed items
periods = 6

## Simple forecast using the empirical data: number of completed items

In [None]:
{avg,sd,all} = U.mc iterations, U.forecast_items(data,periods), collect_all?: true
U.display {avg,sd,:-}

Here, the interpretation is that in 50% of the runs 230 work items or more have been completed in 6 iterations. While in 84% of the runs 179 or more work items have been completed.
Finally, we expect with near certainty to always complete 79 work items or more.

In [None]:
P.histogram(all,
    plottitle: "Monte Carlo result for completed items after #{periods} iterations",
    xlabel: "Completed items (count)",
    ylabel: "Frequency")
:"this is an inline image"

## Finding the largest set of most recent relevant data

In [None]:
# The size of the bins
binsize = 5

# Number of probes to use in the chi2 fit
probes = 10_000

# The range of the parameter to look for a (global) minimum
initial = {1,100}

Next, we use the Poisson distribution to model the data.

In [None]:
# Use the Poisson distribution as a model; in most cases this is a more than reasonable assumption
model = D.model "poisson"
options = [probes: probes, smoothing: false, model: :linear, saved?: true, bin: binsize, fitmodel: model, init: initial]
:"do not show this result in output"

In [None]:
# Find points in the delivery dates that indicate a change in the model
trends = F.find_all data, options
:"do not show this result in output"

In [None]:
trends
|> Stream.transform(1, fn arg={_,_,data}, index -> { [{arg, Enum.at(history,index)}], index+length(data)} end)
|> Enum.map(fn {{chi, [rate], sub}, date} ->
    [ Timex.format!(date,~S({Mshort}, {D} {YYYY})), Float.round(chi,4), Float.round(rate,1), "#{inspect(sub, charlists: :as_lists)}" ]
  end)
|> U.as_table({"End date of sequence", "Goodness of fit", "Delivery Rate", "Subsequence"})
:"do not show this result in output"

In [None]:
# Pick the first (and most recent subsequence); extract the subsequence
{_, _, subdata} = hd(trends)

## Forecast: number of completed items revisited

In [None]:
# If you're not interested in plotting a histogram of the simulation data, use `collect_all?: false`
{avg,sd,all} = U.mc iterations, U.forecast_items(subdata,periods), collect_all?: true
U.display {avg,sd,:-}

Here, the interpretation is that in 50% of the runs 332 work items or more have been completed in 6 iterations. While in 84% of the runs 308 or more work items have been completed.
Finally, we expect with near certainty to always complete 259 work items or more.

In [None]:
P.histogram(all,
    plottitle: "Monte Carlo result for completed items after #{periods} iterations",
    xlabel: "Completed items (count)",
    ylabel: "Frequency")
:"this is an inline image"