# Regularized Unfolding

RUN reconstructs fits the target distribution `f` to the convolution model `g = R * f`, using maximum likelihood. The regularization strength is configured with `n_df`, the effective number of degrees of freedom in the second-order local model of the solution.

For a quick start, we deconvolve the distribution of Iris plant types in the famous IRIS data set.

In [1]:
# load the example data
using MLDataUtils
X, y_labels, _ = load_iris()

# discretize the target quantity (for numerical values, we'd use LinearDiscretizer)
using Discretizers: encode, CategoricalDiscretizer
y = encode(CategoricalDiscretizer(y_labels), y_labels) # vector of target value indices
;

In [2]:
# Split the data into training and observed data sets.
# 
# The matrices MLDataUtils expects are transposed, by default.
# Thus, we have to be explicit about obsdim = 1. Note that
# CherenkovDeconvolution.jl follows the convention of ScikitLearn.jl
# (and others), which is size(X_train) == (n_examples, n_features).
# 
# MLDataUtils unfortunately assumes size(X_train) == (n_features, n_examples),
# but obsdim = 1 fixes this assumption.
# 
srand(42) # make split reproducible
(X_train, y_train), (X_data, y_data) = splitobs(shuffleobs((X', y), obsdim = 1), obsdim = 1);

In [3]:
#
# RUN is only applicable with a single discrete observable dimension. In order to obtain 
# a dimension that contains as much information as possible, we discretize the feature
# space with a decision tree, using its leaves as clusters. The cluster indices are the
# discrete values of the observed dimension. This concepts relates to supervised clustering.
#
using ScikitLearn, CherenkovDeconvolution.Sklearn

td = TreeDiscretizer(X_train, y_train, 6) # obtain (up to) 6 clusters
x_train = encode(td, X_train)
x_data  = encode(td, X_data)

# have a look at the content of x_train
unique(x_train) # its the cluster indices

[1m[36mINFO: [39m[22m[36mUtilities of ScikitLearn.jl are available in CherenkovDeconvolution.Sklearn
[39m

6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

In [4]:
#
# Now let's estimate the target distribution!
#
using CherenkovDeconvolution

f_est = CherenkovDeconvolution.run(x_data, x_train, y_train) # returns a vector of target value probabilities



3-element Array{Float64,1}:
 0.333333
 0.356313
 0.35    

In [5]:
#
# Compare the result to the true target distribution, which we are estimating
#
f_true = Util.fit_pdf(y_data) # f_est is almost equal to f_true!

3-element Array{Float64,1}:
 0.333333
 0.355556
 0.311111

In [7]:
?CherenkovDeconvolution.run # You can find more information in the documentation

```
run(data, train, y, x; kwargs...)
```

Regularized Unfolding of the target distribution in the DataFrame `data`. The deconvolution is inferred from the DataFrame `train`, where the target column `y` and the observable column `x` are given.

This function wraps `run(R, g; kwargs...)`, constructing `R` and `g` from the examples in the two DataFrames.

```
run(x_data, x_train, y_train; kwargs...)
```

Regularized Unfolding of the target distribution, given the observations in the one-dimensional array `x_data`. The deconvolution is inferred from `x_train` and `y_train`.

This function wraps `run(R, g; kwargs...)`, constructing `R` and `g` from the examples in the three arrays.

```
run(R, g, n_df = size(R, 2); kwargs...)
```

Perform RUN with the observed pdf `g`, the detector response matrix `R`, and `n_df` degrees of freedom. The default `n_df` results in no regularization (there is one degree of freedom for each dimension in the result).

**Keyword arguments**

  * `K = 100` is the maximum number of iterations.
  * `epsilon = 1e-6` is the minimum difference in the loss function between iterations. RUN stops when the absolute loss difference drops below `epsilon`.
  * `inspect = nothing` is a function `(k::Int, tau::Float64, ldiff::Float64, f_k::Array) -> Any` optionally called in every iteration.
  * `loggingstream = DevNull` is an optional `IO` stream to write log messages to.
