# Getting Started

For a quick start, we compare the different algorithms for deconvolution on the famous IRIS data set, estimating the distribution of Iris plant types.

In [1]:
# load the example data
using MLDataUtils
X, y_labels, _ = load_iris()

# discretize the target quantity (for numerical values, we'd use LinearDiscretizer)
using Discretizers: encode, CategoricalDiscretizer
y = encode(CategoricalDiscretizer(y_labels), y_labels) # vector of target value indices

# have a look at the content of y
unique(y) # its just indices

3-element Array{Int64,1}:
 1
 2
 3

In [2]:
# Split the data into training and observed data sets.
# 
# The matrices MLDataUtils expects are transposed, by default.
# Thus, we have to be explicit about obsdim = 1. Note that
# CherenkovDeconvolution.jl follows the convention of ScikitLearn.jl
# (and others), which is size(X_train) == (n_examples, n_features).
# 
# MLDataUtils unfortunately assumes size(X_train) == (n_features, n_examples),
# but obsdim = 1 fixes this assumption.
# 
using Random
Random.seed!(42) # make split reproducible
(X_train, y_train), (X_data, y_data) = splitobs(shuffleobs((X', y), obsdim = 1), obsdim = 1);

## Deconvolution with DSEA

The Dortmund Spectrum Estimation Algorithm (DSEA) reconstructs the target distribution from classifier predictions on the target quantity of individual examples. CherenkovDeconvolution.jl implements the improved version DSEA+, which is extended by adaptive step sizes and a fixed reweighting of examples.

In [3]:
using ScikitLearn, CherenkovDeconvolution

# deconvolve with a Naive Bayes classifier
@sk_import naive_bayes : GaussianNB
tp_function = Sklearn.train_and_predict_proba(GaussianNB()) # trains and applies the classifier in each iteration
                                                            # Sklearn is a sub-module of CherenkovDeconvolution.jl

f_dsea = dsea(X_data, X_train, y_train, tp_function) # returns a vector of target value probabilities

┌ Info: DSEA iteration 1/1 uses alpha = 1.0 (chi2s = 0.0028011676660929376)
└ @ CherenkovDeconvolution /home/mbunse/.julia/dev/CherenkovDeconvolution/src/methods/dsea.jl:119


3-element Array{Float64,1}:
 0.3333333327844613
 0.3549289670574495
 0.3117377001580892

In [4]:
# compare the result to the true target distribution, which we are estimating
f_true = Util.fit_pdf(y_data) # f_dsea is almost equal to f_true!

3-element Array{Float64,1}:
 0.3333333333333333 
 0.35555555555555557
 0.3111111111111111 

## Regularized Unfolding and Iterative Bayesian Unfolding

RUN fits the target distribution `f` to the convolution model `g = R * f`, using maximum likelihood. The regularization strength is configured with `n_df`, the effective number of degrees of freedom in the second-order local model of the solution.

IBU reconstructs the target distribution by iteratively applying Bayes' rule to the conditional probabilities contained in the detector response matrix.

In [5]:
#
# RUN and IBU are only applicable with a single discrete observable dimension. In order to
# obtain a dimension that contains as much information as possible, we discretize the feature
# space with a decision tree, using its leaves as clusters. The cluster indices are the
# discrete values of the observed dimension. This concepts relates to supervised clustering.
#
td = Sklearn.TreeDiscretizer(X_train, y_train, 6) # obtain (up to) 6 clusters
x_train = encode(td, X_train)
x_data  = encode(td, X_data)

# have a look at the content of x_train
unique(x_train) # its the cluster indices

6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

In [6]:
# However, RUN and IBU do not need a classifier for deconvolution
f_run = CherenkovDeconvolution.run(x_data, x_train, y_train) # module qualification required

└ @ CherenkovDeconvolution /home/mbunse/.julia/dev/CherenkovDeconvolution/src/methods/run.jl:82


3-element Array{Float64,1}:
 0.32062181199902845
 0.34272528540199176
 0.33665290259897984

In [7]:
f_ibu = ibu(x_data, x_train, y_train)

3-element Array{Float64,1}:
 0.3333333333333333
 0.3463872738499534
 0.3202793928167133

## More Information

In [8]:
?dsea # You can find more information in the documentation

search: [0m[1md[22m[0m[1ms[22m[0m[1me[22m[0m[1ma[22m f_[0m[1md[22m[0m[1ms[22m[0m[1me[22m[0m[1ma[22m Gri[0m[1md[22m[0m[1mS[22m[0m[1me[22m[0m[1ma[22mrch [0m[1mD[22men[0m[1ms[22m[0m[1me[22m[0m[1mA[22mrray [0m[1mD[22men[0m[1ms[22m[0m[1me[22mM[0m[1ma[22mtrix [0m[1mD[22men[0m[1ms[22m[0m[1me[22mVecOrM[0m[1ma[22mt



```
dsea(data, train, y, train_and_predict_proba[, bins;
     features = setdiff(names(train), [y]),
     kwargs...])
```

Deconvolve the `y` distribution in the DataFrame `data`, as learned from the DataFrame `train`. This function wraps `dsea(::Matrix, ::Matrix, ::Array, ::Function)`.

The additional keyword argument allows to specify the columns in `data` and `train` to be used as the `features`.

---

```
dsea(X_data, X_train, y_train, train_and_predict_proba[, bins; kwargs...])
```

Deconvolve the target distribution of `X_data`, as learned from `X_train` and `y_train`.

The function `train_and_predict_proba(X_data, X_train, y_train, w_train) -> Any` trains and applies a classifier. All of its arguments but `w_train`, which is updated in each iteration, are simply passed through from `dsea`. To facilitate classification, `y_train` has to be discrete, i.e., it must contain label indices rather than actual values. All expected indices (for cases where `y_train` may not contain some of the indices) are optionally provided as `bins`.

**Keyword arguments**

  * `f_0 = ones(m) ./ m` defines the prior, which is uniform by default
  * `fixweighting = true` sets, whether or not the weight update fix is applied. This fix is proposed in my Master's thesis and in the corresponding paper.
  * `alpha = 1.0` is the step size taken in every iteration. This parameter can be either a constant value or a function with the signature `(k::Int, pk::AbstractArray{Float64,1}, f_prev::AbstractArray{Float64,1} -> Float`, where `f_prev` is the estimate of the previous iteration and `pk` is the direction that DSEA takes in the current iteration `k`.
  * `smoothing = Base.identity` is a function that optionally applies smoothing in between iterations
  * `K = 1` is the maximum number of iterations.
  * `epsilon = 0.0` is the minimum symmetric Chi Square distance between iterations. If the actual distance is below this threshold, convergence is assumed and the algorithm stops.
  * `inspect = nothing` is a function `(f_k::Array, k::Int, chi2s::Float64, alpha::Float64) -> Any` optionally called in every iteration.
  * `return_contributions = false` sets, whether or not the contributions of individual examples in `X_data` are returned as a tuple together with the deconvolution result.


In [9]:
?CherenkovDeconvolution.run

```
run(data, train, y, x; kwargs...)
```

Regularized Unfolding of the target distribution in the DataFrame `data`. The deconvolution is inferred from the DataFrame `train`, where the target column `y` and the observable column `x` are given.

This function wraps `run(R, g; kwargs...)`, constructing `R` and `g` from the examples in the two DataFrames.

---

```
run(x_data, x_train, y_train; kwargs...)
```

Regularized Unfolding of the target distribution, given the observations in the one-dimensional array `x_data`. The deconvolution is inferred from `x_train` and `y_train`.

This function wraps `run(R, g; kwargs...)`, constructing `R` and `g` from the examples in the three arrays.

---

```
run(R, g; kwargs...)
```

Perform RUN with the observed frequency distribution `g` (absolute counts!) and the detector response matrix `R`.

**Keyword arguments**

  * `n_df = size(R, 2)` is the effective number of degrees of freedom. The default `n_df` results in no regularization (there is one degree of freedom for each dimension in the result).
  * `K = 100` is the maximum number of iterations.
  * `epsilon = 1e-6` is the minimum difference in the loss function between iterations. RUN stops when the absolute loss difference drops below `epsilon`.
  * `inspect = nothing` is a function `(f_k::Array, k::Int, ldiff::Float64, tau::Float64) -> Any` optionally called in every iteration.


In [10]:
?ibu

search: [0m[1mi[22m[0m[1mb[22m[0m[1mu[22m f_[0m[1mi[22m[0m[1mb[22m[0m[1mu[22m [0m[1mI[22mO[0m[1mB[22m[0m[1mu[22mffer @[0m[1mi[22mn[0m[1mb[22mo[0m[1mu[22mnds P[0m[1mi[22mpe[0m[1mB[22m[0m[1mu[22mffer [0m[1mi[22ms_[0m[1mb[22msd



```
ibu(data, train, x, y[, bins_y; kwargs...])
```

Iterative Bayesian Unfolding of the target distribution in the DataFrame `data`. The deconvolution is inferred from the DataFrame `train`, where the target column `y` and the observable column `x` are given.

This function wraps `ibu(R, g; kwargs...)`, constructing `R` and `g` from the examples in the two DataFrames.

---

```
ibu(x_data, x_train, y_train[, bins_y; kwargs...])
```

Iterative Bayesian Unfolding of the target distribution, given the observations in the one-dimensional array `x_data`.

The deconvolution is inferred from `x_train` and `y_train`. Both of these arrays have to be discrete, i.e., they must contain indices instead of actual values. All expected label indices (for cases where `y_train` may not contain some of the indices) are optionally provided as `bins_y`.

This function wraps `ibu(R, g; kwargs...)`, constructing `R` and `g` from the examples in the three arrays.

---

```
ibu(R, g; kwargs...)
```

Iterative Bayesian Unfolding with the detector response matrix `R` and the observable density function `g`.

**Keyword arguments**

  * `f_0 = ones(m) ./ m` defines the prior, which is uniform by default.
  * `smoothing = Base.identity` is a function that optionally applies smoothing in between iterations. The operation is neither applied to the initial prior, nor to the final result. The function `inspect` is called before the smoothing is performed.
  * `K = 3` is the maximum number of iterations.
  * `epsilon = 0.0` is the minimum symmetric Chi Square distance between iterations. If the actual distance is below this threshold, convergence is assumed and the algorithm stops.
  * `inspect = nothing` is a function `(f_k::Array, k::Int, chi2s::Float64) -> Any` optionally called in every iteration.
