# Getting Started

For a quick start, we compare the different algorithms for deconvolution on the IRIS data set, estimating the probability density of Iris plant types.

In [1]:
# load the example data
using MLDataUtils
X, y_labels, _ = load_iris()

# discretize the target quantity (for numerical values, we'd use LinearDiscretizer)
using Discretizers: encode, CategoricalDiscretizer
y = encode(CategoricalDiscretizer(y_labels), y_labels) # vector of target value indices

# have a look at the content of y
unique(y) # its just indices

3-element Array{Int64,1}:
 1
 2
 3

In [2]:
# Split the data into training and observed data sets.
# 
# The matrices MLDataUtils expects are transposed, by default.
# Thus, we have to be explicit about obsdim = 1. Note that
# CherenkovDeconvolution.jl follows the convention of ScikitLearn.jl
# (and others), which is size(X_train) == (n_examples, n_features).
# 
# MLDataUtils unfortunately assumes size(X_train) == (n_features, n_examples),
# but obsdim = 1 fixes this assumption.
# 
srand(42) # make split reproducible
(X_train, y_train), (X_data, y_data) = splitobs(shuffleobs((X', y), obsdim = 1), obsdim = 1);

## Deconvolution with DSEA

The Dortmund Spectrum Estimation Algorithm (DSEA) reconstructs the target density from classifier predictions on the target quantity of individual examples. CherenkovDeconvolution.jl implements the improved version DSEA+, which is extended by adaptive step sizes and a fixed reweighting of examples.

In [3]:
using ScikitLearn, CherenkovDeconvolution

# deconvolve with a Naive Bayes classifier
@sk_import naive_bayes : GaussianNB
tp_function = Sklearn.train_and_predict_proba(GaussianNB()) # trains and applies the classifier in each iteration
                                                            # Sklearn is a sub-module of CherenkovDeconvolution.jl

f_dsea = dsea(X_data, X_train, y_train, tp_function) # returns a vector of target value probabilities

3-element Array{Float64,1}:
 0.333333
 0.354929
 0.311738

In [4]:
# compare the result to the true target distribution, which we are estimating
f_true = Util.fit_pdf(y_data) # f_dsea is almost equal to f_true!

3-element Array{Float64,1}:
 0.333333
 0.355556
 0.311111

## Regularized Unfolding and Iterative Bayesian Unfolding

RUN fits the density distribution `f` to the convolution model `g = R * f`, using maximum likelihood. The regularization strength is configured with `n_df`, the effective number of degrees of freedom in the second-order local model of the solution.

IBU reconstructs the target density by iteratively applying Bayes' rule to the conditional probabilities contained in the detector response matrix.

In [5]:
#
# RUN and IBU are only applicable with a single discrete observable dimension. In order to
# obtain a dimension that contains as much information as possible, we discretize the feature
# space with a decision tree, using its leaves as clusters. The cluster indices are the
# discrete values of the observed dimension. This concept relates to supervised clustering.
#
td = Sklearn.TreeDiscretizer(X_train, y_train, 6) # obtain (up to) 6 clusters
x_train = encode(td, X_train)
x_data  = encode(td, X_data)

# have a look at the content of x_train
unique(x_train) # its the cluster indices

6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

In [6]:
# However, RUN and IBU do not need a classifier for deconvolution
f_run = CherenkovDeconvolution.run(x_data, x_train, y_train) # module qualification required



3-element Array{Float64,1}:
 0.320622
 0.342725
 0.336653

In [7]:
f_ibu = ibu(x_data, x_train, y_train)

3-element Array{Float64,1}:
 0.333333
 0.346387
 0.320279

## More Information

In [8]:
?dsea # You can find more information in the documentation

search: [1md[22m[1ms[22m[1me[22m[1ma[22m Gri[1md[22m[1mS[22m[1me[22m[1ma[22mrch [1mD[22men[1ms[22m[1me[22m[1mA[22mrray [1md[22me[1ms[22m[1me[22mri[1ma[22mlize [1mD[22men[1ms[22m[1me[22mM[1ma[22mtrix [1mD[22men[1ms[22m[1me[22mVecOrM[1ma[22mt



```
dsea(data, train, y, train_predict[, bins]; kwargs...)

dsea(X_data, X_train, y_train, train_predict[, bins]; kwargs...)
```

Deconvolve the observed data with *DSEA/DSEA+* trained on the given training set.

The first form of this function works on the two DataFrames `data` and `train`, where `y` specifies the target column to be deconvolved - this column has to be present in the DataFrame `train`. The second form works on vectors and matrices.

To facilitate classification, `y_train` (or `train[y]` in the first form) must contain label indices rather than actual values. All expected indices are optionally provided as `bins`. The function object `train_predict(X_data, X_train, y_train, w_train) -> Matrix` trains and applies a classifier, obtaining a confidence matrix. All of its arguments but `w_train`, which is updated in each iteration, are simply passed through from `dsea`.

**Keyword arguments**

  * `f_0 = ones(m) ./ m` defines the prior, which is uniform by default
  * `fixweighting = true` sets, whether or not the weight update fix is applied. This fix is proposed in my Master's thesis and in the corresponding paper.
  * `alpha = 1.0` is the step size taken in every iteration. This parameter can be either a constant value or a function with the signature `(k::Int, pk::AbstractVector{Float64}, f_prev::AbstractVector{Float64} -> Float`, where `f_prev` is the estimate of the previous iteration and `pk` is the direction that DSEA takes in the current iteration `k`.
  * `smoothing = Base.identity` is a function that optionally applies smoothing in between iterations
  * `K = 1` is the maximum number of iterations.
  * `epsilon = 0.0` is the minimum symmetric Chi Square distance between iterations. If the actual distance is below this threshold, convergence is assumed and the algorithm stops.
  * `inspect = nothing` is a function `(f_k::Vector, k::Int, chi2s::Float64, alpha::Float64) -> Any` optionally called in every iteration.
  * `loggingstream = DevNull` is an optional `IO` stream to write log messages to.
  * `return_contributions = false` sets, whether or not the contributions of individual examples in `X_data` are returned as a tuple together with the deconvolution result.
  * `features = setdiff(names(train), [y])` specifies which columns in `data` and `train` to be used as features - only applicable to the first form of this function.


In [9]:
?CherenkovDeconvolution.run

```
run(data, train, x, y[, bins]; kwargs...)

run(x_data, x_train, y_train[, bins]; kwargs...)

run(R, g; kwargs...)
```

Deconvolve the observed data applying the *Regularized Unfolding* trained on the given training set.

The first form of this function works on the two DataFrames `data` and `train`, where `y` specifies the target column to be deconvolved (this column has to be present in `train`) and `x` specifies the observed column present in both DataFrames. The second form accordingly works on vectors and the third form makes use of a pre-defined detector response matrix `R` and an observed (discrete) frequency distribution `g` (absolute counts, not a pdf!!). In the first two forms, `R` and `g` are directly obtained from the data and the keyword arguments.

The vectors `x_data`, `x_train`, and `y_train` (or accordingly `data[x]`, `train[x]`, and `train[y]`) must contain label/observation indices rather than actual values. All expected indices in `y_train` are optionally provided as `bins`.

**Keyword arguments**

  * `n_df = size(R, 2)` is the effective number of degrees of freedom. The default `n_df` results in no regularization (there is one degree of freedom for each dimension in the result).
  * `K = 100` is the maximum number of iterations.
  * `epsilon = 1e-6` is the minimum difference in the loss function between iterations. RUN stops when the absolute loss difference drops below `epsilon`.
  * `inspect = nothing` is a function `(f_k::Vector, k::Int, ldiff::Float64, tau::Float64) -> Any` optionally called in every iteration.
  * `loggingstream = DevNull` is an optional `IO` stream to write log messages to.


In [10]:
?ibu

search: [1mi[22m[1mb[22m[1mu[22m [1mI[22mO[1mB[22m[1mu[22mffer @[1mi[22mn[1mb[22mo[1mu[22mnds P[1mi[22mpe[1mB[22m[1mu[22mffer D[1mi[22mstri[1mb[22m[1mu[22mted [1mi[22ms_[1mb[22msd



```
ibu(data, train, x, y[, bins]; kwargs...)

ibu(x_data, x_train, y_train[, bins]; kwargs...)

ibu(R, g; kwargs...)
```

Deconvolve the observed data applying the *Iterative Bayesian Unfolding* trained on the given training set.

The first form of this function works on the two DataFrames `data` and `train`, where `y` specifies the target column to be deconvolved (this column has to be present in `train`) and `x` specifies the observed column present in both DataFrames. The second form accordingly works on vectors and the third form makes use of a pre-defined detector response matrix `R` and an observed (discrete) probability density `g`. In the first two forms, `R` and `g` are directly obtained from the data and the keyword arguments.

The vectors `x_data`, `x_train`, and `y_train` (or accordingly `data[x]`, `train[x]`, and `train[y]`) must contain label/observation indices rather than actual values. All expected indices in `y_train` are optionally provided as `bins`.

**Keyword arguments**

  * `f_0 = ones(m) ./ m` defines the prior, which is uniform by default.
  * `smoothing = Base.identity` is a function that optionally applies smoothing in between iterations. The operation is neither applied to the initial prior, nor to the final result. The function `inspect` is called before the smoothing is performed.
  * `K = 3` is the maximum number of iterations.
  * `epsilon = 0.0` is the minimum symmetric Chi Square distance between iterations. If the actual distance is below this threshold, convergence is assumed and the algorithm stops.
  * `fit_ratios = false` determines if ratios are fitted (i.e. `R` has to contain counts so that the ratio `f_est / f_train` is estimated) or if the probability density `f_est` is fitted directly.
  * `inspect = nothing` is a function `(f_k::Vector, k::Int, chi2s::Float64) -> Any` optionally called in every iteration.
  * `loggingstream = DevNull` is an optional `IO` stream to write log messages to.

**Caution:** According to the value of `fit_ratios`, the keyword argument `f_0` specifies a ratio prior or a pdf prior, but only in the third form. In the second form, `f_0` always specifies a pdf prior.
