# Getting Started

For a quick start, we compare the different algorithms for deconvolution on the IRIS data set, estimating the probability density of Iris plant types.

In [1]:
# load the example data
using MLDataUtils
X, y_labels, _ = load_iris()

# discretize the target quantity (for numerical values, we'd use LinearDiscretizer)
using Discretizers: encode, CategoricalDiscretizer
y = encode(CategoricalDiscretizer(y_labels), y_labels) # vector of target value indices

# have a look at the content of y
unique(y) # its just indices

3-element Array{Int64,1}:
 1
 2
 3

In [2]:
# Split the data into training and observed data sets.
# 
# The matrices MLDataUtils expects are transposed, by default.
# Thus, we have to be explicit about obsdim = 1. Note that
# CherenkovDeconvolution.jl follows the convention of ScikitLearn.jl
# (and others), which is size(X_train) == (n_examples, n_features).
# 
# MLDataUtils unfortunately assumes size(X_train) == (n_features, n_examples),
# but obsdim = 1 fixes this assumption.
# 
using Random; Random.seed!(42) # make split reproducible
(X_train, y_train), (X_data, y_data) = splitobs(shuffleobs((X', y), obsdim = 1), obsdim = 1);

## Deconvolution with DSEA

The Dortmund Spectrum Estimation Algorithm (DSEA) reconstructs the target density from classifier predictions on the target quantity of individual examples. CherenkovDeconvolution.jl implements the improved version DSEA+, which is extended by adaptive step sizes and a fixed reweighting of examples.

In [3]:
using ScikitLearn, CherenkovDeconvolution

# deconvolve with a Naive Bayes classifier
@sk_import naive_bayes : GaussianNB
tp = DeconvUtil.train_and_predict_proba(GaussianNB()) # will train and apply the classifier in each iteration

f_dsea = dsea(X_data, X_train, y_train, tp) # returns a vector of target value probabilities

┌ Info: DSEA iteration 1/1 uses alpha = 1.0 (chi2s = 0.0028011676660929232)
└ @ CherenkovDeconvolution /home/bunse/.julia/dev/CherenkovDeconvolution/src/methods/dsea.jl:150


3-element Array{Float64,1}:
 0.3333333327844613 
 0.3549289670574494 
 0.31173770015808927

In [4]:
# compare the result to the true target distribution, which we are estimating
f_true = DeconvUtil.fit_pdf(y_data) # f_dsea is almost equal to f_true!

3-element Array{Float64,1}:
 0.3333333333333333 
 0.35555555555555557
 0.3111111111111111 

##  Classical Deconvolution-Algorithms

The Regularized Unfolding (RUN) fits the density distribution `f` to the convolution model `g = R * f`, using maximum likelihood. The regularization strength is configured with `n_df`, the effective number of degrees of freedom in the second-order local model of the solution.

The Iterative Bayesian Unfolding (IBU) reconstructs the target density by iteratively applying Bayes' rule to the conditional probabilities contained in the detector response matrix.

The SVD-based method computes the singular value decomposition of the detector response matrix `R`, fitting `f` according to the method of least squares.

In [5]:
#
# The classical algorithms are only applicable with a single discrete observable dimension.
# In order to obtain a dimension that contains as much information as possible, we discretize
# the feature space with a decision tree, using its leaves as clusters. The cluster indices
# are the discrete values of the observed dimension.
#
td = TreeDiscretizer(X_train, y_train, 6) # obtain (up to) 6 clusters
x_train = encode(td, X_train)
x_data  = encode(td, X_data)

# have a look at the content of x_train
unique(x_train) # its the cluster indices

6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

In [6]:
# However, RUN and IBU do not need a classifier for deconvolution
f_run = CherenkovDeconvolution.run(x_data, x_train, y_train)

└ @ CherenkovDeconvolution /home/bunse/.julia/dev/CherenkovDeconvolution/src/methods/run.jl:103
└ @ CherenkovDeconvolution /home/bunse/.julia/dev/CherenkovDeconvolution/src/methods/run.jl:133


3-element Array{Float64,1}:
 0.32062181199902845
 0.34272528540199176
 0.33665290259897984

In [7]:
f_p_run = CherenkovDeconvolution.p_run(x_data, x_train, y_train)

└ @ CherenkovDeconvolution /home/bunse/.julia/dev/CherenkovDeconvolution/src/methods/p_run.jl:105
└ @ CherenkovDeconvolution /home/bunse/.julia/dev/CherenkovDeconvolution/src/methods/p_run.jl:129


3-element Array{Float64,1}:
 0.32062185296050366
 0.34272525419868627
 0.3366528928408101 

In [8]:
f_ibu = CherenkovDeconvolution.ibu(x_data, x_train, y_train)

3-element Array{Float64,1}:
 0.3333333333333333
 0.3463872738499534
 0.3202793928167133

In [9]:
f_svd = CherenkovDeconvolution.svd(x_data, x_train, y_train)

3-element Array{Float64,1}:
 0.3206218119990271 
 0.34272528540199193
 0.33665290259898095

## More Information

In [10]:
?dsea # You can find more information in the documentation

search: [0m[1md[22m[0m[1ms[22m[0m[1me[22m[0m[1ma[22m f_[0m[1md[22m[0m[1ms[22m[0m[1me[22m[0m[1ma[22m Gri[0m[1md[22m[0m[1mS[22m[0m[1me[22m[0m[1ma[22mrch [0m[1mD[22men[0m[1ms[22m[0m[1me[22m[0m[1mA[22mrray [0m[1mD[22men[0m[1ms[22m[0m[1me[22mM[0m[1ma[22mtrix [0m[1mD[22men[0m[1ms[22m[0m[1me[22mVecOrM[0m[1ma[22mt



```
dsea(data, train, y, train_predict[, bins_y, features]; kwargs...)

dsea(X_data, X_train, y_train, train_predict[, bins_y]; kwargs...)
```

Deconvolve the observed data with *DSEA/DSEA+* trained on the given training set.

The data is provided as feature matrices `X_data`, `X_train` and the label vector `y_train` (or accordingly `data[features]`, `train[features]`, and `train[y]`). Here, `y_train` must contain label indices rather than actual values. All expected indices are optionally provided as `bins_y`.

The function object `train_predict(X_data, X_train, y_train, w_train) -> Matrix` trains and applies a classifier to obtain a confidence matrix.

**Keyword arguments**

  * `f_0 = ones(m) ./ m` defines the prior, which is uniform by default
  * `fixweighting = true` sets, whether or not the weight update fix is applied. This fix is proposed in my Master's thesis and in the corresponding paper.
  * `alpha = DEFAULT_STEPSIZE` is the step size taken in every iteration.
  * `smoothing = Base.identity` is a function that optionally applies smoothing in between iterations.
  * `K = 1` is the maximum number of iterations.
  * `epsilon = 0.0` is the minimum symmetric Chi Square distance between iterations. If the actual distance is below this threshold, convergence is assumed and the algorithm stops.
  * `inspect = nothing` is a function `(f_k::Vector, k::Int, chi2s::Float64, alphak::Float64) -> Any` optionally called in every iteration.
  * `return_contributions = false` sets, whether or not the contributions of individual examples in `X_data` are returned as a tuple together with the deconvolution result.
  * `features = setdiff(names(train), [y])` specifies which columns in `data` and `train` to be used as features - only applicable to the first form of this function.


In [11]:
?CherenkovDeconvolution.run

```
run(data, train, x, y[, bins_y]; kwargs...)

run(x_data, x_train, y_train[, bins_y]; kwargs...)

run(R, g; kwargs...)
```

Deconvolve the observed data applying the *Regularized Unfolding* trained on the given training set.

The vectors `x_data`, `x_train`, and `y_train` (or accordingly `data[x]`, `train[x]`, and `train[y]`) must contain label/observation indices rather than actual values. All expected indices in `y_train` are optionally provided as `bins_y`. Alternatively, the detector response matrix `R` and the observed density vector `g` can be given directly.

**Keyword arguments**

  * `n_df = size(R, 2)` is the effective number of degrees of freedom. The default `n_df` results in no regularization (there is one degree of freedom for each dimension in the result).
  * `K = 100` is the maximum number of iterations.
  * `epsilon = 1e-6` is the minimum difference in the loss function between iterations. RUN stops when the absolute loss difference drops below `epsilon`.
  * `acceptance_correction = nothing`  is a tuple of functions (ac(d), inv*ac(d)) representing the acceptance correction ac and its inverse operation inv*ac for a data set d.
  * `ac_regularisation = true`  decides whether acceptance correction is taken into account for regularisation. Requires `acceptance_correction` != nothing.
  * `log_constant = 1/18394` is a selectable constant used in log regularisation to prevent the undefined case log(0).
  * `inspect = nothing` is a function `(f_k::Vector, k::Int, ldiff::Float64, tau::Float64) -> Any` optionally called in every iteration.
  * `loggingstream = devnull` is an optional `IO` stream to write log messages to.
  * `fit_ratios = false` determines if ratios are fitted (i.e. `R` has to contain counts so that the ratio `f_est / f_train` is estimated) or if the probability density `f_est` is fitted directly.

**Caution:** According to the value of `fit_ratios`, the keyword argument `f_0` specifies a ratio prior or a pdf prior, but only in the third form. In the other forms, `f_0` always specifies a pdf prior.


In [12]:
?CherenkovDeconvolution.ibu

```
ibu(data, train, x, y[, bins_y]; kwargs...)

ibu(x_data, x_train, y_train[, bins_y]; kwargs...)

ibu(R, g; kwargs...)
```

Deconvolve the observed data applying the *Iterative Bayesian Unfolding* trained on the given training set.

The vectors `x_data`, `x_train`, and `y_train` (or accordingly `data[x]`, `train[x]`, and `train[y]`) must contain label/observation indices rather than actual values. All expected indices in `y_train` are optionally provided as `bins_y`. Alternatively, the detector response matrix `R` and the observed density vector `g` can be given directly.

**Keyword arguments**

  * `f_0 = ones(m) ./ m` defines the prior, which is uniform by default.
  * `smoothing = Base.identity` is a function that optionally applies smoothing in between iterations. The operation is neither applied to the initial prior, nor to the final result. The function `inspect` is called before the smoothing is performed.
  * `K = 3` is the maximum number of iterations.
  * `epsilon = 0.0` is the minimum symmetric Chi Square distance between iterations. If the actual distance is below this threshold, convergence is assumed and the algorithm stops.
  * `alpha = DEFAULT_STEPSIZE` is the step size taken in every iteration.
  * `fit_ratios = false` determines if ratios are fitted (i.e. `R` has to contain counts so that the ratio `f_est / f_train` is estimated) or if the probability density `f_est` is fitted directly.
  * `inspect = nothing` is a function `(f_k::Vector, k::Int, chi2s::Float64, alphak::Float64) -> Any` optionally called in every iteration.
  * `loggingstream = DevNull` is an optional `IO` stream to write log messages to.

**Caution:** According to the value of `fit_ratios`, the keyword argument `f_0` specifies a ratio prior or a pdf prior, but only in the third form. In the other forms, `f_0` always specifies a pdf prior.


In [13]:
?CherenkovDeconvolution.svd

```
svd(data, train, x, y[, bins_y]; kwargs...)

svd(x_data, x_train, y_train[, bins_y]; kwargs...)

svd(R, g; kwargs...)
```

Deconvolve the observed data applying the *SVD-based deconvolution algorithm* trained on the given training set.

The vectors `x_data`, `x_train`, and `y_train` (or accordingly `data[x]`, `train[x]`, and `train[y]`) must contain label/observation indices rather than actual values. All expected indices in `y_train` are optionally provided as `bins_y`. Alternatively, the detector response matrix `R` and the observed density vector `g` can be given directly.

**Keyword arguments**

  * `effective_rank = -1` is a regularization parameter which defines the effective rank of the solution. This rank must be <= dim(f). Any value smaller than one results turns off regularization.
  * `N = length(x_data)` is the number of observations. In the third form of the method, `N=sum(g)` is the default, assuming that `g` contains absolute counts, not probabilities.
  * `B = DeconvUtil.cov_Poisson(g, N)` is the varianca-covariance matrix of the observed bins. The default value represents the assumption that each observed bin is Poisson-distributed with rate `g[i]*N`.
  * `epsilon_C = 1e-3` is a small constant to be added to each diagonal entry of the regularization matrix `C`. If no such constant would be added, inversion of `C` would not be possible.
  * `fit_ratios = false` determines if ratios are fitted (i.e. `R` has to contain counts so that the ratio `f_est / f_train` is estimated) or if the probability density `f_est` is fitted directly.

**Caution:** According to the value of `fit_ratios`, the keyword argument `f_0` specifies a ratio prior or a pdf prior, but only in the third form. In the other forms, `f_0` always specifies a pdf prior.
