# Deconvolution with DSEA

The Dortmund Spectrum Estimation Algorithm (DSEA) reconstructs the target distribution from classifier predictions on the target quantity of individual examples. CherenkovDeconvolution.jl implements the improved version DSEA+, which is extended by adaptive step sizes and a fixed reweighting of examples.

For a quick start, we deconvolve the distribution of Iris plant types in the famous IRIS data set.

In [1]:
# load the example data
using MLDataUtils
X, y_labels, _ = load_iris()

# discretize the target quantity (for numerical values, we'd use LinearDiscretizer)
using Discretizers: encode, CategoricalDiscretizer
y = encode(CategoricalDiscretizer(y_labels), y_labels) # vector of target value indices

# have a look at the content of y
unique(y) # its just indices

3-element Array{Int64,1}:
 1
 2
 3

In [2]:
# Split the data into training and observed data sets.
# 
# The matrices MLDataUtils expects are transposed, by default.
# Thus, we have to be explicit about obsdim = 1. Note that
# CherenkovDeconvolution.jl follows the convention of ScikitLearn.jl
# (and others), which is size(X_train) == (n_examples, n_features).
# 
# MLDataUtils unfortunately assumes size(X_train) == (n_features, n_examples),
# but obsdim = 1 fixes this assumption.
# 
(X_train, y_train), (X_data, y_data) = splitobs(shuffleobs((X', y), obsdim = 1), obsdim = 1);

In [3]:
#
# Now let's estimate the target distribution!
#
using ScikitLearn, CherenkovDeconvolution

# deconvolve with a Naive Bayes classifier
@sk_import naive_bayes : GaussianNB
tp_function = Sklearn.train_and_predict_proba(GaussianNB())

f_est = dsea(X_data, X_train, y_train, tp_function) # returns a vector of target value probabilities

[1m[36mINFO: [39m[22m[36mUtilities of ScikitLearn.jl are available in CherenkovDeconvolution.Sklearn
[39m

3-element Array{Float64,1}:
 0.355556
 0.357834
 0.28661 

In [4]:
#
# Compare the result to the true target distribution, which we are estimating
#
f_true = Util.fit_pdf(y_data) # f_est is almost equal to f_true!

3-element Array{Float64,1}:
 0.355556
 0.311111
 0.333333

In [6]:
?dsea # You can find more information in the documentation

search: [1md[22m[1ms[22m[1me[22m[1ma[22m Gri[1md[22m[1mS[22m[1me[22m[1ma[22mrch [1mD[22men[1ms[22m[1me[22m[1mA[22mrray [1md[22me[1ms[22m[1me[22mri[1ma[22mlize [1mD[22men[1ms[22m[1me[22mM[1ma[22mtrix [1mD[22men[1ms[22m[1me[22mVecOrM[1ma[22mt



```
dsea(data, train, y, train_and_predict_proba;
     features = setdiff(names(train), [y]),
     kwargs...)
```

Deconvolve the `y` distribution in the DataFrame `data`, as learned from the DataFrame `train`. This function wraps `dsea(::Matrix, ::Matrix, ::Array, ::Function)`.

The additional keyword arguments allows to specify the columns in `data` and `train` to be used as the `features`.

```
dsea(X_data, X_train, y_train, train_and_predict_proba; kwargs...)
```

Deconvolve the target distribution of `X_data`, as learned from `X_train` and `y_train`. The function `train_and_predict_proba` trains and applies a classifier. It has the signature `(X_data, X_train, y_train, w_train) -> Any` where all arguments but `w_train`, which is updated in each iteration, are simply passed through. To facilitate classification, `y_train` has to be discrete, i.e., it has to have a limited number of unique values that are used as labels for the classifier.

# Keyword arguments

  * `f_0 = ones(m) ./ m` defines the prior, which is uniform by default
  * `fixweighting = false` sets, whether or not the weight update fix is applied. This fix is proposed in my Master's thesis and in the corresponding paper.
  * `alpha = 1.0` is the step size taken in every iteration. This parameter can be either a constant value or a function with the signature `(k::Int, pk::AbstractArray{Float64,1}, f_prev::AbstractArray{Float64,1} -> Float`, where `f_prev` is the estimate of the previous iteration and `pk` is the direction that DSEA takes in the current iteration `k`.
  * `smoothing = Base.identity` is a function that optionally applies smoothing in between iterations
  * `K = 1` is the maximum number of iterations.
  * `epsilon = 0.0` is the minimum symmetric Chi Square distance between iterations. If the actual distance is below this threshold, convergence is assumed and the algorithm stops.
  * `inspect = nothing` is a function `(k::Int, alpha::Float64, chi2s::Float64, spectrum::Array) -> Any` optionally called in every iteration.
  * `loggingstream = DevNull` is an optional `IO` stream to write log messages to.
  * `return_contributions = false` sets, whether or not the contributions of individual examples in `X_data` are returned as a tuple together with the deconvolution result.
