# Getting Started

For a quick start, we compare the different algorithms for deconvolution on the IRIS data set, estimating the probability density of Iris plant types.

In [None]:
# load the example data
using MLDataUtils
X, y_labels, _ = load_iris()

# discretize the target quantity (for numerical values, we'd use LinearDiscretizer)
using Discretizers: encode, CategoricalDiscretizer
y = encode(CategoricalDiscretizer(y_labels), y_labels) # vector of target value indices

# have a look at the content of y
unique(y) # its just indices

In [None]:
# Split the data into training and observed data sets.
# 
# The matrices MLDataUtils expects are transposed, by default.
# Thus, we have to be explicit about obsdim = 1. Note that
# CherenkovDeconvolution.jl follows the convention of ScikitLearn.jl
# (and others), which is size(X_train) == (n_examples, n_features).
# 
# MLDataUtils unfortunately assumes size(X_train) == (n_features, n_examples),
# but obsdim = 1 fixes this assumption.
# 
using Random; Random.seed!(42) # make split reproducible
(X_train, y_train), (X_data, y_data) = splitobs(shuffleobs((X', y), obsdim = 1), obsdim = 1);

## Deconvolution with DSEA

The Dortmund Spectrum Estimation Algorithm (DSEA) reconstructs the target density from classifier predictions on the target quantity of individual examples. CherenkovDeconvolution.jl implements the improved version DSEA+, which is extended by adaptive step sizes and a fixed reweighting of examples.

In [None]:
using ScikitLearn, CherenkovDeconvolution
@sk_import naive_bayes : GaussianNB

# deconvolve with a Naive Bayes classifier
dsea = DSEA(GaussianNB()) # instantiate the deconvolution method
f_dsea = deconvolve(dsea, X_data, X_train, y_train) # returns a vector of target value probabilities

In [None]:
# compare the result to the true target distribution, which we are estimating
f_true = DeconvUtil.fit_pdf(y_data) # f_dsea is almost equal to f_true!

##  Classical Deconvolution-Algorithms

The Regularized Unfolding (RUN) fits the density distribution `f` to the convolution model `g = R * f`, using maximum likelihood. The regularization strength is configured with `n_df`, the effective number of degrees of freedom in the second-order local model of the solution.

The Iterative Bayesian Unfolding (IBU) reconstructs the target density by iteratively applying Bayes' rule to the conditional probabilities contained in the detector response matrix.

The SVD-based method computes the singular value decomposition of the detector response matrix `R`, fitting `f` according to the method of least squares.

In [None]:
#
# The classical algorithms are only applicable with a single discrete observable dimension.
# In order to obtain a dimension that contains as much information as possible, we discretize
# the feature space with a decision tree, using its leaves as clusters. The cluster indices
# are the discrete values of the observed dimension.
#
binning = TreeBinning(6) # obtain (up to) 6 clusters

# inspect the way in which the TreeBinning discretizes the data
td = BinningDiscretizer(binning, X_train, y_train) # fit the tree with labeled data
x_train = encode(td, X_train) # apply it to the feature vectors
unique(x_train) # the result are the cluster indices

In [None]:
# RUN and IBU need a binning instead of a classifier
f_ibu = deconvolve(IBU(binning), X_data, X_train, y_train)

In [None]:
f_run = deconvolve(RUN(binning), X_data, X_train, y_train)

In [None]:
f_p_run = deconvolve(PRUN(binning), X_data, X_train, y_train)

In [None]:
f_svd = deconvolve(SVD(binning), X_data, X_train, y_train)

## More Information

In [None]:
?DSEA # You can find more information in the documentation

In [None]:
?IBU

In [None]:
?RUN

In [None]:
?PRUN

In [None]:
?SVD