### Sparse linear models for image denoising

### What are sparse methods?

Sparse methods in signal processing use special structure that is shared by some signals, that enables representing them as linear combinations of base signals ('atoms') with few nonzeros or big values.

Sparse methods have many application, for example they are used for image denoising, deblurring, superresolution, source separation and image compression.

#### Teaser

* JPEG algorithm
* Heard about Dictionary Learning? What is `sklearn.decomposition.DictionaryLearning`?
* Seen Orthogonal Matching Pursuit? What is `sklearn.linear_model.OrthogonalMatchingPursuit`?

** TODO Edit JPEG comment** 

One example of method that in practice works like sparse method is JPEG algorithm. Its quantization step, which is done after Discrete Cosine Transform works by zeroing out small coefficients (this is actually only lossy step).

### Linear equations and sparsity

Consider linear system $Ax = y$. This equation might either not have any solution, or have many solutions.

Sparse methods aim at finding canonical solution where there might be many potential guesses - the original problem becomes

Find $x$ such that $Ax = y$, and $x$ is *sparse*.

### How to define sparsity precisely?

In math/signal processing several types of metrics are used to measure sparsity.

#### $L_0$ and $L_1$

Sparsity is most commonly measured using $L_0$ metric or $L_1$ 'metric'.

$L_1$ metric, also known as Manhattan distance, is defined as

$\|x\|_1 = \sum_{i}{|x_i|}$ 

Whereas $L_0$ is not actually a metric, and is defined as

$\|x\|_0 = supp(x) = $*number of nonzero coefficients in* $x$

One might ask: why use two separate notions? The answer to that is that $L_0$ is not even continuous, and it is hard to optimize. In general problem of finding solution to linear system with smallest $L_0$ is NP-hard!

$L_1$ on the other hand is continuous and convex. This makes using it appealing, since in optimization there is a whole subfield dealing with such problems. Also, $L_1$ can be thought of as a convexification of $L_0$:

(image of L1 vs L0 'unit balls')

### L1 vs L2 (Euclidean) norm

A person familiar with regularization in linear regression might ask: 

What's the difference between minimizing $\|x\|^2$ and $\|x\|_1$ while solving  $Ax = b$? Doesn't minimizing Euclidean norm lead to small coefficients?

The problem is in structure: while both norms enforce 'small coefficients', Euclidean norm treats small and big coefficients differently, penalizing big coefficients more, while caring less for differences in small coefficients.

Consider example:

$x \in \mathbb{R}^d, x_i = \sqrt{\frac{1}{d}}$.

Then

$\|x\|^2 = \sum_{i=0}^d \frac{1}{d} = 1$

But

$\|x\|_1 = \sum_{i=0}^d{\sqrt{\frac{1}{d}}} = d {\sqrt{\frac{1}{d}}} = \sqrt{d}$

So even though coefficients of $x$ get smaller as $d$ gets bigger, it's L1 norm actually get bigger!

### Sparsity and sparse data structures

Sparsity in mathematics/signal processing is related, but not equal to sparsity from computer science point of view.
In computer science sparse matrix is a data structure that holds only nonzero coefficients.
The relation between sparse data structure and mathematial notion lies in fact that sparse matrix data structure works well for representing matrices sparse in precise sense - this is also actually used in many 'sparse algorithms'.

### Dictionary learning and related terminology

In practical image/signal processing problems it is common to use convention

$x \approx D\alpha$ where $\alpha$ is found with some kind of sparse method.

The $D$ matrix is called dictionary, and its columns are called atoms, in other words $\alpha$ gives $x$'s *decomposition into atoms*.

Signal processing gives lots of examples of potential dictionaries - above we mentioned how JPEG algorithm uses Discrete Cosine Transform coefficients as $D$. Another commonly used dictionaries come from wavelet transforms.

$D$ might be also learned from training data, which is the task of dictionary learning.

### Denoising problem formulation

### Sparse models for denoising

#### Used dictionary

For this part we will use Daubechies wavelets coefficients.

#### Other considerations

Use other wavelets?

### Advanced topics

#### Other method that could utilize sparsity

Sparsity naturally comes up in matrix factorization (from using some form of L1 regularization)

* Nonnegative Matrix Factorization (**TODO** link this with NMF on arXiv )

* Sparse PCA (PCA with added L1 penalty)

* Robust PCA

A precise way to formulate the problem is to pose it as Maximum A Posteriori estimation:

If Gaussian noise is assumed, we can use prior on transform coefficients to derive (here $D$ is dictionary (in particular it could correspond to transform matrix), $X$ are coefficients, and $Y$ is data used for estimation)

$-logP(X | Y) \propto -log(P(Y|X)P(X)) \propto -log(e^{\|Y - DX\|^2} e^{\lambda\|X\|_1}) =  \|Y - DX\|^2 + \lambda\|X\|_1$