# Dual-sPLS

Dual-sPLS implements a modified version of sPLS, providing a more intuitive way to decide how much information we want to keep with a shrinking ratio ("replacing" $\lambda$ in sPLS)

## Theory

#### Dual Norm: Definition

According to the paper:

Definition 3.1: Dual Norm

Let $\Omega (.)$ be a norm on $\mathbb{R}^p$. For any $z \in \mathbb{R}^p$, the associated dual norm, denoted $\Omega^*(.)$, is defined as

$$
\Omega^*(.) = max_w (z^Tw) \quad s.t. \quad \Omega(w) = 1 \quad (21)
$$

#### Generalizing sPLS to many other regularization

Taking the expression of the regularization problem for PLS:

$$
max_w (y^TXw) \quad s.t. \quad ||w||_2 = 1
$$

We can generalize it to any norm

$$
max_w (y^TXw) \quad s.t. \quad \Omega(w) = 1
$$

And get the expression for $\hat{w}$

$$
\hat{w} = argmin_w (-z^Tw) \quad s.t. \quad \Omega(w) = 1
$$

The method becomes powerful because we can put any norm in $\Omega$. For example, we can have a lasso penalization as in sPLS and find the same result, but also combination of norms, with for example the first proposition made by the paper: pseudo-lasso:

$$
\Omega(w) = \lambda ||w||_1 + ||w||_2
$$

which will be used to illustrate the method in this notebook.

We can then apply our Lagrangian method:

$$
\mathcal{L}(w) = -z^Tw + \mu(\Omega(w) - 1) \quad ; \quad \mu > 0 ^*
$$

\* Usually not the case for an equality constraint, but here we want the constraint to be active

With a very similar reasoning than we have in sPSL (see sPLS.ipynb), we get:

$$
\nabla \Omega(w) = \frac{w}{\mu}
$$

to solve the non-differentiability issues 

($u_i = +1 \quad if \quad  w_i > 0$; $u_i = -1 \quad if \quad  w_i < 0$; $u_i \in [-1, +1] \quad if \quad  w_i = 0$)

Which gives us the same soft-thresholding than seen in sPLS. 

___

Let's find the right expression for pseudo-lasso as we will need it now:

$$
\nabla \Omega(w) = \lambda \delta + \frac{w}{||w||_2}
$$

where $\delta = sign(w) = sign(z)$, see sPLS

$$
\nabla \Omega(w) = \frac{z}{\mu} = \lambda \delta + \frac{w}{||w||_2} \iff \frac{w}{||w||_2} = \frac{z}{\mu} - \lambda \delta \Rightarrow \frac{w_p}{||w||_2} = \frac{1}{\mu}\delta_p(|z_p| - \nu)_+
$$

where $\nu = \lambda \mu$

Then, we can decide to keep $\xi \%$ of the most important values, and find the right value for $\nu$ by computing the quantile in $z$ for $\xi$. 

<div style="text-align: center;">
    <img src="assets/dualsplsfig1.png" alt="Description" width="400" />
</div>


But we cannot simply keep zeros in $z$ after the soft-thresholding; we still need to respect $\Omega(w) = 1$. "To guarantee the unit norm property for $w$, we set $\mu = ||z_\nu||_2$ where $z_\nu$ is the vector of coordinates $\delta_p(|z_p| - \nu)_+$ for $p\in \{ 1, ..., P\}$. Consequently,

$$
w = \frac{\mu}{\nu||z_\nu||_1 + ||z_\nu||_2^2}z_\nu
$$

The rationale behind constrainting the direction $w$ instead of the regression coeﬃcients $\hat{\beta}$ is their collinearity. Indeed, the estimator writes
$\hat{\beta} = W(T^TT)^{−1}T^Ty$. Being collinear, soft-thresholding $w$ performs a variable selection at the same location in $\hat{\beta}$ coordinates."

