### Group Lasso
#### Definition
The mixed $\ell_1/\ell_q$ norm (denoted as $\Omega(x)$) for group sparisty is computed by:
$$
\Omega(x) = \sum_{g=1}^G{d_g \lVert x_g \rVert_q    } \text{    for any q $\in$ (1, $\infty$]},
$$
where $d_g$ is a positive scalar weight which is usually seleted as $1.0$, $x_g$ is $g$th group from $x$, $G$ is the number of groups in $x$. If $q=2$, the formula above computes the mixed $\ell_1/\ell_2$ norm, i.e., $\Omega(\mathbf{x}) = \sum_{g=1}^G{d_g \lVert \mathbf{x}_g \rVert_2}$.

In the context of least-squares regression, this regularization is known as the *group Lasso*.




#### Proximal Gradient Method for Group Lasso
*Step 1.* How do we update variables $x$ if we use group Lasso regularization? 


We use the same update strategy as Prox-FG and Prox-SG, i.e., 
$$x_g^{(k)} = \textbf{prox}_{\eta_k R}\left(x_g^{(k-1)} - \eta_k \bigtriangledown F(x_g^{(k-1)})\right).$$ Note that in group sparsity we perform proximal gradient for each group $x_g$ instead of all variables in Prox-FG or randomly selected variables in Prox-SG. 

*Step 2.* The point here is how do we compute proximal mapping for group Lasso, i.e., $\textbf{prox}_{\eta R}(u)$, where $u = \left(x_g^{(k-1)} - \eta_k \bigtriangledown F(x_g^{(k-1)})\right)$, $\eta = \eta_k$ for convenience, $R$ is the group Lasso? 


When $R$ is the group Lasso, $\textbf{prox}_{\eta}(u)$ can be computed by 
$$
\textbf{prox}_{\eta}(u) =  \begin{cases} 
      u - u \frac{\eta \lambda}{\lVert u \rVert_2}, & \lVert u \rVert_2 \geq \eta \lambda \\
      0, & \lVert u \rVert_2 \lt \eta \lambda 
   \end{cases}
$$


*Step 3.* Based on above discussion, the update strategy for group Lasso can be defined as:
$$
u = x_g^{(k-1)} - \eta_k \bigtriangledown F(x_g^{(k-1)}) \\
x_g^{(k)} =  \begin{cases} 
      u - u \frac{\eta_k \lambda}{\lVert u \rVert_2}, & \lVert u \rVert_2 \geq \eta_k \lambda \\
      0, & \lVert u \rVert_2 \lt \eta_k \lambda 
   \end{cases}
$$


### OBProx-SG

#### Case 1. Ball
1. We have parameters called $x$, which is divided into $G$ groups, namely $x_1, x_2, \cdots, x_G$, where $x_g$ might consist of multiple elements.
2. We execute the following steps only for groups that contain at least 1 nonzero element:
    1. Compute the radius for each group: $\mathbf{radius}(g) = \lVert x_g \rVert_2$
    2. Based on the equation at the step $3$ above, we get $\hat{x}_g$ for $g$th group.
    3. Project each element in $\hat{x}_g$ to their original orthant at the beggining of the epoch. We denote the obtained group as $\mathbf{proj}(\hat{x}_g)$. Note that we manupulate single element in the group rather than as a whole.
    4. Modify $\mathbf{proj}(\hat{x}_g)$ if $\mathbf{proj}(\hat{x}_g) > \mathbf{radius}(g)$:
    $$
\mathbf{proj}(\hat{x}_g) =  \begin{cases} 
      \frac{\mathbf{proj}(\hat{x}_g)}{\lVert \mathbf{proj}(\hat{x}_g)  \rVert_2} \mathbf{radius}(g), & \mathbf{proj}(\hat{x}_g) \gt \mathbf{radius}(g) \\
      \mathbf{proj}(\hat{x}_g), & \text{otherwise} 
   \end{cases}
    $$
    5. Update $x_g$ with $\mathbf{proj}(\hat{x}_g)$: $x_g = \mathbf{proj}(\hat{x}_g)$
    
In conclusion, we apply proximal gradient step and projection step to $x_g$ that contains at least $1$ nonzero element, and then limit the $x_g$ within a ball whose radius equals to $\lVert x_g \rVert_2$.
    
    
#### Case 2. True Region
We denote $\lambda_i$ as trust region coefficients for group $i$, $G$ as number of groups, $T$ as number of batches, $g_t$ as the gradients on $t$th batch, $\bar{g}$ as online average of gradients, $g$ as the gradients, $v_i$ as the violation flag for $i$th group, $\bar{v}_i$ as the online average of violation flag for $i$th group.

For each epoch, we have the following algorithm:
1. Compute the radius for each group: $\mathbf{radius}(i) = \lambda_i \lVert x_i \rVert_2, i = 1, 2, \cdots, G$
2. for $t$ in $1, 2, \cdots, T$:
    1. Compute the $F_t$ value with regularizer on $t$th batch.
    2. Update the online average of $F$ value: $\bar{F} = \frac{t - 1}{t} \bar{F} + \frac{1}{t} F_t$
    3. Compute the gradients: $g_t = \begin{cases}\bigtriangledown F_t(x), & x \neq 0 \\ 0, & x = 0\end{cases}$.
    4. Update the online average of gradients: $\bar{g} = \frac{t - 1}{t} \bar{g} + \frac{1}{t} {g_t}$
    5. Compute the proximal gradient followed by projection step values: $\mathbf{proj}(\hat{x}) = \begin{cases}\mathbf{proj}(\hat{x}), & x \ne 0 \\ 0, & x =0\end{cases}$.
    6. Limit the new group value within a ball: $\mathbf{proj}(\hat{x}_i) =  \begin{cases} 
      \frac{\mathbf{proj}(\hat{x}_i)}{\lVert \mathbf{proj}(\hat{x}_i)  \rVert_2} \mathbf{radius}(i), & \mathbf{proj}(\hat{x}_i) \gt \mathbf{radius}(i) \\
      \mathbf{proj}(\hat{x}_i), & \text{otherwise} 
   \end{cases}$
    7. Compute the violation for each group: $v_i = \begin{cases}1, &\mathbf{proj}(\hat{x}_i) \gt \mathbf{radius}(i)\\ 0, & \text{otherwise} \end{cases}$
    8. Update the online average of violation for each group: $\bar{v}_i = \frac{t - 1}{t} \bar{v}_i + \frac{1}{t} v_i$
    9. Update $x$: $x = \mathbf{proj}(\hat{x}_i)$
3. Update trust region coefficients for each group at the end of the epoch:
    1. $\rho = \frac{\bar{F_1}- \bar{F}}{\alpha^2 \lVert \bar{g} \rVert_2}$, where $\bar{F_1}$ is the $\bar{F}$ at previous epoch, $\alpha$ is selected as $1.0$.
    2. $\lambda_i = \begin{cases}\begin{cases}2 \lambda_i, & v_i \ge 0.3 \\ \lambda_i, & \text{otherwise}\end{cases}, & \rho \lt 0.25\\\lambda_i, & 0.25 \le \rho \le 0.75 \\\begin{cases}2 \lambda_i, & v_i \ge 0.5 \\ \frac{\lambda_i}{2}, &v_i = 0\\ \lambda_i, & \text{otherwise}\end{cases}, & 0.75 \lt \rho \le 1\end{cases}$
        
        
