# SwiGLU

## Swish

We propose a new activation function that we name Swish:

$$f(x) = x\cdot\sigma(x)$$

where $\sigma(x) = (1 + \exp(-x))^{-1}$ is the sigmoid function.

## FFN

The "position-wise feed-forward networks" (FFN) takes a vector $x$ (the hidden representation at a particular position in the sequence) and passes it through two learned linear transformations, (represented by the matrices $W_{1}$ and $W_{2}$ and bias vectors $b_{1}$ and $b_{2}$). A rectified-linear (ReLU) activation function applied between the two linear transformations.

$$\text{FFN}(x, W_{1}, W_{2}, b_{1}, b_{2}) = \max(0, xW_{1}+b_{1})W_{2} + b_{2}$$

If we use a version with no bias:

$$\text{FFN}_{\text{ReLU}}(x, W_{1}, W_{2}) = \max(0, xW_{1})W_{2}$$

Subsequent work has proposed replacing the ReLU with other nonlinear activation functions such as Gaussian Error Linear Units, $\text{GELU}(x) = x\Phi(x) = xP(X\le x)$ where $X\sim \mathcal{N}(0,1)$, and $\text{Swish}_{\beta} = x\sigma(\beta x)$

$$
\begin{aligned}
\text{FFN}_{\text{GELU}}(x, W_{1}, W_{2}) &= \text{GELU}(xW_{1})W_{2}\\
\text{FFN}_{\text{Swish}}(x, W_{1}, W_{2}) &= \text{Swish}_{1}(xW_{1})W_{2}
\end{aligned}
$$

## Gated Linear Units (GLU) and Variants

GLU is a neural network layer defined as the component-wise product of two linear transformations of the input, one of which is sigmoid-activated. They also suggest
omitting the activation, which they call a "bilinear" layer.

$$
\begin{aligned}
\text{GLU}(x, W, V, b, c) &= \sigma(xW + b)\otimes(xV+c)\\
\text{Bilinear}(x, W, V, b, c) &= (xW + b)\otimes(xV+c)
\end{aligned}
$$

We can also define GLU variants using other activation functions:

$$
\begin{aligned}
\text{ReGLU}(x, W, V, b, c) &= \max(0, xW + b)\otimes(xV+c)\\
\text{GEGLU}(x, W, V, b, c) &= \text{GELU}(xW+b)\otimes(xV+c)\\
\text{SwiGLU}(x, W, V, b, c, \beta) &= \text{Swish}_{\beta}(xW + b)\otimes(xV+c)
\end{aligned}
$$

We propose additional variations on the Transformer FFN layer which use GLU or one of
its variants in place of the first linear transformation and the activation function. Again, we omit the bias
terms.

$$
\begin{aligned}
\text{FFN}_{\text{GLU}}(x, W, V, W_{2}) &= (\sigma(xW + b)\otimes(xV+c))W_{2}\\
\text{FFN}_{\text{Bilinear}}(x, W, V, W_{2}) &= ((xW + b)\otimes(xV+c))W_{2}\\
\text{FFN}_{\text{ReGLU}}(x, W, V, W_{2}) &= (\max(0, xW + b)\otimes(xV+c))W_{2}\\
\text{FFN}_{\text{GEGLU}}(x, W, V, W_{2}) &= (\text{GELU}(xW + b)\otimes(xV+c))W_{2}\\
\text{FFN}_{\text{SwiGLU}}(x, W, V, W_{2}) &= (\text{Swish}_{1}(xW + b)\otimes(xV+c))W_{2}\\
\end{aligned}
$$

All of these layers have three weight matrices, as opposed to two for the original FFN. To keep the
number of parameters and the amount of computation constant, we reduce the number of hidden units $d_{ff}$
(the second dimension of $W$ and $V$ and the first dimension of $W2$) by a factor of $\frac{2}{3}$ when comparing these
layers to the original two-matrix version.