# Pseudocode : GLU and its variants in place of traditional activation functions

This is a formal pseudocode description based on the paper **GLU Variants Improve Transformer** by Noam Shazeer (2020). We can see how the GLU-based feed-forward sublayers introduced in Noam Shazeer’s paper differ from the traditional feed-forward layers in transformer architectures by using Gated Linear Units (GLU) or its variants in place of traditional activation functions like ReLU or GELU.   

## Traditional Transformer Feed-Forward Network (FFN)

Recall that ReLU(x) = max(0,x) and we can approximate GELU(x)= 0.5x(1+tanh($\sqrt{(2/ \pi}(x+0.044715(x^3) )))$

In [None]:
# Traditional FFN Layer (using ReLU or GELU)
def FFN(x, W1, W2):
    # Step 1: Linear Transformation
    h = x @ W1
    
    # Step 2: Activation (ReLU or GELU)
    h_activated = activation_function(h)  # ReLU or GELU

    # Step 3: Output Linear Transformation
    out = h_activated @ W2

    return out


In practice you can import torch as below to apply activation functions directly:

In [None]:
import torch
import torch.nn.functional as F

# Traditional FFN Layer using ReLU or GELU
def FFN(x, W1, W2, activation='relu'):
    # Step 1: Linear Transformation
    h = x @ W1
    
    # Step 2: Apply the activation function (ReLU or GELU)
    if activation == 'relu':
        h_activated = F.relu(h)  # ReLU: max(0, x)
    elif activation == 'gelu':
        # GELU activation as defined by the approximation formula
        h_activated = 0.5 * h * (1 + torch.tanh((torch.sqrt(torch.tensor(2 / torch.pi)) * (h + 0.044715 * h**3))))
    else:
        raise ValueError("Invalid activation function specified. Use 'relu' or 'gelu'.")

    # Step 3: Output Linear Transformation
    out = h_activated @ W2

    return out


## GLU-Based FFN Layer (from the paper)

Recall that Gated Linear Units consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function.

In [None]:
# GLU-Based FFN Layer
def GLU(x, W, V):
    # Step 1: Compute Linear Transformations
    a = x @ W
    b = x @ V

    # Step 2: Apply Sigmoid Activation for Gating
    gate = sigmoid(a)

    # Step 3: Element-wise product (gated linear unit)
    h_glu = gate * b

    return h_glu

# Pseudocode for the FFNGLU Layer using GLU within the transformer feed-forward block
def FFNGLU(x, W, V, W2):
    # Step 1: GLU Transformation (element-wise product)
    h_glu = GLU(x, W, V)

    # Step 2: Apply final linear transformation
    out = h_glu @ W2

    return out


## GLU Variants

Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid.

In [None]:
# ReGLU Variant (using ReLU instead of Sigmoid)
def ReGLU(x, W, V):
    a = x @ W
    b = x @ V
    gate = relu(a)  # ReLU activation as the gating function
    h_glu = gate * b
    return h_glu

# GEGLU Variant (using GELU instead of Sigmoid)
def GEGLU(x, W, V):
    a = x @ W
    b = x @ V
    gate = gelu(a)  # GELU activation as the gating function
    h_glu = gate * b
    return h_glu

# SwiGLU Variant (using Swish activation)
def SwiGLU(x, W, V, beta):
    a = x @ W
    b = x @ V
    gate = swish(a, beta)  # Swish activation as the gating function
    h_glu = gate * b
    return h_glu
