<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4">
        <strong>Author:</strong> Amirhossein Heydari — 
        📧 <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> — 
        🐙 <a href="https://github.com/mr-pylin/pytorch-workshop" target="_blank" rel="noopener">github.com/mr-pylin</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <a href="https://pytorch.org/" target="_blank" rel="noopener noreferrer">
            <img src="../../assets/images/pytorch/logo/pytorch-logo-dark.svg" 
                 alt="PyTorch Logo"
                 style="max-height: 48px; width: auto; background-color: #ffffff; border-radius: 8px;">
        </a>
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Generate Artificial Inputs](#toc2_)    
- [Activation Functions](#toc3_)    
  - [Built-in Activations](#toc3_1_)    
    - [Elementwise](#toc3_1_1_)    
      - [Linear](#toc3_1_1_1_)    
      - [Sigmoid](#toc3_1_1_2_)    
      - [LogSigmoid](#toc3_1_1_3_)    
      - [Hyperbolic Tangent (Tanh)](#toc3_1_1_4_)    
      - [Softplus](#toc3_1_1_5_)    
      - [Rectified Linear Unit (ReLU)](#toc3_1_1_6_)    
      - [LeakyReLU](#toc3_1_1_7_)    
      - [Exponential Linear Unit (ELU)](#toc3_1_1_8_)    
      - [Sigmoid Linear Unit (SiLU)](#toc3_1_1_9_)    
      - [Mish](#toc3_1_1_10_)    
      - [Gaussian Error Linear Units (GeLU)](#toc3_1_1_11_)    
    - [Non-Elementwise](#toc3_1_2_)    
      - [Softmax](#toc3_1_2_1_)    
      - [LogSoftmax](#toc3_1_2_2_)    
  - [Custom Activations](#toc3_2_)    
    - [Example 1: Custom Sigmoid](#toc3_2_1_)    
    - [Example 2: Custom Softmax](#toc3_2_2_)    
  - [Plot Activations](#toc3_3_)    
- [Threshold Functions](#toc4_)    
  - [Step](#toc4_1_)    
  - [Sign](#toc4_2_)    
  - [Plot Thresholds](#toc4_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
import matplotlib.pyplot as plt
import torch
from torch import nn

# <a id='toc2_'></a>[Generate Artificial Inputs](#toc0_)

In [None]:
batch_size, num_features = 2, 4
input_values = torch.randn(batch_size, num_features) * 5

# log
print(f"input_values:\n{input_values}")

In [None]:
in_features, out_features = input_values.shape[1], 5
hidden_layer = nn.Linear(in_features, out_features, bias=False).requires_grad_(False)
hidden_values = hidden_layer(input_values)

# log
print(f"hidden_values:\n{hidden_values}")

In [None]:
in_features, out_features = hidden_values.shape[1], 4
output_layer = nn.Linear(in_features, out_features, bias=False).requires_grad_(False)
logits = output_layer(hidden_values)

# log
print(f"logits:\n{logits}")

# <a id='toc3_'></a>[Activation Functions](#toc0_)

- Activation functions are used to introduce non-linearity into the neural network.
- Without an activation function, a neural network would behave like a linear regression model, no matter how many layers it has!

<figure style="text-align: center;">
  <img src="../../assets/images/original/mlp/no-activation-network.svg" alt="no-activation-network.svg" style="width: 100%;">
  <figcaption style="text-align: center;">Neural Network without Any Activation Functions is just a Linear Transformation of Input to the Output</figcaption>
</figure>

📥 **Importing Activation Functions**:

- `torch`: Some activation functions, such as `torch.sigmoid` and `torch.tanh`, are available directly under the `torch` namespace.
- `torch.nn`: Many activation functions are available as **classes** under `torch.nn`, such as `nn.ReLU`, `nn.Sigmoid`, and `nn.Tanh`.
- `torch.nn.functional`: The functional API provides activation functions that can be applied **directly** in the forward pass, like `F.relu`, `F.sigmoid`, and `F.leaky_relu`.

📝 **Docs**:

- Non-linear Activations (weighted sum, nonlinearity): [docs.pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity](https://docs.pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity)
- Non-linear Activations (other): [docs.pytorch.org/docs/stable/nn.html#non-linear-activations-other](https://docs.pytorch.org/docs/stable/nn.html#non-linear-activations-other)
- Non-linear activation functions: [docs.pytorch.org/docs/stable/nn.functional.html#non-linear-activation-functions](https://docs.pytorch.org/docs/stable/nn.functional.html#non-linear-activation-functions)

✍️ **Notes**:

- Using Python functions is not a correct implementation of an activation function for Pytorch
- The correct implementation is covered in the future notebooks


## <a id='toc3_1_'></a>[Built-in Activations](#toc0_)

### <a id='toc3_1_1_'></a>[Elementwise](#toc0_)

#### <a id='toc3_1_1_1_'></a>[Linear](#toc0_)


In [None]:
linear = nn.Identity()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"linear(hidden_values):\n{linear(hidden_values)}")

#### <a id='toc3_1_1_2_'></a>[Sigmoid](#toc0_)

- Historically used for `binary classification`, but less common now due to [vanishing gradient](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484) issues.


In [None]:
sigmoid = nn.Sigmoid()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"sigmoid(hidden_values):\n{sigmoid(hidden_values)}")

#### <a id='toc3_1_1_3_'></a>[LogSigmoid](#toc0_)

- Logarithm of `sigmoid`, less common but used in specific applications.

In [None]:
log_sigmoid = nn.LogSigmoid()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"log_sigmoid(hidden_values):\n{log_sigmoid(hidden_values)}")

#### <a id='toc3_1_1_4_'></a>[Hyperbolic Tangent (Tanh)](#toc0_)

- Similar to `sigmoid` but centered around 0, used in [recurrent neural networks (RNNs)](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) and older architectures.


In [None]:
tanh = nn.Tanh()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"tanh(hidden_values):\n{tanh(hidden_values)}")

#### <a id='toc3_1_1_5_'></a>[Softplus](#toc0_)

- Smooth approximation of `ReLU`.


In [None]:
softplus = nn.Softplus()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"softplus(hidden_values):\n{softplus(hidden_values)}")

#### <a id='toc3_1_1_6_'></a>[Rectified Linear Unit (ReLU)](#toc0_)

- Most commonly used, computationally efficient, but suffers from the [dying ReLU](https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks) ([vanishing gradient](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)) problem.


In [None]:
relu = nn.ReLU()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"relu(hidden_values):\n{relu(hidden_values)}")

#### <a id='toc3_1_1_7_'></a>[LeakyReLU](#toc0_)

- Addresses the `dying ReLU` problem by allowing a small, non-zero gradient for negative inputs.


In [None]:
leaky_relu = nn.LeakyReLU()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"leaky_relu(hidden_values):\n{leaky_relu(hidden_values)}")

#### <a id='toc3_1_1_8_'></a>[Exponential Linear Unit (ELU)](#toc0_)

- Similar to `LeakyReLU` but uses an exponential function for negative inputs, often providing better performance than `ReLU`.


In [None]:
elu = nn.ELU()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"elu(hidden_values):\n{elu(hidden_values)}")

#### <a id='toc3_1_1_9_'></a>[Sigmoid Linear Unit (SiLU)](#toc0_)

- Combines ReLU-like behavior with a smooth curve, often yielding better results than `ReLU` (also known as **Swish**).


In [None]:
silu = nn.SiLU()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"silu(hidden_values):\n{silu(hidden_values)}")

#### <a id='toc3_1_1_10_'></a>[Mish](#toc0_)

- Self-regularized activation function, generally performs better than `ReLU` and its variants.


In [None]:
mish = nn.Mish()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"mish(hidden_values):\n{mish(hidden_values)}")

#### <a id='toc3_1_1_11_'></a>[Gaussian Error Linear Units (GeLU)](#toc0_)

- Approximates the expected value of `ReLU` with a Gaussian input, often used in `transformer-based` models.


In [None]:
gelu = nn.GELU()

# log
print(f"hidden_values:\n{hidden_values}\n")
print(f"gelu(hidden_values):\n{gelu(hidden_values)}")

### <a id='toc3_1_2_'></a>[Non-Elementwise](#toc0_)

#### <a id='toc3_1_2_1_'></a>[Softmax](#toc0_)

- Used for `multi-class classification`, outputs probabilities [[mutually exclusive](https://en.wikipedia.org/wiki/Softmax_function)] for each class, often used `internally` in `CrossEntropyLoss`.


In [None]:
softmax = nn.Softmax(dim=1)

# log
print(f"logits:\n{logits}\n")
print(f"softmax(logits):\n{softmax(logits)}")

#### <a id='toc3_1_2_2_'></a>[LogSoftmax](#toc0_)

- Logarithm of softmax, often used in `NLLLoss`.
- Reducing the risk of numerical issues and ensuring more reliable calculations rather than `Softmax`.


In [None]:
log_softmax = nn.LogSoftmax(dim=1)

# log
print(f"logits:\n{logits}\n")
print(f"log_softmax(logits):\n{log_softmax(logits)}")

## <a id='toc3_2_'></a>[Custom Activations](#toc0_)

- You can define **custom** activation functions in PyTorch using `torch.nn.Module` or simple **Python functions**.
- To create a custom activation, extend `torch.nn.Module` and implement the `forward` method, or define a function using PyTorch operations.

📝 **Docs**:

- `nn.Module`: [docs.pytorch.org/docs/stable/generated/torch.nn.Module.html](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html)


### <a id='toc3_2_1_'></a>[Example 1: Custom Sigmoid](#toc0_)

In [None]:
def custom_sigmoid(x: torch.Tensor) -> torch.Tensor:
    return 1 / (1 + torch.exp(-x))


# compute the activations
activations = custom_sigmoid(hidden_values)

# log
print(f"activations:\n{activations}")

In [None]:
class CustomSigmoid(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return 1 / (1 + torch.exp(-x))


# compute the activations
sigmoid = CustomSigmoid()
activations = sigmoid(hidden_values)

# log
print(f"activations:\n{activations}")

### <a id='toc3_2_2_'></a>[Example 2: Custom Softmax](#toc0_)

In [None]:
def custom_softmax(x: torch.Tensor, dim: int) -> torch.Tensor:
    exp_tensor = torch.exp(x)
    sum_exp_tensor = torch.sum(exp_tensor, dim=dim, keepdim=True)
    return exp_tensor / sum_exp_tensor


# compute the activations
probs = custom_softmax(logits, dim=1)

# log
print(f"probs:\n{probs}")

In [None]:
class CustomSoftmax(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.dim = dim

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        exp_tensor = torch.exp(x)
        sum_exp_tensor = torch.sum(exp_tensor, dim=self.dim, keepdim=True)
        return exp_tensor / sum_exp_tensor


# compute the activations
softmax = CustomSoftmax(dim=1)
probs = softmax(logits)

# log
print(f"probs:\n{probs}")

## <a id='toc3_3_'></a>[Plot Activations](#toc0_)


In [None]:
# domain [-10, +10]
x = torch.linspace(-10, +10, 1001)

# log
print(x)

In [None]:
fig, axs = plt.subplots(nrows=3, ncols=4, figsize=(12, 8), layout="compressed")
fig.suptitle("Activation Functions")
axs[0, 0].plot(x, nn.ReLU()(x))
axs[0, 0].set(title="Rectified Linear Unit (ReLU)", xlim=[-10, 10], ylim=[-10, 10])
axs[0, 1].plot(x, nn.LeakyReLU()(x))
axs[0, 1].set(title="LeakyReLU", xlim=[-10, 10], ylim=[-10, 10])
axs[0, 2].plot(x, nn.ELU()(x))
axs[0, 2].set(title="Exponential Linear Unit (ELU)", xlim=[-10, 10], ylim=[-10, 10])
axs[0, 3].plot(x, nn.SiLU()(x))
axs[0, 3].set(title="Sigmoid Linear Unit (SiLU)", xlim=[-10, 10], ylim=[-10, 10])
axs[1, 0].plot(x, nn.Mish()(x))
axs[1, 0].set(title="Mish", xlim=[-10, 10], ylim=[-10, 10])
axs[1, 1].plot(x, nn.Sigmoid()(x))
axs[1, 1].set(title="Sigmoid", xlim=[-10, 10], ylim=[-4, 4])
axs[1, 2].plot(x, nn.LogSigmoid()(x))
axs[1, 2].set(title="LogSigmoid", xlim=[-10, 10], ylim=[-10, 10])
axs[1, 3].plot(x, nn.Tanh()(x))
axs[1, 3].set(title="Hyperbolic Tangent (Tanh)", xlim=[-10, 10], ylim=[-4, 4])
axs[2, 0].plot(x, nn.Softplus()(x))
axs[2, 0].set(title="Softplus", xlim=[-10, 10], ylim=[-10, 10])
axs[2, 1].plot(x, nn.GELU()(x))
axs[2, 1].set(title="Gaussian Error Linear Units (GeLU)", xlim=[-10, 10], ylim=[-10, 10])
axs[2, 2].plot(x, nn.Softmax(dim=0)(x))
axs[2, 2].set(title="Softmax", xlim=[-10, 10], ylim=[-0.05, 0.05])
axs[2, 3].plot(x, nn.LogSoftmax(dim=0)(x))
axs[2, 3].set(title="LogSoftmax", xlim=[-10, 10], ylim=[-25, 0])
for ax in fig.axes:
    ax.grid(True)
plt.show()

# <a id='toc4_'></a>[Threshold Functions](#toc0_)

- Threshold functions are a simpler type of activation function primarily used in the early development of neural networks
- These functions decide whether a neuron should be activated or not based on whether the input surpasses a certain threshold


## <a id='toc4_1_'></a>[Step](#toc0_)


In [None]:
def step(x: torch.Tensor) -> torch.Tensor:
    return torch.where(x >= 0, torch.ones_like(x), torch.zeros_like(x))

## <a id='toc4_2_'></a>[Sign](#toc0_)


In [None]:
def sign(x: torch.Tensor) -> torch.Tensor:
    return torch.where(x > 0, torch.ones_like(x), torch.where(x < 0, torch.ones_like(x) * -1, torch.zeros_like(x)))

## <a id='toc4_3_'></a>[Plot Thresholds](#toc0_)


In [None]:
# domain [-10, +10]
x = torch.linspace(-10, +10, 1001)

# log
print(x)

In [None]:
# plot
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(8, 4), layout="compressed")
fig.suptitle("Threshold Functions")
axs[0].plot(x, step(x))
axs[0].grid(True)
axs[0].set(title="step", xlim=[-10, 10], ylim=[-2, 2])
axs[1].plot(x, sign(x))
axs[1].grid(True)
axs[1].set(title="sign", xlim=[-10, 10], ylim=[-2, 2])
plt.show()