#### Characteristics of an Ideal Activation Function
1. `Non-linearity` – enables the network to learn complex, non-linear patterns instead of just linear combinations.  
2. `Non-saturating` – avoids vanishing gradients by keeping derivatives from approaching zero in large input regions.  
3. `Continuous and differentiable` – guarantees gradient existence everywhere so back-propagation can flow smoothly.  
4. `Zero-centered` – keeps the mean activation near zero, helping gradients update both positive and negative weights evenly.  
5. `Smooth (continuous derivative)` – eliminates abrupt slope changes that can destabilize training.  
6. `Non-monotonic` – allows local gradient sign changes, giving the optimizer more flexible paths to escape plateaus.  
7. `Non-convex` – provides the multimodal landscape necessary for deep nets to find rich, hierarchical feature representations.

*Hidden-layer activation functions inject the non-linearity that lets a deep stack of layers model complex, curved decision boundaries instead of collapsing back into a single linear transformation.*

In [1]:
# Importing Modules
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from keras import Sequential
from keras.layers import Input, Dense

In [2]:
# Importing Dataset

X, y = make_moons(
    n_samples = 500,
    noise = 0.5,
    random_state = 42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

#### Sigmoid Activation Function
> 1. `Continuous and Differentiable`: The sigmoid function is smooth and differentiable, which makes it suitable for gradient-based optimization methods.
> 2. `Non-linear`: It introduces non-linearity into the model, allowing it to learn complex patterns.

In [3]:
# Model building
model = Sequential()

model.add(Input(shape = (X_train.shape[1],)))
model.add(Dense(units = 10, activation = 'sigmoid'))
model.add(Dense(units = 5, activation = 'sigmoid'))
model.add(Dense(units = 1, activation = 'sigmoid'))

model.summary()

In [4]:
model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

model.fit(X_train, y_train, epochs = 10, validation_split = 0.2, verbose = 1)

Epoch 1/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - accuracy: 0.5000 - loss: 0.7115 - val_accuracy: 0.4125 - val_loss: 0.7340
Epoch 2/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.5000 - loss: 0.7084 - val_accuracy: 0.4125 - val_loss: 0.7283
Epoch 3/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.5000 - loss: 0.7063 - val_accuracy: 0.4125 - val_loss: 0.7229
Epoch 4/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.5000 - loss: 0.7033 - val_accuracy: 0.4125 - val_loss: 0.7190
Epoch 5/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.5000 - loss: 0.7013 - val_accuracy: 0.4125 - val_loss: 0.7160
Epoch 6/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.5000 - loss: 0.6996 - val_accuracy: 0.4125 - val_loss: 0.7120
Epoch 7/10
[1m10/10[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x267294223c0>

#### Why you shoud not use Sigmoid Activation Function:
> 1. `Vanishing Gradient Problem`: Gradients can become very small during backpropagation, leading to slow learning or even stopping the learning process.
> 2. `Cannot Work with Negative Values`: The output range is between 0 and 1, which means it cannot handle negative input values effectively.
> 3. `Not Zero-Centered`: The outputs are always positive, which can lead to inefficient gradient updates.
> 4. `Derivative Calculation is Expensive`: The computation of the derivative is more complex compared to other activation functions, which can slow down the training process.
> 5. `Saturated`: When the input values are very large or very small, the function saturates and the gradients become almost zero, leading to slow learning.

<u>Core Issue</u> – *Because every hidden neuron outputs a positive value, the gradients for all weights in every hidden layer share the same sign (all positive or all negative). This prevents the network from simultaneously increasing some weights while decreasing others, so while a subset of weights may converge to good values, the remainder drift away from their optimal settings.*

---

#### Tanh Activation Function
> 1. `Non-linear`: The tanh function introduces non-linearity into the model, allowing it to learn complex patterns.
> 2. `Differentiable and Continuous`: It is smooth and differentiable, making it suitable for gradient-based optimization methods.
> 3. `Zero-Centered`: The outputs range from -1 to 1, which helps in centering the data and making the optimization process more efficient.

In [5]:
# Model Building
model = Sequential()

model.add(Input(shape = (X_train.shape[1],)))
model.add(Dense(units = 10, activation = 'tanh'))
model.add(Dense(units = 5, activation = 'tanh'))
model.add(Dense(units = 1, activation = 'sigmoid'))

model.summary()

In [6]:
model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

model.fit(X_train, y_train, epochs = 10, validation_split = 0.2, verbose = 1)

Epoch 1/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.6031 - loss: 0.6422 - val_accuracy: 0.7000 - val_loss: 0.6075
Epoch 2/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.6469 - loss: 0.6179 - val_accuracy: 0.7625 - val_loss: 0.5770
Epoch 3/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.6844 - loss: 0.5976 - val_accuracy: 0.7875 - val_loss: 0.5499
Epoch 4/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.7188 - loss: 0.5784 - val_accuracy: 0.8250 - val_loss: 0.5267
Epoch 5/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.7344 - loss: 0.5607 - val_accuracy: 0.8375 - val_loss: 0.5071
Epoch 6/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.7375 - loss: 0.5467 - val_accuracy: 0.8500 - val_loss: 0.4893
Epoch 7/10
[1m10/10[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x267294bd310>

#### Why you shoud not use Tanh Activation Function:

> 1. `Vanishing Gradient Problem`: Similar to the sigmoid function, gradients can become very small during backpropagation, leading to slow learning or even stopping the learning process.
> 2. `Saturation`: When the input values are very large or very small, the function saturates and the gradients become almost zero, leading to slow learning.
> 3. `Expensive Calculation of Differentiation`: The computation of the derivative is more complex compared to other activation functions, which can slow down the training process.

---

#### Relu Activation Function
> 1. `Non-linear`: ReLU introduces non-linearity, allowing the network to learn complex patterns by combining multiple ReLU activations.
> 2. `Less Expensive for Gradient Calculation`: The gradient calculation is computationally efficient, which speeds up the training process.
> 3. `No Vanishing Gradient Problem`: ReLU helps in mitigating the vanishing gradient problem, making it suitable for deep networks.
> 4. `No Saturation`: Unlike sigmoid and tanh, ReLU does not saturate, which helps in maintaining strong gradients.

In [7]:
# Model Building
model = Sequential()

model.add(Input(shape = (X_train.shape[1],)))
model.add(Dense(units = 10, activation = 'relu', kernel_initializer = 'he_normal'))
model.add(Dense(units = 5, activation = 'relu', kernel_initializer = 'he_normal'))
model.add(Dense(units = 1, activation = 'sigmoid'))

model.summary()

In [8]:
model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

model.fit(X_train, y_train, epochs = 10, validation_split = 0.2)

Epoch 1/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 19ms/step - accuracy: 0.3906 - loss: 1.2015 - val_accuracy: 0.2750 - val_loss: 1.3632
Epoch 2/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.3812 - loss: 1.1243 - val_accuracy: 0.2750 - val_loss: 1.2594
Epoch 3/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.3625 - loss: 1.0543 - val_accuracy: 0.2750 - val_loss: 1.1690
Epoch 4/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.3688 - loss: 0.9945 - val_accuracy: 0.3125 - val_loss: 1.0910
Epoch 5/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.3875 - loss: 0.9449 - val_accuracy: 0.3625 - val_loss: 1.0257
Epoch 6/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.4406 - loss: 0.9050 - val_accuracy: 0.4625 - val_loss: 0.9638
Epoch 7/10
[1m10/10[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x267294be350>

It is completely natural to feel that "throwing away" negative values looks like a mistake. You are spotting a real trade-off that researchers debated for years.

#### 1. The Intuition: It’s Not "Data Loss," It’s a "Noise Filter"

You asked if converting negative values to zero is **data loss**.
**Short Answer:** Yes, it is data loss, but it is **intentional** data loss.

Think of a neuron in a neural network like a **detective** looking for a specific clue. Let's say one neuron is responsible for detecting a "horizontal edge" (like the brim of a hat) in an image.

* **Positive Value (+10):** "I found a strong horizontal edge here!" (The neuron shouts).
* **Zero (0):** "I found nothing relevant to me."
* **Negative Value (-10):** "I found the *opposite* of a horizontal edge (e.g., a vertical edge)."

**Why ignore the negative?**
If you are the "Horizontal Edge Detective," you don't care about vertical edges. You just want to report what *you* found. If you report a negative value, you are effectively sending "anti-information" that might confuse the next layer. By sending **0**, you are essentially saying, *"I have nothing to report right now. Keep looking."*

This is called **Sparsity**. It turns off neurons that aren't relevant to the current task, making the network efficient and focused.

> **Analogy:** Imagine a radio. If you tune into a station (the signal), you want to hear the music. If there is static or silence between songs (negative/irrelevant data), you want the **Squelch** circuit to mute it completely (turn it to 0). You don't want to hear "negative sound"; you just want silence so you can focus on the actual signal.

---

#### 2. The Gradient: Correcting the "Direction" Misconception

You mentioned: *"The direction of values helps to the values of gradients."*

This is a very sharp observation, but here is the nuance:
In deep learning, we care about the **slope** (how much changing the input changes the output), not just the raw value.

* **For Positive Values (ReLU is linear):** The slope is **1**. This means if you increase the input a little, the output increases a little. The gradient flows perfectly. The network learns: *"Hey, this feature was useful! Strengthen this connection!"*
* **For Negative Values (ReLU is 0):** The slope is **0**. This stops the gradient. The network learns: *"This neuron didn't fire, so it didn't contribute to the error. Don't blame it, and don't change it."*

**Why is this "Data Loss" okay?**
We assume that if a feature was truly important, the network would have learned **weights** (biases) to shift that value into the positive range so the ReLU *would* see it. If it stays negative, the network effectively decided it was noise.

---

#### 3. The "Dying ReLU" (When Your Intuition is 100% Right)

You are actually correct that sometimes this "zeroing out" is too harsh. This is a famous problem called the **"Dying ReLU" problem**.

If a neuron gets stuck in the negative region (e.g., a large negative bias), it outputs **0** forever. It has "died."
* It stops learning.
* The gradient is always 0.
* It becomes a wasted part of your brain.

This is exactly why **Leaky ReLU** and **ELU** (Exponential Linear Unit) exist.

* **Leaky ReLU:** Instead of 0, it outputs a tiny negative value (like $0.01x$).
* **The Logic:** It says, *"I'm mostly sure this is noise, but I'll let a tiny bit of information through just in case I'm wrong, so the gradient can still flow back and correct me."*

---

#### Leaky Relu Activation Function
$$
\text{activation}(z) = 
\begin{cases}
\alpha \cdot z & \text{if } z \leq 0 \\
z & \text{if } z > 0
\end{cases}
$$

In [9]:
# Model Building
model = Sequential()

model.add(Input(shape = (X_train.shape[1],)))
model.add(Dense(units = 10, activation = keras.layers.LeakyReLU(negative_slope=0.2), kernel_initializer = 'he_normal'))
model.add(Dense(units = 5, activation = keras.layers.LeakyReLU(negative_slope=0.2), kernel_initializer = 'he_normal'))
model.add(Dense(units = 1, activation = 'sigmoid'))

model.summary()

In [10]:
model.compile(
    optimizer = 'Adam',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

model.fit(X_train, y_train, epochs = 10, validation_split = 0.2)

Epoch 1/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 24ms/step - accuracy: 0.2719 - loss: 1.1325 - val_accuracy: 0.2250 - val_loss: 1.3204
Epoch 2/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.3000 - loss: 1.0712 - val_accuracy: 0.2250 - val_loss: 1.2331
Epoch 3/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.3094 - loss: 1.0118 - val_accuracy: 0.2500 - val_loss: 1.1550
Epoch 4/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.3625 - loss: 0.9564 - val_accuracy: 0.2750 - val_loss: 1.0828
Epoch 5/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.4250 - loss: 0.9077 - val_accuracy: 0.3750 - val_loss: 1.0140
Epoch 6/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.4969 - loss: 0.8588 - val_accuracy: 0.4000 - val_loss: 0.9531
Epoch 7/10
[1m10/10[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x2672941ab10>

>  leaky variants always outperformed the strict ReLU activation function. In fact, setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01 (small leak)

---

#### Parametric ReLU Activation Function
> α is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter). This was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set

$$
\text{activation}(z) = 
\begin{cases}
\alpha \cdot z & \text{if } z \leq 0 \\
z & \text{if } z > 0
\end{cases}
$$

In [11]:
# Model Building
model = Sequential()

model.add(Input(shape = (X_train.shape[1],)))
model.add(Dense(units = 10, activation = keras.layers.PReLU(), kernel_initializer = 'he_normal'))
model.add(Dense(units = 5, activation = keras.layers.PReLU(), kernel_initializer = 'he_normal'))
model.add(Dense(units = 1, activation = 'sigmoid'))

model.summary()

In [12]:
model.compile(
    optimizer = 'Adam',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

model.fit(X_train, y_train, epochs = 10, validation_split = 0.2)

Epoch 1/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.5000 - loss: 0.7127 - val_accuracy: 0.4250 - val_loss: 0.7454
Epoch 2/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.5156 - loss: 0.6727 - val_accuracy: 0.4875 - val_loss: 0.7050
Epoch 3/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.5531 - loss: 0.6355 - val_accuracy: 0.5875 - val_loss: 0.6717
Epoch 4/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.6062 - loss: 0.6053 - val_accuracy: 0.6625 - val_loss: 0.6439
Epoch 5/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.6562 - loss: 0.5813 - val_accuracy: 0.7375 - val_loss: 0.6200
Epoch 6/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.7125 - loss: 0.5589 - val_accuracy: 0.8125 - val_loss: 0.6025
Epoch 7/10
[1m10/10[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x2672941b100>

This relu variant was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

*Unlike the non-smooth ReLU family (ReLU, Leaky ReLU & PReLU), ELU and SELU provide stable gradients through their smooth, differentiable curves.*

---

#### ELU (Exponential Leaky ReLU) Activation Function

$$
\text{activation}(z) = 
\begin{cases}
\alpha \left(e^{z} - 1\right), & \text{if } z \leq 0 \\
z, & \text{if } z > 0
\end{cases}
$$

In [13]:
# Model Building
model = Sequential()

model.add(Input(shape = (X_train.shape[1],)))
model.add(Dense(units = 10, activation = keras.layers.ELU(), kernel_initializer = 'he_normal'))
model.add(Dense(units = 5, activation = keras.layers.ELU(), kernel_initializer = 'he_normal'))
model.add(Dense(units = 1, activation = 'sigmoid'))

model.summary()

In [None]:
model.compile(
    optimizer = 'Adam',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

model.fit(X_train, y_train, epochs = 10, validation_split = 0.2)

Epoch 1/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 25ms/step - accuracy: 0.6177 - loss: 0.6080 - val_accuracy: 0.7375 - val_loss: 0.5017
Epoch 2/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6392 - loss: 0.6193 - val_accuracy: 0.7625 - val_loss: 0.4786
Epoch 3/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7348 - loss: 0.5479 - val_accuracy: 0.8000 - val_loss: 0.4592
Epoch 4/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7527 - loss: 0.5400 - val_accuracy: 0.8375 - val_loss: 0.4441
Epoch 5/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7438 - loss: 0.5213 - val_accuracy: 0.8375 - val_loss: 0.4325
Epoch 6/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7687 - loss: 0.5197 - val_accuracy: 0.8125 - val_loss: 0.4238
Epoch 7/10
[1m10/10[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x79762d2f2680>

>- The hyperparameter α defines the value that the ELU func
tion approaches when z is a large negative number. It is usually set to but you can tweak it like any other hyperparameter if you want.
>- It has a nonzero gradient for z < 0, which avoids the dead neurons problem.
>- If α is equal to 1 then the function is smooth everywhere, including
 around z = 0, which helps speed up Gradient Descent, since it does not bounce as much left and right of z = 0.

*The main drawback of the ELU activation function is that it is slower to compute than the ReLU and its variants (due to the use of the exponential function), but during training this is compensated by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU network.*


 ---

#### SELU (Scaled Exponent Linear Unit) Acivation Function
$$ \text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \le 0 \end{cases} $$

Where $\alpha \approx 1.6733$ and $\lambda \approx 1.0507$.

Assumptions
>- The input features must be standardized (mean 0 and standard deviation 1).
>- Every hidden layer’s weights must also be initialized using LeCun normal initialization. In Keras, this means setting kernel_initializer="lecun_normal".
>- The network’s architecture must be sequential. Unfortunately, if you try to use SELU in non-sequential architectures, such as recurrent networks or networks with skip connections (i.e., connections that skip layers, such as in wide & deep nets), self-normalization will not be guaranteed, so SELU will not neces
sarily outperform other activation functions.
> - It only guarantees self-normalization if all layers are dense. However, in practice the SELU activation function seems to work great with convolutional neural nets as well

In [None]:
# Model Building
model = Sequential()

model.add(Input(shape = (X_train.shape[1],)))
model.add(Dense(units = 10, activation = 'selu', kernel_initializer = 'lecun_normal'))
model.add(Dense(units = 5, activation = 'selu', kernel_initializer = 'lecun_normal'))
model.add(Dense(units = 1, activation = 'sigmoid'))

model.summary()

In [None]:
model.compile(
    optimizer = 'Adam',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

model.fit(X_train, y_train, epochs = 10, validation_split = 0.2)

Epoch 1/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 24ms/step - accuracy: 0.7769 - loss: 0.4625 - val_accuracy: 0.8375 - val_loss: 0.4031
Epoch 2/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7187 - loss: 0.5044 - val_accuracy: 0.8500 - val_loss: 0.3937
Epoch 3/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.8159 - loss: 0.4195 - val_accuracy: 0.8625 - val_loss: 0.3895
Epoch 4/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.8246 - loss: 0.4294 - val_accuracy: 0.8625 - val_loss: 0.3885
Epoch 5/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8165 - loss: 0.4294 - val_accuracy: 0.8625 - val_loss: 0.3886
Epoch 6/10
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.8302 - loss: 0.4035 - val_accuracy: 0.8500 - val_loss: 0.3883
Epoch 7/10
[1m10/10[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x79762d0f2170>

#### Persistent challenges across ReLU-based activations
1. Non-monotonic behavior  
2. Non-convexity  

*Modern alternatives such as GELU, Swish, and MISH mitigate these issues*

---

> - If you want to train your DNN faster - `Leaky ReLU`
> - If you dont know the exact alpha value - `PReLU`
> - If your data is self Normalizing - `ELU`
> - If your data is not self normalizing - `SELU`