## Activation Functions in Neural Networks

Activation functions are a crucial component of artificial neural networks, used to introduce non-linearity into the model. They determine whether a neuron should be activated (fired) or not based on the weighted sum of its inputs. Different activation functions serve various purposes and have unique characteristics. Here are some of the most common activation functions:

### Step/Threshold Function:

- **Formula:** $f(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases}$
- This function was historically used but is rarely used in modern neural networks due to its discontinuity.

### Sigmoid Function:

- **Formula:** $f(x) = \frac{1}{1 + e^{-x}}$
- **Range:** $(0, 1)$
- It squashes the input into a range between 0 and 1. It is used in the output layer of binary classification problems.

### Hyperbolic Tangent (tanh) Function:

- **Formula:** $f(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
- **Range:** $(-1, 1)$
- Similar to the sigmoid function, but it squashes the input into a range between -1 and 1.

### Rectified Linear Unit (ReLU):

- **Formula:** $f(x) = \max(0, x)$
- **Range:** $[0, ∞)$
- It is one of the most widely used activation functions due to its simplicity and effectiveness. It helps mitigate the vanishing gradient problem.

### Leaky ReLU:

- **Formula:** $f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{else} \end{cases}$ (where $\alpha$ is a small positive constant, typically around 0.01)
- It addresses the "dying ReLU" problem by allowing a small gradient when $x < 0$.

### Parametric ReLU (PReLU):

- **Formula:** $f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{else} \end{cases}$ (where $\alpha$ is a learnable parameter)
- Similar to Leaky ReLU but with $\alpha$ as a learnable parameter during training.

### Exponential Linear Unit (ELU):

- **Formula:** $f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^{x} - 1) & \text{else} \end{cases}$ (where $\alpha$ is a positive constant)
- It addresses the vanishing gradient problem and has some advantages over ReLU, but it's computationally more expensive.

### Swish:

- **Formula:** $f(x) = x \cdot \text{sigmoid}(x)$
- Swish is a smooth approximation of ReLU and has been found to perform well in some cases.

### Softmax:

- Used in the output layer of multi-class classification problems.
- **Formula:** $f(x)_i = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$ for all $i$, where $i$ is the current class and $j$ ranges over all classes.

These activation functions have different properties and can be chosen based on the specific requirements and characteristics of the neural network and the problem being solved. ReLU and its variants are among the most popular choices in deep learning due to their training efficiency and effectiveness in mitigating gradient-related issues.


# Optimizers

In [None]:
"""
self.opt_dict = {
    "sgd": SGD(),
    "bgd": BGD(),
    "mbsgd" : MBGD()
    # TODO : will implement these someday
    # "adagrad": Adagrad(),
    # "adadelta": Adadelta(),
    # "rmsprop": RMSProp(),
    # "adam": Adam(),
    # "nadam": Nadam()
    # "adamax": Adamax()
    # "momentum": Momentum()
    # "nag": NesterovAcceleratedGradient()
    # "lbfgs": LBFGS()
    # "rprop": RProp()
    # "yf": YurrisFriend()
    # "la": LookAhead()
    # "ranger" : Ranger()
    # "FTRL": FollowTheRegularizedLeader()
    
}
"""

# Resources for Back Propagation

https://cs231n.github.io/optimization-2/