# Introduction to Artificial Neural Network

    Artificial Neural Networks (ANNs) are computational models inspired by the structure and functioning of the human brain.
    
    As we know machine learning are used to solve task such as pattern recognition, classification, regression and more.
    
    ANNs consist of interconnected nodes, called neurons or units, organized in layers: input, hidden, and output layers.
    
    Components:
        Neurons: 
            These are the basic units of an ANN. 
            These recieve inputs to which mathematical operation is done to produce an output.
            Neurons are organized into layers, and each neuron in one layer is connected to the neurons in the next layer.
        
        Layers:
            ANNs have three main types of layers: input, hidden, and output.
                Input Layer: receives data
                Hidden Layer: information is processed
                Output Layer: produces the network's predicition or result

        Weights and biases: 
            These are adjusted during the learning process to get better result.
            
        Activation functions:
            This function introduces the non-linearity, allowing the network to model complex relationships within data.

    Biological and Artificial Neurons

image of neuron and simple nn

# Activation Functions

    Activation functions are crucial components within ANNs which introduces non-linearities to network's computations.
    
    They are applied to the output of each neuron, influencing whether and to what extent the neuron should be activated and its output passed to the next layer.
    
    
    Sigmoid Function: 
$$ f(x) = \frac{1}{1 + e^{-x}} $$
        
        Range: (0, 1)
        
        Outputs values between 0 and 1.
        
        Historically used in the hidden layers of neural networks, but less common now due to issues like vanishing gradients.
        
        
    Hyperbolic Tangent (tanh): 
$$ f(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} $$
        
        Range: (-1, 1)
        
        Similar to the sigmoid but outputs values between -1 and 1.
        
        Also prone to vanishing gradient problems.
        
        
    Rectified Linear Unit (ReLU): 
$$ f(x) = \max(0, x) $$
        Range: [0, ∞)
        
        Outputs the input if it's positive, zero otherwise.
        
        Simple and computationally efficient, allowing for faster training. Widely used in hidden layers.
        
        
        
    Leaky ReLU: 
$$ f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{otherwise} \end{cases} $$
        
        Range: (-∞, ∞)
        
        Similar to ReLU but allows a small, non-zero gradient for negative inputs, addressing the "dying ReLU" problem.
        
    Exponential Linear Unit (ELU):
$$ f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha (e^{x} - 1) & \text{otherwise} \end{cases} $$
    
        Range: (-α, ∞)
        
        Similar to ReLU for positive values but with a smooth curve for negative values.
        
        Addresses some of the limitations of ReLU and its variants.
        
        
        
    Softmax Function:
$$ f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}} $$


        Primarily used in the output layer of a neural network for multi-class classification.
        
        It squashes the outputs to a probability distribution over multiple classes, ensuring that the sum of probabilities is 1.

## Perceptron

    Lets begin with simple ANN which is "Perceptron"
        It's a single-layer neural network used for binary classification tasks, where it can distinguish between two classes.

image of single layer 

- Inputs: $ [x_1, x_2, \ldots, x_n] $
- Weights: $ [w_1, w_2, \ldots, w_n] $
- Bias: $ b $

$$ z = \sum_{i=1}^{n} w_i \cdot x_i + b $$

    The perceptron then makes a prediction based on a threshold or step function.

$$ y = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases} $$

    In this simple model its effective for linear binary classification tasks but has limitations in handling non-linear problems,

    In practice, single-layer perceptrons without activation functions have been largely replaced by more complex models like multi-layer neural networks (MLPs) that use activation functions like ReLU, sigmoid, tanh, etc., enabling them to learn and approximate non-linear functions.

    Feed Forward Network
        In a sequential network where information moves in one direction, from input to output.
    
        Data is received from input layer then passed to hidden layer to understand the pattern in the data.
        
        A activation function is used to break the linear-variability then in final layer the output is generated.
        
        The output layer comprises of number of nodes based on the nature of problem.
            eg: binary -> 2 nodes
                multiclass -> more than 2 nodes(number of lables in categorical variable)

### Multi-layer Perceptron (MLP)
    
    A Multi-layer Perceptron (MLP) is a type of feedforward artificial neural network that consists of multiple layers of nodes, including an input layer, one or more hidden layers, and an output layer.
    
    Input Layer : receives input features
    Hidden Layer : consists of may hidden layer which makes it deep neural network
    Output Layer : produces the network's output

![image-2.png](attachment:image-2.png)

    Characteristics of MLPs
    
        Ability to Learn Complex Patterns.
        
        Overfitting: 
            Without proper regularization techniques (e.g., dropout, weight decay), MLPs can overfit to the training data, leading to reduced generalization on unseen data.

## Back Propagation

    Another important aspect in this ANN for the model to learn is "Back Propagation" short for backward propagation of errors.
    
    This technique used for adjusting the weights of the network's connections to minimize the difference between actual and predicted outputs.


    During the backward pass, the gradient of the error with respect to each weight is calculated to update the weights and minimize the error. This involves the use of the chain rule to propagate the error backward through the network.
    
    The weights are updated using the learning rate and the calculated gradients:
    
$$ W_{new} = W_{old} - \text{learning rate} \cdot \frac{\partial E_{total}}{\partial W} $$

    This process is repeated iteratively to train the network.

# Deep ANN

    A type of neural network architecture that contains multiple hidden layers between the input and output layers.