<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Commonly Used Terminology in Neural Network </p>

### <span style="color:#C738BD; font-weight: bold;">Importance of Bias</span>

**1. What happens without bias?**

*   **Weighted Sum:** Each input to a neuron is multiplied by its corresponding weight. These weighted inputs are then summed together. Let's represent this as:
    ```
    weighted_sum = (input1 * weight1) + (input2 * weight2) + ... + (inputN * weightN)
    ```
*   **Direct Input to Activation Function:** This `weighted_sum` is directly passed to the activation function. There's no additional term added to it.

**2. The "Centering" Effect**

*   **Consider a simple case:** Let's say we have one input (x) and one weight (w). The input to the activation function is simply `x * w`.
    *   If `x = 0`, then `x * w = 0`, regardless of the value of `w`.
    *   If `w = 0`, then `x * w = 0`, regardless of the value of `x`.
*   **Generalizing:** In the case of multiple inputs and weights, if all inputs are zero, the `weighted_sum` will always be zero.

**3. Impact on Activation Functions**

*   Many common activation functions behave in a specific way around zero:
    *   **Sigmoid:** The sigmoid function outputs 0.5 when the input is 0.
    *   **Tanh:** The tanh function outputs 0 when the input is 0.
    *   **ReLU:** The ReLU function outputs 0 when the input is 0.

*   Because the `weighted_sum` is directly fed into the activation function, and the `weighted_sum` is zero when all inputs are zero, the output of the activation function is determined by its behavior at zero. This is what we mean by "centered around the origin."

**4. Why is this a limitation?**

*   This means the neuron can only "activate" (produce a non-zero output) if there's a non-zero `weighted_sum`. It cannot activate based on a threshold or offset. This limits the types of patterns the network can learn.

**In summary:** Without a bias term, the neuron's activation is strictly dependent on the weighted sum of its inputs. When all inputs are zero, the activation function is evaluated at zero, effectively centering its behavior around the origin and limiting the neuron's ability to learn diverse patterns.

### Linear and Non-Linear

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       The neural network without any activation function in any of its layers is called a linear neural network. The neural network which has action functions like relu, sigmoid or tanh in any of its layer or even in more than one layer is called non-linear neural network. Introducing non-linearity in a neural network helps the model to learn complex pattern.
   </font>
</p>

### Epoch

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       An epoch in a neural network is a single pass through the entire training dataset. This means that the network will see each training example once during an epoch. The number of epochs is a hyperparameter that determines how many times the network will see the entire training dataset.

The number of epochs required to train a neural network depends on a number of factors, including the size of the training dataset, the complexity of the network, and the learning rate. In general, a larger training dataset will require more epochs to train. A more complex network will also require more epochs to train. The learning rate controls how much the weights of the network are updated during each epoch. A higher learning rate will cause the network to learn faster, but it may also lead to overfitting.

The best way to determine the number of epochs to use is to experiment with different values and see what works best for the specific problem. A common approach is to start with a small number of epochs and then increase the number of epochs until the performance of the network stops improving.

Here is an example. Let's say we have a training dataset of 1000 examples and a neural network with 10 layers. If we set the number of epochs to 1, then the network will see each training example once during the epoch. If we set the number of epochs to 10, then the network will see each training example 10 times during the training process.

It is important to note that more epochs is not always better. If the network is overfitting the training dataset, then increasing the number of epochs will not improve the performance of the network on the test dataset. In this case, it is better to use a technique called early stopping, which stops the training process when the performance of the network on the test dataset starts to deteriorate
   </font>
</p>

### What are Parameters in Neural Networks?

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       <p>In neural networks, parameters refer to the learnable components of the model that are updated during the training process. Parameters are the key elements that define the behavior and functionality of a neural network. They are used to represent the weights and biases of the network's layers.</p>
  
  <b>Types of Parameters</b>
  <p>There are two main types of parameters in neural networks:</p>
  <ol>
    <li><strong>Weights:</strong> Weights are the values associated with the connections between neurons in the network. Each connection between two neurons has an associated weight, which determines the strength or importance of that connection. Weights are adjusted during the training process to optimize the network's performance on a given task.</li>
    <li><strong>Biases:</strong> Biases are additional learnable parameters that are added to each neuron in a layer. They provide an additional degree of freedom to the model, allowing it to fit the data more accurately. Biases help shift the activation function of each neuron and control the overall output of the neuron. Like weights, biases are updated during the training process.</li>
  </ol>
  <b>Training Neural Networks</b>
  <p>Parameters in neural networks are typically initialized randomly, and their values are updated through an optimization algorithm such as gradient descent or its variants. The objective is to find the optimal set of parameter values that minimize the loss function, which measures the discrepancy between the model's predictions and the actual targets.</p>
  <p>The process of training a neural network involves iteratively adjusting the parameters based on the gradients of the loss function with respect to those parameters. This iterative process aims to find the optimal parameter values that enable the network to make accurate predictions on unseen data.</p>
   </font>
</p>

<img src="images\how_nn_works.jpg" alt="how_nn_works" style="width: 600px;"/>

### Difference between loss & cost functions?

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Generally, both terms refer to a function that measures the discrepancy or
       error between the predicted outputs of a machine learning model and the
       true labels. The goal is to minimize this discrepancy during the training
       process.<br><br>
       <b>Loss Function</b>:<br>
       A loss function, also known as an error function or objective function, is 
       typically defined for an individual training example or data point. It
       quantifies the error between the predicted output of the model and the
       true label associated with that particular data point. In other words,
       it measures how well the model performs on a single instance.<br><br>
       The choice of a specific loss function depends on the type of machine
       learning task being performed. For example, in classification tasks,
       common loss functions include cross-entropy loss, hinge loss, or softmax loss.
       For regression tasks, mean squared error (MSE) or mean absolute error (MAE)
       are commonly used as loss functions.<br><br>
       During training, the loss function is evaluated for each training example,
       and the model's parameters are adjusted to minimize the cumulative error 
       across all examples. This optimization process, typically performed using
       gradient descent or its variants, aims to find the set of model parameters
       that minimizes the average loss over the entire training dataset.<br><br>
       <b>Cost Function</b>:<br>
       A cost function, also known as the objective function or the average 
       loss, measures the overall performance of the model by aggregating the
       individual losses from all training examples. It represents the average
       loss over the entire training dataset.<br><br>
       The cost function is computed by taking the average or the sum (depending
       on the specific context) of the individual losses over the training examples.
       It provides a measure of how well the model is performing on average across
       the entire dataset.<br><br>
       The cost function is used in the training process to guide the 
       model's optimization. The goal is to find the optimal set of model
       parameters that minimizes the cost function. By minimizing the cost function,
       the model aims to reduce the overall error or discrepancy between its
       predictions and the true labels across the entire dataset.<br><br>
       In summary, the loss function quantifies the error between the predicted
       outputs and true labels for individual training examples, while the cost
       function measures the overall performance of the model by aggregating
       the individual losses across the entire training dataset. The cost function
       guides the model's optimization process by providing a single scalar
       value to minimize during training.
   </font>
</p>

### Softmax Activation Function

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       It is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.
   </font>
</p>

<img src="images\softmax.png" alt="Drawing" style="width: 600px;"/>

$$\sigma(\vec{Z})_i = \frac{e^{Z_{i}}}{\Sigma_{j=1}^{K}e^{Z_{j}}}$$

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Where,<br>
       $\sigma$ = Softmax<br>
       $\vec{Z}$ = Input Vector<br>
       $e^{Z_{i}}$ = standard exponential function for input vector <br>
       $K$ = number of classes in the multi-class classifier<br>
       $e^{Z_{j}}$ = standard exponential function for output vector
   </font>
</p>

### Cross Entropy

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       <b>TODO</b>: Yet to explore more <br>
       Cross-entropy is a commonly used cost function for classification tasks, especially when the predicted outputs are probabilities. It measures the average logarithmic loss between the predicted class probabilities and the true labels.
   </font>
</p>

### logits

In a neural network for classification tasks, the output layer typically produces a set of raw scores or probabilities for each class. These scores are often referred to as "logits." Each logit value represents the model's confidence for a specific class.

In the given line of code, <i>outputs</i> represents the output of the like XLNet or bert model etc..,, which includes the <i>logits</i> among other things. By accessing <code>outputs.logits</code>, we obtain the raw scores for each class.

### <span style="color:#3C4048; font-weight: bold; font-size: 18px; font-family: Gill Sans, sans-serif;">Multimodal Models</span>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
      We, humans, rely on our five senses to interpret the world around us. We use our senses of sight, hearing, touch,
      taste, and smell to gather information about our environment and make sense of it.<br><br>
      In a similar vein, multimodal learning is an exciting new field of AI that seeks to replicate this ability by
      combining information from multiple models. By integrating information from diverse sources such as text, image,
      audio, and video, multimodal models can build a richer and more complete understanding of the underlying data,
      unlock new insights, and enable a wide range of applications. 
   </font>
</p>

Source: <a href="https://www.kdnuggets.com/2023/03/multimodal-models-explained.html">Multimodal Models Explained</a>