# "MLP Notation"
lec 8 campus x

***

### Lecture Notes: Multi-layer Perceptron (MLP) Notation

**1. Introduction and Importance of Notation**
*   This video builds on previous discussions about Multi-layer Perceptron (MLP) intuition, exploring how MLPs work and why they are effective.
*   The most challenging aspect of understanding MLPs is often the **training algorithm, known as Backpropagation**.
*   A common source of confusion when learning backpropagation is the large number of weights and biases in a neural network. Without a proper system for notation, it becomes difficult to distinguish between these parameters, leading to confusion during complex calculations.
*   **The primary goals of this video are**:
    1.  To learn how to **calculate the total number of trainable parameters** (weights and biases) in any given neural network architecture.
    2.  To establish a **standardised notation for weights, biases, and outputs** that is commonly followed in the industry, to avoid confusion during backpropagation.

**2. Neural Network Architecture Setup**

<img src='https://i.ibb.co/twpX7JCM/image.png'>

*   The lecture uses a specific neural network architecture for demonstration. (shown above)
*   This architecture consists of **four layers in total**:
    *   **Layer 0**: The Input Layer.
    *   **Layer 1**: The first Hidden Layer.
    *   **Layer 2**: The second Hidden Layer.
    *   **Layer 3**: The Output Layer.
*   The input data for this example is **four-dimensional**, meaning each input instance has four features or columns.

**3. Calculating Trainable Parameters (Weights and Biases)**
*   **Trainable parameters** are the values of weights and biases that the backpropagation algorithm will determine during the training process of the model.
*   It is crucial to be able to calculate these from any given architecture.
*   **For the demonstrated architecture**:
    *   **From Input Layer (Layer 0) to Hidden Layer 1 (Layer 1)**:
        *   Layer 0 has 4 nodes and Layer 1 has 3 nodes.
        *   **Weights**: 4 (nodes in Layer 0) × 3 (nodes in Layer 1) = **12 weights**.
        *   **Biases**: 3 (biases, one for each node in Layer 1).
        *   *Subtotal for this segment*: 12 weights + 3 biases = **15 parameters**.
    *   **From Hidden Layer 1 (Layer 1) to Hidden Layer 2 (Layer 2)**:
        *   Layer 1 has 3 nodes and Layer 2 has 2 nodes.
        *   **Weights**: 3 (nodes in Layer 1) × 2 (nodes in Layer 2) = **6 weights**.
        *   **Biases**: 2 (biases, one for each node in Layer 2).
        *   *Subtotal for this segment*: 6 weights + 2 biases = **8 parameters**.
    *   **From Hidden Layer 2 (Layer 2) to Output Layer (Layer 3)**:
        *   Layer 2 has 2 nodes and Layer 3 has 1 node.
        *   **Weights**: 2 (nodes in Layer 2) × 1 (node in Layer 3) = **2 weights**.
        *   **Biases**: 1 (bias, for the node in Layer 3).
        *   *Subtotal for this segment*: 2 weights + 1 bias = **3 parameters**.
    *   **Total Trainable Parameters for the entire network**: 15 + 8 + 3 = **26 parameters**. This means the backpropagation algorithm will find the values for these 26 weights and biases.

**4. Notation for Biases (b)**
*   The notation for biases is straightforward and uses two indices.
*   **Standard Notation**: **`b_i^j`**
    *   **`i`**: Represents the **layer number**.
    *   **`j`**: Represents the **node number** within that layer.
*   **Examples**:
    *   **`b_1^1`**: Bias for the first node in Layer 1.
    *   **`b_1^2`**: Bias for the second node in Layer 1.
    *   **`b_2^1`**: Bias for the first node in Layer 2.
    *   **`b_3^1`**: Bias for the first (and only) node in Layer 3.

**5. Notation for Outputs (o)**
*   The notation for outputs is identical to that for biases.
*   **Standard Notation**: **`o_i^j`**
    *   **`i`**: Represents the **layer number**.
    *   **`j`**: Represents the **node number** within that layer.
*   Any output originating from a node will follow this notation.
*   **Examples**:
    *   **`o_1^1`**: Output from the first node in Layer 1.
    *   **`o_1^2`**: Output from the second node in Layer 1.
    *   **`o_2^1`**: Output from the first node in Layer 2.
    *   **`o_3^1`**: Output from the first (and only) node in Layer 3.

eg - 
<img src='https://miro.medium.com/v2/resize:fit:1400/1*2vLiWsyesKLAfDcezIfBRQ.png'>

**6. Notation for Weights (W)**
*   The notation for weights is slightly more complex, requiring three indices.
*   **Standard Notation**: **`W_k_i^j`**
    *   **`k`**: Represents the **layer number that the weight is entering**. This is the layer containing the destination node.
    *   **`i`**: Represents the **node number in the previous layer from which the weight is originating**.
    *   **`j`**: Represents the **node number in the current layer (layer `k`) that the weight is entering**.
*   **Examples (referencing the network diagram in the source)**:
    *   **`W_1_1^1`**: Weight entering **Layer 1**, originating from the **1st node of the previous layer** (Layer 0), and entering the **1st node of Layer 1**.
    *   **`W_1_4^2`**: Weight entering **Layer 1**, originating from the **4th node of the previous layer** (Layer 0), and entering the **2nd node of Layer 1**.
    *   **`W_1_1^3`**: Weight entering **Layer 1**, originating from the **1st node of the previous layer** (Layer 0), and entering the **3rd node of Layer 1**.
    *   **`W_2_2^2`**: Weight entering **Layer 2**, originating from the **2nd node of the previous layer** (Layer 1), and entering the **2nd node of Layer 2**.
    *   **`W_3_1^1`**: Weight entering **Layer 3**, originating from the **1st node of the previous layer** (Layer 2), and entering the **1st node of Layer 3**.

eg 

<img src='https://miro.medium.com/v2/resize:fit:1200/1*n5YNnh_vG2exS-YnjDPoPA.png'>


***

# "Multi Layer Perceptron | MLP Intuition":

lec 9 campus x dl playlist

***

### Lecture Notes: Multi-layer Perceptron (MLP) Intuition

**1. Introduction: Overcoming the Perceptron's Limitation**
*   The fundamental problem with a single Perceptron is its **inability to create non-linear decision boundaries**; it can only draw a straight line.
*   This limitation means a Perceptron cannot capture non-linear relationships in data, as seen with complex datasets where a straight line cannot separate classes (e.g., data requiring a curved boundary).
*   The solution is the **Multi-layer Perceptron (MLP)**, which combines multiple Perceptrons to form a larger neural network.
*   MLPs act as **universal function approximators**, capable of creating any kind of non-linear decision boundary.

**2. Perceptron (Logistic Regression) in this Context**
*   For the purpose of this explanation, the Perceptron used has a **Sigmoid activation function** and **Log Loss**.
*   This setup essentially makes the Perceptron behave like a **Logistic Regression model**.
*   Instead of binary outputs (0 or 1), it provides a **probability between 0 and 1** (e.g., probability of placement).
*   **How it works**:
    *   Input features (e.g., CGPA, IQ) are multiplied by weights (w1, w2) and summed with a bias (`Z = w1*CGPA + w2*IQ + bias`).
    *   This `Z` value is then passed through the Sigmoid function (`1 / (1 + e^-Z)`) to produce a probability.
    *   The **decision boundary** is where this probability is **0.5**.
    *   Points further away from this line will have probabilities increasingly closer to 0 or 1, forming a gradient of probabilities.

**3. The Core Idea of MLP: Combining Multiple Perceptrons**
*   To capture non-linearity, an MLP uses **more than one Perceptron** to solve the same problem.
*   **Intuition (Abstract Idea)**: Imagine two separate Perceptrons, each creating its own linear decision boundary. The idea is to somehow "superimpose" these boundaries on top of each other and then "smooth" them out to create a more complex, non-linear boundary. This initial explanation is purely intuitive without mathematical detail.

**4. Mathematical Justification: Linear Combination with Sigmoid**
*   Consider a single data point (student) and two Perceptrons. Each Perceptron will output a probability (e.g., Perceptron 1: 0.7, Perceptron 2: 0.8).
*   **Initial thought**: Simply add the probabilities (`0.7 + 0.8 = 1.5`).
*   **Problem**: Probabilities must be between 0 and 1. A sum can exceed 1.
*   **Solution**: Sum the probabilities from the individual Perceptrons, and then pass this sum through *another* **Sigmoid function**.
    *   Example: `Sigmoid(P_Perceptron1 + P_Perceptron2)`. This new model would then output a valid probability (e.g., 0.82).
*   This process of addition followed by a Sigmoid function mathematically achieves the "superimposition and smoothing" previously discussed conceptually.
*   This is essentially creating a **linear combination of multiple Perceptrons**.

**5. Adding Flexibility: Weighted Combinations and Bias**
*   To allow different Perceptrons to have more or less influence on the final decision, **weights** can be assigned to their outputs.
    *   Instead of `P_Perceptron1 + P_Perceptron2`, it becomes `w_new1 * P_Perceptron1 + w_new2 * P_Perceptron2`.
    *   Example: `10 * P_Perceptron1 + 5 * P_Perceptron2` means Perceptron 1's output has double the impact.
*   Additionally, a **bias term** can be added to this weighted sum: `Sigmoid(w_new1 * P_Perceptron1 + w_new2 * P_Perceptron2 + bias_new)`.
*   **Key Realisation**: The entire operation of taking outputs from previous Perceptrons, applying new weights and a bias, and then passing it through a Sigmoid function, is itself the operation of **another Perceptron**.
*   Therefore, an MLP is essentially a **combination of Perceptrons where the outputs of some Perceptrons serve as inputs to subsequent Perceptrons**.

**6. MLP Architecture: Input, Hidden, and Output Layers**
*   An MLP consists of at least three types of layers:
    *   **Input Layer (Layer 0)**: Receives the raw features of the data (e.g., CGPA, IQ).
    *   **Hidden Layer(s)**: Contains multiple Perceptrons (nodes). Each hidden node takes inputs from the previous layer (e.g., input layer), applies its own weights and bias, and uses an activation function (e.g., Sigmoid) to produce an output. This is where the non-linear transformations and feature learning primarily occur.
    *   **Output Layer**: Contains one or more Perceptrons that take inputs from the last hidden layer and produce the final output of the network (e.g., probability of placement).

**7. Modifying MLP Architecture for Enhanced Flexibility**
*   Neural Network architecture defines how nodes (Perceptrons) are connected and their weights.
*   **1. Increase Number of Nodes in Hidden Layer(s)**:
    *   Adding more Perceptrons to a hidden layer allows the network to create more distinct linear decision boundaries internally.
    *   This enables the MLP to capture **more complex non-linear relationships** in the data.
*   **2. Increase Number of Nodes in Input Layer**:
    *   This is done when the dataset has **more input features/columns** (e.g., adding "12th marks" to CGPA and IQ).
    *   More input nodes change the dimensionality of the input space (e.g., from 2D lines to 3D planes for decision boundaries).
*   **3. Increase Number of Nodes in Output Layer**:
    *   Primarily used for **multi-class classification problems** (e.g., classifying an image as Dog, Cat, or Human).
    *   Each output node corresponds to a specific class, providing a probability for that class.
*   **4. Increase Number of Hidden Layers**:
    *   This is the basis of **Deep Neural Networks**.
    *   Multiple hidden layers allow the network to learn increasingly **abstract and hierarchical features** from the data.
    *   Early layers might capture simple patterns, while deeper layers combine these patterns to understand more complex relationships.
    *   This capability makes neural networks **universal function approximators**, meaning they can model almost any complex mathematical function, given enough layers, nodes, and training time.

**8. TensorFlow Playground Demonstration**
*   The demonstration in TensorFlow Playground (`playground.tensorflow.org`) illustrates these concepts:
    *   A single Perceptron fails to converge on **XOR data** (non-linear).
    *   A small **MLP (with two hidden nodes)** successfully learns to classify XOR data quickly.
    *   For more complex non-linear datasets (e.g., concentric circles, spirals), MLPs can converge by adding more hidden layers and nodes.
    *   Changing **activation functions** (e.g., to ReLU) can significantly improve training speed and convergence for complex data.
    *   The tool visually shows how each layer transforms the decision boundaries, with early layers often starting with linear boundaries and deeper layers creating more complex, non-linear ones.
    *   It highlights that MLPs can capture the "essence" of even highly complex, non-linear data.

**9. Conclusion**
*   MLPs solve the problem of non-linear data by creating **linear combinations of multiple Perceptrons**, which themselves are combined into layers.
*   This architecture allows them to act as **universal function approximators**, capable of capturing any complex non-linear relationship within data.

***