# Lecture notes 
class 6 dl playlist campusx
# Perceptron Loss Functions

## 1. Introduction to Perceptron

*   A **Perceptron is fundamentally a mathematical model** based on a biological neuron.
*   It processes inputs (e.g., CGPA, IQ) with associated weights and a bias.
*   The core operation is a **dot product** of inputs and weights, summed with the bias, producing a value `z`.
*   This `z` value then goes through a **step activation function** for binary classification:
    *   If `z >= 0`, output is `1`.
    *   If `z < 0`, output is `0`.
*   **Geometric Intuition**: A Perceptron represents a line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions), separating data into two regions (positive and negative) for binary classification.
*   Like other machine learning algorithms, Perceptrons have two phases:
    1.  **Prediction**: Generating an output based on inputs.
    2.  **Training**: Calculating optimal weights and biases to improve prediction accuracy.
*   The **Perceptron Rule** was introduced as a preliminary, "makeshift" (jugaad) technique for training, where the line adjusts if a point is misclassified.

## 2. Problems with the Perceptron Rule

While the Perceptron Rule often works, it has significant drawbacks:

*   **No Guarantee of Best Values**: It cannot definitively confirm that the obtained weights (`w1`, `w2`) and bias (`b`) represent the **absolute best line** for classification. Different runs might yield different satisfactory lines.
*   **Lack of Quantification**: It does not provide a quantitative measure of **how good** the classification is. If a point is correctly classified, the line doesn't change, regardless of how close or far it is from the boundary.
*   **Potential Convergence Issues**: In some rare scenarios, particularly with random point selection, the model might fail to converge to an optimal solution (e.g., if it repeatedly picks correctly classified points, the line won't move).
*   Described as a "jugaad" that works **98% of the time**, but the remaining 2% (1% convergence issue, 1% uncertainty about optimality) are critical.

## 3. Introduction to Loss Functions

To overcome the Perceptron Rule's limitations, **Loss Functions** are used.

*   **Definition**: A loss function is a method to **quantify** how well or poorly a machine learning model is performing on a given problem.
*   For Perceptron, the loss function is typically a mathematical function of the weights (`w1`, `w2`) and bias (`b`).
*   **Output**: For any given line (defined by `w1`, `w2`, `b`), the loss function outputs a **single number representing the error**.
*   **Goal of Training**: The objective is to find the specific values of `w1`, `w2`, and `b` that **minimise this loss function's output**. This minimum value signifies the "best" line.
*   **Examples of other Loss Functions**:
    *   **Mean Squared Error (MSE)** for Linear Regression.
    *   **Log Loss** for Logistic Regression.
    *   **Hinge Loss** for Support Vector Machines (SVM).
*   It is also possible to design **custom loss functions**.

## 4. Developing a Loss Function for Perceptron

Let's consider how to design a loss function for the Perceptron:

*   **Simple Approach (Ineffective)**: Counting the **number of misclassified points**. This is too simplistic because all misclassifications are treated equally, regardless of how far a point is from the decision boundary.
*   **Improved Approach**: Summing the **perpendicular distances of misclassified points** from the line. This accounts for the *magnitude* of the error (larger distance implies a greater error).
*   **Perceptron's Practical Approach**: Instead of complex perpendicular distance calculations, Perceptron uses a value that is **directly proportional to the distance**. This is achieved by substituting the coordinates of a point into the line's equation (essentially a dot product operation). This method is computationally simpler than calculating exact distances.

## 5. Perceptron's Actual Loss Function (Hinge Loss Variant)

According to scikit-learn documentation for the Perceptron in SGD (Stochastic Gradient Descent), the loss function used is a variant of **Hinge Loss**.

*   The loss function for Perceptron is given by:
    $L(\mathbf{w}, b) = \frac{1}{N} \sum_{i=1}^{N} \max(0, -y_i \cdot f(x_i))$
    *   **$N$**: Number of data points (rows).
    *   **$y_i$**: The **actual output label** for the $i$-th data point (e.g., +1 for one class, -1 for the other).
    *   **$f(x_i)$**: Represents the output of the dot product for the $i$-th data point: $w_1x_{i1} + w_2x_{i2} + b$. This is often denoted as `z`.
    *   The $\max(0, \cdot)$ term ensures that the loss is only incurred for misclassified points. If the term inside is negative or zero, the loss for that point is zero.

*   **Mathematical Goal**: The training process aims to find the values of $w_1, w_2,$ and $b$ that **minimise** this loss function. This is denoted as $\text{argmin}(\mathbf{w}, b)$ of the loss function.

## 6. Geometric Intuition of the Perceptron Loss Function

Let's break down the term $\max(0, -y_i \cdot f(x_i))$ to understand its geometric meaning:

*   **`y_i` (Actual Label)**:
    *   If a point belongs to the positive class, $y_i = +1$.
    *   If a point belongs to the negative class, $y_i = -1$.
*   **`f(x_i)` (Model's Raw Output)**:
    *   If the point falls on the positive side of the line, $f(x_i)$ is positive.
    *   If the point falls on the negative side of the line, $f(x_i)$ is negative.

We can analyse two main scenarios:

1.  **Correctly Classified Point**:
    *   If $y_i = +1$ and $f(x_i) > 0$ (positive class point on the positive side of the line).
        *   Then $y_i \cdot f(x_i)$ will be positive.
        *   So, $-y_i \cdot f(x_i)$ will be negative.
        *   $\max(0, \text{negative number}) = \mathbf{0}$.
    *   If $y_i = -1$ and $f(x_i) < 0$ (negative class point on the negative side of the line).
        *   Then $y_i \cdot f(x_i)$ will be positive (negative times negative).
        *   So, $-y_i \cdot f(x_i)$ will be negative.
        *   $\max(0, \text{negative number}) = \mathbf{0}$.
    *   **Conclusion**: **Correctly classified points contribute zero to the total loss**.

2.  **Misclassified Point**:
    *   If $y_i = +1$ but $f(x_i) < 0$ (positive class point on the negative side of the line).
        *   Then $y_i \cdot f(x_i)$ will be negative.
        *   So, $-y_i \cdot f(x_i)$ will be positive.
        *   $\max(0, \text{positive number}) = \text{positive number}$.
    *   If $y_i = -1$ but $f(x_i) > 0$ (negative class point on the positive side of the line).
        *   Then $y_i \cdot f(x_i)$ will be negative.
        *   So, $-y_i \cdot f(x_i)$ will be positive.
        *   $\max(0, \text{positive number}) = \text{positive number}$.
    *   **Conclusion**: **Misclassified points contribute a positive, non-zero value to the total loss**. The magnitude of this contribution is proportional to the point's "distance" from the decision boundary (calculated via the dot product).

## 7. Training Perceptron using Gradient Descent

To minimise the loss function and find the optimal weights and bias, **Gradient Descent** (an optimisation algorithm) is used.

*   **Gradient Descent Algorithm**:
    1.  **Initialisation**: Randomly initialise $w_1, w_2,$ and $b$.
    2.  **Loop for a fixed number of epochs** (iterations):
        *   For each data point $x_i, y_i$:
            *   Calculate $f(x_i) = w_1x_{i1} + w_2x_{i2} + b$.
            *   Check the misclassification condition: If $-y_i \cdot f(x_i) > 0$ (i.e., the point is misclassified), then update the parameters.
            *   **Update Rule**: The parameters are updated in the direction opposite to the gradient of the loss function. This helps move towards the minimum loss.
                *   $w_1 \leftarrow w_1 + \text{learning\_rate} \times \frac{\partial L}{\partial w_1}$
                *   $w_2 \leftarrow w_2 + \text{learning\_rate} \times \frac{\partial L}{\partial w_2}$
                *   $b \leftarrow b + \text{learning\_rate} \times \frac{\partial L}{\partial b}$
            *   The `learning_rate` (e.g., 0.1) controls the step size in each update.

*   **Partial Derivatives of the Perceptron Loss Function**:
    *   For **misclassified points** (where $-y_i \cdot f(x_i) > 0$):
        *   $\frac{\partial L}{\partial w_1} = -\frac{1}{N} \sum_{i=1}^{N} y_i x_{i1}$
        *   $\frac{\partial L}{\partial w_2} = -\frac{1}{N} \sum_{i=1}^{N} y_i x_{i2}$
        *   $\frac{\partial L}{\partial b} = -\frac{1}{N} \sum_{i=1}^{N} y_i$
    *   For correctly classified points, the derivatives are 0.

## 8. Perceptron's Flexibility: Different Behaviours

The Perceptron is a **highly flexible mathematical model** due to its design. By changing its **activation function** and **loss function**, it can be adapted to solve various types of problems. This flexibility demonstrates that the same underlying model can perform different tasks.

Here's a summary of how Perceptron components can be combined for different machine learning algorithms:

| Algorithm Type            | Activation Function      | Loss Function               | Output                                   | Problem Type                 |
| :------------------------ | :----------------------- | :-------------------------- | :--------------------------------------- | :--------------------------- |
| **Perceptron**            | **Step Function**        | **Hinge Loss**              | Binary (1 or -1)                         | Binary Classification        |
| **Logistic Regression**   | **Sigmoid Function**     | **Log Loss (Binary Cross-Entropy)** | Probabilities (0 to 1)                   | Binary Classification        |
| **Softmax Regression**    | **Softmax Function**     | **Categorical Cross-Entropy** | Probabilities for each class             | Multi-Class Classification   |
| **Linear Regression**     | **Linear (or None)**     | **Mean Squared Error**      | A continuous number                      | Regression                   |

*   **Key takeaway**: The Perceptron provides a foundational mathematical model. By selectively altering its activation and loss functions, while consistently using **Stochastic Gradient Descent (SGD)** for parameter optimisation, it can be re-purposed for diverse tasks, including different types of classification and regression problems. This adaptability is crucial for building more complex neural networks later on.

---

# "Problem with Perceptron":

lec 7 dl playist dscampusx
***

### Lecture Notes: The Problem with Perceptron

**1. The Core Problem of Perceptron: Linearity Limitation**
*   The primary reason Perceptrons did not become highly famous for deep learning is their **inability to work with non-linear data**.
*   Perceptrons are only capable of operating on **linear data**.
*   Even with extensive training time, a Perceptron will **never converge** if the data is non-linear, because its decision boundary is inherently linear.

**2. Practical Demonstration of the Problem**

The video practically demonstrates this limitation using code and a visual tool to show that a Perceptron cannot truly work with non-linear data.

**2.1. Demonstration using Python Code (`sklearn` Perceptron)**
*   Three distinct datasets were created for demonstration: **AND, OR, and XOR**.
*   **AND Data:**
    *   The output is `1` only when both inputs are `1`; otherwise, the output is `0`.
    *   **Result:** The Perceptron successfully classified the AND data, easily dividing the two classes with a **clear linear boundary**.
*   **OR Data:**
    *   The output is `1` if any input is `1`; the output is `0` only when both inputs are `0`.
    *   **Result:** The Perceptron successfully classified the OR data, separating the classes with a **single linear line**.
*   **XOR Data:**
    *   The output is `0` when both inputs are the same (both `0` or both `1`); the output is `1` when inputs are different (one `0` and one `1`).
    *   **Result:** When the Perceptron was run on the XOR data, it **failed to find any clear-cut boundary**. This visually confirms that a single straight line cannot separate the classes in XOR data, highlighting the Perceptron's limitation with non-linear relationships.

**2.2. Demonstration using TensorFlow Playground**
*   This is a powerful web-based tool (`playground.tensorflow.org`) that allows users to create and train neural networks on various datasets without writing extensive code.
*   **Linearly Separable Data (Example):**
    *   A simple, linearly separable dataset was chosen where both classes are clearly separated.
    *   **Result:** The Perceptron (simulated as a single-layer network with no hidden layers) quickly found a dividing line and provided correct results.
*   **Non-linear Data (XOR Dataset):**
    *   The XOR dataset (visualised as crosses and circles in a specific arrangement) was loaded.
    *   **Result:** Even after running the simulation for a significant amount of time (many epochs), the Perceptron **could not separate the two classes**. The output did not converge, visually confirming the Perceptron's inability to handle non-linear data even when given ample training time.

**3. Conclusion and Future Implications**
*   The practical demonstrations clearly show that a Perceptron can only capture **linear relationships**.
*   This fundamental limitation of Perceptrons with non-linear data was the primary reason for the **necessity of Multi-layer Perceptrons**.
*   The next steps involve moving on to Multi-layer Perceptrons, where the concepts will become even more interesting.

***