# Week 2: Neural Netwroks Training

## TensorFlow Implementation

Training a neural network in TensorFlow (using the Keras API) is accomplished through three sequential lines of code:

| Step | Function | Purpose |
| :--- | :--- | :--- |
| **1. Define the Model** | `model = tf.keras.Sequential(...)` | Specifies the architecture of the neural network by sequentially chaining the layers (e.g., Dense layers with Sigmoid activation). |
| **2. Compile the Model** | `model.compile(loss='binary_crossentropy', ...)` | Configures the learning process by specifying the **loss function** that will measure the error between predictions and true labels. (The loss function used here is `binary_crossentropy`, which will be detailed later). |
| **3. Fit the Model** | `model.fit(X, Y, epochs=100)` | Executes the training. It applies the learning algorithm (like gradient descent) to the dataset ($X, Y$) for a specified number of **epochs**. |

### Key Concepts Introduced
* **Loss Function:** A metric (e.g., `binary_crossentropy`) specified during the compile step that quantifies how well the model is performing. This is the value the optimizer tries to minimize.
* **Epochs:** A technical term defining how many complete passes the learning algorithm (gradient descent) should make through the **entire** training dataset ($X, Y$).
* **Deeper Understanding:** It is important not just calling these lines of code, but understanding the underlying mechanisms (like the loss function and gradient descent) to effectively **debug** the model when it doesn't work as expected.

## Training Details

### 1. Training is a Three-Step Process
Training a neural network in TensorFlow follows the same three fundamental steps used for a simpler model like logistic regression:

1.  **Specify the Model (Forward Computation):** Define the input-to-output function.
2.  **Specify Loss and Cost:** Define the metric for measuring error.
3.  **Minimize the Cost:** Use an optimization algorithm (like Gradient Descent/Backpropagation) to find the best parameters.

### 2. Step 1: Specify the Model (Architecture)
* **TensorFlow Code:** This is done using the `tf.keras.Sequential` code snippet from the previous video.
* **Function:** This defines the entire neural network architecture (number of layers, units per layer, and activation functions), allowing TensorFlow to compute the output $f(\mathbf{x})$ given the input $\mathbf{x}$ and the parameters ($W$ and $B$ for all layers).

### 3. Step 2: Specify Loss and Cost Functions
* **Loss Function ($\mathcal{L}$):** Measures the error on a **single training example** ($\mathbf{x}, y$).
* **Cost Function ($J$):** Measures the average error (loss) across the **entire training set**. This is the function that the optimizer tries to minimize.
* **Classification (Binary):** The standard loss is **Binary Cross-Entropy Loss**, which is the same function used for logistic regression:
    $$\mathcal{L}(f(\mathbf{x}), y) = -y \log f(\mathbf{x}) - (1-y) \log (1 - f(\mathbf{x}))$$
    In TensorFlow, you specify this in the compile step: `loss='BinaryCrossentropy'`.
* **Regression:** If solving a regression problem (predicting a continuous value), you would typically use the **Mean Squared Error (MSE)** loss, which minimizes the squared difference between the prediction and the true value.

### 4. Step 3: Minimize the Cost Function
* **Goal:** To find the optimal values for all parameters ($W^1, b^1, W^2, b^2, \dots$) that minimize the overall Cost Function $J$.
* **Algorithm:** This is achieved by repeatedly using the **Gradient Descent** algorithm.
* **Backpropagation:** TensorFlow uses the **backpropagation** algorithm behind the scenes (within the `model.fit` function) to efficiently compute the partial derivatives (gradients) of the cost function with respect to every single parameter.
* **TensorFlow Code:** The entire optimization process is executed by calling: `model.fit(X, Y, epochs=100)`.

### 5. Importance of Libraries vs. Scratch Implementation
* Modern deep learning development relies heavily on mature libraries like TensorFlow and PyTorch for efficiency. Developers rarely implement complex functions like sorting, square roots, or even backpropagation from scratch.
* However, understanding the underlying math and algorithms (like forward prop and backprop) is essential for **debugging** and making smart design choices when models fail or produce unexpected results.

## Alternative to the Sigmoid Activation

### 1. The Need for New Activations
* **Sigmoid Limitation:** The sigmoid function, used previously (because it was based on logistic regression), squashes the output ($g(z)$) to a range between 0 and 1. This is appropriate for modeling probabilities or binary outcomes.
* **Expanding Capability:** For hidden layers, especially when the modeled concepts (like awareness or quality) can have a non-binary, large, or non-negative impact, activation values may need to exceed 1. Using other activation functions allows the neural network to become much more powerful.

### 2. Introduction of ReLU
* **ReLU (Rectified Linear Unit):** This is a very common and highly effective activation function.
* **Equation:** $g(z) = \max(0, z)$.
* **Functionality:**
    * If the input $z$ is negative, the output is $0$.
    * If the input $z$ is positive, the output is simply $z$.
* **Range:** The output can be $0$ or any **non-negative value**, allowing activations to grow larger than $1$.

### 3. Most Common Activation Functions

The three most common activation functions for neural networks are:

| Name | Equation | Common Use Case | Notes |
| :--- | :--- | :--- | :--- |
| **Sigmoid** | $g(z) = \frac{1}{1 + e^{-z}}$ | Output layer for **Binary Classification** (predicting probability). | Output is always between 0 and 1. |
| **ReLU** | $g(z) = \max(0, z)$ | **Hidden Layers** (Default choice for most new networks). | Output is 0 or any non-negative number. |
| **Linear** | $g(z) = z$ | Output layer for **Regression Problems** (predicting a continuous value). | Often referred to as "no activation function" since the output equals the input ($a=z$). |

The choice of activation function for each neuron is a critical design decision explored further in the following video. A fourth activation function, **Softmax**, will be introduced later for multi-class classification.

## Choosing Activation Functions

### 1. Choosing the Output Layer Activation

The choice of activation function for the output layer is determined primarily by the **nature of the target variable ($y$)**:

| Target Variable ($y$) | Problem Type | Recommended Activation | Rationale |
| :--- | :--- | :--- | :--- |
| **0 or 1** | **Binary Classification** | **Sigmoid** | Predicts a probability between 0 and 1. (Most natural choice). |
| **Positive or Negative** | **Regression (General)** | **Linear** | Allows the output $f(\mathbf{x})$ to take on any real value. (e.g., predicting stock change). |
| **Non-Negative Only** | **Regression (Constrained)** | **ReLU** | Constrains the output $f(\mathbf{x})$ to be 0 or positive. (e.g., predicting house price). |

### 2. Choosing the Hidden Layer Activation

* **ReLU is the Default:** The **Rectified Linear Unit (ReLU)** activation function is by far the most common and recommended choice for **all hidden layers** in modern neural networks.
* **Why ReLU is Preferred over Sigmoid:**
    * **Computational Speed:** ReLU is faster to compute ($\max(0, z)$) than the sigmoid function (which involves exponentiation).
    * **Avoids "Flatness":** The sigmoid function is "flat" (has a very small gradient) at both ends, which significantly **slows down** the Gradient Descent algorithm. ReLU is only flat on the negative side, allowing the network to learn faster.

### 3. Implementation in TensorFlow

In TensorFlow, you specify the activation function within the layer definition:

| Activation | TensorFlow Syntax |
| :--- | :--- |
| **ReLU (Hidden Layer)** | `tf.keras.layers.Dense(units=25, activation='relu')` |
| **Sigmoid (Output Layer)** | `tf.keras.layers.Dense(units=1, activation='sigmoid')` |
| **Linear (Output Layer)** | `tf.keras.layers.Dense(units=1, activation='linear')` |

### 4. Other Activation Functions
While **Sigmoid, ReLU, and Linear** are the most crucial, various other specialized activation functions exist (like TanH, LeakyReLU, Swish) that researchers sometimes use for marginal performance gains, but the ReLU-based choices are sufficient for the vast majority of applications.

## More on Swith and SiLU Activation Function

The **Swish activation function** is a smooth, non-linear activation function that has gained popularity in deep learning research as an alternative to the more common ReLU. It was shown to often outperform ReLU in various deep learning tasks.

### Swish Definition

The Swish function is defined as:

$$\text{Swish}(\mathbf{x}) = \mathbf{x} \cdot \sigma(\beta \mathbf{x})$$

Where:

* $\mathbf{x}$ is the input to the neuron (often denoted as $z$ in introductory material).
* $\sigma$ is the **sigmoid function** ($\frac{1}{1 + e^{-x}}$).
* $\beta$ (beta) is a learned or fixed parameter. In the original paper, the simplest and most common form sets $\beta = 1$.

When $\beta = 1$, the function is often referred to as the **SiLU** (Sigmoid Linear Unit).

$$\text{SiLU}(\mathbf{x}) = \mathbf{x} \cdot \sigma(\mathbf{x})$$

### Key Characteristics and Advantages

The Swish function blends the input linearly with the output of a sigmoid function, giving it several desirable properties:

### 1. Smoothness
Unlike ReLU, which has an abrupt change in gradient at $\mathbf{x}=0$ (a sharp kink), Swish is **smooth** everywhere. This smoothness can make the optimization process (Gradient Descent) easier and more stable for deeper models.

### 2. Non-Monotonicity
For negative values of $\mathbf{x}$, the Swish function is **non-monotonic** (it goes down, then slightly up before dropping off).  This unique property allows small negative inputs to be "rectified" or slightly amplified, which can lead to better information flow through the network compared to ReLU, which simply forces all negative inputs to zero.

### 3. Better Performance
Research, particularly in complex architectures like those used for image classification, has shown that using Swish/SiLU can result in **faster training and better generalization** (higher accuracy on unseen data) compared to standard ReLU. It became famous after being used successfully in models like **EfficientNet**.

### Comparison to ReLU

| Feature | ReLU ($\max(0, \mathbf{x})$) | Swish ($\mathbf{x} \cdot \sigma(\mathbf{x})$) |
| :--- | :--- | :--- |
| **Equation** | Simple: output is $\mathbf{x}$ or $0$ | Complex: involves multiplication and sigmoid |
| **Gradient at $\mathbf{x}=0$** | Discontinuous (abrupt change) | Smooth and continuous |
| **Range for $\mathbf{x} < 0$** | Strictly 0 ("dead neurons" issue) | Small negative values (non-monotonic) |
| **Computational Cost** | Very low (fast) | Higher (due to sigmoid) |

## Why do we need activation functions?

### 1. The Problem with All-Linear Activation
* **Collapsing to Linearity:** If every neuron in a neural network uses the **linear activation function** ($g(z) = z$), the entire, multi-layered network collapses mathematically into a single, equivalent linear function.
* **Loss of Complexity:** A neural network with any number of hidden layers using only linear activations performs no better than a simple **Linear Regression** model.
    * *Demonstration:* For a simple two-layer network, substituting $a^1 = w^1x + b^1$ into $a^2 = w^2a^1 + b^2$ results in $a^2 = (w^2w^1)x + (w^2b^1 + b^2)$, which is simply $a^2 = Wx + B$.
* **Defeats the Purpose:** Using multiple layers becomes redundant if they are all linear, as the model cannot learn complex, non-linear features or decision boundaries.

### 2. Consequences for Model Type
* **All Linear Layers:** A network with linear activation in all hidden and output layers is equivalent to **Linear Regression**.
* **Linear Hidden + Sigmoid Output:** A network with linear activations in the hidden layers but a sigmoid output is equivalent to **Logistic Regression**.
* **Conclusion:** To gain the benefit of a deep architecture and learn complex, non-linear relationships, the hidden layers **must** employ a non-linear activation function.

### 3. Recommended Hidden Layer Activation
* **Rule of Thumb:** **Do not use the linear activation function in hidden layers.**
* **Recommendation:** The **ReLU (Rectified Linear Unit)** activation function is the recommended default for hidden layers because it introduces the necessary non-linearity while allowing for faster training.

## Multiclass

### 1. Definition of Multiclass Classification
* **Classification Problem:** The target variable ($y$) still takes on only a small number of discrete categories.
* **Key Difference:** $y$ can have **more than two possible output labels** (e.g., $y = 1, 2, 3, 4, \dots$).
* **Examples:**
    * Recognizing 10 possible handwritten digits (0 through 9).
    * Classifying a patient into one of several diseases.
    * Identifying the type of defect on a manufactured part (e.g., scratch, discoloration, chip).

### 2. The Goal of the Model
* In binary classification (where $y$ is 0 or 1), the model estimates $P(y=1 | \mathbf{x})$.
* In multiclass classification, the model must estimate the probability for **each possible class**:
    * $P(y=1 | \mathbf{x})$
    * $P(y=2 | \mathbf{x})$
    * $P(y=3 | \mathbf{x})$
    * ...and so on.
* The model's decision boundary will divide the feature space into multiple distinct categories (e.g., four regions for four classes) instead of just two.

![multi class](images/multi_class.png)

## Softmax

Softmax Regression is the generalization of Logistic Regression to handle classification problems with more than two output classes (multiclass classification).

### 1. Generalizing Logistic to Softmax
* **Logistic Regression (Binary):** Handles $Y$ with two classes (0 or 1). It computes $Z = W^T X + b$ and uses the sigmoid function to output $A_1 = P(Y=1|X)$. The probability of the other class is $A_2 = 1 - A_1 = P(Y=0|X)$.
* **Softmax Regression (Multiclass):** Handles $Y$ with $N$ possible classes (e.g., 1, 2, 3, up to $N$). It computes **separate linear scores ($Z_j$)** for each class.

### 2. The Softmax Activation Function
Softmax computes the probability for each class ($A_1$ through $A_N$) such that all probabilities sum to 1.

* **Linear Scores:** For each class $j$ (from 1 to $N$), compute:
    $$Z_j = W_j^T X + b_j$$
* **Softmax Output (Activation):** The probability that $Y$ belongs to class $j$ is:
    $$A_j = P(Y=j|X) = \frac{e^{Z_j}}{\sum_{k=1}^{N} e^{Z_k}}$$
    * The denominator is the sum of $e^Z$ for all classes, which ensures that $\sum A_j = 1$.
* **Special Case:** When $N=2$ (only two classes), Softmax Regression mathematically reduces to Logistic Regression.

---

![Softmax Regression](images/softmax_regression.png)

### 3. The Parameters
* For $N$ classes, Softmax Regression has $N$ sets of parameters:
    * $W_1, W_2, \dots, W_N$ (weight vectors)
    * $b_1, b_2, \dots, b_N$ (bias terms)

### 4. The Loss and Cost Functions
The standard loss function used with Softmax is called the **Cross-Entropy Loss** (or Softmax Loss), which is a generalization of the loss used in Logistic Regression.

* **Loss Function:** For a single training example, if the true label is $Y=j$, the loss is simply the negative log of the probability assigned to that true class:
    $$\text{Loss} = - \log(A_j)$$
    * **Goal:** This function incentivizes the algorithm to make the probability of the true class ($A_j$) as large as possible (close to 1). If $A_j$ is close to 1, the loss is small. If $A_j$ is small, the loss is large.
* **Cost Function:** The total cost $J$ is the average of the loss function calculated over the entire training set.

---

![Softmax Cost](images/softmax_cost.png)

## Neural Network with Softmax Output

Here we briefly discuss how to adapt a standard neural network architecture for **multiclass classification** by incorporating the **Softmax function** into the output layer.

### 1. Softmax Neural Network Architecture
* **Goal:** To classify an input (like a handwritten digit) into one of $N$ possible classes (e.g., 10 classes for digits 0-9).
* **Output Layer:** The final layer of the network must have **$N$ output units** (e.g., 10 units for 10 classes).
* **Softmax Layer:** The activation function applied to these output units is the **Softmax function**. This transforms the raw scores into probabilities that sum to 1.
    * The network is often referred to as having a **Softmax Output** or being a **Softmax Layer**.

### 2. Forward Propagation with Softmax
* **Hidden Layers:** Activations in the hidden layers ($A^{[1]}, A^{[2]}, \dots$) are computed exactly as before (using ReLU, Sigmoid, etc.).
* **Output Layer Calculation ($L=3$):**
    1.  **Compute Linear Scores ($Z^{[L]}$):** The raw scores for each output unit are calculated using the activations from the previous layer ($A^{[L-1]}$) and the final layer's parameters ($W^{[L]}, b^{[L]}$).
    2.  **Compute Softmax Activations ($A^{[L]}$):** The Softmax function is applied to the $Z$ values to get the output probabilities:
        $$A_j^{[L]} = P(Y=j|X) = \frac{e^{Z_j^{[L]}}}{\sum_{k} e^{Z_k^{[L]}}}$$
* **Unique Property:** Unlike ReLU or Sigmoid, where $A_j$ only depends on $Z_j$, the **Softmax activation $A_j$ depends on all $Z$ values** ($Z_1, Z_2, \dots, Z_N$) simultaneously due to the shared denominator.

### 3. Implementation and Loss Function
* **TensorFlow Loss Function:** The loss function corresponding to the Softmax output (the generalization of logistic loss) is called **SparseCategoricalCrossentropy**.
    * **Categorical:** Refers to classifying into categories (classes).
    * **Sparse:** Refers to the fact that each training example belongs to **only one** category (e.g., an image is a "2" or a "7", but not both).

## Improved Implementation of softmax

He we addresses a crucial implementation detail in deep learning: achieving **numerical stability** when calculating loss for classification problems, particularly with the Softmax function.

### 1. The Numerical Stability Problem
* **Round-off Error:** Computers store numbers (floating-point numbers) with finite precision. When performing complex calculations, especially involving very large or very small intermediate values, the result can accumulate significant **numerical round-off error**.
    * **Example:** Calculating a simple number in two complex ways can yield slightly different results due to this finite precision.
* **Softmax and Exponentials:** The Softmax function involves calculating $e^Z$. If any $Z$ is very large or very small, $e^Z$ can become extremely large or extremely close to zero, leading to severe round-off errors when performing division.

### 2. The Solution: Combining Activation and Loss (Logits)
The recommended solution is to avoid calculating the activation probabilities (like $A$ in Logistic Regression or $A_1$ through $A_{10}$ in Softmax) as an explicit intermediate step.

* **Logits:** The linear output $Z$ (i.e., $W^T X + b$) is known as the **logits** in TensorFlow/Keras terminology.
* **Optimized Loss Calculation:** By telling the framework to use the logits ($Z$) directly in the loss function, the framework can mathematically rearrange the terms in the combined $\text{Loss} = -\log(\text{Softmax}(Z))$ expression. This rearrangement avoids calculating the extreme exponential values and produces a **more numerically accurate result**.

### 3. Implementation in TensorFlow (Recommended Method)

| Step | Less Recommended (Two Steps) | **Recommended (Numerically Stable)** |
| :--- | :--- | :--- |
| **Output Layer** | Use **Softmax** activation. | Use **Linear** activation (outputs only the **logits**, $Z$). |
| **Loss Function** | Use `SparseCategoricalCrossentropy`. | Use `SparseCategoricalCrossentropy` and set the argument **`from_logits=True`**. |
| **Concept** | Explicitly computes probabilities $A$, then computes $\text{Loss} = -\log(A)$. | Computes the loss directly from the logits $Z$ using a mathematically optimized formula. |

### 4. SparseCategorialCrossentropy or CategoricalCrossEntropy
Tensorflow has two potential formats for target values and the selection of the loss defines which is expected.
* **SparseCategorialCrossentropy:** expects the target to be an integer corresponding to the index. For example, if there are 10 potential target values, y would be between 0 and 9. 
* **CategoricalCrossEntropy:** Expects the target value of an example to be one-hot encoded where the value at the target index is 1 while the other N-1 entries are zero. An example with 10 potential target values, where the target is 2 would be [0,0,1,0,0,0,0,0,0,0].

### 5. Key Takeaways
* The recommended implementation (using linear output and `from_logits=True`) achieves the **same conceptual result** as the two-step approach but is **more numerically stable and accurate**.
* While the recommended code may be slightly **less legible** (as it uses the non-intuitive linear activation), it is the **best practice** for robust deep learning implementation.
* This principle applies to both binary (Logistic) and multiclass (Softmax) classification, but the numerical issues are often more pronounced with Softmax.

## Bonus: Mathematical rearrangment of loss function for numerical stability

TensorFlow (or any framework using optimized numerical libraries) rearranges the Softmax and Cross-Entropy loss calculation to avoid computing extreme exponentials by working with the **logarithm of the probabilities**, a technique often called the "log-sum-exp trick."

Here's how the rearrangement works, focusing on the core mathematical transformation:

### 1. The Standard (Unstable) Calculation

The Softmax probability for a class $j$ is defined as:

$$A_j = \frac{e^{Z_j}}{\sum_{k} e^{Z_k}}$$

The Cross-Entropy Loss for a true class $y$ is:

$$\text{Loss} = - \log(A_y)$$

Substituting the Softmax equation into the loss gives:

$$\text{Loss} = - \log\left(\frac{e^{Z_y}}{\sum_{k} e^{Z_k}}\right)$$

If any of the logits, $Z_k$, are very large (e.g., $Z_k = 1000$), $e^{Z_k}$ becomes astronomically large, leading to **overflow** (a number too big to store). If they are very small (e.g., $Z_k = -1000$), $e^{Z_k}$ becomes zero, leading to **underflow** and potential division-by-zero errors.

### 2. The Numerically Stable Rearrangement

The framework rearranges the loss expression using logarithm properties: $\log(a/b) = \log(a) - \log(b)$.

$$\text{Loss} = - \left[ \log(e^{Z_y}) - \log\left(\sum_{k} e^{Z_k}\right) \right]$$

Since $\log(e^{Z_y}) = Z_y$, the equation simplifies to:

$$\text{Loss} = -Z_y + \log\left(\sum_{k} e^{Z_k}\right)$$

#### The Log-Sum-Exp Trick

The term $\log\left(\sum_{k} e^{Z_k}\right)$ is still dangerous because of the sum of large exponentials. To stabilize this term, the framework finds the **maximum logit**, $Z_{\max} = \max_k (Z_k)$, and adds and subtracts it inside the logarithm:

$$\log\left(\sum_{k} e^{Z_k}\right) = \log\left(\sum_{k} e^{Z_k - Z_{\max} + Z_{\max}}\right)$$

Using the rule $e^{a+b} = e^a e^b$:

$$\log\left(\sum_{k} e^{Z_{\max}} \cdot e^{Z_k - Z_{\max}}\right)$$

Pulling $e^{Z_{\max}}$ out of the sum:

$$\log\left(e^{Z_{\max}} \sum_{k} e^{Z_k - Z_{\max}}\right)$$

Using the rule $\log(ab) = \log(a) + \log(b)$:

$$\log(e^{Z_{\max}}) + \log\left(\sum_{k} e^{Z_k - Z_{\max}}\right)$$

Which simplifies to:

$$Z_{\max} + \log\left(\sum_{k} e^{Z_k - Z_{\max}}\right)$$

#### Why this is Stable

1.  **Reduced Magnitude:** The term $Z_k - Z_{\max}$ ensures that the arguments to the exponentials are all **negative or zero**.
2.  **No Overflow:** This means the largest exponential value being computed is $e^0 = 1$. The sum is now stable and will not overflow.
3.  **No Underflow:** While some terms in the sum may underflow to zero, the term corresponding to $Z_{\max}$ is always 1, ensuring the sum itself remains non-zero and stable.

By implementing the loss calculation using this mathematically equivalent, but numerically stable, formula, TensorFlow avoids the intermediate step of calculating the prone-to-error probabilities ($A_j$) and computes the final loss accurately directly from the logits ($Z$). This is why setting **`from_logits=True`** is the recommended best practice.

## Classification with multiple outputs

Herem, we introduce **multi-label classification**, contrasting it with **multi-class classification**.

* **Multi-class Classification** vs. **Multi-label Classification**:
    * **Multi-class** classification has an output label $Y$ that can be one of two or more possible categories (e.g., classifying a handwritten digit as a single number from 0 to 9).
    * **Multi-label** classification associates **multiple labels** with a single input $X$ (e.g., an image).

* **Definition and Example of Multi-label Classification**:
    * In a multi-label problem, the output $Y$ is a **vector** of numbers, where each element corresponds to a separate classification question.
    * **Example**: In a driver assistance system, a single image $X$ is checked for three separate features: Is there a **car**? Is there a **bus**? Is there a **pedestrian**?
    * The output $Y$ would be a vector of three binary values (e.g., $Y = [1, 0, 1]$ means **yes car**, **no bus**, **yes pedestrian**).

* **Neural Network Implementation**:
    * **Option 1 (Separate Networks)**: Treat each label as a completely separate machine learning problem and train independent neural networks for each (e.g., one network for cars, one for buses, etc.).
    * **Option 2 (Single Network)**: Train a **single neural network** to simultaneously predict all labels.
        * The **output layer** will have multiple neurons (e.g., three for the car/bus/pedestrian example).
        * Since each output is a separate **binary classification** (yes/no), a **sigmoid activation function** is used for **each node** in the output layer.

* The goal of the discussion is to clearly define multi-label classification to avoid confusion with multi-class classification, allowing practitioners to choose the correct approach for their application.

---

![Multi-label Classification](images/multi_label_classification.png)

## Advanced Optimization

Here, we introduces the **Adam (Adaptive Moment Estimation) optimization algorithm** as a superior alternative to standard **Gradient Descent** for training neural networks, focusing on its ability to automatically adapt the learning rate.

* **Motivation for a Better Optimizer**: While **Gradient Descent** is foundational, it often requires small steps (small learning rate $\alpha$) to avoid oscillation, leading to slow convergence, or large steps that cause oscillation and may miss the minimum.

* **Introducing the Adam Algorithm**:
    * Adam stands for **Adaptive Moment Estimation**.
    * It is an optimization algorithm that can help a neural network **train much faster** than standard gradient descent.
    * Adam is now the **de facto standard** used by most practitioners for training neural networks.

* **Adaptive Learning Rates**: The core advantage of Adam is its ability to **automatically adjust the learning rate** based on the cost function's landscape:
    * **Increases $\alpha$**: If a parameter is taking many small steps consistently in the same direction (i.e., making slow progress), Adam **increases** its specific learning rate to speed up convergence.
    * **Decreases $\alpha$**: If a parameter is **oscillating** back and forth across the minimum, Adam **reduces** its specific learning rate to allow for a smoother path.

* **Parameter-Specific Learning Rates**:
    * Unlike Gradient Descent, which uses a single global learning rate $\alpha$, Adam uses a **different learning rate for every single parameter** ($\mathbf{w}_j$ and $b$) in the model.
    * This allows the algorithm to tailor the learning speed precisely for each parameter's gradient behavior.

* **Implementation**:
    * In a deep learning framework like Keras, Adam is implemented by specifying it as the **optimizer** during model compilation (e.g., `optimizer=tf.keras.optimizers.Adam`). 
    * Adam still requires an initial **default global learning rate** (often $10^{-3}$), but it is more robust to the exact choice of this initial value compared to Gradient Descent.

## Additional Layer Types

This cell introduces the concept of different neural network layer types, specifically focusing on the **Convolutional Layer** as an alternative to the standard **Dense Layer**.

* **Dense Layer Recap**:
    * The **dense layer** (or fully connected layer) is where every neuron in the layer takes as input **all activations** from the previous layer.

* **Convolutional Layer Introduction**:
    * A **convolutional layer** is a different layer type where each neuron **only looks at a limited, local region (or "window")** of the activations from the previous layer or the input data.
    * A neural network using convolutional layers is often called a **Convolutional Neural Network (CNN)**.

* **Advantages of Convolutional Layers**:
    * **Faster Computation**: By restricting the inputs to each neuron, computation is sped up.
    * **Better Data Efficiency**: The network may require **less training data**.
    * **Reduced Overfitting**: The network can be **less prone to overfitting** (a topic to be discussed in more detail later).

* **Illustrative Examples**:
    * **2D Image (Handwritten Digit)**: A neuron in the first layer may only look at the pixels in a **small rectangular region** of the input image, not the entire picture.
    * **1D Time Series (ECG Signal)**:
        * The input is a time series of numbers (e.g., $X_1$ through $X_{100}$).
        * The first hidden unit might only look at $X_1$ through $X_{20}$.
        * The second hidden unit might look at a shifted window, $X_{11}$ through $X_{30}$.
        * This process of looking at a local window and sliding it across the input is the essence of a convolutional layer.

* **Architecture and Future Research**:
    * Convolutional layers introduce new **architectural choices** (e.g., the size of the input window, the number of neurons).
    * Inventing new layer types and combining them is a key focus of advanced neural network research (e.g., the development of **Transformer** and **LSTM** models).
    * **Note**: While introduced for intuition, the details of CNNs are **not required** for the current course or homework.

---

![Convolutional Neural Network](images/cnn.png)