# Introduction to Neural Networks

Welcome to this basic introduction to neural networks. While this course aims to cover building **state-of-the-art models** and **practical implementations in Lightning**, we will also cover the **foundations of neural networks**. Always remember, the foundations covered here may not be exhaustive, as this topic can be a vast coursework in itself. Hence, by utilizing the current best **open-source resources**, we will do our best to equip users with enough knowledge to navigate the subsequent topics.

---

Some commendable resources for learning about this vast topic include:

1.  **[Dive into Deep Learning](https://d2l.ai/chapter_introduction/index.html)** - For a more coding-oriented approach.

2.  **[Understanding Deep Learning](https://udlbook.github.io/udlbook/)** - Covering deep learning content from both theoretical and practical standpoints.

3.  **[DeepLearning.AI](https://www.deeplearning.ai/)** - Last but not least, the coursework produced by Professor Andrew Ng, which has had a significant impact on our machine learning and deep learning community.

---

Let us now get accustomed to some **industrial terms** in general machine learning and deep learning.

---

## Supervised Learning:

As the name suggests, a **model** learns the relationship between **one or more inputs** and **one or more outputs**. For example, a model might take **multiple features of an image** as input to recognize an **output class of animals** in the image, such as "cat" or "dog."

---

### Mathematical Representation - Classification

Let's represent the concept of classification mathematically:

* **Input ($X$):**
    If the model receives multiple features, we can represent the input as a vector:
    $$X = [x_1, x_2, \ldots, x_n]$$
    Here, $n$ is the total number of input features. In our image recognition example, $X$ could be a flattened array of pixel values or a set of extracted visual features from the image.

* **Output ($Y$):**
    For classification, the output is typically a probability distribution over the possible classes. For instance, for "cat" and "dog" classes:
    $$Y = [P(\text{cat}), P(\text{dog})]$$
    Here, $P(\text{cat})$ is the predicted probability that the image contains a cat, and $P(\text{dog})$ is the predicted probability that it contains a dog. The model's final prediction would be the class with the highest probability.

* **The Model ($f$):**
    The model itself is essentially a function that maps the input to the output. This function, $f$, encapsulates all the learned relationships:
    $$Y = f[X]$$
    In a neural network, this function $f$ is a complex arrangement of interconnected nodes (neurons) that perform linear transformations followed by non-linear activation functions. The "learning" part involves adjusting the internal **parameters** (weights and biases) of this function during training.

---

### Mathematical Representation - Regression

If you recall, we discussed estimating an output function using multiple input variables in Chapter 1. Here, we'll demonstrate a regression example where we predict a continuous output $y$.

The equation for $y$ - our **Output Variable** - represents an underlying true function. Our model, often a Fully Connected Neural Network (or Multi-Layer Perceptron), aims to approximate this function. This model is **The Model** in our context:
$$y = 10x_{1}^{2} + 5x_{2}^{2} + 2x_{1}x_{2} + 3x_{1} + 4x_{2} + \varepsilon$$
Our model, aiming to approximate this true function, can be generally expressed as $f(x_{1}, x_{2}, \phi)$.

Where the variables are distributed as follows for generating our data: **Input Variables**
* $x_1$ follows a continuous uniform distribution between -10 and 10, denoted as $x_{1} \sim \mathcal{U}(-10, 10)$.
* $x_{2}$ follows a continuous uniform distribution between 0 and 5, denoted as $x_{2} \sim \mathcal{U}(0, 5)$.
* $\varepsilon$ represents Gaussian noise with a mean of 0 and a variance of $2^{2} = 4$, denoted as $\varepsilon \sim \mathcal{N}(0, 2^{2})$.
* $\phi$ represents the estimated **parameters** (e.g., weights and biases) that the model learns to estimate this second-order quadratic equation.

---

### General Representation of Supervised Learning

In simple terms, in supervised learning, we always try to estimate $Y$ (which can be a single or multiple outputs) by utilizing one or more inputs, $X$. The model inherently contains **parameters** $\phi$. This choice of parameters represents the learned relationship between $X$ and $Y$:
$$Y = f(X, \phi)$$

### What is Learning? How Does a Model Estimate the $\hat{\phi}$ Parameters?

At its core, **learning** in supervised machine learning is the process of finding the optimal **parameters ($\hat{\phi}$)** for a model such that its estimated output ($\hat{Y}$) is as close as possible to the true, actual output ($Y_{actual}$). We achieve this by exposing the model to a **training dataset**, which consists of numerous examples of input-output pairs ($X, Y_{actual}$).

During this training process, we quantify the model's performance using a **loss function ($L$)**. This is a scalar value that summarizes the overall inaccuracy of the model's predictions across the entire training dataset. A **lower loss value indicates higher accuracy**, meaning the model's predictions are closer to the actual values. The discrepancy between the model's prediction and the true value for a single training example is often referred to as the **error ($e_i$)** for that specific example $i$.

Therefore, the parameters $\hat{\phi}$ are estimated by **minimizing the loss** over the training dataset. Mathematically, this objective is represented as:

$$\hat{\phi} = \underset{\phi}{\text{argmin}} \ L(\phi)$$

This equation states that we are searching for the set of parameters $\phi$ that minimizes the loss function $L$.

### What is loss? 

Now, let's clarify the loss function. The simplest conceptualization of loss is the difference between the predicted and actual values. However, for practical and mathematical reasons (like ensuring differentiability for optimization), loss functions are usually defined using operations like squaring the difference for regression or using cross-entropy for classification.

A more precise representation of the loss function, taking into account the entire dataset and commonly used forms, would be:

$$L(\phi) = \frac{1}{M} \sum_{i=1}^{M} \text{Loss}(\hat{Y}_i, Y_{actual,i})$$

Where:
* $M$ is the total number of examples in the training dataset.
* $\hat{Y}_i = f(X_i, \phi)$ is the model's predicted output for the $i$-th input example $X_i$, using the current parameters $\phi$.
* $Y_{actual,i}$ is the true, known output for the $i$-th input example.
* $\text{Loss}(\cdot, \cdot)$ represents a specific function that quantifies the mismatch between the predicted and actual values for a single example. Common examples include:
    * **Mean Squared Error (MSE)** for regression: $\text{Loss}(\hat{Y}_i, Y_{actual,i}) = (\hat{Y}_i - Y_{actual,i})^2$
    * **Binary Cross-Entropy** or **Categorical Cross-Entropy** for classification.

The minimization process typically involves an optimization algorithm (like Gradient Descent) that iteratively adjusts $\phi$ in the direction that most rapidly reduces $L(\phi)$ until a satisfactory minimum is reached. This iterative adjustment is the essence of "learning."

#### Example of regression loss 

In the below example you can clearly see that the regression loss below and also the error deviation in the interative plot 

In [1]:
import numpy as np
import plotly.graph_objects as go
from ipywidgets import interact, FloatSlider, Layout
from IPython.display import display, HTML

# --- 1. Data Generation Function ---
def generate_regression_data(num_samples: int = 50, true_w: float = 2.0, true_b: float = 5.0, noise_std: float = 2.0) -> tuple[np.ndarray, np.ndarray]:
    """
    Generates synthetic linear regression data for demonstration purposes.

    The data follows a linear relationship with added Gaussian noise:
    y_true = true_w * X + true_b + noise

    Args:
        num_samples (int): The number of data points to generate.
        true_w (float): The true slope (weight) of the underlying linear relationship.
        true_b (float): The true intercept (bias) of the underlying linear relationship.
        noise_std (float): The standard deviation of the Gaussian noise added to the outputs.

    Returns:
        tuple[np.ndarray, np.ndarray]:
            A tuple containing:
            - X (np.ndarray): The input features, uniformly distributed between -10 and 10.
            - y_true (np.ndarray): The true output values corresponding to X, with noise.
    """
    np.random.seed(42) # for reproducibility
    X = np.random.uniform(-10, 10, num_samples) # Input features
    y_true = true_w * X + true_b + np.random.normal(0, noise_std, num_samples) # True outputs with noise
    return X, y_true

# --- 2. Model Prediction Function ---
def linear_regression_predict(X: np.ndarray, w: float, b: float) -> np.ndarray:
    """
    Predicts outputs using a simple linear regression model.

    The model's prediction is given by:
    y_hat = w * X + b

    Args:
        X (np.ndarray): Input features for which to make predictions.
        w (float): The weight (slope) parameter of the linear model.
        b (float): The bias (intercept) parameter of the linear model.

    Returns:
        np.ndarray: The predicted output values (y_hat).
    """
    return w * X + b

# --- 3. Loss Function ---
def mean_squared_error(y_pred: np.ndarray, y_true: np.ndarray) -> float:
    """
    Calculates the Mean Squared Error (MSE) between predicted and true values.

    MSE is a common loss function for regression problems, defined as:
    MSE = (1/N) * sum((y_pred_i - y_true_i)^2)

    Args:
        y_pred (np.ndarray): The array of predicted output values from the model.
        y_true (np.ndarray): The array of true (actual) output values from the dataset.

    Returns:
        float: The calculated Mean Squared Error.
    """
    return np.mean((y_pred - y_true)**2)

# --- 4. Function to Setup Initial Plotly FigureWidget ---
def setup_interactive_plot(X_train: np.ndarray, y_train: np.ndarray, initial_w: float, initial_b: float) -> go.FigureWidget:
    """
    Sets up and returns an initial Plotly FigureWidget for the interactive
    linear regression visualization. It initializes all traces (actual data,
    predicted line, error lines, predicted points) and the plot layout.

    Args:
        X_train (np.ndarray): The input training data features.
        y_train (np.ndarray): The true output training data values.
        initial_w (float): The initial weight (slope) parameter for the predicted line.
        initial_b (float): The initial bias (intercept) parameter for the predicted line.

    Returns:
        go.FigureWidget: The initialized Plotly FigureWidget instance.
    """
    fig = go.FigureWidget()

    # Define a static range for the regression line to ensure it covers the plot width
    x_line_for_plot = np.array([-10, 10])

    # Calculate initial predictions and loss based on initial_w and initial_b
    initial_y_pred_line = linear_regression_predict(x_line_for_plot, initial_w, initial_b)
    initial_y_pred_points = linear_regression_predict(X_train, initial_w, initial_b)
    initial_loss = mean_squared_error(initial_y_pred_points, y_train)

    # Add traces to the figure
    # Trace 0: Actual Training Data (Scatter plot)
    fig.add_trace(go.Scatter(x=X_train, y=y_train, mode='markers',
                             name='Actual Training Data',
                             marker=dict(color='blue', opacity=0.7, size=8)))

    # Trace 1: Predicted Line (Line plot)
    fig.add_trace(go.Scatter(x=x_line_for_plot, y=initial_y_pred_line, mode='lines',
                             name=f'Predicted Line: y = {initial_w:.2f}x + {initial_b:.2f}',
                             line=dict(color='red', width=3)))

    # Trace 2: Individual Error Lines (Scatter plot with 'lines' mode and None for breaks)
    error_x = []
    error_y = []
    for i in range(len(X_train)):
        error_x.extend([X_train[i], X_train[i], None]) # 'None' creates a break between segments
        error_y.extend([y_train[i], initial_y_pred_points[i], None])

    fig.add_trace(go.Scatter(x=error_x, y=error_y, mode='lines',
                             line=dict(color='gray', width=1, dash='dot'),
                             hoverinfo='none', # Disable hover info to keep it clean
                             showlegend=False)) # Hide from legend as it's a visual aid, not a main data series


    # Trace 3: Predicted Points (Scatter plot on the regression line)
    fig.add_trace(go.Scatter(x=X_train, y=initial_y_pred_points, mode='markers',
                             marker=dict(color='green', symbol='circle', size=6, opacity=0.7,
                                         line=dict(color='green', width=1)),
                             hoverinfo='none', # Disable hover info
                             showlegend=False)) # Hide from legend


    # Update general layout properties of the figure
    fig.update_layout(
        xaxis_title='Input Feature $X$', # LaTeX for axis title
        yaxis_title='Output $Y$', # LaTeX for axis title
        xaxis_range=[-11, 11],
        yaxis_range=[-18, 28],
        title_text='Interactive Linear Regression: Estimating Parameters ɸ', # Unicode phi for main title
        title_x=0.5, # Center the main title
        hovermode='closest', # Optimizes hover interactions
        template="plotly_white", # Sets a clean white background theme
        
        # --- Legend Position Adjustment ---
        legend=dict(
            x=1.05,        # X-coordinate relative to the plot area (1.0 is right edge)
            y=1,           # Y-coordinate (1.0 is top edge)
            xanchor='left', # Anchor the legend's left side at 'x'
            yanchor='top',  # Anchor the legend's top side at 'y'
            bgcolor='rgba(255,255,255,0.7)', # Semi-transparent background
            bordercolor='Black',
            borderwidth=1
        ),
        # --- Add a margin to the right to make space for the legend ---
        margin=dict(r=150) # Right margin in pixels. Adjust as needed.
    )

    # Add the Mean Squared Error (MSE) loss annotation
    fig.add_annotation(
        x=0.02, y=1.05, xref="paper", yref="paper", # Position: 2% from left, 105% from bottom (above plot)
        text=f'Mean Squared Error (MSE) Loss: {initial_loss:.4f}',
        showarrow=False, # Do not show an arrow pointing from the annotation
        bgcolor="yellow", # Background color for the text box
        opacity=0.9,
        borderpad=4,
        bordercolor="black",
        borderwidth=1,
        font=dict(size=12),
        align="left",
        valign="bottom" # Align bottom of text box to the y-coordinate
    )

    return fig

# --- 5. Function for Interactive Update Logic ---
def create_update_function(fig: go.FigureWidget, X_train: np.ndarray, y_train: np.ndarray, x_line_for_plot: np.ndarray):
    """
    Creates the inner callback function that ipywidgets.interact will execute
    whenever the slider values (w or b) change. This function updates the
    existing Plotly FigureWidget in place for smooth, live interaction.
    """
    def update_plot_plotly(w: float, b: float):
        """
        Updates the Plotly FigureWidget's traces and annotations based on
        the current weight (w) and bias (b) slider values.
        """
        # Calculate new predictions and loss based on current parameters
        y_pred_line = linear_regression_predict(x_line_for_plot, w, b)
        y_pred_points = linear_regression_predict(X_train, w, b)
        current_loss = mean_squared_error(y_pred_points, y_train)

        # Update Predicted Line trace (Trace 1)
        fig.data[1].x = x_line_for_plot # Ensure x-data is consistent
        fig.data[1].y = y_pred_line
        fig.data[1].name = f'Predicted Line: y = {w:.2f}x + {b:.2f}' # Update legend label

        # Update Error Lines trace (Trace 2)
        # Reconstruct x and y data for error lines for each update
        error_x_updated = []
        error_y_updated = []
        for i in range(len(X_train)):
            error_x_updated.extend([X_train[i], X_train[i], None])
            error_y_updated.extend([y_train[i], y_pred_points[i], None])
        fig.data[2].x = error_x_updated
        fig.data[2].y = error_y_updated

        # Update Predicted Points trace (Trace 3)
        fig.data[3].x = X_train # Ensure x-data is consistent
        fig.data[3].y = y_pred_points

        # Update the text of the MSE loss annotation (located at index 0 in annotations list)
        if fig.layout.annotations: # Robustly check if annotations exist
            fig.layout.annotations[0].text = f'Mean Squared Error (MSE) Loss: {current_loss:.4f}'

    return update_plot_plotly

# --- 6. Main Orchestration Function ---
def run_interactive_regression_demo():
    """
    Orchestrates the entire interactive linear regression demonstration.
    This function generates the data, sets up the Plotly FigureWidget,
    creates the interactive widgets, and links them to the plot update function.
    """
    # Define initial parameters for the demo (matching screenshot's starting values)
    initial_w, initial_b = 1.78, 6.20
    num_samples = 50

    # 1. Generate the synthetic training data
    X_train, y_train = generate_regression_data(num_samples=num_samples)

    # 2. Set up the initial Plotly FigureWidget
    interactive_fig = setup_interactive_plot(X_train, y_train, initial_w, initial_b)

    # 3. Create the update function, passing the figure and data
    x_line_for_plot = np.array([-10, 10]) # This range is static for the line
    update_func = create_update_function(interactive_fig, X_train, y_train, x_line_for_plot)

    # 4. Create interactive FloatSlider widgets for weight (w) and bias (b)
    w_slider = FloatSlider(min=-5.0, max=5.0, step=0.01, value=initial_w,
                           description='Weight ɸ₁:', # Using Unicode subscript 1
                           continuous_update=True, layout=Layout(width='auto'))
    b_slider = FloatSlider(min=-10.0, max=15.0, step=0.01, value=initial_b,
                           description='Bias ɸ₂:', # Using Unicode subscript 2
                           continuous_update=True, layout=Layout(width='auto'))

    # 5. Display the Plotly FigureWidget in the Jupyter output
    display(interactive_fig)

    # 6. Link the sliders to the update function using ipywidgets.interact
    # This establishes the dynamic connection between slider movements and plot updates.
    interact(update_func, w=w_slider, b=b_slider);

    # Print guiding instructions for the user
    print("\nAdjust the sliders above to see how changing the model's parameters (ɸ₁ and ɸ₂) affects the predicted line, individual errors, and the overall MSE loss.")
    print("The goal of 'learning' is to find the ɸ₁ and ɸ₂ values that result in the lowest possible MSE loss, meaning the red line best fits the blue data points.")

# --- Execute the main function to run the demo ---

run_interactive_regression_demo()

FigureWidget({
    'data': [{'marker': {'color': 'blue', 'opacity': 0.7, 'size': 8},
              'mode': 'markers',
              'name': 'Actual Training Data',
              'type': 'scatter',
              'uid': 'b09ce9ca-21a4-4b50-b9ba-22bc83c2d9eb',
              'x': array([-2.50919762,  9.01428613,  4.63987884,  1.97316968, -6.87962719,
                          -6.88010959, -8.83832776,  7.32352292,  2.02230023,  4.16145156,
                          -9.58831011,  9.39819704,  6.64885282, -5.75321779, -6.36350066,
                          -6.3319098 , -3.91515514,  0.49512863, -1.36109963, -4.1754172 ,
                           2.23705789, -7.21012279, -4.15710703, -2.67276313, -0.87860032,
                           5.70351923, -6.00652436,  0.28468877,  1.84829138, -9.07099175,
                           2.15089704, -6.58951753, -8.69896814,  8.97771075,  9.31264066,
                           6.16794696, -3.90772462, -8.04655772,  3.68466053, -1.19695013,
              

interactive(children=(FloatSlider(value=1.78, description='Weight ɸ₁:', layout=Layout(width='auto'), max=5.0, …


Adjust the sliders above to see how changing the model's parameters (ɸ₁ and ɸ₂) affects the predicted line, individual errors, and the overall MSE loss.
The goal of 'learning' is to find the ɸ₁ and ɸ₂ values that result in the lowest possible MSE loss, meaning the red line best fits the blue data points.


Let's expand on that concept, delving deeper into optimization, loss functions, and the role of parameters $\phi$.

### What is Optimization? 

As you can see in the above plot, **Mean Squared Error (MSE)**, alias our **loss function**, is considered to estimate a scalar value which represents the accuracy of fit. This scalar value quantifies how well our model's predictions align with the actual observed data. In the context of our interactive linear regression demo, a lower MSE signifies that the "Predicted Line" (red) is a better approximation of the "Actual Training Data" (blue points), with shorter and less prominent "Individual Error Lines" (gray dotted lines).

The method of searching for parameters $\phi$ (which in our linear regression case are the weight $w$ and bias $b$) that estimate the least value for a **penalty objective function** (like MSE) or the highest value for a **reward objective function** is called **optimization**.

Let's unpack this:

1.  **Objective Function (Loss Function / Cost Function / Reward Function):**
    At its core, an objective function is a mathematical expression that we want to either minimize or maximize.
    * **Loss/Cost Function (Penalty Objective Function):** When we aim to minimize it, it's typically called a loss function or cost function. Its value represents a "penalty" for how poorly our model is performing. The higher the loss, the worse the model's fit. MSE is a classic example of a loss function – we want to find $w$ and $b$ that make MSE as small as possible. Other common loss functions include Mean Absolute Error (MAE), cross-entropy (for classification), etc.
    * **Reward Function (Utility Function):** Less common in basic regression but prevalent in fields like reinforcement learning or economics, a reward function is what we aim to maximize. Its value represents how "good" a model's performance or a decision is.

2.  **Parameters $\phi$:**
    These are the tunable elements within our model that we adjust during the optimization process. In our linear regression example, $\phi$ represents the pair $(w, b)$. These parameters define the specific instance of our linear model (the slope and y-intercept of the red line). The goal of optimization is to find the *optimal* values for these parameters.

3.  **Optimization:**
    Optimization is the algorithmic process of iteratively adjusting the model's parameters ($\phi$) to achieve the best possible value of the objective function.
    * **The Search:** Imagine the MSE as a landscape where different combinations of $w$ and $b$ create different "altitudes" (MSE values). We want to find the lowest point in this landscape (the global minimum). Optimization algorithms are like sophisticated hikers traversing this landscape.
    * **Iterative Adjustment:** Instead of randomly trying values, optimization algorithms use systematic approaches. A very common one in machine learning is **Gradient Descent**.
        * **Gradient Descent:** This method works by calculating the "gradient" (the direction of the steepest ascent) of the loss function with respect to each parameter. To *minimize* the loss, the algorithm takes small steps in the *opposite* direction of the gradient. For example, if increasing $w$ increases MSE, the algorithm will slightly decrease $w$. This process is repeated many times, gradually moving the parameters towards the minimum loss.
    * **Learning Rate:** A crucial concept in gradient descent is the "learning rate," which determines the size of each step taken. A small learning rate means slow but precise convergence; a large learning rate can lead to overshooting the minimum or instability.

**In summary, for our linear regression:**

We start with arbitrary values for $w$ and $b$ (e.g., $1.78$ and $6.20$). These choices result in a certain MSE. The optimization process then systematically tweaks $w$ and $b$ based on how these changes affect the MSE. The ultimate aim is to discover the $w$ and $b$ values that yield the lowest possible MSE, thereby fitting the best straight line through our data points. This search for the ideal $\phi$ values is the very essence of how machine learning models "learn" from data.

**Note**:

Always remember that most of the deeplearning and machine learning objectives are aimed towards generalized optimization where in the goal is not always to find a global minima where it can most likely lead to overfit scenario which is not desired as performance of the model on the unseen data can be lower in this use case. We will cover this in most of the detailed use cases. 

### Example of using the gradient decent . 

Most of the practitioners know that Neural networks have thousands of parameters and large scale data sets where often finding an optimal solution has no deterministic solution hence we will rely optimization algorithms to find the minima of a loss function we are trying to minimize. 

#### Example using a Quadratic regression use case: 

Let us estimate the ${\phi}_1$ & ${\phi}_2$ of a quadratic equation using the above mentioned loss function. 

Let, 

$$Y = {\phi}_1 X^2 + {\phi}_2 X + 10 $$ 

Where ${\phi}_1$ & ${\phi}_2$ are 5 and 10 respectively assume that we do not know how to solve quadratic equations and we want to find the solution when we have X and Y measurements and trying to estimate the relationship between X and Y knowing that is quadratic using optimation algorithm 


##### Create a gradient descent 

In [32]:
import torch

def generate_input(start: float = -10.0, end: float = 10.0, steps: int = 100) -> torch.Tensor:
    """
    Generates a sequence of evenly spaced values between `start` and `end`.

    Args:
        start (float): Start of the range.
        end (float): End of the range.
        steps (int): Number of values to generate.

    Returns:
        torch.Tensor: A 1D tensor of evenly spaced values.
    """
    return torch.linspace(start=start, end=end, steps=steps)

def true_quadratic_function(x: torch.Tensor) -> torch.Tensor:
    """
    Computes the ground truth quadratic function y = 5x^2 + 10x + 10.

    Args:
        x (torch.Tensor): Input tensor.

    Returns:
        torch.Tensor: Output tensor after applying the true function.
    """
    return 5 * x**2 + 10 * x + 10

def model(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor:
    """
    Predicts the output using a parameterized quadratic model.

    Args:
        x (torch.Tensor): Input tensor.
        w1 (torch.Tensor): Weight for the quadratic term.
        w2 (torch.Tensor): Weight for the linear term.

    Returns:
        torch.Tensor: Predicted output tensor.
    """
    return w1 * x**2 + w2 * x + 10

def mse_loss(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    """
    Computes the Mean Squared Error (MSE) loss.

    Args:
        y_pred (torch.Tensor): Predicted output tensor.
        y_true (torch.Tensor): Ground truth output tensor.

    Returns:
        torch.Tensor: Scalar tensor representing the MSE loss.
    """
    return torch.mean((y_pred - y_true) ** 2)

def compute_gradients(
    x: torch.Tensor, 
    y_true: torch.Tensor, 
    w1: torch.Tensor, 
    w2: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Computes gradients of the loss with respect to model weights.

    Args:
        x (torch.Tensor): Input tensor.
        y_true (torch.Tensor): Ground truth output tensor.
        w1 (torch.Tensor): Current value of weight for x^2.
        w2 (torch.Tensor): Current value of weight for x.

    Returns:
        tuple[torch.Tensor, torch.Tensor]: Gradients with respect to w1 and w2.
    """
    y_pred = model(x, w1, w2)
    error = y_pred - y_true

    grad_w1 = torch.mean(2 * error * x**2)
    grad_w2 = torch.mean(2 * error * x)

    return grad_w1, grad_w2

def train(
    x: torch.Tensor, 
    y: torch.Tensor, 
    w1_init: float = 2.0, 
    w2_init: float = 1.0, 
    alpha: float = 1e-4, 
    epochs: int = 10000
) -> tuple[torch.Tensor, torch.Tensor, list[float]]:
    """
    Trains the quadratic model using gradient descent.

    Args:
        x (torch.Tensor): Input tensor.
        y (torch.Tensor): Ground truth output tensor.
        w1_init (float): Initial value for weight w1.
        w2_init (float): Initial value for weight w2.
        alpha (float): Learning rate.
        epochs (int): Maximum number of training iterations.

    Returns:
        tuple[torch.Tensor, torch.Tensor, list[float]]: 
            Final weights (w1, w2) and loss history.
    """
    w1 = torch.tensor(w1_init, dtype=torch.float32)
    w2 = torch.tensor(w2_init, dtype=torch.float32)
    loss_history = []

    for epoch in range(epochs):
        y_pred = model(x, w1, w2)
        current_loss = mse_loss(y_pred, y)
        loss_history.append(current_loss.item())

        # Early stopping if loss increases
        if epoch > 2 and loss_history[-1] > loss_history[-2]:
            print(f"Loss increased at epoch {epoch}, stopping training.")
            break

        grad_w1, grad_w2 = compute_gradients(x, y, w1, w2)
        w1 -= alpha * grad_w1
        w2 -= alpha * grad_w2

        if epoch % 100 == 0:
            print(f"[Epoch {epoch:4d}] Loss: {current_loss:.6f} | w1 = {w1:.4f}, w2 = {w2:.4f}")

    return w1, w2, loss_history


X = generate_input()
Y = true_quadratic_function(X)
final_w1, final_w2, losses = train(X, Y)

[Epoch    0] Loss: 21486.664062 | w1 = 3.2488, w2 = 1.0612
[Epoch  100] Loss: 703.516663 | w1 = 5.0000, w2 = 5.4826
[Epoch  200] Loss: 179.679733 | w1 = 5.0000, w2 = 7.7170
[Epoch  300] Loss: 45.890491 | w1 = 5.0000, w2 = 8.8462
[Epoch  400] Loss: 11.720507 | w1 = 5.0000, w2 = 9.4169
[Epoch  500] Loss: 2.993507 | w1 = 5.0000, w2 = 9.7053
[Epoch  600] Loss: 0.764523 | w1 = 5.0000, w2 = 9.8511
[Epoch  700] Loss: 0.195235 | w1 = 5.0000, w2 = 9.9247
[Epoch  800] Loss: 0.049870 | w1 = 5.0000, w2 = 9.9620
[Epoch  900] Loss: 0.012731 | w1 = 5.0000, w2 = 9.9808
[Epoch 1000] Loss: 0.003252 | w1 = 5.0000, w2 = 9.9903
[Epoch 1100] Loss: 0.000830 | w1 = 5.0000, w2 = 9.9951
[Epoch 1200] Loss: 0.000212 | w1 = 5.0000, w2 = 9.9975
[Epoch 1300] Loss: 0.000054 | w1 = 5.0000, w2 = 9.9987
[Epoch 1400] Loss: 0.000014 | w1 = 5.0000, w2 = 9.9994
[Epoch 1500] Loss: 0.000004 | w1 = 5.0000, w2 = 9.9997
[Epoch 1600] Loss: 0.000001 | w1 = 5.0000, w2 = 9.9998
[Epoch 1700] Loss: 0.000000 | w1 = 5.0000, w2 = 9.9999


#### Explanation: 
In the previous example, we used the entire training dataset to compute the gradients that minimize the loss of a quadratic function. This is a simple and effective approach to demonstrate how iterative optimization works to estimate the parameters of a neural network. However, in real-world applications, datasets are often large and high-dimensional, making full-batch gradient descent computationally expensive and memory-intensive.

To address these challenges, we use extensions of gradient descent such as:
	•	Stochastic Gradient Descent (SGD)
Updates model parameters using a single randomly selected data point at a time.
	•	Mini-batch Gradient Descent
A compromise between batch and SGD: it uses small random batches (e.g., 16–128 samples) to compute updates, offering a good trade-off between speed and stability.
	•	Momentum
Accelerates convergence by maintaining a velocity vector that helps the model build up speed in consistent gradient directions.
	•	Adaptive optimizers like AdaGrad, RMSprop, and Adam
These algorithms adapt the learning rate dynamically for each parameter based on past gradients, helping to improve convergence, especially in problems with sparse or noisy gradients.

These optimization techniques reduce memory usage, allow for faster convergence, and are robust to local minima and plateaus. They form the backbone of how modern deep learning systems are trained at scale.

However, due to time constraints, we will not be covering most of the optimization algorithms in depth, as our primary focus is on learning PyTorch Lightning.

Additionally, PyTorch abstracts away much of the mathematical complexity behind gradient computation, optimization algorithms, and learning rate scheduling. As a result, we will defer detailed discussions on optimizer selection and tuning, and revisit them later if needed, based on the specific use case.

Bonus Materials: 
1. [Optimization](https://d2l.ai/chapter_optimization/index.html)
2. [Gradient Descent Quest](https://www.youtube.com/watch?v=vMh0zPT0tLI)

Basic Intro for unsupervised learning , reinforcement learning and semisupervised learning as examples for the audience after introduction of generic terms and then jumping to other algorithms 


Covering Basic pytorch fundamentals before jumping into constructing neural networks 