<div style="text-align: justify;">

# **BLOCK 1: Linear and Logistic Regression**

### **Block Objective**
Based on a data set $(x_n, t_n)$ a supervised learning task is proposed with the objective of building a model capable of predicting the output based on the input.
<br>

#### The general process consists of:

1. Identifying the type of problem based on the data **(1.1)**
2. Choosing an appropriate model type **(1.2)**
3. Defining a cost function that allows for comparison of different models **(1.3)**
4. Optimizing the model parameters to obtain the best possible fit to the data **(1.5)**
5. **Extra:** Alternative optimization using Newton's method **(1.6)**

$\text{data → model choice → many possible models → criteria for comparing them → finding the optimum}$ <br><br>


</div>

---



<div style="text-align: justify;">

## **1.1 Problem Contextualization**
**Objective:** To determine the nature of the problem based on the type of target variable.

It answers the question: What type of problem do I have and what type of output do I want to predict?

Each data point is represented as a pair $(x_n, t_n)$ where:

- $x_n$ is the input (observed variables)
- $t_n$ is the actual output or target variable

Depending on the type of target variable, two types of tasks are distinguished:

### **Output Type $t$**
- **Continuous Output** ($t \in \mathbb{R}$)

  The goal is to predict a numerical value
  
  e.g., height based on age, house price based on square meters, weight based on height

- **Binary Output** ($t \in \{0,1\}$)

  The goal is to decide between two possible classes → Classification
  
  e.g., pass/fail, healthy/sick patient, yes/no, smoker/non-smoker

The analysis of the output type determines the formulation of the problem, but does not yet establish the specific model, as this will be defined in the following section.

<div style="border:2px solid black; padding:15px; text-align:center; margin:20px 0;">
Based on the analysis carried out in this section, a mathematical model is selected that is suitable for representing the relationship between inputs and output (1.2).
</div> <br>

</div>

---

<div style="text-align: justify;">

## **1.2 Regression Models: Hypotheses and Parameterization**
### **Types of Models** 
Defined by its nature.<br><br>

#### **1.2.1 Linear Regression**
When the target variable $t$ is continuous, it is assumed that there is a linear relationship between the input $x$ and the output $t$ . The model prediction is obtained directly from a linear combination of the input.

The model is expressed as:
$$
y = wx + b \quad \text{with }  w, b \in \mathbb{R}
$$
where:

- $w$ → weight or slope of the line
- $b$ → intercept ($y$ when $x=0$)
- $y$ → model prediction

**note:** $w$ and $b$ are the model parameters

Each pair $(w,b)$ defines a line in the plane $(x,t)$ , which represents a candidate model. These parameters are common to all data, describing a global line, while what changes in each observation is the input value $x_n$ . In general, there are infinitely many candidate $(w,b)$. 

Given a dataset with more than one point, multiple possible lines can be drawn, each corresponding to a different pair $(w,b)$ . Intuitively, some lines will fit the observed data better than others. This graphical intuition motivates the need to define a quantitative measure that allows lines to be compared and determines which one is the best, leading to the introduction of the cost function (1.3).

If the data exhibit a perfect linear relationship, there exists a line that passes through all the points, which is therefore the best possible model.

However, in real-world situations the relationship between variables is not perfectly linear, so there is no line that passes through all the points. In this case, it is necessary to introduce a criterion that allows different models to be compared and to select the one that best fits the data (cost function).

<div align="center">
  <img src="imagenes/foto1.png" width="50%">
</div>

**Limit case:** if there is only a single data point, there are infinitely many lines that pass through that point, so the model is not uniquely determined. This shows that the information is insufficient to identify an optimal model without additional criteria. (Note: all of them are valid because $(e = 0)$)

<div align="center">
  <img src="imagenes/foto2.png" width="50%">
</div><br><br>

**Illustrative example:** Given a set of points, graphically represent three candidate lines and visually compare which one appears to fit the data best. Each line represents a possible model.
<div align="center">
  <img src="imagenes/foto3.png" width="50%">
</div><br><br>

**Keep in mind !** In multiple linear regression, the prediction depends on several inputs:
$$
y = w_1 x_1 + w_2 x_2 + \cdots + w_k x_k + b
$$
This increases the model parameters from $2$ to $k+1$ . <br><br><br>

#### **1.2.1 Logistic Regression**
When the target variable $t$ is binary, the linear combination is not interpreted directly as a prediction, but as an intermediate variable that is transformed through a sigmoid activation function to obtain a probability.

The model starts with a linear combination of the inputs:
$$
z = wx + b \quad \text{with } w, b \in \mathbb{R}
$$
and transforms it using the sigmoid function:
$$y = \sigma(z) = \frac{1}{1 + e^{-z}}$$
<br>

<div style="margin-left: 2em;">

**Clarification:**  
As this is a binary classification problem: 
$$
y = P(t = 1 \mid x)
$$
Thus, the model does not directly predict the class, but rather a probability of belonging to class $1$ (event occurring/positive result).
</div>
<br>

**Decision boundary**
To finally assign a class, a decision boundary is defined, which is usually set at 0.5 :
- If $y \ge 0.5$, the model assigns clas $1$
- If $y < 0.5$, the model assigns clas $0$

<br><br>

&emsp;&emsp;&emsp;&emsp;**Illustrative graphic:**
<div align="center">
  <img src="imagenes/foto4.png" width="50%">
</div><br><br>

**Example:** 
If $P(t = 1 \mid x)=0.8$, the prediction is class $1$.<br>
If $P(t = 1 \mid x)=0.4$, the prediction is class $0$.

<br>

**Important !**  Defining the decision boundary allows the predicted probability to be transformed into a specific class. The following section (1.3) analyzes how the model parameters are adjusted using a cost function that penalizes classification errors.  

**Model parameters**

The parameters  $w$ and $b$ completely define the model and are common to all data, as in linear regression.

Logistic regression is linear in the parameters, but the output is nonlinear, thanks to the sigmoid activation function. This allows the prediction  $y$ to be interpreted as a probability.
<br>

</div>

---


<div style="text-align: justify;">

## **1.3 Cost function: what it is and what it is used for**
<div style="border:2px solid black; padding:15px; text-align:center; margin:20px 0;">
Once the model type has been chosen (1.2), there are multiple possible parameter configurations that define different candidate models. To compare which one best fits the data, the cost function is used, which quantifies the discrepancy between the model predictions and the actual values and allows the most appropriate model to be selected.
</div> <br>

### **Individual loss and total cost**
For each data point $n$ , an individual loss function is defined:
$$
l(t_n, y_n)
$$
which measures the error made by the model when predicting $y_n$ when the actual value is $t_n$ . This function must be optimizable.

**Important !** The error (or residual) is the difference between the predicted value and the actual value. It can be positive (overestimation) or negative (underestimation).


### **Total cost function**
This is defined as the sum of the individual losses $\equiv \text{TOTAL loss}$:
$$
L(w,b) = \sum_{n=1}^{N} l(t_n, y_n)
$$
Therefore, the total cost function converts the learning problem into a quantifiable and optimizable problem.

<div style="margin-left: 3em;">

**Interpretation of the cost function**  
Establishes a criterion of discrepancy or suitability for evaluating the model in relation to the task or question. It measures how close or far the model's predictions are from the actual values.

$\text{The lower the cost function, the better the model, because it will be closer and better adjusted to the actual data}$
</div>

**Important !** It does not have to be strictly “distance”: it can be any measure
that adequately captures the error made by the model.

### **Types of cost functions**
#### **Linear regression :**
**[Common examples]**

- **Cuadratic** error:  $$L(w,b)=\sum_{n=1}^{N}(t_n-y_n)^2 = \sum_{n=1}^{N}\big(t_n-(wx_n+b)\big)^2$$

<div style="margin-left: 3em;">
  <strong>Motivation:</strong>
  <ul style="margin-left: 0;">
    <li>Symmetry: treats positive and negative errors equally</li>
    <li>Amplification of major errors: the greater the discrepancy, the greater the penalty</li>
    <li>Smooth optimization: It is differentiable and convex → guarantees a global minimum</li>
  </ul>
</div> <br>

- **Cubic** error: $$L(w,b)=\sum_{n=1}^{N}(t_n-y_n)^3 = \sum_{n=1}^{N}\big(t_n-(wx_n+b)\big)^3$$

<div style="margin-left: 3em;">
  <strong>Motivation:</strong>
  <ul style="margin-left: 0;">
    <li>Asymmetry: It penalizes positive and negative errors differently. This is because positive errors cause the cost function to increase and negative errors cause it to decrease.</li>
    → Result: the model could focus on reducing overestimations, neglecting underestimations → mismatch</li>
    <li>Error amplification: Large errors are amplified even further, while small errors carry less weight.</li>
  </ul>
</div> <br>

- Errors with **higher exponents**: $$L(w,b)=\sum_{n=1}^{N}(t_n-y_n)^k = \sum_{n=1}^{N}\big(t_n-(wx_n+b)\big)^k$$

<div style="margin-left: 3em;">
  <strong>Motivation:</strong>
  <ul style="margin-left: 0;">
    <li> k = pair → symmetry</li>
    <li> k = odd → asymmetry</li>
    <li>The higher k is, the more big errors are amplified and small ones are minimized.</li>
  </ul>
</div> <br>

- **Absolute** error: $$L(w,b)=\sum_{n=1}^{N}|t_n-y_n| = \sum_{n=1}^{N}\big|t_n-(wx_n+b)\big|$$

<div style="margin-left: 3em;">
  <strong>Motivation:</strong>
  <ul style="margin-left: 0;">
    <li>Symmetry: positive and negative errors are treated equally</li>
    <li>Linear penalty: Each error is counted proportionally, without amplifying large ones.</li>
  </ul>
</div> <br><br>

**Illustrative example:** Given a set of points $(x_1,t_1)=(1,1)$ , $(x_2,t_2)=(2,6)$ and choosing the line as the candidate model:    
$$
y = 2x
$$
Compare the cost function $L$ using the above forms:

<div align="center">
  <img src="imagenes/foto5.png" width="50%">
</div><br><br>

#### **Logistic regression :**

- **Binary cross-entropy** $\equiv \text{log-loss}$:
$$
L = -\sum_{n=1}^{N} \left[ t_n \log(y_n) + (1 - t_n)\log(1 - y_n) \right]
$$

<div style="margin-left: 2em;">

**Motivation:**

- It strongly penalizes confident but incorrect predictions, that is, when the prediction differs from the actual label.

- For $t \in \{0,1\}$:
  - If $y_n = t_n$ → $l \approx 0$ (minimum error)
  - If $y_n \neq t_n$ → $l > 0$, increases the more confident the incorrect prediction is

  Extreme cases:

  - $t_n = 1 \Rightarrow l(t_n, y_n) = -\log(y_n)$
    - if $y_n = 1$ → $l = 0$  *(Trustworthy and correct)*
    - if $y_n = 0$ → $l = +\infty$  *(Trustworthy and incorrect)*

  - $t_n = 0 \Rightarrow l(t_n, y_n) = -\log(1 - y_n)$
    - if $y_n = 0$ → $l = 0$  *(Trustworthy and correct)*
    - if $y_n = 1$ → $l = +\infty$  *(Trustworthy and incorrect)*

- Rewards correct predictions and penalizes incorrect ones.

- Differentiable and convex → it has a single global minimum, which allows it to be found efficiently using gradient descent.
</div><br><br>

**Illustrative graph of the prediction, decision boundary, and misclassification:**

<div align="center">
  <img src="imagenes/foto6.png" width="50%">
</div><br><br>

<div style="border:2px solid black; padding:15px; text-align:center; margin:20px 0;">
The cost function converts an intuitive judgment (“which line fits best”) into a mathematical optimization problem.
</div> <br>

</div>

---


<div style="text-align: justify;">

## **1.4 Partial derivatives and chain rule**

**Objective:** obtain the partial derivatives of the cost function $L(w,b)$ with respect to the model parameters:

$$
\frac{\partial L(w,b)}{\partial w}, \quad \frac{\partial L(w,b)}{\partial b}
$$

### **Chain rule**
The cost function does not depend directly on the parameters $\text{w and b}$, but rather through one or more intermediate variables. For this reason, the chain rule is used to calculate the partial derivatives, which allows all dependencies between variables to be chained together.

**General procedure:**

1. Define intermediate variables (in order):  
   Identify all intermediate variables that connect the parameters with the cost function.
2. Apply the chain rule to derive the parameter  $\text{w o b}$.
   <span style="display:block;">
   **Important:** we derive with respect to one parameter while the other remains constant.
   </span>

**Difference** between linear and logistic regression: The difference between the two models lies in
the intermediate variables that relate the parameters to the cost function.
<br><br>

<div style="margin-left: 2.5em;">

#### Linear regression

1. Intermediate variable: $y_n = w x_n + b$ <br>

2. Apply chain rule:
   - Derive with respect to $w$: $\frac{\partial L_n}{\partial w} = \frac{\partial L_n}{\partial y_n} \cdot \frac{\partial y_n}{\partial w}$

   - Derive with respect to $b$: $\frac{\partial L_n}{\partial b} = \frac{\partial L_n}{\partial y_n} \cdot \frac{\partial y_n}{\partial b}$ <br><br>


#### Logistic regression

1. ntermediate variables: $z_n = w x_n + b$ ,&nbsp;&nbsp;&nbsp;&nbsp;$y_n = \sigma(z_n) = \frac{1}{1 + e^{-z}}$

2. Apply chain rule:
   - Derive with respect to $w$: $\frac{\partial L_n}{\partial w} = \frac{\partial L_n}{\partial y_n} \cdot \frac{\partial y_n}{\partial z_n} \cdot \frac{\partial z_n}{\partial w}$

   - Derive with respect to $b$: $\frac{\partial L_n}{\partial b} = \frac{\partial L_n}{\partial y_n} \cdot \frac{\partial y_n}{\partial z_n} \cdot \frac{\partial z_n}{\partial b}$

</div>
<br><br>

</div>

---

<div style="text-align: justify;">

## **1.5 Optimization**

**Important !** While the cost function allows you to compare models, optimization allows you to select the best one among them.

**Objective:** Find the parameters $\text{w and b}$ that minimize the cost function $L(w,b)$, that is, those that produce the smallest possible discrepancy between the model's predictions and the actual values.

<div align="center">

**Note:** $\max f(w,b) \equiv \min -f(w,b)$  (opposite effect)
</div>

### **Cost function gradient**
Based on the partial derivatives obtained in section 1.4, the gradient of the cost function is defined as:
$$
\nabla L(w,b)
=
\left(
\frac{\partial L(w,b)}{\partial w},\;
\frac{\partial L(w,b)}{\partial b}
\right)^T
=
\begin{pmatrix}
\frac{\partial L(w,b)}{\partial w} \\
\frac{\partial L(w,b)}{\partial b}
\end{pmatrix}
$$

This vector reflects the variation in the cost function in response to small changes in each of the model parameters.

<div style="margin-left: 2.5em;">

**Interpretation of the gradient:**

- Direction: indicates the direction in which the cost function varies.
- Magnitude: indicates the intensity of this variation (slope).
- Optimization: moving in the opposite direction to the gradient reduces the value of the cost function.
<br><br>

**Meaning of terms:**

- $t_n - y_n$: represents the prediction error for observation $n$.
- $x_n$: measures the influence of the input on the adjustment of parameter $w$.
- $\text{sign}$: indicates the direction in which the parameter should be modified to reduce the value of $L$.

Example: if $\frac{\partial L(w,b)}{\partial w} \succ 0$, an increase in $w$ increases the cost function; therefore, to reduce $L$, it is necessary to decrease $w$ .
</div>
<br>

**Important !** The gradient does not depend on the model, but on the cost function. What changes between models is the explicit expression of the derivatives, not their meaning or their role in optimization.
<br><br>

### **Optimization Methods**
#### **1.5.1 Analytical Method**
**Objective:** Find the critical points of $L(w,b)$ , where the slope is zero.

The partial derivatives are set equal to zero:
$$
\frac{\partial L(w,b)}{\partial w} = 0, \quad
\frac{\partial L(w,b)}{\partial b} = 0
$$

Solving the system yields $w^*$ y $b^*$ , which correspond to critical points (minima, maxima, or saddle points) .<br>


**Note:** To classify a critical point, the Hessian is analyzed (see 1.6)<br><br>


**Important !** : If $L$ is **convex**, the critical point is the global minimum

Therefore, the best possible model is:
$$
z = w^* x + b^*
$$

<div style="margin-left: 2.5em;">

**Geometric interpretation:** The cost function forms a bowl (paraboloid) in the space $(w,b,L)$
</div> 

<div align="center">
  <img src="imagenes/foto7.png" width="30%">
</div><br><br>


#### **1.5.2 Iterative method: Gradient Descent (GD)**
This method allows us to approximate the minimum of the cost function when it is not possible (or convenient) to solve the problem analytically. It consists of iteratively updating the model parameters in the opposite direction to the gradient $$\nabla L(w,b)$$

**Requirement:** $L(w,b)$ must be differentiable. 

**General procedure:**

The parameters are initialized at $t=0$ with given values $w_0$ y $b_0$ .
From there, they are updated iteratively:

$$
w_{t+1} = w_t - \alpha \cdot \frac{\partial L(w,b)}{\partial w}
\qquad\qquad
b_{t+1} = b_t - \alpha \cdot \frac{\partial L(w,b)}{\partial b}
$$

for $t = 0, 1, 2, \ldots$, until the algorithm converges or approaches the minimum of $L$.

- $\alpha$ is the learning rate and controls the step size.
  It can be any value, but **be careful!**
  - $\alpha$ too small → slow convergence.
  - $\alpha$ too large → risk of oscillations or divergence.
  - The optimal $\alpha$ depends on the curvature of $L$: Greater curvature → lower $\alpha$, because the slope changes more quickly.

In each iteration, the parameters move in the opposite direction to the gradient, progressively reducing the value of $L$.
<br>

**Convergence properties**

- If $L(w,b)$ is convex: converge to the global minimum, regardless of the starting point
<div align="center">
  <img src="imagenes/foto8.png" width="50%">
</div><br><br>

- If $L(w,b)$ is not convex: there may be multiple local minima and the result depends on the initialization starting point

<div align="center">
  <img src="imagenes/foto9.png" width="50%">
</div><br><br>


##### **1.5.2.1 Application to linear regression**  
Using the quadratic error as a cost function:

- $L(w,b)$ is convex</li>
- There is an analytical solution</li>
- GD and the analytical method converge to the same global minimum
<br><br>

##### **1.5.2.2 Application to logistic regression**  
Using binary cross-entropy as the cost function:

- $L(w,b)$ is convex
- There is no closed-form analytical solution
- GD is the usual method for optimizing the parameters
<br><br>

Gradient Descent is a general optimization method that does not depend on the model, but only on the cost function and its gradient.
<br><br><br>

</div>

---

<div style="text-align: justify;">

## **1.6 Advanced optimization: replacing Gradient Descent with the Newton method**

### **Motivation: limitations of GD**
In the previous section, we studied the Gradient Descent method, which uses only first-order information (the gradient) to minimize the cost function $L(w,b)$. Although GD is simple and robust, it can converge slowly, especially when:

- The cost function has a pronounced curvature (very steep or very flat areas).
- The gradient changes significantly in different directions (ill-conditioned problems).

To improve convergence speed, the Newton method is introduced, an optimization method
that uses second-order information, incorporating the local curvature
of the cost function.

### **General idea of Newton's method**
Newton's method is based on approximating the cost function $L(w,b)$ using a second-order Taylor expansion around a current point $\theta_t = (w_t, b_t)$:

$$
L(\theta) \approx L(\theta_t) + \nabla L(\theta_t)^T(\theta - \theta_t)
\;+\;
\frac{1}{2}(\theta - \theta_t)^T H(\theta_t)(\theta - \theta_t)
$$

- $L(\theta_t)$: current value of the function
- $\nabla L(\theta_t)^T(\theta - \theta_t)$: slope (gradient)
- $\frac{1}{2}(\theta - \theta_t)^T H(\theta_t)(\theta - \theta_t)$: curvature (Hessian)

**Analysis:**

- GD only follows the slope → uniform steps defined by $\alpha$
- Newton, apart from the slope, also looks at the curvature → adapted steps, approaching the minimum faster.
<br>

**Illustrative graph of Gradient Descent vs. Newton in 2D:**

<div align="center">
  <img src="imagenes/foto10.png" width="50%">
</div><br><br>


### **Hessian: definition and interpretation**
It is the matrix of second partial derivatives of the cost function with respect to the parameters:

$$
H =
\begin{pmatrix}
\frac{\partial^2 L(w,b)}{\partial w^2} & \frac{\partial^2 L(w,b)}{\partial w \partial b} \\
\frac{\partial^2 L(w,b)}{\partial b \partial w} & \frac{\partial^2 L(w,b)}{\partial b^2}
\end{pmatrix}
$$

<div style="margin-left: 2.5em;">

#### **Interpretation of the Hessian:**

- Describes the local curvature of $L(w,b)$
- Evaluated at a critical point, it allows the region around that point to be classified as:
  - local minimum (valley)
  - local maximum (hill)
  - saddle point
- For smooth functions, the symmetry of cross derivatives is satisfied:
$$
\frac{\partial^2 L(w,b)}{\partial w \partial b} = \frac{\partial^2 L(w,b)}{\partial b \partial w} 
$$
which guarantees that $H$ is symmetric.
<br><br>

#### **Smoothness requirements for Newton:**

For Newton's method to be applicable, the cost function must:

- Be differentiable at least twice with respect to all parameters.
- Not have any discontinuities or irregularities that prevent the existence of continuous partial derivatives.
- Satisfy the symmetry of the second cross derivatives, ensuring that the Hessian is symmetric and usable in Newton's update.
<br>

**Note:** Both linear regression with quadratic error and logistic regression with cross-entropy fulfil these conditions. Therefore, the Hessian is symmetric and Newton can be applied safely.
</div>
<br>

#### **Side note: Local vs. global analysis**
Before classifying a critical point, it is important to distinguish between local and global properties.

- **Local minimum and maximum**
  - A local minimum is the lowest point in a region
  - A local maximum is the highest point in a region
  <br><br>
  
- **Global minimum and maximum**
  - A global minimum is the lowest point in the entire function
  - A global maximum is the highest point in the entire function
  
  Every global minimum is local, but not every local minimum is global !!!
 
<br>

#### **Global Form of the Function**

When we analyze the function as a whole, we talk about:

- **Convex** function:
  - It has a single minimum, and this minimum is global
<br><br>

- **Concave** function:
  - It has a single maximum, and this maximum is global
 <br><br>
 
- **Non-convex and non-concave** function:
  - It can present multiple local minima, local maxima, and saddle points
 <br><br>

#### **Local classification using the Hessian (region around the point)**

<div style="margin-left: 3em;">

**Case 1: Two-variable functions (2D)**  
When the function depends on two parameters, local classification can
be performed using the Hessian determinant:
$$
H =
\begin{pmatrix}
L_{ww} & L_{wb} \\
L_{bw} & L_{bb}
\end{pmatrix}
$$


$$
D = L_{ww}\cdot L_{bb} - (L_{wb})^2
$$

Then:

- If $D(H) > 0$ and $L_{ww} > 0$ → local minimum (valley)
- If $D(H) > 0$ and $L_{ww} < 0$ → local maximum (hill)
- If $D(H) < 0$ → saddle point
- If $D(H) = 0$ → inconclusive

<br>

**Case 2: Functions of more than two variables**  
The determinant criterion is no longer sufficient, so the classification is performed
using the eigenvalues of $H$, qwhich are obtained by solving the characteristic equation:

$$
D(H - \lambda I) = 0
$$

| **Hessian** | **Property** | **Conclusion** |
|---|---|---|
| Positive definite | All eigenvalues > 0 | Local minimum |
| Negative definite | All eigenvalues < 0 | Local maximum |
| Indefinite | Eigenvalues with different signs | Saddle point |
| Semidefinite | If any eigenvalue = 0 | Inconclusive |

</div>

**Connection with the Newton method**

For the Newton method, it is particularly important that the Hessian is symmetric and positive definite in the vicinity of the minimum.

This guarantees that the update direction corresponds to a descent and that the method converges to a minimum.

In convex cost functions, such as mean square error or cross entropy, the Hessian is positive semidefinite throughout the domain, which ensures that the minimum reached by Newton is also the global minimum.
<br><br>

### **Illustrative graph of curvature and gradient in 3D:**

<div align="center">
  <img src="imagenes/foto11.png" width="30%">
</div><br><br>


### **Newton's update rule**

The update of the parameters is given by:
$$
\theta_{t+1}
=
\theta_t
-
H^{-1}\nabla L(\theta_t)
$$
where:

- $\theta = (w, b)^T$ is the model parameter vector
- $\nabla L(\theta_t)$ is the gradient
- $H^{-1}$ is the inverse of the Hessian
<br><br>

Unlike GD, no learning rate $\alpha$ is needed, as the step size is automatically adjusted according to the curvature of $L$ . 
<br><br>

#### 1.6.1 Application to linear regression

**Pseudocode: Newton's method for linear regression**

1. Initialize the parameters of $\text{w and b}$
2. Repeat:
   - Calculate the predictions: $y_n = w x_n + b$
   - Calculate the partial derivatives of the cost function:
     $$\frac{\partial L(w,b)}{\partial w}, \quad
     \frac{\partial L(w,b)}{\partial b}$$
   - Calculate the Hessian $H$ (matrix of second derivatives):
     $$H =
     \begin{pmatrix}
     \frac{\partial^2 L(w,b)}{\partial w^2} &
     \frac{\partial^2 L(w,b)}{\partial w \, \partial b} \\
     \frac{\partial^2 L(w,b)}{\partial b \, \partial w} &
     \frac{\partial^2 L(w,b)}{\partial b^2}
     \end{pmatrix}$$
   - Update the parameters using Newton's method:
     $$\theta_{t+1} = \theta_t - H^{-1} \nabla L(\theta_t)$$
3. Until the minimum of the cost function is reached
<br><br>


**For linear regression with quadratic error:**
$$L(w,b)=\sum_{n=1}^{N}(t_n-y_n)^2 = \sum_{n=1}^{N}\big(t_n-(wx_n+b)\big)^2$$

- The cost function is convex and quadratic.
- The Hessian is constant and does not depend on the parameters:
$$
H =
\begin{pmatrix}
\sum x_n^2 & \sum x_n \\
\sum x_n   & N
\end{pmatrix}
$$
- Newton's method converges to the global minimum in a single iteration (in theory).
- This coincides with the closed analytical solution: Newton generalizes the closed approach.
<br><br>

**Illustrative example:**  Given a set of points  $(x_1,t_1)=(1,2)$ , $(x_2,t_2)=(2,3)$, and considering a linear regression model with quadratic error, demonstrate that the minimum obtained using the analytical method coincides with the minimum reached after a single iteration of Newton's method:

<div align="center">
  <img src="imagenes/foto12.png" width="50%">
  <img src="imagenes/foto12.png" width="50%">
</div><br><br>

#### 1.6.2 Application to logistic regression (cross-entropy)

**Pseudocode: Newton's method for logistic regression**

1. Initialize the parameters of $\text{w and b}$
2. Repeat:
   - Calculate the intermediate variable: $z_n = w x_n + b$
   - Calculate the predictions: $y_n = \sigma(z_n)$
   - Calculate the partial derivatives of the cost function:
     $$\frac{\partial L(w,b)}{\partial w}, \quad
     \frac{\partial L(w,b)}{\partial b}$$
   - Calculate the Hessian $H$ hat depends on $y_n(1 - y_n)$:
     $$
     H = X^T R X
     $$
   - Update the parameters using Newton's method:
     $$\theta_{t+1} = \theta_t - H^{-1} \nabla L(\theta_t)$$
3. Until the minimum of the cost function is reached
<br><br>

**For logistic regression with cross-entropy:**
$$
L = -\sum_{n=1}^{N} \left[ t_n \log(y_n) + (1 - t_n)\log(1 - y_n) \right]
$$
 
- The cost function is convex
- The Hessian depends on the parameters, normally defined as:
$$
H = X^T R X
$$
where $R$ is a diagonal matrix with $y_n(1 - y_n)$.
<br>

- Newton is applied iteratively, also called Newton–Raphson or IRLS (Iteratively Reweighted Least Squares)
- Much faster convergence than GD, especially near the minimum
<br><br>

### Comparison: Gradient Descent vs Newton
<div style="text-align: center;">

| **Aspect** | **GD** | **Newton** |
|------------|--------|------------|
| **Information used** | Gradient (1st order) | Gradient + Hessian (2nd order) |
| **Step size** | Fixed learning rate | Determined by curvature |
| **Convergence** | Linear | Quadratic near the minimum |
| **Computational cost** | Low | High |
| **Robustness** | High, stable even far from the minimum | Sensitive to poor initialization or ill-conditioned Hessians |
| **Typical application** | Large or high-dimensional data | Small or medium-sized problems requiring high precision |
</div><br>

</div>

---


<div style="text-align: justify;">

## **1.7 Final conclusions**
In this section, we have seen how we can use a data set to build linear and logistic regression models to predict a target variable. The general process begins with identifying the type of problem, followed by selecting an appropriate model, defining a cost function, and optimizing its parameters.

The Newton method is presented as an advanced alternative to Gradient Descent. While GD uses only first-order information and requires a carefully chosen learning rate, Newton takes advantage of second-order information (Hessian) to adapt the step size to the curvature of the cost function, allowing for much faster convergence, albeit at a higher computational cost. For linear regression, Newton practically coincides with the analytical solution, reaching the minimum in a single iteration. For logistic regression, it is applied iteratively (Newton–Raphson or IRLS) and remains more efficient than GD, especially near the minimum.

In summary, Newton is another way to achieve the final objective of the block: the optimization of the model parameters.

</div>
