------------------
```markdown
# Copyright © 2024 Meysam Goodarzi
This notebook is licensed under CC BY-NC 4.0 with the following amandments:
- Individuals may use, share, and adapt this material for non-commercial purposes with attribution.
- Institutions/Companies must obtain written consent to use this material, except for nonprofits.
- Commercial use is prohibited without permission.  
Contact: analytica@meysam-goodarzi.com
```
------------------------------
❗❗❗ **IMPORTANT**❗❗❗ **Create a copy of this notebook**

In order to work with this Google Colab you need to create a copy of it. Please **DO NOT** provide your answers here. Instead, work on the copy version. To make a copy:

**Click on: File -> save a copy in drive**

Have you successfully created the copy? if yes, there must be a new tab opened in your browser. Now move to the copy and start from there!

----------------------------------------------


In [4]:
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import clear_output
import plotly.graph_objects as go

# Optimization
This notebook is dedicated to the introduction to Gradient Descent, Gradient Steepest Descent, Newtons Method, and the Stochastic Gradient Descent.

## Gradient Descent
The general gradient descent update rule is
$$\theta := \theta - \alpha\cdot\nabla J(\theta)$$
where:
* $\theta$ is the parameter vector (the values over which you want to optimize your objective function)
* $\alpha$ is the learning rate
* $\nabla J(\theta)$ is the gradient of the function $J(\theta)$

### Example
Let us consider the function $$f(x) = x^2 + 2x + 1$$ with the gradient of $$f^´(x) = 2x + 2.$$ We can implement gradient descent algorithm as follows:

In [None]:
# Define the function and its gradient
def f(x: float):
    return x**2 + 2*x + 1

def grad_f(x: float):
    return 2*x + 2

# Gradient Descent implementation
def gradient_descent(starting_point: float, learning_rate: float, num_iterations: int):
    x = starting_point
    x_values = [x]
    for i in range(num_iterations):
        x = x - learning_rate * grad_f(x)
        x_values.append(x)
    return x_values

# Parameters
learning_rate = 0.1
num_iterations = 10
starting_point = 5

# Run gradient descent
x_values = gradient_descent(starting_point, learning_rate, num_iterations)

# Visualization
fig, ax = plt.subplots(figsize=(7, 3))
x_range = np.linspace(-5, 5, 100)
ax.plot(x_range, f(x_range), label="f(x) = x^2")
ax.scatter(x_values, f(np.array(x_values)), color='red', label='GD Steps')
ax.set(title="Gradient Descent on f(x) = x^2", xlabel="x", ylabel="f(x)")
plt.legend()
plt.show()


### Exercise 1
Let us consider the function $$f(x, y) = x^2 + y^2 + 2x - 3y + 1 + 2.$$
This could represent a policy scenario where there are two adjustable parameters, such as:
* $x$: public spending on healthcare.
* $y$: public spending on education.

The function represents the total cost of these interventions, and we want to find the optimal combination that minimizes the overall cost. Write the following function to compelete the minimization using gradient descent.

In [None]:
# Define the function with multiple local minima
def f(x: float, y: float):
    return x**2 + y**2 + 2*x - 3*y + 1 + 2

# Gradient of the function (partial derivatives)
def grad_f(x: float, y: float):
    return np.array([2*x + 2, 2*y - 3])


# Preparing the axes of the plot
x_vals = np.linspace(-10, 10, 100)
y_vals = np.linspace(-10, 10, 100)
X, Y = np.meshgrid(x_vals, y_vals)
Z = f(X, Y)

steps_x = []
steps_y = []
steps_z = []

# Parameters for the gradient descent
learning_rate = 0.1
num_iterations = 10
starting_point = [8, 6]  # Try different starting points, e.g., [8, 6], [-5, 2], [3, -4]

point = np.array(starting_point)

for i in range(num_iterations):
    clear_output(wait=True)

    # Append current point
    steps_x.append(point[0])
    steps_y.append(point[1])
    steps_z.append(f(point[0], point[1]))

    # Create plotly figure
    fig = go.Figure()

    # Add the 3D surface plot for the function
    fig.add_trace(go.Surface(x=X, y=Y, z=Z, colorscale='Viridis', opacity=0.7))

    # Add the steps taken by gradient descent
    fig.add_trace(go.Scatter3d(x=steps_x, y=steps_y, z=steps_z,
                                mode='markers+lines', marker=dict(size=5, color='red'),
                                line=dict(color='red', width=3),
                                name='Steps'))

    # Customize plot details
    fig.update_layout(title=f"3D Gradient Descent",
                      scene=dict(xaxis_title="x", yaxis_title="y", zaxis_title="f(x, y)"),
                      showlegend=True)
    fig.show()

    # Update point using gradient descent rule
    point = point - learning_rate * grad_f(point[0], point[1])

**Question**: What whappens if gradient descent is applied on a function with multiple minima/maxima?

### Exercise 2
Let us consider the function $$f(x, y) = x^4 - 4x^2 + 2.$$
This function has multiple minimas. Write a gradient descent algorithm to find the minimas.

In [None]:
# Define the function and its gradient
def f(x: float):
    return x**4 - 4*x**2 + 2

def grad_f(x: float):
    return 4*x**3 - 8*x

# Visualization
fig, ax = plt.subplots(figsize=(7, 3))
x_range = np.linspace(-2, 2, 100)
ax.plot(x_range, f(x_range), label="f(x) = x^4 -4x + 2 ")

# Parameters
learning_rate = 0.02
num_iterations = 20
starting_point = 0.6

# Gradient Descent implementation
x = starting_point
x_values = [x]
for i in range(num_iterations):
    x = # Your code
    x_values.append(x)
    ax.scatter(x_values, f(np.array(x_values)), color='red')

ax.set(title="Gradient Descent", xlabel="x", ylabel="f(x)")
plt.legend()
plt.show()


**Question**: What happens if change the minus sign to plus in the gradient descent update formula?

In [None]:
def f(x: float):
    return x**4 - 4*x**2 + 2

def grad_f(x: float):
    return 4*x**3 - 8*x

# Visualization
fig, ax = plt.subplots(figsize=(7, 3))
x_range = np.linspace(-2, 2, 100)
ax.plot(x_range, f(x_range), label="f(x) = x^4 -4x + 2 ")

# Parameters
learning_rate = 0.02
num_iterations = 20
starting_point = 0.6

# Gradient Descent implementation
x = starting_point
x_values = [x]
for i in range(num_iterations):
    x = # Your code
    x_values.append(x)
    ax.scatter(x_values, f(np.array(x_values)), color='red')

ax.set(title="Gradient Descent on f(x) = x^2", xlabel="x", ylabel="f(x)")
plt.legend()
plt.show()


## Newton's Method
Newton’s Method uses second-order information ([Hessian](https://en.wikipedia.org/wiki/Hessian_matrix#:~:text=In%20mathematics%2C%20the%20Hessian%20matrix,a%20function%20of%20many%20variables.) matrix) to find optimal steps. It converges faster than gradient descent but is computationally more expensive. The general update rule is
$$\theta := \theta - H^{-1}\cdot\nabla J(\theta)$$
where:
* $\theta$ is the parameter vector (the values over which you want to optimize your objective function)
* $\nabla J(\theta)$ is the gradient of the function $J(\theta),$ and
* $H$ denotes the Hessian matrix.

<details>
  <summary>Mathematical explanations</summary>
</p>

Newton’s method is derived from the second-order Taylor expansion of the function $J(\theta)$ around $\theta_k$:

$$
J(\theta) \approx J(\theta_k) + \nabla J(\theta_k)^T (\theta - \theta_k) + \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k).
$$

To find the optimal update step, we set the derivative of the Taylor approximation to zero:

$$
\nabla J(\theta) \approx \nabla J(\theta_k) + H (\theta - \theta_k) = 0.
$$

Solving for $\theta$, we get:

$$
\theta_{k+1} = \theta_k - H^{-1} \nabla J(\theta_k).
$$

This forms the iterative update rule for Newton’s method:

$$
\theta_{k+1} = \theta_k - H^{-1} \nabla J(\theta_k).
$$
</p>
</details>

### Example
Let us implement the method on a simple function given by
$$f(x) = (x-3)^2$$

In [None]:
def newtons_method(f_grad, hess, x_init, tol=1e-6, max_iter=100):
    x = x_init
    for _ in range(max_iter):
        grad = f_grad(x)
        if np.linalg.norm(grad) < tol:
            break
        x = x - 1/hess * grad
    return x

# Example: f(x) = (x - 3)^2
f_grad = lambda x: 2 * (x - 3)
f_hess = 2

x_opt = newtons_method(f_grad, f_hess, x_init=np.array([10.0]))
print("Optimized x:", x_opt)

### Exercise 3

Consider the function:
$$f(x, y) = e^{x + y} + (x - 2)^2 + (y + 3)^2.$$ Find the minimum of the function using Newton's method.
<details>
<summary>
Hint
</summary>
<p>
The gradient of $f(x, y)$ is given by:
$$
    \nabla f(x, y) =
\begin{bmatrix}
\frac{\partial f}{\partial x} \\
\frac{\partial f}{\partial y}
\end{bmatrix}
=
\begin{bmatrix}
e^{x+y} + 2(x - 2) \\
e^{x+y} + 2(y + 3)
\end{bmatrix}
.$$

The Hessian matrix is:
$$
H(x, y) =
\begin{bmatrix}
\frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\
\frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2}
\end{bmatrix}
=
\begin{bmatrix}
e^{x+y} + 2 & e^{x+y} \\
e^{x+y} & e^{x+y} + 2
\end{bmatrix}
$$
</p>
</details>

In [None]:
import numpy as np

def newtons_method(f_grad, f_hess, x_init, tol=1e-6, max_iter=100):
    x = x_init
    for i in range(max_iter):
        grad = f_grad(x)  # Gradient vector
        # Hessian matrix
        hess = # Your code
        if np.linalg.norm(grad) < tol:
            print(f"Converged in {i+1} iterations")
            break
        x = x - np.linalg.inv(hess) @ grad  # Newton's update step
        print(x)
    return x

# Function: f(x, y) = e^(x+y) + (x - 2)^2 + (y + 3)^2
f_grad = lambda x: np.array([
    np.exp(x[0] + x[1]) + 2 * (x[0] - 2),
    np.exp(x[0] + x[1]) + 2 * (x[1] + 3)
])

f_hess = lambda x: np.array([
    # Your code
])

x_opt = newtons_method(f_grad, f_hess, x_init=np.array([10.0, -15.0]))
print("Optimized x:", x_opt)

## Stochastic Gradient Descent (SGD)
SGD is used for large datasets where computing the full gradient is expensive. It updates parameters using randomly selected data points. The general idea is to compute the gradient on bath of data instead of the whole function. That is:
$$\nabla f(x) \approx \frac{1}{B}\sum_i^{B}\nabla f_i(x).$$

### Example
Let us generate some dummy data using the following relationship between $x$ and $y$ and addition of some random values.
$$y = 2x + 1 + \text{random value}.$$

Assuming, absurdly, that we are solving a linear regression problem $$y = wx + b$$ and we do not know the coefficient $w$ and the intercept $b$. The goal is to use SGD to obtain the coefficient and the intercept.

In [None]:
import random

# Generate dummy dataset (y = 2x + 1 with noise)
random.seed(42)
X = np.array([random.uniform(-10, 10) for _ in range(256)])
y = np.array([2 * x + 1 + random.uniform(-1, 1) for x in X])

def sgd(X, y, alpha=0.01, epochs=100, batch_size=16):
    w, b = 0, 0
    n = len(X)
    losses = []
    for epoch in range(epochs):
        grad_w, grad_b = 0, 0
        indices = np.random.permutation(n)
        X, y = X[indices], y[indices]

        for i in range(0, n, batch_size):
            xi = X[i:i+batch_size]
            yi = y[i:i+batch_size]
            grad_w = 1/batch_size*np.sum(-2 * (yi - (w * xi + b)) * xi)
            grad_b = 1/batch_size*np.sum(-2 * (yi - (w * xi + b)))

            # Update parameters using the average gradient
            w -= alpha * grad_w
            b -= alpha * grad_b
        epoch_loss = 1/n*np.sum((y-(w*X+b))**2)
        losses.append(float(epoch_loss.round(2)))

    return w, b, losses

num_epochs = 100
w_batch, b_batch, losses = sgd(X, y, epochs=num_epochs)
print(f"Batch SGD Parameters: w = {w_batch:.2f}, b = {b_batch:.2f}")
print(f"Epoch Losses: {losses}")
# Plot loss over epochs
plt.plot(range(num_epochs), losses)
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Loss Over Time Using SGD")
plt.show()



### Exercise 4
Predict students' exam scores based on their study hours using a randomly generated dataset given below.

<details>
<summary>
Hint
</summary>
<p>
The exam score data is generated using this formula:
$$y = 5x + 50 + noise$$
where $y$ is the score, $x$ is the number of study hours, and $noise$ is added to make the data random.
</p>
</details>

In [None]:

# Generate random study hours (between 0 to 12 hours)
np.random.seed(42)
X = np.random.uniform(0, 12, 100)

# Generate exam scores using y = 5 * x + 50 + noise
true_w, true_b = 5, 50  # True relationship
noise = np.random.normal(0, 5, size=X.shape)  # Add some noise
y = true_w * X + true_b + noise

# Normalize X for better convergence
X = (X - np.mean(X)) / np.std(X)

def sgd(X, y, alpha=0.01, epochs=100, batch_size=10):
    w, b = 0, 0  # Initialize weights and bias
    n = len(X)
    losses = []

    for epoch in range(epochs):
        indices = np.random.permutation(n)
        X, y = X[indices], y[indices]  # Shuffle data

        for i in range(0, n, batch_size):
            xi = X[i:i+batch_size]
            yi = y[i:i+batch_size]

            # Compute gradients
            grad_w = -2/batch_size * np.sum((yi - (w * xi + b)) * xi)
            grad_b = # Your code

            # Update weights
            w -= # Your code
            b -= alpha * grad_b

        # Compute loss
        epoch_loss = # Your code
        losses.append(epoch_loss)

    return w, b, losses

# Train the model using SGD
num_epochs = 100
w_sgd, b_sgd, losses = sgd(X, y, epochs=num_epochs)

# Predictions using learned parameters
X_test = np.linspace(min(X), max(X), 100)
y_pred = w_sgd * X_test + b_sgd

# Plot loss over epochs
plt.plot(range(num_epochs), losses)
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Loss Over Time Using SGD")
plt.show()

# Plot the data and model prediction
plt.scatter(X, y, label="Actual Data", alpha=0.6)
plt.plot(X_test, y_pred, color="red", label="SGD Prediction")
plt.xlabel("Normalized Study Hours")
plt.ylabel("Exam Score")
plt.legend()
plt.title("Study Hours vs Exam Score Prediction using SGD")
plt.show()

print(f"Estimated Parameters: w = {w_sgd:.2f}, b = {b_sgd:.2f}")


**Congratulations! You have finished the Notebook! Great Job!**
🤗🙌👍👏💪
<!--
# Copyright © 2024 Meysam Goodarzi
This notebook is licensed under CC BY-NC 4.0 with the following amandments:
- Individuals may use, share, and adapt this material for non-commercial purposes with attribution.
- Institutions/Companies must obtain written consent to use this material, except for nonprofits.
- Commercial use is prohibited without permission.  
Contact: analytica@meysam-goodarzi.com.
-->