# Introduction and Overview of existing gradient algorithms

In this assignment, we explore the evolution and significance of gradient descent algorithms, focusing on their applications in handling complex data-driven problems prevalent in fields such as machine learning and natural language processing. We will delve into the foundations of both classical and adaptive stochastic gradient techniques and investigating their convergence properties.

### Historical Context

Gradient descent algorithms have evolved significantly, starting from the stochastic approximation methods of Kiefer-Wolfowitz and Robbins-Monro in the 1950s, to the introduction of advanced techniques like Momentum Gradient Descent and Nesterov's accelerated method in the 1980s. The 2010s marked a shift towards adaptive methods, with algorithms like AdaGrad, RMSProp, and ADAM, each bringing unique approaches to learning rate adjustments and showcasing effectiveness in various applications, particularly in deep learning.

### Learning Objectives

1. Understand the fundamental concepts of gradient, gradient descent, and stochastic optimization;
2. Explore the theoretical foundations and practical applications of various stochastic gradient descent algorithms;
3. Compare the performance of different gradient descent algorithms on a test convex and smooth objective function.

## Prerequisites

Before delving into the implementation and comparison of gradient descent algorithms, it is essential to set up the necessary computational environment. We will be utilizing the PyTorch library to perform all calculations, as it offers a flexible and efficient platform for scientific computing, particularly in machine learning.

In [None]:
# Import the PyTorch library
import torch

# Import typing annotations for assignment hints
from typing import Callable

# Check the version of PyTorch
print("PyTorch Version:", torch.__version__)

# Perform a basic operation to test PyTorch
a = torch.tensor([1.0, 2.0])
b = torch.tensor([3.0, 4.0])

# Assert the result of the sum
assert torch.equal(a + b, torch.tensor([4.0, 6.0])), "The sum of a and b is incorrect!"

## Stochastic Optimization Problem

Stochastic optimization problems form the bedrock for addressing uncertainties and randomness inherent in various domains like finance, machine learning, and operations research. Contrasting with deterministic optimization, where the objective function and constraints are well-defined, stochastic optimization introduces challenges by incorporating components that exhibit randomness. In this section, we delve into the mathematical formulation of a stochastic optimization problem and explore how stochastic gradient descent algorithms tackle the challenges presented by this formulation.

### Mathematical Formulation

Given an objective function $ f: X \to \mathbb{R} ^ n $ with domain $ X \subset \mathbb{R} ^ n $, and a convex and differentiable function $ F: X \times \Xi \to \mathbb{R} ^ 1 $ that depends on the determined variable $ x \in X $ and a stochastic variable $ \xi \in \Xi $, defined on a space $ (\Xi, \Sigma, P) $, a stochastic optimization problem can be represented as:

$$
\min_{x \in X} \left[f(x) = \mathbb{E} F(x, \xi) = \int_{\xi \in \Xi} F(x, \xi) P(d \xi), X \subset \mathbb{R} ^ n\right]
$$

Here, $ \mathbb{E} $ denotes the mathematical expectation. The intrinsic challenge of this problem lies in the difficulty of explicitly calculating the value of an integral (mathematical expectation) and the gradient of this integral. Stochastic gradient descent algorithms, leveraging gradients $ \nabla_{x} F(x, \xi) $ of a stochastic function $ F(\cdot, \xi) $ or their finite-difference counterparts, offer a solution to this challenge.

### Practical Implications

These optimization problems are pivotal in scenarios where decision-making is dependent on incomplete or uncertain information. Employing techniques such as random sampling, Monte Carlo simulations, and stochastic gradients, stochastic optimization methods effectively and efficiently traverse the optimization landscape, aiming for convergence to the optimal solution.

### Gradient approximation

While libraries like PyTorch offer automatic differentiation, this assignment encourages a hands-on approach. We will be utilizing the second-order accurate central differences method to estimate gradients, offering insight into the intricacies of gradient computation and its role in optimization algorithms.

We can derive approximation formula from the Taylor's series polynomial while discarding unnecessary residual terms that have higher accuracy order than the first order:

$$ f(x_0 + h) = f(x_0) + f'(x_0)(h) + o(h) $$

By applying finite differences approximation we get left and right finite differences:

$$
\begin{cases}
  f'_{-}(x) = \frac{f(x + h) - f(x)}{h} - \frac{h f''(\xi)}{2} \\
  f'_{+}(x) = \frac{f(x) - f(x - h)}{h} + \frac{h f''(\xi)}{2}
\end{cases}
$$

The accuracy of the approximation depends on the number of nodes on the numerical partitioning grid, thus the smaller step difference, the higher precision. Using the Runge-Romberg-Richardson algorithm we can achieve an increase in the order of the precision of the partitioning grid up to $ O(h^2) $ without adding extra iterations to the approximation algorithm:

$$
\begin{cases}
  f'_{-}(x) = \frac{-3 f(x) + 4 f(x + h) - f(x + 2h)}{2h} + \frac{h^2 f'''(\xi)}{3} \\
  f'(x) = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2 f'''(\xi)}{6} \\
  f'_{+}(x) = \frac{f(x - 2h) - 4 f(x - h) + 3 f(x)}{2h} + \frac{h^2 f'''(\xi)}{3}
\end{cases}
$$

The approach of approximating the gradient value of the target function with a finite-difference schema lets us generalize optimization problems on any kind of analytical functions.

In [None]:
def grad_left(F: Callable[[torch.Tensor], torch.Tensor], x: torch.Tensor, h: float = 0.001) -> torch.Tensor:
  """A finite-difference approximation for left-side gradient ∇F₋(x) with the precision order O(h^2).

  Args:
      F (Callable[[torch.Tensor], torch.Tensor]): an objective function F(x) with a single input argument x ∈ ℝⁿ.
      x (torch.Tensor): an input vector x ∈ ℝⁿ, where the derivative is calculated.
      h (float, optional): a step of the derivative partitioning grid with the range of 0<h<1. The lower value, the higher gradient precision. Defaults to 0.001.

  Returns:
      torch.Tensor: a gradient vector approximation ∇F₋(x).
  """

  pass # TODO: Implement second-order accurate forward difference algorithm


def grad_center(F: Callable[[torch.Tensor], torch.Tensor], x: torch.Tensor, h: float = 0.001) -> torch.Tensor:
  """A finite-difference approximation for central gradient ∇F(x) with the precision order O(h^2).

  Args:
      F (Callable[[torch.Tensor], torch.Tensor]): a target function F(x) with a single input argument x ∈ ℝⁿ.
      x (torch.Tensor): an input vector x ∈ ℝⁿ, where the derivative is calculated.
      h (float, optional): a step of the derivative partitioning grid with the range of 0 < h < 1. The lower value, the higher gradient precision. Defaults to 0.001.

  Returns:
      torch.Tensor: a gradient vector approximation ∇F(x).
  """

  pass # TODO: Implement second-order accurate center difference algorithm


def grad_right(F: Callable[[torch.Tensor], torch.Tensor], x: torch.Tensor, h: float = 0.001) -> torch.Tensor:
  """A finite-difference approximation for right-side gradient ∇F+(x) with the precision order O(h^2).

  Args:
      F (Callable[[torch.Tensor], torch.Tensor]): a target function F(x) with a single input argument x ∈ ℝⁿ.
      x (torch.Tensor): an input vector x ∈ ℝⁿ, where the derivative is calculated.
      h (float, optional): a step of the derivative partitioning grid with the range of 0<h<1. The lower value, the higher gradient precision. Defaults to 0.001.

  Returns:
      torch.Tensor: a gradient vector approximation ∇F+(x).
  """

  pass # TODO: Implement second-order accurate center difference algorithm

We will check the accuracy of the implemented method on the test case:

$ F(x, y) = x^2 + xy + y^2 \implies \begin{cases} \frac{\partial F(x, y)}{\partial x} = 2x + y, \frac{\partial F(2.0, -1.0)}{\partial x} = 3.0 \\ \frac{\partial F(x, y)}{\partial y} = x + 2y, \frac{\partial F(2.0, -1.0)}{\partial y} = 0.0
 \end{cases} $

In [None]:
h = 0.001
x = torch.tensor([2.0, -1.0])
f = lambda x: x[0] ** 2 + x[0] * x[1] + x[1] ** 2

assert torch.allclose(grad_left(f, x, h=h), torch.tensor([3.0, 0.0]), rtol=h)
assert torch.allclose(grad_center(f, x, h=h), torch.tensor([3.0, 0.0]), rtol=h)
assert torch.allclose(grad_right(f, x, h=h), torch.tensor([3.0, 0.0]), rtol=h)