# 📘 Chapter 4: Numerical Computation

Deep learning algorithms depend heavily on numerical computation to solve optimization problems that cannot be solved analytically. This chapter covers the fundamental numerical methods and considerations essential for deep learning.

## Key Knowledge Points

### 1. Importance of Numerical Computation
- Deep learning heavily relies on numerical methods since most problems cannot be solved analytically.
- Focus on stability, efficiency, and acceptable approximations.

### 2. Numerical Precision and Stability
- Floating-point representation (finite precision, rounding error).
- Overflow and underflow.
- Ill-conditioned problems (condition number).
- Numerical stability and error propagation.

### 3. Gradient Computation
- Numerical gradients (finite difference approximation).
- Symbolic differentiation.
- Automatic differentiation.

### 4. Iterative Optimization Methods
- Gradient Descent (GD).
- Stochastic Gradient Descent (SGD).
- Batch vs. Mini-batch updates.

### 5. Hessian and Second-Order Information
- Newton's Method.
- Conjugate Gradient.

### 6. Constrained Optimization
- Projected Gradient Descent.
- Lagrangian multipliers.

### 7. Numerical Tricks in Deep Learning
- Avoiding vanishing/exploding gradients.
- Numerically stable softmax and log-sum-exp trick.
- Parameter initialization considerations.

---

# 📝 Exercises

## Basic Understanding

### Exercise 1
Explain why deep learning relies more on numerical approximation than analytic solutions.  
👉 Hint: Consider model size, nonlinearity, and parameter dimensionality.

In [None]:
# TODO: Write your explanation here as a comment or markdown cell

### Exercise 2
Given matrix  
$$
A = \begin{bmatrix} 1 & 2 \\ 2 & 4.0001 \end{bmatrix}
$$  
Compute its condition number and explain why it is ill-conditioned.

In [None]:
# TODO: Implement condition number calculation
import numpy as np

A = np.array([[1, 2], 
              [2, 4.0001]])

# Calculate condition number using np.linalg.cond()
# Explain why this matrix is ill-conditioned

---

## Numerical Gradient

### Exercise 3
Using finite difference approximation, compute the derivative of  
$$
f(x) = x^3 - 3x^2 + 2x
$$  
at $x = 2$, and compare it with the analytical derivative.

In [None]:
# TODO: Implement finite difference approximation
import numpy as np

def f(x):
    return x**3 - 3*x**2 + 2*x

def analytical_derivative(x):
    # TODO: Calculate analytical derivative f'(x) = 3x^2 - 6x + 2
    pass

def finite_difference(f, x, h=1e-5):
    # TODO: Implement (f(x+h) - f(x-h)) / (2*h)
    pass

x = 2
# Compare numerical vs analytical derivative

---

## Optimization Methods

### Exercise 4
Perform **Gradient Descent** manually:  
- Function: $f(x) = (x-3)^2$  
- Initial point: $x_0 = 0$  
- Learning rate: $\eta = 0.1$  
- Compute the first 3 iterations of $x$.

In [None]:
# TODO: Implement gradient descent manually
import numpy as np

def f(x):
    return (x - 3)**2

def df_dx(x):
    # TODO: Calculate derivative f'(x) = 2(x-3)
    pass

# Initial conditions
x = 0
learning_rate = 0.1

# TODO: Perform 3 iterations of gradient descent
# x_new = x_old - learning_rate * gradient

### Exercise 5
Write down the main differences between **Stochastic Gradient Descent (SGD)** and **Batch Gradient Descent (BGD)**. Provide suitable application scenarios for each.

In [None]:
# TODO: Write your comparison here as comments or markdown cell
# Consider:
# - Computational cost per iteration
# - Memory requirements
# - Convergence properties
# - Noise in gradient estimates
# - When to use each method

---

## Second-Order Methods

### Exercise 6
Apply Newton's Method to:  
- Function: $f(x) = x^2 - 2$  
- Initial point: $x_0 = 1$  
- Perform 2 iterations and compare with the true solution $\sqrt{2}$.

In [None]:
# TODO: Implement Newton's method
import numpy as np

def f(x):
    return x**2 - 2

def f_prime(x):
    # TODO: Calculate f'(x) = 2x
    pass

def f_double_prime(x):
    # TODO: Calculate f''(x) = 2
    pass

# Newton's method: x_new = x_old - f'(x) / f''(x)
x = 1
true_solution = np.sqrt(2)

# TODO: Perform 2 iterations and compare with true solution

---

## Numerical Stability

### Exercise 7
Why is softmax usually computed as:  
$$
\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}
$$  
instead of directly $\frac{e^{x_i}}{\sum_j e^{x_j}}$?

In [None]:
# TODO: Implement both versions and demonstrate numerical issues
import numpy as np

def naive_softmax(x):
    # TODO: Implement direct softmax
    pass

def stable_softmax(x):
    # TODO: Implement numerically stable softmax
    pass

# Test with large values that cause overflow
x = np.array([1000, 1001, 1002])

# TODO: Compare results and explain the difference

---

## Overflow and Underflow

### Exercise 8
Run the following code in Python / NumPy and explain the results:

In [None]:
# TODO: Run this code and explain the results
import numpy as np

print("exp(1000) =", np.exp(1000))     # Test overflow
print("exp(-1000) =", np.exp(-1000))   # Test underflow

# TODO: Explain what overflow and underflow mean
# TODO: Discuss implications for deep learning computations