# **In-class 15. Pickle, Automatic differentiation, PyTorch, and gradient descent.** #

# CONTRIBUTORS #

This in-class exercise is to be done in pairs. Add the names of the two students in this text block.


## Workflow for today ##

Today we'll learn to use pytorch as a tool to perform automatic differentiation (AD). AD is special, because it allows us to take the derivatives of very complex functions and use those derivatives to perform optimization. On it's own, AD is simply a mathematical tool. When combined with ideas from artificial neural networks, it allows us to build and train models inspired by the functionality of the human brain. Today we will first get started on the basic syntax of PyTorch, and use it to fit a very simple model to data.


## Using pickle to save and load data. ##

First, we are going to need some tools to build datasets and save them to disk. There are more sophisticated tools (typically one would work with [Pandas dataframes](https://pandas.pydata.org/) in a more serious data science setting), but we will use [Pickle](https://docs.python.org/3/library/pickle.html) to get started because it is incredibly simple to use and is fast for relatively small datasets. For very large datasets Pickle won't cut it - in that case you would use something like [HDF5](https://docs.h5py.org/en/stable/). As always, when learning a new library you will find it useful to keep a LLM open in your browser to help answer specific questions about syntax.

Pickle will take *any python object* and save it to disk as a `.pkl` file (pronounced "pickle"). There are three key functions to understand:
* `open`: Open up a pickle file. The first argument is the file name, and the second argument sets whether to open the file in read or write mode.
* `pickle.dump`: Pushes the object into the pkl file. The first argument is the python object, while the second is the opened pkl file.
* `pickle.load`: Loads the pkl file from disk. The argument is the opened pkl file.

In what follows you'll be seeing us use the command `with` for the first time. `with` is a convenient way to set the scope so that it vanishes. at the end of the code block, it vanishes. 

The syntax is
``` 
with OBJECT as OBJECT_NAME:
  # do things 
```

This makes it easy to make sure you don't use an object by accident outside of a certain place. In the example code below, we use `with` to make sure our pickle file is in either write mode (`'wb'`) or read mode (`'rb'`).

In [1]:
import numpy as np
import pickle

# Create a list of NumPy arrays
array_list = [np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([7, 8, 9])]

# Save the list of NumPy arrays to a file
with open('array_list.pkl', 'wb') as filename:
    pickle.dump(array_list, filename)

# Load the list of NumPy arrays from the file
with open('array_list.pkl', 'rb') as filename:
    loaded_array_list = pickle.load(filename)

# Verify the loaded data
for i, array in enumerate(loaded_array_list):
    print(f"Array {i}:", array)

Array 0: [1 2 3]
Array 1: [4 5 6]
Array 2: [7 8 9]


**Your turn.** Use the bouncy rock paper scissors code to generate 10 trajectories. Use pickle to save them as separate pkl files on your disk (e.g. `dataset_001.pkl`, `dataset_002.pkl`,...). Write code here to load all of the pkl files at once and store them as a list of numpy arrays.

In [1]:
# Write code here.

## Introduction to PyTorch ##

Today we'll learn to use pytorch as a tool to perform automatic differentiation (AD). AD is special, because it allows us to take the derivatives of very complex functions and use those derivatives to perform optimization. On it's own, AD is simply a mathematical tool. When combined with ideas from artificial neural networks, it allows us to build and train models inspired by the functionality of the human brain. Today we will first get started using AD to minimize some simple functions.

**Tensor syntax.** PyTorch operates on tensors, which you can think of as no different from numpy array, but that they have the "plumbing" in place to do automatic differentiation. The following gives some examples of how you can build tensors and interact with them - it is nearly identical to numpy.

In [None]:
import torch

# Create PyTorch tensors
tensor_a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
tensor_b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)

# Transpose
tensor_transpose = tensor_a.t()

# Reshape
tensor_reshape = tensor_a.view(4, 1)

# Slicing
tensor_slice = tensor_a[:, 1]

# Concatenation - dim states which dimension to concatenate in (i.e. stack in rows vs stack in columns)
tensor_concat = torch.cat((tensor_a, tensor_b), dim=0)

# Sum
tensor_sum = torch.sum(tensor_a)

# Mean
tensor_mean = torch.mean(tensor_a)

# Element-wise sine
tensor_sin = torch.sin(tensor_a)

# Norm
tensor_norm = torch.norm(tensor_a)

# In-place addition
tensor_a.add_(tensor_b)

print("Transpose:\n", tensor_transpose)
print("Reshape:\n", tensor_reshape)
print("Slicing:\n", tensor_slice)
print("Concatenation:\n", tensor_concat)
print("Sum:\n", tensor_sum)
print("Mean:\n", tensor_mean)
print("Element-wise Sine:\n", tensor_sin)
print("Norm:\n", tensor_norm)
print("In-place Addition:\n", tensor_a)

**Taking derivatives.** The following code implements a very simple polynomial function and takes its derivative. For this there are a few parts to pay careful attention to:
* `Requires_grad=True` when defining `x` is important, because it tells PyTorch to track future variables which depend on `x` - this is the cue to start building up the computational graph for building derivatives.
* `y.backward` builds up the derivatives with respect to all variables that `y` depends upon which were initialized with the `requires_grad` flag.
* `x.grad` returns the computed derivative.

In [None]:
import torch

# Create a tensor with requires_grad=True to track computations
x = torch.tensor(2.0, requires_grad=True)

# Perform some operations
y = x ** 2 + 3 * x + 1

# Compute the gradient
y.backward()

# Print the gradient
print("Gradient of y with respect to x:", x.grad)

**Zeroing out the gradients.**
To make things fast, PyTorch accumulates gradients on the fly. This is because when performing the chain rule, many derivatives are aggregated across the model over and over. This means however that you need to "clear the buffer" before calculating a derivative a second time. The following code illustrates this.

In [None]:
# Define a simple function
def f(x):
    return x ** 2

# Initialize the tensor with requires_grad=True
x = torch.tensor(2.0, requires_grad=True)

# Perform the first backward pass
y = f(x)
y.backward()
print("Gradient after first backward pass:", x.grad)

# Perform the second backward pass without zeroing gradients
y = f(x)
y.backward()
print("Gradient after second backward pass without zeroing:", x.grad)

# Zero the gradients
x.grad.zero_()

# Perform the third backward pass after zeroing gradients
y = f(x)
y.backward()
print("Gradient after third backward pass with zeroing:", x.grad)

**Detaching tensors from the graph.** Pytorch will track every variable as you build up more and more complicated expressions. Sometimes though, you just want to do some operations on a function that don't need derivatives (e.g. you want to make some plots). For this you can use the `detach()` function.

In [None]:
import torch

# Create a tensor with requires_grad=True
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Perform some operations
y = x ** 2 + 3 * x + 1

# Detach the tensor
y_detached = y.detach()

# Print the original and detached tensors
print("Original tensor:", y)
print("Detached tensor:", y_detached)

# Perform an operation on the detached tensor
y_detached += 1

# Print the modified detached tensor and the original tensor
print("Modified detached tensor:", y_detached)
print("Original tensor after modifying detached tensor:", y)

**Going back and forth between numpy.** It is possible to interpret pytorch tensors at numpy arrays and vice versa.

In [None]:
import torch
import numpy as np

# Create a PyTorch tensor
tensor = torch.tensor([1.0, 2.0, 3.0])

# Convert PyTorch tensor to NumPy array
numpy_array = tensor.numpy()
print("NumPy array:", numpy_array)

# Convert NumPy array back to PyTorch tensor
tensor_from_numpy = torch.from_numpy(numpy_array)
print("PyTorch tensor from NumPy array:", tensor_from_numpy)

**Example - Generating plots of PyTorch variables.** The following code will generate a plot of a polynomial function and gives examples of how to use `numpy()` and `detach()` when making plots.

In [None]:
import torch
import matplotlib.pyplot as plt
import numpy as np

# Define the function
def f(x):
    return x ** 2 + 3 * x + 1

# Generate x values between 0 and 1
x_values = torch.linspace(0, 1, steps=100, requires_grad=True)

# Compute y values
y_values = f(x_values)

# Plot the function
#   Note: we use detach() here because we don't need derivatives of plots with respect to inputs
plt.plot(x_values.detach().numpy(), y_values.detach().numpy(), label='y = x^2 + 3x + 1')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Plot of y = x^2 + 3x + 1 for x between 0 and 1')
plt.legend()
plt.show()

**Your turn.** Modify the previous code examples to plot the derivative of the function $f$. Confirm that you have the correct derivative by plotting the true derivative (calculated by hand).

**Your turn.** Modify that same code to take the derivative of $y = \sin 2 \pi x$.

## Combining AD with optimizers to fit models. ##

Early in the semester we implemented Newton's method to find solutions to nonlinear equations.
$$F(x) = 0 $$

Recall the following formula told us how to find our next guess for a solution.
$$x_{n+1} = x_n - \frac{F (x_n)}{F' (x_n)}$$
Remember also that if we instead want to minimize a function
$$\underset{x}{\min}\, G(x)$$
that amounts to setting the first derivative to zero

$$G'(x) = 0$$

in which case Newton reduces to
$$x_{n+1} = x_n - \frac{G'(x_n)}{G''(x_n)}$$
This requires taking two derivatives. For the toy functions we've looked at this isn't a big deal. But let's do a little exercise to see how quickly these derivatives get complicated. If you cook up the following (physically meaningless but complicated) expression:

$$G(x) = \frac{\sin \left[\exp \left( \cos x^2 \right)^3\right] }{\cos \left[\exp -\left( \tan x^4 \right)^2 \right]}$$
The derivative has many terms, even after simplifying:
$$G'(x) = \frac{
    \cos \left( \exp \left( \cos x^2 \right)^3 \right) \cdot \left( \exp \left( \cos x^2 \right)^3 \right)' \cos \left( \exp \left( -\left( \tan x^4 \right)^2 \right) \right) - \sin \left( \exp \left( \cos x^2 \right)^3 \right) \cdot -\sin \left( \exp \left( -\left( \tan x^4 \right)^2 \right) \right) \cdot \left( \exp \left( -\left( \tan x^4 \right)^2 \right) \right)'
}{
    \left( \cos \left( \exp \left( -\left( \tan x^4 \right)^2 \right) \right) \right)^2
}$$

Taking a second derivative would get even gnarlier! So as we move toward optimizing complicated nonlinear functions (like what shows up in neural networks) we want to avoid taking a second derivative if we can help it.

**Gradient descent.** For gradient descent when minimizing a function $G$, we simplify the Newton update as follows:
$$x_{n+1} = x_n - \eta G'(x)$$

where $\eta << 1$ is called a **learning rate**. The idea is that the derivative $G'$ tells you which direction to go, and the learning rate can be chosen as a small number that we pick that tells us how big of a step to take. This is inefficient - it will take many more steps to get to the right answer. You will also have to play with $\eta$ - you'll run the optimizer for some guess about what $\eta$ should be (maybe $\eta = 0.001$ is a good starting point), and then you'll run it again with a smaller $\eta$ and make sure the answer doesn't change. But what you get is a method that only needs information that pops out of automatic differentiation.

$\eta$ is an example of a *hyperparameter*. A hyperparameter is a number in your code that needs to be nailed down. To contrast, we could call variables that correspond to the model itself as a parameter (for example, the polynomial coefficients when we do regression). Hyperparameters are something that changes the behavior of the model fitting (or *training* in machine learning speak) process.

**Implement gradient descent.** The following code implements gradient descent. As usual, use a LLM to get an explanation of any components that don't make sense to you.

In [None]:
import torch

# Define the function to be minimized
def f(x):
    return x ** 2 + 3 * x + 1

# Initialize the tensor with requires_grad=True
x = torch.tensor(2.0, requires_grad=True)

# Hyperparameters
learning_rate = 0.1
num_iterations = 100

# Gradient descent loop
for iteration in range(num_iterations):
    # Compute the function value
    y = f(x)
    
    # Perform backpropagation to compute gradients
    y.backward()
    
    # Update the tensor using the computed gradients
    with torch.no_grad():
        x -= learning_rate * x.grad
    
    # Zero the gradients
    x.grad.zero_()

# Print the optimized value
print("Optimized value of x:", x.item())
print("Minimum value of the function:", f(x).item())

The following is an example of what I got when I asked Gemini to explain the code:

### Explanation
- **Function Definition**: `f(x)` is the function to be minimized.
- **Tensor Initialization**: `x` is initialized with `requires_grad=True`.
- **Hyperparameters**: `learning_rate` and `num_iterations` are set.
- **Gradient Descent Loop**:
  - Compute the function value `y`.
  - Perform backpropagation with `y.backward()` to compute gradients.
  - Update `x` using the gradient and learning rate.
  - Zero the gradients with `x.grad.zero_()`.
- **Results**: Print the optimized value of `x` and the minimum value of the function.

**Your turn.** In `Code examples/Lecture06.ipynb` earlier in the semester we generated a Newton solve to optimize a function. Run that for the same function `def f(x): return x ** 2 + 3 * x + 1`. Generate a log-log plot showing how the error decreases during training. On the x-axis plot the step of training, and the y-axis show the absolute value of the error. Generate several plots for different choices of learning rate, and see if you can get gradient descent to be competitive with Newton.

In [3]:
# Put stuff here.

## Jumpstart on the homework ##

The homework will be pretty involved. My aim is to give you lots of time to prepare for it. It will be due a week from Wednesday, and I have cleared class on the day that it's due to allow everyone to work on it together if there are last minute questions. (**Do not leave it until then to get started - there is no way that you'll be able to complete it.**). Like the last homework, this is a larger, multi-component project that you will want to read through today and **get help at OH early**.

Today if you still have time at the end of class, you can read through the problem statement and ask questions. You are in a good spot right now to make sure you collect all of your data from the simulator and have it ready to work with in Wednesday's class.

# Turning in assignments on Canvas #
In order to submit your assignment as a pair, you need to create a group on Canvas. This will enable you to both receive the same grade for one submission.

On Canvas, navigate to People > Groups > In-Class 15.
Find an empty group and add the names of both members of the pair.

Submit your work as both an ipynb and a pdf to Canvas.

Save the ipynb and upload from your hard drive. Also print a pdf file to ensure the graders can see you have completed the exercise, even if there are issues with the formatting in your jupyter notebook.

The student who did not submit should make sure that the group was created successfully by checking that they can also access the files on their Canvas page.