# Intro to PyTorch

### PyTorch is the premiere package for object-based machine-learning. It is currently used as the backend training for AI state-of-the-art technologies. E.g: Tesla's autopilot for self-driving cars and Stable Diffusion.
### It is derived from the Lua-based torch library. We will expand on the use of PyTorch for constructing neural network models for simple applications.

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

## The Tensor Class

### The default Torch array data structure is the Torch tensor. This class is almost completely analogous to its NumPy counterpart (the NumPy array), and bears a lot of the same methods:

### Array operations work similarly:

### I.e: shape of array **must** match. This means each dimension must be equal, or any n-dimension may be matched with a dimension of size 1. E.g:

### Torch tensors start to differ from NumPy on their seamless integration with automatic differentiation. The keyword *requires_grad* decides whether we record gradient operations on the declared tensor.

### $$y =\sum_i x_i^2$$

### $$\Rightarrow \frac{\partial y}{\partial x_i} = 2x_i$$

### Automatically computing gradients is convenient for ML models, including those trained through gradient descent. In this case, automatic differentiation will be used in the context of $y = f(x)$ being our "error" or loss function.

### Finally, the tensor object poses some benefits in time complexity for (some) operations on large arrays:

In [None]:
import time
N = 1000
X = torch.randn((N,N))
Y = torch.randn((N,N))

t0 = time.perf_counter()
torch.matmul(X,Y)
t1 = time.perf_counter()

print("Time is {}s".format(t1-t0))

In [None]:
import time
X1 = X.detach().numpy() #Converts to numpy array.
Y1 = Y.detach().numpy()

t0 = time.perf_counter()
X1@Y1
t1 = time.perf_counter()

print("Time is {}s".format(t1-t0))

### Without loss of generality, the goal of a neural network is to construct a model that predicts the behavior of a function $f$ that connects input x data, and output y data. For example, we want a general model to predict temperatures ($y$) based on the hour (t), the humidity (h), and the month (m). Thus, we let $\hat{y}$ = f(t,h,m; $\alpha$). In this case, $\hat{y}$ is our predicted temperature for the parameters, while $y$ is the actual temperature. $\alpha$ represents parameters we can tune through gradient descent.

### Thus, we define a *Loss* function L($y$,$\hat{y}$), which is analogous to our error. E.g: L($y$,$\hat{y}$) = $\sum_i (y_i-\hat{y}_i)^2$ (mean square error). Minimizing this function by changing the $\alpha$ parameters minimizes the "error," and so, leads to a more accurate interpolation function $f$ (atleast for the inputs provided).

### We define two linear transformations $M_1, M_2$ that corresponds to the $\alpha$ parameters mentioned prior (more on this later). We know our input is a 4x2 matrix, and we want a vector that is 4x1 as output. So if we multiply x by a 2xN, and then by a Nx1 matrix, our output will be a 4x1 matrix. The nn.linear function creates a function that maps an array to that array times a linear matrix multiplication (with a bias vector b). I.e: $g(x) = Ax+b$.

### Create a class representing the aforementioned neural network.

### We define the loss by the aforementioned mean-square error using nn.MSELoss():

# Gradient Descent

### Gradient descent is a relatively simple minimization problem. In this case, we minimize the error/loss by finding the $\alpha$ parameters that lead to the error closest to 0. In 1-dimension (for $\alpha$), this is done by finding $\alpha$ such that $\partial f/\partial \alpha = 0$. In this case, we adjust each of our parameters $\alpha_i$ iteratively: 

### $$\alpha_i \rightarrow \alpha_i-l\frac{\partial L}{\partial a_i} ,$$
### for n iterations (epochs) until we find a minimum in our loss function. Here, $l$ is the learning rate of your algorithm. The stochastic gradient descent inherent to PyTorch performs this process using automatic differentiation. 

In [None]:
from torch.optim import SGD


In [None]:
from tqdm import tqdm


In [None]:
plt.plot(loss_num)
plt.xlabel("epoch")
plt.ylabel("Loss")
plt.show()

In [None]:
print(y)
print(f(x))

In [None]:
print(y_test)
print(f(x_test))

### Consider the following case:

In [None]:


plt.plot(losses1)
plt.xlabel("epoch")
plt.ylabel("Loss")
plt.show()

print(np.min(losses1))

In [None]:
print(y1)
print(g(x1))

### For this function (which is likely not linear), layers that are linear transformations aren't great approximations. As such, we add *activation layers* to our model. These are typically nonlinearities that add independence between our parameters represented by the aforementioned matrices. Consider our previous model: $$M_2(M_1x+b_1)+b_2 = M_2M_1x \ \ (b=\hat{0}).$$

### We now add a nonlinearity N (a **slightly** nonlinear function applied element-wise) right after our first linear transformation by $M_1$. Thus, our new model is given by:  $$M_2N(M_1x).$$

## Examples of common nonlinearities:

### ReLU

In [None]:
t = torch.linspace(-np.pi,np.pi, 1000)
act = nn.ReLU()
y = act(t)

plt.plot(t.numpy(), (y).numpy())
plt.grid()
plt.show()

### Tanh

In [None]:
t = torch.linspace(-np.pi,np.pi, 1000)
act = nn.Tanh()
y = act(t)

plt.plot(t.numpy(), (y).numpy())
plt.grid()
plt.show()

### Sigmoid

In [None]:
t = torch.linspace(-np.pi,np.pi, 1000)
act = nn.Sigmoid()
y = act(t)

plt.plot(t.numpy(), (y).numpy())
plt.grid()
plt.show()

In [None]:
plt.plot(lossNL)
plt.xlabel("epoch")
plt.ylabel("Loss")
plt.show()

### No significant difference yet... but the addition of more parameters leads to significant corrections now that we have this nonlinearity: 

In [None]:
plt.plot(lossNL)
plt.xlabel("epoch")
plt.ylabel("Loss")
plt.show()

In [None]:
print(y1)
print(fNL(x1))

### Now, we can add bias vectors to our layers so that our model is given by: 

### $$M_2(M_1x+b_1)+b_2.$$
### This helps shift our results towards the positive or negative side depending on layers and inputs/outputs. 

In [None]:
plt.plot(lossNLbias)
plt.xlabel("epoch")
plt.ylabel("Loss")
plt.show()

In [None]:
print(y1)
print(fNLbias(x1))

### Finally, we add another layer to our sequential neural network:

In [None]:
plt.plot(lossNLbiasF)
plt.xlabel("epoch")
plt.ylabel("Loss")
plt.show()

In [None]:
print(y1)
print(fNLbiasF(x1))

### Predicting our output almost exactly!

# Project: Integer Image Classification

Reference: https://www.youtube.com/watch?v=gBw0u_5u0qU

### Description of the problem: We want to take an N x N image of a number from 0 to 9. This image might be hand-drawn, or computer-generated, but the training data should represent either or both of these samples. Our output should be a prediction of the number depicted based on the input image (either a scalar output, or the probabilities of each number being depicted in the given image). 

In [None]:
for i in range(4):
    plt.imshow(x0[i].numpy())
    plt.title(str(y0[i].numpy()))
    plt.colorbar()
    plt.show()

### We use the one hot encoder to construct an array of values analogous to the probability of each class. In this case, our classes are 0-9 (10 classes total). Given an input image, we know the exact number related to that image. Thus, we set the probability of that given image being that particular number as 1. E.g:

In [None]:
import torch.nn.functional as F


### PyTorch does NOT like working with 3-D arrays. As such, we flatten the images for our x, so that instead of having an 50000 x 28 x28 array, we have a 50000 x $28^2$ array.

### We define our neural net and our loss function. In particular, our loss is now the cross-entropy loss. This loss function is better suited to classification problems.

In [None]:
plt.plot(lossC)
plt.xlabel("epoch")
plt.ylabel("Loss")
plt.show()
print(min(lossC))

In [None]:
plt.imshow(x_train[0])
plt.title(torch.argmax(y_train[0]).numpy())

### Let's see the algorithm's predictions on training data...

In [None]:
fig, ax = plt.subplots(10,4,figsize=(20,30)) 
for i in range(1,41):
    plt.subplot(10,4,i)
    plt.imshow(x_train[i-1])
    plt.title("Model prediction is: {}".format(torch.argmax(y_train[i-1]).numpy()))
fig.tight_layout()
plt.show()

### Now on test data:

In [None]:
fig, ax = plt.subplots(10,4,figsize=(20,30)) 
for i in range(59001,59041):
    plt.subplot(10,4,i-59000)
    plt.imshow(x_train[i-1])
    plt.title("Model prediction is: {}".format(torch.argmax(y_train[i-1]).numpy()))
fig.tight_layout()
plt.show()