<a href="https://colab.research.google.com/github/patrickmsshin/PyTorch/blob/main/PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Curriculum
[Reference1]
https://www.youtube.com/watch?time_continue=1&v=GIsg-ZUy0MY&feature=emb_logo

[Refreence2]
https://www.youtube.com/playlist?list=PLWKjhJtqVAbm5dir5TLEy2aZQMG7cHEZp

1. PyTorch Basics: Tensor & Gradients
2. Linear Regression & Gradient Descent
3. Image Classification using Ligistic Regression
4. Training Deep Neural Network on a GPU
5. CNN, Regularization and ResNets
6. Generative Adverserial Netwokrs(GAN)

# 1.PyTorch Basics

##1.1.Tensors
* Tensors: PyTorch is a library for processing tensors. A tensor is a number, vector, matix or any n-dimensional array.

In [None]:
import torch

In [None]:
# Create a tensor with a single number
t1 = torch.tensor(4.)  # 4. = 4.0
t1

tensor(4.)

In [None]:
t1.dtype

torch.float32

In [None]:
# Create a slightly more complex tensor
# Vector 
t2 = torch.tensor([1.,2,3,4]) # size = [4]
t2

tensor([1., 2., 3., 4.])

tensor 안에 한 값이 float값이면, 다른 값들의 dtype도 float로 변경된다.
  
ex) 1. -> float
    2,3,4 -> int 

In [None]:
# Matrix
t3 = torch.tensor([[5., 6.],
                   [7., 8.],
                   [9, 10]])  # shape -> size([3, 2])
t3

tensor([[ 5.,  6.],
        [ 7.,  8.],
        [ 9., 10.]])

In [None]:
# 3-dimensional array
t4 = torch.tensor([[[1.,2,3],
                   [5, 6,4]],
                   [[8.,9,4],
                   [12, 13,4]],
                   [[8.,9,4],
                   [12, 13,4]],
                   [[8.,9,4],
                   [12, 13,4]]])  # shape -> size([4, 2, 3]) depths, rows, columns
t4.shape

torch.Size([4, 2, 3])

In [None]:
t4 = torch.tensor([[[1.,2.,5,6]
	                 ,[3,4.,7,8]
									 ,[5, 6, 9, 10]],
									 [[1.,2., 3,9]
                    ,[3, 4,12,11]
										,[5, 6,7,9.]]])

t4.shape  # size([2,3,4])

torch.Size([2, 3, 4])

In [None]:
# Tensor should have a regular shape
# t5 will show error, since the last element has different size with other elements
#t5 = torch.tensor([[5.,6.], [7.,8.], [9., 10., 11.]])

##1.2.Tensor operations and gradients
- We can combine tensors with the usual arithmetic operations.

In [None]:
# Create tensors
x = torch.tensor(3.)                      # 우리는 x 미분에는 관심없다.
w = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

x, w, b

(tensor(3.), tensor(4., requires_grad=True), tensor(5., requires_grad=True))

In [None]:
# Arithmetic operations
y = w * x + b
y

tensor(17., grad_fn=<AddBackward0>)

- What makes PyTorch speical is that we can automatically compute the derivative of 'y' w.r.t the tensors that have "requires_grad" set to "True".
- To compute the derivatives, we can call the ".backward" method on our output function.

In [None]:
# Compute Derivatives
# requires_grad = True 로 설정해준 모든 input variable에 대해서 미분 수행. 
# w, b 모두에 대해서 각각 미분 수행
# 모든 input variable 에 대해서 미분 수행시 비효율 증가.(나중에 변수 개수 많아짐.)
y.backward()

* The derivatives of 'y' w.r.t. the input tensors are stored in the '.grad' property of the respective tensors. (requires_grad = True 로 설정했던 각각의 입력 텐서에 대한 'y'의 미분은 각 텐서의 '.grad'에 저장됩니다.)

In [None]:
print('dy/dx', x.grad)  # requires_grad 옵션을 주지 않았기 때문에 미분 미수행 : none
print('dy/dw', w.grad)  # y = w*3 + 5 로 보고 미분 수행 
print('dy/db', b.grad)  # y = 4*3 + b 로 보고 미분 수행

dy/dx None
dy/dw tensor(3.)
dy/db tensor(1.)


##1.3.Tensor functions
- Apart from arithmetic operations, the torch module also contains many functions for creating and manipulating tensors. Let's look at some examples.

In [None]:
# Create a tensor with a fixed value for every element
# 모두 동일한 값을 갖는 tensor 생성
t6 = torch.full((3,2), 42)
t6

tensor([[42, 42],
        [42, 42],
        [42, 42]])

In [None]:
# Concatenate two tensors with compatible shapes
t7 = torch.cat((t3, t6)) # 3x2 , 3x2  -> size([6,2])
t7

tensor([[ 5.,  6.],
        [ 7.,  8.],
        [ 9., 10.],
        [42., 42.],
        [42., 42.],
        [42., 42.]])

In [None]:
# Compute the sin of each element
# tensor에 있는 각 element에 'sin' 연산 수행
t8 = torch.sin(t7)
t8

tensor([[-0.9589, -0.2794],
        [ 0.6570,  0.9894],
        [ 0.4121, -0.5440],
        [-0.9165, -0.9165],
        [-0.9165, -0.9165],
        [-0.9165, -0.9165]])

In [None]:
# Chage the shape of a tensor
t9 = t8.reshape(3, 2, 2)   # t8 -> size 6x2 = 12 elements
t9

tensor([[[-0.9589, -0.2794],
         [ 0.6570,  0.9894]],

        [[ 0.4121, -0.5440],
         [-0.9165, -0.9165]],

        [[-0.9165, -0.9165],
         [-0.9165, -0.9165]]])

You can learn more about tensor operations here: https://pytorch.org/docs/stable/torch.html

##1.4.Interoperability with Numpy
Numpy is a popular open-source library used for mathematical and scientific computing in Python. It enables efficient operations on large multi-dimensional arrays and has a vast ecosystem of supporting libraries, including
- Pandas for file I/O and data analysis
- Matplotlib for plotting and visualization
- OpenCV for image and video processing


In [None]:
# Create an array in Numpy
import numpy as np

x = np.array([[1, 2], [3,4.]])
x

array([[1., 2.],
       [3., 4.]])

We can convert a Numpy array to PyTorch tensor using torch.from_numpy

In [None]:
import torch

y = torch.from_numpy(x)
y

tensor([[1., 2.],
        [3., 4.]], dtype=torch.float64)

The numpy array and torch tensor have similar data type.

In [None]:
x.dtype, y.dtype

(dtype('float64'), torch.float64)

We can convert PyTorch tensor to a Numpy array using the .numpy method of a tensor.

In [None]:
z = y.numpy()
z

array([[1., 2.],
       [3., 4.]])

The interoperability between PyTorch and Numpy is essential because most datasets you'll work with will likely be read and preprocessed as Numpy arrays.
(대부분의 입력 데이터와 데이터 전처리들이 Numpy array 형태로 진행이 될 것이기 때문에 PyTorch와 Numpy를 함께 잘 사용할 수 있어야 합니다.)

You might wonder why we neea a library like PyTorch like PyTorch at all since Numpy already provides data structures and utilities for working with multi-dimensional numeric data. There are two main reasons,

1. AutoGrad: The ability to automaticaaly compute gradients for tensor operations is essential for training deep learning models. 
2. GPU support: While working with massive datasets and large models, PyTorch tensor operations can be performed efficiently using Graphical Processing Units(GPU).Computations that might typically take hourse can be completed within minutes using GPUs.

#2.Gradient Descent and Linear Regression with PyTorch

##2.1.Introduction to Linear Regression
In this tutorial, we'll discuss one of the foundational algorithms in machine learning: Linear regression. We'll create a model that predicts crop yields for apples and oranges (target variables) by looking at the average temperature, rainfall, and humidity (input variables or features) in a region. Here's the training data:

<p align="center"><img src="https:%5C%5Cdrive.google.com%5Cuc?export=view&id=1E_YIReAANKaFNbSiiQFQwuZyYiFpJi7I&raw=1" width="60%">

<p align="center"><img src="https://drive.google.com/uc?export=view&id=1E_YIReAANKaFNbSiiQFQwuZyYiFpJi7I" width="60%">

In a linear regression model, each target variable is estimated to be a weighted sum of the input variables, offset by some constant, known as a bias :

```
yield_apple  = w11 * temp + w12 * rainfall + w13 * humidity + b1
yield_orange = w21 * temp + w22 * rainfall + w23 * humidity + b2
```



Visually, it means that the yield of apples is a linear or planar function of temperature, rainfall and humidity:

The learning part of linear regression is to figure out a set of weights w11, w12,... w23, b1 & b2 using the training data, to make accurate predictions for new data. The learned weights will be used to predict the yields for apples and oranges in a new region using the average temperature, rainfall, and humidity for that region.

We'll train our model by adjusting the weights slightly many times to make better predictions, using an `optimization technique called gradient descent.`Let's begin by importing Numpy and PyTorch.

In [None]:
import numpy as np
import torch

##2.2.Training data

We can represent the training data using two matrices: `inputs` and `targets`, each with one row per observation, and one column per variable.

In [None]:
# Input (temp, rainfall, humidity)
inputs = np.array([[73, 67, 43], 
                   [91, 88, 64], 
                   [87, 134, 58], 
                   [102, 43, 37], 
                   [69, 96, 70]], dtype='float32')

In [None]:
# Targets (apples, oranges)
targets = np.array([[56, 70], 
                    [81, 101], 
                    [119, 133], 
                    [22, 37], 
                    [103, 119]], dtype='float32')

We've separated the input and target variables because we'll operate on them separately. Also, we've created numpy arrays, because this is typically how you would work with training data: 
1. Read some CSV files as numpy arrays.
2. Do some processing.
3. And then convert them to PyTorch tensors.

Let's convert the arrays to PyTorch tensors.

In [None]:
# Convert Numpy array to PyTorch tensors
inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)

In [None]:
print(inputs)
print(targets)

##2.3.Linear regression model from scratch
The weights and biases (`w11, w12,... w23, b1 & b2`) can also be represented as matrices, initialized as random values. The first row of `w` and the first element of `b` are used to predict the first target variable, i.e., yield of apples, and similarly, the second for oranges.

In [None]:
# Define initial weight and bias
w = torch.randn(2,3, requires_grad=True)
b = torch.randn(1,2, requires_grad=True)
print(w)
print(b)

`torch.randn` creates a tensor with the given shape, with elements picked randomly from a normal distribution with mean 0 and standard deviation 1.

Our model is simply a function that performs a matrix multiplication of the `inputs` and the weights `w` (transposed) and adds the bias `b` (replicated for each observation).



```
Y = X  W.t + B
```



We can define the model as follows:

In [None]:
# Linear regression model
def model(x):
  return x @ w.t() + b

`@` represents matrix multiplication in PyTorch, and the `.t` method returns the transpose of a tensor.

The matrix obtained by passing the input data into the model is a set of predictions for the target variables.

In [None]:
# Generate predictions
preds= model(inputs)
print(preds)

Let's compute the predictions of our model with the actual targets

In [None]:
print(targets)

You can see a big difference between our model's predictions and the actual targets because we've initialized our model with random weights and biases. Obviously, we can't expect a randomly initialized model to just work.

##2.4.Loss Function
[참고]
https://www.youtube.com/watch?v=TxIVr-nk1so&list=PLlMkM4tgfjnLSOjrEJN31gZATbcj_MpUm&index=6

- Before we improve our model, we need a way to evaluate how well our model is performing. We can compare the model's predictions with the actual targets, using the following method:

  - Calculate the difference between the two matrices (`preds` and `targets`).
  - Square all elements of the difference matrix to remove negative values.
  - Calculate the average of the elements in the resulting matrix.

- The result is a single number, known as the mean squared error (MSE).


<p align="center"><img src="https://drive.google.com/uc?export=view&id=1qJf62D5EGVgJi3xD0JGzk433CZz9A1Ge" width="40%">



In [None]:
# Define Loss Function: MSE
def mse(t1, t2):
  diff = t1 - t2
  return torch.sum(diff*diff) / diff.numel()

`torch.sum` returns the sum of all the elements in a tensor. The `.numel` method of a tensor returns the number of elements in a tensor. Let's compute the mean squared error for the current predictions of our model.

In [None]:
# Compute the mse
loss = mse(preds, targets)
print(loss)

Here’s how we can interpret the result: On average, each element in the prediction differs from the actual target by the square root of the loss. And that’s pretty bad, considering the numbers we are trying to predict are themselves in the range 50–200. The result is called the loss because it indicates how bad the model is at predicting the target variables. It represents information loss in the model: `the lower the loss, the better the model`.


##2.5.Compute Gradients
- With PyTorch, we can automatically compute the gradient or derivative of the loss w.r.t. to the weights and biases, because they have requires_grad set to True.

* .backward()는 requires_grad로 설정한 모든 변수(ex. w, b)에 대해서 loss의 미분식을 구한 후 앞서 정의 된 w, b 값을 대입하여 최종 미분값을 전부 구해주는 기능이다. 

* 변경사항 저장 되는지 확인하기.

[계산 과정 설명]

https://www.youtube.com/watch?v=ma2KXWblllc&list=PLlMkM4tgfjnJ3I-dbhO9JTw7gNty6o_2m&index=4

In [None]:
# Compute Gradients
loss.backward()

The gradients are stored in the '.grad' property of the respective tensors. Note that the derivative of the loss w.r.t. the weights matrix is itself a matrix, with the same dimensions.

.backward()로 구해진 gradient 값들은 전부 `.grad`에 저장이 된다.

In [None]:
print(w)
print(w.grad)

In [None]:
print(b)
print(b.grad)

##2.6.Adjust weights and biases to reduce the loss

[참고: Sung Kim]
https://www.youtube.com/watch?v=b4Vyma9wPHo&list=PLlMkM4tgfjnJ3I-dbhO9JTw7gNty6o_2m&index=3

[Other Ref.]https://computer-nerd.tistory.com/5

The loss is a quadratic function of our weights and biases, and our objective is to find the set of weights where the loss is the lowest. If we plot a graph of the loss w.r.t any individual weight or bias element, it will look like the figure shown below. An important insight from calculus is that the gradient indicates the rate of change of the loss, i.e., the loss function's slope w.r.t. the weights and biases.

If a gradient element is `positive`:

* `increasing` the weight element's value slightly will `increase` the loss
* `decreasing` the weight element's value slightly will `decrease` the loss

<p align="center"><img src="https://drive.google.com/uc?export=view&id=1lpYTykep-CW0cx13O062Mw5hikdl8m3B" width="50%">


If a gradient element is `negative`:
* `increasing` the weight element's value slightly will `decrease` the loss
* `decreasing` the weight element's value slightly will `increase` the loss

<p align="center"><img src="https://drive.google.com/uc?export=view&id=1Eq2LWUxbjzna9ocyFle1mBlYWuA09dh6" width="50%">

The increase or decrease in the loss by changing a weight element is proportional to the gradient of the loss w.r.t. that element. This observation forms the basis of the gradient descent optimization algorithm that we'll use to improve our model (by descending along the gradient).

We can subtract from each weight element a small quantity proportional to the derivative of the loss w.r.t. that element to reduce the loss slightly.

In [None]:
# 1. Prediction
preds = mod_linear(ttInput)
print(preds)

# 2. Calculate the loss
loss = mse(preds, ttTargets)
print(loss)

# 3. Compute gradients w.r.t the weights and biases
loss.backward()
print(w.grad)
print(b.grad)


Finally, we update the weights and biases using the gradients computed above.

In [None]:
# 4. Adjust the weights by subtracting a small quantity proportional to the gradient
# 5. Reset the gradients to zero

# Adjust weights & reset gradients
with torch.no_grad():
    # Step 4. 
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5        
    # Step 5.
    w.grad.zero_()
    b.grad.zero_()

A few things to note above:

- We use torch.no_grad to indicate to PyTorch that we shouldn't track, calculate or modify gradients while updating the weights and biases.

- We multiply the gradients with a really small number (10^-5 in this case), to ensure that we don't modify the weights by a really large amount, since we only want to take a small step in the downhill direction of the gradient. This number is called the learning rate of the algorithm.

- After we have updated the weights, we reset the gradients back to zero, to avoid affecting any future computations.

Let's take a look at the new weights and biases.

In [None]:
# New Weight
print(w)
print(b)

In [None]:
# Calculate loss
preds = mod_linear(ttInput)
loss = mse(preds, ttTargets)
print(loss)

##2.7.Train the model using gradient descent
As seen above, we reduce the loss and improve our model using the gradient descent optimization algorithm, which has the following steps:

1. Generate predictions
2. Calculate the loss
3. Compute gradients w.r.t the weights and biases
4. Adjust the weights by subtracting a small quantity proportional to the gradient
5. Reset the gradients to zero

Let's implement the above step by step.

##2.8.Train for multiple epochs
To reduce the loss further, we can repeat the process of adjusting the weights and biases using the gradients multiple times. Each iteration is called an epoch. Let's train the model for 100 epochs.

In [None]:
# Train model for 100 epochs
for i in range(10000):
  preds = mod_linear(ttInput)
  loss = mse(preds, ttTargets)
  loss.backward
  with torch.no_grad():
    # Step 4. 
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5        
    # Step 5.
    w.grad.zero_()
    b.grad.zero_()
  if i%1000 == 0:
    print("Epoch:", i, "Loss:", loss)
  if loss < 100: break

In [None]:
# Calculate loss
preds = mod_linear(ttInput)
loss = mse(preds, ttTargets)
print(loss)

In [None]:
print(preds)
print(ttTargets)

##2.8.Linear regression using PyTorch built-ins

The model and training process above were implemented using basic matrix operations. But since this such a common pattern , PyTorch has several built-in functions and classes to make it easy to create and train models.

Let's begin by importing the torch.nn package from PyTorch, which contains utility classes for building neural networks.

In [None]:
import numpy as np
import torch
import torch.nn as nn

In [None]:
# Input (temp, rainfall, humidity)
inputs = np.array([[73, 67, 43], [91, 88, 64], [87, 134, 58], 
                   [102, 43, 37], [69, 96, 70], [73, 67, 43], 
                   [91, 88, 64], [87, 134, 58], [102, 43, 37], 
                   [69, 96, 70], [73, 67, 43], [91, 88, 64], 
                   [87, 134, 58], [102, 43, 37], [69, 96, 70]], 
                  dtype='float32')

# Targets (apples, oranges)
targets = np.array([[56, 70], [81, 101], [119, 133], 
                    [22, 37], [103, 119], [56, 70], 
                    [81, 101], [119, 133], [22, 37], 
                    [103, 119], [56, 70], [81, 101], 
                    [119, 133], [22, 37], [103, 119]], 
                   dtype='float32')

inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)

##2.9.Dataset and DataLoader
We'll create a TensorDataset, which allows access to rows from inputs and targets as tuples, and provides standard APIs for working with many different types of datasets in PyTorch.

In [None]:
from torch.utils.data import TensorDataset

In [None]:
# Define dataset
train_ds = TensorDataset(inputs, targets)
train_ds[0:3]

In [None]:
# Define DataLoader
batch_size = 5
train_dl = DataLoader(train_ds, batch_size, shuffle=True)

The data loader is typically used in a for-in loop. Let's look at an example.

In [None]:
for xb, yb in train_dl:
    print(xb)
    print(yb)
    break

in each iteration, the data loader returns one batch of data, with the given batch size. If shuffle is set to True, it shuffles the training data before creating batches. Shuffling helps randomize the input to the optimization algorithm, which can lead to faster reduction in the loss.

##2.10.nn.Linear
Instead of initializing the weights & biases manually, we can define the model using the nn.Linear class from PyTorch, which does it automatically.

In [None]:
model = nn.Linear(3,2)
print(model.weight)
print(model.bias)

We can use the model to generate predictions in the exact same way as before:

In [None]:
# Generate predictions
preds = model(inputs)
print(preds)

##2.11.Loss Function
Instead of defining a loss function manually, we can use the built-in loss function mse_loss.

In [None]:
# Import nn.functional
import torch.nn.functional as F

The nn.functional package contains many useful loss functions and several other utilities.

In [None]:
# Define Loss Function
loss_fn = F.mse_loss

In [None]:
# Compute the loss for the current predictions of our model.
loss = loss_fn(preds, targets)
print(loss)

##2.12.Optimizer
Instead of manually manipulating the model's weights & biases using gradients, we can use the optimizer optim.SGD. SGD stands for stochastic gradient descent. It is called stochastic because samples are selected in batches (often with random shuffling) instead of as a single group.

In [None]:
# Define Optimizer
opt = torch.optim.SGD(model.parameters(), lr=1e-5)

Note that model.parameters() is passed as an argument to optim.SGD, so that the optimizer knows which matrices should be modified during the update step. Also, we can specify a learning rate which controls the amount by which the parameters are modified.

##2.13.Train the model
We are now ready to train the model. We'll follow the exact same process to implement gradient descent:

1. Generate predictions
2. Calculate the loss
3. Compute gradients w.r.t the weights and biases
4. Adjust the weights by subtracting a small quantity proportional to the gradient
5. Reset the gradients to zero

The only change is that we'll work batches of data, instead of processing the entire training data in every iteration. Let's define a utility function fit which trains the model for a given number of epochs.

In [None]:
# Utility function to train the model
def fit(num_epochs, model, loss_fn, opt, train_dl):

  # Repeat for given number of epoch
  for epoch in range(num_epochs):

    # Train with batches of data:
    for xb, yb in train_dl:

      # 1. Generate predictions
      preds = model(xb)

      # 2. Calculate the loss
      loss = loss_fn(preds, yb)
      
      # 3. Compute gradients w.r.t the weights and biases
      loss.backward()

      # 4. Adjust the weights by subtracting a small quantity proportional to the gradient
      opt.step()

      # 5. Reset the gradients to zero
      opt.zero_grad()

    # Print parameters
    if (epoch+1) % 10 == 0:
      print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))

Some things to note above:

- We use the data loader defined earlier to get batches of data for every iteration.
- Instead of updating parameters (weights and biases) manually, we use opt.step to perform the update, and opt.zero_grad to reset the gradients to zero.
- We've also added a log statement which prints the loss from the last batch of data for every 10th epoch, to track the progress of training. loss.item returns the actual value stored in the loss tensor.

Let's train the model for 100 epochs.

In [None]:
# Generate a trained model(fit)
fit(100, model, loss_fn, opt, train_dl)

In [None]:
# Generate Prediction
final_pred = model(inputs)
print(final_pred)
print(targets)

#3.Image Classification using Logistic Regression in PyTorch

### 3.1.Exploring the Data
We begin by importing torch and torchvision. torchvision contains some utilities for working with image data. It also contains helper classes to automatically download and import popular datasets like MNIST.

In [None]:
# Import library
import torch
import torchvision
from torchvision.datasets import MNIST



In [None]:
# Download dataset
dataset = MNIST(root='MNIST/', download=True)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
len(dataset)

The dataset has 60,000 images which can be used to train the model. There is also an additonal test set of 10,000 images which can be created by passing train=False to the MNIST class.

In [None]:
test_dataset = MNIST(root='MNIST/', train=False)
print(len(test_dataset))

It's a pair, consisting of a 28x28 image and a label. The image is an object of the class PIL.Image.Image, which is a part of the Python imaging library Pillow. We can view the image within Jupyter using matplotlib, the de-facto plotting and graphing library for data science in Python.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Along with importing matplotlib, a special statement %matplotlib inline is added to indicate to Jupyter that we want to plot the graphs within the notebook. Without this line, Jupyter will show the image in a popup. Statements starting with % are called IPython magic commands, and are used to configure the behavior of Jupyter itself. You can find a full list of magic commands here: https://ipython.readthedocs.io/en/stable/interactive/magics.html .

Let's look at a couple of images from the dataset.

In [None]:
image, label = dataset[2]
plt.imshow(image, cmap='gray')
print('Label: ',label)

It's evident that these images are quite small in size, and recognizing the digits can sometimes be hard even for the human eye. While it's useful to look at these images, there's just one problem here: PyTorch doesn't know how to work with images. We need to convert the images into tensors. We can do this by specifying a transform while creating our dataset.

In [None]:
import torchvision.transforms as transforms

PyTorch datasets allow us to specify one or more transformation functions which are applied to the images as they are loaded. torchvision.transforms contains many such predefined functions, and we'll use the ToTensor transform to convert images into PyTorch tensors.

In [None]:
# download MNIST data in tensor format
dataset = MNIST(root='MNIST/', 
                train=True, 
                transform=transforms.ToTensor())

In [None]:
img_tensor, label = dataset[0]
print(img_tensor.shape, label)

The image is now converted to a 1x28x28 tensor. The first dimension is used to keep track of the color channels. Since images in the MNIST dataset are grayscale, there's just one channel. Other datasets have images with color, in which case there are 3 channels: red, green and blue (RGB). Let's look at some sample values inside the tensor:

In [None]:
print(img_tensor[0,10:15,10:15])
print(torch.max(img_tensor), torch.min(img_tensor))

The values range from 0 to 1, with 0 representing black, 1 white and the values in between different shades of grey. We can also plot the tensor as an image using plt.imshow.

In [None]:
# plot the image by passing the 28 x 28 matrix
plt.imshow(img_tensor[0,10:15,10:15], cmap='gray')

### 3.2.Training and Validation Datasets
While building real world machine learning models, it is quite common to split the dataset into 3 parts:

1. Training set - used to train the model i.e. compute the loss and adjust the weights of the model using gradient descent.
2. Validation set - used to evaluate the model while training, adjust hyperparameters (learning rate etc.) and pick the best version of the model.
3. Test set - used to compare different models, or different types of modeling approaches, and report the final accuracy of the model.
In the MNIST dataset, there are 60,000 training images, and 10,000 test images. The test set is standardized so that different researchers can report the results of their models against the same set of images.

Since there's no predefined validation set, we must manually split the 60,000 images into training and validation datasets. Let's set aside 10,000 randomly chosen images for validation. We can do this using the random_spilt method from PyTorch.

In [None]:
from torch.utils.data import random_split

trn_ds, val_ds = random_split(dataset, [50000, 10000])
len(trn_ds), len(val_ds)

We can now created data loaders to help us load the data in batches. We'll use a batch size of 128.

In [None]:
from torch.utils.data import DataLoader

batch_size = 128

trn_dl = DataLoader(trn_ds, batch_size, shuffle=True)
val_dl = DataLoader(val_ds, batch_size)

We set shuffle=True for the training dataloader, so that the batches generated in each epoch are different, and this randomization helps generalize & speed up the training process. On the other hand, since the validation dataloader is used only for evaluating the model, there is no need to shuffle the images.

### 3.3.Model
Now that we have prepared our data loaders, we can define our model.

- A logistic regression model is almost identical to a linear regression model i.e. there are weights and bias matrices, and the output is obtained using simple matrix operations (pred = x @ w.t() + b).

- Just as we did with linear regression, we can use nn.Linear to create the model instead of defining and initializing the matrices manually.

- Since nn.Linear expects the each training example to be a vector, each 1x28x28 image tensor needs to be flattened out into a vector of size 784 (28*28), before being passed into the model.

- The output for each image is vector of size 10, with each element of the vector signifying the probability a particular target label (i.e. 0 to 9). The predicted label for an image is simply the one with the highest probability.

In [None]:
import torch.nn as nn

input_size = 28*28
num_classes = 10

# Logistic Regression Model
model = nn.Linear(input_size, num_classes)

Of course, this model is a lot larger than our previous model, in terms of the number of parameters. Let's take a look at the weights and biases.

In [None]:
print(model.weight.shape)
model.weight

In [None]:
print(model.bias.shape)
model.bias

Although there are a total of 7850 parameters here, conceptually nothing has changed so far. Let's try and generate some outputs using our model. We'll take the first batch of 100 images from our dataset, and pass them into our model.

In [None]:
for images, labels in trn_dl: # images = feature values of each image
  print(labels)
  print(images.shape)
  outputs = model(images)
  break

This leads to an error, because our input data does not have the right shape. Our images are of the shape 1x28x28, but we need them to be vectors of size 784 i.e. we need to flatten them out. We'll use the .reshape method of a tensor, which will allow us to efficiently 'view' each image as a flat vector, without really chaging the underlying data.

To include this additional functionality within our model, we need to define a custom model, by extending the nn.Module class from PyTorch.

In [None]:
class MnistModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.linear = nn.Linear(input_size, num_classes)

  def forward(self, xb):
    xb = xb.reshape(-1, 784)
    out = self.linear(xb)
    return out
  
model = MnistModel()

Inside the __init__ constructor method, we instantiate the weights and biases using nn.Linear. And inside the forward method, which is invoked when we pass a batch of inputs to the model, we flatten out the input tensor, and then pass it into self.linear.

xb.reshape(-1, 28*28) indicates to PyTorch that we want a view of the xb tensor with two dimensions, where the length along the 2nd dimension is 28*28 (i.e. 784). One argument to .reshape can be set to -1 (in this case the first dimension), to let PyTorch figure it out automatically based on the shape of the original tensor.

Note that the model no longer has .weight and .bias attributes (as they are now inside the .linear attribute), but it does have a .parameters method which returns a list containing the weights and bias, and can be used by a PyTorch optimizer.

In [None]:
# print(model.linear.weight.shape, model.linear.bias.shape)
list(model.parameters())

Our new custom model can be used in the exact same way as before. Let's see if it works.

In [None]:
for images, labels in trn_dl: # images = feature values of each image
  outputs = model(images)
  break

print(outputs)
print('outputs.shape : ', outputs.shape)
# [128, 10]: 128 = # of training sets. 
# 128개의 이미지를 넣었을 때 10개 값을 변환.
# 10개의 값은 각 10개 class에 대한 계산 값.
# SoftMax 를 이용해 10개의 값이 전부 0 과 1 사이값이 되고, 합이 1이 되도록 조정.

print('Sample outputs :\n', outputs[:2].data)

For each of the 100 input images, we get 10 outputs, one for each class. As discussed earlier, we'd like these outputs to represent probabilities, but for that the elements of each output row must lie between 0 to 1 and add up to 1, which is clearly not the case here.

To convert the output rows into probabilities, we use the softmax function, which has the following formula:

softmax

First we replace each element yi in an output row by e^yi, which makes all the elements positive, and then we divide each element by the sum of all elements to ensure that they add up to 1.

While it's easy to implement the softmax function (you should try it!), we'll use the implementation that's provided within PyTorch, because it works well with multidimensional tensors (a list of output rows in our case).

In [None]:
import torch.nn.functional as F

The softmax function is included in the torch.nn.functional package, and requires us to specify a dimension along which the softmax must be applied.

In [None]:
# Apply Softmax for each output row
probs = F.softmax(outputs, dim=1)

# Look at sample probability
print("Sample probability: \n", probs[:2].data)

# Add
print("Sum: ", torch.sum(probs[0]).item())

Finally, we can determine the predicted label for each image by simply choosing the index of the element with the highest probability in each output row. This is done using torch.max, which returns the largest element and the index of the largest element along a particular dimension of a tensor.

In [None]:
max_prob, preds = torch.max(probs, dim=1)
print(preds)
print(max_prob)

The numbers printed above are the predicted labels for the first batch of training images. Let's compare them with the actual labels.

In [None]:
labels

Clearly, the predicted and the actual labels are completely different. Obviously, that's because we have started with randomly initialized weights and biases. We need to train the model i.e. adjust the weights using gradient descent to make better predictions.

### 3.4.Evaluation Metric and Loss Function
Just as with linear regression, we need a way to evaluate how well our model is performing. A natural way to do this would be to find the percentage of labels that were predicted correctly i.e. the accuracy of the predictions.

In [None]:
def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(preds == labels).item() / len(preds))

The == performs an element-wise comparison of two tensors with the same shape, and returns a tensor of the same shape, containing 0s for unequal elements, and 1s for equal elements. Passing the result to torch.sum returns the number of labels that were predicted correctly. Finally, we divide by the total number of images to get the accuracy.

Note that we don't need to apply softmax to the outputs, since it doesn't change the relative order of the results. This is because e^x is an increasing function i.e. if y1 > y2, then e^y1 > e^y2 and the same holds true after averaging out the values to get the softmax.

Let's calculate the accuracy of the current model, on the first batch of data. Obviously, we expect it to be pretty bad.

In [None]:
accuracy(outputs, labels)

While the accuracy is a great way for us (humans) to evaluate the model, it can't be used as a loss function for optimizing our model using gradient descent, for the following reasons:

1. It's not a differentiable function. torch.max and == are both non-continuous and non-differentiable operations, so we can't use the accuracy for computing gradients w.r.t the weights and biases.

2. It doesn't take into account the actual probabilities predicted by the model, so it can't provide sufficient feedback for incremental improvements.

Due to these reasons, accuracy is a great evaluation metric for classification, but not a good loss function. A commonly used loss function for classification problems is the cross entropy, which has the following formula:

cross-entropy

While it looks complicated, it's actually quite simple:

- For each output row, pick the predicted probability for the correct label. E.g. if the predicted probabilities for an image are [0.1, 0.3, 0.2, ...] and the correct label is 1, we pick the corresponding element 0.3 and ignore the rest.

- Then, take the logarithm of the picked probability. If the probability is high i.e. close to 1, then its logarithm is a very small negative value, close to 0. And if the probability is low (close to 0), then the logarithm is a very large negative value. We also multiply the result by -1, which results is a large postive value of the loss for poor predictions.

- Finally, take the average of the cross entropy across all the output rows to get the overall loss for a batch of data.

Unlike accuracy, cross-entropy is a continuous and differentiable function that also provides good feedback for incremental improvements in the model (a slightly higher probability for the correct label leads to a lower loss). This makes it a good choice for the loss function.

As you might expect, PyTorch provides an efficient and tensor-friendly implementation of cross entropy as part of the torch.nn.functional package. Moreover, it also performs softmax internally, so we can directly pass in the outputs of the model without converting them into probabilities.

In [None]:
loss_fn = F.cross_entropy

In [None]:
# Loss for current batch of data
loss = loss_fn(outputs, labels)
print(loss)

Since the cross entropy is the negative logarithm of the predicted probability of the correct label averaged over all training samples, one way to interpret the resulting number e.g. 2.23 is look at e^-2.23 which is around 0.1 as the predicted probability of the correct label, on average. Lower the loss, better the model.

### 3.5.Training the model
Now that we have defined the data loaders, model, loss function and optimizer, we are ready to train the model. The training process is identical to linear regression, with the addition of a "validation phase" to evaluate the model in each epoch. Here's what it looks like in pseudocode:

    for epoch in range(num_epochs):
      # Training phase
      for batch in train_loader:
          # Generate predictions
          # Calculate loss
          # Compute gradients
          # Update weights
          # Reset gradients
    
      # Validation phase
      for batch in val_loader:
          # Generate predictions
          # Calculate loss
          # Calculate metrics (accuracy etc.)
      # Calculate average validation loss & metrics
      
      # Log epoch, loss & metrics for inspection

Some parts of the training loop are specific the specific problem we're solving (e.g. loss function, metrics etc.) whereas others are generic and can be applied to any deep learning problem. Let's impelment the problem-specific parts within our MnistModel class:

In [None]:
class MnistModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.linear = nn.Linear(input_size, num_classes)

  def forward(self, xb):          
        xb = xb.reshape(-1, 784) # Reshape the input data
        out = self.linear(xb)   
        return out

  def training_step(self, batch):
    images, labels = batch
    out = self(images)                    # Generate Prediction
    loss = F.cross_entropy(out, labels)   # Calculate Loss
    return loss

  def validation_step(self, batch):
    images, labels = batch
    out = self(images)                    # Generate Prediction
    loss = F.cross_entropy(out, labels)   # Calculate Loass
    acc = accuracy(out, labels)           # Calculate Accuracy
    return {'val_loss': loss, 'val_acc': acc}

  def validation_epoch_end(self, outputs):
    batch_losses = [x['val_loss'] for x in outputs]
    epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
    batch_accs = [x['val_acc'] for x in outputs]
    epoch_acc = torch.stack(batch_accs).mean()      # Combine accuracies
    return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}

  def epoch_end(self, epoch, result):
    print("Epoch [{}], val_acc: {:.4f}, val_loss: {:.4f}".format(epoch, result['val_acc'], result['val_loss']))


model = MnistModel()

out = self(images) 
- 'self' of a class in Python refers to the Object itself. It is similar to the following:

    * model = MnistModel()
    * out = model(images)

- But since we cannot do the above inside a class, we use 'self(images)' as an equivalent to 'model(images)', where 'self' refers to the object itself. 

* 'evaluate' function: perform the validation phase
* 'fit' function: peform the entire training process.

In [None]:
def evaluate(model, val_dl):
  outputs = [model.validation_step(batch) for batch in val_dl]
  return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, trn_dl, val_dl):
  history = []
  optimizer = torch.optim.SGD(model.parameters(), lr)
  for epoch in range(epochs):
    # Training Phase
    for batch in trn_dl:
      loss = model.training_step(batch)
      loss.backward()
      optimizer.step()        # update weights w.r.t to optimization function
      optimizer.zero_grad()   # set grad to zero to avoid grad_accumulation.

    # Validation Phase
    result = evaluate(model, val_dl)
    model.epoch_end(epoch, result)
    history.append(result)

The 'fit' function records the validation loss and metric from each epoch and returns a history of the training process. This is useful for debuggin & visualizing the training process. Before we train the model, let's see how the model performs on the validation set with the initial set of randomly initialized weights & biases.

Configurations like batch size, learning rate etc. need to picked in advance while training machine learning models, and are called hyperparameters. Picking the right hyperparameters is critical for training an accurate model within a reasonable amount of time, and is an active area of research and experimentation. Feel free to try different learning rates and see how it affects the training process.

In [None]:
result0 = evaluate(model, val_dl)
result0

The initial accuracy is around 10%, which is what one might expect from a randomly intialized model (since it has a 1 in 10 chance of getting a label right by guessing randomly). Also note that we are using the .format method with the message string to print only the first four digits after the decimal point.

We are now ready to train the model. Let's train for 5 epochs and look at the results.

In [None]:
history1 = fit(5, 0.001, model, trn_dl, val_dl)

That's a great result! With just 5 epochs of training, our model has reached an accuracy of over 80% on the validation set. Let's see if we can improve that by training for a few more epochs.

In [None]:
history2 = fit(5, 0.001, model, trn_dl, val_dl)

In [None]:
history3 = fit(5, 0.001, model, trn_dl, val_dl)

In [None]:
history4 = fit(5, 0.001, model, trn_dl, val_dl)

While the accuracy does continue to increase as we train for more epochs, the improvements get smaller with every epoch. This is easier to see using a line graph.

### 3.6.Saving and loading the model
Since we've trained our model for a long time and achieved a resonable accuracy, it would be a good idea to save the weights and bias matrices to disk, so that we can reuse the model later and avoid retraining from scratch. Here's how you can save the model.

https://www.youtube.com/watch?v=g6kQl_EFn84

In [None]:
torch.save(model.state_dict(), 'mnist-logistic.pth')

The .state_dict method returns an OrderedDict containing all the weights and bias matrices mapped to the right attributes of the model.



To load the model weights, we can instante a new object of the class MnistModel, and use the .load_state_dict method.

In [None]:
model2 = MnistModel()
model2.load_state_dict(torch.load('mnist-logistic.pth'))
model2.state_dict()

In [None]:
# Define test dataset
test_dataset = MNIST(root='MNIST/', 
                     train=False,
                     transform=transforms.ToTensor())

In [None]:
test_loader = DataLoader(test_dataset, batch_size=256)
result = evaluate(model2, test_loader)
result

## 4.Training DNN on a GPU with PyTorch

### 4.1. Preparing the Data
- Import required modules and classes.

In [None]:
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from torchvision.utils import make_grid
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import random_split
%matplotlib inline

In [None]:
dataset = MNIST(root='data/',
                download=True,
                transform=ToTensor())

In [None]:
# Use the random_split helper function to set aside 10000 images for our validation set.
val_size = 10000
train_size = len(dataset) - val_size

train_ds, val_ds = random_split(dataset, [train_size, val_size])
len(train_ds), len(val_ds)

#### PyTorch DataLoader
[reference]
https://www.youtube.com/watch?v=zN49HdDxHi8
* Create a dataloader which creates batches of data
* one epoch: one forward pass and one backward pass of all the training examples.
* batch size: the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
* number of iterations: number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass(we don't count the forward pass and backward pass as two different pass)

* ex) if you have 1,000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

In [None]:
batch_size=128

In [None]:
# create a dataloader for training data
train_loader = DataLoader(train_ds, batch_size, shuffle=True, num_workers=4, pin_memory=True)

# create a dataloader for validation data
val_loader = DataLoader(val_ds, batch_size*2, num_workers=4, pin_memory=True)

In [None]:
def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(preds == labels).item() / len(preds))

In [None]:
# Setting GPU Environmnet: RunTime > Change Runtime Type(런타임 유형 변경) > 하드웨어 가속기 > GPU
torch.cuda.is_available()