A large portion of the contents are copied from the [PyTorch Tutorial](https://pytorch.org/tutorials/) and [mrdbourke/pytorch-deep-learning](https://www.learnpytorch.io/) 

Created on 2023/1/29 for CS440/ECE448 Artificial Intelligence, Spring 2023 

## 1. What is *__PyTorch__*?
*PyTorch* is an open source machine learning framework that accelerates the path from research prototyping to production deployment.  
*PyTorch* allows you to manipulate and process data and write machine learning algorithms using Python code (user-friendly!).  
*PyTorch* also offers some domain-specific libraries such as [TorchText](https://pytorch.org/text/stable/index.html), [TorchVision](https://pytorch.org/vision/stable/index.html), and [TorchAudio](https://pytorch.org/audio/stable/index.html).

![domain-libraries.png](attachment:domain-libraries.png)

You can [install](https://pytorch.org/get-started/locally/) *PyTorch* with conda or pip. For more information, please refer to [PyTorch website](https://pytorch.org/).  
Let's verify the installation by printing *PyTorch* version. The code block below should run without error if *PyTorch* was installed correctly.

In [1]:
import torch    # the library name is torch
print("PyTorch version:", torch.__version__)

PyTorch version: 1.13.1


## 2. PyTorch Workflow 
__Source__: [mrdbourke/pytorch-deep-learning](https://www.learnpytorch.io/01_pytorch_workflow/)

Machine learning is a game of two parts:
1. Turn your data, whatever it is, into numbers (a representation).
2. Pick or build a model to learn the representation as best as possible.

![pytorch_workflow.png](attachment:pytorch_workflow.png)

In this MP, you will mostly work on the second step: *__building a model__*.  
But for now, let's go over some fundamentals first.

## 3. Fundamentals
### Datasets and Dataloaders
__Source__: [PyTorch Tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html) and [`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples. 

A dataloader is a way for you to handle loading and transforming data before it enters your network for training or prediction. It will let you write code that looks like you're just looping through the dataset, with the division into batches happening automatically. Details on how to use a dataloader can be found in this [tutorial](https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel) by Shervine Amidi. For more information about datasets and dataloaders, please refer to __Source__.

In this MP, you don't need to write the dataset and dataloader part, we have provided one for you.

### Tensors
__Source__: [mrdbourke/pytorch-deep-learning](https://www.learnpytorch.io/00_pytorch_fundamentals/#introduction-to-tensors)

Tensors are a specialized data structure that are very similar to arrays and matrices. Their job is to represent data in a numerical way. Tensors are similar to NumPy’s arrays, except that tensors can run on GPUs or other hardware accelerators (better performance!). In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters.

The code cell below may give you some idea of how to use tensors.

In [2]:
rand_tensor = torch.rand(2, 3)              # create a random tensor of size (2, 3)
zeros_tensor = torch.zeros(5)               # create a tensor of size 5 that is filled with 0's
print("Random Tensor: \n", rand_tensor, "\n")
print("Zeros Tensor: \n", zeros_tensor, "\n")

# explore some of the attributes of a Tensor
tensor = torch.tensor([[7, 7, 5], [1, 3, 0], [2, 2, 1], [9, 4, 8]])
print("My Tensor: \n", tensor)
print("Shape of tensor: ", tensor.shape)
print("Datatype of tensor: ", tensor.dtype)
print("Device tensor is stored on: ", tensor.device, "\n")

# element-wise multiplication
tensor1 = torch.tensor([1, 2, 3])
tensor2 = torch.tensor([2, 3, 4])
print("Element-wise multiplication:")
print(tensor1, "*", tensor2, "=", tensor1 * tensor2)

Random Tensor: 
 tensor([[0.1447, 0.4911, 0.9916],
        [0.4106, 0.9788, 0.4347]]) 

Zeros Tensor: 
 tensor([0., 0., 0., 0., 0.]) 

My Tensor: 
 tensor([[7, 7, 5],
        [1, 3, 0],
        [2, 2, 1],
        [9, 4, 8]])
Shape of tensor:  torch.Size([4, 3])
Datatype of tensor:  torch.int64
Device tensor is stored on:  cpu 

Element-wise multiplication:
tensor([1, 2, 3]) * tensor([2, 3, 4]) = tensor([ 2,  6, 12])


One of the most common errors you'll run into in deep learning is shape mismatches, because matrix multiplication has a strict rule about what shapes and sizes can be combined.

The code cell below is such an example.

In [3]:
# matrix multiplication
tensor_A = torch.tensor([[1, 2],
                         [3, 4],
                         [5, 6]], dtype=torch.float32)

tensor_B = torch.tensor([[7, 10],
                         [8, 11], 
                         [9, 12]], dtype=torch.float32)
print("tensor_A, shape =", tensor_A.shape)
print(tensor_A, "\n")
print("tensor_B, shape =", tensor_B.shape)
print(tensor_B)

# torch.matmul() is a built-in matrix multiplication function
torch.matmul(tensor_A, tensor_B)    # this will error because of shape mismatch

tensor_A, shape = torch.Size([3, 2])
tensor([[1., 2.],
        [3., 4.],
        [5., 6.]]) 

tensor_B, shape = torch.Size([3, 2])
tensor([[ 7., 10.],
        [ 8., 11.],
        [ 9., 12.]])


RuntimeError: mat1 and mat2 shapes cannot be multiplied (3x2 and 3x2)

In [4]:
# tensor_A and tensor_B cannot be multiplied, but multiplying tensor_A with the transpose of tensor_B is legal (3x2 and 2x3)
# transpose of a tensor:    tensor.T - where tensor is the desired tensor to transpose
print("tensor_A, shape =", tensor_A.shape)
print(tensor_A, "\n")
print("transpose of tensor_B, shape =", tensor_B.T.shape)
print(tensor_B.T, "\n")         # transpose of tensor_B
print("tensor_A * tensor_B.T equals")
print(torch.matmul(tensor_A, tensor_B.T))

tensor_A, shape = torch.Size([3, 2])
tensor([[1., 2.],
        [3., 4.],
        [5., 6.]]) 

transpose of tensor_B, shape = torch.Size([2, 3])
tensor([[ 7.,  8.,  9.],
        [10., 11., 12.]]) 

tensor_A * tensor_B.T equals
tensor([[ 27.,  30.,  33.],
        [ 61.,  68.,  75.],
        [ 95., 106., 117.]])


### Neural Net Layers

So far we have looked into the tensors, their properties and basic operations on tensors. These are especially useful to get familiar with if we are building the layers of our network from scratch. PyTorch also provides some built-in blocks in the [torch.nn](https://pytorch.org/docs/stable/nn.html) module.

We can use [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear)(in_features, out_features) to create a a linear layer that applies a linear transformation to the incoming data x:  
- y = x $A^{T}$  + b ,  where A and b are initialized __randomly__.

This will take an input of size (∗, *__in_features__*) where ∗ means any number of dimensions including none and *__in_features__* is the size of each input sample. It will yield an output of size (∗, *__out_features__*) where all but the last dimension are the same shape as the input and *__out_features__* is the size of each output sample. 

In [5]:
# Create the input (8 samples, each of size 5)
input = torch.randn(8, 5)
print("Input:")
print(input)
print("Input size:", input.size(), "\n")

# Make a linear layer transforming (*, 5)-dimensinal inputs to (*, 7)-dimensional outputs
linear_layer = torch.nn.Linear(5, 7)

# Apply the linear layer
output = linear_layer(input)
print("Output:")
print(output)
print("Output size:", output.size())

Input:
tensor([[ 1.2311, -1.6106,  1.1904, -1.3760, -1.0768],
        [ 1.4457, -0.9221,  0.2620, -0.4303, -0.1025],
        [ 0.6957, -0.1886, -0.2860, -0.8013, -0.0345],
        [ 1.9111,  1.4318, -0.7400, -1.6043,  0.9857],
        [-0.3254, -0.1228, -1.8418,  0.5205,  0.4083],
        [-0.2250,  0.9231, -0.1863,  0.6050, -0.5626],
        [ 0.3139,  0.3218, -0.4420, -1.3299, -0.2689],
        [ 1.2476, -0.5054, -0.9550, -2.1420,  0.2721]])
Input size: torch.Size([8, 5]) 

Output:
tensor([[ 1.3293e+00,  6.8910e-01,  7.5343e-01,  4.8346e-02,  1.0342e+00,
          4.6120e-02,  2.5880e-01],
        [ 6.4500e-03,  1.0862e+00,  7.1512e-01, -1.3738e-01,  3.4276e-01,
          4.2982e-01,  7.4599e-01],
        [-1.7132e-01,  5.6373e-01,  1.8710e-01, -4.3903e-03,  2.2715e-01,
          4.9964e-01,  6.0173e-02],
        [-1.1930e+00,  5.7070e-01,  2.8576e-01,  3.7716e-01, -3.4330e-01,
          1.2255e+00,  8.0821e-02],
        [-1.1761e+00,  8.4956e-01, -2.5006e-01, -5.5140e-01, -6.0904e-0

We can also use the torch.nn module to apply activations functions to our tensors. Activation functions are used to add non-linearity to our network. Activation functions operate on each element seperately, so the shape of the tensors we get as an output are the same as the ones we pass in. Let's try [nn.Sigmoid()](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html).

In [6]:
# Pass the output of previous layer to the current layer as the input
print("Previous output:")
print(output)
print("Output size:", output.size(), "\n")

# Create a sigmoid function
sigmoid = torch.nn.Sigmoid()

# Apply the activation function
activated_output = sigmoid(output)
print("Activated output:")
print(activated_output)
print("Activated output size:", activated_output.size())

Previous output:
tensor([[ 1.3293e+00,  6.8910e-01,  7.5343e-01,  4.8346e-02,  1.0342e+00,
          4.6120e-02,  2.5880e-01],
        [ 6.4500e-03,  1.0862e+00,  7.1512e-01, -1.3738e-01,  3.4276e-01,
          4.2982e-01,  7.4599e-01],
        [-1.7132e-01,  5.6373e-01,  1.8710e-01, -4.3903e-03,  2.2715e-01,
          4.9964e-01,  6.0173e-02],
        [-1.1930e+00,  5.7070e-01,  2.8576e-01,  3.7716e-01, -3.4330e-01,
          1.2255e+00,  8.0821e-02],
        [-1.1761e+00,  8.4956e-01, -2.5006e-01, -5.5140e-01, -6.0904e-02,
          6.6462e-01,  1.6914e-01],
        [-8.1182e-01, -2.2811e-01, -4.3557e-01, -1.1723e-01, -4.8372e-01,
          5.1117e-01, -3.5015e-01],
        [-4.4120e-02,  9.4943e-02, -1.3788e-01,  1.3130e-01,  3.0328e-01,
          5.7239e-01, -5.3382e-01],
        [ 1.0347e-01,  9.8029e-01,  4.2258e-01,  2.2643e-02,  8.4057e-01,
          7.9757e-01, -1.2799e-03]], grad_fn=<AddmmBackward0>)
Output size: torch.Size([8, 7]) 

Activated output:
tensor([[0.7907, 0.6658,

So far we have seen that we can create layers and pass the output of one as the input of the next. Instead of creating intermediate tensors and passing them around, we can use [nn.Sequentual](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html), which does exactly that.

In [7]:
# Create a sequential container
block = torch.nn.Sequential(
    torch.nn.Linear(5, 7),
    torch.nn.Sigmoid()
)

print("Input:")
print(input)
print("Input size:", input.size(), "\n")

# Apply the block
block_output = block(input)
print("Block output:")
print(block_output)
print("Block output size:", block_output.size())

Input:
tensor([[ 1.2311, -1.6106,  1.1904, -1.3760, -1.0768],
        [ 1.4457, -0.9221,  0.2620, -0.4303, -0.1025],
        [ 0.6957, -0.1886, -0.2860, -0.8013, -0.0345],
        [ 1.9111,  1.4318, -0.7400, -1.6043,  0.9857],
        [-0.3254, -0.1228, -1.8418,  0.5205,  0.4083],
        [-0.2250,  0.9231, -0.1863,  0.6050, -0.5626],
        [ 0.3139,  0.3218, -0.4420, -1.3299, -0.2689],
        [ 1.2476, -0.5054, -0.9550, -2.1420,  0.2721]])
Input size: torch.Size([8, 5]) 

Block output:
tensor([[0.5038, 0.6579, 0.2936, 0.7488, 0.5123, 0.4349, 0.6728],
        [0.4912, 0.5293, 0.3327, 0.6359, 0.5016, 0.5841, 0.5915],
        [0.4351, 0.5619, 0.3979, 0.6545, 0.4264, 0.5064, 0.5577],
        [0.4677, 0.7703, 0.2803, 0.6743, 0.2310, 0.5434, 0.4912],
        [0.3174, 0.2679, 0.6050, 0.5022, 0.4819, 0.5958, 0.5162],
        [0.4263, 0.4943, 0.5767, 0.5954, 0.6059, 0.5832, 0.4461],
        [0.3975, 0.6503, 0.4262, 0.7030, 0.3661, 0.4229, 0.5587],
        [0.3713, 0.6600, 0.3161, 0.7148, 0.

## 4. Build a Model
__Source__: [mrdbourke/pytorch-deep-learning](https://www.learnpytorch.io/01_pytorch_workflow/#2-build-model) and [PyTorch Tutorial](https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html)


Now that we have covered some fundamentals, let's focus on how to build a model. In this MP, you will implement a neural network.

Neural networks comprise of layers/modules that perform operations on data. The [torch.nn](https://pytorch.org/docs/stable/nn.html) (a PyTorch module) namespace provides all the building blocks you need to build your own neural network. Every module in PyTorch subclasses the [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) (object-oriented programming -- if you are not familar with Python class notation, or if you don't quite understand what it means, you may want to take a look at [this](https://realpython.com/python3-object-oriented-programming/)). A neural network is a derived class of nn.Module that consists of other modules (layers). This nested structure allows for building and managing complex architectures easily.

The neural network model is defined in two steps: 
1. we first specify the parameters (e.g., layers) of the model ( in `__init__()` ), 

2. and then outline how they are applied to the inputs ( in `forward()` ).

![net-workflow.png](attachment:net-workflow.png)

The code snippet provided below is a simple example of how to construct a network and use it to make a prediction. Note that there are *__more than one way__* to construct a network architecture, you can find many examples on the internet.

![simple-neural-network.png](attachment:simple-neural-network.png)

In [8]:
import torch

class SimpleNet(torch.nn.Module):
  def __init__(self):
    """
    In the initialization function we specify the parameters of our network.
    """
    super().__init__()  # call the initialization function of the base class (nn.Module)
    # network architecture, please try to relate the code to the picture
    self.hidden = torch.nn.Linear(4, 3)     # input has 4 values
    self.output = torch.nn.Linear(3, 2)     # output has 2 values
    self.relu = torch.nn.ReLU()             # activation function

  def forward(self, x):
    """
    In the forward function we accept a Tensor of input data (the variable x) and we must return a Tensor of output data. 
    We can use Modules defined in the __init__() as well as arbitrary (differentiable) operations on Tensors.
    """
    x_temp = self.hidden(x)             # input data x flows through the hidden layer
    x_temp = self.relu(x_temp)          # use relu as the activation function for intermediate data x_temp 
    y_pred = self.output(x_temp)        # predicted value
    return y_pred

# Create an instance of the SimpleNet model (this is a subclass of nn.Module)
model = SimpleNet()

# Create inputs, here we use a random tensor, but in reality, the input should be loaded from a real-world dataset
x = torch.rand(3, 4)   # 3 samples, each sample of size 4

# Forward pass: compute predicted y by passing x to the model
# Note that the model is randomly initialized, so this prediction probably doesn't make sense
# We need to train our model and teach it to make reasonable predictions (we will see that later)
y_pred = model(x)

print("y_pred.shape: ", y_pred.shape)   # since our output layer has 2 values, y_pred should be of size (3, 2)
print(y_pred)

y_pred.shape:  torch.Size([3, 2])
tensor([[0.3849, 0.6309],
        [0.4664, 0.5541],
        [0.4284, 0.5552]], grad_fn=<AddmmBackward0>)


| Name | What does it do? |
| ----- | ----- |
| [`torch.nn`](https://pytorch.org/docs/stable/nn.html) | Contains all of the building blocks (network layers) for computational graphs (essentially a series of computations executed in a particular way). |
| [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) | The base class for all neural network modules, all the building blocks for neural networks are subclasses. If you're building a neural network in PyTorch, your models should subclass `nn.Module`. Requires a `forward()` method be implemented. | 
| `__init__()` | `__init__()` is your network's initialization function, where you will initialize the neural network layers.|
| `forward()` | All `nn.Module` subclasses (e.g., your own network) require a `forward()` method, this defines the computation that will take place on the data passed to the particular `nn.Module`. Simply put, `forward()` should perform a forward pass through your network. Note that you should *__NOT__* directly call the `forward(x)` method, though. To use the model, you should call the whole model itself and pass it the input data, as in `model(x)` to perform a forward pass and output predictions. This executes the model's `forward()` automatically.  |
| [`torch.optim`](https://pytorch.org/docs/stable/optim.html) | Contains various optimization algorithms (these tell the model parameters stored in `nn.Parameter` how to best change to improve gradient descent and in turn reduce the loss). | 

### Train a Model
__Source__: [PyTorch Tutorial](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html)

Training a model is an iterative process; in each iteration the model makes a guess about the output, calculates the error in its guess (loss), collects the derivatives of the error with respect to its parameters, and optimizes these parameters using gradient descent.

Note that the parameters are initialized randomly, and for our model to update its parameters on its own, we'll need to add a *__loss function__* as well as an *__optimizer__*.

#### *Loss Function and Optimizer*

| Function | What does it do? | Where does it live in PyTorch? | Common values |
| ----- | ----- | ----- | ----- |
| **Loss function** | Measures how wrong your models predictions (e.g. `y_preds`) are compared to the truth labels (e.g. `y_test`). Lower the better. | PyTorch has plenty of built-in loss functions in [`torch.nn`](https://pytorch.org/docs/stable/nn.html#loss-functions). | Mean absolute error (MAE) for regression problems ([`torch.nn.L1Loss()`](https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html)). Binary cross entropy for binary classification problems ([`torch.nn.BCELoss()`](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)).  Cross entropy for multi-class classification problems ([`torch.nn.CrossEntropyLoss()`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)) |
| **Optimizer** | Tells your model how to update its internal parameters to best lower the loss. | You can find various optimization function implementations in [`torch.optim`](https://pytorch.org/docs/stable/optim.html). | Stochastic gradient descent ([`torch.optim.SGD()`](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD)). Adam optimizer ([`torch.optim.Adam()`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam)). | 

#### *Gradients*
When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters (model weights, biases, ...) are adjusted according to the gradient of the loss function with respect to the given parameter. To compute those gradients, PyTorch has a built-in differentiation engine called `torch.autograd`. It supports automatic computation of gradient for any computational graph. Take a look at [this](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html) for more information about autogradient in PyTorch (you should be able to understand everything that is covered in the linked tutorial by now).

Time for an example: 

In [9]:
y_pred = model(x)
print("Predicted values:")
print(y_pred, "\n")

# Creata a loss function, e.g., Mean Squared Error
loss_fn = torch.nn.MSELoss()

# Create true values, in reality, they should come from a real-world dataset
y_true = torch.tensor(
    [
        [1, 1], 
        [1, 1],
        [1, 1]
    ], dtype=torch.float32)
print("True values:")
print(y_true, "\n")

# Calculate MSE
loss = loss_fn(y_true, y_pred)
print("MSE:", loss, "\n\n\n")       # You can verify the results manually


# Create an optimizer, e.g., SGD optimizer
optimizer = torch.optim.SGD(params=model.parameters(), lr=1)
# params:   parameters of target model to optimize
# lr:       learning rate (how much the optimizer should change parameters at each step)

print("Weights of hidden linear layer, before back propagation:")
print(model.hidden.weight, "\n") 

# Preform back propagation
optimizer.zero_grad()   # Clear previous gradients, will see more about this later
loss.backward()         # back propagation
# Here I only print the gradients of hidden.weight, but backward() updates gradients of all related parameters
print("gradients of weights of the hidden layer:")
print(model.hidden.weight.grad, "\n")   

# Update parameters
optimizer.step()
print("Weights of hidden linear layer, after back propagation:")
print(model.hidden.weight)          # You can verify the results manually, after = before - gradient x learning rate.

Predicted values:
tensor([[0.3849, 0.6309],
        [0.4664, 0.5541],
        [0.4284, 0.5552]], grad_fn=<AddmmBackward0>) 

True values:
tensor([[1., 1.],
        [1., 1.],
        [1., 1.]]) 

MSE: tensor(0.2538, grad_fn=<MseLossBackward0>) 



Weights of hidden linear layer, before back propagation:
Parameter containing:
tensor([[ 0.0913, -0.3333,  0.3799, -0.1272],
        [ 0.2785, -0.1365, -0.4936,  0.3962],
        [-0.0153,  0.2444,  0.2186, -0.2870]], requires_grad=True) 

gradients of weights of the hidden layer:
tensor([[-0.0263, -0.0080, -0.0165, -0.0045],
        [ 0.0000,  0.0000,  0.0000,  0.0000],
        [-0.1944, -0.0914, -0.1011, -0.0288]]) 

Weights of hidden linear layer, after back propagation:
Parameter containing:
tensor([[ 0.1177, -0.3253,  0.3964, -0.1226],
        [ 0.2785, -0.1365, -0.4936,  0.3962],
        [ 0.1790,  0.3358,  0.3197, -0.2583]], requires_grad=True)


#### *Hyperparameters*
Hyperparameters are adjustable parameters that let you control the model optimization process. Different hyperparameter values can impact model training and convergence rates.

We define the following hyperparameters for training:
- *__Number of Epochs__* - the number of times to iterate over the dataset (one epoch = one forward pass and one backward pass of all the training samples)

- *__Batch Size__* - the number of data samples propagated through the network before the parameters are updated

- *__Learning Rate__* - how much to update models parameters at each batch/epoch. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.

#### *Batch size*
Batch size might be confusing if this is your first time hearing it. You can find a wonderful explanation [here](https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network#:~:text=Advantages%20of%20using%20a%20batch%20size%20%3C%20number,we%20update%20the%20weights%20after%20each%20propagation.%20).

> Let's say you have 1050 training samples and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next, it takes the second 100 samples (from 101st to 200th) and trains the network again. We can keep doing this procedure until we have propagated all samples through of the network. Problem might happen with the last set of samples. In our example, we've used 1050 which is not divisible by 100 without remainder. The simplest solution is just to get the final 50 samples and train the network.
>
> Advantages of using a batch size < number of all samples:
> - It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That's especially important if you are not able to fit the whole dataset in your machine's memory.
>
> - Typically networks train faster with mini-batches. That's because we update the weights after each propagation. In our example we've propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we've updated our network's parameters. If we used all samples during propagation we would make only 1 update for the network's parameter.
>
> Disadvantages of using a batch size < number of all samples:
> - The smaller the batch the less accurate the estimate of the gradient will be.

#### *Annotated Code*
Please note that the code snippets below do not make use of batch size, instead it uses one single "batch" with all training data. Take a look [here](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html) for an example with batch size. In this MP, you will write a training loop and a testing loop which operate data in batches.

![train-loop.png](attachment:train-loop.png)

![test-loop.png](attachment:test-loop.png)

## 5. Summary
![Summary.png](attachment:Summary.png)

Note that some of the modules are not covered and will not be used in this MP but may be useful to know. 