# Neural Network

- **Part I:  Implmentation of a feed-forward neural network**
    - Build your feed-forward network with different layers and activation functions
    - Define the gradient descent function to update the parameters
    - Adjust the learning rate to achieve better performance 
    - Run the evaluation function


- **Part II: implement your a Convolutional Neural Network**
    - Train the CNN and compare it with the feed-forward neural network
    


Let's first import all the packages that you will need.

- **torch, torch.nn, torch.nn.functional** are the fundamental modules in pytorch library, supporting Python programs that facilitates building deep learning projects.
- **torchvision** is a library for Computer Vision that goes hand in hand with PyTorch
- **numpy** is the fundamental package for scientific computing with Python programs.
- **matplotlib** is a library to plot graphs and images in Python.
- **math, random** are the standard modules in Python.

In [18]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import random
import math
import numpy as np
import matplotlib.pyplot as plt
from project1_utils import *

print("Import packages successfully!")

Import packages successfully!


A helper function is provided:

```python
def set_seed(seed):
    """
    Use random seed to ensure that results are reproducible.
    """
```

In [19]:
seed = 265
set_seed(seed)

##  Dataset

Let's load the dataset first using pytorch dataset and loader modules.

In [20]:
# the number of images in a batch
batch_size = 8

# load dataset
trainset = dataset(path='use the path of your training set')
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)
testset = dataset(path='use the path of your testing set')
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=True, num_workers=2)

# name of classes
classes = ("define your classes")


Number of training examples: 1034
Number of testing examples: 126


Let's visualize some examples in the dataset, the tool to show images is provided as below:

```python
def imshow(images):
    """
    Display the input images in a plot
    """
```

# Part I

---

## Feed-forward neural network.

In this cell, we will build a **four-layer multilayer perceptron (MLP)** to classify images into different categories. 


The size of input images is a batch-like tensor $ X \in \mathbb{R}^{B \times C \times H \times W}$, where $B$ denotes the batch size. Vectorize the image pixels equals to transforming into a vector $X_{vector} \in \mathbb{R}^{B \times CHW}$.

To process image data with feed-forward neural network, the image pixels should be vectorized. So first, we implement a vectorized function. 

In [22]:
def image_vectorization(image_batch):
    """
    Input: 
        image_batch: a batch of images with shape [b, c, h, w]
    Output: 
        vectorized_image_batch: a batch of neurons
    """
    vectorized_image_batch = image_batch.view(image_batch.size(0), -1) # change shape to R^(B, CHW) 
    return vectorized_image_batch
    

In [23]:
# # IGNORE: testing 
# def test_vect(image_batch):
#     print(f"this is my size: {image_batch.size()}") 

# B1 = 10  # Batch size of 10 images
# X_batch1 = torch.randn(B1, 1, 64, 64)
# test_vect(X_batch1)

# vectorized = image_vectorization(X_batch1)
# print(f"this is my size after vect: {vectorized.size()}") 

# # this is my size: torch.Size([10, 1, 64, 64])
# # this is my size after vect: torch.Size([10, 4096])

 **each layer** of a MLP can be denoted as the following mathematical operation:

$$z = W^T x + b$$ 

Here, $W, b$ denote the weights and biases. The function is **parameterized by $W, b$**.


In [24]:
"""
torch.randn(input_dim, output_dim): This creates a tensor of the specified shape (input_dim x output_dim) 
filled with random numbers drawn from a standard normal distribution (mean 0 and variance 1).

torch.sqrt(torch.tensor(2.0) / (input_dim + output_dim)): This calculates the scaling factor. 
For the Xavier initialization with a sigmoid or tanh activation, the weights are scaled by the square root of 
2/ (number of input units + number of output units)

Multiplying the random tensor by the scaling factor ensures that the weights are initialized 
in a manner that helps combat the vanishing/exploding gradient problem in deep networks.
"""
def get_layer_params(input_dim: int, output_dim: int):
    """
    Input: 
        input_dim: number of neurons in the input
        output_dim: number of neurons produced by the layer
    Output: 
        a dictionary of generated parameters
            - w: weights
            - b: biases
    """
    
    # Xavier/Glorot initialization for weights
    w = torch.randn(input_dim, output_dim, requires_grad=True) #* torch.sqrt(torch.tensor(2.0) / (input_dim + output_dim))
    
    # Biases initialization (zeros)
    b = torch.zeros(output_dim,requires_grad=True)
    
    return {'w': w,
            'b': b}




Following with the previous linear layer, an activation layer is required to add non-linearity to the network:

 $$a = \sigma(z)$$

 $a, \sigma$ denote activation output and activation function, respectively.
The entire layer function is also **parameterized by choice of $\sigma(\cdot)$**.

**Question 3 (4-3 points):** We need an activation wrapper function to support the following three activation functions.
- Sigmoid
- tanh
- ReLU


In [25]:
def activation_wrapper(x, activation='relu'):
    """
    Input: 
        x: the input neuron values
        activation: name of activation, could be one in ['relu', 'sigmoid', 'tanh']
    Output: 
        a: the corresponding activated output
    """

    if activation == 'relu':
        # use ReLU(x) = max(0, x) this zeros out all negative values by doing a comparison etween elemnts in x and 
        # an all-zeros tensor of the same size as x  
        a = torch.max(torch.tensor(0.0),x)

        """
        It is computationally expensive, causes vanishing gradient problem and not zero-centred. 
        This method is generally used for binary classification problems.
        """
    elif activation == 'sigmoid':
        # formula: sigmoid(x) = 1/(1 + exp(-x))
        a = 1 / (1 + torch.exp(-x))
    
        """
        If you compare it to sigmoid, it solves just one problem of being zero-centred.
        """
    elif activation == 'tanh':
        
        # Tanh(x) = (exp(x) - exp(-x))/(exp(x) + exp(-x))
        a = (torch.exp(x) - torch.exp(-x)) / (torch.exp(x) + torch.exp(-x))
    
    else:
        raise ValueError(f"Unsupported activation function: {activation}")

    return a 

Given the layer parameters $W, b$ and the choice of $\sigma(\cdot)$, compute the output for an MLP layer with input $x$. 


In [26]:
def layer_forward_computation(x, params, activation):
    """
    Input: 
        x: the input to the layer
        params: parameters of each layer
        activation: activation type
    Output: 
        a: the output after the activation
    """

    
    # compute the output for layer
    Z = x.mm(params["w"] ) + params["b"]

    # apply activation
    a = activation_wrapper(Z, activation)
    
    return a

---


**Architecture **:

We now describe in details how our four-layer MLP should be built in PyTorch.

1. In the dataset, the size of input image is a tensor $ X \in \mathbb{R}^{B \times 1 \times 64 \times 64}$, where $B$ denotes the batch size.
2. Vectorize the image pixels to a vector $X_{vector} \in \mathbb{R}^{B \times 4096}$.
3. We now begin describing the specific architecture of the model, although this is not the only design choice, and feel free to change the hidden dimensions of the parameters
4. Layer1: set your parameters so the input is projected from $\mathbb{R}^{B \times 4096}$ to $\mathbb{R}^{B \times 2048}$, use ReLU as your activation function
5. Layer2: set your parameters so the input is projected from $\mathbb{R}^{B \times 2048}$ to $\mathbb{R}^{B \times 1024}$, use ReLU as your activation function
6. Layer3: set your parameters so the input is projected from $\mathbb{R}^{B \times 1024}$ to $\mathbb{R}^{B \times 256}$, use ReLU as your activation function
7. Layer4: set your parameters so the input is projected from $\mathbb{R}^{B \times 256}$ to $\mathbb{R}^{B \times 2}$, use sigmoid function as your activation function

---

In [27]:
layer1_params: dict = dict()
layer2_params: dict = dict()
layer3_params: dict = dict()
layer4_params: dict = dict()

def net(X, params, activations):
    """
    Input: 
        X: the input images to the network
        params: a dictionary of parameters(W and b) for the four different layers
        activations: a dictionary of activation function names for the four different layers
    Output: 
        output: the final output from the four layer
    """
    
    # build your network forward
    # 1- vectorize image
    X_vectorized = image_vectorization(X)

    # 2- Forward Pass Layer 1 
    A1 = layer_forward_computation(X_vectorized, params["layer1"], activations["layer1"])
    
    # 3- Forward pass through Layer 2
    A2 = layer_forward_computation(A1, params['layer2'], activations['layer2'])
    
    # 4- Forward pass through Layer 3
    A3 = layer_forward_computation(A2, params['layer3'], activations['layer3'])
    
    # 5- Forward pass through Layer 4
    output = layer_forward_computation(A3, params['layer4'], activations['layer4'])
    
    return output



In [28]:
""" We prepare serval dictories to store the parameters and activations for different   """
layer1_params: dict = dict()
layer2_params: dict = dict()
layer3_params: dict = dict()
layer4_params: dict = dict()
params: dict = dict()
activations: dict = dict()


# Define layer parameters using the get_layer_params function
layer1_params = get_layer_params(4096, 2048) 
layer2_params = get_layer_params(2048, 1024)
layer3_params = get_layer_params(1024, 256)
layer4_params = get_layer_params(256, 2)

# Pack layer parameters into the params dictionary
params['layer1'] = layer1_params
params['layer2'] = layer2_params
params['layer3'] = layer3_params
params['layer4'] = layer4_params

# Define activations for each layer
activations['layer1'] = 'relu'
activations['layer2'] = 'relu'
activations['layer3'] = 'relu'
activations['layer4'] = 'sigmoid'


In [29]:
# #IGNORE: testing

# # Test case 1
# B1 = 10  # Batch size of 10 images
# X_batch1 = torch.randn(B1, 1, 64, 64)
# output1 = net(X_batch1, params, activations)
# print(f"Test case 1 - Output shape: {output1.shape}")  # Expected shape: [10, 2]

# # Test case 2
# B2 = 5  # Batch size of 5 images
# X_batch2 = torch.randn(B2, 1, 64, 64)
# output2 = net(X_batch2, params, activations)
# print(f"Test case 2 - Output shape: {output2.shape}")  # Expected shape: [5, 2]

# # Test case 1 - Output shape: torch.Size([10, 2])
# # Test case 2 - Output shape: torch.Size([5, 2])

##  Backpropagation and optimization






Gradient descent is a way to minimize the final objective function (loss) parameterized by a model's parameter $\theta$ by updating the parameters in the opposite direction of the gradient $\nabla_\theta J(\theta)$ w.r.t to the parameters. The learning rate $\lambda$ determines the size of the steps you take to reach a (local) minimum.

However, for the vanilla gradient descent, you need to run through all the samples in your training set and update once. This will be time-consuming with large-scale datasets. We choose Stochastic Gradient Descent, which only requires a subset of training samples to update the parameters. With the popular deep learning framework, the subset usually equals to the minibatch selected during training.

Now, let's look at the equation to update parameters for each layer in your network.

$$\large \theta = \theta - \lambda\cdot\nabla_\theta J(\theta)$$

---

In [30]:
def update_params(params, learning_rate):
    """
    Input: 
        params: the dictornary to store all the layer parameters
        learning_rate: the step length to update the parameters
    Output: 
        params: the updated parameters
    """
        
    for layer_params in params.values():
        if layer_params['w'].grad is not None: # check the gard is not None to not cause errors
            layer_params['w'].data -= learning_rate * layer_params['w'].grad.data

        if layer_params['b'].grad is not None:
            layer_params['b'].data -= learning_rate * layer_params['b'].grad.data
            
    return params

In [31]:
def zero_grad(params):
    """
    Input: 
        params: the dictornary to store all the layer parameters
    Output: 
        params: the updated parameters with gradients clear
    """
    #TODO: set the gradients with respect to parameters as zero

    for layer, layer_params in params.items():
        if layer_params['w'].grad is not None:
            layer_params['w'].grad.data.zero_()
        if layer_params['b'].grad is not None:
            layer_params['b'].grad.data.zero_()
    
    return params

    

In [32]:
def backprop(loss, params, learning_rate):
    """
    Input: 
        loss: the loss tensor from the objective funtion that can be used to compute gradients
        params: parameters of the four layers
        learning_rate: the size of steps when updating parameters
    Output:
        params: parameters after one backpropogation
    """    
    
    # 1- claculate gradients for all relevent params (w,b) 
    loss.backward()

    #2- update params after calculating grads 
    params = update_params(params, learning_rate)

    #3- zero gradients so it does not affect future grad calculations 
    params = zero_grad(params) 
   
    return params
    


## 6. Training loop

For this binary classification task, a standard objective function **Binary Cross-Entropy Loss** is used. Related detail is given as follows:

$$\large L = -\frac{1}{N}\sum_{i=1}^{N}( y_i \cdot \log(p(y_i))+(1-y_i)\log(1-p(y_i)))$$

where $y$ is the label (1 for dog and 0 for cat in our case) and $p(y)$ is the predicted probability, here $N$ equals to the batch_size.


A initialization function is provided to help your network converge faster.

```python
def init_params(params):
    """
    Initialize the parameters of each layer
    """
```


In [33]:
def adjust_lr(learning_rate, epoch):
    """
    Input: 
        learning_rate: the input learning rate
        epoch: which epoch you are in
    Output:
        learning_rate: the updated learning rate
    """    
    
    if (epoch + 1) % 15 == 0:
        learning_rate *= 0.3
        
    return learning_rate

**loss function**. Here is the helpful link: https://pytorch.org/docs/stable/nn.html#loss-functions

In [34]:
# define the initial learning rate here
learning_rate = 1e-2
n_epochs = 100 # how many epochs to run

# define loss function
#TODO: define loss function for this binary classification 

# Using Binary Cross-Entropy with Logits loss since it handles the sigmoid activation.
criterion = nn.CrossEntropyLoss()

# initialize network parameters
# print(f"this is params: {params}")
# for layer in params:
#     print(f"Layer {layer} Weights requires_grad: {params[layer]['w'].requires_grad}")
#     print(f"Layer {layer} Biases requires_grad: {params[layer]['b'].requires_grad}")
init_params(params)
    
for epoch in range(n_epochs):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):

        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        labels = labels.float()
        inputs = inputs.float()

        # Forward 
        output = net(inputs, params, activations)
        
        # Compute the loss using the final output
        loss = criterion(output, labels)
        
        # Backpropagation
        params = backprop(loss, params, learning_rate)
        
        # print statistics
        running_loss += loss.item()
        if i % 10 == 9:  # print every 10 mini-batches
            print('[Epoch %d, Step %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 10))
            running_loss = 0.0
            
    # adjust learning rate
    learning_rate = adjust_lr(learning_rate, epoch)
print('Finished Training')

[Epoch 1, Step    10] loss: 0.656
[Epoch 1, Step    20] loss: 0.668
[Epoch 1, Step    30] loss: 0.576
[Epoch 1, Step    40] loss: 0.553
[Epoch 1, Step    50] loss: 0.541
[Epoch 1, Step    60] loss: 0.432
[Epoch 1, Step    70] loss: 0.510
[Epoch 1, Step    80] loss: 0.482
[Epoch 1, Step    90] loss: 0.534
[Epoch 1, Step   100] loss: 0.560
[Epoch 1, Step   110] loss: 0.540
[Epoch 1, Step   120] loss: 0.479
[Epoch 1, Step   130] loss: 0.523
[Epoch 2, Step    10] loss: 0.492
[Epoch 2, Step    20] loss: 0.465
[Epoch 2, Step    30] loss: 0.452
[Epoch 2, Step    40] loss: 0.500
[Epoch 2, Step    50] loss: 0.456
[Epoch 2, Step    60] loss: 0.520
[Epoch 2, Step    70] loss: 0.510
[Epoch 2, Step    80] loss: 0.470
[Epoch 2, Step    90] loss: 0.475
[Epoch 2, Step   100] loss: 0.445
[Epoch 2, Step   110] loss: 0.468
[Epoch 2, Step   120] loss: 0.450
[Epoch 2, Step   130] loss: 0.445
[Epoch 3, Step    10] loss: 0.457
[Epoch 3, Step    20] loss: 0.454
[Epoch 3, Step    30] loss: 0.482
[Epoch 3, Step

**Evaluation**: Now testing with your trained model on the all test datasets!

In [36]:
correct = 0
total = 0

# since you're not training, you don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        images, labels = data
        _, labels = torch.max(labels, 1)
        
        # calculate outputs by running images through the network
        output = net(images, params, activations)

        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the test images: %d %%' % (
        100 * correct / total))

Accuracy of the network on the test images: 84 %


**Evaluation**: Testing with your trained model on the each labels.

In [37]:
# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed
with torch.no_grad():
    for data in testloader:
        images, labels = data
        _, labels = torch.max(labels, 1)
        output = net(images, params, activations)
        _, predictions = torch.max(output, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1

# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print("Accuracy for class {:5s} is: {:.1f} %".format(classname,
                                                         accuracy))

Accuracy for class panda is: 82.0 %
Accuracy for class grizzly is: 86.8 %


# Part II

---

## Convolutional neural network.


**Architecture**:

1. CNN Layer1
2. CNN Layer2
3. Pooling Layer
4. FC layer

In [38]:
"""
IGNORE: NOTES FOR PERSONAL USE

the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), 
the stride with which they are applied (S), and the amount of zero padding used (P) on the border. 

the Conv Layer:

Accepts a volume of size W1×H1×D1
Requires four hyperparameters:
Number of filters K, their spatial extent F,the stride S ,the amount of zero padding P.
Produces a volume of size W2×H2×D2
 where:
W2=(W1−F+2P)/S+1
H2=(H1−F+2P)/S+1
 (i.e. width and height are computed equally by symmetry)

pooling layer: Accepts a volume of size W1×H1×D1
Requires two hyperparameters:
their spatial extent F,the stride S ,Produces a volume of size W2×H2×D2
where:
W2=(W1−F)/S+1
H2=(H1−F)/S+1
D2=D1
"""

'\nIGNORE: NOTES FOR PERSONAL USE\n\nthe output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), \nthe stride with which they are applied (S), and the amount of zero padding used (P) on the border. \n\nthe Conv Layer:\n\nAccepts a volume of size W1×H1×D1\nRequires four hyperparameters:\nNumber of filters K, their spatial extent F,the stride S ,the amount of zero padding P.\nProduces a volume of size W2×H2×D2\n where:\nW2=(W1−F+2P)/S+1\nH2=(H1−F+2P)/S+1\n (i.e. width and height are computed equally by symmetry)\n\npooling layer: Accepts a volume of size W1×H1×D1\nRequires two hyperparameters:\ntheir spatial extent F,the stride S ,Produces a volume of size W2×H2×D2\nwhere:\nW2=(W1−F)/S+1\nH2=(H1−F)/S+1\nD2=D1\n'

In [39]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # the conv layer is a layer of small filters we apply to the image, does most of the heavy lifting and produces a 2-d activation map
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=64, kernel_size=5, padding=2) # compute the weights between neurons and thier regions
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2) # downsize the width and height
        self.conv2 = nn.Conv2d(in_channels=64, out_channels=256, kernel_size=5)
        self.fc1 = nn.Linear(in_features=256*14*14, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=2) # compute class scores for the two classses 

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x))) # uses relu activation method 
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 256*14*14)
        x = F.relu(self.fc1(x))
        output = torch.sigmoid(self.fc2(x))  # Using sigmoid activation for the final layer
        return output


In [40]:
# define the initial learning rate here
learning_rate = 1e-2
n_epochs = 50 # how many epochs to run


# define loss function
criterion = nn.BCELoss()

cnn_net = Net().cuda()
optimizer = torch.optim.SGD(cnn_net.parameters(), lr=learning_rate)

for epoch in range(n_epochs):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):

        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        labels = labels.float().cuda()
        

        # Forward 
        output = cnn_net(inputs.cuda())
        
        # Compute the loss using the final output
        loss = criterion(output, labels)

        # Backpropagation
        # YOUR CODE HERE
        optimizer.zero_grad()

        loss.backward()

        optimizer.step()
        
        # print statistics
        running_loss += loss.item()
        if i % 10 == 9:  # print every 10 mini-batches
            print('[Epoch %d, Step %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 10))
            running_loss = 0.0

print('Finished Training')

[Epoch 1, Step    10] loss: 0.686
[Epoch 1, Step    20] loss: 0.686
[Epoch 1, Step    30] loss: 0.680
[Epoch 1, Step    40] loss: 0.674
[Epoch 1, Step    50] loss: 0.645
[Epoch 1, Step    60] loss: 0.631
[Epoch 1, Step    70] loss: 0.648
[Epoch 1, Step    80] loss: 0.616
[Epoch 1, Step    90] loss: 0.573
[Epoch 1, Step   100] loss: 0.637
[Epoch 1, Step   110] loss: 0.525
[Epoch 1, Step   120] loss: 0.615
[Epoch 1, Step   130] loss: 0.588
[Epoch 2, Step    10] loss: 0.520
[Epoch 2, Step    20] loss: 0.525
[Epoch 2, Step    30] loss: 0.587
[Epoch 2, Step    40] loss: 0.532
[Epoch 2, Step    50] loss: 0.563
[Epoch 2, Step    60] loss: 0.486
[Epoch 2, Step    70] loss: 0.464
[Epoch 2, Step    80] loss: 0.525
[Epoch 2, Step    90] loss: 0.513
[Epoch 2, Step   100] loss: 0.414
[Epoch 2, Step   110] loss: 0.544
[Epoch 2, Step   120] loss: 0.485
[Epoch 2, Step   130] loss: 0.512
[Epoch 3, Step    10] loss: 0.345
[Epoch 3, Step    20] loss: 0.389
[Epoch 3, Step    30] loss: 0.346
[Epoch 3, Step

**Evaluation**: Now testing with your trained model on the all test datasets!

In [41]:
correct = 0
total = 0

# since you're not training, you don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        images, labels = data
        _, labels = torch.max(labels, 1)
        images = images.cuda()
        labels = labels.cuda()
        # calculate outputs by running images through the network
        output = cnn_net(images)

        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the test images: %d %%' % (
        100 * correct / total))

Accuracy of the network on the test images: 96 %


**Evaluation**: Testing with your trained model on the each labels.

In [42]:
# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed
with torch.no_grad():
    for data in testloader:
        images, labels = data
        _, labels = torch.max(labels, 1)
        images = images.cuda()
        labels = labels.cuda()
        output = cnn_net(images)
        _, predictions = torch.max(output, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1

# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print("Accuracy for class {:5s} is: {:.1f} %".format(classname,
                                                         accuracy))


Accuracy for class panda is: 98.0 %
Accuracy for class grizzly is: 94.7 %
