<a href="https://colab.research.google.com/github/joshmaha/Image-Classification-using-CNN/blob/main/cifar10_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1

### import needed libraries

In [15]:
# import needed libraries

import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
import numpy as np

### Train models

In [16]:
# ensure we use gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# print("Using device:", device)

# make 2 different networks
net1 = Net().to(device)
net2 = Net().to(device)


# shuffle data differently by changing shuffle orders
order1 = np.random.randint(0, 100000) # generate random number for firs network
order2 = np.random.randint(0, 100000)

rand_shuffle_net1 = torch.Generator().manual_seed(order1)  # manual_seed sets the starting point for the randomness at seed1
rand_shuffle_net2 = torch.Generator().manual_seed(order2)

batch_size = 4

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)

# trainset is the same dataset for both but independently shuffled
trainset1 = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainset2 = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)

trainloader1 = torch.utils.data.DataLoader(trainset1, batch_size=batch_size,
                                          shuffle=True, num_workers=2,
                                          generator=rand_shuffle_net1)

trainloader2 = torch.utils.data.DataLoader(trainset2, batch_size=batch_size,
                                          shuffle=True, num_workers=2,
                                          generator=rand_shuffle_net2)

# both networks have testloader
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)


# independent loss functions + optimizers
criterion1 = nn.CrossEntropyLoss()
criterion2 = nn.CrossEntropyLoss()

optimizer1 = optim.SGD(net1.parameters(), lr=0.001, momentum=0.9)
optimizer2 = optim.SGD(net2.parameters(), lr=0.001, momentum=0.9)


# Train loop for net1
print("Training Model 1")

for epoch in range(5):

    running_loss = 0.0
    for i, data in enumerate(trainloader1, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)

        # zero the parameter gradients
        optimizer1.zero_grad()

        # forward + backward + optimize
        outputs = net1(inputs)
        loss = criterion1(outputs, labels)
        loss.backward()
        optimizer1.step()

        running_loss += loss.item()
        # print statistics
        if i % 2000 == 1999:
            print(f"[{epoch+1}, {i+1}] loss: {running_loss/2000:.3f}")
            running_loss = 0.0
print("Finished training model 1\n")


# train loop for net2
print("Training Model 2")


for epoch in range(5):

    running_loss = 0.0
    for i, data in enumerate(trainloader2, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)

        # zero the parameter gradients
        optimizer2.zero_grad()

        # forward + backward + optimize
        outputs = net2(inputs)
        loss = criterion2(outputs, labels)
        loss.backward()
        optimizer2.step()

        running_loss += loss.item()
        # print statistics
        if i % 2000 == 1999:
            print(f"[{epoch+1}, {i+1}] loss: {running_loss/2000:.3f}")
            running_loss = 0.0

print("Finished training model 2\n")


Training Model 1
[1, 2000] loss: 2.173
[1, 4000] loss: 1.827
[1, 6000] loss: 1.651
[1, 8000] loss: 1.581
[1, 10000] loss: 1.503
[1, 12000] loss: 1.451
[2, 2000] loss: 1.380
[2, 4000] loss: 1.348
[2, 6000] loss: 1.330
[2, 8000] loss: 1.287
[2, 10000] loss: 1.284
[2, 12000] loss: 1.269
[3, 2000] loss: 1.176
[3, 4000] loss: 1.184
[3, 6000] loss: 1.181
[3, 8000] loss: 1.152
[3, 10000] loss: 1.164
[3, 12000] loss: 1.160
[4, 2000] loss: 1.083
[4, 4000] loss: 1.079
[4, 6000] loss: 1.071
[4, 8000] loss: 1.070
[4, 10000] loss: 1.063
[4, 12000] loss: 1.064
[5, 2000] loss: 0.986
[5, 4000] loss: 0.985
[5, 6000] loss: 1.009
[5, 8000] loss: 1.013
[5, 10000] loss: 1.022
[5, 12000] loss: 0.999
Finished training model 1

Training Model 2
[1, 2000] loss: 2.127
[1, 4000] loss: 1.785
[1, 6000] loss: 1.625
[1, 8000] loss: 1.555
[1, 10000] loss: 1.512
[1, 12000] loss: 1.458
[2, 2000] loss: 1.409
[2, 4000] loss: 1.363
[2, 6000] loss: 1.338
[2, 8000] loss: 1.323
[2, 10000] loss: 1.311
[2, 12000] loss: 1.282
[

### Get accuracy and agreement score

In [24]:
# get test accuracy for both models
def test_accuracy(net):
    correct = 0
    total = 0
    # since we're not training, we don't need to calculate the gradients for our outputs
    with torch.no_grad():
        for data in testloader:
            # images, labels = data # had to change this line
            images, labels = data[0].to(device), data[1].to(device)
            # calculate outputs by running images through the network
            outputs = net(images)
            # the class with the highest energy is what we choose as prediction
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return 100.0 * correct / total

# print results
accuracy1 = test_accuracy(net1)
accuracy2 = test_accuracy(net2)
print(f'Accuracy of the net1 on the 10000 test images: {accuracy1:.2f} %')
print(f'Accuracy of the net2 on the 10000 test images: {accuracy2:.2f} %')


# calculate agreement score between both networks
def ascore(net1, net2, test_images):

    matches = 0 # prepare to count matches of predictions
    total = 10000 # total amount of images in the dataset

    # again no gradients needed
    with torch.no_grad():
        for images, _ in testloader:   # labels not needed as we just need to know which predictions matched between the 2 networks
            images = images.to(device)

            # top1 predictions from both networks
            outputs1 = net1(images)
            outputs2 = net2(images)

            _, predicted1 = torch.max(outputs1, 1) # takes the highest energy from all 10 classes as its prediction
            _, predicted2 = torch.max(outputs2, 1)

            # count images where both predictions match
            matches += (predicted1 == predicted2).sum().item()

    return 100.0 * matches / total

agreement_score = ascore(net1, net2, testloader)
print(f"Agreement Score between net1 and net2: {agreement_score:.2f}%")


Accuracy of the net1 on the 10000 test images: 63.73 %
Accuracy of the net2 on the 10000 test images: 59.60 %
Agreement Score between net1 and net2: 60.60%


## Analysis

In this task, I initialized and trained two different convolutional neural networks (CNNs), net1 and net2, on CIFAR-10 for five epochs each. The only differences between the two models were their random weight initializations and the random order in which they saw the training data. These small sources of randomness naturally lead each model down a different optimization path, which explains why their final accuracies are not identical. Net1 achieved a top-1 accuracy of 63.73% on the test set, while net2 reached 59.60%.

When comparing the predictions of the two networks directly, I found that they agreed on 60.60% of the entire dataset. This agreement score reflects how often both models predicted the same class label, even if that label was incorrect. The value being close to the individual accuracies shows that the models are learning similar high-level representations but still diverge in meaningful ways. Since the networks start from different random weights and see the training samples in different orders, they naturally form slightly different decision boundaries, especially with the limited training time of only 5 epochs.

The agreement score is no where close to 100% because the 2 networks are highly sensitive to initialization and data ordering, and some of the CIFAR-10 classes are somehwat similar in physical features which can be hard for the models to distinguish(e.g., cat vs. dog and truck vs. car). The stochastic differences lead the models to emphasize different features during training. As a result, they may correctly classify the same image or confidently disagree on others. Overall, the results match the expected behavior of independently trained neural networks and show how randomness affects the training and prediction outcomes in computer vision.

# Task 2

### get samples set for each class

In [35]:
# get 100 test images from each class
random.seed(0)  # make results reproducible

# dictionary to store test image indices for each class
indices_by_class = {a: [] for a in range(10)}

# fill the dictionary with every indices belonging to each class
for ind, (_, label) in enumerate(testset): # place each image in designated class
    indices_by_class[label].append(ind)

# sample 100 indices from each class
sampled_indices_by_class = {}
for a in range(10):
    sampled_indices_by_class[a] = random.sample(indices_by_class[a], 100)


### calculate entropy for each class

In [52]:
# get average entropy for each class
def avg_entropy(net, testset, sampled_indices_by_class):

    avg_entropy = {}
    with torch.no_grad():
        for a in range(10):
            entropies = []

            # fetch all sampled images for class c
            indices = sampled_indices_by_class[a]

            for index in indices:
                image, _ = testset[index]
                image = image.unsqueeze(0).to(device)  # add batch dimension because pytorch expects 4D input for CNN

                # forward pass -> probabilities
                logits = net(image) # network outputs raw scores not probabilities
                probs = F.softmax(logits, dim=1) # use softmax to get probabilities and all values sum to 1

                # entropy is the -sum(p * log(p))
                entropy = -torch.sum(probs * torch.log(probs + 1e-12))
                entropies.append(entropy.item()) # add entropy to list

            # average entropy for this class
            avg_entropy[a] = sum(entropies) / len(entropies)

    return avg_entropy

net1_entrop = avg_entropy(net1, testset, sampled_indices_by_class)

print("Average entropy per class for net1:\n")
for classname in range(10):
    print(f"Class {classes[classname]:5s}: {net1_entrop[classname]:.4f}")



Average entropy per class for net1:

Class plane: 1.0838
Class car  : 0.6380
Class bird : 1.0738
Class cat  : 1.2797
Class deer : 1.3570
Class dog  : 1.0907
Class frog : 0.8397
Class horse: 0.9125
Class ship : 0.7504
Class truck: 1.0301


### find accuracy for each class

In [59]:
def per_class_accuracy(net, testloader, classes):
    # prepare to count predictions for each class
    correct_pred = {classname: 0 for classname in classes}
    total_pred = {classname: 0 for classname in classes}

    # again no gradients needed
    with torch.no_grad():
        for images, labels in testloader: # unpack data in for loop
            # images, labels = data # can't load like this as
            images, labels = images.to(device), labels.to(device) # assign o
            outputs = net(images)
            _, predictions = torch.max(outputs, 1)

            # collect the correct predictions for each class
            for label, prediction in zip(labels, predictions):
                label_name = classes[label]
                # if label == prediction:
                #     correct_pred[label_name] += 1
                # total_pred[label_name] += 1
                if label == prediction:
                   correct_pred[classes[label]] += 1
                total_pred[classes[label]] += 1

    per_class_acc = {}
    for classname in classes:
        per_class_acc[classname] = 100.0 * float(correct_pred[classname]) / total_pred[classname]
    return per_class_acc

net1_class_acc = per_class_accuracy(net1, testloader, classes)

# print accuracy for each class
print("Per-class accuracy for net1:\n")
for classname in classes:
    print(f'Accuracy for class: {classname:5s} is {net1_class_acc[classname]:.2f}%')

Per-class accuracy for net1:

Accuracy for class: plane is 59.90%
Accuracy for class: car   is 76.30%
Accuracy for class: bird  is 52.90%
Accuracy for class: cat   is 45.60%
Accuracy for class: deer  is 51.60%
Accuracy for class: dog   is 55.80%
Accuracy for class: frog  is 78.10%
Accuracy for class: horse is 67.50%
Accuracy for class: ship  is 78.50%
Accuracy for class: truck is 71.10%


### Analysis

For the second task, I examined how confident the network is when making predictions by computing the entropy of its softmax output for 100 randomly sampled test images from each class. Low entropy indicates high confidence in the class prediction, while high entropy suggests uncertainty. Classes such as the car, frog, and ship had relatively low entropy scores (0.63, 0.84, and 0.75), meaning the network tended to produce confident probability distributions for these images. In contrast, classes like cat, deer, and dog showed much higher entropy (1.28, 1.36, and 1.09) which could indicate more uncertainty in their predictions.

When aligned with the corresponding class accuracies, a noticeable pattern is present of classes with low entropy generally corresponding to higher accuracy, and classes with high entropy often show lower accuracy. For example, car, frog, ship, and truck all have strong accuracies above 70% while also being among the lowest-entropy classes. On the other hand, classes such as cat, deer, and bird show higher average entropy and also correspondingly lower accuracies, with some of the weakest performance in the dataset. This relationship makes sense because when the model is less certain (high entropy), it tends to make more mistakes, and when it is more confident (low entropy), it is generally correct.

From this task, we see that the model's confidence is a useful indicator of its performance. Classes with visually distinct features, such as cars and ships, lead the model to make more confident and accurate predictions. Meanwhile, classes with higher variation or similarity when compared directly with another class, such as cats and dogs, causes the model to be more uncertain and inaccurate.

# Task 3

### get new cifar100 images

In [78]:
# get Cifar100 data set
cifar100 = torchvision.datasets.CIFAR100(root='./data', train=False,
                                              download=True, transform=transform)
cifar100_classes = cifar100.classes

### group images based on their classes

In [91]:
# make dictionary to store and group indices into their own classes
indices_c100 = {c: [] for c in range(100)}

for index, (_, label) in enumerate(cifar100):
    indices_c100[label].append(index) # add right image label to the corresponding class


### get probability distribution of class

In [99]:
# for each class get the prediction distribution
def pred_distrib(net, data, indices_by_class):
    class_distributions = {}

    # gradient not needed
    with torch.no_grad():
        for class_id in range(100): # go through all 100 classes for cifar100 dataset
            counts = torch.zeros(10)  # cifar10 has 10 classes

            for class_index in indices_by_class[class_id]:  # loop over 100 test images belonging to this class id
                image, _ = data[class_index] # ignore label
                image = image.unsqueeze(0).to(device) # add batch dimension because pytorch expects 4D input for CNN

                logits = net(image)
                _, pred = torch.max(logits, 1)  # cifar10 prediction

                counts[pred.item()] += 1

            class_distributions[class_id] = counts # store this per class result

    return class_distributions

c100_distribs = pred_distrib(net1, cifar100, indices_c100)

# convert counts to probability distribution
prob_distribs = {c: dist / dist.sum() for c, dist in c100_distribs.items()}

### find entropy of each class

In [107]:
# comput entropy for each class
def prob_distr_entropy(p):
    p = p + 1e-12
    return float(-(p * torch.log(p)).sum())

class_entropies = {} # create dict to store entropy for all 100 classes
for class_id in range(100):
    prob_vec = prob_distribs[class_id]   # 10-class probability distribution
    class_entropies[class_id] = prob_distr_entropy(prob_vec)


### find best and worst classes for entropy

In [96]:
# convert entropy dictionary to list of (class_index, entropy)
entropy_list = list(class_entropies.items())

# selection sort based on entropy (no lambda)
for i in range(len(entropy_list)):
    for j in range(i + 1, len(entropy_list)):
        if entropy_list[j][1] < entropy_list[i][1]: # sort in ascending order by entropy
            entropy_list[i], entropy_list[j] = entropy_list[j], entropy_list[i]

# get lowest and highest 5 classes
lowest_5  = entropy_list[:5]
highest_5 = entropy_list[-5:]

# print results
print("\nLowest 5 entropy classes:")
for classname, ent in lowest_5:
    print(f"Class {classname:3d} ({cifar100_classes[classname]}): {ent:.4f}")

print("\nHighest 5 entropy classes:")
for classname, ent in highest_5:
    print(f"Class {classname:3d} ({cifar100_classes[classname]}): {ent:.4f}")



Lowest 5 entropy classes:
Class  71 (sea): 1.0272
Class  58 (pickup_truck): 1.0765
Class   2 (baby): 1.3340
Class  13 (bus): 1.3534
Class  36 (hamster): 1.3587

Highest 5 entropy classes:
Class  91 (trout): 2.1523
Class  55 (otter): 2.1553
Class  99 (worm): 2.1610
Class  84 (table): 2.1752
Class  75 (skunk): 2.2401


### Gaussian Noise

In [116]:
# create gaussian noise
# create 100 Gaussian noise images (mean=0, std=1) woth a shape: [batch_size=100, channels=3, height=32, width=32]
noise_images = torch.randn(100, 3, 32, 32).to(device) # torch.randn() creates random numbers from a normal distribution with mean = 0 and std deviation = 0

# predict classes given noise
with torch.no_grad():
    logits = net1(noise_images) # outputs for all 100 noise images
    preds = logits.argmax(dim=1) # top1 class predictions

# count predictions
noise_counts = torch.bincount(preds, minlength=10)
noise_prob = noise_counts / noise_counts.sum()

# compute entropy of the noise distribution
noise_entropy = prob_distr_entropy(noise_prob)

# print results
print("Noise prediction distribution:\n", noise_prob)
print("\nNoise entropy:", noise_entropy)


Noise prediction distribution:
 tensor([0.0000, 0.0700, 0.0000, 0.0100, 0.0000, 0.0000, 0.0400, 0.0500, 0.0000,
        0.8300], device='cuda:0')

Noise entropy: 0.6653951406478882


### Analysis

For each CIFAR-100 class, I collected all 100 predictions made by the network and transformed them into a probability distribution over the ten CIFAR-10 labels. The lowest-entropy classes, such as sea, pickup_truck, and bus, produced concentrated prediction distributions, meaning the model consistently mapped most images from these categories into a single CIFAR-10 class. In contrast, the highest-entropy classes like trout, otter, worm, and skunk, generated very scattered predictions, suggesting the network had difficulty associating these image types with any single CIFAR-10 category.

These entropy patterns reflect how visually related each CIFAR-100 class is to the CIFAR-10 classes the network was trained on. Classes with low entropy often resemble or share features with a specific CIFAR-10 category. For example, the pickup_truck is visually similar to CIFAR-10’s “truck”, and sea often contains textures or colors that the model might consistently map to classes like “ship.” On the other hand, classes like trout, otter, and worm do not strongly resemble any one CIFAR-10 class, leading to a higher entropy because the network spreads its predictions across many categories. This behavior makes sense given that CIFAR-10 contains only broad animal categories (bird, cat, dog, frog), while many CIFAR-100 classes represent more specific species or objects (bus is a type of truck) with features that do not exactly match up with the CIFAR-10 classes.

Finally, when the network was fed 100 samples of pure Gaussian noise, the predictions collapsed heavily into a few classes. Most notably the truck class, accounted for 83% of predictions. The resulting entropy was relatively low (0.665), showing that the model was surprisingly confident even though the inputs contained no meaningful structure. Instead of recognizing that noise is nonsensical, the network was forced to map every input to the nearest learned pattern and it did this with high confidence.