In [2]:
import torch
torch.cuda.is_available()

True

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has gained popularity since its inception in 2010. The training set for ILSVRC consists of 1.2 million images labeled with one of 1,000 nouns (for example, “dog”), referred to as the class of the image.



In [5]:
from torchvision import models

In [6]:
dir(models)

['AlexNet',
 'AlexNet_Weights',
 'ConvNeXt',
 'ConvNeXt_Base_Weights',
 'ConvNeXt_Large_Weights',
 'ConvNeXt_Small_Weights',
 'ConvNeXt_Tiny_Weights',
 'DenseNet',
 'DenseNet121_Weights',
 'DenseNet161_Weights',
 'DenseNet169_Weights',
 'DenseNet201_Weights',
 'EfficientNet',
 'EfficientNet_B0_Weights',
 'EfficientNet_B1_Weights',
 'EfficientNet_B2_Weights',
 'EfficientNet_B3_Weights',
 'EfficientNet_B4_Weights',
 'EfficientNet_B5_Weights',
 'EfficientNet_B6_Weights',
 'EfficientNet_B7_Weights',
 'EfficientNet_V2_L_Weights',
 'EfficientNet_V2_M_Weights',
 'EfficientNet_V2_S_Weights',
 'GoogLeNet',
 'GoogLeNetOutputs',
 'GoogLeNet_Weights',
 'Inception3',
 'InceptionOutputs',
 'Inception_V3_Weights',
 'MNASNet',
 'MNASNet0_5_Weights',
 'MNASNet0_75_Weights',
 'MNASNet1_0_Weights',
 'MNASNet1_3_Weights',
 'MaxVit',
 'MaxVit_T_Weights',
 'MobileNetV2',
 'MobileNetV3',
 'MobileNet_V2_Weights',
 'MobileNet_V3_Large_Weights',
 'MobileNet_V3_Small_Weights',
 'RegNet',
 'RegNet_X_16GF_Weights'

## AlexNet
The AlexNet architecture won the 2012 ILSVRC by a large margin, with a top-5 test
error rate (that is, the correct label must be in the top 5 predictions) of 15.4%. By
comparison, the second-best submission, which wasn’t based on a deep network,
trailed at 26.2%.
In order to run the AlexNet architecture on an input image, we can create an
instance of the AlexNet class. This is how it’s done:

In [7]:
alexnet = models.AlexNet()

## ResNet
Just to put things in perspective, before the advent of residual networks in
2015, achieving stable training at such depths was considered extremely hard. Residual networks pulled a trick that made it possible, and by doing so, beat several benchmarks in one sweep that year.

In [8]:
resnet = models.resnet101(pretrained=True)

Downloading: "https://download.pytorch.org/models/resnet101-63fe2227.pth" to C:\Users\System2/.cache\torch\hub\checkpoints\resnet101-63fe2227.pth
100.0%


In [9]:
print(resnet)


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

The resnet variable can be called like a function, taking as input one or more images and producing an equal number of scores for each of the 1,000 ImageNet classes. Before we can do that, however, we have to preprocess the input images so they are the right size and so that their values (colors) sit roughly in the same numerical range. In order to do that, the torchvision module provides transforms, which allow us to quickly define pipelines of basic preprocessing functions:

In [10]:
from torchvision import transforms
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)])

In this case, we defined a preprocess function that will scale the input image to 256 ×
256, crop the image to 224 × 224 around the center, transform it to a tensor (a
PyTorch multidimensional array: in this case, a 3D array with color, height, and width), and normalize its RGB (red, green, blue) components so that they have
defined means and standard deviations. These need to match what was presented to
the network during training, if we want the network to produce meaningful answers.

In [12]:
from PIL import Image
img = Image.open("C:/Users/System2/Desktop/Abhimanyu/dog-puppy-on-garden-royalty-free-image-1586966191.jpg")

In [13]:
img.show()

In [14]:
img_t = preprocess(img)

In [16]:
import torch

In [17]:
batch_t = torch.unsqueeze(img_t, 0)

The process of running a trained model on new data is called inference in deep learning circles. In order to do inference, we need to put the network in eval mode:

In [18]:
resnet.eval()

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

If we forget to do that, some pretrained models, like batch normalization and dropout,
will not produce meaningful answers, just because of the way they work internally.

In [19]:
out = resnet(batch_t)
out

tensor([[-3.2184e+00, -9.4353e-01, -4.3547e+00, -4.9982e+00, -4.4868e+00,
         -3.6622e+00, -4.5088e+00, -2.6504e+00, -6.5589e-01, -2.1134e+00,
          6.0493e-01,  6.0023e-01, -7.7637e-01, -8.3863e-01, -1.4292e+00,
         -1.0995e+00, -5.3213e-01,  6.2361e-01, -2.8421e-01, -9.3961e-04,
         -1.1592e+00, -1.8192e+00, -1.1642e+00, -3.1179e-01, -1.0828e+00,
         -5.9177e-01, -1.8535e+00, -1.0341e+00, -1.7673e+00, -1.6795e+00,
         -2.0407e+00, -1.3571e+00, -9.2553e-01, -2.1798e+00, -1.1610e+00,
         -1.7341e+00, -1.0691e-01, -9.8468e-01, -7.8110e-01, -1.3071e+00,
         -5.3563e-01, -1.0375e+00,  1.2434e+00, -1.7774e+00, -7.8122e-01,
         -1.0403e+00, -2.8123e-01,  7.6159e-02, -2.3334e+00, -2.4288e+00,
         -2.4754e+00, -1.6916e+00, -4.0977e-01, -1.3788e+00, -1.2360e+00,
         -5.2387e-01, -4.8639e-01, -9.7038e-02, -1.0204e+00, -2.9690e-01,
          7.6534e-01, -1.6702e+00,  2.7858e-02,  6.0604e-02, -2.0752e+00,
         -2.1413e+00, -8.6118e-01, -6.

Let’s load the file containing the 1,000 labels for the ImageNet dataset classes: