I selected PIL to do raw image transformations because Resnet expects the image parameters (such as the RGB channels) in a certain order, which (for instance) cv2 handles differently.

Initially I ran into the problem where the inference constantly returned on each run the exact same predictions with 100% confidence, I learned this problem rooted in cv2 essentially passing the image to the model as an entirely white 100x100px image with 62px black padding.

The preprocessing requirements was somewhat confusing at first as the model requires the images to be fixed to 224x224px, which, concatenated to the previous, random cropping 100x100px step resulted in a constant, thick black padding. This lowers the confidence of the results. Not having this requirement usually results a few of the crops having some padding, which would affect the confidences of the results by a marginally low rate.

Without these specific requirements, usually assembling the dataset could've been done using only `torchvision.transforms` instead of a custom class.

In [None]:
from PIL import Image
import torchvision.transforms as T


class ZoomIn(object):

    def __init__(self, r):
        self.r = r

    def __call__(self, img):
        newsize = (int(img.width * self.r), int(img.height * self.r))
        return img.resize(newsize)


def preprocess(path):
    augs = T.Compose([
        ZoomIn(1.5),
        T.RandomCrop(100),
        T.Pad(62),
        T.ToTensor(),
        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    img = Image.open(path).convert("RGB")
    img = augs(img)
    return img
