<a href="https://colab.research.google.com/github/harvard-visionlab/psy1410/blob/master/psy1410_imagenet_transfer_zoo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imagenet Transfer Model Zoo

This notebook is our storehouse of models that have been trained on one task (e.g., video action recognition, or face recognition), followed by transfer training to test performance on imagenet classification. For this reason, the original "task head" has been removed, replaced with a fresh fully-connected layer with 1000 output units (corresponding to the Imagenet categories). The convolutional-backbone's weights were frozen, and only the new 1000 unit fully-connected layer had its weights adjusted. So we're asking "how well can we use the features of this model, trained on X, to perform imagenet classification?"



# installations

In [1]:
!pip install facenet-pytorch

Collecting facenet-pytorch
[?25l  Downloading https://files.pythonhosted.org/packages/18/e8/5ea742737665ba9396a8a2be3d2e2b49a13804b56a7e7bb101e8731ade8f/facenet_pytorch-2.5.2-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 8.4MB/s 
Installing collected packages: facenet-pytorch
Successfully installed facenet-pytorch-2.5.2


# helpers

In [2]:
import matplotlib.pyplot as plt
from torchvision.utils import make_grid
from IPython.core.debugger import set_trace

def show_conv1(model):
    for m in [module for module in model.modules() if type(module) != nn.Sequential]:
        if isinstance(m, nn.Conv2d):
            break
    kernels = m.weight.detach().clone().cpu()
    kernels = kernels - kernels.min()
    kernels = kernels / kernels.max()
    img = make_grid(kernels, nrow=16)
    ax = plt.imshow(img.permute(1, 2, 0))
    return ax

# models

## moments action-recognition models


### resnet backbone

In [5]:
import os
import re
import subprocess
import functools
from functools import partial

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
from torchvision import transforms
from pathlib import Path

def conv3x3x3(in_planes, out_planes, stride=1):
    """3x3x3 convolution with padding."""
    return nn.Conv3d(
        in_planes, out_planes, kernel_size=3, stride=stride, padding=1, bias=False
    )

def downsample_basic_block(x, planes, stride):
    out = F.avg_pool3d(x, kernel_size=1, stride=stride)
    zero_pads = torch.Tensor(
        out.size(0), planes - out.size(1),
        out.size(2), out.size(3), out.size(4)).zero_()
    if isinstance(out.data, torch.cuda.FloatTensor):
        zero_pads = zero_pads.cuda()
    out = torch.cat([out.data, zero_pads], dim=1)
    return out


class BasicBlock(nn.Module):
    expansion = 1
    Conv3d = staticmethod(conv3x3x3)

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = self.Conv3d(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm3d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = self.Conv3d(planes, planes)
        self.bn2 = nn.BatchNorm3d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)
        return out


class Bottleneck(nn.Module):
    expansion = 4
    Conv3d = nn.Conv3d

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        self.conv1 = self.Conv3d(inplanes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm3d(planes)
        self.conv2 = self.Conv3d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm3d(planes)
        self.conv3 = self.Conv3d(planes, planes * 4, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm3d(planes * 4)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)
        return out


class ResNet3D(nn.Module):

    Conv3d = nn.Conv3d

    def __init__(self, block, layers, shortcut_type='B', num_classes=305):
        self.inplanes = 64
        super(ResNet3D, self).__init__()
        self.conv1 = self.Conv3d(3, 64, kernel_size=7, stride=(1, 2, 2), padding=(3, 3, 3), bias=False)
        self.bn1 = nn.BatchNorm3d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool3d(kernel_size=(3, 3, 3), stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0], shortcut_type)
        self.layer2 = self._make_layer(block, 128, layers[1], shortcut_type, stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], shortcut_type, stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], shortcut_type, stride=2)
        self.avgpool = nn.AdaptiveAvgPool3d(1)
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        self.init_weights()

    def _make_layer(self, block, planes, blocks, shortcut_type, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            if shortcut_type == 'A':
                downsample = partial(
                    downsample_basic_block,
                    planes=planes * block.expansion,
                    stride=stride,
                )
            else:
                downsample = nn.Sequential(
                    self.Conv3d(
                        self.inplanes,
                        planes * block.expansion,
                        kernel_size=1,
                        stride=stride,
                        bias=False,
                    ),
                    nn.BatchNorm3d(planes * block.expansion),
                )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, self.Conv3d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
            elif isinstance(m, nn.BatchNorm3d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)

        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x


def modify_resnets(model):
    # Modify attributs
    model.last_linear, model.fc = model.fc, None

    def features(self, input):
        x = self.conv1(input)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        return x

    def logits(self, features):
        x = self.avgpool(features)
        x = x.view(x.size(0), -1)
        x = self.last_linear(x)
        return x

    def forward(self, input):
        x = self.features(input)
        x = self.logits(x)
        return x

    # Modify methods
    setattr(model.__class__, 'features', features)
    setattr(model.__class__, 'logits', logits)
    setattr(model.__class__, 'forward', forward)
    return model


ROOT_URL = 'http://moments.csail.mit.edu/moments_models'
weights = {
    'resnet50': 'moments_v2_RGB_resnet50_imagenetpretrained.pth.tar',
    'resnet3d50': 'moments_v2_RGB_imagenet_resnet3d50_segment16.pth.tar',
    'multi_resnet3d50': 'multi_moments_v2_RGB_imagenet_resnet3d50_segment16.pth.tar',
}
default_model_dir = '/content/weights'
if not os.path.exists(default_model_dir):
    os.makedirs(default_model_dir)

def load_checkpoint(weight_file):
    weight_file_name = weight_file
    weight_file = os.path.join(default_model_dir, weight_file)
    if not os.access(weight_file, os.W_OK):
        weight_url = os.path.join(ROOT_URL, weight_file_name)
        os.system('wget ' + weight_url)
    checkpoint = torch.load(weight_file, map_location=lambda storage, loc: storage)  # Load on cpu
    return {str.replace(str(k), 'module.', ''): v for k, v in checkpoint['state_dict'].items()}


def resnet50(num_classes=305, pretrained=True):
    model = models.__dict__['resnet50'](num_classes=num_classes)
    if pretrained:
        model.load_state_dict(load_checkpoint(weights['resnet50']))
    model = modify_resnets(model)
    return model


def resnet3d50(num_classes=305, pretrained=True, **kwargs):
    """Constructs a ResNet3D-50 model."""
    model = modify_resnets(ResNet3D(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, **kwargs))
    if pretrained:
         model.load_state_dict(load_checkpoint(weights['resnet3d50']))
    return model


def multi_resnet3d50(num_classes=292, pretrained=True, **kwargs):
    """Constructs a ResNet3D-50 model."""
    model = modify_resnets(ResNet3D(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, **kwargs))
    if pretrained:
        model.load_state_dict(load_checkpoint(weights['multi_resnet3d50']))
    return model


def load_model(arch):
    model = {'resnet3d50': resnet3d50,
             'multi_resnet3d50': multi_resnet3d50, 'resnet50': resnet50}.get(arch, 'resnet3d50')()
    model.eval()
    return model


def load_transform():
    """Load the image transformer."""
    return transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406],
                             [0.229, 0.224, 0.225])])


def load_categories(filename):
    """Load categories."""
    with open(filename) as f:
        return [line.rstrip() for line in f.readlines()]

def load_frames(frame_paths, num_frames=8):
    """Load PIL images from a list of file paths."""
    frames = [Image.open(frame).convert('RGB') for frame in frame_paths]
    if len(frames) >= num_frames:
        return frames[::int(np.ceil(len(frames) / float(num_frames)))]
    else:
        raise ValueError('Video must have at least {} frames'.format(num_frames))
        
def extract_frames(video_file, num_frames=16):
    """Return a list of PIL image frames uniformly sampled from an mp4 video."""
    try:
        os.makedirs(os.path.join(os.getcwd(), 'frames'))
    except OSError:
        pass
    output = subprocess.Popen(['ffmpeg', '-i', video_file],
                              stderr=subprocess.PIPE).communicate()
    # Search and parse 'Duration: 00:05:24.13,' from ffmpeg stderr.
    re_duration = re.compile(r'Duration: (.*?)\.')
    duration = re_duration.search(str(output[1])).groups()[0]

    seconds = functools.reduce(lambda x, y: x * 60 + y,
                               map(int, duration.split(':')))
    rate = num_frames / float(seconds)

    output = subprocess.Popen(['ffmpeg', '-i', video_file,
                               '-vf', 'fps={}'.format(rate),
                               '-vframes', str(num_frames),
                               '-loglevel', 'panic',
                               'frames/%d.jpg']).communicate()
    frame_paths = sorted([os.path.join('frames', frame)
                          for frame in os.listdir('frames')])
    frames = load_frames(frame_paths, num_frames=num_frames)
    subprocess.call(['rm', '-rf', 'frames'])
    return frames

### imagenet-trained tasks head

In [6]:
from torchvision import transforms, models  
from torch.hub import load_state_dict_from_url
from pathlib import Path 
from IPython.core.debugger import set_trace 

VISLAB_URL = "https://visionlab-pretrainedmodels.s3.amazonaws.com/model_zoo/psy1410/"
default_model_dir = '/content/weights'
if not os.path.exists(default_model_dir):
  os.makedirs(default_model_dir)

def load_checkpoint_imagenet_head(model, weights_url, weight_dir=None, device='cpu'):
  model_dir = default_model_dir if weight_dir is None else weight_dir

  weights_url = str(weights_url)

  print(f"=> loading checkpoint: {Path(weights_url).name}")
  checkpoint = load_state_dict_from_url(weights_url, model_dir=model_dir, 
                                        map_location=torch.device(device))
  state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}
  model.load_state_dict(state_dict)
  print("=> state loaded.")

  model.top1 = checkpoint['top1']
  model.num_epochs = checkpoint['epoch']
  print(f"=> top1 accuracy {model.top1:3.2f}% (num_epochs={model.num_epochs})")

  return model

def moments_resnet3d50_imagenet_head():
  model = resnet3d50(num_classes=1000, pretrained=False)
  weights_url = VISLAB_URL+'moments_resnet3d50_avgpool_onecycle.pth.tar'
  model = load_checkpoint_imagenet_head(model, weights_url)

  normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                   std=[0.229, 0.224, 0.225])
  
  test_transforms = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize
  ])

  return model, test_transforms

def moments_resnet50_imagenet_head():
  model = resnet50(num_classes=1000, pretrained=False)

  # weights_url = VISLAB_URL+'moments_resnet50_avgpool_onecycle.pth.tar'
  weights_url = VISLAB_URL+'moments_resnet50_avgpool_onecycle1_madgrad.pth.tar'
  model = load_checkpoint_imagenet_head(model, weights_url)

  normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                   std=[0.229, 0.224, 0.225])
  
  test_transforms = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize
  ])

  return model, test_transforms

def imagenet_resnet50_imagenet_head():
  model = models.resnet50(num_classes=1000, pretrained=False)  
  # weights_url = VISLAB_URL+'imagenet_resnet50_avgpool_onecycle.pth.tar'
  weights_url = VISLAB_URL+'imagenet_resnet50_avgpool_onecycle1_madgrad.pth.tar'

  model = load_checkpoint_imagenet_head(model, weights_url)

  normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                   std=[0.229, 0.224, 0.225])
  
  test_transforms = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize
  ])

  return model, test_transforms


### FaceNet face-recognition models

In [7]:
from facenet_pytorch import MTCNN, InceptionResnetV1
from facenet_pytorch import fixed_image_standardization

from torchvision import transforms, models  
from torch.hub import load_state_dict_from_url
from pathlib import Path 
from IPython.core.debugger import set_trace 

VISLAB_URL = "https://visionlab-pretrainedmodels.s3.amazonaws.com/model_zoo/psy1410/"
default_model_dir = '/content/weights'
if not os.path.exists(default_model_dir):
  os.makedirs(default_model_dir)

def load_checkpoint_imagenet_head(model, weights_url, weight_dir=None, device='cpu'):
  model_dir = default_model_dir if weight_dir is None else weight_dir

  weights_url = str(weights_url)

  print(f"=> loading checkpoint: {Path(weights_url).name}")
  checkpoint = load_state_dict_from_url(weights_url, model_dir=model_dir, 
                                        map_location=torch.device(device))
  state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}
  model.load_state_dict(state_dict)
  print("=> state loaded.")

  model.top1 = checkpoint['top1']
  model.num_epochs = checkpoint['epoch']
  print(f"=> top1 accuracy {model.top1:3.2f}% (num_epochs={model.num_epochs})")

  return model

def vggface2_inceptionV1_imagenet_head():
  model = InceptionResnetV1(pretrained='vggface2')
  in_features = model.last_linear.in_features
  model.last_linear = nn.Linear(in_features, 1000)
  model.last_linear.weight.data.normal_(mean=0.0, std=0.01)
  model.last_linear.bias.data.zero_()
  model.last_bn = nn.BatchNorm1d(1000, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  
  #weights_url = VISLAB_URL+'vggface2_inceptionV1_avgpool_1a_onecycle.pth.tar'
  weights_url = VISLAB_URL+'vggface2_inceptionV1_avgpool_1a_onecycle1_madgrad.pth.tar'
  model = load_checkpoint_imagenet_head(model, weights_url)

  test_transforms = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    fixed_image_standardization
  ])

  return model, test_transforms 

def casia_inceptionV1_imagenet_head():
  model = InceptionResnetV1(pretrained='casia-webface')
  in_features = model.last_linear.in_features
  model.last_linear = nn.Linear(in_features, 1000)
  model.last_linear.weight.data.normal_(mean=0.0, std=0.01)
  model.last_linear.bias.data.zero_()
  model.last_bn = nn.BatchNorm1d(1000, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)

  #weights_url = VISLAB_URL+'casia_inceptionV1_avgpool_1a_onecycle.pth.tar'
  weights_url = VISLAB_URL+'casia_inceptionV1_avgpool_1a_onecycle1_madgrad.pth.tar'
  model = load_checkpoint_imagenet_head(model, weights_url)

  test_transforms = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    fixed_image_standardization
  ])

  return model, test_transforms 


# Loading Moments-trained Models (Annie)

These models were pre-trained on: (1) action recognition from video (moments_resnet3d50), (2) action recognition from still images (moments_resnet50), or (3) imagenet recognition (imagenet_resnet50).

For each network, I removed the fully-connected layer and replaced it with one that has 1000 output units, for the 1000 ImageNet classes.

All of the weights are frozen, except those of this last fully-connected layer, which I trained on ImageNet classification.

So this new FC layer is used to "readout a fixed set of features from the convolutional backbone."

One tricky part was that resnet3d50 expects sequences of 16 video frames as it's input. What I did was repeat the same image 16 times, which we can think of as a "very slow moving video"! I couldn't really think of a better way to train this model to do imagenet recognition! As you'll see it doesn't perform as well as the model trained on still images (either on moments or imagenet), but it still does well above chances (34.7% vs. chance which is 1/1000=.10).

In [8]:
# model trained with videos (Moments "action recognition")
# remove it's fully-connected layer, replace with a new one that
# has 1000 outputs (corresponding to the 1000 imagenet categories)
# train this new layer on ImageNet Classification.
model, test_transforms = moments_resnet3d50_imagenet_head()

=> loading checkpoint: moments_resnet3d50_avgpool_onecycle.pth.tar


Downloading: "https://visionlab-pretrainedmodels.s3.amazonaws.com/model_zoo/psy1410/moments_resnet3d50_avgpool_onecycle.pth.tar" to /content/weights/moments_resnet3d50_avgpool_onecycle.pth.tar


HBox(children=(FloatProgress(value=0.0, max=201463387.0), HTML(value='')))


=> state loaded.
=> top1 accuracy 34.72% (num_epochs=10)


In [9]:
# model trained with static images on Moments "action recognition"
# remove it's fully-connected layer, replace with a new one that
# has 1000 outputs (corresponding to the 1000 imagenet categories)
# train this new layer on ImageNet Classification.
model, test_transforms = moments_resnet50_imagenet_head()

=> loading checkpoint: moments_resnet50_avgpool_onecycle1_madgrad.pth.tar


Downloading: "https://visionlab-pretrainedmodels.s3.amazonaws.com/model_zoo/psy1410/moments_resnet50_avgpool_onecycle1_madgrad.pth.tar" to /content/weights/moments_resnet50_avgpool_onecycle1_madgrad.pth.tar


HBox(children=(FloatProgress(value=0.0, max=127088699.0), HTML(value='')))


=> state loaded.
=> top1 accuracy 47.19% (num_epochs=20)


In [10]:
# This is a sanity check to make sure the ImageNet training is reasonable.
# Here we take a network trained on ImageNet classification to begin with,
# then cut off it's fully-connected layer, and replace it with an untrained
# one. That new layer is then trained on ImageNet Classification. If the
# procedure is reasonable we should get close to the original performance
# (which we do here, 74% new, 76% original).
model, test_transforms = imagenet_resnet50_imagenet_head()

=> loading checkpoint: imagenet_resnet50_avgpool_onecycle1_madgrad.pth.tar


Downloading: "https://visionlab-pretrainedmodels.s3.amazonaws.com/model_zoo/psy1410/imagenet_resnet50_avgpool_onecycle1_madgrad.pth.tar" to /content/weights/imagenet_resnet50_avgpool_onecycle1_madgrad.pth.tar


HBox(children=(FloatProgress(value=0.0, max=127088674.0), HTML(value='')))


=> state loaded.
=> top1 accuracy 74.07% (num_epochs=20)


# Loading FaceNet Models (Jolade?)

These models were pre-trained on the FaceNet triplet task.

Then I removed the fully-connected layer and replaced it with one that has 1000 output units, for the 1000 ImageNet classes.

All of the weights are frozen, except those of this last fully-connected layer, which I trained on ImageNet classification.

So this new FC layer is used to "readout a fixed set of features from the convolutional backbone specialized for face processing."

There are two different FaceNet models, trained on different face datasets (the vggface2, and casia-web datasets).

One issue we might have to address is that the overall performance level for these networks on imagenet classificaiton is low. Perhaps that's to be expected: These are highly face-specialized networks, and they might not have "the right kinds of features" for performing imagenet classification.

That said, even though performnce is low (15% and 18%), they are far above chance (which is 1/1000, or .1%)!

In [11]:
model, test_transform = vggface2_inceptionV1_imagenet_head()

HBox(children=(FloatProgress(value=0.0, max=111898327.0), HTML(value='')))


=> loading checkpoint: vggface2_inceptionV1_avgpool_1a_onecycle1_madgrad.pth.tar


Downloading: "https://visionlab-pretrainedmodels.s3.amazonaws.com/model_zoo/psy1410/vggface2_inceptionV1_avgpool_1a_onecycle1_madgrad.pth.tar" to /content/weights/vggface2_inceptionV1_avgpool_1a_onecycle1_madgrad.pth.tar


HBox(children=(FloatProgress(value=0.0, max=136936472.0), HTML(value='')))


=> state loaded.
=> top1 accuracy 15.03% (num_epochs=20)


In [12]:
model, test_transform = casia_inceptionV1_imagenet_head()

HBox(children=(FloatProgress(value=0.0, max=115887415.0), HTML(value='')))


=> loading checkpoint: casia_inceptionV1_avgpool_1a_onecycle1_madgrad.pth.tar


Downloading: "https://visionlab-pretrainedmodels.s3.amazonaws.com/model_zoo/psy1410/casia_inceptionV1_avgpool_1a_onecycle1_madgrad.pth.tar" to /content/weights/casia_inceptionV1_avgpool_1a_onecycle1_madgrad.pth.tar


HBox(children=(FloatProgress(value=0.0, max=140925439.0), HTML(value='')))


=> state loaded.
=> top1 accuracy 17.96% (num_epochs=20)
