# Problem 2: Long-Tailed Recognition on Imbalanced Dataset

In the existing visual recognition setting, the training data and testing data are both balanced under a closed-world setting, e.g., the ImageNet dataset. However, this setting is not a good proxy for the real-world scenario. This imbalanced data distribution in the training set may largely degrade the performance of the machine learning or deep learning-based method.

Our goal is to build a CNN model that can accurately classify the images into their respective categories under imbalanced settings.

### Readings before you start

1. Bag of tricks for long-tailed visual recognition with deep convolutional neural networks [[Paper]](http://www.lamda.nju.edu.cn/zhangys/papers/AAAI_tricks.pdf) [[Github]](https://github.com/zhangyongshun/BagofTricks-LT)

In [None]:
%matplotlib inline

import csv
import math
import os

import numpy as np
import pandas as pd

from tqdm import tqdm

# Pytorch
import torch
import torch.nn as nn

from torch.utils.tensorboard import SummaryWriter
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split

import torchvision
import torchvision.datasets as datasets
import torchvision.transforms as transforms


import matplotlib.pyplot as plt
import seaborn as sns
sns.set(palette='pastel')

np.random.seed(568)

## Prepare Imbalanced CIFAR-30 Dataset from CIFAR-*100*

You will be building the imbalanced version of CIFAR-30 from the CIFAR-100:

\begin{equation*}
    \beta = \frac{max(\{n_1, n_2, \cdots, n_k\})}{min(\{n_1, n_2, \cdots, n_k\})}\
\end{equation*}

\noindent where $n_i$ represents the number of images for class $i$. Therefore, the larger the imbalance factor $\beta$ is, the harder it gets for doing long-tailed recognition on such data. With a $\beta=100$ version of CIFAR-100, the head classes will have $500$ training samples while the tail classes only have $5$ training samples.

In [None]:
# create a custom dataset CIFAR30 from CIFAR100
class CIFAR30(torchvision.datasets.CIFAR100):
    # cifar100 has 100 classes, we only want 30
    cls_num = 30

    def __init__(self, root, imb_type='exp', imb_factor=0.01, rand_number=0, train=True,
                 transform=None, target_transform=None,
                 download=False, imbalanced=False):
        super(CIFAR30, self).__init__(root, train, transform, target_transform, download)
        np.random.seed(rand_number)

        self.remove_extra_class(self.cls_num)

        if self.train and imbalanced:
            img_num_list = self.get_img_num_per_cls(self.cls_num, imb_type, imb_factor)
            self.gen_imbalanced_data(img_num_list)

        self.update_num_per_cls()

    # remove extra classes to make it 30 classes
    def remove_extra_class(self, cls_num):
        new_data = []
        new_targets = []
        targets_np = np.array(self.targets, dtype=np.int64)
        classes = np.unique(targets_np)
        for i in range(cls_num):
            idx = np.where(targets_np == i)[0]
            new_data.append(self.data[idx, ...])
            new_targets.extend([i, ] * len(idx))
        new_data = np.vstack(new_data)
        self.data = new_data
        self.targets = new_targets


    # get the number of images per class we desire
    def get_img_num_per_cls(self, cls_num, imb_type, imb_factor):
        img_max = len(self.data) / cls_num
        img_num_per_cls = []
        if imb_type == 'exp':
            for cls_idx in range(cls_num):
                num = img_max * (imb_factor ** (cls_idx / (cls_num - 1.0)))
                img_num_per_cls.append(int(num))
        elif imb_type == 'step':
            for cls_idx in range(cls_num // 2):
                img_num_per_cls.append(int(img_max))
            for cls_idx in range(cls_num // 2):
                img_num_per_cls.append(int(img_max * imb_factor))
        else:
            img_num_per_cls.extend([int(img_max)] * cls_num)
        return img_num_per_cls

    # generate imbalanced data from original dataset with given img_num_per_cls
    def gen_imbalanced_data(self, img_num_per_cls):
        new_data = []
        new_targets = []
        targets_np = np.array(self.targets)
        classes = np.unique(targets_np)
        self.num_per_cls_dict = dict()
        for the_class, the_img_num in zip(classes, img_num_per_cls):
            self.num_per_cls_dict[the_class] = the_img_num
            idx = np.where(targets_np == the_class)[0]
            np.random.shuffle(idx)
            selec_idx = idx[:the_img_num]
            new_data.append(self.data[selec_idx, ...])
            new_targets.extend([the_class, ] * the_img_num)
        new_data = np.vstack(new_data)
        self.data = new_data
        self.targets = new_targets

    def get_cls_num_list(self):
        cls_num_list = []
        for i in range(self.cls_num):
            cls_num_list.append(self.num_per_cls_dict[i])
        return cls_num_list

    def update_num_per_cls(self):
        targets_np = np.array(self.targets, dtype=np.int64)
        classes = np.unique(targets_np)
        self.num_per_cls_dict = dict()
        for cls in classes:
            self.num_per_cls_dict[cls] = len(np.where(targets_np == cls)[0])

We will also be adapting some extra transorms (augmentations) on our CIFAR-30:

In [None]:
# transforms for training and testing
training_transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.RandomHorizontalFlip(p=1),
     transforms.RandomAffine(degrees=60),
     transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
testing_transform = transforms.Compose(
    [transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)

# create the datasets
cifar30_trainset = CIFAR30(root='./data', train=True, download=True,
                                transform=training_transform, imbalanced=False)
im_cifar30_trainset = CIFAR30(root='./data', train=True, download=True,
                                transform=training_transform, imbalanced=True)
cifar30_testset = CIFAR30(root='./data', train=False, download=True,
                               transform=testing_transform, imbalanced=False)

Compare the data (label) distribution of the three dataset `cifar30_trainset`, `im_cifar30_trainset`, and `cifar30_testset` we constructed:

1. Balanced Training Data
2. Imbalanced Training Data
3. Balanced Testing Data

In [None]:
training_distribution = list(cifar30_trainset.num_per_cls_dict.values())
im_training_distribution = list(im_cifar30_trainset.num_per_cls_dict.values())
testing_distribution = list(cifar30_testset.num_per_cls_dict.values())
training_cls = list(im_cifar30_trainset.num_per_cls_dict.keys())

plt.subplots(1, 3, sharey=True, figsize=(14,4))

plt.subplot(1, 3, 1)
plt.bar(training_cls, training_distribution, color='blue')
plt.title('Training Data (Balanced)')
plt.ylabel('Number of Images')
plt.subplot(1, 3, 2)
plt.bar(training_cls, im_training_distribution, color='blue')
plt.title('Training Data (Imbalanced)')
plt.subplot(1, 3, 3)
plt.bar(training_cls, testing_distribution, color='red')
plt.title('Testing Data')

Show some images with labels (class names) from dataset.

In [None]:
def cifar_imshow(img):
  img = img / 2 + 0.5 # unnormalize the image
  npimg = img.numpy()
  return np.transpose(npimg, (1, 2, 0)) # reorganize the channel

# visualize some samples in the CIFAR-30 dataset
fig, axs = plt.subplots(3, 10, figsize = (12, 4))

# loop through subplots and images
for i, ax in enumerate(axs.flat):
  ax.imshow(cifar_imshow(cifar30_testset[i*100][0]))
  ax.axis('off')
  ax.set_title('{}'.format(cifar30_testset[i*100][1]))

## 2-a. Train CNN Baseline

Check whether your runtime is on GPU or not.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Using device:', device)
!nvidia-smi

The CNN we will be using in this problem is called ResNet:



In [None]:
class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(
            in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion*planes)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, in_planes, planes, stride=1):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion *
                               planes, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(self.expansion*planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion*planes)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out


class ResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=30):
        super(ResNet, self).__init__()
        self.in_planes = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
        self.linear = nn.Linear(512*block.expansion, num_classes)

    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes * block.expansion
        return nn.Sequential(*layers)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = F.avg_pool2d(out, 4)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

my_cnn = ResNet(BasicBlock, [2, 2, 2, 2]).to(device)

## Trainer and Tester code

In [None]:
def trainer(train_loader, valid_loader, model, config, device, weight=None):

    criterion = nn.CrossEntropyLoss(reduction='mean', weight=weight)
    optimizer = torch.optim.SGD(model.parameters(), lr=config['learning_rate'], momentum=0.7)

    if not os.path.isdir('./models'):
        os.mkdir('./models') # Create directory of saving models.

    n_epochs, best_loss, step, early_stop_count = config['n_epochs'], math.inf, 0, 0

    for epoch in range(n_epochs):
        model.train() # Set your model to train mode.
        loss_record = []

        # tqdm is a package to visualize your training progress.
        train_pbar = tqdm(train_loader, position=0, leave=True)

        for x, y in train_pbar:
            optimizer.zero_grad()               # Set gradient to zero.
            x, y = x.to(device), y.to(device)   # Move your data to device.
            pred = model(x)
            loss = criterion(pred, y)
            loss.backward()                     # Compute gradient(backpropagation).
            optimizer.step()                    # Update parameters.
            step += 1
            loss_record.append(loss.detach().item())

            # Display current epoch number and loss on tqdm progress bar.
            train_pbar.set_description(f'Epoch [{epoch+1}/{n_epochs}]')
            train_pbar.set_postfix({'loss': loss.detach().item()})

        mean_train_loss = sum(loss_record)/len(loss_record)
        # writer.add_scalar('Loss/train', mean_train_loss, step)

        model.eval() # Set your model to evaluation mode.
        loss_record = []
        val_accuracy = []
        for x, y in valid_loader:
            x, y = x.to(device), y.to(device)
            with torch.no_grad():
                pred = model(x)
                loss = criterion(pred, y)

                _, predicted = torch.max(pred.data, 1)
                val_accuracy.append((predicted == y).sum().item() / predicted.size(0))
            loss_record.append(loss.item())
        print('Accuracy:', sum(val_accuracy)/len(val_accuracy))

    torch.save(model.state_dict(), config['save_path'])

def tester(test_loader, model, config, device):

    model.eval() # Set your model to evaluation mode.
    loss_record = []
    test_accuracy = []
    for x, y in test_loader:
        x, y = x.to(device), y.to(device)
        with torch.no_grad():
            pred = model(x)
            _, predicted = torch.max(pred.data, 1)
            test_accuracy.append((predicted == y).sum().item() / predicted.size(0))
    print(sum(test_accuracy)/len(test_accuracy))

## Sample Config Dict

In [None]:
config = {
    'seed': 1968990,      # Your seed number, you can pick your lucky number. :)
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 50,      # Number of epochs.
    'batch_size': 32,
    'learning_rate': 0.01,
    'early_stop': 20,    # If model has not improved for this many consecutive epochs, stop training.
    'save_path': './models/baseline_model.ckpt'  # Your model will be saved here.
}

## Prepare the Dataloader

In [None]:
# Original CIFAR-30
cifar30_train_data, cifar30_valid_data = random_split(cifar30_trainset, [0.8, 0.2])

train_loader = torch.utils.data.DataLoader(cifar30_train_data, batch_size=config['batch_size'], shuffle=True)
valid_loader = torch.utils.data.DataLoader(cifar30_valid_data, batch_size=config['batch_size'], shuffle=True)

# Imbalanced CIFAR-30
im_cifar30_train_data, im_cifar30_valid_data = random_split(im_cifar30_trainset, [0.8, 0.2])

im_train_loader = torch.utils.data.DataLoader(im_cifar30_train_data, batch_size=config['batch_size'], shuffle=True)
im_valid_loader = torch.utils.data.DataLoader(im_cifar30_valid_data, batch_size=config['batch_size'], shuffle=True)

# CIFAR-30 Testing (always balanced)
test_loader = torch.utils.data.DataLoader(cifar30_testset, batch_size=config['batch_size'], shuffle=False)

### Train on original CIFAR-30

In [None]:
trainer(im_train_loader,  im_valid_loader, my_cnn, config, device)

In [None]:
tester(test_loader, my_cnn, config, device)

## 2-b. Implement Re-Weighting

Hint:

Notice there is a "weight" argument for the loss we use:

```
criterion = nn.CrossEntropyLoss(reduction='mean', weight=weight)
```


In [None]:
# Please do not modify the config
re_weighting_config = {
    'seed': 1968990,      # Your seed number, you can pick your lucky number. :)
    'select_all': True,   # Whether to use all features.
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 50,      # Number of epochs.
    'batch_size': 32,
    'learning_rate': 0.001,
    'early_stop': 20,    # If model has not improved for this many consecutive epochs, stop training.
    'save_path': './models/re_weighting_model.ckpt'  # Your model will be saved here.
}

# TODO

# ENDS HERE

## 2-c. Evaluate Re-Weighting

In [None]:
tester(test_loader, my_cnn_re_weighting, re_weighting_config, device)

## 2-d. Implement Re-Sampling

Hint:

Check out how sampler works in PyTorch's DataLoader!


```
sampler = WeightedRandomSampler(weights=sample_weights, num_samples=len(class_counts) * 500)
rs_train_loader = DataLoader(im_cifar30_trainset, batch_size=config['batch_size'], sampler=sampler)

```



In [None]:
# Please do not modify the config
config = {
    'seed': 1968990,      # Your seed number, you can pick your lucky number. :)
    'select_all': True,   # Whether to use all features.
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 50,      # Number of epochs.
    'batch_size': 32,
    'learning_rate': 0.001,
    'early_stop': 20,    # If model has not improved for this many consecutive epochs, stop training.
    'save_path': './models/re_sampling_model.ckpt'  # Your model will be saved here.
}


# TODO

# ENDS HERE

In [None]:
my_cnn_re_sampling = ResNet(BasicBlock, [2, 2, 2, 2]).to(device)
trainer(rs_train_loader, test_loader, my_cnn_rs, config, device)

## 2-e. Evaluate Re-Sampling

In [None]:
tester(test_loader, my_cnn_re_sampling, re_sampling_config, device)