## Deep Learning Lab: ECE-00450107
## Meeting 1 - Part 2: Dealing with unbalanced datasets

Before running the code in this file, make sure that you are **activating the enviourment** in which the following packages are installed.

#### Definitions and Imports:

In [1]:
####################################
## DO NOT EDIT THIS CODE SECTION
from DL_Lab1_functions import *
import albumentations as A
warnings.filterwarnings('ignore')
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
matplotlib.use('Agg')

%matplotlib tk
####################################

  check_for_updates()


Set fixed seeds to enable reproducing the results:

In [2]:
# write the last ID digit of each student in the team
id_digit1 = 206492910
id_digit2 = 319046504
seed = (id_digit1+id_digit2)%10

####################################
## DO NOT EDIT THIS CODE SECTION
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.use_deterministic_algorithms(True)
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = False
train_flag = False
####################################

In this part of the meeting we will use the "MNIST-Fashion" dataset.  
For the pupose of learning, we will remove some of its samples and classes to craete a smaller and unbalanced dataset.

#### Define the device and load the dataset 

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
fulltrain = torchvision.datasets.FashionMNIST(root="/usr/share/DL_exp/datasets/mnist-fashion",train=True, download=True)
fulltest = torchvision.datasets.FashionMNIST(root="/usr/share/DL_exp/datasets/mnist-fashion",train=False, download=True)

###############################################################################################################################
# Flags for code enabling/disabling:
dataset_flag = 2 # 1-clothes balanced dataset, 2-clothes unbalanced dataset
train_code_flag = 3 # 0-regular training, 1-training with oversampling, 2-training with weights, 3-training with augmentations
###############################################################################################################################

if(dataset_flag == 1):
    train, test = fashion_mnist_leave_only_clothes(fulltrain, fulltest)
    class_names = ['0: T-shirt/top', '1: Trouser', '2: Pullover', '3: Dress', '4: Coat', '5: Shirt']
elif(dataset_flag == 2):
    train, test = fashion_mnist_imbalanced(fulltrain, fulltest, seed=seed)
    class_names = ['0: T-shirt/top', '1: Trouser', '2: Pullover', '3: Dress', '4: Coat', '5: Shirt']
else:
    print("Wrong flag value")

train_size = len(train)
test_size = len(test)

# Display dataset distribution - train and test
labels1 = [label for _, label in train]
label_counts1 = Counter(labels1)
plot_counts1 = []
plot_labels1 = []
print(f"The train set contains {train_size} images, divided into {len(label_counts1.items())} sections:")
for label, count in label_counts1.items():
    print(f"Label {class_names[label]}: {count} examples")
    plot_counts1.append(count)
    plot_labels1.append(f"class {label}:\n{count} examples")

labels2 = [label for _, label in test]
label_counts2 = Counter(labels2)
plot_counts2 = []
plot_labels2 = []
print(f"The test set contains {test_size} images, divided into {len(label_counts2.items())} sections:")
for label, count in label_counts2.items():
    print(f"Label {label}: {count} examples")
    plot_counts2.append(count)
    plot_labels2.append(f"class {label}:\n{count} examples")   
        
if train_code_flag == 0:
    # Display a few samples
    display_Mnist_Sample(train) 

    # Plot the Dataset as pie chart
    fig, axs = plt.subplots(1, 2, figsize=(20, 6))
    axs[0].pie(plot_counts1, labels=plot_labels1, autopct='%1.1f%%', startangle=140)
    axs[0].set_title('Class Distribution in Train Dataset')    
    axs[1].pie(plot_counts2, labels=plot_labels2, autopct='%1.1f%%', startangle=140)
    axs[1].set_title('Class Distribution in Test Dataset')
    plt.show()

The train set contains 26700 images, divided into 6 sections:
Label 0: T-shirt/top: 6000 examples
Label 2: Pullover: 1500 examples
Label 1: Trouser: 6000 examples
Label 5: Shirt: 6000 examples
Label 4: Coat: 6000 examples
Label 3: Dress: 1200 examples
The test set contains 6000 images, divided into 6 sections:
Label 2: 1000 examples
Label 1: 1000 examples
Label 5: 1000 examples
Label 4: 1000 examples
Label 3: 1000 examples
Label 0: 1000 examples


Save data in tensors

In [4]:
def subset_to_tensor(subset, device):
    data = []
    labels = []
    for idx in range(len(subset)):
        sample, label = subset[idx]
        data.append(np.array(sample))
        labels.append(label)
    
    data_tensor = torch.tensor(np.array(data), dtype=torch.float32).to(device)
    labels_tensor = torch.tensor(np.array(labels), dtype=torch.long).to(device)    
    
    return data_tensor, labels_tensor

train_data_notNorm, train_labels = subset_to_tensor(train, device)
test_data_notNorm, test_labels = subset_to_tensor(test, device)

Complete the following code (copy the rellevant parts from the previous file):

Normilize the data

In [5]:
# TODO: complete the code
train_mean = torch.mean(train_data_notNorm)
train_std = torch.std(train_data_notNorm)

train_data = (train_data_notNorm-train_mean)/train_std
test_data = (test_data_notNorm-train_mean)/train_std


Define the class of the Neural Network

In [6]:
# Creating a Fully Connected Neural Network Architeture Class 
class OurClothNetwork(nn.Module):
    def __init__(self, input_size: int, hidden_layer_size: int, output_size: int):
        super(OurClothNetwork, self).__init__()
        # define the network fully connected layers
        self.fc_layer1 = nn.Linear(input_size, hidden_layer_size)
        self.fc_layer2 = nn.Linear(hidden_layer_size, output_size)
        # define a flatten layer
        self.flatten = nn.Flatten()
        # define a sigmoid (activation) layer
        self.activation = nn.Sigmoid()
      
    def forward(self, x):
        # define the input layer, operating on flattaned inputs
        flattened_x = self.flatten(x)
        # define the first layer, using linear operation and then activation
        z1 = self.fc_layer1(flattened_x)
        z2 = self.activation(z1)
        # define the output layer
        return self.fc_layer2(z2)


Define the hyper-parameters that will be used during the training.

In [7]:
# Defining hyper-parameters
hparams = Hyper_Params()
hparams.train_size = train_size
hparams.lr = 0.3
hparams.batch_size = 100
hparams.epochs = 30

Set up the model, and define Gradient Descend as optimizer and Cross Entropy for loss.  
*Attention:* What are the sizes of the input and output layers?

In [8]:
# Set the network model with a hidden layer of size 200 and send to device
model = OurClothNetwork(input_size=784, hidden_layer_size=200, output_size=6).to(device)

# Define the optimizer
optimizer = optim.SGD(model.parameters(), lr=hparams.lr)
# Define the loss criterion
loss_function = nn.CrossEntropyLoss()

Train the Model

In [9]:
if train_code_flag == 0:
    if not train_flag:
        print("Training started...")
        train_flag = True
        
        # Start a progress graph and a performance table, for visualization of the trainig process:
        hparams.fig, (hparams.ax1, hparams.ax2) = plt.subplots(2, 1, figsize=(15, 9))
        print_performance_grid(Flag=True)
        # Calculate how many iterations the model trains in each epoch
        iter_num = int(np.ceil(hparams.train_size/hparams.batch_size))
        
        # Set the model to training mode
        model.train()
        start_time = time.time() # time the start of training
        # Training loop:
        for epoch in range(hparams.epochs):
            # for each epoch, do:
            hparams.epoch_accuracy_train = np.zeros(iter_num)
            hparams.epoch_loss_train = np.zeros(iter_num)
            # randomly reshuffle the training and test groups before each new epoch:
            index = torch.randperm(hparams.train_size)
            train_data_perm = train_data[index]
            train_labels_perm = train_labels[index]
            # for each batch, do:
            for i, batch in enumerate(range(0, hparams.train_size, hparams.batch_size)):
                # Get a new batch of images and labels
                data = train_data_perm[batch:batch+hparams.batch_size].to(device) 
                target = train_labels_perm[batch:batch+hparams.batch_size].squeeze().to(device)
                
                # Forward pass
                output = model(data) # Apply the network on the new examples
                loss = loss_function(output, target) # calculate the value of the loss function
                
                # Backward pass - ALWAYS IN THIS ORDER!
                optimizer.zero_grad() # First, delete the gradients from the previous iteration
                loss.backward() # Run backward pass on the loss
                optimizer.step() # Preform an algorithm step (using the optimizer)
        
                # Save the loss and accuracy for the graph visualization
                hparams.epoch_accuracy_train[i] = multi_class_accuracy(output, target.squeeze().to(device))
                hparams.epoch_loss_train[i] = loss.item()
                if(i == 0 and epoch == 0) or ((i+1) == iter_num):
                    # Freeze the model in order to evaluate the loss and accuracy on the test set
                    model.eval()
                    test_out = model(test_data)
                    test_loss = loss_function(test_out, test_labels.squeeze()).item()
                    test_accuracy = multi_class_accuracy(test_out, test_labels.squeeze())
                    print_performance(epoch, i, hparams, test_loss, test_accuracy)
                    model.train()
                              
        plt.show()
        print(f"Total training took {time.time() - start_time:.2f} seconds")
        
        print("Training finished.")
    else:
        print("Error: Please restart the kernel before running the train again.")

Train the Model using random oversampling

In [10]:
if train_code_flag == 1:
    if not train_flag:
        print("Training with oversampling started...")
        train_flag = True
        
        num_of_classes = len(class_names)
        # ------------------ Oversampling - Before train -----------------------
        # a "balanced" batch should contain an equal number of samples from each class:
        num_of_samples_in_batch_from_each_class = int(hparams.batch_size/num_of_classes) # (how many samples from each class should be in a 'balanced batch'?)
 
        # Prepare a list of the indices for each class:
        class_indices = [[] for _ in range(num_of_classes)] # initiate class indices with empty lists
        for idx, label in enumerate(train_labels):
            class_indices[label].append(idx) # insert index to the list of its class according to the label
        
        # make sure that batch_size represents the required number of samples
        if(hparams.batch_size > num_of_samples_in_batch_from_each_class*num_of_classes):
            num_of_samples_in_batch_from_each_class = num_of_samples_in_batch_from_each_class + 1
            print(f"Number of samples per class was increased to {num_of_samples_in_batch_from_each_class}.")
        if(hparams.batch_size < num_of_samples_in_batch_from_each_class*num_of_classes):
            hparams.batch_size = num_of_samples_in_batch_from_each_class*num_of_classes
            print(f"Batch size was updated to {hparams.batch_size} so it will contain the same amount from each class.")
        # ----------------------------------------------------------------------

        # Start a progress graph and a performance table, for visualization of the trainig process:
        hparams.fig, (hparams.ax1, hparams.ax2) = plt.subplots(2, 1, figsize=(15, 9))
        print_performance_grid(Flag=True)
        # Calculate how many iterations the model trains in each epoch
        iter_num = int(np.ceil(hparams.train_size/hparams.batch_size))

        # Set the model to training mode
        model.train()
        start_time = time.time() # time the start of training
        # Training loop:
        for epoch in range(hparams.epochs):
            # for each epoch, do:
            hparams.epoch_accuracy_train = np.zeros(iter_num)
            hparams.epoch_loss_train = np.zeros(iter_num)

            # for each batch, do:
            for i, batch in enumerate(range(0, hparams.train_size, hparams.batch_size)):
                # --------------- Oversampling - During train --------------------------
                # Get a new batch of images and labels
                batch_indices = []
                # for each class: randomly select the required amount of samples (from that class) to be included in the batch
                for idx in range(num_of_classes):
                    batch_indices.extend(np.random.choice(class_indices[idx], num_of_samples_in_batch_from_each_class, replace=False))
                np.random.shuffle(batch_indices)
                # ----------------------------------------------------------------------
                data = train_data[batch_indices].to(device)
                target = train_labels[batch_indices].squeeze().to(device)
                
                # Forward pass
                output = model(data) # Apply the network on the new examples
                loss = loss_function(output, target) # calculate the value of the loss function
                
                # Backward pass - ALWAYS IN THIS ORDER!
                optimizer.zero_grad() # First, delete the gradients from the previous iteration
                loss.backward() # Run backward pass on the loss
                optimizer.step() # Preform an algorithm step (using the optimizer)
        
                # Save the loss and accuracy for the graph visualization
                hparams.epoch_accuracy_train[i] = multi_class_accuracy(output, target.squeeze().to(device))
                hparams.epoch_loss_train[i] = loss.item()
                if(i == 0 and epoch == 0) or ((i+1) == iter_num):
                    # Freeze the model in order to evaluate the loss and accuracy on the test set
                    model.eval()
                    test_out = model(test_data)
                    test_loss = loss_function(test_out, test_labels.squeeze()).item()
                    test_accuracy = multi_class_accuracy(test_out, test_labels.squeeze())
                    print_performance(epoch, i, hparams, test_loss, test_accuracy)
                    model.train()
                              
        plt.show()
        print(f"Total training took {time.time() - start_time:.2f} seconds")
        
        print("Training finished.")
    else:
        print("Error: Please restart the kernel before running the train again.")

Train the Model using weighting

In [11]:
if train_code_flag == 2:
    if not train_flag:
        print("Training with weights started...")
        train_flag = True
        
        num_of_classes = len(class_names)
        # Start a progress graph and a performance table, for visualization of the training process:
        hparams.fig, (hparams.ax1, hparams.ax2) = plt.subplots(2, 1, figsize=(15, 9))
        print_performance_grid(Flag=True)
        # Calculate how many iterations the model trains in each epoch
        iter_num = int(np.ceil(hparams.train_size/hparams.batch_size))

        # ------------------ Weighting - Before train -----------------------
        # Prepare a list of the indices of each class (same as in oversampling):
        class_indices = [[] for _ in range(num_of_classes)]
        for idx, label in enumerate(train_labels):
            class_indices[label].append(idx)
            
        # Calculate weights for each class based on their size:
        class_weights = []
        class_counts = []
        for i in range(num_of_classes):
            
            count = len(class_indices[i]) 
            weight = 6000/count 
            print(weight)                   # Standard inverse frequency weighting
            class_weights.append(weight)
            class_counts.append(count)

        class_weights = torch.FloatTensor(class_weights).to(device) # Convert to torch tensor
        weighted_loss_function = nn.CrossEntropyLoss(weight=class_weights) # Create weighted loss function
        # --------------------------------------------------------------------

        # Set the model to training mode
        model.train()
        start_time = time.time() # time the start of training
        # Training loop:
        for epoch in range(hparams.epochs):
            # for each epoch, do:
            hparams.epoch_accuracy_train = np.zeros(iter_num)
            hparams.epoch_loss_train = np.zeros(iter_num)
            # randomly reshuffle the training and test groups before each new epoch:
            index = torch.randperm(hparams.train_size)
            train_data_perm = train_data[index]
            train_labels_perm = train_labels[index]
            # for each batch, do:
            for i, batch in enumerate(range(0, hparams.train_size, hparams.batch_size)):
                # Get a new batch of images and labels
                data = train_data_perm[batch:batch+hparams.batch_size].to(device) 
                target = train_labels_perm[batch:batch+hparams.batch_size].squeeze().to(device)
                
                # --------------- Weighting - During train --------------------------
                # Forward pass
                output = model(data)
                loss = weighted_loss_function(output, target)
                # -------------------------------------------------------------
                
                # Backward pass - ALWAYS IN THIS ORDER!
                optimizer.zero_grad() # First, delete the gradients from the previous iteration
                loss.backward() # Run backward pass on the loss
                optimizer.step() # Perform an algorithm step (using the optimizer)
        
                # Save the loss and accuracy for the graph visualization
                hparams.epoch_accuracy_train[i] = multi_class_accuracy(output, target)
                hparams.epoch_loss_train[i] = loss.item()
                
                if(i == 0 and epoch == 0) or ((i+1) == iter_num):
                    # Freeze the model in order to evaluate the loss and accuracy on the test set
                    model.eval()
                    test_out = model(test_data)
                    test_loss = weighted_loss_function(test_out, test_labels.squeeze()).item()
                    test_accuracy = multi_class_accuracy(test_out, test_labels.squeeze())
                    print_performance(epoch, i, hparams, test_loss, test_accuracy)
                    model.train()
                              
        plt.show()
        print(f"Total training took {time.time() - start_time:.2f} seconds")
        
        print("Training finished.")
        
        print("Classes distribution:", class_counts)
        print("Classes weights:", class_weights.cpu().numpy())        
    else:
        print("Error: Please restart the kernel before running the train again.")

Define a function that uses off-line augmentation for balancing the dataset

In [12]:
def augment_and_balance_dataset(dataset, device):

    # save images and labels as NUMPY arrays
    images = []
    labels = []
    for idx in range(len(dataset)):
        im, lb = dataset[idx]
        images.append(np.array(im))
        labels.append(lb)
    
    images = np.array(images)
    labels = np.array(labels)
        
    # find the different sizes of each class
    class_counts = Counter(labels)
    max_class_count = max(class_counts.values())
    
    # define augmentations
    # replace each ??? with a name of an augmentation function from the list in the booklet
    # transform = A.Compose([A.???()])
    # or
    # transform = A.Compose([A.???(),A.???()])
    transform = A.Compose([
    A.HorizontalFlip(),
    A.RandomGamma()])



    if len(transform.transforms) > 0: # verify that at least one augmentation was defined
         # create a list containing the augmented copies
        augmented_images = []
        augmented_labels = []
        class_items = list(class_counts.items())
        num_of_classes = len(class_counts)
        
        # add augmentation to the smaller classes in order to create a balanced dataset
        for i in range(num_of_classes):
            class_label, count = class_items[i] # 'count' holds the amount of samples in the class, and 'class_label' holds the label of the class
            if count < max_class_count:
                num_of_augments_per_class = max_class_count-count  # calculate the number of augmented copies that should be added to each class
                class_indices = np.where(labels == class_label)[0] # find the class's samples
                for _ in range(num_of_augments_per_class):
                    image_idx = np.random.choice(class_indices) # randomly select a sample
                    augmented = transform(image=images[image_idx])['image'] # craete an augmented copy of the sample
                    augmented_images.append(augmented) # add the augmented copy to the list
                    augmented_labels.append(class_label)
    
        # add the augmented images to the original dataset
        augmented_images = np.array(augmented_images)
        augmented_labels = np.array(augmented_labels)
        
        images = np.concatenate([images, augmented_images])
        labels = np.concatenate([labels, augmented_labels])
    
    # convet data to Tensors and create united dataset
    images_tensor = torch.tensor(images, dtype=torch.float32).to(device)
    labels_tensor = torch.tensor(labels, dtype=torch.long).to(device)

    return images_tensor, labels_tensor

Train the Model using augmentations

In [13]:
if train_code_flag == 3:
    if not train_flag:
        print("Training with augmentations started...")
        train_flag = True

        train_data_aug_notNorm, train_labels_aug = augment_and_balance_dataset(train, device) # expand the training set with augmentations
        # since we changed the traning set, we want to re-calculate new mean and std values and normilize the dataset again
        train_mean_aug = torch.mean(train_data_aug_notNorm)
        train_std_aug = torch.std(train_data_aug_notNorm)
        train_data_aug = (train_data_aug_notNorm-train_mean_aug)/train_std_aug
        test_data = (test_data_notNorm-train_mean_aug)/train_std_aug        

        hparams.train_size = len(train_data_aug)

        # Start a progress graph and a performance table, for visualization of the trainig process:
        hparams.fig, (hparams.ax1, hparams.ax2) = plt.subplots(2, 1, figsize=(15, 9))
        print_performance_grid(Flag=True)
        # Calculate how many iterations the model trains in each epoch
        iter_num = int(np.ceil(hparams.train_size/hparams.batch_size))
       
        # Set the model to training mode
        model.train()
        start_time = time.time() # time the start of training
        # Training loop:
        for epoch in range(hparams.epochs):
            # for each epoch, do:
            hparams.epoch_accuracy_train = np.zeros(iter_num)
            hparams.epoch_loss_train = np.zeros(iter_num)
            # randomly reshuffle the training and test groups before each new epoch:
            index = torch.randperm(hparams.train_size)
            train_data_perm = train_data_aug[index]
            train_labels_perm = train_labels_aug[index]
            # for each batch, do:
            for i, batch in enumerate(range(0, hparams.train_size, hparams.batch_size)):
                # Get a new batch of images and labels
                data = train_data_perm[batch:batch+hparams.batch_size].to(device) 
                target = train_labels_perm[batch:batch+hparams.batch_size].squeeze().to(device)
                
                # Forward pass
                output = model(data) # Apply the network on the new examples
                loss = loss_function(output, target) # calculate the value of the loss function
                
                # Backward pass - ALWAYS IN THIS ORDER!
                optimizer.zero_grad() # First, delete the gradients from the previous iteration
                loss.backward() # Run backward pass on the loss
                optimizer.step() # Preform an algorithm step (using the optimizer)
        
                # Save the loss and accuracy for the graph visualization
                hparams.epoch_accuracy_train[i] = multi_class_accuracy(output, target.squeeze().to(device))
                hparams.epoch_loss_train[i] = loss.item()
                if(i == 0 and epoch == 0) or ((i+1) == iter_num):
                    # Freeze the model in order to evaluate the loss and accuracy on the test set
                    model.eval()
                    test_out = model(test_data)
                    test_loss = loss_function(test_out, test_labels.squeeze()).item()
                    test_accuracy = multi_class_accuracy(test_out, test_labels.squeeze())
                    print_performance(epoch, i, hparams, test_loss, test_accuracy)
                    model.train()
                              
        plt.show()
        print(f"Total training took {time.time() - start_time:.2f} seconds")
        
        print("Training finished.")
    else:
        print("Error: Please restart the kernel before running the train again.")

Training with augmentations started...
| Epoch No. | Iter No. | Train Loss | Train Accuracy | Test Loss | Test Accuracy |
----------------------------------------------------------------------------------
|     0     |    1     |    1.82    |      0.13      |   1.94    |     0.17      |
----------------------------------------------------------------------------------
|     0     |   360    |    0.7     |      0.73      |   0.65    |     0.75      |
----------------------------------------------------------------------------------
|     1     |   360    |    0.52    |      0.8       |   0.55    |      0.8      |
----------------------------------------------------------------------------------
|     2     |   360    |    0.47    |      0.82      |   0.57    |     0.79      |
----------------------------------------------------------------------------------
|     3     |   360    |    0.44    |      0.84      |   0.55    |      0.8      |
------------------------------------------------

Evaluate the model on the test set, and calculate measures for each class

In [14]:
test_out = model(test_data)
test_true = test_labels.squeeze()
_, test_predicted = torch.max(test_out, dim=1)
correct = (test_predicted == test_true).sum().item()
accuracy = correct / test_true.size(0)

cm = confusion_matrix(test_true.cpu(), test_predicted.cpu())
cm_display = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = class_names)
cm_display.plot()
plt.title('Confusion Matrix')
plt.show()

accuracy = correct / test_true.size(0)
precision = precision_score(test_true.cpu(), test_predicted.cpu(), average=None)
recall = recall_score(test_true.cpu(), test_predicted.cpu(), average=None)
print(f"Test Set's total accuarcy is: {accuracy: .2f}")
for i, class_name in enumerate(class_names):
    print(f"Class: {class_names[i]} Measures: Precision: {precision[i]:.2f}, Recall: {recall[i]:.2f}")

Test Set's total accuarcy is:  0.82
Class: 0: T-shirt/top Measures: Precision: 0.86, Recall: 0.76
Class: 1: Trouser Measures: Precision: 0.99, Recall: 0.97
Class: 2: Pullover Measures: Precision: 0.84, Recall: 0.73
Class: 3: Dress Measures: Precision: 0.88, Recall: 0.87
Class: 4: Coat Measures: Precision: 0.77, Recall: 0.82
Class: 5: Shirt Measures: Precision: 0.64, Recall: 0.79
