
# Introduction

**DESCRIPTION:** In this challenge task I have provided you with skeleton code. There is an image dataset and a text dataset, and you must train deep learning models for them.

In these tasks you will be required to write code and write short answer responses to questions in a structured report. You have been provided with a template Word document of this report in which you simply have to fill in the blanks (1-3 sentences is expected).

**INSTRUCTIONS:**

1.   Copy the skeleton files to your Google Drive.
2.   Edit `SKELETON_DIR` in the first cell to point to the skeleton files you uploaded in step 1. The provided code assumes you have uploaded them to "Telemus/DLTasks" in your Google Drive.
3.   Run the following two cells


In [None]:
!nvidia-smi

from google.colab import drive
drive.mount('/content/drive')

# Set the working directory for the tasks
import os
SKELETON_DIR = '/content/drive/MyDrive/TranTasks'
os.chdir(SKELETON_DIR)
! mkdir -p "$SKELETON_DIR/saved_models"
! mkdir -p "$SKELETON_DIR/logs"

# Set up auto-reloading modules from the working directory
%load_ext autoreload
%autoreload 2

# Install extra dependencies
!pip install -q transformers==4.27.0
!pip install -q wandb==0.15.0
!pip install -q torchmetrics==0.11.3


# Set the default figure size
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 120

# Task 1 - Image Classification

**MARKS**: 66

In this first task, you will create a deep learning model to classify images of skin lesions into one of seven classes: 

1.   "MEL" = Melanoma
2.   "NV" = Melanocytic nevus
3.   "BCC" = Basal cell carcinoma
4.   "AKIEC" = Actinic keratosis
5.   "BKL" = Benign keratosis
6.   "DF" = Dermatofibroma
7.   "VASC" = Vascular lesion

The data for this task is a subset of: https://challenge2018.isic-archive.com/task3/

The data for this task is inside the `data/img` folder. It contains ~3,800 images named like `ISIC_000000.jpg` and the following label files:

*   `/data/img/train.csv`
*   `/data/img/val.csv`
*   `/data/img/train_small.csv`
*   `/data/img/val_small.csv`

The `small` versions are the first 200 lines of each partition and are included for debugging purposes. To save time, ensure your code runs on the `small` versions first.

## Task 1a. Explore the training set

**INSTRUCTIONS**: Check for data issues. Check the class distribution and at least 1 other potential data issue. Hint: Look in `explore.py` for a function that can plot the class distribution.

**REPORT**: What did you check for? What data issues are present in this dataset?

In [None]:
import pandas as pd

IMG_CLASS_NAMES = ["MEL", "NV", "BCC", "AKIEC", "BKL", "DF", "VASC"]

train_df = pd.read_csv('/content/drive/MyDrive/TranTasks/data/img/train.csv')
val_df = pd.read_csv('/content/drive/MyDrive/TranTasks/data/img/val.csv')
train_df.head()

# count = train_df[IMG_CLASS_NAMES].sum()
# print(count)
# plt.bar(IMG_CLASS_NAMES, count)

# count = val_df[IMG_CLASS_NAMES].sum()
# print(count)
# plt.bar(IMG_CLASS_NAMES, count)

train_df.hist(bins=20, figsize=(15, 10));

In [None]:
from PIL import Image
# Change the filename to view other examples from the dataset 
display(Image.open('/content/drive/MyDrive/TranTasks/data/img/ISIC_0024307.jpg'))

In [None]:
import explore
import numpy as np


# TODO - Check for data issues
# Hint: You can convert from one-hot to integers with argmax
#       This way you can convert 1, 0, 0, 0, 0, 0, 0  to class 0 
#                                0, 1, 0, 0, 0, 0, 0  to class 1
#                                0, 0, 1, 0, 0, 0, 0  to class 2
# so it should be something like the following: 
# train_labels = train_df.values[....].argmax(....)
# val_labels = val_df.values[....].argmax(....)
#     - you need to fill in the ... parts with the correct values.
# You should then print output the contents of train_labels to see if 
# it matches the contents of train.csv
#
# Next you can plot the class distributions like the following:
# explore.plot_label_distribution(....)
#    - do the above for both the train and val labels.
#
# Following this look for other potential problems with the data
#   You can look at practiceTorch1 notebook to see what was checked there.
#   You may also think of any other potential problems with the data.

print(train_df.dtypes)

print(train_df.isnull().sum())

# Convert the one-hot encoded labels to integers
train_labels = train_df.iloc[:,1:].values.argmax(axis=1)
val_labels = val_df.iloc[:,1:].values.argmax(axis=1)

# Print out the contents of train_labels
print(train_labels)

# Print out the contents of val_labels
print(val_labels)

# Calculate the class counts for the training and validation sets
train_counts = train_df[IMG_CLASS_NAMES].sum().values
print(train_counts)

val_counts = val_df[IMG_CLASS_NAMES].sum().values
print(val_counts)

# Combine the class counts for both sets
total_counts = train_counts + val_counts


# Plot the combined class distribution using explore.plot_label_distribution()
explore.plot_label_distribution(train_labels, "Train Class Distribution", IMG_CLASS_NAMES)
explore.plot_label_distribution(val_labels, "Validation Class Distribution", IMG_CLASS_NAMES)



## Task 1b. Implement Training loop

**INSTRUCTIONS**:

*   Implement LesionDataset in `datasets.py`. Use the cell below to test your implementation. 
*   Implement the incomplete functions in `train.py` marked as "Task 1b"
*   Go to the [Model Training Cell](#task-1-model-training) at the end of Task 1 and fill in the required code for "Task 1b".

**REPORT**: Why should you *not use* `random_split` in your code here?

In [None]:
import datasets

ds = datasets.LesionDataset('/content/drive/MyDrive/TranTasks/data/img',
                            '/content/drive/MyDrive/TranTasks/data/img/train.csv')
input, label = ds[0]
print(input)
print(label)


## Task 1c. Implement a baseline convolutional neural network

You will implement a baseline convolutional neural network which you can compare results to. This allows you to evaluate any improvements made by hyperparameter tuning or transfer learning.

**INSTRUCTIONS**:

*   Implement a `SimpleBNConv` in `models.py` with:
    *   5 `nn.Conv2d` layers, with 8, 16, 32, 64, 128 output channels respectively, with the following between each convolution layer:
        *   `nn.ReLU()` for the activation function, and
        *   `nn.BatchNorm2d`, and
        *   finally a `nn.MaxPool2d` to downsample by a factor of 2.
*   Use a normalised confusion matrix on the model's validation predictions in `train.py`.
*  Go to the [Model Training Cell](#task-1-model-training) at the end of Task 1 and fill in the required code to train the model.

Training should take about 1 minute/epoch. Validation accuracy should be 60-70%, but UAR should be around 20-40%.

**REPORT**: As training sets get larger, the length of time per epoch also gets larger. Some datasets take over an hour per epoch. This makes it impractical to debug typos in your code since it can take hours after starting for the program to reach new code. Name two ways to significantly reduce how long each epoch takes - for debugging purposes - while still using real data and using the real training code.

**REPORT**: Show the confusion matrix and plots of the validation accuracy and UAR in your report, and explain what is going wrong. 
(Right-click a plot and select "save image as..." to save the image to your computer)

## Task 1d. Account for data issues

**INSTRUCTIONS**: Account for the data issues in Task 1a and retrain your model.

**REPORT**: How did you account for the data issues? Was it effective? How can you tell? Show another confusion matrix.

## Task 1e. Data Augmentation

**INSTRUCTIONS**: 

*   Add an `augment` flag to LesionDataset which specifies whether any augmentation is done to the images. Ensure it is set to `True` *only* for the training dataset.
*   Use random horizontal flips
*   Use at least 2 other different non-deterministic augmentations

**REPORT:** Are random vertical flips appropriate for this dataset? Why?

Using data augmentation does not guarantee improved model performance. Data augmentation can hurt test performance by making the model train on unrealistic images.

**REPORT**: What effect did Data Augmentation have on performance? Show a screenshot of the relevant graphs from Weights & Biases for evidence.

**CHALLENGE**: Apply 5 crop augmentation with crop size 200x300. Make a distinct model which uses 5 crops at once to give a single answer. Include in your report how you did this and report the effect on performance.

## Task 1f. Chase improved performance

**INSTRUCTIONS**: 
*   Create a model from a pre-trained model from the torchvision model zoo. We recommend Resnet18, but you may use any model you like. You may freeze the weights of all layers except the last, or fine-tune all the weights. https://cloudstor.aarnet.edu.au/plus/s/TsYJXyJWch0h7TD
*   Create your own models, modifying the model architecture, try different losses, learning rates. Change anything you like except the evaluation metrics in search of a better model.

Train at least 10 different models, each with a different combination.

**REPORT**: Create a table in an excel spreadsheet to record your results. Make sure it includes every parameter of variation between your combinations as a separate column. Include notes about what you were thinking/hoping for each combination as a number column in the spreadsheet.

In addition to the excel spreadsheet generate a report using Weights and Biases of the models you trained and the performance curves. Save the report as a pdf and include this in your submission. Please see this link on how to generate reports with Weights and Biases. https://docs.wandb.ai/guides/reports

Play around with Weights and Biases to see what cool features you can dig out and use to better visualize the training results and use that to improve the information shared via the report. 

Write a discussion about the key findings from the experimental results.

**CHALLENGE REPORT**: Assuming you use the full dataset in a single epoch, if you halve the size of the batch size, what happens to the number of times that you update the weights per epoch? With reference to the gradients, under what circumstances is this good?

<a name="task-1-model-training"></a>
## Model Training Cell

Note we will be using Weights and Biases to keep track of our experimental runs and evaluation metrics.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder
from torch.autograd import Variable
import datasets
import models
import train
from train import device
from train import plot_confusion_matrix
from sklearn.metrics import confusion_matrix

torch.cuda.empty_cache()
!nvidia-smi

torch.manual_seed(42)

NUM_EPOCHS = 2
BATCH_SIZE = 64

NUM_CLASSES = len(IMG_CLASS_NAMES)

device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    torch.cuda.set_device(device)

print(device)

# model = models.SimpleBNConv(device) 
model = models.SimpleBNConv(num_classes=NUM_CLASSES)

print(model)

model = model.to(device)



# Create datasets/loaders
# TODO Task 1b - Create the data loaders from LesionDatasets
# TODO Task 1d - Account for data issues, if applicable

train_dataset = datasets.LesionDataset('/content/drive/MyDrive/TranTasks/data/img',
                            '/content/drive/MyDrive/TranTasks/data/img/train.csv', augment = True)
val_dataset = datasets.LesionDataset('/content/drive/MyDrive/TranTasks/data/img',
                            '/content/drive/MyDrive/TranTasks/data/img/val.csv', augment = False)

width, height = train_dataset.get_image_size(0)
print(f'{width}x{height}')

width, height = val_dataset.get_image_size(0)
print(f'{width}x{height}')

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

# TODO Task 1d - Account for data issues, if applicable
# defining the Optimizer 
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Train model
# TODO Task 1c: Set to ident_str to a string that identifies this particular
#               training run. Note this line in the training code
#                     exp_name = f"{model.__class__.__name__}_{ident_str}"
#               So it means the the model class name is already included in the
#               exp_name string. You can consider adding other information particular
#               to this training run, e.g. learning rate (lr) used, 
#               augmentation (aug) used or not, etc.

train.train_model(model, train_loader, val_loader, optimizer, criterion,
                  IMG_CLASS_NAMES, NUM_EPOCHS, project_name = "CSE5DL Assignment Task 1",
                  ident_str= "TranTasks")



# Task 2 - News article classification

You will first create your own model to classify news articles into one of the following classes:

*   World
*   Sport
*   Business
*   Sci/Tech

You will then compare it to a pre-trained DistilBERT model that has been fine-tuned, similar to Lab 6. Note: using a model pre-trained on a source task for a new target task is called "transfer learning" whether you fine-tune it or not.

The data for this task is a subset of: https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv

## Task 2a. Exploring the dataset

**INSTRUCTIONS**: Check for at least 2 data issues.

**REPORT**: What did you check for? What data issues exist, if any? Report anything you checked even if it turned out the data did not have that issue. We want to know what you are checking.

In [None]:
import pandas as pd

with open('/content/drive/MyDrive/TranTasks/data/txt/classes.txt') as f:
    TXT_CLASS_NAMES = [line.rstrip('\n') for line in f]

train_df = pd.read_csv('/content/drive/MyDrive/TranTasks/data/txt/train.csv', header=None)
val_df = pd.read_csv('/content/drive/MyDrive/TranTasks/data/txt/val.csv', header=None)
train_df.head()
val_df.head()

print(TXT_CLASS_NAMES)

In [None]:
import explore
# TODO Check for data issues.
# Again you should fill in the following:
# train_labels = ...
# val_labels = ....
#   - Note the csv file has class labels start from 1 but
#     pytorch expects class labels to start from 0 instead. 
#
# explore.plot_label_distribution(....) for train labels
# explore.plot_label_distribution(....) for val labels
# 
# check for other kinds of problems with the data like you did for Task 1a.
# View the first few rows of the dataframes

print(train_df.head())
print(val_df.head())

# Print classes
print(TXT_CLASS_NAMES)

# Check data types and null values
print(train_df.dtypes)
print(train_df.isnull().sum())
print(val_df.dtypes)
print(val_df.isnull().sum())

# Subtract 1 from labels to make them start from 0, as PyTorch expects
train_labels = train_df[0] - 1
val_labels = val_df[0] - 1

# Check distributions of labels
explore.plot_label_distribution(train_labels, "Train Class Distribution", TXT_CLASS_NAMES)
explore.plot_label_distribution(val_labels, "Validation Class Distribution", TXT_CLASS_NAMES)

# Check for class imbalance
train_counts = np.bincount(train_labels)
val_counts = np.bincount(val_labels)
print("Training set class counts:", train_counts)
print("Validation set class counts:", val_counts)

# Check if there are any other potential problems with the data
# Here we check if there are any classes in the validation set that do not appear in the training set
missing_classes = set(val_labels) - set(train_labels)
if missing_classes:
    print("Warning: The following classes appear in the validation set but not in the training set:", missing_classes)


## Task 2b. Clustering and visualising embeddings from a pre-trained model

**INSTRUCTIONS**: 

*  Implement the `TextDataset` class in the `datasets.py` file. Consider adding a small code block to test your implementation, as provided in task 1b.

*   Complete `visualise_embeddings.py` and run it. Make sure you instantiate two different models to visualize. One is the sequence classification model and the other is the token classification model. For the sequence classification model the code will visualize the CLS token. For the token classification model the model will perform average pool over all output tokens except the CLS token output.

* The `visualise_embeddings.py` file does the following:
    *   visualise embeddings of the news articles from the two pre-trained `'distilbert-base-uncased'` model (i.e. the models which have not yet been fine-tuned on the labels) using T-SNE. T-SNE is a popular dimensionality reduction method that takes data from a high dimensional space and reduces it to just two dimensions while trying to preserve the right distances between points. The visualization will represent each article by a point with a color corresponding to their true label. Ideally the colors are well separated into separate clusters. If this happens it will be really cool since it means we did not even need to fine-tune the model on our data, it is already able to separate the classes.
    *   Next the code will run K-Means clustering on the validation set to group the data into separate clusters. The code will then colour the points based on which cluster they belong to rather than the ground truth label. 


**REPORT**: By looking at the resulting images of the two models (sequence classification and token classification), which two classes have the most similar embeddings? How can you tell? Did you expect this, if so, why, if not why not?

**CHALLENGE**: Only attempt this after completing the rest of Task 2.

*   Modify `visualise_embeddings.py` so that it can load the weights for a fine-tuned DistilBERT model. Then visualize the data points with their corresponding true labels. 
*   Next instead of using K-Means for the second visualisation, use the model's own predicted labels to colour the points.

Present the resulting images in your report.

In [None]:
import visualise_embeddings
SENTENCE_LEN = 80
# Run this code to visualize the results from embedding text using the sequence classification model
visualise_embeddings.mk_plots(SENTENCE_LEN, sequenceClassificationModel = True)

In [None]:
import visualise_embeddings
SENTENCE_LEN = 80
# Run this code to visualize the results from embedding text using the token classification model
visualise_embeddings.mk_plots(SENTENCE_LEN, sequenceClassificationModel = False)

## Task 2c. Models

**INSTRUCTIONS**:

*   Complete `TextMLP` in `models.py`. It should be a simple MLP with 8 Linear layers. It should first embed the inputs into a vocabulary of size 30522. Use an output feature size of 256 in all hidden layers and a feature size of 128 for the embeddings. Flatten the sentence after embedding, but before it goes into any Linear layers. Use batch norm and ReLU. Train for 1000 epochs with learning rate of 0.001 and a batch size of 512.
*   Complete `DistilBertForClassification` in `models.py`. This model should replace the last layer with an `nn.Linear` with 4 outputs for classification. Hint: Call `print()` on the DistilBERT model to observe the layers and their names before attempting this. Train for 4 epochs with learning rate of 0.001 and a batch size of 64.

Each of these should take around 10 minutes to complete.

Go to the [Model Training Cell](#task-2-model-training) at the end of Task 2 and fill in the required code to train the model.

**REPORT**: The saved model weights of a fine-tuned DistilBERT model are >200MB, but you only created one small `nn.Linear` layer. Why is the saved model so large? 

**REPORT**: These models should accept only input with a dtype of `torch.int64`. What do each of these longs (`int64`) represent?

## Task 2d. Learning Rate

Fine-tuning `DistilBertForSequenceClassification` with Adam at a learning rate of 0.001 results in very poor accuracy (~26%).

**INSTRUCTIONS**: 

*   Uncomment the lines marked `Task 2d` in `train.py`
*   Execute the below cell to begin training and observe the class distribution per batch
*   Comment the lines marked `Task 2d` in `train.py` so they no longer interfere with the training.


**REPORT**: What is wrong with the class distributions? The learning rate can be changed to fix it. Should you increase or decrease the learning rate? How can you tell?

**REPORT**: After fixing the learning rate, comment on the relative train/val performance between these two models. Which model performed better on each partition? Is this expected? If so, why?

When you have finished Task 2d. Go back to Task 2b and finish the challenge if you are up to it. You should get a pleasant surprise if you have done everything correctly.


<a name="task-2-model-training"></a>
## Model Training Cell

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

import datasets
import models
import train

torch.manual_seed(42)

SENTENCE_LEN = 80
NUM_EPOCHS = 4
BATCH_SIZE = 64

device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    torch.cuda.set_device(device)

print(device)

model = models.DistilBertForClassification(n_classes=4)
model = model.to(device)

print(model)

# Create datasets/loaders
# TODO: Create the data loaders from TextDatasets
# train_dataset = ...
# val_dataset = ...
# train_loader = ...
# val_loader = ...

train_dataset = datasets.TextDataset(fname='/content/drive/MyDrive/TranTasks/data/txt/train.csv', sentence_len=SENTENCE_LEN)
val_dataset = datasets.TextDataset(fname='/content/drive/MyDrive/TranTasks/data/txt/val.csv', sentence_len=SENTENCE_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)



# Instantiate model, optimizer and criterion
# TODO: Make an instance of your model
# model = models.<**put the name of the model class you created in the model file here**>

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss()

# TODO Change ident_str to something that identifying this experiment e.g. lr0001
# Train model. We are using the same train model function we wrote for task 1.
train.train_model(model, train_loader, val_loader, optimizer, criterion,
                  TXT_CLASS_NAMES, NUM_EPOCHS, project_name = "CSE5DL Assignment Task 2",
                  ident_str='**TranTasks**')

# Super challenge task

This challenge task is quite difficult and will really test your mastery of PyTorch and `nn.Linear` layers.

We can manually assign weights to an `nn.Linear` like this:


In [None]:
import torch
import torch.nn as nn
lin = nn.Linear(10, 20)
manual_weights = torch.arange(20*10).reshape(lin.weight.shape)
lin.weight.data[:] = manual_weights
lin.bias.data[:] = 0

But this does not calculate anything useful. A Linear layer simply performs a weighted sum (plus bias). We can choose weights/biases to perform known operations.

**INSTRUCTIONS**: 
1.   Given an `nn.Linear(1, 1)` layer, set the weights such that the layer adds 1 to it's input.
2.   Given an `nn.Linear(1, 1)` layer, set the weights such that the layer calculates `y = 3x + 2`.
3.   Given an `nn.Linear(4, 1)` layer, set the weights such that the layer calculates the average of it's inputs.
4.   Given an `nn.Linear(4, 2)` layer, set the weights such that the layer calculates both the average of it's inputs and the sum of the inputs.
5.   Given an `nn.Linear(3, 3)` layer, set the weights such that the layer returns the inputs, but in reverse order.
6.   Given an `nn.Linear(5, 2)` layer, set the weights such that the layer always returns `(4,2)`


Note: We would never use this in a deep learning model; this challenge is to prove that you understand the mathematics and coding mechanics of the `nn.Linear` layer.

In [None]:
import sc1
sc1.test_1(sc1.modify_lin_1)
sc1.test_2(sc1.modify_lin_2)
sc1.test_3(sc1.modify_lin_3)
sc1.test_4(sc1.modify_lin_4)
sc1.test_5(sc1.modify_lin_5)
sc1.test_6(sc1.modify_lin_6)