# NTU Deep Learning Bootcamp Hackathon Challenge

## Goal
Your goal is to classify Pokémon by primary type and secondary type given an image of the pokemon and a textual description.  The training dataset is provided to you, but you will be evaluated on a hiddent test dataset after you submit your solutions.  Highest classification accuracy on primary type is the winner.  If there is a tie, it will be broken by classification accuracy on secondary type.  If there is a tie in both of these metrics, the model with the smaller number of weights will be declared the winner.

## Background
If you are unfamiliar with Pokémon and their types, you can read more about them [here](https://bulbapedia.bulbagarden.net/wiki/Type).  Basically each Pokémon has a primary type (water, fire, electricity, etc.) and sometimes a secondary type (but not all the time).  Our claim is that given a picture of a Pokémon and its textual description, you should be able to determine both its primary and secondary types.

## Rules
* You can only use the dataset provided to you.  Hardcoding a table of Pokémon names and their corresponding types is forbidden.
* You have until the deadline to complete the challenge.  You cannot start before, and work continued after the deadline cannot be submitted for the cash prize.
* You are allowed to use the Internet, ChatGPT, or any other resources to help you write your code, but you must train your model yourself, i.e., you cannot take weights from somebody else online and submit them as your solution.
* You are allowed to start with a pretrained model, including transformers.
* You must submit both your iPython notebook and the weights (*.pt) file.  Judging will be based on the accuracy measurement function included in this notebook.  Please make sure that your model outputs guesses in the format supported by this function, or your submission will be disqualified.
* You cannot get help from other teams enrolled in the competition.
* You cannot get help from student studying computer science or a related field if they are not involved in the competition and enrolled on your team.

## Submission
The following items are needed for a complete submission:
1. iPython notebook named ```hackathon_TEAMNAME.ipynb```. This notebook must contain the class for your model, as well as the transforms and tokenizer that you want us to use for testing.
2. Your trained model weights stored in a file called ```submission_TEAMNAME.pth```.  Please save your weights as a state dictionary, do not pickle your entire model.

More information is available in the *Testing and Submission* subsection below.

## Getting Started
An example solution is provided below to show you how to load the dataset and use the accuracy measurement function.  Feel free to use this as a starting point.  Lines starting with ```# HINT:``` contain advice that you may find helpful if you are unsure of how to proceed.

In [18]:
# Some basic libraries to get you started
import numpy
import pandas
import torch
import torchvision
import transformers
from PIL import Image

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# This code loads the dataset.  There is no need to modfiy this block of code.
# https://drive.google.com/file/d/1tXKD4OXxJKvopkNpTOv2w4II4jCP_OrZ/view?usp=sharing
!pip install --upgrade --no-cache-dir gdown
!gdown https://drive.google.com/uc?id=1tXKD4OXxJKvopkNpTOv2w4II4jCP_OrZ -O PokeData.zip
!unzip PokeData.zip

### Loading the Dataset
In this section we load the dataset and create some sample dataloaders.  Please feel free to modify any of the blocks in this section.

In [20]:
# We provide a MultimodalPokemonDataset class for your convenience.
# This block shows you how to load the dataset and explore its contents.
from PokeData.pokedata import MultimodalPokemonDataset


# You will need to define a transform to be applied to the image data in the
# dataset.  The transform below is a simple example, but may not be optimal.
# HINT: Torch can take care of things like data augmentation in the tranform.
train_transform = torchvision.transforms.Compose([
  torchvision.transforms.ToTensor(),
  torchvision.transforms.Resize((224, 224), antialias=True),
])


# You will also need to define a tokenizer for the text data.  The max_length
# argument gives the maximum number of tokens returned for an input sequence.
# By default, sequnces shorter than max_length are padded.
# HINT: If you load a pretrained text model, make sure it was trained on data
# from the same tokenizer you load here.
train_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')


# Load our dataset
dataset = MultimodalPokemonDataset(
  root='PokeData',
  transform=train_transform,
  tokenizer=train_tokenizer,
  max_length=128,
)

In [None]:
# You can access the list of images like this:
dataset.img_list[:10]

In [None]:
# To view an untransformed image:
img = Image.open(dataset.img_list[0])
img

In [None]:
# To view the raw text labels:
dataset.descr_list[:10]

In [None]:
# To view the type data:
print(f'Primary Type: {dataset.primary_type_list[:10]}')
print(f'Secondary Type: {dataset.secondary_type_list[:10]}')

In [None]:
# To convert the number to a type
dataset.index_to_type(dataset.primary_type_list[0])

In [None]:
# In total we have 19 possible types for Pokemon (including NA when there is no
# secondary type)
NUM_TYPES = len(dataset.type_to_index)
print("All possible Pokemon types:")
print(dataset.type_to_index)

In [26]:
# Let's split the dataset and create data loaders.
# HINT: Right now the test/val split is set at 90/10, but this may not be
# optimal.
# HINT: This dataset is highly imbalanced.  You may want to oversample or
# undersample some classes or use K-fold cross-validation.
train_set, val_set = torch.utils.data.random_split(
  dataset,
  [int(0.9 * len(dataset)), int(0.1 * len(dataset)) + 1]
)


# HINT: Remember that batch size is a hyperparemter you can adjust.
BATCH_SIZE = 16
train_loader = torch.utils.data.DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_set, batch_size=BATCH_SIZE, shuffle=True)

### Create the Model
Below is a simple example of a multimodal model.  It uses a separate backbone for text and image data and combines them in a final linear layer.  You can use this as a starting point, or try something completely difference.  Remember, nothing says you necessarily have to use both data modes.

In [27]:
class PokeClassifier(torch.nn.Module):

  def __init__(self):
    # HINT: Here we are using pretrained BERT as a text backbone, but other
    # language models with varying levels of performance may be available.
    super().__init__()
    self.text_backbone = transformers.BertModel.from_pretrained('bert-base-uncased')
    text_backbone_outsize = self.text_backbone.config.hidden_size
    # HINT: Below we freeze the layers of pretrained BERT before fine tuning
    # to save time, but if you are using a simpler language model, you may want
    # to investigate updating the backbone weights as well.
    for param in self.text_backbone.parameters():
      param.requires_grad = False

    # HINT: Here we are using pretrained Resnet50 as a vision backbone, but
    # other image encoders are available.
    self.vision_backbone = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
    vision_backbone_outsize = self.vision_backbone.fc.in_features
    self.vision_backbone.fc = torch.nn.Identity()
    # HINT: Below we freeze the layers of the pretrained Resnet before fine
    # tuning, but you may want to experiment with updating the backbone weights
    # as well.
    for param in self.vision_backbone.parameters():
      param.requires_grad = False

    # HINT: Here we are just taking all the outputs of BERT and all the outputs
    # of Resnet, shmooshing them together and feeding them into a linear layer
    # for classification (actually, two linear layers running in parallel, one
    # for primary type and the other for secondary type).  This is probably the
    # simplest way to combine data of different modalities, but it may lake the
    # capacity to learn more complex relationships.  Can you find a better one?
    self.fc_primary_type = torch.nn.Linear(
      text_backbone_outsize + vision_backbone_outsize,
      NUM_TYPES,
    )
    self.fc_secondary_type = torch.nn.Linear(
      text_backbone_outsize + vision_backbone_outsize,
      NUM_TYPES,
    )

  def forward(self, img, txt, attn):
    # Inference the text backbone
    bertput = self.text_backbone(
      input_ids=txt.squeeze(1),
      attention_mask=attn.squeeze(1)
    )
    text_features = bertput.pooler_output

    # Inference the vision backbone
    vision_features = self.vision_backbone(img)

    # Combine the feature spaces and run our classifiers
    feature_space = torch.cat((text_features, vision_features), dim=-1)
    primary_type = self.fc_primary_type(feature_space)
    secondary_type = self.fc_secondary_type(feature_space)
    return primary_type, secondary_type


### Training the Model
Below is some sample training code for the model.  Remember, the training process (loss function, learning rate, optimizer, etc.) have just as big of an impact on model performance as architecture.

In [None]:
# Create an instance of our classifier
model = PokeClassifier()
model = model.to(DEVICE)


# Optional: load from a previous checkpoint
#model.load_state_dict(torch.load('checkpoint.pth'))


# HINT: Adam may not be the optimal optimzer or the hyperparameters passed here
# may not be the best.
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)


# HINT: Here we are using cross entropy loss, but other more complex loss
# functions are possible.
primary_loss = torch.nn.CrossEntropyLoss()
secondary_loss = torch.nn.CrossEntropyLoss()


# HINT: Remember, training for more epochs can improve loss, but at some point
# it will lead to overfitting.
N_EPOCHS = 10
for epoch in range(N_EPOCHS):

  # Training Loop
  model.train()
  train_loss = 0
  for batch in train_loader:
    optimizer.zero_grad()

    # Get data from the data loader and push it to the GPU
    img, txt, attn, primary_gt, secondary_gt = batch
    img = img.to(DEVICE)
    txt = txt.to(DEVICE)
    attn = attn.to(DEVICE)
    primary_gt = primary_gt.to(DEVICE)
    secondary_gt = secondary_gt.to(DEVICE)

    # Inference the model
    primary_guess, secondary_guess = model(img, txt, attn)

    # Calculate loss and update weights
    # HINT: Here we are weighting both losses equally, but it may be more
    # beneficial to focus on one loss over the other.  Also, there is nothing
    # that says you can't use two separate optimizers for two separate losses.
    L = primary_loss(primary_guess, primary_gt) + secondary_loss(secondary_guess, secondary_gt)
    L.backward()
    optimizer.step()
    train_loss += L

  # Validation Loop
  model.eval()
  val_loss = 0
  n_correct = 0
  n_samples = 0
  for batch in val_loader:
    # Get data from the data loader and push it to the GPU
    img, txt, attn, primary_gt, secondary_gt = batch
    img = img.to(DEVICE)
    txt = txt.to(DEVICE)
    attn = attn.to(DEVICE)
    primary_gt = primary_gt.to(DEVICE)
    secondary_gt = secondary_gt.to(DEVICE)

    # Inference the model and calculate loss
    with torch.no_grad():
      primary_guess, secondary_guess = model(img, txt, attn)
      L = primary_loss(primary_guess, primary_gt) + secondary_loss(secondary_guess, secondary_gt)
      val_loss += L

  # Report results
  print(f'Epoch: {epoch} '
        f'-- Training Loss: {train_loss / len(train_set):.3f} '
        f'-- Val Loss: {val_loss / len(val_set):.3f} '
  )

  # Save a checkpoint
  torch.save(model.state_dict(), 'checkpoint.pth')

### Testing and Submission
A sample test function is provided in this section.  Note that all you are required to supply is a model, an input transform, tokenizer, and max length for a tokenized string.  Here, we load the dataset provided to you, but in your final submission, you will be evaluated against a blind test set.

Also, don't forget to save your model weights and download them **before logging off your VM**.

In [None]:
def measure_accuracy(model, transform, tokenizer, max_length):
  # Create a data loader specifically for testing
  test_set = MultimodalPokemonDataset(
    root='PokeData',
    transform=train_transform,
    tokenizer=train_tokenizer,
    max_length=max_length,
  )
  test_loader = torch.utils.data.DataLoader(test_set)

  # Evaluate the model
  model.eval()
  n_correct_pri = 0
  n_correct_sec = 0
  n_samples = 0
  for batch in test_loader:
    # Get data from the data loader and push it to the GPU
    img, txt, attn, pri_gt, sec_gt = batch
    img = img.to(DEVICE)
    txt = txt.to(DEVICE)
    attn = attn.to(DEVICE)
    pri_gt = pri_gt.to(DEVICE)
    sec_gt = sec_gt.to(DEVICE)

    # Inference the model and accuracy
    with torch.no_grad():
      pri_type_guess, sec_type_guess = model(img, txt, attn)
      n_correct_pri += torch.sum(torch.argmax(pri_type_guess, 1) == pri_gt)
      n_correct_sec += torch.sum(torch.argmax(sec_type_guess, 1) == sec_gt)
      n_samples += pri_gt.shape[-1]

  # Display the results
  print(f'Accuracy on Primary Type: {100 * n_correct_pri / n_samples:.3f}%')
  print(f'Accuracy on Secondary Type: {100 * n_correct_sec / n_samples:.3f}%')


# Please remember to update this line if you use different variable names or
# a separate transform for testing.
measure_accuracy(model, train_transform, train_tokenizer, 128)

In [33]:
# You can save a model with the command below
torch.save(model.state_dict(), 'submission.pth')