# BERT Baseline

## DataThon @ IndoML'24

This tutorial will walk you through a very simple BERT based baseline for the [DataThon](https://sites.google.com/view/datathon-indoml24) Challenge at IndoML'24.
Feel free to play around with the repository, once you are done with this tutorial, it has been developed to work with multiple GPUs as well.
At the end, you will also be able to make your first submission (albeit a very bad one!).

### Table of Contents: 

1. [Preprocessing](#preprocessing)
2. [Dataset Creation](#dataset-creation)
3. [BERT Models](#bert)
4. [Training](#training)
5. [Evaluation](#evaluation)

Please feel free to create a PR if you find any bugs or need help with running this code.

In [None]:
# Please run this cell on colab, you can skip this if you are running locally.
# !git clone https://github.com/karannb/indoml-bert-baseline.git

import sys
sys.path.append("./") # easier relative imports

## Preprocessing

Before preprocessing we need to get our data, please download it from [here](https://codalab.lisn.upsaclay.fr/competitions/19907#participate) after registering for the challenge.
You can store it in any directory, but ideally store it in a new directory `data/`.
Once you have the data running, the next cell will pre-process the data to have it in the format we have designed our `torch.utils.data.Dataset` objects, you can obviously change these as well.
Feel free to got through the code if you want to change anything.

### Preprocess and Categorize

In [None]:
from src.preprocess import preprocess_fn, categorize

# this will preprocess the data to a particular format (in json)
preprocess_fn("attrebute_test.data", "data")
preprocess_fn("attrebute_val.data", "data")
preprocess_fn("attrebute_val.solution", "data")
preprocess_fn("attrebute_train.data", "data")
preprocess_fn("attrebute_train.solution", "data")


# this will create maps of index2class and class2index for all columns or outputs
for col in ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']:
    categorize(col)

## Dataset Creation

Now we will create PyTorch Datasets for convinient DataLoading (across multiple devices as well).
We would strongly recommend you to start modifying stuff here in case you want to use a BERT baseline itself, we have also added a `TODO: CAN CHANGE HERE` where we thought modifications are possible.
DO NOT USE `trim=True` when you are training your own model! It is only helpful for debugging purposes.

In [None]:
from src.dataset import ReviewsDataset, ReviewsDataLoader

# this is just a simple interface run
dataset = ReviewsDataset(data_dir="data/", split="test", output="L4_category", trim=True)
print(dataset[0])

dataloader = ReviewsDataLoader(dataset, batch_size=2, shuffle=True)

for batch in dataloader:
    print(batch)
    break

## BERT Models

This is a simple script to download a BERT model ans it's tokenizer, you can also play around with different types of BERT models from HuggingFace. 
We stick to the basic one [here](https://huggingface.co/google-bert/bert-base-uncased).
Please save them appropriately, or modify the path in the `trainer.py`, by default it should be saved in `bert-model/` and `bert-tokenizer/`.

In [None]:
from src.downloadBERT import download

# this function will download the model and tokenizer
download()

## Training

Now the interesting part, training.
Note that there are **several** parameters that can be modified here, from learning rate to batch size to the optimizer used, weight decay, etc..
Apart from that, the code also doesn't have checkpointing as of now, you will probably need that.
You can also change the criteria for best model selection (currently f1-score).

In [None]:
from src import trainer

# note that this code IS NOT going to run on multiple GPUs, 
# please see the README / trainer.py for that. If you run
# on colab, it will utilize the GPU there. Please raise an
# issue if you don't see that happen.

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on {device}")

trainer = Trainer(data_dir="data/", device=device, output="details_Brand", trim=True)
trainer.train()
trainer.test()

## Evaluation

We have also provided utility evaluation functions (which we use for validation), to judge your model.
Please note that since we are ideally training 6 different models for 6 different columns to get the `item accuracy` you need to store results from the validation run how we have done for the test run.
To make a valid submission, just run the previous code cell 6 times with all the 6 columns as the output once, then you can use the next cell to generate a valid submission.
The next cell will generate a zip file that you can upload to CodaLab.

In [None]:
from src.evaluate import postprocess

postprocess("outputs/results.json")

## Thank you!
Repository and tutorial by [Karan Bania](https://karannb.github.io).