# BERT Baseline

## DataThon @ IndoML'24

This tutorial will walk you through a very simple BERT based baseline for the [DataThon](https://sites.google.com/view/datathon-indoml24) Challenge at IndoML'24.
Feel free to play around with the repository, once you are done with this tutorial, it has been developed to work with multiple GPUs as well.
At the end, you might also be able to make your first submission (albeit a very bad one!).

### Table of Contents:

1. [Preprocessing](#preprocessing)
2. [Dataset Creation](#dataset-creation)
3. [BERT Models](#bert)
4. [Training](#training)
5. [Evaluation](#evaluation)

Please feel free to create a PR if you find any bugs or need help with running this code.

In [1]:
# Please run this cell on colab, you can skip this if you are running locally.
!git clone https://github.com/karannb/indoml-bert-baseline.git
!cd indoml-bert-baseline/ && pip install -r requirements.txt

fatal: destination path 'indoml-bert-baseline' already exists and is not an empty directory.


## Preprocessing

Before preprocessing we need to get our data, please download it from [here](https://codalab.lisn.upsaclay.fr/competitions/19907#participate) after registering for the challenge.
You can store it in any directory, but ideally store it in a new directory `data/`.
Once you have the data running, the next cell will pre-process the data to have it in the format we have designed our `torch.utils.data.Dataset` objects, you can obviously change these as well.
Feel free to got through the code if you want to change anything.

### Preprocess and Categorize

In [2]:
!unzip input_data.zip -d indoml-bert-baseline/data/

Archive:  input_data.zip
replace indoml-bert-baseline/data/attribute_test.data? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: indoml-bert-baseline/data/attribute_test.data  
  inflating: indoml-bert-baseline/data/attribute_train.data  
  inflating: indoml-bert-baseline/data/attribute_train.solution  
  inflating: indoml-bert-baseline/data/attribute_val.data  
  inflating: indoml-bert-baseline/data/attribute_val.solution  


In [3]:
%cd indoml-bert-baseline/

/content/indoml-bert-baseline


In [4]:
from src.preprocess import preprocess_fn, categorize

# this will preprocess the data to a particular format (in json)
preprocess_fn("attribute_test.data", "data")
preprocess_fn("attribute_val.data", "data")
preprocess_fn("attribute_val.solution", "data")
preprocess_fn("attribute_train.data", "data")
preprocess_fn("attribute_train.solution", "data")


# this will create maps of index2class and class2index for all columns or outputs
for col in ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']:
    categorize(col)

data/test.json already exists, skipped.
data/val.json already exists, skipped.
data/val_sol.json already exists, skipped.
data/train.json already exists, skipped.
data/train_sol.json already exists, skipped.
Attribute: details_Brand has 5066 unique values.
Maps for details_Brand already exist, skipped.
Attribute: L0_category has 27 unique values.
Maps for L0_category already exist, skipped.
Attribute: L1_category has 163 unique values.
Maps for L1_category already exist, skipped.
Attribute: L2_category has 612 unique values.
Maps for L2_category already exist, skipped.
Attribute: L3_category has 1252 unique values.
Maps for L3_category already exist, skipped.
Attribute: L4_category has 962 unique values.
Maps for L4_category already exist, skipped.


## Dataset Creation

Now we will create PyTorch Datasets for convinient DataLoading (across multiple devices as well).
We would strongly recommend you to start modifying stuff here in case you want to use a BERT baseline itself, we have also added a `TODO: CAN CHANGE HERE` where we thought modifications are possible.
DO NOT USE `trim=True` when you are training your own model! It is only helpful for debugging purposes.

In [5]:
from src.dataset import ReviewsDataset, ReviewsDataLoader

# this is just a simple interface run
dataset = ReviewsDataset(data_dir="data/", split="train", output="L4_category", trim=True)
print(dataset[0])

dataloader = ReviewsDataLoader(dataset, batch_size=2, shuffle=True)

for batch in dataloader:
    print(batch)
    break

Loading train data: 100%|██████████| 443499/443499 [00:00<00:00, 750706.80it/s]


Found 'na' as the label, removing examples with 'na' label.
Original data has 443499 examples.
Data loaded in 0.058 minutes.
train data has 1000 examples.
{'input': 'Product Name: FEL-PRO 60977 Throttle Body Gasket, Sold at store: Fel-Pro, Manufactured by: FEL-PRO', 'output': 695, 'id': 188905}
{'input': ['Product Name: Microsoft Surface Book 2 Sleeve Case Evecase Slim Neoprene Pouch Carrying Laptop Bag for 2017 Microsoft Surface Book 2 / 1 13.5inch Chromebook - Black, Sold at store: Evecase, Manufactured by: Evecase', 'Product Name: GSP NCV36558 CV Axle Shaft Assembly - Right Front (Passenger Side), Sold at store: GSP, Manufactured by: GSP'], 'output': tensor([932, 193]), 'ids': tensor([94512, 61450], dtype=torch.int32)}


## BERT Models

This is a simple script to download a BERT model ans it's tokenizer, you can also play around with different types of BERT models from HuggingFace.
We stick to the basic one [here](https://huggingface.co/google-bert/bert-base-uncased).
Please save them appropriately, or modify the path in the `trainer.py`, by default it should be saved in `bert-model/` and `bert-tokenizer/`.

In [6]:
from src.downloadBERT import download

# this function will download the model and tokenizer
download()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Training

Now the interesting part, training.
Note that there are **several** parameters that can be modified here, from learning rate to batch size to the optimizer used, weight decay, etc..
Apart from that, the code also doesn't have checkpointing as of now, you will probably need that.
You can also change the criteria for best model selection (currently f1-score).
Again, remember to NOT TRIM when you are actually training your model.

In [8]:
import torch
from src.trainer import Trainer

# note that this code IS NOT going to run on multiple GPUs,
# please see the README / trainer.py for that. If you run
# on colab, it will utilize the GPU there. Please raise an
# issue if you don't see that happen.

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on {device}")

trainers = {}
for col in ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']:
    print("*"*88)
    print(f"Training for {col}")
    print("*"*88)
    trainers[col] = Trainer(data_dir="data/", device=device, output=col, trim=True)
    trainers[col].train()
    print(f"Training finished for {col}")
    print("*"*88)
    # print(f"Testing on {col}")
    # print("*"*88)
    # trainer[col].test()
    # print(f"Testing finished for {col}")
    # print("*"*88)

Running on cuda
Using device: cuda


Loading train data: 100%|██████████| 443499/443499 [00:00<00:00, 678456.50it/s]


Data loaded in 0.064 minutes.
train data has 1000 examples.


Loading val data: 100%|██████████| 95035/95035 [00:00<00:00, 566259.97it/s]


Data loaded in 0.013 minutes.
val data has 200 examples.


Loading test data: 100%|██████████| 95036/95036 [00:00<00:00, 753410.44it/s]


Data loaded in 0.008 minutes.
test data has 95036 examples.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bert-model/ and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5


100%|██████████| 125/125 [00:32<00:00,  3.85it/s]
100%|██████████| 25/25 [00:06<00:00,  3.81it/s]


Validation Accuracy: 0.0950, Precision: 0.9289, Recall: 0.1160, F1: 0.0516
Epoch 2/5


100%|██████████| 125/125 [00:31<00:00,  3.95it/s]
100%|██████████| 25/25 [00:06<00:00,  4.05it/s]


Validation Accuracy: 0.1650, Precision: 0.7659, Recall: 0.3190, F1: 0.0938
Epoch 3/5


100%|██████████| 125/125 [00:31<00:00,  3.96it/s]
100%|██████████| 25/25 [00:06<00:00,  3.94it/s]


Validation Accuracy: 0.1900, Precision: 0.7228, Recall: 0.3806, F1: 0.1170
Epoch 4/5


100%|██████████| 125/125 [00:31<00:00,  3.93it/s]
100%|██████████| 25/25 [00:06<00:00,  3.99it/s]


Validation Accuracy: 0.2250, Precision: 0.7560, Recall: 0.3648, F1: 0.1421
Epoch 5/5


100%|██████████| 125/125 [00:31<00:00,  3.96it/s]
100%|██████████| 25/25 [00:06<00:00,  3.98it/s]


Validation Accuracy: 0.2450, Precision: 0.7156, Recall: 0.4237, F1: 0.1502
Using device: cuda


Loading train data: 100%|██████████| 443499/443499 [00:00<00:00, 514507.26it/s]


Data loaded in 0.073 minutes.
train data has 1000 examples.


Loading val data: 100%|██████████| 95035/95035 [00:00<00:00, 609729.34it/s]


Data loaded in 0.013 minutes.
val data has 200 examples.


Loading test data: 100%|██████████| 95036/95036 [00:00<00:00, 681699.03it/s]


Data loaded in 0.007 minutes.
test data has 95036 examples.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bert-model/ and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5


100%|██████████| 125/125 [00:31<00:00,  3.93it/s]
100%|██████████| 25/25 [00:06<00:00,  3.91it/s]


Validation Accuracy: 0.5500, Precision: 0.7930, Recall: 0.3432, F1: 0.3255
Epoch 2/5


100%|██████████| 125/125 [00:31<00:00,  3.98it/s]
100%|██████████| 25/25 [00:06<00:00,  4.00it/s]


Validation Accuracy: 0.6550, Precision: 0.8066, Recall: 0.4700, F1: 0.4880
Epoch 3/5


100%|██████████| 125/125 [00:31<00:00,  3.99it/s]
100%|██████████| 25/25 [00:06<00:00,  3.94it/s]


Validation Accuracy: 0.6200, Precision: 0.7147, Recall: 0.4003, F1: 0.3874
Epoch 4/5


100%|██████████| 125/125 [00:31<00:00,  3.95it/s]
100%|██████████| 25/25 [00:06<00:00,  3.96it/s]


Validation Accuracy: 0.6900, Precision: 0.7597, Recall: 0.4836, F1: 0.4702
Epoch 5/5


100%|██████████| 125/125 [00:31<00:00,  3.97it/s]
100%|██████████| 25/25 [00:06<00:00,  3.97it/s]


Validation Accuracy: 0.7100, Precision: 0.7376, Recall: 0.5806, F1: 0.5775
Using device: cuda


Loading train data: 100%|██████████| 443499/443499 [00:00<00:00, 696781.47it/s]


Data loaded in 0.063 minutes.
train data has 1000 examples.


Loading val data: 100%|██████████| 95035/95035 [00:00<00:00, 579742.86it/s]


Data loaded in 0.012 minutes.
val data has 200 examples.


Loading test data: 100%|██████████| 95036/95036 [00:00<00:00, 789456.24it/s]


Data loaded in 0.008 minutes.
test data has 95036 examples.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bert-model/ and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5


100%|██████████| 125/125 [00:31<00:00,  3.95it/s]
100%|██████████| 25/25 [00:06<00:00,  3.93it/s]


Validation Accuracy: 0.3550, Precision: 0.8312, Recall: 0.1777, F1: 0.1048
Epoch 2/5


100%|██████████| 125/125 [00:31<00:00,  3.96it/s]
100%|██████████| 25/25 [00:06<00:00,  3.99it/s]


Validation Accuracy: 0.4400, Precision: 0.7704, Recall: 0.2859, F1: 0.2131
Epoch 3/5


100%|██████████| 125/125 [00:31<00:00,  3.98it/s]
100%|██████████| 25/25 [00:06<00:00,  3.95it/s]


Validation Accuracy: 0.4850, Precision: 0.8116, Recall: 0.3562, F1: 0.2911
Epoch 4/5


100%|██████████| 125/125 [00:31<00:00,  3.95it/s]
100%|██████████| 25/25 [00:06<00:00,  3.95it/s]


Validation Accuracy: 0.4900, Precision: 0.8210, Recall: 0.3085, F1: 0.2704
Epoch 5/5


100%|██████████| 125/125 [00:31<00:00,  3.96it/s]
100%|██████████| 25/25 [00:06<00:00,  3.96it/s]


Validation Accuracy: 0.5300, Precision: 0.7769, Recall: 0.4116, F1: 0.3406
Using device: cuda


Loading train data: 100%|██████████| 443499/443499 [00:00<00:00, 677091.58it/s]


Data loaded in 0.060 minutes.
train data has 1000 examples.


Loading val data: 100%|██████████| 95035/95035 [00:00<00:00, 583853.70it/s]


Data loaded in 0.012 minutes.
val data has 200 examples.


Loading test data: 100%|██████████| 95036/95036 [00:00<00:00, 455465.13it/s]


Data loaded in 0.011 minutes.
test data has 95036 examples.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bert-model/ and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5


100%|██████████| 125/125 [00:31<00:00,  3.94it/s]
100%|██████████| 25/25 [00:06<00:00,  3.89it/s]


Validation Accuracy: 0.2250, Precision: 0.8527, Recall: 0.1659, F1: 0.0904
Epoch 2/5


100%|██████████| 125/125 [00:31<00:00,  3.97it/s]
100%|██████████| 25/25 [00:06<00:00,  4.00it/s]


Validation Accuracy: 0.2600, Precision: 0.8216, Recall: 0.2291, F1: 0.1346
Epoch 3/5


100%|██████████| 125/125 [00:31<00:00,  3.98it/s]
100%|██████████| 25/25 [00:06<00:00,  3.96it/s]


Validation Accuracy: 0.3350, Precision: 0.8145, Recall: 0.3064, F1: 0.1965
Epoch 4/5


100%|██████████| 125/125 [00:31<00:00,  3.95it/s]
100%|██████████| 25/25 [00:06<00:00,  3.94it/s]


Validation Accuracy: 0.3250, Precision: 0.7713, Recall: 0.3138, F1: 0.1837
Epoch 5/5


100%|██████████| 125/125 [00:31<00:00,  3.97it/s]
100%|██████████| 25/25 [00:06<00:00,  3.96it/s]


Validation Accuracy: 0.3650, Precision: 0.6890, Recall: 0.3888, F1: 0.2206
Using device: cuda


Loading train data: 100%|██████████| 443499/443499 [00:00<00:00, 709655.52it/s]


Found 'na' as the label, removing examples with 'na' label.
Original data has 443499 examples.
Data loaded in 0.065 minutes.
train data has 1000 examples.


Loading val data: 100%|██████████| 95035/95035 [00:00<00:00, 580503.57it/s]


Found 'na' as the label, removing examples with 'na' label.
Original data has 95035 examples.
Data loaded in 0.013 minutes.
val data has 200 examples.


Loading test data: 100%|██████████| 95036/95036 [00:00<00:00, 734028.69it/s]


Data loaded in 0.007 minutes.
test data has 95036 examples.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bert-model/ and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5


100%|██████████| 125/125 [00:31<00:00,  3.94it/s]
100%|██████████| 25/25 [00:06<00:00,  3.90it/s]


Validation Accuracy: 0.0800, Precision: 0.9334, Recall: 0.0793, F1: 0.0399
Epoch 2/5


100%|██████████| 125/125 [00:31<00:00,  3.96it/s]
100%|██████████| 25/25 [00:06<00:00,  4.00it/s]


Validation Accuracy: 0.1200, Precision: 0.8208, Recall: 0.1733, F1: 0.0469
Epoch 3/5


100%|██████████| 125/125 [00:31<00:00,  3.97it/s]
100%|██████████| 25/25 [00:06<00:00,  3.91it/s]


Validation Accuracy: 0.1700, Precision: 0.7229, Recall: 0.2894, F1: 0.0784
Epoch 4/5


100%|██████████| 125/125 [00:31<00:00,  3.95it/s]
100%|██████████| 25/25 [00:06<00:00,  3.94it/s]


Validation Accuracy: 0.2450, Precision: 0.7181, Recall: 0.3311, F1: 0.1062
Epoch 5/5


100%|██████████| 125/125 [00:31<00:00,  3.97it/s]
100%|██████████| 25/25 [00:06<00:00,  3.97it/s]


Validation Accuracy: 0.1950, Precision: 0.7060, Recall: 0.3014, F1: 0.1033
Using device: cuda


Loading train data: 100%|██████████| 443499/443499 [00:00<00:00, 679673.91it/s]


Found 'na' as the label, removing examples with 'na' label.
Original data has 443499 examples.
Data loaded in 0.058 minutes.
train data has 1000 examples.


Loading val data: 100%|██████████| 95035/95035 [00:00<00:00, 534864.57it/s]


Found 'na' as the label, removing examples with 'na' label.
Original data has 95035 examples.
Data loaded in 0.012 minutes.
val data has 200 examples.


Loading test data: 100%|██████████| 95036/95036 [00:00<00:00, 704250.26it/s]


Data loaded in 0.008 minutes.
test data has 95036 examples.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bert-model/ and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5


100%|██████████| 125/125 [00:31<00:00,  3.94it/s]
100%|██████████| 25/25 [00:06<00:00,  3.92it/s]


Validation Accuracy: 0.1200, Precision: 0.9232, Recall: 0.0897, F1: 0.0376
Epoch 2/5


100%|██████████| 125/125 [00:31<00:00,  3.96it/s]
100%|██████████| 25/25 [00:06<00:00,  4.01it/s]


Validation Accuracy: 0.1950, Precision: 0.8193, Recall: 0.2324, F1: 0.1028
Epoch 3/5


100%|██████████| 125/125 [00:31<00:00,  3.97it/s]
100%|██████████| 25/25 [00:06<00:00,  3.96it/s]


Validation Accuracy: 0.2500, Precision: 0.7281, Recall: 0.3413, F1: 0.1329
Epoch 4/5


100%|██████████| 125/125 [00:31<00:00,  3.95it/s]
100%|██████████| 25/25 [00:06<00:00,  3.99it/s]


Validation Accuracy: 0.3450, Precision: 0.7118, Recall: 0.4087, F1: 0.2040
Epoch 5/5


100%|██████████| 125/125 [00:31<00:00,  3.98it/s]
100%|██████████| 25/25 [00:06<00:00,  3.99it/s]

Validation Accuracy: 0.2950, Precision: 0.6880, Recall: 0.3940, F1: 0.1797





## Evaluation

We have also provided utility evaluation functions (which we use for validation), to judge your model.
Please note that if you want to perform model selection based on `item accuracy` (the main metric for the contest), since we are training 6 different models for 6 different columns, to get the `item accuracy` you need to store results from the validation run how we have done for the test run.
To make a valid submission, just run the previous code cell with the testing lines uncommented and then you can use the next cell to generate a valid submission. (NOTE: this takes
very long, as the test output must be on the entire test dataset ~49 minutes on one GPU on colab, you can reduce this by using a bigger batch size probably)
The next cell will generate a zip file that you can upload to CodaLab.

In [None]:
from src.evaluate import postprocess

# postprocess()

## Thank you!
Repository and tutorial by [Karan Bania](https://karannb.github.io).