# Subtask 2 Product Category: Prediction using BERT

For the first subtask I decided to use a pretrained BERT model and train it on the challenge's data. <br> <br>
__The documentation remains the same as in the Hazard Category Prediction notebook (st1_hazard.ipynb). The only difference is the number of epochs, which I set to 5 to account for the larger number of classes.__

In [1]:
%pip install transformers -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -U accelerate -q

Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install torch -q

Note: you may need to restart the kernel to use updated packages.


In [4]:
%pip install tf-keras -q

Note: you may need to restart the kernel to use updated packages.


In [5]:
%pip install 'accelerate>=0.26.0' -q

Note: you may need to restart the kernel to use updated packages.


In [6]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
device

'cpu'

In [7]:
import re
import string
import tensorflow as tf
import torch, os
import torch.nn as nn
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import pipeline, BertForSequenceClassification, BertTokenizerFast, TrainingArguments, Trainer
from torch.utils.data import Dataset

2025-02-13 13:56:20.123381: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739454980.141993   64943 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739454980.147756   64943 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-13 13:56:20.166855: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Data preparation

Load CSV datasets

In [8]:
labeled_train_df = pd.read_csv("labeled_training_incidents.csv").rename(columns={"Unnamed: 0": "index"})
labeled_valid_df = pd.read_csv("labeled_validation_incidents.csv").rename(columns={"Unnamed: 0": "index"})
labeled_test_df = pd.read_csv("labeled_test_incidents.csv").rename(columns={"Unnamed: 0": "index"})

We extract unique hazard categories <br>
Then create mapping dictionaries (label2id and id2label)<br>
And apply this mapping to the training dataset<br>

In [9]:
# Extract unique producy-category values and create a list
unique_categories = labeled_train_df["product-category"].unique().tolist()

# Create a mapping dictionaries
label2id = {category: idx for idx, category in enumerate(unique_categories)}
id2label = {v: k for k, v in label2id.items()}

# Create a new dataframe with text and hazard_category_label
train_st1 = labeled_train_df[['text']].copy()
train_st1['label'] = labeled_train_df['product-category'].map(label2id)


We apply the same mapping to the validation and testing dataset

In [10]:
# Create a new dataframe with text and hazard_category_label
valid_st1 = labeled_valid_df[['text']].copy()
valid_st1['label'] = labeled_valid_df['product-category'].map(label2id)

# Create a new dataframe with text and hazard_category_label
test_st1 = labeled_test_df[['text']].copy()
test_st1['label'] = labeled_test_df['product-category'].map(label2id)

We define a custom standardization function that:
* Converts text to lowercase
* Removes extra spaces
* Strips punctuation

In [11]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '\n            ', ' ')
    no_punctuation = tf.strings.regex_replace(stripped_html,
                                              '[%s]' % re.escape(string.punctuation),
                                              '')
    return no_punctuation.numpy().decode('utf-8')  # Convert from Tensor to string

We apply the function to all datasets

In [12]:
# Apply the custom_standardization function to the first columns
valid_st1.iloc[:, 0] = valid_st1.iloc[:, 0].apply(lambda x: custom_standardization(tf.constant(x)))
train_st1.iloc[:, 0] = train_st1.iloc[:, 0].apply(lambda x: custom_standardization(tf.constant(x)))
test_st1.iloc[:, 0] = test_st1.iloc[:, 0].apply(lambda x: custom_standardization(tf.constant(x)))

2025-02-13 13:56:24.839665: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


### Tokenization and Model Initialization

Then, we load a BERT Tokenizer from Google's BERT base model and a BERT model for sequence classification

In [13]:
tokenizer = BertTokenizerFast.from_pretrained("google-bert/bert-base-uncased", max_length=512)

In [14]:
num_labels = len(label2id)

model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=num_labels, id2label=id2label, label2id=label2id)
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

We first convert the text and labels into lists and then we tokenize (by adding padding and truncating) all the datasets

In [15]:
train_texts = train_st1['text'].tolist()
train_labels = train_st1['label'].tolist()
val_texts = valid_st1['text'].tolist()
val_labels = valid_st1['label'].tolist()
test_texts = test_st1['text'].tolist()
test_labels = test_st1['label'].tolist()

In [16]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings  = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

### Dataloader

The Dataloader takes the tokenized text and labels, converts them into PyTorch tensors and gets the sample and the dataset length.

In [None]:
class DataLoader(Dataset):

    def __init__(self, encodings, labels):
        """
        Initializes the DataLoader class with encodings and labels.

        Args:
            encodings (dict): A dictionary containing tokenized input text data
                              (e.g., 'input_ids', 'token_type_ids', 'attention_mask').
            labels (list): A list of integer labels for the input text data.
        """
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        """
        Returns a dictionary containing tokenized data and the corresponding label for a given index.

        Args:
            idx (int): The index of the data item to retrieve.

        Returns:
            item (dict): A dictionary containing the tokenized data and the corresponding label.
        """
        # Retrieve tokenized data for the given index
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # Add the label for the given index to the item dictionary
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        """
        Returns the number of data items in the dataset.

        Returns:
            (int): The number of data items in the dataset.
        """
        return len(self.labels)


We create dataloaders for all the datasets

In [18]:
train_dataloader = DataLoader(train_encodings, train_labels)
val_dataloader = DataLoader(val_encodings, val_labels)
test_dataset = DataLoader(test_encodings, test_labels)

### Training and Evaluation

We define a custom metric function that computes Accuracy, F1-score, Precision, and Recall (sklearn) and
uses argmax on model predictions to get class labels

In [None]:
def compute_metrics(pred):

    # Extract true labels from the input object
    labels = pred.label_ids
    
    # Obtain predicted class labels by finding the column index with the maximum probability
    preds = pred.predictions.argmax(-1)
    
    # Compute macro precision, recall, and F1 score using sklearn's precision_recall_fscore_support function
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    
    # Calculate the accuracy score using sklearn's accuracy_score function
    acc = accuracy_score(labels, preds)
    
    # Return the computed metrics as a dictionary
    return {
        'Accuracy': acc,
        'F1': f1,
        'Precision': precision,
        'Recall': recall
    }


In [None]:
training_args = TrainingArguments(
    use_cpu=True,
    # The output directory where the model predictions and checkpoints will be written
    output_dir='./Model-st1-product', 
    do_train=True,
    do_eval=True,
    #  The number of epochs, defaults to 3.0 
    num_train_epochs=5,              
    per_device_train_batch_size=32,  
    per_device_eval_batch_size=48,
    # Number of steps used for a linear warmup
    warmup_steps=100,
    weight_decay=0.01,
    logging_strategy='steps',
   # TensorBoard log directory
    logging_dir='./multi-class-logs-st1-product',
    logging_steps=50,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_F1"
)



We initialize the Trainer with the BERT model, the training and validation dataset and the custom metric function we made

In [21]:
trainer = Trainer(
    # the pre-trained model that will be fine-tuned 
    model=model,
     # training arguments that we defined above                        
    args=training_args,                 
    train_dataset=train_dataloader,         
    eval_dataset=val_dataloader,            
    compute_metrics= compute_metrics
)

We train the model

In [22]:
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
50,2.8676,2.531655,0.258407,0.022848,0.014381,0.055556
100,2.3143,1.970199,0.431858,0.122044,0.179139,0.157987
150,1.7457,1.492316,0.614159,0.350433,0.395098,0.387079
200,1.2651,1.230021,0.646018,0.46329,0.555968,0.503928
250,1.0461,1.078735,0.695575,0.496662,0.524372,0.509446
300,0.9937,1.014523,0.723894,0.567535,0.573485,0.583499
350,0.7715,1.025426,0.715044,0.615187,0.60843,0.642815
400,0.6625,0.968962,0.739823,0.615946,0.627995,0.630611
450,0.6558,0.944281,0.730973,0.622198,0.634843,0.643314
500,0.5867,0.983027,0.722124,0.620882,0.617033,0.645509


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

TrainOutput(global_step=795, training_loss=0.9424421262441192, metrics={'train_runtime': 9699.5327, 'train_samples_per_second': 2.62, 'train_steps_per_second': 0.082, 'total_flos': 6686852472115200.0, 'train_loss': 0.9424421262441192, 'epoch': 5.0})

Lastly, in order to evaluate the model on the test data we pass it in a csv file that will be then be reloaded in the overview.ipynb file for the assesment

In [23]:
predictions = trainer.predict(test_dataset)

logits = predictions.predictions

# Convert logits (raw output) into class labels
predicted_labels = predictions.predictions.argmax(-1)

# Map predicted numeric labels back to category names
predicted_categories = [id2label[label] for label in predicted_labels]

# Create a DataFrame with original text and predicted categories
results_df = pd.DataFrame({
    "index": test_st1.index,  # Use the index of the original test dataset
    "true_label": test_st1["label"].map(id2label),  # True category name
    "predicted_label": predicted_categories  # Model's predicted category
})

# Save to CSV
results_df.to_csv("predictions_st1_product.csv", index=False)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
