# __BERT__

Now, we aimed to perform the text classification using the DistilBERT model, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), implemented in TensorFlow. We utilized the DistilBERT model due to its efficiency and effectiveness in natural language processing tasks.

In [23]:
!pip install transformers datasets



In [24]:
!pip install tf-keras



In [25]:
import os
import csv
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
#from transformers import AutoModelForSequenceClassification, TFDistilBertForSequenceClassification, TFTrainingArguments, TFTrainer

from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Custom libraries
import sys
sys.path.append('..')
from functions.models import *

We started by loading the training, testing, and validation datasets from TSV files using Pandas. Each dataset consisted of text comments and their corresponding labels. 

In [54]:
train_path = '../data/X_train.tsv'
test_path = '../data/X_test.tsv'
validation_path = '../data/X_val.tsv'

X_train = pd.read_csv(train_path, sep='\t')
X_test = pd.read_csv(test_path, sep='\t')
X_val = pd.read_csv(validation_path, sep='\t')

In [55]:
train_path = '../data/y_train.tsv'
test_path = '../data/y_test.tsv'
validation_path = '../data/y_val.tsv'

y_train = pd.read_csv(train_path, sep='\t')
y_test = pd.read_csv(test_path, sep='\t')
y_val = pd.read_csv(validation_path, sep='\t')

In [73]:
import datasets

train_dataset = datasets.Dataset.from_dict({"text": X_train["comment"], "label": y_train["label"]})
test_dataset = datasets.Dataset.from_dict({"text": X_test["comment"], "label": y_test["label"]})
validation_dataset = datasets.Dataset.from_dict({"text": X_val["comment"], "label": y_val["label"]})

dataset = datasets.DatasetDict({"train": train_dataset, "test": test_dataset, "validation": validation_dataset})


In [74]:
dataset.shape

{'train': (39900, 2), 'test': (4833, 2), 'validation': (4891, 2)}

### Tokenization, Padding and Sequencing

We then tokenized the text data using the DistilBERT tokenizer, which converts text inputs into numerical vectors that the model can process.

In [75]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")



In [76]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [80]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/39900 [00:00<?, ? examples/s]

Map:   0%|          | 0/4833 [00:00<?, ? examples/s]

Map:   0%|          | 0/4891 [00:00<?, ? examples/s]

In [82]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Model Building

For our text classification model, we utilized the AutoModelForSequenceClassification class from the transformers library. This class automatically loads the pre-trained DistilBERT model fine-tuned for sequence classification tasks. We specified the number of labels based on the unique labels present in the training data.

In [83]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=y_train["label"].nunique())


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

### Model Training

We defined the training configuration using the TrainingArguments class, setting parameters such as the learning rate, batch size, number of epochs, and weight decay. The Trainer class from the transformers library was employed to facilitate model training. We provided the model, training arguments, tokenized training dataset, evaluation dataset, tokenizer, and data collator to the Trainer object.

In [85]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

  0%|          | 0/12470 [00:00<?, ?it/s]

{'loss': 2.5003, 'grad_norm': 6.8331170082092285, 'learning_rate': 1.9198075380914196e-05, 'epoch': 0.2}
{'loss': 2.154, 'grad_norm': 8.398826599121094, 'learning_rate': 1.8396150761828387e-05, 'epoch': 0.4}
{'loss': 2.0152, 'grad_norm': 10.167052268981934, 'learning_rate': 1.7594226142742585e-05, 'epoch': 0.6}
{'loss': 2.0048, 'grad_norm': 11.380202293395996, 'learning_rate': 1.679230152365678e-05, 'epoch': 0.8}
{'loss': 1.9495, 'grad_norm': 7.773941993713379, 'learning_rate': 1.5990376904570973e-05, 'epoch': 1.0}
{'loss': 1.8047, 'grad_norm': 7.956082820892334, 'learning_rate': 1.5188452285485164e-05, 'epoch': 1.2}
{'loss': 1.8158, 'grad_norm': 11.588322639465332, 'learning_rate': 1.4386527666399359e-05, 'epoch': 1.4}
{'loss': 1.815, 'grad_norm': 8.168700218200684, 'learning_rate': 1.3584603047313553e-05, 'epoch': 1.6}
{'loss': 1.8091, 'grad_norm': 9.24846363067627, 'learning_rate': 1.2782678428227749e-05, 'epoch': 1.8}
{'loss': 1.8097, 'grad_norm': 8.791220664978027, 'learning_rate'

TrainOutput(global_step=12470, training_loss=1.706114694798767, metrics={'train_runtime': 15500.0635, 'train_samples_per_second': 12.871, 'train_steps_per_second': 0.805, 'total_flos': 785712641419104.0, 'train_loss': 1.706114694798767, 'epoch': 5.0})

### Model Evaluation

After training the model, we evaluated its performance on the evaluation dataset. We obtained the model's predictions on the evaluation dataset using the predict method of the Trainer object. Then, we computed the accuracy of the model by comparing the predicted classes with the true labels.

In [86]:
# See loss
trainer.evaluate()

  0%|          | 0/303 [00:00<?, ?it/s]

{'eval_loss': 1.9426915645599365,
 'eval_runtime': 57.9295,
 'eval_samples_per_second': 83.429,
 'eval_steps_per_second': 5.23,
 'epoch': 5.0}

In [95]:
import numpy as np

# Obtener las predicciones del conjunto de evaluación
eval_predictions = trainer.predict(tokenized_dataset["test"]).predictions

# Obtener las etiquetas verdaderas del conjunto de evaluación
eval_labels = tokenized_dataset["test"]["label"]

# Calcular las predicciones finales (clase predicha) usando la función argmax
predicted_classes = np.argmax(eval_predictions, axis=1)

# Calcular la precisión comparando las etiquetas verdaderas con las predicciones
accuracy = np.mean(predicted_classes == eval_labels)
print("Accuracy:", accuracy)

  0%|          | 0/303 [00:00<?, ?it/s]

Accuracy: 0.45934202358783366


The DistilBERT model achieved an accuracy of 0.46 on the evaluation dataset. By leveraging transformer-based models, such as DistilBERT, we achieved enhanced performance in categorizing text comments into predefined labels. This highlights the efficacy of transformer architectures in capturing complex textual patterns and underscores their potential for various natural language processing applications.