# Training a Classifier based on deep learning

Codes in this notebook are executed on a Linux-based virtual machine with the following **computational requirements**:
* GPU:  RTX2080 Super
* vCPU:  8 
* CPU Memory: 48GB 
* GPU Memory: 8GB

## Import necessary dependencies and data

In [1]:
import os
from data_extraction import get_raw_dataset
import tensorflow as tf

2025-03-06 23:40:20.575020: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741322420.600298  254411 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741322420.607915  254411 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-06 23:40:20.636707: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Reload Raw Data
X_train, y_train = get_raw_dataset(mode='train')
X_dev, y_dev = get_raw_dataset(mode='dev')
X_test, _ = get_raw_dataset(mode='test')

## Load the Pre-Trained DistilBERT Classification-based Model

Note: This pre-trained model has a classification head which is suitable for our problem. 

In [3]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification

# Load a pretrained model
# https://huggingface.co/distilbert/distilbert-base-uncased
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Initialize DistilBERT model for sequence classification
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) # binary classification

# Convert X_train to a list of strings
X_train_list = X_train.tolist()

encoded_input = tokenizer(
    X_train_list,
    padding=True,
    truncation=True,
    max_length=32,
    return_tensors='tf'
)

# Convert y_train to a tensor
y_train_tensor = tf.convert_to_tensor(y_train.values)

I0000 00:00:1741322430.082501  254411 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 705 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 SUPER, pci bus id: 0000:00:05.0, compute capability: 7.5
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights 

## Fine-tuning the DistilBERT Model

In [4]:
# Construct a Tensorflow-based dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(encoded_input),  # model expects a dict of input_ids/attention_mask
    y_train_tensor
)).batch(4)

In [12]:
# Compile the Model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5) # Learning rate inspired by: https://arxiv.org/pdf/1810.04805
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])

In [6]:
# Verify the shape and content of the tensors
for batch in train_dataset.take(1):
    inputs, labels = batch
    print({k: v.shape for k, v in inputs.items()}, labels.shape)

{'input_ids': TensorShape([4, 32]), 'attention_mask': TensorShape([4, 32])} (4,)


2025-03-07 00:02:40.431325: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [17]:
# Prepare validation data
X_dev_list = X_dev.tolist()
encoded_dev_input = tokenizer(
    X_dev_list,
    padding=True,
    truncation=True,
    max_length=32,
    return_tensors='tf'
)
y_dev_tensor = tf.convert_to_tensor(y_dev.values)

# Check if the tensors are correctly created
print(encoded_dev_input['input_ids'].shape)
print(encoded_dev_input['attention_mask'].shape)
print(y_dev_tensor.shape)

(5000, 32)
(5000, 32)
(5000,)


In [None]:
# Fine-tune the Model with less epochs via another training
history = model.fit(train_dataset, epochs=2, validation_data=(dict(encoded_dev_input), y_dev_tensor))
history

Epoch 1/2


AttributeError: in user code:

    File "/home/student/.local/lib/python3.10/site-packages/tf_keras/src/engine/training.py", line 1398, in train_function  *
        return step_function(self, iterator)
    File "/home/student/.local/lib/python3.10/site-packages/transformers/modeling_tf_utils.py", line 1588, in compute_loss  *
        return super().compute_loss(*args, **kwargs)
    File "/home/student/.local/lib/python3.10/site-packages/tf_keras/src/engine/training.py", line 1206, in compute_loss  **
        return self.compiled_loss(
    File "/home/student/.local/lib/python3.10/site-packages/tf_keras/src/engine/compile_utils.py", line 275, in __call__
        y_t, y_p, sw = match_dtype_and_rank(y_t, y_p, sw)
    File "/home/student/.local/lib/python3.10/site-packages/tf_keras/src/engine/compile_utils.py", line 854, in match_dtype_and_rank
        if (y_t.dtype.is_floating and y_p.dtype.is_floating) or (

    AttributeError: 'NoneType' object has no attribute 'dtype'


In [None]:
# Define the directory path
dir = os.path.dirname(os.curdir)

# Save the pretrained model
model.save_pretrained(os.path.join(dir, 'models', 'model_deep_learning_distilBERT'))