# Training a Classifier based on deep learning

Codes in this notebook are executed on a Linux-based virtual machine with the following **computational requirements**:
* GPU:  RTX2080 Super
* vCPU:  8 
* CPU Memory: 48GB 
* GPU Memory: 8GB

## Import necessary dependencies and data

In [1]:
import os
from data_extraction import get_raw_dataset
import tensorflow as tf

2025-03-11 00:42:43.289988: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-11 00:42:43.494499: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-11 00:42:44.400781: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/student/.local/lib/python3.10/site-packages/tensorrt_libs:/usr/local/cuda-12.3/lib64:/usr/lib/x86_64-linux-gnu
2025-03-11 00:42:44.401002: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Cou

In [None]:
# Reload Raw Data
X_train, y_train = get_raw_dataset(mode='train')
X_dev, y_dev = get_raw_dataset(mode='dev')
X_test, _ = get_raw_dataset(mode='test')

## Load the Pre-Trained DistilBERT Classification-based Model

Note: This pre-trained model has a classification head which is suitable for our problem. 

In [None]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification

# Load a pretrained model
# https://huggingface.co/distilbert/distilbert-base-uncased
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Initialize DistilBERT model for sequence classification
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) # binary classification

# Convert X_train to a list of strings
X_train_list = X_train.tolist()

encoded_input = tokenizer(
    X_train_list,
    padding=True,
    truncation=True,
    max_length=32,
    return_tensors='tf'
)

# Convert y_train to a tensor
y_train_tensor = tf.convert_to_tensor(y_train.values)

# Model Description
model.summary()

2025-03-11 00:42:51.802027: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-03-11 00:42:51.804103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-03-11 00:42:51.804527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-03-11 00:42:51.805402: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [4]:
# Define the directory path
dir = os.path.dirname(os.curdir)

# Save the pretrained model
model.save(os.path.join(dir, 'models', 'model_deep_learning_distilBERT_pretrained'))





INFO:tensorflow:Assets written to: models/model_deep_learning_distilBERT_pretrained/assets


INFO:tensorflow:Assets written to: models/model_deep_learning_distilBERT_pretrained/assets


## Fine-tuning the DistilBERT Model

In [5]:
# Construct a Tensorflow-based dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(encoded_input),  # model expects a dict of input_ids/attention_mask
    y_train_tensor
)).batch(4)

In [6]:
# Compile the Model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5) # Learning rate inspired by: https://arxiv.org/pdf/1810.04805
model.compile(optimizer=optimizer, loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])

In [7]:
# Verify the shape and content of the tensors
for batch in train_dataset.take(1):
    inputs, labels = batch
    print({k: v.shape for k, v in inputs.items()}, labels.shape)

{'input_ids': TensorShape([4, 32]), 'attention_mask': TensorShape([4, 32])} (4,)


In [8]:
# Prepare validation data
X_dev_list = X_dev.tolist()
encoded_dev_input = tokenizer(
    X_dev_list,
    padding=True,
    truncation=True,
    max_length=32,
    return_tensors='tf'
)
y_dev_tensor = tf.convert_to_tensor(y_dev.values)

# Check if the tensors are correctly created
print(encoded_dev_input['input_ids'].shape)
print(encoded_dev_input['attention_mask'].shape)
print(y_dev_tensor.shape)

(5000, 32)
(5000, 32)
(5000,)


In [9]:
# Fine-tune the Model with less epochs via another training
history = model.fit(
    train_dataset, 
    epochs=5, 
    validation_data=(dict(encoded_dev_input), y_dev_tensor),
    batch_size=4
)
history

Epoch 1/5


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f62ca61ee00>

## Model after Fine-Tuning

In [10]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [11]:
# Define the directory path
dir = os.path.dirname(os.curdir)

# Save the pretrained model
model.save(os.path.join(dir, 'models', 'model_deep_learning_distilBERT_tuned'))

























INFO:tensorflow:Assets written to: models/model_deep_learning_distilBERT_tuned/assets


INFO:tensorflow:Assets written to: models/model_deep_learning_distilBERT_tuned/assets
