#### Objective of the Notebook

The objective of this notebook is to fine-tune DeBERTa to classify disaster tweets. In the previous notebook (the forked notebook), I had established a machine learning pipeline that I can use when I am trying to fine-tune pre-trained models made available on Hugging Face. Here, we are going to simply modify the model that was fine-tuned in that notebook, from DistilBERT to DeBERTa (Decoding-enhanced BERT with Disentangled Attention). DeBERTa is known to perform significantly better in a wide variety of NLP tasks such as NLI and STS, so we expect this fine-tuned model to perform moderately better than the previous binary classifiers trained on BERTweet and DistilBERT. 

In [1]:
!pip install evaluate --quiet
!pip install emoji --quiet

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from emoji import demojize
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import os, re, random, datasets, evaluate
from sklearn.model_selection import train_test_split

pd.set_option('display.max_colwidth', None)
from transformers import AutoTokenizer, TFAutoModel, EarlyStoppingCallback

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [4]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

# Preprocess the Data Sets

In the past few notebooks, I established a pre-processing pipeline where I: (1) identify misclassified tweets (duplicate tweets whose labels are not identical), (2) concatenate the substance of the location column with that of the text column, and (3) clean the tweets following the set of pre-processing steps that VinAI used prior to training the BERTweet model. You can read more about the pre-processing steps by taking a look at one of my previous notebooks [here](https://www.kaggle.com/code/l048596/disaster-tweets-bertweet-pytorch-ii-82-62?kernelSessionId=139348416). 

In [5]:
duplicates = train[train.duplicated('text')]
problematic_duplicates = []

for i in range(duplicates.text.nunique()):
    duplicate_subset = train[train.text == duplicates.text.unique()[i]]
    if len(duplicate_subset) > 1 and duplicate_subset.target.nunique() == 2:
        problematic_duplicates.append(i)
        
target_list = [0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0]

for problematic_index in range(len(problematic_duplicates)): 
    train.target = np.where(train.text == duplicates.text.unique()[problematic_index], 
                            target_list[problematic_index], train.target)

In [6]:
def clean_tweets(text):
    
    text = text.lower()
    
    text = text.replace("n't", " n't ")
    text = text.replace("n 't", " n't ")
    text = text.replace("ca n't", "can't")
    text = text.replace("ai n't", "ain't")
    
    text = text.replace("'m", " 'm ")
    text = text.replace("'re", " 're ")
    text = text.replace("'s", " 's ")
    text = text.replace("'ll", " 'll ")
    text = text.replace("'d", " 'd ")
    text = text.replace("'ve", " 've ")
    text = text.replace("\n", " ")
    
    text = text.replace(" p . m .", " p.m.")
    text = text.replace(" p . m ", " p.m ")
    text = text.replace(" a . m .", " a.m.")
    text = text.replace(" a . m ", " a.m ")
    
    token_list = text.split(' ')
    
    token_list = [re.sub('#', '', x) for x in token_list]
    token_list = [re.sub(r'@\S+', '@USER', x) for x in token_list]
    token_list = [re.sub(r'http\S+', 'HTTPURL', x) for x in token_list]
    token_list = [re.sub(r'www\S+', 'HTTPURL', x) for x in token_list]
    token_list = [demojize(x) if len(x) == 1 else x for x in token_list]
    
    return(" ".join(token_list))

In [7]:
train.text = train.text.apply(lambda x: clean_tweets(x))
test_df.text = test_df.text.apply(lambda x: clean_tweets(x))

In [8]:
train = train.groupby('target').sample(np.min(train.target.value_counts().to_list())) # remove random state
train_df, val_df = np.split(train.sample(frac = 1), [int(0.85 * len(train))])

# Load Pre-trained Model for Tokenization

In [9]:
model_name = 'microsoft/deberta-v3-small'

tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                          normalization = True,
                                          use_fast = False,
                                          add_special_tokens = True,
                                          pad_to_max_length = True, 
                                          return_attention_mask = True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
max_len = np.max([len(x) for x in train.text])

In [11]:
train_tokens = tokenizer(train_df.text.to_list(),
                         padding = "max_length",
                         max_length = max_len,
                         truncation = True).data

val_tokens = tokenizer(val_df.text.to_list(),
                       padding = "max_length",
                       max_length = max_len,
                       truncation = True).data

In [12]:
def extract_features(tokens, labels, batch_size = 16): # Note that batch size of 64 willr esult in GPU OOM error
    features = {x: tokens[x] for x in tokenizer.model_input_names}
    features = tf.data.Dataset.from_tensor_slices((features, labels))
    return features.shuffle(len(labels)).batch(batch_size).prefetch(tf.data.AUTOTUNE)

train_features = extract_features(train_tokens, train_df.target)
val_features = extract_features(val_tokens, val_df.target)

In [17]:
bert_model = TFAutoModel.from_pretrained(model_name)
text = ["Replace me by any text you'd like.", "My name is Messi Lee"]

encoded_input = tokenizer(text, 
                          padding = "max_length", 
                          max_length = max_len,
                          truncation = True,
                          return_tensors='tf')

# encoded_input
output = bert_model([encoded_input['input_ids'], encoded_input['attention_mask']])
output

All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at microsoft/deberta-v3-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.


TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(2, 159, 768), dtype=float32, numpy=
array([[[-8.21725652e-03,  9.34619643e-03, -3.16984877e-02, ...,
         -7.06309378e-02, -3.45765762e-02, -8.09677467e-02],
        [-1.04211187e+00, -2.73463845e-01,  1.55843809e-01, ...,
          4.24775302e-01,  5.86446047e-01,  4.63497162e-01],
        [ 4.20123369e-01, -1.39614537e-01, -2.98552603e-01, ...,
          2.81707257e-01, -2.22064167e-01, -1.91090941e-01],
        ...,
        [-5.77143192e-01, -2.09147632e-02,  1.35440663e-01, ...,
         -2.57291123e-02, -3.41671556e-01, -1.85759217e-01],
        [-5.77143192e-01, -2.09147632e-02,  1.35440663e-01, ...,
         -2.57291123e-02, -3.41671556e-01, -1.85759217e-01],
        [-5.77143192e-01, -2.09147632e-02,  1.35440663e-01, ...,
         -2.57291123e-02, -3.41671556e-01, -1.85759217e-01]],

       [[ 1.24452747e-02, -3.14728394e-02,  1.28026213e-03, ...,
         -6.23275191e-02, -5.48804179e-02, -5.14611900e-02],
        [-5.

The last hidden state of the DeBERTa model is of shape (2, 159, 768). The 2 corresponds to the (None, ???, ???) of the bert_model layer as shown in the model summary (the number of observations), the 159 corresponds to the length of the text (in this case, the texts were truncated to the length of *max_len* which is 159), and 768 corresponds to the dimension of the DeBERTa embeddings. I wanted to keep the change from the previous notebook minimal, but here I tokenized the texts to the maximum length of all texts inside the training data set (159), so that ended up shrinking the second dimensions of the input from 512 to 159. 

In [25]:
bert_model = TFAutoModel.from_pretrained(model_name)

input_ids = tf.keras.Input(shape=(max_len,), dtype = 'int32', name = 'input_ids')
attention_masks = tf.keras.Input(shape=(max_len,), dtype ='int32', name = 'attention_mask')

output = bert_model([input_ids, attention_masks])[0]
output = tf.keras.layers.Dropout(0.7)(output)
output = tf.keras.layers.Flatten()(output)
output = tf.keras.layers.Dense(1, activation = 'sigmoid')(output)

model = tf.keras.models.Model(inputs = [input_ids, attention_masks], outputs = output)

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-5), 
              loss = tf.keras.losses.BinaryCrossentropy(), 
              metrics = ['accuracy'])

All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at microsoft/deberta-v3-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.


In [26]:
model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 159)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 159)]        0           []                               
                                                                                                  
 tf_deberta_v2_model_7 (TFDeber  TFBaseModelOutput(l  141304320  ['input_ids[0][0]',              
 taV2Model)                     ast_hidden_state=(N               'attention_mask[0][0]']         
                                one, 159, 768),                                                   
                                 hidden_states=None                                         

In [27]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', 
                                                  patience = 2)

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath = 'model/best_performed_model',
    save_weights_only = True,
    save_best_only = True,
    monitor = 'val_loss',
    verbose = 1
)

In [28]:
model.fit(train_features, 
          validation_data = val_features,
          epochs = 30, 
          callbacks = [early_stopping, model_checkpoint_callback])

Epoch 1/30
Epoch 1: val_loss improved from inf to 0.46172, saving model to model/best_performed_model
Epoch 2/30
Epoch 2: val_loss improved from 0.46172 to 0.45008, saving model to model/best_performed_model
Epoch 3/30
Epoch 3: val_loss improved from 0.45008 to 0.44797, saving model to model/best_performed_model
Epoch 4/30
Epoch 4: val_loss did not improve from 0.44797
Epoch 5/30
Epoch 5: val_loss did not improve from 0.44797


<keras.callbacks.History at 0x79987ec7f370>

# Prepare for Submission

Here, we are simply loading the weights that were saved to *model/best_performed_model* during the training of the model. If the correct model is loaded, we would expect the loaded model to show 82.09% validation accuracy, which is the highest validation accuracy that we achieved during the training of the model. And we confirm that the evaluation yields the validation accuracy value that we expected to see. 

In [29]:
model.load_weights('model/best_performed_model')
model.evaluate(val_features)



[0.44796693325042725, 0.8208802342414856]

In [30]:
test_token = tokenizer(test_df.text.tolist(), 
                       padding = "max_length", 
                       max_length = max_len,
                       truncation = True,
                       return_tensors='tf').data

In [31]:
predictions = model.predict(test_token)
pred = [(x > 0.5).astype(int)[0] for x in predictions]



In [32]:
submission = pd.DataFrame(list(zip(test_df.id, pred)), columns = ["id", "target"])
submission.to_csv("submission.csv", index = False)

The model achieved test accuracy of 81.64%. Despite DeBERTa being known to perform significantly better in a wide variety of tasks compared to other pre-trained models, simply switching out models from DistilBERT and BERTweet to DeBERTa did not significantly improvement the performance of our classifier. Here I speculate a few possible reasons for why this is the case:

- The data pre-processing pipeline that I am using here is a set of steps that was adapted from BERTweet model. It may be that there is a pre-processing pipeline that is more suitable for the purpose of fine-tuning DeBERTa. 
- Given the unique nature of the text that we are classifying here, Tweets, BERTweet model fine-tuned on Twitter data may be more appropriate for training a classifier. 
- Let me know in the discussion if there are other reasons for why this may be the case or if I have made any mistakes in the notebook. Thanks. 