# Objective of the Notebook

The objective of this notebook is to 1) build a classifier that does a good job of predicting whether or not a tweet is about an actual disaster or not (hopefully better than the classifiers that we've built previously); and 2) to establish a set of steps to fine-tune a model on Hugging Face for the purpose of classifying text. Hence, the steps that we take to fine-tune a model on Hugging Face is identical to the one that we use in [one of my previous notebooks that I use to classify contradictions](https://www.kaggle.com/code/l048596/contradiction-classifier-w-huggingface-87-41). I am doing two things differently: 

1. We are not "cleaning" the tweets **at all**. In the next iteration, I intend to perform the exact same pre-processing steps that were used to train BERTweet.
2. I am introducing an EarlyStoppingCallback so that we can prevent the model from over-fitting on the training data set. 

In [1]:
!pip install transformers --quiet
!pip install evaluate --quiet

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import os, re, random, datasets, evaluate

pd.set_option('display.max_colwidth', None)
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer, EarlyStoppingCallback

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [8]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

# Preprocess the Training and Test Data Sets

Again, we are going to use the same set of pre-processing steps that we had used in previous notebooks covering this data set. We are going to identify duplicate observations with different labels, manually identify what the correct labels are, and assign the correct labels to them. Then, we are going to perform some pre-processing tasks to account for the fact that tweets contain urls, mentions, hashtags, and newlines.

In [9]:
duplicates = train[train.duplicated('text')]
problematic_duplicates = []

for i in range(duplicates.text.nunique()):
    duplicate_subset = train[train.text == duplicates.text.unique()[i]]
    if len(duplicate_subset) > 1 and duplicate_subset.target.nunique() == 2:
        problematic_duplicates.append(i)

target_list = [0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0]

for problematic_index in range(len(problematic_duplicates)): 
    train.target = np.where(train.text == duplicates.text.unique()[problematic_index], 
                            target_list[problematic_index], train.target)

## Split Data Sets into Train, Validation, and Test Data Sets

In [10]:
train = train.groupby('target').sample(3200, random_state = 1048596)
train_df, val_df = np.split(train.sample(frac = 1), [int(0.8 * len(train))])

We are going to retain the columns that we are going to need for training and evaluation: *id* for test data set evaluation, *text* and *target* for both. Moving forward, I am going to stick to a pre-processing pipeline where we store the training, validation, and test data sets (if applicable) as Datasets inside one **DatasetDict**. 

In [11]:
train_df = train_df[['id', 'text', 'target']]
val_df = val_df[['id', 'text', 'target']]
test_df = test_df[['id', 'text']]

In [12]:
train_dict = datasets.Dataset.from_dict(train_df.to_dict(orient="list"))
val_dict = datasets.Dataset.from_dict(val_df.to_dict(orient="list"))
test_dict = datasets.Dataset.from_dict(test_df.to_dict(orient="list"))

In [13]:
tweets_ds = datasets.DatasetDict({"train": train_dict, "val": val_dict, "test": test_dict})

In [14]:
tweets_ds

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'target'],
        num_rows: 5120
    })
    val: Dataset({
        features: ['id', 'text', 'target'],
        num_rows: 1280
    })
    test: Dataset({
        features: ['id', 'text'],
        num_rows: 3263
    })
})

# Finetune Hugging Face Model

We are going to use the **BERTweet-base** model on Hugging Face. The justification for this is the following: 
1. BERTweet has been trained on 850 million English Tweets. As we are trying to classify tweets, this model will capture the subtletie that only Tweets have;
2. BERTweet has been trained based on the RoBERTa pre-training procedure. RoBERTa is generally a good model to fine-tune for classifcation purposes.

In [15]:
model_name = 'vinai/bertweet-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 2)

Downloading (…)lve/main/config.json:   0%|          | 0.00/558 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/843k [00:00<?, ?B/s]

Downloading (…)solve/main/bpe.codes:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading pytorch_model.bin:   0%|          | 0.00/543M [00:00<?, ?B/s]

Some weights of the model checkpoint at vinai/bertweet-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at vinai/bertweet-base and are newly initialized: 

In [16]:
def tokenize_function(dataset):
    return(tokenizer(dataset['text'], truncation = True))

tokenized_data = tweets_ds.map(tokenize_function, batched = True)

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [17]:
tokenized_data

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'target', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5120
    })
    val: Dataset({
        features: ['id', 'text', 'target', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1280
    })
    test: Dataset({
        features: ['id', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3263
    })
})

In [18]:
tokenized_data['train'] = tokenized_data['train'].rename_column('target', 'labels')
tokenized_data['val'] = tokenized_data['val'].rename_column('target', 'labels')
tokenized_data.with_format('pt')

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5120
    })
    val: Dataset({
        features: ['id', 'text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1280
    })
    test: Dataset({
        features: ['id', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3263
    })
})

I came to learn that the column containing the labels (predictions) have to be named "labels". If not, **trainer.train()** will return an error. As the test dataset inside the DatasetDict object does not contain the labels yet, we are going to rename the "target" columns inside the train and validation datasets as "labels". 

In [22]:
training_args = TrainingArguments(model_name,  
                                  evaluation_strategy = 'epoch',
                                  num_train_epochs = 5,
                                  learning_rate = 5e-5,
                                  weight_decay = 0.005,
                                  per_device_train_batch_size = 16,
                                  per_device_eval_batch_size = 16,
                                  report_to = 'none',
                                  load_best_model_at_end = True,
                                  save_strategy = 'epoch')

def compute_metrics(eval_pred):
    metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis = -1)
    return metric.compute(predictions=predictions, references=labels)

early_stop = EarlyStoppingCallback(2, 0.01)

trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_data["train"],
    eval_dataset = tokenized_data["val"],
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [early_stop]
)

In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.429196,0.828906
2,0.464900,0.43289,0.822656
3,0.464900,0.56237,0.80625


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

TrainOutput(global_step=960, training_loss=0.3944559574127197, metrics={'train_runtime': 2984.3629, 'train_samples_per_second': 8.578, 'train_steps_per_second': 0.536, 'total_flos': 361950367750080.0, 'train_loss': 0.3944559574127197, 'epoch': 3.0})

# Prepare for Submission

In [24]:
test_predictions = trainer.predict(tokenized_data["test"])
preds = np.argmax(test_predictions.predictions, axis=1)

In [25]:
submission = pd.DataFrame(list(zip(test_df.id, preds)), 
                          columns = ["id", "target"])
submission.to_csv("submission.csv", index=False)