<a href="https://www.kaggle.com/code/l048596/disaster-tweets-classifier-using-bert?scriptVersionId=137860233" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
!pip install tensorflow --quiet
!pip install tensorflow-hub --quiet
!pip install tensorflow-text --quiet
!pip install transformers --quiet

In [None]:
import os, re, random
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')
pd.set_option('display.max_colwidth', None)
os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"

## Import Dataset

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [None]:
train.columns

## Inspect Dataset for Mislabeled Tweets

In the discussion for this [competition](https://www.kaggle.com/competitions/nlp-getting-started/discussion/157982), I read that there are a lot of mislabeled tweets. Before we train models, let's check for mislabeled tweets and assign appropriate labels to them. As we cannot manually inspect all tweets, we are going to look for duplicate tweets and check that the duplicates have all been assigned the same labels. 

In [None]:
duplicates = train[train.duplicated('text')]
duplicates.text.nunique()

There are **69 duplicate tweets** inside the training dataset. We are going to iterate through these duplicate tweets to see if these duplicate tweets have unmatching labels. Unmatching labels would indicate that the tweet(s) has been mislabeled. We are going to store the index of these "problematic duplicates" inside a list and use it to iterate through these tweets so that we can re-assign correct labels after inspecting them.

In [None]:
problematic_duplicates = []

for i in range(duplicates.text.nunique()):
    duplicate_subset = train[train.text == duplicates.text.unique()[i]]
    if len(duplicate_subset) > 1 and duplicate_subset.target.nunique() == 2:
        problematic_duplicates.append(i)
        
print(problematic_duplicates)

In [None]:
train[train.text == duplicates.text.unique()[58]]

Above is the 58th duplicate. We see that these tweets have unmatching labels despite their texts being identical. This tweet is not about an actual disaster, so we are going to correctly assign both tweets as not being about an actual disaster. This is going to look like this: 

In [None]:
train.target = np.where(train.text == duplicates.text.unique()[58], 0, train.target)
train[train.text == duplicates.text.unique()[58]]

Let's repeat this step for all problematic duplicates after having identified the correct labels for each and every one of these problematic duplicates. We are going to store the correct labels inside a list and iterate through the problematic duplicates, assigning the correct labels one after the other.

In [None]:
target_list = [0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0]

for problematic_index in range(len(problematic_duplicates)): 
    train.target = np.where(train.text == duplicates.text.unique()[problematic_index], 
                            target_list[problematic_index], train.target)

## Pre-process the Tweets

Before we use the tweets to train the model, we are going to perform some basic pre-processing. To identify the appropriate steps, let's look at some of the tweets.

In [None]:
sample_train = train.sample(frac = 1, random_state = 1048596).head(5)
sample_train

In the randomly selected tweets above, we see that the tweets contain links (http://...), hashtags (#..), and mentions (@..). We are going to remove links entirely and keep hashtags and mentions in case the words used in hashtags and mentions are useful for correctly classifying the tweets. We are going to define a function that performs the aforementioned pre-processing steps, in addition to lower-casing, removing numbers, removing indications of new lines (\n), on the data frame and returns a cleaned data frame. 

In [None]:
def clean_text(dataframe):
    dataframe.text = dataframe.text.apply(lambda x: str.lower(x))
    dataframe.text = dataframe.text.apply(lambda x: re.sub(r'http\S+', '', x))
    dataframe.text = dataframe.text.apply(lambda x: re.sub(r'#', '', x))
    dataframe.text = dataframe.text.apply(lambda x: re.sub(r'@', '', x))
    dataframe.text = dataframe.text.apply(lambda x: re.sub(r'\n', '', x))
    dataframe.text = dataframe.text.apply(lambda x: re.sub(r'\d+', '', x))
    return(dataframe)

In [None]:
sample_train = clean_text(sample_train)

In [None]:
sample_train

In [None]:
clean_train = clean_text(train)

## Balance the Training Dataset

In [None]:
clean_train.target.value_counts()

In [None]:
clean_train_balanced = clean_train.groupby('target').sample(3000, random_state = 1048596)

In [None]:
train_df, val_df = np.split(clean_train_balanced.sample(frac = 1), [int(0.8 * len(clean_train_balanced))])

In [None]:
print('Number of observations inside the training dataset: {}'.format(len(train_df)))
print('Number of observations inside the validation dataset: {}'.format(len(val_df)))

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_df.text, train_df.target)).shuffle(1000).batch(32)
val_dataset = tf.data.Dataset.from_tensor_slices((val_df.text, val_df.target)).shuffle(1000).batch(32)

## Define and Train Model - First Model

In our first model, we are going to fine-tune BERT. A tutorial for fine-tuning BERT for text classification can be found in [this tensorflow tutorial page](https://www.tensorflow.org/text/tutorials/classify_text_with_bert). Similar to the model introduced in the tutorial, we are going to use a preprocessing layer, a 'bert-base-uncased' model, a Dropout layer, and two Dense layers. 

In [None]:
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'

In [None]:
bert_preprocess = hub.KerasLayer(tfhub_handle_preprocess)
bert_encoder = hub.KerasLayer(tfhub_handle_encoder)

In [None]:
text_input = tf.keras.layers.Input(shape = (), dtype = tf.string)
encoder_input = bert_preprocess(text_input)
encoder_output = bert_encoder(encoder_input)
    
l = tf.keras.layers.Dropout(0.3)(encoder_output['pooled_output'])
l = tf.keras.layers.Dense(16, activation = 'relu')(l)
l = tf.keras.layers.Dense(1, activation = 'sigmoid')(l)
    
model = tf.keras.Model(inputs=[text_input], outputs = [l])
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = tf.keras.losses.BinaryCrossentropy(),
              metrics = ['accuracy'])

We are going to introduce a callback so that the model stops training when validation loss stops to improve for two epochs in a row. That way, when the model starts to over-fit and fails to generalize to the validation dataset, we can stop the model from training. 

In [None]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', 
                                                  patience = 2)

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath = 'model/best_performed_model.ckpt',
    save_weights_only = True,
    save_best_only = True,
    monitor = 'val_loss',
    verbose = 1
)

In [None]:
history = model.fit(train_dataset,
                    validation_data = val_dataset,
                    epochs = 30, 
                    callbacks = [early_stopping, model_checkpoint_callback])

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'validation'])
plt.show()

The plot suggested that the model stopped training before over-fitting on the training data. This first model (BERT model) returned a validation accuracy of .70. 

## Define and Train Model - Second Model

In the second model, we are going use a preprocessing layer (the same one as the previous one), a 'Small BERT' model, a Dropout layer, and two Dense layers. The Small BERT is smaller in size and thus more efficient for downstream tasks such as text classification. 

In [None]:
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1'

In [None]:
bert_preprocess = hub.KerasLayer(tfhub_handle_preprocess)
bert_encoder = hub.KerasLayer(tfhub_handle_encoder)

In [None]:
text_input = tf.keras.layers.Input(shape = (), dtype = tf.string)
encoder_input = bert_preprocess(text_input)
encoder_output = bert_encoder(encoder_input)

l = tf.keras.layers.Dense(32, activation = 'relu')(encoder_output['pooled_output'])
l = tf.keras.layers.Dropout(0.3)(l)
l = tf.keras.layers.Dense(1, activation = 'sigmoid')(l)

model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [None]:
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = tf.keras.losses.BinaryCrossentropy(),
              metrics = ['accuracy'])

In [None]:
history = model.fit(train_dataset,
                    validation_data = val_dataset,
                    epochs = 30, 
                    callbacks = [early_stopping, model_checkpoint_callback])

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'validation'])
plt.show()

The second model returned a validation accuracy of .78. This was a considerable improvement from the first model. The model stopped training although there was no indication of over-fitting, but both training and validation accuracy values were plateauing by the end of the fifth epoch. 

## Define and Train Model - Second Model (2)

In [None]:
text_input = tf.keras.layers.Input(shape = (), dtype = tf.string)
encoder_input = bert_preprocess(text_input)
encoder_output = bert_encoder(encoder_input)

l = tf.keras.layers.Dense(32, activation = 'relu')(encoder_output['pooled_output'])
l = tf.keras.layers.Dropout(0.3)(l)
l = tf.keras.layers.Dense(16, activation = 'relu')(l)
l = tf.keras.layers.Dropout(0.3)(l)
l = tf.keras.layers.Dense(1, activation = 'sigmoid')(l)

model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [None]:
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = tf.keras.losses.BinaryCrossentropy(),
              metrics = ['accuracy'])

In [None]:
history = model.fit(train_dataset,
                    validation_data = val_dataset,
                    epochs = 30, 
                    callbacks = [early_stopping, model_checkpoint_callback])

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'validation'])
plt.show()

The updated model returned a validation accuracy of .76. This was a regression in model performance from the previous one, and validation accuracy started to decrease from the second epoch while training accuracy continued to increase. This is indication that the model started over-fitting since the second epoch. Adding complexity was not the right way to go. 

## Define and Train Model - Third Model

As the second model using the 'Small BERT' model performed considerably better than the first model using the 'bert-based-uncased' model, we are going to continue using the 'Small BERT' model for the new models that follow. This time, we are going to feed the sequential output of the BERT layer into Bidirectional LSTMs instead of immediately feeding the BERT embeddings to a Dense layer as we did in the first two models. 

In [None]:
text_input = tf.keras.layers.Input(shape = (), dtype = tf.string)
encoder_input = bert_preprocess(text_input)
encoder_output = bert_encoder(encoder_input)

l = tf.keras.layers.Dense(200, activation = 'relu')(encoder_output['sequence_output'])
l = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(200, return_sequences = False, dropout = 0.3))(l)
l = tf.keras.layers.Dense(50, activation = 'relu')(l)
l = tf.keras.layers.Dropout(0.3)(l)
l = tf.keras.layers.Dense(1, activation = 'sigmoid')(l)

model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [None]:
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = tf.keras.losses.BinaryCrossentropy(),
              metrics = ['accuracy'])

In [None]:
history = model.fit(train_dataset,
                    validation_data = val_dataset,
                    epochs = 30, 
                    callbacks = [early_stopping, model_checkpoint_callback])

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'validation'])
plt.show()

The third model returned a validation accuracy of .78. This was a considerable improvement from the first model but not from the second model. We are going to resume working on this task in a new notebook. 

## Prepare for Submission

In [None]:
cleaned_test = clean_text(test)

In [None]:
predictions = model.predict(cleaned_test.text)
pred_list = [np.mean(x) for x in predictions]

In [None]:
final_predictions = [(x > 0.5).astype(int) for x in pred_list]

In [None]:
predictions_df = pd.DataFrame(list(zip(test.id, final_predictions)),
                              columns = ['id', 'target'])

In [None]:
predictions_df.to_csv('predictions.csv', index = False)