# Tweets BERT

Adapted from [here](https://www.kaggle.com/brendanartley/roberta-w-tensorflow-explained-0-844/notebook). Compared to the kaggle author, we have:
1. updated the HEAD to have a configurable MLP
1. added regularization
1. made the pipeline configurable
1. performed small hyperparameter search (a dozen of configurations, searched by hand and without cross validation due to lack of computing resources)
1. implemented best model evaluation metrics
1. log the best checkpoint according to validation loss
1. trained the best model on the whole dataset to make it ready for production (production for us is quotebank)

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
import tensorflow_hub as hub
import h5py
from sklearn import metrics
from tensorflow.keras import regularizers

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
import tokenization

2021-12-10 03:01:57.106165: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


## <center>NLP Disaster Tweet Classification w/ roBERTa</center>

This notebook implements a roBERTa model in Tensorflow to evaluate whether a tweet is about a disaster or not. I have provided explanations throughout to provide a better understanding of what the roBERTa model is actually doing.

I got most of my understanding for this notebook from a good discussion thread about the roBERTa model from @Chris Deotte explaining how the components of the model work and his starter notebook on the roBERTa model. These are the top two links below. I also found the Tensorflow documentation quite informative as well (third and fourth links).

### Useful Links

This is a collection of links that I found helpful in understanding the structure of the roBERTa model, how it works, and more.

- [TensorFlow roBERTa Explained Discussion](https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/143281#807401)
- [tensorflow-roberta-0-705 Notebook](https://www.kaggle.com/cdeotte/tensorflow-roberta-0-705)
- [Bert_en_uncased Docs](https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/4)
- [bert_en_uncased_preprocess](https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3)
- [How to get meaning from text with language model BERT](https://www.youtube.com/watch?v=-9vVhYEXeyQ)
- [TF Bert Tokenizer](https://github.com/google-research/bert)

In [2]:
FINAL_TRAIN = True # True for production. Will train the model on all available data (train+test)

TWEETS_DATASET_PATH = "../datasets/disaster-tweets"
SEED = 72
MAX_LEN = 300
LEARNING_RATE = 5e-6
N_HIDDEN_UNITS = "1024,512,256" # The head we will use
EPOCHS = 100
BATCH_SIZE = 8
L2_LOSS_CONST = 3e-3
MODEL_CHECKPOINT_NAME = f"model_lr={LEARNING_RATE}_hid={N_HIDDEN_UNITS}_maxlen={MAX_LEN}_batch={BATCH_SIZE}_epochs={EPOCHS}_seed={SEED}_l2={L2_LOSS_CONST}.h5"
def seed_everything(seed):
    np.random.seed(seed)
    tf.random.set_seed(seed) 
    
seed_everything(SEED) 

Reading in the data using pandas. We will tokenize the text later in the notebook.

In [3]:
#reading input data with pandas
train = pd.read_csv(os.path.join(TWEETS_DATASET_PATH, "train.csv")
test = pd.read_csv(os.path.join(TWEETS_DATASET_PATH, "test.csv"))
# submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

if FINAL_TRAIN: # let's get this model into production!
    train = pd.concat([train, test], sort=False)

#visualizing some of the tweets
for i, val in enumerate(train.iloc[:2]["text"].to_list()):
    print("Tweet {}: {}".format(i+1, val))
train

Tweet 1: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Tweet 2: Forest fire near La Ronge Sask. Canada


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...,0
3259,10865,,,Storm in RI worse than last hurricane. My city...,1
3260,10868,,,Green Line derailment in Chicago http://t.co/U...,1
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...,1


## <center>bert_encode function</center>


### tokenizer
We are using the [Tensorflow Research's BERT tokenization method](https://github.com/tensorflow/models/blob/master/official/nlp/bert/tokenization.py). This tokenization method can be thought of as three steps.

- Text Normalization
    - The first part of the tokenizer converts the text to lowercase (given that we are using the uncased version of roBERTa), converts whitespace to spaces, and strips out accent markers.
    ```
    "Alex Pättason's, "  -> "alex pattason's,"
    ```
    <br></br>
- Punctuation splitting
    - This next step adds spaces on each side of all "punctuation". Note that this includes any non-letter/number/space ASCII characters (ie including \$, \@). See more of this in the Docs. 
    ```
    "Alex Pättason's, "  -> "alex pattason ' s ,"
    ```
    <br></br>
- WordPiece tokenization
    - This step applies what is called whitespace tokenization to the output of the process above, and apply's WordPiece tokenization to each word separately. See the example below.
    ```
    "Alex Pättason's, "  -> "alex pat ##ta ##son ' s ,"
    ```
   
### tags
The next part of the function reduces the length of the text by the max_length that we have specified and adds [CLS] and [SEP] tags to the end of the array. The [CLS] tag is short for classification and indicates the start of the sentence. Similarly, the [SEP] tag indicates the end of the sentence.

### convert_tokens_to_ids + pad_masks

We then use the tokenizer method to replace the string representation of words with integers. We also create the input mask (AKA pad_masks), and the segment id's. Note that we are not fulfilling the segment_ids full benefits below as we are only passing an array of zeros. More on the tokens, pad_masks, and segment_ids further in the notebook.

In [4]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

## <center>build_model function</center>

The first three components of the function are basically preprocessing plain text inputs into the input format expected by the roBERTa model.

### input_word_ids
- Basically maps each word to its token id. There can be multiple different values that correspond with the same word. For example, "smell" could be encoded both as 883 and 789.
<br></br>
```
text = "I love this notebook. It is Great."
input_word_ids = [10, 235, 123, 938, 184, 301, 567]
```

### the input_mask
- Shows where the sentence begins, and where it ends using an array. All input tokens that are not padding are given a value of 1, and all values that are padding are given 0. If the sentence exceeds that max_length, then the entire vector will be of 1's.
<br></br>
```
text = "I love this notebook. It is Great."
input_mask = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
```

### segment_ids
- This component is still a little vague for me, but from my understanding, it is recognizing segments of the text. The start of each segment has a 1 in the array, and other components and padding all have a zero. I am unsure as to whether this corresponds to the end of sentences or paragraphs, but if you can explain this better please do so in the comments below!
<br></br>
```
text = "I love this notebook. It is Great."
segment_ids = [1, 0, 0, 0, 1, 0, 0, 0, 0, 0]
```

In [5]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    #could be pooled_output, sequence_output yet sequence output provides for each input token (in context)
    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    hidden = clf_output
    for h in N_HIDDEN_UNITS.split(","):
        hidden = Dense(int(h), activation='relu', kernel_regularizer=regularizers.l2(L2_LOSS_CONST))(hidden)
    out = Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(L2_LOSS_CONST))(hidden)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    
    #specifying optimizer
    model.compile(Adam(learning_rate=LEARNING_RATE, ), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

## Build Model + Preprocess Data

The first cell below is basically loading in the version of the roBERTa model that we want to use. We are using a Large uncased model. The most simple way to use a roBERTa model and modify it to a specific use case is to set it as a KerasLayer.

Note there are many different variations of BERT models that you can look through here --> [TFhub Bert](https://tfhub.dev/google/collections/bert/1)


In [6]:
#load uncased bert model
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

2021-12-10 03:02:19.959003: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-12-10 03:02:20.052590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:86:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-12-10 03:02:20.052629: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2021-12-10 03:02:20.060707: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-12-10 03:02:20.063324: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-12-10 03:02:20.063931: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so

In the next cell, we are setting up the tokenizer that will be used to preprocess our input data to what BERT understands. We have to specify a vocab file so that the tokenizer knows what number to encode each word as then we have to specify whether we want uncased or cased text. We will use the same vocab_file that the pre-trained model was trained on (Google's SentencePiece in this case) and we will also use the same case that the model was built for (uncased).

Finally, once we have these two variables, we create the tokenizer and tokenize the training and testing data using the bert_encode function that we created above. 

In [7]:
#vocab file from pre-trained BERT for tokenization
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()

#returns true/false depending on if we selected cased/uncased bert layer
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

#Create the tokenizer
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

#tokenizing the training and testing data
train_input = bert_encode(train.text.values, tokenizer, max_len=MAX_LEN)
test_input = bert_encode(test.text.values, tokenizer, max_len=MAX_LEN)
train_labels = train.target.values

Having a look at the model summary. We can see the three input layers that we created followed by the roBERTa model which is in the keras_layer. We have the final dense layer which predicts the sentiment of the tweet on a scale of 0-1. 

In [8]:
model = build_model(bert_layer, max_len=MAX_LEN)
model.summary()

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 300)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 300)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 300)]        0                                            
_______________________________________________________________________________

## Training Model

The next cell is a simple way of training the model using Keras. We have included the built-in ModelCheckpoint callback to only save the model that has the highest validation loss. This ensures we are only saving the best models. 


We could decrease the randomness of the split by doing some sort of a stratified split, or cross-validation, but this will do for now.

In [9]:
checkpoint = ModelCheckpoint(MODEL_CHECKPOINT_NAME, monitor='val_loss', save_best_only=True)

train_history = model.fit(
    train_input, train_labels,
    validation_split=0.1,
    epochs=EPOCHS,
    callbacks=[checkpoint],
    batch_size=BATCH_SIZE
)

Epoch 1/100


2021-12-10 03:05:13.424386: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10






Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100


Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


## Make Prediction

Using the model to make predictions on the testing set. We round the prediction to 1 or 0. 1 is a disaster tweet, and 0 is a regular tweet.

In [13]:
def print_metrics(y_true, y_predicted, log_filename=None):
    acc = metrics.accuracy_score(y_true, y_predicted)
    precision, recall, f1, support = metrics.precision_recall_fscore_support(y_true, y_predicted)
    cf = metrics.confusion_matrix(y_true, y_predicted)
    
    lines = []
    lines += [f"Accuracy: {acc}"]
    lines += [f"Precision: {precision}"]
    lines += [f"Recall: {recall}"]
    lines += [f"F1: {f1}"]
    lines += [f"support: {support}"]
    lines += [f"{cf}"]
    
    for line in lines:
        print(line)
    
    if log_filename is not None:
        with open(log_filename, "a") as lf:
                for line in lines:
                    lf.write(line)
                    lf.write("\n")
    
    return lines

In [16]:
if not FINAL_TRAIN:
    print("Overfitted model test results:")
    test_pred = model.predict(test_input)
    print_metrics(test.target, test_pred.round().astype(int).squeeze())

In [17]:
if not FINAL_TRAIN:
    print("Best checkpoint test results (best according to validation loss):")
    best_model = load_model(MODEL_CHECKPOINT_NAME, custom_objects={'KerasLayer': hub.KerasLayer})
    test_pred = best_model.predict(test_input)

    submission = test[["id"]].copy()
    submission["target"] = test_pred.round().astype(int)
    submission.to_csv('submission.csv', index=False)

    print(MODEL_CHECKPOINT_NAME)
    print_metrics(test.target, submission.target, f"{MODEL_CHECKPOINT_NAME}.log")

In [19]:
if FINAL_TRAIN:
    print(f"Model ready for production! Find the best weights in:\n{MODEL_CHECKPOINT_NAME}")

Model ready for production! Find the best weights in:
model_lr=5e-06_hid=1024,512,256_maxlen=300_batch=8_epochs=100_seed=72_l2=0.003.h5
