## AUXILIARY NOTEBOOK: Fine-Tuning BERT to multi-class prediction of document topic.
----
by Juan Equihua on April 10th, 2022.

The purpose of this notebook is extending the 1-D labeling provided in the dataset to multiple classes. For this we will fine-tune the BERT model to achieve multi-class prediction, and then dump the model on disk from where we can deploy it via Docker.

NOTE: This notebook saves the tuned BERT model in **/Fine_tuned_BERT/multiclassification_bert/**

----

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification, DefaultDataCollator

from utils.utilities import load_data, get_predictions_from_bert_api

In [2]:
## Specify input data:
data_path = './Data/data.csv'
data = load_data(data_path)

In [3]:
## AUX Functions:
## TODO: Comment the code.

def get_labels(data):
    return list(data.clusters_0.unique())

def clean_text(text):
    ''' Clean text. (Further cleaning can be done) '''
    text = str(text).lower()
    return text
    
def encode_label(label, labels):
    return np.array(list(int(label == x) for x in labels))

def encode_label_bkp(label, labels):
    return np.array([int(label == x) for x in labels])[..., np.newaxis]

def preprocess_data(data):
    ''' Preprocess data for training '''
    data['tag'] = [[str(x)] for x in data.clusters_0]
    data['Clean_body'] = [clean_text(x) for x in data.body]
    data['tag'] = [encode_label(x[0], labels) for x in data.tag]
    return data[['Clean_body', 'tag' ]]

def dict_row(row):
    return {"label": row.tag, "text": row.Clean_body}
        
def create_examples(data):
    ''' Create examples for all dataset '''
    examples = []
    for i in range(data.shape[0]):
        examples.append(dict_row(data.iloc[i]))

    return examples


def tokenize_function(examples):
    return tokenizer(examples["Clean_body"], padding="max_length", truncation=True, max_length = 50)

def make_predictions_for_string(string, tokenizer, model, labels):
    tokenized_string_test = tokenizer.encode(string,
                                            truncation=True,
                                            padding=True,
                                            return_tensors="tf")
    prediction = model(tokenized_string_test)[0]
    prediction_probs = tf.nn.softmax(prediction,axis=1).numpy()
    
    sorted_prob_index = np.argsort(-prediction_probs)[0]
    
    return {"Top 1": labels[sorted_prob_index[0]], "Top 2": labels[sorted_prob_index[1]], "Top 3": labels[sorted_prob_index[2]]}



In [4]:
## Create Labels: 
labels = get_labels(data)
print(labels)

['Poor Pay', 'Cost of Living', 'Wage Growth', 'Rich People', 'Low Income Families', 'Public Sector Pay', 'Government Support', 'Mental Health', 'Leaseholding', 'State Pension', 'Pay Rises', 'Long Hours', 'Income Tax', 'Poor People', 'Council Tax', 'Small Businesses', 'Statutory Sick Pay', 'Social Care', 'House Prices', 'Job', 'Minimum Wage Increase', 'National Insurance', 'Gender Pay Gap']


In [5]:
## Clean Data:
dataset = preprocess_data(data)

In [6]:
## Load tokenizer for BERT: 
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [7]:
## Split data into train_test (hardcoded for now):
dataset_train = dataset[:700]
dataset_test = dataset[700:]

## Convert pandas into Dataset: 
train_dataset = Dataset.from_pandas(dataset_train)
test_dataset = Dataset.from_pandas(dataset_test)

In [8]:
## Tokenize datasets: 
train_tokenized_datasets = train_dataset.map(tokenize_function, batched=True)
tokenized_datasets = test_dataset.map(tokenize_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [9]:
## Define model:
data_collator = DefaultDataCollator(return_tensors="tf")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels))

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
## Create training, testing sets for Tensorflow: 

tf_train_dataset = train_tokenized_datasets.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["tag"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = train_tokenized_datasets.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["tag"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

In [11]:
## Compile model:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.CategoricalAccuracy(),
)

In [12]:
## Model Training: (This might take a while).
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd5c97f4e90>

In [13]:
## Save model:
model_path = './Fine_tuned_BERT/multiclassification_bert'
model.save_pretrained(model_path)

In [17]:
## Testing predictions: 
string_test = 'Economic impact of covid in real state'
make_predictions_for_string(string_test, tokenizer, model, labels)

{'Top 1': 'Cost of Living', 'Top 2': 'Mental Health', 'Top 3': 'Pay Rises'}

In [None]:
## Testing Loading Model and makinf predictions: 

In [18]:
classifier = TFAutoModelForSequenceClassification.from_pretrained(model_path)

Some layers from the model checkpoint at ./Fine_tuned_BERT/multiclassification_bert were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./Fine_tuned_BERT/multiclassification_bert.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [19]:
make_predictions_for_string(string_test, tokenizer, classifier, labels)

{'Top 1': 'Cost of Living', 'Top 2': 'Mental Health', 'Top 3': 'Pay Rises'}

In [313]:
### Test api requests after deployment:

In [316]:
## NOTE: get_predictions_from_bert_api function only works if the Flask API was already launched. 
get_predictions_from_bert_api('Covid impact in household income')

{'Top 1': 'Government Support',
 'Top 2': 'Poor Pay',
 'Top 3': 'Low Income Families'}