# BERT fine tuning for Question-Answering

This notebook demonstrates fine tuning BERT models from TF Hub using the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). Scripts from the [TensorFlow Model Garden](https://github.com/tensorflow/models) are used for preprocessing the training dataset and fine tuning.

The notebook performs the following steps:
1. [Install dependencies and setup parameters](#1.-Install-dependencies-and-setup-parameters)
2. [Prepare the dataset](#2.-Prepare-the-dataset)
3. [Fine tuning and evaluation](#3.-Fine-tuning-and-evaluation)
4. [Export the saved model](#4.-Export-the-saved-model)

## 1. Install dependencies and setup parameters

The notebook assumes that you have already followed the README.md instructions that install Intel-optimized TensorFlow or use the Intel-optimized TensorFlow jupyter docker container. Additional installations needed to run the notebook are done in the next cell.

In [None]:
!pip install --upgrade -q pip
!pip install -q gin-config==0.5.0 \
                sentencepiece==0.1.96 \
                tensorflow-addons==0.15.0 \
                tensorflow-datasets==4.5.2 \
                tensorflow-hub==0.12.0 \
                'pandas>=1.1.5' \
                pyyaml==6.0 \
                wget==3.2
!pip install --no-deps -q tf-models-official==2.7.0
!apt-get -q update && apt-get -q install -y git

In [None]:
import json
import os
import pandas as pd
import tensorflow as tf
import wget

from bert_qa_utils import create_mini_dataset_file, \
                          display_predictions, \
                          get_config_and_vocab_from_zip, \
                          predict_squad_customized

from bert_utils import get_model_map

This notebook will run one of the supported [BERT models from TF Hub](https://tfhub.dev/google/collections/bert/1). The table below has a list of the available models and links to their URLs in TF Hub.

In [None]:
tfhub_model_map, models_df = get_model_map("tfhub_bert_model_map_qa.json", return_data_frame=True)
models_df.style.hide(axis="index")

Specify the name of the BERT model to use. This string must match one of the models listed in the table above.

In [None]:
model_name = "bert_en_wwm_uncased_L-24_H-1024_A-16"
if model_name not in tfhub_model_map.keys():
    raise ValueError("The specified model name ({}) is not supported".format(model_name))

In [None]:
# Define a working directory where the dataset and TensorFlow models repo will be downloaded
if "WORKING_DIR" in os.environ and os.environ["WORKING_DIR"] != "":
    working_dir = os.environ["WORKING_DIR"]
else:
    working_dir = input("Path to a working directory (to download datasets, vocab files, etc): ")

# Define an output directory for the saved model to be exported
if "OUTPUT_DIR" in os.environ and os.environ["OUTPUT_DIR"] != "":
    output_dir = os.environ["OUTPUT_DIR"]
else:
    output_dir = input("Path to an output directory (for checkpoints and the saved model): ")
    
# Directory for downloading the BERT config and vocab file
bert_dir = os.path.join(working_dir, model_name)

# Output directory for logs and checkpoints generated during training
if not os.path.isdir(output_dir):
    os.makedirs(output_dir)

# Directory to download the bert checkpoint zip so to get the vocab.txt and bert_config.json
if not os.path.isdir(bert_dir):
    os.makedirs(bert_dir)
    
# Get the BERT TF Hub URL from the model map
tfhub_bert_encoder = tfhub_model_map[model_name]["bert_encoder"]
checkpoint_url = tfhub_model_map[model_name]["checkpoint_zip"]

# Extract the vocab.txt and bert_config.json from the checkpoint zip file
vocab_txt, bert_config = get_config_and_vocab_from_zip(checkpoint_url, bert_dir)

if not os.path.exists(vocab_txt):
    ValueError("The vocab file could not be found at", vocab_txt)
    
if not os.path.exists(bert_config):
    ValueError("The bert config could not be found at", bert_config)
    
print("Using TF Hub model:", model_name)
print("BERT encoder URL:", tfhub_bert_encoder)
print("Vocab file:", vocab_txt)
print("BERT config:", bert_config)

In [None]:
# Path where the https://github.com/tensorflow/models repo will be cloned
tf_models_dir = os.path.join(working_dir, "tensorflow-models")
os.environ["TF_MODELS_DIR"] = tf_models_dir
tf_models_branch = "v2.7.0"

# Clone the TensorFlow models repo
if not os.path.exists(tf_models_dir):
    !git clone --depth=1 --branch=$tf_models_branch https://github.com/tensorflow/models.git $tf_models_dir
        
os.environ["PYTHONPATH"] = tf_models_dir

## 2. Prepare the dataset

Download the SQuAD dataset, create smaller json files with a subset of the dev and train datasets, and then create TF records for the mini training dataset. The SQuAD dataset has json files for a train and dev datasets. The json files are formatted like:

```
{
    "data": [
        {
            "title": "...",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "question": "...",
                            "id": "<unique id>",
                            "answers": [
                                {
                                    "text": "...",
                                    "answer_start": <index>
                                },
                                {
                                    "text": "...",
                                    "answer_start": <index>
                                },
                                {
                                    "text": "...",
                                    "answer_start": <index>
                                }
                            ],
                            "is_impossible": <true/false>
                        },
                        ...
                    ],
                    "context": "....."
                },
                ...
            ]
        }
    ],
    "version": "v2.0"
}
```

Each item in the data list has a title, a list of paragraphs that with questions/answers and a context string. The answer to each question is a segment of text from the context paragraph (unless the question is impossible).

For this example, we will be using a subset of the dev and train dataset in order to speed up the execution time. The size of the datasets can be increased (or the full dataset can be used) to try to improve accuracy.

In [None]:
# Specify to use SQuAD v1.1 or v2.0
squad_version = "v1.1"

# Maximum sequence length
max_seq_length = 384

# Specify the number of dataset items to grab from the dev and train datasets.
# More dataset items can increase accuracy, but will also increase the training/evaluation time.
num_dev_dataset_items = 2
num_train_dataset_items = 12

# Flag to overwrite previously generated mini dataset .json files and the TF records file
overwrite = False

# Dataset download directory
squad_dir = os.path.join(working_dir, "squad")

squad_dev_dataset = os.path.join(squad_dir, "dev-{}.json".format(squad_version))
squad_train_dataset = os.path.join(squad_dir, "train-{}.json".format(squad_version))
version_2_with_negative = squad_version == "v2.0"

# Create a directory for the SQuAD files, if the folder does not exist
if not os.path.isdir(squad_dir):
    os.makedirs(squad_dir)

# Download the SQuAD dev dataset file, if it doesn't exist
if not os.path.exists(squad_dev_dataset):
    squad_dev_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-{}.json".format(squad_version)
    wget.download(squad_dev_url, squad_dir)

# Download the SQuAD train dataset file, if it doesn't exist
if not os.path.exists(squad_train_dataset):
    squad_train_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-{}.json".format(squad_version)
    wget.download(squad_train_url, squad_dir)
    
# Create a smaller version of the dev dataset
squad_mini_file = "mini-dev-{}.json".format(squad_version)
mini_dataset_path = os.path.join(squad_dir, squad_mini_file)
create_mini_dataset_file(squad_dev_dataset, mini_dataset_path, num_dev_dataset_items, overwrite=overwrite)

# Create a smaller version of the train dataset
squad_mini_train_file = "mini-train-{}.json".format(squad_version)
mini_train_dataset_path = os.path.join(squad_dir, squad_mini_train_file)
create_mini_dataset_file(squad_train_dataset, mini_train_dataset_path, num_train_dataset_items, overwrite=overwrite)

# Create TF Records for the mini training dataset
train_mini_tfrecords_path = os.path.join(squad_dir, "squad_mini_{}_train.tf_record".format(squad_version))
squad_metadata_path = os.path.join(squad_dir, "squad_{}_meta_data".format(squad_version))

# Preprocess the dataset, if we don't already have the files
if not os.path.exists(train_mini_tfrecords_path) or not os.path.exists(squad_metadata_path) or overwrite:
    !python $tf_models_dir/official/nlp/data/create_finetuning_data.py \
        --squad_data_file=$mini_train_dataset_path \
        --vocab_file=$vocab_txt \
        --version_2_with_negative=$version_2_with_negative \
        --train_data_output_path=$train_mini_tfrecords_path \
        --meta_data_file_path=$squad_metadata_path \
        --fine_tuning_task_type=squad \
        --max_seq_length=$max_seq_length
    
    if os.path.exists(train_mini_tfrecords_path):
        print("Preprocessed dataset: ", train_mini_tfrecords_path)
else:
    print("The preprocessed training dataset was found at:", train_mini_tfrecords_path)
    print("The SQuAD metadata file was found at:", squad_metadata_path)

## 3. Fine tuning and evaluation

Train the model using the `run_squad.py` script from the [TensorFlow Model Garden](https://github.com/tensorflow/models/blob/v2.7.0/official/nlp/bert/run_squad.py) with the mode set to `train_and_eval`. The [TF Hub](https://tfhub.dev) model URL is being passed as the `hub_module_url`.

In [None]:
%%time

# Learning rate
learning_rate = 8e-5

# Number of training epochs
num_train_epochs=1

# Batch sizes
train_batch_size = 4
predict_batch_size = 4

# Directory for checkpoints
checkpoint_dir = os.path.join(output_dir, "{}_checkpoints".format(model_name))

if os.path.exists(checkpoint_dir):
    if len(os.listdir(checkpoint_dir)) > 0:
        print("WARNING: The model checkpoint directory is not empty and fine tuning may pick up " 
              "previously generated checkpoint files.\n")
else:
    os.makedirs(checkpoint_dir)

!python $tf_models_dir/official/nlp/bert/run_squad.py \
  --mode=train_and_eval \
  --input_meta_data_path=$squad_metadata_path \
  --train_data_path=$train_mini_tfrecords_path \
  --predict_file=$mini_dataset_path \
  --vocab_file=$vocab_txt \
  --bert_config_file=$bert_config \
  --hub_module_url=$tfhub_bert_encoder \
  --train_batch_size=$train_batch_size \
  --predict_batch_size=$predict_batch_size \
  --learning_rate=$learning_rate \
  --num_train_epochs=$num_train_epochs \
  --model_dir=$checkpoint_dir \
  --distribution_strategy=one_device

In [None]:
display_predictions(mini_dataset_path, os.path.join(checkpoint_dir, "predictions.json"), n=25)

## 4. Export the saved model

Using the TensorFlow Model Garden API, export the saved model using the checkpoint files that were generated during fine tuning.

In [None]:
import tensorflow as tf
from official.nlp.bert import bert_models
from official.nlp.bert import configs as bert_configs
from official.nlp.bert import model_saving_utils

tf.keras.mixed_precision.set_global_policy('float32')
bert_config_obj = bert_configs.BertConfig.from_json_file(bert_config)
squad_model, _ = bert_models.squad_model(bert_config_obj,
                                         max_seq_length,
                                         hub_module_url=tfhub_bert_encoder)

saved_model_dir = os.path.join(output_dir, "{}_saved_model".format(model_name))

if not os.path.exists(saved_model_dir):
    os.makedirs(saved_model_dir)

model_saving_utils.export_bert_model(saved_model_dir, model=squad_model, checkpoint_dir=checkpoint_dir)

## Citations

```
@misc{tensorflowmodelgarden2020,
  author = {Hongkun Yu and Chen Chen and Xianzhi Du and Yeqing Li and
            Abdullah Rashwan and Le Hou and Pengchong Jin and Fan Yang and
            Frederick Liu and Jaeyoun Kim and Jing Li},
  title = {{TensorFlow Model Garden}},
  howpublished = {\url{https://github.com/tensorflow/models}},
  year = {2020}
}

@article{2016arXiv160605250R,
       author = { {Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
                 Konstantin and {Liang}, Percy},
        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
      journal = {arXiv e-prints},
         year = 2016,
          eid = {arXiv:1606.05250},
        pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
       eprint = {1606.05250},
}
```