<a href="https://colab.research.google.com/github/parker-erickson/BERT-question-answering/blob/main/question_answering_bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###### Parker Erickson
###### Ryan Beck
###### Cassandra Cabrera
###### 12/18/2020




# **Question Answering with BERT**

## **Introduction**


___
***BERT***

>BERT stands for Bidirectional Encoder Representations from Transformers, and is designed to bidirectionally train representations of text, which is unlabeled, through analysis of all sides of a neural networks layer’s. Through the process of fine-tuning, Bert can create highly accurate models to solve many problems, without requiring architecture specific to each task. In the case of this project, it will be used to create a Question answering algorithm that takes in a question and responds with the best fit response.

>Developed by google and introduced in the paper https://arxiv.org/pdf/1810.04805.pdf
Bert is a response to similar work such as GPT-2, and ELMo. With a shift to the use of transformers from LSTM models, models such as BERT are increasing in popularity as they are faster and more accurate than previous implementations of LSTM models. The Model architecture of BERT is multi-layered, and the transformer utilizes bidirectional self-attention. This means that it creates connections between indices adjacent to the input, which is why it excels at language processing tasks. Language is, after all, an array of words that only have meaning when all of the adjacent words are the context of each other.
___
***DATASET***

>For the dataset, the Stanford Question and Answer Dataset 2.0 is used (SQuAD 2.0) to train the BERT model. It contains more than 100,000 question and answer pairs, as well as over 50,000 questions which have no answer. This adds extra complexity as the algorithm must be written to identify when the question does and does not have a definite answer. 

_______________________________________________________
BERT paper: https://arxiv.org/pdf/1810.04805.pdf

SQuAD 2.0: https://rajpurkar.github.io/SQuAD-explorer/
________________________________________________________

##**Setup**



### **Imports**

In [None]:
!pip install transformers
import requests
import random
import gc
import os
import re
import json
import string
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras
from keras import backend as K
from tensorflow.keras import layers
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer, TFBertModel, BertConfig

max_len = 384
configuration = BertConfig()  # default parameters and configuration for BERT

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 6.8MB/s 
[?25hCollecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 33.7MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 49.5MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=ea7bfb1

###**Retrieve Data**

The data was downloaded from the SQuAD website, and uploaded to Github. 

Initially there was an issue with the file size limit on Github, as the initial commit cannot be above 25MB but the overall limit is 100MB. With the Database File being over 40MB this initally posed a problem, but Cassandra was able to upload the data.

The data uploads to the notebook quickly and contains titles of topics and questions that fall into those topics. The Questions can have multiple answers or be impossible to answer. As well, they also have a reference to the index at which they start.

In [None]:
train_url = 'https://raw.githubusercontent.com/cass-cabrera/data/master/train-v2.0.json'
dev_url = 'https://raw.githubusercontent.com/cass-cabrera/data/master/dev-v2.0.json'
# r = requests.get(train_url).json()
train_path = keras.utils.get_file("train.json", train_url)
eval_path = keras.utils.get_file("dev.json", dev_url)


Downloading data from https://raw.githubusercontent.com/cass-cabrera/data/master/train-v2.0.json
Downloading data from https://raw.githubusercontent.com/cass-cabrera/data/master/dev-v2.0.json


In [None]:
with open(train_path) as f:
  raw_train_data = json.load(f)

with open(eval_path) as f:
  raw_eval_data = json.load(f)

##**Set-up BERT tokenizer**


When pre-training a BERT model, a WordPiece Tokenization is applied with a word masking rate of about 15% by default. This means that the text used for training is encoded as a high-dimension vector representation. 

The word masking is important for the bi-directional encoding the BERT algorithm utilizes. If it did not mask words in the training process, words would refer to themselves, and would then overfit the model. Thus, the random masking function allows for each token to build a comprehensive relationship to the left and right side of each sentence being trained on. 

This allows for a better predicted answer, as it comes to question answering.


In [None]:
# Save the slow pretrained tokenizer
slow_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
save_path = "bert_base_uncased/"
if not os.path.exists(save_path):
    os.makedirs(save_path)
slow_tokenizer.save_pretrained(save_path)

# Load the fast tokenizer from saved file
tokenizer = BertWordPieceTokenizer("bert_base_uncased/vocab.txt", lowercase=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




##**Preprocessing and Data Exploration**

---



In the preprocessing and data exploration step, the SQuAD 2.0 data is munged and tokenized to represent the indexes where the answers end for each question input. This is essential for the training process, as described earlier, because the character data is transposed into numerical vector representations. 

In [None]:
class SquadExample:
  def __init__(self, question, context, start_char_idx, answer_text, all_answers, is_impossible):
    self.question = question
    self.context = context
    self.start_char_idx = start_char_idx
    self.answer_text = answer_text
    self.all_answers = all_answers
    self.is_impossible = is_impossible
    self.skip = False

  def preprocess(self):
    context = self.context
    question = self.question
    answer_text = self.answer_text
    start_char_idx = self.start_char_idx

    # Clean context, answer and question
    context = " ".join(str(context).split())
    question = " ".join(str(question).split())
    answer = " ".join(str(answer_text).split())

    # Find end character index of answer in context
    end_char_idx = start_char_idx + len(answer)
    if end_char_idx >= len(context):
      self.skip = True
      return

    # Mark the character indexes in context that are in answer
    is_char_in_ans = [0] * len(context)
    for idx in range(start_char_idx, end_char_idx):
      is_char_in_ans[idx] = 1

    # Tokenize context
    tokenized_context = tokenizer.encode(context)

    # Find tokens that were created from answer characters
    ans_token_idx = []
    for idx, (start, end) in enumerate(tokenized_context.offsets):
      if sum(is_char_in_ans[start:end]) > 0:
        ans_token_idx.append(idx)

    if len(ans_token_idx) == 0:
      self.skip = True
      return

    # Find start and end token index for tokens from answer
    start_token_idx = ans_token_idx[0]
    end_token_idx = ans_token_idx[-1]

    # Tokenize question
    tokenized_question = tokenizer.encode(question)

    # Create inputs
    input_ids = tokenized_context.ids + tokenized_question.ids[1:]
    token_type_ids = [0] * len(tokenized_context.ids) + [1] * len(
        tokenized_question.ids[1:]
    )
    attention_mask = [1] * len(input_ids)

    # Pad and create attention masks.
    # Skip if truncation is needed
    padding_length = max_len - len(input_ids)
    if padding_length > 0:  # pad
      input_ids = input_ids + ([0] * padding_length)
      attention_mask = attention_mask + ([0] * padding_length)
      token_type_ids = token_type_ids + ([0] * padding_length)
    elif padding_length < 0:  # skip
      self.skip = True
      return

    self.input_ids = input_ids
    self.token_type_ids = token_type_ids
    self.attention_mask = attention_mask
    self.start_token_idx = start_token_idx
    self.end_token_idx = end_token_idx
    self.context_token_to_char = tokenized_context.offsets


In [None]:
def create_squad_examples(raw_data):
  global is_impossible
  global plausible_answers
  global no_answer
  squad_examples = []
  for item in raw_data["data"]:
    for para in item["paragraphs"]:
      context = para["context"]
      for qa in para["qas"]:
        question = qa["question"]
        if qa["is_impossible"] == False:
          answer_text = qa["answers"][0]["text"]
          all_answers = [_["text"] for _ in qa["answers"]]
          start_char_idx = qa["answers"][0]["answer_start"]
        elif "plausible_answers" in qa and len(qa['plausible_answers']) > 0:
          answer_text = qa["plausible_answers"][0]["text"]
          all_answers = [_["text"] for _ in qa["plausible_answers"]]
          start_char_idx = qa["plausible_answers"][0]["answer_start"]
          is_impossible += 1
          plausible_answers += 1
        else:
          is_impossible +=1
          no_answer +=1
          continue
        squad_eg = SquadExample(
            question, context, start_char_idx, answer_text, all_answers, qa["is_impossible"]
        )
        squad_eg.preprocess()
        squad_examples.append(squad_eg)
  return squad_examples

In [None]:
def create_inputs_targets(squad_examples):
  dataset_dict = {
      "input_ids": [],
      "token_type_ids": [],
      "attention_mask": [],
      "start_token_idx": [],
      "end_token_idx": [],
  }
  for item in squad_examples:
    if item.skip == False:
      for key in dataset_dict:
        dataset_dict[key].append(getattr(item, key))
  for key in dataset_dict:
    dataset_dict[key] = np.array(dataset_dict[key])

  x = [
      dataset_dict["input_ids"],
      dataset_dict["token_type_ids"],
      dataset_dict["attention_mask"],
  ]
  y = [dataset_dict["start_token_idx"], dataset_dict["end_token_idx"]]
  return x, y

Based on creating SQUAD examples from the raw data, we can see that we initially have around 130,000 training data records and 12,000 evaluation data records. We noticed that there were some questions in the dataset that had a boolean flag to tell us how many questions were considered impossible to answer. From that we then analyzed of those that were impossible to answer if there were  any that had a suggested plausible answer, or no answer. 

In [None]:
is_impossible = 0
plausible_answers = 0
no_answer = 0

train_squad_examples = create_squad_examples(raw_train_data)
print(f"{len(train_squad_examples)} training data records to begin with.")

eval_squad_examples = create_squad_examples(raw_eval_data)
print(f"{len(eval_squad_examples)} evaluation data records to begin with.")

print(f"{is_impossible} of questions are considered impossible to answer.")
print(f"{plausible_answers} of the impossible questions are suggested to have a plausible answer.")
print(f"{no_answer} of the impossible questions have no possible answer.")

130319 training data records to begin with.
11858 evaluation data records to begin with.
49443 of questions are considered impossible to answer.
49428 of the impossible questions are suggested to have a plausible answer.
15 of the impossible questions have no possible answer.


Because of the very small amount of  questions with no possible answer we decided to remove these from our data set as well as any data records that exceed our max length of 364 characters.

In [None]:
x_train, y_train = create_inputs_targets(train_squad_examples)
print(f"{np.array(x_train).shape[1]} training points created.")

x_eval, y_eval = create_inputs_targets(eval_squad_examples)
print(f"{np.array(x_eval).shape[1]} evaluation points created.")

128178 training points created.
11587 evaluation points created.


In [None]:
for x in random.sample(train_squad_examples, 3):
  print("Question: ", x.question)
  print("Context: ", x.context)
  print("Starting Answer Index: ", x.start_char_idx)
  print("Answer: ", x.answer_text)
  print("All Possible Answers: ", x.all_answers)
  print("Is Impossible: ", x.is_impossible, "\n")

Question:  In what year did the Dutch give New York back to the English?
Context:  On August 24, 1673, Dutch captain Anthonio Colve took over the colony of New York from England and rechristened it "New Orange" to honor the Prince of Orange, King William III. However, facing defeat from the British and French, who had teamed up to destroy Dutch trading routes, the Dutch returned the island to England in 1674.
Starting Answer Index:  324
Answer:  1674
All Possible Answers:  ['1674']
Is Impossible:  False 

Question:  One rod in the second box from the right is what number?
Context:  Mathematics: From the earliest the Chinese used a positional decimal system on counting boards in order to calculate. To express 10, a single rod is placed in the second box from the right. The spoken language uses a similar system to English: e.g. four thousand two hundred seven. No symbol was used for zero. By the 1st century BC, negative numbers and decimal fractions were in use and The Nine Chapters on t

##**Model Creation**

The model is defined with three input layers, which are required for the TFBertModel class from the transformers library. This first model we will run will contain default parameters the most importantly being the BertConfig, which defines the attributes of the model itself. Some notable parameters of the config include the number of hidden layers and the dropout rate. The output of the Bert model is the start and end logits of the predicted text, which is then normalized to become probabilities of the answer text beginning at each index. Loss is calculated with SparseCategoricalCrossEntropy. 



In [None]:
def create_model(bert_config=configuration):
  ## BERT encoder
  encoder = TFBertModel.from_pretrained("bert-base-uncased", config=bert_config)

  ## QA Model
  input_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
  token_type_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
  attention_mask = layers.Input(shape=(max_len,), dtype=tf.int32)
  embedding = encoder(
    input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask
  )[0]

  start_logits = layers.Dense(1, name="start_logit", use_bias=False)(embedding)
  start_logits = layers.Flatten()(start_logits)

  end_logits = layers.Dense(1, name="end_logit", use_bias=False)(embedding)
  end_logits = layers.Flatten()(end_logits)

  start_probs = layers.Activation(keras.activations.softmax)(start_logits)
  end_probs = layers.Activation(keras.activations.softmax)(end_logits)

  model = keras.Model(
    inputs=[input_ids, token_type_ids, attention_mask],
    outputs=[start_probs, end_probs],
  )
  loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
  optimizer = keras.optimizers.Adam(lr=5e-5)
  model.compile(optimizer=optimizer, loss=[loss, loss], metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")])
  return model

In [None]:
K.clear_session()
use_tpu = True
if use_tpu:
  # Create distribution strategy
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
  tf.config.experimental_connect_to_cluster(tpu)
  tf.tpu.experimental.initialize_tpu_system(tpu)
  strategy = tf.distribute.experimental.TPUStrategy(tpu)

  # Create model
  with strategy.scope():
    model = create_model()
else:
  model = create_model()

model.summary()

INFO:absl:Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0


INFO:tensorflow:Initializing the TPU system: grpc://10.83.19.66:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.83.19.66:8470


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f69fced4e58> is not a module, class, method, function, traceback, frame, or code object


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f69fced4e58> is not a module, class, method, function, traceback, frame, or code object


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f69fced4e58> is not a module, class, method, function, traceback, frame, or code object





Cause: while/else statement not yet supported


The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
Cause: while/else statement not yet supported


Cause: while/else statement not yet supported
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     TFBaseModelOutputWit 109482240   input_1[0][0]                    
                                                

##**Evaluation Callback**

This callback will be used at the end of every epoch in the model. The main purpose of this callback will be to test the model on the evaluation data at every epoch, as opposed to just at the end of training. That way we can see how it improves as the epochs continue. The evaluation data is tested in two ways. The first way is checking if the predicted text is an exact match of the target text. The second way is checking if the target text is a substring of the predicted text. We found many examples where the model would predict something like “in Normandy, France”, but the dataset would have the target value as just “France”.

Aside from testing the evaluation data, the callback also returns some data in an output dataframe. This includes the normalized and decoded predicted and target values in each row, along with other related data. This dataframe can be used for diagnosing some issues as well as simply getting a visual look at the predicted values in plain text.


In [None]:
def normalize_text(text):
  text = text.lower()

  # Remove punctuations
  exclude = set(string.punctuation)
  text = "".join(ch for ch in text if ch not in exclude)

  # Remove articles
  regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
  text = re.sub(regex, " ", text)

  # Remove extra white space
  text = " ".join(text.split())
  return text


class DiagnosticCallback(keras.callbacks.Callback):

  def __init__(self, x_eval, y_eval, num_epochs):
    self.x_eval = x_eval
    self.y_eval = y_eval
    self.num_epochs = num_epochs
    self.q_and_a = []

  def on_epoch_end(self, epoch, logs=None):
    pred_start, pred_end = self.model.predict(self.x_eval)
    count = 0
    count2 = 0
    eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False]
    for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
      squad_eg = eval_examples_no_skip[idx]
      offsets = squad_eg.context_token_to_char
      start = np.argmax(start)
      end = np.argmax(end)
      if start >= len(offsets):
        continue
      pred_char_start = offsets[start][0]
      if end < len(offsets):
        pred_char_end = offsets[end][1]
        pred_ans = squad_eg.context[pred_char_start:pred_char_end]
      else:
        pred_ans = squad_eg.context[pred_char_start:]

      normalized_pred_ans = normalize_text(pred_ans)
      normalized_true_ans = [normalize_text(_) for _ in squad_eg.all_answers]

      if normalized_pred_ans in normalized_true_ans:
        count += 1
        exact = True
      else:
        exact = False

      real_start = squad_eg.start_char_idx
      if real_start >= pred_char_start and real_start < pred_char_end:
        count2 += 1

      if epoch + 1 == self.num_epochs:
        self.q_and_a.append([squad_eg.question, normalized_pred_ans, normalized_true_ans, exact, squad_eg.is_impossible])


    self.output = pd.DataFrame(self.q_and_a, columns=["question", "prediction", "target", "exact_match", "is_impossible"])
    acc = count / len(self.y_eval[0])
    acc2 = count2 / len(self.y_eval[0])
    print(f"\nepoch={epoch+1}, exact match score={acc:.2f}")
    print(f"\nepoch={epoch+1}, answer is a substring of context score={acc2:.2f}")

In [None]:
num_epochs = 3
diagnostic_callback = DiagnosticCallback(x_eval, y_eval, num_epochs)
history = model.fit(
    x_train,
    y_train,
    epochs=num_epochs, 
    verbose=2,
    batch_size=32,
    callbacks=[diagnostic_callback],
)

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 542s - loss: 2.5477 - activation_loss: 1.3288 - activation_1_loss: 1.2189 - activation_acc: 0.6212 - activation_1_acc: 0.6621


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.



epoch=1, exact match score=0.66

epoch=1, answer is a substring of context score=0.72
Epoch 2/3
4006/4006 - 426s - loss: 1.6550 - activation_loss: 0.8733 - activation_1_loss: 0.7817 - activation_acc: 0.7282 - activation_1_acc: 0.7684

epoch=2, exact match score=0.66

epoch=2, answer is a substring of context score=0.72
Epoch 3/3
4006/4006 - 426s - loss: 1.2057 - activation_loss: 0.6403 - activation_1_loss: 0.5654 - activation_acc: 0.7908 - activation_1_acc: 0.8253

epoch=3, exact match score=0.65

epoch=3, answer is a substring of context score=0.74


We wanted to perform gridsearch on our TFBertModel, to find the best hyperparameters for its configuration parameter, BertConfig. We initally attempted to do this using scikit learn's GridSearchCV and found that due to the transformer type model we are using and the type of data we have, this wouldn't work. This lead us to looking into using HuggingFace's TFTrainer with Ray Tune, which is used when trying to fine-tune Transformers. But again, the format our data is in was not suitable to use TFTrainer. We finally decided to exhaustively search through our selected hyperparameters (num_hidden_layers, hidden_dropout_prob), while less efficient than a built in tool would have been for us, we were able to find that using 10 hidden layers and a hidden dropout probability of 0.1, we could achieve an accuracy of 82%, which was the best accuracy obtained throughout our manual cross checking of the parameters.

In [None]:

params = {'num_hidden_layers':[10,12,15,18], 'hidden_dropout_prob':[0.1,0.3,0.5]} #default hidden_layers=12, drouput = 0.1
best_acc = 0
best_acc_params = None
for x in params['num_hidden_layers']:
  for y in params['hidden_dropout_prob']:
    K.clear_session()
    config = BertConfig(num_hidden_layers=x, hidden_dropout_prob=y)
    with strategy.scope():
      model = create_model(config)
    num_epochs = 3
    diagnostic_callback = DiagnosticCallback(x_eval, y_eval, num_epochs)

    history = model.fit(
      x_train,
      y_train,
      epochs=num_epochs,
      verbose=2,
      batch_size=32,
    )

    curr_acc = max(history.history['activation_1_acc'])
    if  curr_acc > best_acc:
      best_acc = curr_acc
      best_acc_params = (x, y)
    gc.collect()
    del model



Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls', 'bert/encoder/layer_._11/output/LayerNorm/gamma:0', 'bert/encoder/layer_._10/intermediate/dense/kernel:0', 'bert/encoder/layer_._10/attention/self/query/bias:0', 'bert/encoder/layer_._11/output/dense/bias:0', 'bert/encoder/layer_._11/output/LayerNorm/beta:0', 'bert/encoder/layer_._11/attention/output/dense/kernel:0', 'bert/encoder/layer_._11/attention/self/key/bias:0', 'bert/encoder/layer_._10/intermediate/dense/bias:0', 'bert/encoder/layer_._11/attention/self/query/bias:0', 'bert/encoder/layer_._10/output/dense/kernel:0', 'bert/encoder/layer_._10/output/LayerNorm/beta:0', 'bert/encoder/layer_._10/attention/self/query/kernel:0', 'bert/encoder/layer_._10/attention/self/value/bias:0', 'bert/encoder/layer_._10/output/LayerNorm/gamma:0', 'bert/encoder/layer_._11/attention/self/value/bias:0', 'bert/encoder/layer_._10/output/dense/bias:0', 'bert/encoder/layer_._1

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f1383d31e58> is not a module, class, method, function, traceback, frame, or code object


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f1383d31e58> is not a module, class, method, function, traceback, frame, or code object


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f1383d31e58> is not a module, class, method, function, traceback, frame, or code object





Cause: while/else statement not yet supported


The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
Cause: while/else statement not yet supported


Cause: while/else statement not yet supported
Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 473s - loss: 2.5652 - activation_loss: 1.3405 - activation_1_loss: 1.2248 - activation_acc: 0.6178 - activation_1_acc: 0.6582
Epoch 2/3
4006/4006 - 363s - loss: 1.6390 - activation_loss: 0.8663 - activation_1_loss: 0.7727 - activation_acc: 0.7318 - activation_1_acc: 0.7709
Epoch 3/3
4006/4006 - 364s - loss: 1.1620 - activation_loss: 0.6169 - activation_1_loss: 0.5451 - activation_acc: 0.7966 - activation_1_acc: 0.8296


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls', 'bert/encoder/layer_._11/output/LayerNorm/gamma:0', 'bert/encoder/layer_._10/intermediate/dense/kernel:0', 'bert/encoder/layer_._10/attention/self/query/bias:0', 'bert/encoder/layer_._11/output/dense/bias:0', 'bert/encoder/layer_._11/output/LayerNorm/beta:0', 'bert/encoder/layer_._11/attention/output/dense/kernel:0', 'bert/encoder/layer_._11/attention/self/key/bias:0', 'bert/encoder/layer_._10/intermediate/dense/bias:0', 'bert/encoder/layer_._11/attention/self/query/bias:0', 'bert/encoder/layer_._10/output/dense/kernel:0', 'bert/encoder/layer_._10/output/LayerNorm/beta:0', 'bert/encoder/layer_._10/attention/self/query/kernel:0', 'bert/encoder/layer_._10/attention/self/value/bias:0', 'bert/encoder/layer_._10/output/LayerNorm/gamma:0', 'bert/encoder/layer_._11/attention/self/value/bias:0', 'bert/encoder/layer_._10/output/dense/bias:0', 'bert/encoder/layer_._1

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 473s - loss: 3.0614 - activation_loss: 1.5955 - activation_1_loss: 1.4659 - activation_acc: 0.5570 - activation_1_acc: 0.5971
Epoch 2/3
4006/4006 - 365s - loss: 2.1836 - activation_loss: 1.1467 - activation_1_loss: 1.0369 - activation_acc: 0.6580 - activation_1_acc: 0.6989
Epoch 3/3
4006/4006 - 364s - loss: 1.8360 - activation_loss: 0.9678 - activation_1_loss: 0.8682 - activation_acc: 0.7004 - activation_1_acc: 0.7410


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls', 'bert/encoder/layer_._11/output/LayerNorm/gamma:0', 'bert/encoder/layer_._10/intermediate/dense/kernel:0', 'bert/encoder/layer_._10/attention/self/query/bias:0', 'bert/encoder/layer_._11/output/dense/bias:0', 'bert/encoder/layer_._11/output/LayerNorm/beta:0', 'bert/encoder/layer_._11/attention/output/dense/kernel:0', 'bert/encoder/layer_._11/attention/self/key/bias:0', 'bert/encoder/layer_._10/intermediate/dense/bias:0', 'bert/encoder/layer_._11/attention/self/query/bias:0', 'bert/encoder/layer_._10/output/dense/kernel:0', 'bert/encoder/layer_._10/output/LayerNorm/beta:0', 'bert/encoder/layer_._10/attention/self/query/kernel:0', 'bert/encoder/layer_._10/attention/self/value/bias:0', 'bert/encoder/layer_._10/output/LayerNorm/gamma:0', 'bert/encoder/layer_._11/attention/self/value/bias:0', 'bert/encoder/layer_._10/output/dense/bias:0', 'bert/encoder/layer_._1

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 474s - loss: 4.6505 - activation_loss: 2.4085 - activation_1_loss: 2.2421 - activation_acc: 0.3824 - activation_1_acc: 0.4174
Epoch 2/3
4006/4006 - 365s - loss: 3.1426 - activation_loss: 1.6462 - activation_1_loss: 1.4963 - activation_acc: 0.5399 - activation_1_acc: 0.5837
Epoch 3/3
4006/4006 - 367s - loss: 2.7620 - activation_loss: 1.4494 - activation_1_loss: 1.3126 - activation_acc: 0.5851 - activation_1_acc: 0.6269


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config o

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 551s - loss: 2.5507 - activation_loss: 1.3324 - activation_1_loss: 1.2182 - activation_acc: 0.6199 - activation_1_acc: 0.6600
Epoch 2/3
4006/4006 - 431s - loss: 1.6576 - activation_loss: 0.8761 - activation_1_loss: 0.7815 - activation_acc: 0.7284 - activation_1_acc: 0.7683
Epoch 3/3
4006/4006 - 431s - loss: 1.2229 - activation_loss: 0.6511 - activation_1_loss: 0.5718 - activation_acc: 0.7882 - activation_1_acc: 0.8251


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config o

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 553s - loss: 2.9733 - activation_loss: 1.5523 - activation_1_loss: 1.4211 - activation_acc: 0.5650 - activation_1_acc: 0.6095
Epoch 2/3
4006/4006 - 431s - loss: 2.1570 - activation_loss: 1.1329 - activation_1_loss: 1.0240 - activation_acc: 0.6611 - activation_1_acc: 0.7035
Epoch 3/3
4006/4006 - 430s - loss: 1.8241 - activation_loss: 0.9638 - activation_1_loss: 0.8603 - activation_acc: 0.7014 - activation_1_acc: 0.7439


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config o

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 552s - loss: 4.5831 - activation_loss: 2.3664 - activation_1_loss: 2.2167 - activation_acc: 0.3912 - activation_1_acc: 0.4254
Epoch 2/3
4006/4006 - 430s - loss: 3.0859 - activation_loss: 1.6069 - activation_1_loss: 1.4790 - activation_acc: 0.5478 - activation_1_acc: 0.5907
Epoch 3/3
4006/4006 - 431s - loss: 2.7290 - activation_loss: 1.4252 - activation_1_loss: 1.3038 - activation_acc: 0.5911 - activation_1_acc: 0.6312


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert/encoder/layer_._13/output/dense/kernel:0', 'bert/encoder/layer_._14/attention/self/value/kernel:0', 'bert/encoder/layer_._13/attention/output/dense/bias:0', 'bert/encoder/layer_._14/intermediate/dense/bias:0', 'bert/encoder/layer_._14/intermediate/dense/kernel:0', 'bert/encoder/laye

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 677s - loss: 2.5512 - activation_loss: 1.3339 - activation_1_loss: 1.2173 - activation_acc: 0.6201 - activation_1_acc: 0.6613
Epoch 2/3
4006/4006 - 529s - loss: 1.6662 - activation_loss: 0.8794 - activation_1_loss: 0.7868 - activation_acc: 0.7290 - activation_1_acc: 0.7677
Epoch 3/3
4006/4006 - 527s - loss: 1.2268 - activation_loss: 0.6500 - activation_1_loss: 0.5768 - activation_acc: 0.7888 - activation_1_acc: 0.8236


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert/encoder/layer_._13/output/dense/kernel:0', 'bert/encoder/layer_._14/attention/self/value/kernel:0', 'bert/encoder/layer_._13/attention/output/dense/bias:0', 'bert/encoder/layer_._14/intermediate/dense/bias:0', 'bert/encoder/layer_._14/intermediate/dense/kernel:0', 'bert/encoder/laye

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 679s - loss: 3.0131 - activation_loss: 1.5670 - activation_1_loss: 1.4461 - activation_acc: 0.5621 - activation_1_acc: 0.6043
Epoch 2/3
4006/4006 - 528s - loss: 2.1916 - activation_loss: 1.1518 - activation_1_loss: 1.0398 - activation_acc: 0.6576 - activation_1_acc: 0.7003
Epoch 3/3
4006/4006 - 530s - loss: 1.8583 - activation_loss: 0.9795 - activation_1_loss: 0.8788 - activation_acc: 0.6988 - activation_1_acc: 0.7405


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert/encoder/layer_._13/output/dense/kernel:0', 'bert/encoder/layer_._14/attention/self/value/kernel:0', 'bert/encoder/layer_._13/attention/output/dense/bias:0', 'bert/encoder/layer_._14/intermediate/dense/bias:0', 'bert/encoder/layer_._14/intermediate/dense/kernel:0', 'bert/encoder/laye

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 676s - loss: 4.5409 - activation_loss: 2.3439 - activation_1_loss: 2.1970 - activation_acc: 0.3944 - activation_1_acc: 0.4302
Epoch 2/3
4006/4006 - 528s - loss: 3.1329 - activation_loss: 1.6332 - activation_1_loss: 1.4998 - activation_acc: 0.5421 - activation_1_acc: 0.5853
Epoch 3/3
4006/4006 - 528s - loss: 2.7730 - activation_loss: 1.4488 - activation_1_loss: 1.3241 - activation_acc: 0.5862 - activation_1_acc: 0.6277


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert/encoder/layer_._17/attention/self/query/bias:0', 'bert/encoder/layer_._13/attention/output/dense/bias:0', 'bert/encoder/layer_._12/attention/self/query/kernel:0', 'bert/encoder/layer_._14/output/LayerNorm/gamma:0', 'bert/encoder/layer_._16/attention/output/dense/bias:0', 'bert/encod

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 803s - loss: 2.5623 - activation_loss: 1.3390 - activation_1_loss: 1.2234 - activation_acc: 0.6191 - activation_1_acc: 0.6607
Epoch 2/3
4006/4006 - 627s - loss: 1.6798 - activation_loss: 0.8864 - activation_1_loss: 0.7934 - activation_acc: 0.7284 - activation_1_acc: 0.7668
Epoch 3/3
4006/4006 - 628s - loss: 1.2325 - activation_loss: 0.6537 - activation_1_loss: 0.5788 - activation_acc: 0.7884 - activation_1_acc: 0.8215


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert/encoder/layer_._17/attention/self/query/bias:0', 'bert/encoder/layer_._13/attention/output/dense/bias:0', 'bert/encoder/layer_._12/attention/self/query/kernel:0', 'bert/encoder/layer_._14/output/LayerNorm/gamma:0', 'bert/encoder/layer_._16/attention/output/dense/bias:0', 'bert/encod

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 803s - loss: 3.0193 - activation_loss: 1.5728 - activation_1_loss: 1.4466 - activation_acc: 0.5612 - activation_1_acc: 0.6049
Epoch 2/3
4006/4006 - 627s - loss: 2.2379 - activation_loss: 1.1775 - activation_1_loss: 1.0604 - activation_acc: 0.6527 - activation_1_acc: 0.6963
Epoch 3/3
4006/4006 - 627s - loss: 1.8986 - activation_loss: 1.0003 - activation_1_loss: 0.8983 - activation_acc: 0.6944 - activation_1_acc: 0.7353


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert/encoder/layer_._17/attention/self/query/bias:0', 'bert/encoder/layer_._13/attention/output/dense/bias:0', 'bert/encoder/layer_._12/attention/self/query/kernel:0', 'bert/encoder/layer_._14/output/LayerNorm/gamma:0', 'bert/encoder/layer_._16/attention/output/dense/bias:0', 'bert/encod

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 804s - loss: 4.5108 - activation_loss: 2.3267 - activation_1_loss: 2.1841 - activation_acc: 0.3977 - activation_1_acc: 0.4366
Epoch 2/3
4006/4006 - 627s - loss: 3.1597 - activation_loss: 1.6443 - activation_1_loss: 1.5154 - activation_acc: 0.5401 - activation_1_acc: 0.5846
Epoch 3/3
4006/4006 - 628s - loss: 2.8050 - activation_loss: 1.4666 - activation_1_loss: 1.3384 - activation_acc: 0.5809 - activation_1_acc: 0.6257


In [None]:
print("Best accuracy found: ", best_acc)
print("Parameters for best accuracy: ", best_acc_params)

Best accuracy found:  0.8296
Parameters for best accuracy:  (10, 0.1)


In [None]:
K.clear_session()
config = BertConfig(num_hidden_layers=best_acc_params[0], hidden_dropout_prob=best_acc_params[1])
with strategy.scope():
  model = create_model(config)
num_epochs = 3
diagnostic_callback = DiagnosticCallback(x_eval, y_eval, num_epochs)

history = model.fit(
  x_train,
  y_train,
  epochs=num_epochs, 
  verbose=2,
  batch_size=32,
  callbacks=[diagnostic_callback]
)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls', 'bert/encoder/layer_._11/output/dense/bias:0', 'bert/encoder/layer_._10/attention/output/dense/bias:0', 'bert/encoder/layer_._11/attention/self/value/kernel:0', 'bert/encoder/layer_._11/attention/output/dense/kernel:0', 'bert/encoder/layer_._10/attention/self/value/bias:0', 'bert/encoder/layer_._11/attention/output/LayerNorm/beta:0', 'bert/encoder/layer_._11/attention/self/key/bias:0', 'bert/encoder/layer_._10/attention/output/LayerNorm/beta:0', 'bert/encoder/layer_._10/output/LayerNorm/beta:0', 'bert/encoder/layer_._10/intermediate/dense/kernel:0', 'bert/encoder/layer_._11/attention/output/dense/bias:0', 'bert/encoder/layer_._11/attention/self/query/bias:0', 'bert/encoder/layer_._11/intermediate/dense/bias:0', 'bert/encoder/layer_._11/output/dense/kernel:0', 'bert/encoder/layer_._10/attention/self/query/bias:0', 'bert/encoder/layer_._11/attention/self/key/

Epoch 1/3


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.




The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.






4006/4006 - 468s - loss: 2.5742 - activation_loss: 1.3452 - activation_1_loss: 1.2290 - activation_acc: 0.6183 - activation_1_acc: 0.6578


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.



epoch=1, exact match score=0.66

epoch=1, answer is a substring of context score=0.73
Epoch 2/3
4006/4006 - 365s - loss: 1.6373 - activation_loss: 0.8638 - activation_1_loss: 0.7735 - activation_acc: 0.7331 - activation_1_acc: 0.7698

epoch=2, exact match score=0.65

epoch=2, answer is a substring of context score=0.74
Epoch 3/3
4006/4006 - 365s - loss: 1.1646 - activation_loss: 0.6173 - activation_1_loss: 0.5473 - activation_acc: 0.7980 - activation_1_acc: 0.8288

epoch=3, exact match score=0.65

epoch=3, answer is a substring of context score=0.72


We can see here that the accuracy is unfortunately not much higher than the orignial accuracy. However, we believe our model is preforming decently well on the SQuAD dataset. We've been able to reach up to 83% accuracy. The record for best accuracy on the Kaggle competition using this data is roughly 92%. 

## Sample Output from the Final Model

In [None]:
output = diagnostic_callback.output
output.sample(20)

Unnamed: 0,question,prediction,target,exact_match,is_impossible
8309,What was the source of the mistake?,icsi report,"[wwf report, ipcc from wwf report, wwf report]",False,False
5318,What public entity of learning is often target...,governmental,"[universities, private universities, private u...",False,False
10259,"When was the Russian Policy ""Indigenization"" ...",1923,[1923],True,True
2400,What did Lavoisier perceive the air had lost a...,weight,"[weight, weight, weight, weight, weight]",True,False
10037,Friedrich Ratzel thought imperialism was what ...,necessary,"[geographical societies in europe, necessary f...",True,False
8385,Who was the author of the fourth assessment re...,michael oppenheimer,[michael oppenheimer],True,True
7529,What discouraged cultural exchange under the ...,mongols extensive west asian and european cont...,[mongols extensive west asian and european],False,True
1140,What kind of education does Victoria have?,public universities,[diversified],False,True
10657,What South Korean car manufacturer purchased t...,daewoo,"[daewoo, daewoo, daewoo]",True,False
5038,What are responsibilities pharmacy technicians...,patients prescriptions and patient safety issues,[patients prescriptions and patient safety iss...,True,True


##**Conclusion**


In the end, BERT proved itself to be a workhorse of a pre-trained model. It’s ability to be able to accomplish the task of question answering with high accuracy, as well as the prospect of being able to accomplish such tasks as a language inference and text generation, all at first glance seem to be problems which a computer would require significant power, let alone be unable to perform. However, this notebook shows that Google's BERT is able to train an accurate model on a modest computer or cloud processor, reaching an accuracy of almost 80% and an even higher ‘close-enough score’ of almost 80%. The accuracy would be higher without the addition of the ‘impossible’ answers, but those are important for the goal of this project, because they help bridge the gap between an AI and a program.

Overall, with fine-tuning, BERT’s ability to generate answers to questions asked to it is impressive. However, there is room for improvement, and this paper did not reach the level of accuracy some others have been able to reach on Kaggle of about 92% with BERT for Question Answering. The implementation of analysis to select the most efficient number of hidden layers did increase our accuracy beyond the baseline. The Data we chose to use is not as widely used as the SQuAD1.1 and posed some challenges when it came to pre-processing and restructuring the data, but at the end it led to a deep learning model that is one step closer to the cutting edge, than if this just used the tried and tested SQuAD1.1.




##**Sources**

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

https://keras.io/examples/nlp/text_extraction_with_bert/

https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a
