# **Question Answering System using BERT + SQuAD 2.0 + BERT Offcial Script**

Base on Colab TPU + TensorFlow 1.x, using BERT offcial script.

* [BERT Github](https://github.com/google-research/bert)
* [BERT Paper](https://arxiv.org/pdf/1810.04805.pdf)


## **Preparation**

### **Clone the BERT Offical Github Repository**

In [0]:
!git clone https://github.com/google-research/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
Receiving objects:   0% (1/340)   Receiving objects:   1% (4/340)   Receiving objects:   2% (7/340)   Receiving objects:   3% (11/340)   Receiving objects:   4% (14/340)   Receiving objects:   5% (17/340)   Receiving objects:   6% (21/340)   Receiving objects:   7% (24/340)   Receiving objects:   8% (28/340)   Receiving objects:   9% (31/340)   Receiving objects:  10% (34/340)   Receiving objects:  11% (38/340)   Receiving objects:  12% (41/340)   Receiving objects:  13% (45/340)   Receiving objects:  14% (48/340)   Receiving objects:  15% (51/340)   Receiving objects:  16% (55/340)   Receiving objects:  17% (58/340)   Receiving objects:  18% (62/340)   Receiving objects:  19% (65/340)   Receiving objects:  20% (68/340)   Receiving objects:  21% (72/340)   Receiving objects:  22% (75/340)   Receiving objects:  23% (79/340)   Receiving objects:  24% (82/340)   Receiving objects:  25% (85/340)   R

In [0]:
cd bert

/content/bert


### **Set Up TPU Environment**

In [0]:
%tensorflow_version 1.x

import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('***** Using TPU: {} *****'.format(TPU_ADDRESS))

from google.colab import auth
auth.authenticate_user()

with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TensorFlow 1.x selected.
***** Using TPU: grpc://10.9.11.2:8470 *****
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 16821745434637512471),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 2462824988703932859),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 5301574468892394867),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 4955575301680682698),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 11891016979516753538),
 _DeviceAttributes(/jo

### **Set Up Google Cloud Storage (GCS)**

In [0]:
from google.colab import auth

auth.authenticate_user()

In [0]:
BUCKET = 'gzp-nlp-bert' # @param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'bert_output' # @param {type:"string"}

BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.io.gfile.makedirs(OUTPUT_DIR)

print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

***** Model output directory: gs://gzp-nlp-bert/bert_output *****


### **Download the BERT Pretrained Model**

BERT Pretrained Model List :


* BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
* BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
* BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
* BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
* BERT-Base, Cased: 12-layer, 768-hidden, 12-heads, 110M parameters
* BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
* BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
* BERT-Base, Multilingual Uncased (Orig, not recommended): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
* BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

BERT has released BERT-Base and BERT-Large models, that have uncased and cased version. Uncased means that the text is converted to lowercase before performing Workpiece tokenization, e.g., John Smith becomes john smith, on the other hand, cased means that the true case and accent markers are preserved.

As we are using the Cloud TPU, we need to store the pre-trained model and the output directory in the Google Cloud Storage (GCS). Because we connect to Google Cloud's TPU service.

In [0]:
BERT_MODEL = 'uncased_L-24_H-1024_A-16' # @param {type:"string"}
assert BERT_MODEL, '*** Must specify a BERT Model name ***'

#### **Method 1: Download by Ourselves**

In [0]:
BERT_MODEL_ZIP='{}.zip'.format(BERT_MODEL)

In [0]:
# download BERT
!wget https://storage.googleapis.com/bert_models/2018_10_18/$BERT_MODEL_ZIP

# Unzip the pretrained model
!unzip $BERT_MODEL_ZIP

--2020-05-16 05:32:16--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.120.128, 2607:f8b0:4001:c16::80
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.120.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1247797031 (1.2G) [application/zip]
Saving to: ‘uncased_L-24_H-1024_A-16.zip’


2020-05-16 05:32:22 (187 MB/s) - ‘uncased_L-24_H-1024_A-16.zip’ saved [1247797031/1247797031]

Archive:  uncased_L-24_H-1024_A-16.zip
   creating: uncased_L-24_H-1024_A-16/
  inflating: uncased_L-24_H-1024_A-16/bert_model.ckpt.meta  
  inflating: uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001  
  inflating: uncased_L-24_H-1024_A-16/vocab.txt  
  inflating: uncased_L-24_H-1024_A-16/bert_model.ckpt.index  
  inflating: uncased_L-24_H-1024_A-16/bert_config.json  


Move Pretrained Model to GCS Bucket

In [0]:
!gsutil mv ./$BERT_MODEL $BUCKET_NAME

Copying file://./uncased_L-24_H-1024_A-16/bert_model.ckpt.meta [Content-Type=application/octet-stream]...
Removing file://./uncased_L-24_H-1024_A-16/bert_model.ckpt.meta...
Copying file://./uncased_L-24_H-1024_A-16/vocab.txt [Content-Type=text/plain]...
Removing file://./uncased_L-24_H-1024_A-16/vocab.txt...
Copying file://./uncased_L-24_H-1024_A-16/bert_config.json [Content-Type=application/json]...
Removing file://./uncased_L-24_H-1024_A-16/bert_config.json...
Copying file://./uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001 [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means 

In [0]:
BERT_DIR = '{}/{}'.format(BUCKET_NAME, BERT_MODEL)

#### **Method 2: Use Official Path in GCS**

In [0]:
# !export BERT_DIR=gs://bert_models/2018_10_18/$BERT_MODEL

### **Download the SQUAD 2.0 Dataset**

For the Question Answering task, we will be using SQuAD2.0 Dataset.

In [0]:
SQUAD_DIR='squad'

In [0]:
# Download the SQUAD train and dev dataset
!wget -P $SQUAD_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget -P $SQUAD_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
!wget https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ -O $SQUAD_DIR/evaluate-v2.0.py

--2020-05-16 05:33:06--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘squad/train-v2.0.json’


2020-05-16 05:33:07 (57.1 MB/s) - ‘squad/train-v2.0.json’ saved [42123633/42123633]

--2020-05-16 05:33:09--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.111.153, 185.199.108.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘squad/dev-v2.0.json’


2020-05-16 05:33:09 (47.9 MB/s) - ‘squad/dev-v2.0.json’ saved [4370528/4370528]



## **Fine-tuned Training**

### **Fine-tuning with BERT**

In [0]:
!python run_squad.py \
  --vocab_file=$BERT_DIR/vocab.txt \
  --bert_config_file=$BERT_DIR/bert_config.json \
  --init_checkpoint=$BERT_DIR/bert_model.ckpt \
  --do_train=True \
  --train_file=$SQUAD_DIR/train-v2.0.json \
  --do_predict=True \
  --predict_file=$SQUAD_DIR/dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --use_tpu=True \
  --tpu_name=$TPU_ADDRESS \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=$OUTPUT_DIR \
  --version_2_with_negative=True

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
I0516 06:48:18.893177 140450216970112 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0516 06:48:18.909507 140450216970112 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0516 06:48:18.909883 140450216970112 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0516 06:48:18.927773 140450216970112 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0516 06:48:18.928057 140450216970112 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0516 06:48:18.944298 140450216970112 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed

### **Evaluate**

Firstly, Download all output file from GCS.

In [0]:
!gsutil cp -r $OUTPUT_DIR ./

Copying gs://gzp-nlp-bert/bert_output/checkpoint...
Copying gs://gzp-nlp-bert/bert_output/eval.tf_record...
Copying gs://gzp-nlp-bert/bert_output/events.out.tfevents.1589607759.db5366b6a968...
\ [3 files][ 48.7 MiB/ 48.7 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://gzp-nlp-bert/bert_output/graph.pbtxt...
Copying gs://gzp-nlp-bert/bert_output/model.ckpt-10000.data-00000-of-00001...
Copying gs://gzp-nlp-bert/bert_output/model.ckpt-10000.index...
Copying gs://gzp-nlp-bert/bert_output/model.ckpt-10000.meta...
Copying gs://gzp-nlp-bert/bert_output/model.ckpt-10859.data-00000-of-00001...
Copying gs://gzp-nlp-bert/bert_output/model.ckpt-10859.index...
Copying gs://gzp-nlp-bert/bert_output/model.ckpt-10859.meta...
Copying gs:

Run the `evaluate-v2.0.py` script provided by SQuAD offical.

In [0]:
OUTPUT_DIR_NAME=output_dir_name

In [0]:
!python $SQUAD_DIR/evaluate-v2.0.py $SQUAD_DIR/dev-v2.0.json $OUTPUT_DIR_NAME/predictions.json --na-prob-file $OUTPUT_DIR_NAME/null_odds.json >> $SQUAD_DIR/evaluate_result.json

In [0]:
!cat $SQUAD_DIR/evaluate_result.json

{
  "exact": 76.52657289648783,
  "f1": 79.72123519895787,
  "total": 11873,
  "HasAns_exact": 76.23144399460189,
  "HasAns_f1": 82.62993008050383,
  "HasAns_total": 5928,
  "NoAns_exact": 76.82085786375106,
  "NoAns_f1": 76.82085786375106,
  "NoAns_total": 5945,
  "best_exact": 77.70571885791291,
  "best_exact_thresh": -8.116376399993896,
  "best_f1": 80.48224564613157,
  "best_f1_thresh": -4.964851379394531
}


Get the 'best_f1_thresh' value as the THRESH, and use it to run the predict again.

In [0]:
with open(SQUAD_DIR + '/evaluate_result.json') as f:
  result = json.load(f)
  THRESH = result['best_f1_thresh']

### **Predict Again Using the Thresh Value**

In [0]:
!python run_squad.py \
  --vocab_file=$BERT_DIR/vocab.txt \
  --bert_config_file=$BERT_DIR/bert_config.json \
  --init_checkpoint=$BERT_DIR/bert_model.ckpt \
  --do_train=False \
  --train_file=squad/train-v2.0.json \
  --do_predict=True \
  --predict_file=squad/dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=$OUTPUT_DIR/ \
  --use_tpu=True \
  --tpu_name=$TPU_ADDRESS \
  --version_2_with_negative=True \
  --null_score_diff_threshold=$THRESH

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
I0516 06:56:47.400126 140441624778624 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0516 06:56:47.415967 140441624778624 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0516 06:56:47.416217 140441624778624 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0516 06:56:47.432798 140441624778624 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0516 06:56:47.433064 140441624778624 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0516 06:56:47.450024 140441624778624 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed

### **Evaluate Again**

Update those files related to the rediction result.

In [0]:
!gsutil cp -r $OUTPUT_DIR/eval.tf_record ./$OUTPUT_DIR_NAME/eval.tf_record
!gsutil cp -r $OUTPUT_DIR/nbest_predictions.json ./$OUTPUT_DIR_NAME/nbest_predictions.json
!gsutil cp -r $OUTPUT_DIR/null_odds.json ./$OUTPUT_DIR_NAME/null_odds.json
!gsutil cp -r $OUTPUT_DIR/predictions.json ./$OUTPUT_DIR_NAME/predictions.json

Copying gs://gzp-nlp-bert/bert_output/eval.tf_record...
- [1 files][ 17.0 MiB/ 17.0 MiB]                                                
Operation completed over 1 objects/17.0 MiB.                                     
Copying gs://gzp-nlp-bert/bert_output/nbest_predictions.json...
- [1 files][ 54.1 MiB/ 54.1 MiB]                                                
Operation completed over 1 objects/54.1 MiB.                                     
Copying gs://gzp-nlp-bert/bert_output/null_odds.json...
/ [1 files][604.6 KiB/604.6 KiB]                                                
Operation completed over 1 objects/604.6 KiB.                                    
Copying gs://gzp-nlp-bert/bert_output/predictions.json...
/ [1 files][541.5 KiB/541.5 KiB]                                                
Operation completed over 1 objects/541.5 KiB.                                    


In [0]:
!python $SQUAD_DIR/evaluate-v2.0.py $SQUAD_DIR/dev-v2.0.json $OUTPUT_DIR_NAME/predictions.json

{
  "exact": 77.59622673292344,
  "f1": 80.48224564613213,
  "total": 11873,
  "HasAns_exact": 73.22874493927125,
  "HasAns_f1": 79.00905913571627,
  "HasAns_total": 5928,
  "NoAns_exact": 81.95121951219512,
  "NoAns_f1": 81.95121951219512,
  "NoAns_total": 5945
}


## **Prediction**

We create a file to test the model preformance.

In [0]:
!touch input_predict_file.json

In [0]:
%%writefile input_predict_file.json
{
    "version": "v2.0",
    "data": [
        {
            "title": "your_title",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "question": "Who is current CEO?",
                            "id": "56ddde6b9a695914005b9628",
                            "is_impossible": ""
                        },
                        {
                            "question": "Who founded google?",
                            "id": "56ddde6b9a695914005b9629",
                            "is_impossible": ""
                        },
                        {
                            "question": "when did IPO take place?",
                            "id": "56ddde6b9a695914005b962a",
                            "is_impossible": ""
                        }
                    ],
                    "context": "Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet."                
                 }
            ]
        }
    ]
}

Overwriting input_predict_file.json


In [0]:
!python run_squad.py \
  --vocab_file=$BERT_DIR/vocab.txt \
  --bert_config_file=$BERT_DIR/bert_config.json \
  --init_checkpoint=$OUTPUT_DIR/model.ckpt-10859 \
  --do_train=False \
  --max_query_length=30  \
  --do_predict=True \
  --predict_file=input_predict_file.json \
  --predict_batch_size=8 \
  --n_best_size=3 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=predict_output/




W0516 06:58:40.903041 139742874584960 module_wrapper.py:139] From run_squad.py:1127: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0516 06:58:40.903277 139742874584960 module_wrapper.py:139] From run_squad.py:1127: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0516 06:58:40.903454 139742874584960 module_wrapper.py:139] From /content/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W0516 06:58:42.374191 139742874584960 module_wrapper.py:139] From run_squad.py:1133: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related op

In [0]:
!cat predict_output/predictions.json

{
    "56ddde6b9a695914005b9628": "Sundar Pichai",
    "56ddde6b9a695914005b9629": "Larry Page and Sergey Brin",
    "56ddde6b9a695914005b962a": "August 19, 2004"
}


`nbest_predictions.json` contains three best answers for the prediction.

In [0]:
!cat predict_output/nbest_predictions.json

{
    "56ddde6b9a695914005b9628": [
        {
            "text": "Sundar Pichai",
            "probability": 0.9936154736243725,
            "start_logit": 4.848094940185547,
            "end_logit": 5.458774566650391
        },
        {
            "text": "Sundar Pichai was appointed CEO of Google, replacing Larry Page",
            "probability": 0.0049522577624558466,
            "start_logit": 4.848094940185547,
            "end_logit": 0.15726786851882935
        },
        {
            "text": "Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet.",
            "probability": 0.0014322686131717137,
            "start_logit": 4.848094940185547,
            "end_logit": -1.0833160877227783
        }
    ],
    "56ddde6b9a695914005b9629": [
        {
            "text": "Larry Page and Sergey Brin",
            "probability": 0.9998895490978671,
            "start_logit": 8.884753227233887,
            "end_logit": 8.898839950561523
    