<a href="https://colab.research.google.com/github/jorgeramirez/bert-qa-colab/blob/master/bert_finetuning_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Changes
This is a modified version of [this notebook](https://github.com/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb), adapted for QA on the SQuAD dataset using [bert-qa](https://github.com/chiayewken/bert-qa). It also includes some scripts for running predictions on a custom dataset.

# BERT End to End (Fine-tuning + Predicting)  with Cloud TPU for SQuAD

## Overview

**BERT**, or **B**idirectional **E**mbedding **R**epresentations from **T**ransformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

This Colab demonstates using a free Colab Cloud TPU to fine-tune sentence and sentence-pair classification tasks built on top of pretrained BERT models and 
run predictions on tuned model. The colab demonsrates loading pretrained BERT models from both [TF Hub](https://www.tensorflow.org/hub) and checkpoints.

**Note:**  You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud 
Storage) bucket for this Colab to run.

Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) for how to create GCP account and GCS bucket. You have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product. You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.

This notebook is hosted on GitHub. To view it in its original repository, after opening the notebook, select **File > View on GitHub**.

## Instructions

<h3><a href="https://cloud.google.com/tpu/"><img valign="middle" src="https://raw.githubusercontent.com/GoogleCloudPlatform/tensorflow-without-a-phd/master/tensorflow-rl-pong/images/tpu-hexagon.png" width="50"></a>  &nbsp;&nbsp;Train on TPU</h3>

   1. Create a Cloud Storage bucket for your TensorBoard logs at http://console.cloud.google.com/storage and fill in the BUCKET parameter in the "Parameters" section below.
 
   1. On the main menu, click Runtime and select **Change runtime type**. Set "TPU" as the hardware accelerator.
   1. Click Runtime again and select **Runtime > Run All** (Watch out: the "Colab-only auth for this notebook and the TPU" cell requires user input). You can also run the cells manually with Shift-ENTER.

## Set up your TPU environment

In this section, you perform the following tasks:

*   Set up a Colab TPU running environment
*   Verify that you are connected to a TPU device
*   Upload your credentials to TPU to access your GCS bucket.

In [0]:
# run all of this to prepare for training bert-qa
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

## Prepare and import BERT modules
​
With your environment configured, you can now prepare and import the BERT modules. The following step clones the source code from GitHub and import the modules from the source. Alternatively, you can install BERT using pip (!pip install bert-tensorflow).

In [0]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

# import python modules defined by BERT
import modeling
import optimization
import run_classifier
import run_classifier_with_tfhub
import tokenization

# import tfhub 
import tensorflow_hub as hub

## Prepare for training
Run the following lines to download dependencies for training bert-qa.

In [0]:
BUCKET_NAME="gs://bert-qa-demo"

# set the bert directory from the bucket
BERT_BASE_DIR= f"{BUCKET_NAME}/bert_base"


In [0]:
! git clone https://github.com/chiayewken/bert-qa.git
! wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
! wget https://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py
! wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
! mkdir squad_dir

! mv train-v1.1.json squad_dir/
! mv dev-v1.1.json squad_dir & mv evaluate-v1.1.py squad_dir
! ls squad_dir
! wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
! unzip uncased_L-12_H-768_A-12.zip -d bert_base 
! mv bert_base/uncased_L-12_H-768_A-12/* bert_base

! gsutil cp -r bert_base $BUCKET_NAME

In [0]:
# we start the training process with the following line

!cd bert-qa && python run_squad.py \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --do_train=True \
  --train_file=../squad_dir/train-v1.1.json \
  --do_predict=True \
  --predict_file=../squad_dir/dev-v1.1.json \
  --train_batch_size=32 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --use_tpu=True   \
  --tpu_name=$TPU_ADDRESS \
  --output_dir=$BUCKET_NAME

In [0]:
# let's evaluate the predictions

!mkdir squad
!gsutil cp $BUCKET_NAME/predictions.json squad/predictions.json

!cd bert-qa && python ../squad_dir/evaluate-v1.1.py ../squad_dir/dev-v1.1.json  ../squad/predictions.json


## Prepare custom dataset for predictions

We should format our custom dataset based on the format found in [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json). When running `run_squad.py` in prediction mode the dev file just needs to contain the following:

```json
{
    data: [
        {
            title: "",
            paragraphs: [
                {
                    context: "<our text>", 
                    qas: [{id: "<we can use some random UUID here", question: "<our question>"}, ...]
                }
            ]
        }
    ]
}
```

In [0]:
import pandas as pd
import json
import uuid
import io
from google.colab import files



def gen_question_entry(q):
  return {
      "id": str(uuid.uuid4()),
      "question": q
  }

def to_prediction_format(item, questions):
  output = {
      "title": item.title,
      "paragraphs": [
          {
              "context": item.text,
              "qas": [gen_question_entry(q) for q in questions]
          }
      ]
  }
  return output





def run(input_file, output_file, questions):
  uploaded = files.upload() # upload the CSV file, it expects the columns: title, text

  # read the INPUT_FILE
  df = pd.read_csv(io.BytesIO(uploaded[input_file]))
  data = []

  for _, item in df.iterrows():
    formated_entry = to_prediction_format(item, questions)
    data.append(formated_entry)

  with open(output_file, "w") as f:
    json.dump({"data": data}, f)

In [0]:

run("MY_FILE.csv", "./my_file_qa.json", [
     "MY QUESTION?"
 ]
)

In [0]:
# copy the generated file to the bucket
!gsutil cp ./my_file_qa.json  $BUCKET_NAME/my_file_qa.json

In [0]:
# let's inspect the formatted file
!cat my_file_qa.json | python -m json.tool

## Run prediction on the custom dataset

In [0]:
!cd bert-qa && python run_squad.py \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BUCKET_NAME/model.ckpt-5474 \
  --do_train=False \
  --train_file=../squad_dir/train-v1.1.json \
  --do_predict=True \
  --predict_file=../slr_oa_qa.json \
  --train_batch_size=32 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --use_tpu=True   \
  --tpu_name=$TPU_ADDRESS \
  --output_dir=$BUCKET_NAME/     

In [0]:
# let's inspect the answers

!gsutil cp  $BUCKET_NAME/predictions.json predictions.json

!cat predictions.json | python -m json.tool
