<a href="https://colab.research.google.com/github/hogo56/BertQA/blob/master/BERTjoint_yes_no_Howard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTjoint Question Answering Contest

This code is designed to run on Google Colab. Because we also want to submit the kernel to the Kaggle QA competition it needs to be able to run in either location. This is handled by having python vars when the FLAGS are set:

* Kaggle &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; *(cwd = /kaggle/working/)*<br>
  {datadir} = /kaggle/input<br>
  {outdir} = /kaggle/working<br>
* Colab &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; *(cwd = /content/)*<br>
  {datadir} = /content/data<br>
  {outdir} = /content/output<br>

#### - Required Libraries

#### - Required Data

#### - Inputs
   * {datadir}/tensorflow2-question-answering/simplified-nq-test.jsonl
   * {datadir}/tensorflow2-question-answering/simplified-nq-test.jsonl

#### - Outputs
   * {outdir}/predictions.json
   * {outdir}/submission.csv<br>
   * {outdir}/eval.tf_record<br>
   * {outdir}/.ipynb_checkpoints/<br>

#### - Notes
Look at https://www.kaggle.com/prokaj/bert-joint-baseline-notebook/notebook which is similar to this


#### - Credit

This notebook is a fork of [mmmarchetti's notebook](https://www.kaggle.com/mmmarchetti/tensorflow-2-0-bert-yes-no-answers) which was a fork of [prokaj's - bert joint baseline notebook](https://www.kaggle.com/prokaj/bert-joint-baseline-notebook/notebook).<br>
mmmarchetti made some modifications to slightly improve the code and get the YES / NO answers and leave the unknowns blank.

In [0]:
## Required setup code (will eventually be cut out into a lib file)
class ExecutionStop(Exception):             # Custom Error Handler
    def __init__(self, value): self.value=value
    def __str__(self): return(str(self.value))

def list_files(startpath):                  #  Show files
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))
# raise ExecutionStop("Message")

In [0]:
## Set file locations    (these variables are not implemented in the FLAGS code yet)
import os
kernel = ''
if os.path.exists('/content'):
    print("Detected running on Colab")
    kernel = 'Colab'
    basedir = '/content'
    libdir = '/content/lib'
    datadir = '/content/data'
    outdir = '/content/output'          # maybe this should be "/gdrive/My Drive/colab/bertqa/output"
elif os.path.exists('/kaggle'):
    print("Detected running on Kaggle")
    kernel = 'Kaggle'
    basedir = '/kaggle'
    libdir = '/kaggle/lib'
    datadir = '/kaggle/input'           # this may need to be '../input' for scoring
    outdir = '/kaggle/working'          # this may need to be '.' for scoring
else:
    raise ExecutionStop("Cannot continue without determining file locations")

## Config Variables
competition = 'tensorflow2-question-answering'
train_file = 'simplified-nq-train.jsonl'
test_file = 'simplified-nq-test.jsonl'
gprojdir = 'bertqa'                 # The project directory on Drive for this competition
glibdir = 'BERTjoint_yes_no'        # lib subdir for this notebook on Drive
verbose = True
EnableAllCode = False               # Prevent codeblocks that should not execute on Run All
DownloadBigFiles = True             # Files will not download if already on drive

# ToDo -------- search for these vars in /lib and make go away
on_kaggle_server = True if kernel == 'Kaggle' else False
nq_test_file = f"{datadir}/{competition}/{train_file}"


# ============= Machine Spinup =============

In [0]:
! zdump PST
! pwd
if verbose:
    list_files('/content')

In [0]:
## Reset kernel without removing downloaded data files and libs
if False:
    %reset
    ! rm -i {outdir}

## -- Main System Config --
<Details><Summary>Global Config</Summary>
Put any global system configuration here

In [0]:
%%bash -s {libdir} {datadir} {outdir}
zdump PST               # Not sure what is up with the time, PST is running about 8 hrs ahead
[ -d $1 ] || mkdir -p $1
[ -d $2 ] || mkdir -p $2
[ -d $3 ] || mkdir -p $3
if [ -d /content/sample_data ]; then
    rm -rf /content/sample_data
fi

In [0]:
import os, sys
sys.path.append(libdir)

## -- Setup --

###Google Drive
<Details>There are several ways to provide access to your Google Drive from Colab. (What about the Drive FUSE wrapper?)<br>
I am not sure if this is the best. This mounts your Drive into the machine.<br>
I expect there will be a folder in the Drive that we all share.</Details>

In [0]:
## File link to Google Drive
if kernel == 'Colab':
    from google.colab import drive
    drive.mount('/content/gdrive', force_remount=False)   # true to reread drive
    # Create a shorter shared directory name than one with a space
    ! ln -s '/content/gdrive/My Drive/{gprojdir}' /content/{gprojdir}

In [0]:
if False:
    ## Flush and unmount Google Drive
    # You probably won't do this but if you want to at some point click the play button
    drive.flush_and_unmount()

## Install the large file downloader for Google Drive if needed (Colab already has it installed)
#  This works from bash or Python
#! pip install gdown   (if you need to install it)

### Kaggle API
<Details>You will need Kaggle API token to link the Colab instance to your Kaggle account to get data, etc.<br>
Go to: https://www.kaggle.com/yourID/account and click on the "Create New API Token: button to get a file named kaggle.json.<p>You can put your kaggle.json file in your google drive at My Drive/colab/kaggle.json.<br>
Alternately, you can store it on your local machine and the script will ask you to upload it.</Details>

In [0]:
## Link to Kaggle
if kernel == 'Colab':
    from google.colab import files

    # see if there is a kaggle.json file in gdrive
    try:
        # see if auth file is in gdrive
        f = open("/content/gdrive/My Drive/colab/kaggle.json")
        os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/colab/"
        ! ls -l "/content/gdrive/My Drive/colab/kaggle.json"
    except IOError:
        # Have user upload file
        ! rm /content/kaggle.json  2> /dev/null
        print('Upload kaggle.json.')
        # The files.upload() command is failing sporatically with:
        #   TypeError: Cannot read property '_uploadFiles' of undefined (just run again)
        files.upload()
        os.environ['KAGGLE_CONFIG_DIR'] = "/content/"
        ! ls -l /content/kaggle.json

    import kaggle

# =========== Project Specific Stuff ===========

## -- Project Setup --

### Download Dataset and Support Files

Kaggle Competition Files

In [0]:
## Competition Dataset  (5GB zipped)
if DownloadBigFiles and kernel == 'Colab':
    if not os.path.exists(f"{datadir}/compdata.flag"):
        if verbose:
            ! kaggle competitions list
            print()
        print("Downloading Competition Data\n")
        ! kaggle competitions download -c {competition} -p {datadir}
        ! mv {datadir}/sample_submission.csv {outdir}/
        ! mkdir -p {datadir}/{competition}/
        ! unzip {datadir}/{train_file}.zip -d /{datadir}/{competition}
        ! rm {datadir}/{train_file}.zip
        ! unzip {datadir}/{test_file}.zip -d {datadir}/{competition}
        ! rm {datadir}/{test_file}.zip
        ! touch {datadir}/compdata.flag
    else:
        print("Competition Data already exists. Not downloading.\n")
        !ls -l {datadir}/*
else:
    pass        # if Kaggle, data will already be there

public_dataset = os.path.getsize(f"{datadir}/{competition}/{test_file}")<20_000_000
private_dataset = os.path.getsize(f"{datadir}/{competition}/{test_file}")>=20_000_000

BERTjoint files from: https://www.kaggle.com/prokaj/bert-joint-baseline

In [0]:
# Get BERTjoint model files
if DownloadBigFiles and kernel == 'Colab':
    if not os.path.exists(f"{datadir}/bertfiles.flag"):
        print("Downloading BERT-joint Model\n")
        ! mkdir -p {datadir}/bert-joint-baseline/
        ! gdown -O {datadir}/prokaj-bert-joint-baseline.zip --id 1vnJ052xpw1fmpWLMyLhrkUn5j_ymoGP7
        getfiles = "bert_config* model_cpkt* nq-test* vocab*"
        ! unzip {datadir}/prokaj-bert-joint-baseline.zip {getfiles} -d {datadir}/bert-joint-baseline/
        ! rm {datadir}/prokaj-bert-joint-baseline.zip
        if verbose:
            ! ls -l {datadir}/bert-joint-baseline/
        ! touch {datadir}/bertfiles.flag
    else:
        print("BERT-joint Files already exists. Not downloading.\n")
        ! ls -l {datadir}/bert-joint-baseline/
else:
    pass        # if Kaggle, data will already be there

In [0]:
# Get BERTjoint model files
if False and DownloadBigFiles and kernel == 'Colab':
    if not os.path.exists(f"{datadir}/bertfiles.flag"):
        print("Downloading BERT-joint Model\n")
        ! mkdir -p {datadir}/bert-joint-baseline/
        getfiles = "bert_config* model_cpkt* nq-test* vocab*"
        ! unzip {basedir}/{gprojdir}/data/prokaj-bert-joint-baseline.zip {getfiles} -d {datadir}/bert-joint-baseline/
        if verbose:
            ! ls -l {datadir}/bert-joint-baseline/
        ! touch {datadir}/bertfiles.flag
    else:
        print("BERT-joint Files already exists. Not downloading.\n")
        ! ls -l {datadir}/bert-joint-baseline/
else:
    pass        # if Kaggle, data will already be there

BERTjoint files from: 
https://github.com/google-research/language/tree/master/language/question_answering/bert_joint


In [0]:
# Get BERTjoint model files
if False and DownloadBigFiles and kernel == 'Colab':
    if not os.path.exists(f"{datadir}/bertfiles.flag"):
        print("Downloading BERT-joint Model\n")
        ! gsutil cp -R gs://bert-nq/bert-joint-baseline {datadir}
        ! touch {datadir}/bertfiles.flag
    else:
        print("BERT-joint Files already exists. Not downloading.\n")
        ! ls -l {datadir}/bert-joint-baseline/
else:
    pass        # if Kaggle, data will already be there

BERT files from: https://github.com/google-research/bert<br>
(Not the model we are using at the moment)

In [0]:
## get BERT (this is unlikely to be the BERT-joint files needed for competition)
# this version of BERT seems won't import as is. On line 88 of lib/bert/optimization.py
#    change   tr.train.Optimizer to tf.keras.optimizers.Optimizer
if False and DownloadBigFiles and kernel == 'Colab':
    ! git clone https://github.com/google-research/bert.git
    ! mv bert lib

    # get some pretrained models  (I really  have no idea what these are or if useful)
    ! wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
    ! unzip cased_L-12_H-768_A-12.zip
    ! rm cased_L-12_H-768_A-12.zip

<Details><Summary>BERT tf.compat.v1 Notes</Summary>
baseline_w_bert_translated_to_tf2_0 (next code block) comes from /dimitreoliveira with this warning:<br>
This baseline uses code that was migrated from TF1.x. Be aware that it contains use of tf.compat.v1, which is not permitted to be eligible for TF2.0 prizes in this competition. It is intended to be used as a starting point, but we're excited to see how much better you can do using TF2.0!<br>
https://www.kaggle.com/dimitreoliveira/using-tf-2-0-w-bert-on-nq-translated-to-tf2-0</Details>

### Library Setup

In [0]:
## Copy lib files over from Google Drive
if kernel == 'Colab':
    ! cp -a /content/{gprojdir}/lib/{glibdir}/* {libdir}    # ToDo ----- Subdir this by Notebook
if kernel == 'Kaggle':
    pass                            # ToDo ------ Need to figure out how Kaggle gets updated lib fies
if verbose:
    ! ls -l {libdir}

In [0]:
## Load Libraries
import os, sys, importlib

#magic to make colab path to Tensorflow V2
%tensorflow_version 2.x 
import tensorflow as tf
print("TensofFlow", tf.__version__)

import numpy as np
import pandas as pd
import collections

sys.path.extend([f"{libdir}/"])      ## ToDo make sure this works and extra (old) libs are not in datadir
import bert_utils
import modeling
import tokenization

import json

In [0]:
## uncomment and use this cell to reimport libs you have updated
# importlibe.reload(bert_utils)

In [0]:
! zdump PST
! pwd
if verbose:
    list_files('/content')

In [0]:
# raise ExecutionStop("Execution stopped")

## -- Code Implementation in Tensorflow 2.0 --

**A few notes:**
- If you want to keep using **flags** and **logging** you will have to use the **absl** lib (this is recommended by the TF team).
- Since we won't use it with the kernels, he removed most of the **TPU** related stuff to reduce complexity.
- Tensorflow 2 don't let us use global variables **(tf.compat.v1.trainable_variables())**.

In this notebook, we'll be using the Bert baseline for Tensorflow to create predictions for the Natural Questions test set. Note that this uses a model that has already been pre-trained - we're only doing inference here. A GPU is required, and this should take between 1-2 hours to run.

The original script can be found [here](https://github.com/google-research/language/blob/master/language/question_answering/bert_joint/run_nq.py).<br>
The supporting modules were drawn from the [official Tensorflow model repository](https://github.com/tensorflow/models/tree/master/official).<br>
The bert-joint-baseline data is described [here](https://github.com/google-research/language/tree/master/language/question_answering/bert_joint).

**Note:** This baseline uses code that was migrated from TF1.x. Be aware that it contains use of tf.compat.v1, which is not permitted to be eligible for [TF2.0 prizes in this competition](https://www.kaggle.com/c/tensorflow2-question-answering/overview/prizes).

In [0]:
%%bash
zdump PST

In [0]:
## grab bert config

with open(f"{datadir}/bert-joint-baseline/bert_config.json", 'r') as f:
    bertconf = json.load(f)
if verbose:
        print(json.dumps(bertconf, indent=4))
# ToDo ------- search for variable config in ./lib and change to bertconf
config = bertconf

### Support Classes

In [0]:
class TDense(tf.keras.layers.Layer):
    def __init__(self,
                 output_size,
                 kernel_initializer=None,
                 bias_initializer="zeros",
                **kwargs):
        super().__init__(**kwargs)
        self.output_size = output_size
        self.kernel_initializer = kernel_initializer
        self.bias_initializer = bias_initializer

    def build(self,input_shape):
        dtype = tf.as_dtype(self.dtype or tf.keras.backend.floatx())
        if not (dtype.is_floating or dtype.is_complex):
          raise TypeError("Unable to build `TDense` layer with "
                          "non-floating point (and non-complex) "
                          "dtype %s" % (dtype,))
        input_shape = tf.TensorShape(input_shape)
        if tf.compat.dimension_value(input_shape[-1]) is None:
          raise ValueError("The last dimension of the inputs to "
                           "`TDense` should be defined. "
                           "Found `None`.")
        last_dim = tf.compat.dimension_value(input_shape[-1])
        self.input_spec = tf.keras.layers.InputSpec(min_ndim=3, axes={-1: last_dim})
        self.kernel = self.add_weight(
            "kernel",
            shape=[self.output_size,last_dim],
            initializer=self.kernel_initializer,
            dtype=self.dtype,
            trainable=True)
        self.bias = self.add_weight(
            "bias",
            shape=[self.output_size],
            initializer=self.bias_initializer,
            dtype=self.dtype,
            trainable=True)
        super(TDense, self).build(input_shape)

    def call(self,x):
        return tf.matmul(x,self.kernel,transpose_b=True)+self.bias


def mk_model(config):
    seq_len = config['max_position_embeddings']
    unique_id  = tf.keras.Input(shape=(1,),dtype=tf.int64,name='unique_id')
    input_ids   = tf.keras.Input(shape=(seq_len,),dtype=tf.int32,name='input_ids')
    input_mask  = tf.keras.Input(shape=(seq_len,),dtype=tf.int32,name='input_mask')
    segment_ids = tf.keras.Input(shape=(seq_len,),dtype=tf.int32,name='segment_ids')
    BERT = modeling.BertModel(config=config,name='bert')
    pooled_output, sequence_output = BERT(input_word_ids=input_ids,
                                          input_mask=input_mask,
                                          input_type_ids=segment_ids)
    
    logits = TDense(2,name='logits')(sequence_output)
    start_logits,end_logits = tf.split(logits,axis=-1,num_or_size_splits= 2,name='split')
    start_logits = tf.squeeze(start_logits,axis=-1,name='start_squeeze')
    end_logits   = tf.squeeze(end_logits,  axis=-1,name='end_squeeze')
    
    ans_type      = TDense(5,name='ans_type')(pooled_output)
    return tf.keras.Model([input_ for input_ in [unique_id,input_ids,input_mask,segment_ids] 
                           if input_ is not None],
                          [unique_id,start_logits,end_logits,ans_type],
                          name='bert-baseline')    

### Create Model

In [0]:
## Some changes to bert_config.json
small_config = bertconf.copy()
small_config['vocab_size']=16
small_config['hidden_size']=64
small_config['max_position_embeddings'] = 32
small_config['num_hidden_layers'] = 4
small_config['num_attention_heads'] = 4
small_config['intermediate_size'] = 256
small_config

model= mk_model(config)

model.summary()

In [0]:
#                                   ### This was excluded from my copy
if False:
    model_params = {v.name:v for v in model.trainable_variables}
    model_roots = np.unique([v.name.split('/')[0] for v in model.trainable_variables])
    print(model_roots)
    saved_names = [k for k,v in tf.train.list_variables('../input/bertjointbaseline/bert_joint.ckpt')]
    a_map = {v:v+':0' for v in saved_names}
    model_roots = np.unique([v.name.split('/')[0] for v in model.trainable_variables])
    def transform(x):
        x = x.replace('attention/self','attention')
        x = x.replace('attention','self_attention')
        x = x.replace('attention/output','attention_output')  

        x = x.replace('/dense','')
        x = x.replace('/LayerNorm','_layer_norm')
        x = x.replace('embeddings_layer_norm','embeddings/layer_norm')  

        x = x.replace('attention_output_layer_norm','attention_layer_norm')  
        x = x.replace('embeddings/word_embeddings','word_embeddings/embeddings')

        x = x.replace('/embeddings/','/embedding_postprocessor/')  
        x = x.replace('/token_type_embeddings','/type_embeddings')  
        x = x.replace('/pooler/','/pooler_transform/')  
        x = x.replace('answer_type_output_bias','ans_type/bias')  
        x = x.replace('answer_type_output_','ans_type/')
        x = x.replace('cls/nq/output_','logits/')
        x = x.replace('/weights','/kernel')

        return x
    a_map = {k:model_params.get(transform(v),None) for k,v in a_map.items() if k!='global_step'}
    tf.compat.v1.train.init_from_checkpoint(ckpt_dir_or_file='../input/bertjointbaseline/bert_joint.ckpt',
                                            assignment_map=a_map)

### Checkpoint

In [0]:
cpkt = tf.train.Checkpoint(model=model)
cpkt.restore(f"{datadir}/bert-joint-baseline/model_cpkt-1").assert_consumed()

### Setting the Flags

In [0]:
#                                   ### This cell was in my copy but not in former
class DummyObject:
    def __init__(self,**kwargs):
        self.__dict__.update(kwargs)

FLAGS=DummyObject(skip_nested_contexts=True,
                 max_position=50,
                 max_contexts=48,
                 max_query_length=64,
                 max_seq_length=512,
                 doc_stride=128,
                 include_unknowns=-1.0,
                 n_best_size=20,
                 max_answer_length=30)

In [0]:
import tqdm
eval_records = f"{datadir}/bert-joint-baseline/nq-test.tfrecords"
#nq_test_file = f"{datadir}/tensorflow2-question-answering/simplified-nq-test.jsonl"
if on_kaggle_server and private_dataset:
    eval_records='nq-test.tfrecords'
if not os.path.exists(eval_records):
    # tf2baseline.FLAGS.max_seq_length = 512
    eval_writer = bert_utils.FeatureWriter(
        filename=os.path.join(eval_records),
        is_training=False)

    tokenizer = tokenization.FullTokenizer(vocab_file=f"{datadir}/bert-joint-baseline/vocab-nq.txt", 
                                           do_lower_case=True)

    features = []
    convert = bert_utils.ConvertExamples2Features(tokenizer=tokenizer,
                                                   is_training=False,
                                                   output_fn=eval_writer.process_feature,
                                                   collect_stat=False)

    n_examples = 0
    tqdm_notebook= tqdm.tqdm_notebook if not on_kaggle_server else None
    for examples in bert_utils.nq_examples_iter(input_file=nq_test_file, 
                                           is_training=False,
                                           tqdm=tqdm_notebook):
        for example in examples:
            n_examples += convert(example)

    eval_writer.close()
    print('number of test examples: %d, written to file: %d' % (n_examples,eval_writer.num_features))

In [0]:
seq_length = FLAGS.max_seq_length #config['max_position_embeddings']
name_to_features = {
      "unique_id": tf.io.FixedLenFeature([], tf.int64),
      "input_ids": tf.io.FixedLenFeature([seq_length], tf.int64),
      "input_mask": tf.io.FixedLenFeature([seq_length], tf.int64),
      "segment_ids": tf.io.FixedLenFeature([seq_length], tf.int64),
  }

def _decode_record(record, name_to_features=name_to_features):
    """Decodes a record to a TensorFlow example."""
    example = tf.io.parse_single_example(serialized=record, features=name_to_features)

    # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
    # So cast all int64 to int32.
    for name in list(example.keys()):
        t = example[name]
        if name != 'unique_id': #t.dtype == tf.int64:
            t = tf.cast(t, dtype=tf.int32)
        example[name] = t

    return example

def _decode_tokens(record):
    return tf.io.parse_single_example(serialized=record, 
                                      features={
                                          "unique_id": tf.io.FixedLenFeature([], tf.int64),
                                          "token_map" :  tf.io.FixedLenFeature([seq_length], tf.int64)
                                      })
      


In [0]:
raw_ds = tf.data.TFRecordDataset(eval_records)
token_map_ds = raw_ds.map(_decode_tokens)
decoded_ds = raw_ds.map(_decode_record)
ds = decoded_ds.batch(batch_size=16,drop_remainder=False)

In [0]:
#                                       ### This cell was not in my copy but former
# next(iter(decoded_ds))

In [0]:
result=model.predict_generator(ds,verbose=1 if not on_kaggle_server else 0)

In [0]:
np.savez_compressed('bert-joint-baseline-output.npz',
                    **dict(zip(['uniqe_id','start_logits','end_logits','answer_type_logits'],
                               result)))

In [0]:
Span = collections.namedtuple("Span", ["start_token_idx", "end_token_idx"])

In [0]:
class ScoreSummary(object):
  def __init__(self):
    self.predicted_label = None
    self.short_span_score = None
    self.cls_token_score = None
    self.answer_type_logits = None

In [0]:
class EvalExample(object):
  """Eval data available for a single example."""
  def __init__(self, example_id, candidates):
    self.example_id = example_id
    self.candidates = candidates
    self.results = {}
    self.features = {}

In [0]:
def get_best_indexes(logits, n_best_size):
  """Get the n-best logits from a list."""
  index_and_score = sorted(
      enumerate(logits[1:], 1), key=lambda x: x[1], reverse=True)
  best_indexes = []
  for i in range(len(index_and_score)):
    if i >= n_best_size:
      break
    best_indexes.append(index_and_score[i][0])
  return best_indexes

def top_k_indices(logits,n_best_size,token_map):
    indices = np.argsort(logits[1:])+1
    indices = indices[token_map[indices]!=-1]
    return indices[-n_best_size:]

## 1- Understanding the code
#### For a better understanding, I will briefly explain here.
#### In the item "answer_type", in the last lines of this block, it is responsible for storing the identified response type, which, according to [github project repository](https://github.com/google-research/language/blob/master/language/question_answering/bert_joint/run_nq.py) can be:
1. UNKNOWN = 0
2. YES = 1
3. NO = 2
4. SHORT = 3
5. LONG = 4

In [0]:
def compute_predictions(example):
  """Converts an example into an NQEval object for evaluation."""
  predictions = []
  n_best_size = FLAGS.n_best_size
  max_answer_length = FLAGS.max_answer_length
  i = 0
  for unique_id, result in example.results.items():
    if unique_id not in example.features:
      raise ValueError("No feature found with unique_id:", unique_id)
    token_map = np.array(example.features[unique_id]["token_map"]) #.int64_list.value
    start_indexes = top_k_indices(result.start_logits,n_best_size,token_map)
    if len(start_indexes)==0:
        continue
    end_indexes   = top_k_indices(result.end_logits,n_best_size,token_map)
    if len(end_indexes)==0:
        continue
    indexes = np.array(list(np.broadcast(start_indexes[None],end_indexes[:,None])))  
    indexes = indexes[(indexes[:,0]<indexes[:,1])*(indexes[:,1]-indexes[:,0]<max_answer_length)]
    for start_index,end_index in indexes:
        summary = ScoreSummary()
        summary.short_span_score = (
            result.start_logits[start_index] +
            result.end_logits[end_index])
        summary.cls_token_score = (
            result.start_logits[0] + result.end_logits[0])
        summary.answer_type_logits = result.answer_type_logits-result.answer_type_logits.mean()
        start_span = token_map[start_index]
        end_span = token_map[end_index] + 1

        # Span logits minus the cls logits seems to be close to the best.
        score = summary.short_span_score - summary.cls_token_score
        predictions.append((score, i, summary, start_span, end_span))
        i += 1 # to break ties

  # Default empty prediction.
  score = -10000.0
  short_span = Span(-1, -1)
  long_span  = Span(-1, -1)
  summary    = ScoreSummary()

  if predictions:
    score, _, summary, start_span, end_span = sorted(predictions, reverse=True)[0]
    short_span = Span(start_span, end_span)
    for c in example.candidates:
      start = short_span.start_token_idx
      end = short_span.end_token_idx
      ## print(c['top_level'],c['start_token'],start,c['end_token'],end)
      if c["top_level"] and c["start_token"] <= start and c["end_token"] >= end:
        long_span = Span(c["start_token"], c["end_token"])
        break

  summary.predicted_label = {
      "example_id": int(example.example_id),
      "long_answer": {
          "start_token": int(long_span.start_token_idx),
          "end_token": int(long_span.end_token_idx),
          "start_byte": -1,
          "end_byte": -1
      },
      "long_answer_score": float(score),
      "short_answers": [{
          "start_token": int(short_span.start_token_idx),
          "end_token": int(short_span.end_token_idx),
          "start_byte": -1,
          "end_byte": -1
      }],
      "short_answer_score": float(score),
      "yes_no_answer": "NONE",
      "answer_type_logits": summary.answer_type_logits.tolist(),
      # here:
      "answer_type": int(np.argmax(summary.answer_type_logits))
  }

  return summary

In [0]:
def compute_pred_dict(candidates_dict, dev_features, raw_results,tqdm=None):
    """Computes official answer key from raw logits."""
    raw_results_by_id = [(int(res.unique_id),1, res) for res in raw_results]

    examples_by_id = [(int(k),0,v) for k, v in candidates_dict.items()]
  
    features_by_id = [(int(d['unique_id']),2,d) for d in dev_features] 
  
    # Join examples with features and raw results.
    examples = []
    print('merging examples...')
    merged = sorted(examples_by_id + raw_results_by_id + features_by_id)
    print('done.')
    for idx, type_, datum in merged:
        if type_==0: #isinstance(datum, list):
            examples.append(EvalExample(idx, datum))
        elif type_==2: #"token_map" in datum:
            examples[-1].features[idx] = datum
        else:
            examples[-1].results[idx] = datum

    # Construct prediction objects.
    print('Computing predictions...')
   
    nq_pred_dict = {}
    #summary_dict = {}
    if tqdm is not None:
        examples = tqdm(examples)
    for e in examples:
        summary = compute_predictions(e)
        #summary_dict[e.example_id] = summary
        nq_pred_dict[e.example_id] = summary.predicted_label

    return nq_pred_dict


In [0]:
def read_candidates_from_one_split(input_path):
  """Read candidates from a single jsonl file."""
  candidates_dict = {}
  print("Reading examples from: %s" % input_path)
  if input_path.endswith(".gz"):
    with gzip.GzipFile(fileobj=tf.io.gfile.GFile(input_path, "rb")) as input_file:
      for index, line in enumerate(input_file):
        e = json.loads(line)
        candidates_dict[e["example_id"]] = e["long_answer_candidates"]
        
  else:
    with tf.io.gfile.GFile(input_path, "r") as input_file:
      for index, line in enumerate(input_file):
        e = json.loads(line)
        candidates_dict[e["example_id"]] = e["long_answer_candidates"]
        # candidates_dict['question'] = e['question_text']
  return candidates_dict

In [0]:
def read_candidates(input_pattern):
  """Read candidates with real multiple processes."""
  input_paths = tf.io.gfile.glob(input_pattern)
  final_dict = {}
  for input_path in input_paths:
    final_dict.update(read_candidates_from_one_split(input_path))
  return final_dict

In [0]:
all_results = [bert_utils.RawResult(*x) for x in zip(*result)]
    
print ("Going to candidates file")

candidates_dict = read_candidates('../input/tensorflow2-question-answering/simplified-nq-test.jsonl')

print ("setting up eval features")

eval_features = list(token_map_ds)

print ("compute_pred_dict")

tqdm_notebook= tqdm.tqdm_notebook
nq_pred_dict = compute_pred_dict(candidates_dict, 
                                       eval_features,
                                       all_results,
                                      tqdm=tqdm_notebook)

predictions_json = {"predictions": list(nq_pred_dict.values())}

print ("writing json")

with tf.io.gfile.GFile('predictions.json', "w") as f:
    json.dump(predictions_json, f, indent=4)

## 2- Main Change
#### Here is the small, but main change: we created an if to check the predicted response type and thus filter / identify the responses that are passed to the submission file.

### Filtering the Answers

In [0]:
def create_short_answer(entry):
    answer = []    
    if entry['answer_type'] == 0:
        return ""
    
    elif entry['answer_type'] == 1:
        return 'YES'
    
    elif entry['answer_type'] == 2:
        return 'NO'
        
    elif entry["short_answer_score"] < 1.5:
        return ""
    
    else:
        for short_answer in entry["short_answers"]:
            if short_answer["start_token"] > -1:
                answer.append(str(short_answer["start_token"]) + ":" + str(short_answer["end_token"]))
    
        return " ".join(answer)

def create_long_answer(entry):
    
    answer = []
    
    if entry['answer_type'] == 0:
        return ''
    
    elif entry["long_answer_score"] < 1.5:
        return ""

    elif entry["long_answer"]["start_token"] > -1:
        answer.append(str(entry["long_answer"]["start_token"]) + ":" + str(entry["long_answer"]["end_token"]))
        return " ".join(answer)

* ### Creating a DataFrame

In [0]:
test_answers_df = pd.read_json("../working/predictions.json")
for var_name in ['long_answer_score','short_answer_score','answer_type']:
    test_answers_df[var_name] = test_answers_df['predictions'].apply(lambda q: q[var_name])
test_answers_df["long_answer"] = test_answers_df["predictions"].apply(create_long_answer)
test_answers_df["short_answer"] = test_answers_df["predictions"].apply(create_short_answer)
test_answers_df["example_id"] = test_answers_df["predictions"].apply(lambda q: str(q["example_id"]))

long_answers = dict(zip(test_answers_df["example_id"], test_answers_df["long_answer"]))
short_answers = dict(zip(test_answers_df["example_id"], test_answers_df["short_answer"]))

test_answers_df.head()

### Generating the Submission File

In [0]:
sample_submission = pd.read_csv("../input/tensorflow2-question-answering/sample_submission.csv")

long_prediction_strings = sample_submission[sample_submission["example_id"].str.contains("_long")].apply(lambda q: long_answers[q["example_id"].replace("_long", "")], axis=1)
short_prediction_strings = sample_submission[sample_submission["example_id"].str.contains("_short")].apply(lambda q: short_answers[q["example_id"].replace("_short", "")], axis=1)

sample_submission.loc[sample_submission["example_id"].str.contains("_long"), "PredictionString"] = long_prediction_strings
sample_submission.loc[sample_submission["example_id"].str.contains("_short"), "PredictionString"] = short_prediction_strings


In [0]:
sample_submission.to_csv('submission.csv', index=False)

In [0]:
sample_submission.head()

* Yes Answers

In [0]:
yes_answers = sample_submission[sample_submission['PredictionString'] == 'YES']
yes_answers

* No Answers


In [0]:
no_answers = sample_submission[sample_submission['PredictionString'] == 'NO']
no_answers

* Balnk Answers

In [0]:
blank_answers = sample_submission[sample_submission['PredictionString'] == '']
blank_answers.head()

In [0]:
blank_answers.count()

### I [original author] am only sharing modifications that I believe may help. I left out Tunning and any significant code changes I made.

In [0]:
! zdump PST
! pwd
if verbose:
    list_files('/content')
! ls -l {outdir}

## -- Submitting Results --

In [0]:
raise ExecutionStop("Don't let run all go beyond this")

In [0]:
%%bash
## View Previous Results
#kaggle competitions list
kaggle competitions submissions -c tensorflow2-question-answering

In [0]:
## Make Submission
# I am not sure if we can submit this competition from this as it has to be a kernel submission
#! kaggle competitions submit -c tensorflow2-question-answering -f $RESULT_CSV  -m 'test kaggle cli 3'

Verify submission by viewing previous results

End of Project Notebook
# ====== Please fold this stuff up and ignore =====

### SSH Setup
This is only neeeded if you want to log into the Colab machine. Otherwise fold it up and ignore.<br>
To use it you have to create a login at https://ngrok.com
<Details>Thanks to Imad El Hanafi (https://imadelhanafi.com) for showing me how to do this.<p>
You will need to create a free account at https://ngrok.com/ for the SSH tunnel to work.</Details>

In [0]:
assert False        # Make sure user does not accedentially drop into this code

In [0]:
%%bash
## Install sshd; Set to allow login and config
apt-get install -o=Dpkg::Use-Pty=0 openssh-server pwgen > /dev/null
mkdir -p /var/run/sshd
echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
echo "PasswordAuthentication yes" >> /etc/ssh/sshd_config
# set host key to known value (need to test if exist)
if [ -f "/content/bertqa/colab/ssh_host_rsa_key.pub" ]; then
    cp "/content/bertqa/colab/ssh_host_rsa_key.pub" /etc/ssh/
    echo "Using ssh_host_rsa_key from gdrive"
fi
# this script will give fix the login shell so Python will work
if [ -f "/content/bertqa/colab/init_shell.sh" ]; then
    echo "source /content/bertqa/colab/init_shell.sh" >> /root/.bashrc
fi

In [0]:
## setup ssh user / pass and start sshd

#Generate a random root password
import random, string
sshpass = ''.join(random.choice(string.ascii_letters + string.digits) for i in range(30))

#Set root password
! echo root:$sshpass | chpasswd

#Run sshd
get_ipython().system_raw('/usr/sbin/sshd -D &')

In [0]:
%%bash
## Get Ngrok from gdrive or try to download (see: https://ngrok.com/download)
if [ -f "/content/bertqa/colab/ngrok-stable-linux-amd64.zip" ]; then
    cp "/content/bertqa/colab/ngrok-stable-linux-amd64.zip" .
    echo "Using ngrok-stable-linux-amd64.zip from gdrive"
else
    wget -q -c -nc https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
fi
unzip -qq -n ngrok-stable-linux-amd64.zip
rm ngrok-stable-linux-amd64.zip

In [0]:
## Get user to enter auth token from ngrok and start tunnel

# Get token from ngrok for the tunnel
print("Get your authtoken from https://dashboard.ngrok.com/auth")
import getpass
authtoken = getpass.getpass()

#Create tunnel
get_ipython().system_raw('./ngrok authtoken $authtoken && ./ngrok tcp 22 &')

#### ==============================<br>|====&nbsp;&nbsp;  SSH Login Credentials &nbsp;&nbsp;====||<br>==============================

In [0]:
#@title
print("username: root")
print("password: ", sshpass)

Get the host name and port number at: https://dashboard.ngrok.com/status

```bash
ssh root@0.tcp.ngrok.io -p [ngrok_port]
Login as: root
Servrer refused our key
root@0.tcp.ngrok.io's password: [see above]

(Colab):/content$
```


Install vim

In [0]:
! apt-get install vim > /dev/null

If you need to kill Ngrok run this cell

In [0]:
if EnableAllCode and False:
    !kill $(ps aux | grep './ngrok' | awk '{print $2}')

## -- Misc Notes --

### Prevent Disconnects
Colab periodically disconnects the browser.<br>
You have to save model checkpoints to Google Drive so you don't lose work<br>
See: https://mc.ai/google-colab-drive-as-persistent-storage-for-long-training-runs/<br>
Something to try...<br>
Ctrl+Shift+i in browser and in console run this code...
```
function KeepAlive(){
    console.log("Maintaining Connection");
    document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(KeepAlive,60000);
```
There have been reports of people having their GPU privileges suspended for letting processes run for over 12 hours. It seems that they may penalize you rather than just cutting you off.