# BERTjoint Question Answering Contest

I prefer to do my primary development in a Colab virtual machine but, competitions require your kernel to run on Kaggle for scoring. You can start with this notebook and add your own project code which should run in either location if you use the directory variables for file locations and correctly configure data and user libraries in Kaggle.<p>
The main differences are the file locations. On Colab you will symlink to a google Drive and for Kaggle you will need to upload your files as a .zip data file If you want SSH access there is code at bottom to allow this for Colab (not possible on Kaggle).<p>
If you run into problems or have suggestions I'd love to hear from you.

## Explanation

### Data & Directories
There are a couple of differences between Kaggle and Colab.<p>
**Colab** - Is a live linux system with nothing being persistant. You can attach a google drive to your kernel and/or download files from gs://, GitHub, Kaggle, etc. Using drive can have a performance penalty but is the easiest way to access persistent files.<br>
I am symlinking Library files and output files.<br>
Because my google drive space is limited I choose to download large data files each time the kernel is used.<p>
**Kaggle** - The kernel has a persistent ./input directory for data that is read only. You can attach data and library files there. (zip your files into a single file and upload the .zip).<br>
Your private ./lib directory can ba zipped and uploaded there as ./input/lib<br>
You can also create notebooks of kernel type "script" and then file->Add Utility Script in your competition notebook to attach the script files under ./usr/lib but you have to do this one file at a time. (see: https://www.kaggle.com/product-feedback/91185 for more information)<br>
I don't think competition scoring allows internet access so all files have to be attached to your notebook at submit time.

### SSH
You can SSH into the Colab. If you do this you can vim the ./lib files directly and they will be
persistant. Also, since all project and notebook ./lib files are in gdrive you can access all of them through any of your Colab kernels by using the gdrive directory.

### Switching between Colab & Kaggle
One way of moving your script from Colab to Kaggle to run is:<br>
   * delete all cells from your Kaggle competition notebook<br>
   * download the .ipynb from Colab<br>
   * upload it into the blank Kaggle notebook.<br>
   * delete the cells near the bottom of the notebook that need to be deleted when running on Kaggle<br>
   * update any script parameters (eg. verbose)
   * zip your current library files into a lib.zip and upload to your Kaggle dataset file
   * if you have changed ./data files make sure your Kaggle kernel has the current versions of those files

### Directory Structure (Notebook)

* Kaggle &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(cwd = /kaggle/working/)</em><br>
  {datadir} = /kaggle/input<br>
  {libdir} = /kaggle/input/lib &nbsp; &nbsp; &nbsp; (or /kaggle/usr/lib)<br>
  {outdir} = /kaggle/working<br>
* Colab &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(cwd = /content/)</em><br>
  {datadir} = /content/data<br>
  {libdir} = /content/lib<br>
  {outdir} = /content/output<br>

#### - Required Libraries
   * {libdir}/bert_utils.py
   * {libdir}/modeling.py
   * {libdir}/tokenization.py

#### - Inputs (competition data)
   * {datadir}/tensorflow2-question-answering/ (from https://www.kaggle.com/c/tensorflow2-question-answering/data)

#### - Required Data (additional packages)
   * {datadir}/bert-joint-baseline/ (from prvi
https://www.kaggle.com/prokaj/bert-joint-baseline; contains model and scripts)

#### - Outputs
   * {outdir}/predictions.json
   * {outdir}/submission.csv<br>
   * {outdir}/eval.tf_record<br>
   * {outdir}/.ipynb_checkpoints/<br>

### Drectory Structure (Google Drive)
The following is the file structure for your google drive:<p>

/My Drive/Colab/ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(this directory should be private)</em><br>
/My Drive/Colab/kaggle.json &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(your personal Kaggle auth file)</em><br>
/My Drive/Colab/{projdir}/ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (folder for a specific project/competition)<br>
/My Drive/Colab/{projdir}/data/&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;(optional source for project data)<br>
/My Drive/Colab/{projdir}/{notebook}/&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;(individual notebook within project)<br>
/My Drive/Colab/{projdir}/{notebook}/notebook.ipynb<br>
/My Drive/Colab/{projdir}/{notebook}/lib/<br>
/My Drive/Colab/{projdir}/{notebook}/output/<br>
/My Drive/{projdir}/ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(optional directory to be shared between teams)</em><br><p>
Using this directory structure and the variables in the config variables section below you can have any number of <br>projects on your google drive and any number of notebook versions in each project.<br> Each notebook can have a unique set of libraries.<p>
There is also a {nbver} which, if not '' is appended to the end of {outdir} and {libdir} offering additional flexibility <br>if you want. If not just set it to '' (empty string)


### GitHub
As I learn more about GitHub I will add information here. But, I think you can choose to make repositaries out of either /My Drive/Colab/{projdir}/ or /My Drive/Colab/{projdir}/{notebook}/   <p>
You can then either pull/push to the shared repo directly from the Colab (using ssh, using %%bash commands in notebook or by linking the google Drive to your local machine and doing it from there.<p>
From within the Colab all files are in your google drive from the path /content/gdrive/My\ Drive/Colab/...

### - Notes
Put notes about your Notebook here







### - Credits / Ancestry
If your notebook is a fork or combination of other notebooks here you should provide links so other people can look at where you built your current work from.<p>
This notebook is a fork of [mmmarchetti's notebook](https://www.kaggle.com/mmmarchetti/tensorflow-2-0-bert-yes-no-answers) which was a fork of [prokaj's - bert joint baseline notebook](https://www.kaggle.com/prokaj/bert-joint-baseline-notebook/notebook).<br>
mmmarchetti made some modifications to slightly improve the code and get the YES / NO answers and leave the unknowns blank.

# ========== Notebook Variables ==========

In [0]:
CompSubmission = False                       # Set to True if submitting to Competition

In [0]:
## Turn this on for development to make sure library updates get reimported
#  will reduce performance so comment out for production.
%load_ext autoreload
%autoreload 2

from IPython.core.debugger import set_trace      ## allow set_trace()

In [0]:
## Helper Functions   (probably will move to a library eventually)
import os

# Custom Error Handler
class ExecutionStop(Exception):
    ''' forces notebook to stop with a raised exception '''
    def __init__(self, value): self.value=value
    def __str__(self): return(str(self.value))

#  Show list of file sand directories                 (it seems that this is skipping the symlinks)
def list_files(startpath, exclude=["/.config", "/gdrive", "/bertqa", "__pycache__"]):
    ''' Lists files in {startpath} optionally excluding {exclude} directories '''
    for root, _, files in os.walk(startpath, followlinks=True):
        if any([e in root for e in exclude]):
            continue
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print(f"{indent}{os.path.basename(root)}/")
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print(f"{subindent}{f}")

def print_flags():
    ''' Prints the program config flags '''
    if verbose:
        print("\nParameters:")
        FLAGS = tf.flags.FLAGS
        for attr, obj in sorted(FLAGS.__flags.items()):
            print(f"{attr.upper()}={obj.value}")
        print("")

def del_all_flags(FLAGS):
    ''' Deletes all program flags '''
    flags_dict = FLAGS._flags()
    keys_list = [keys for keys in flags_dict]
    for keys in keys_list:
        FLAGS.__delattr__(keys)

# raise ExecutionStop("Message")

In [4]:
## Set file locations    (these variables are not implemented in the FLAGS code yet)
import os, sys
from pathlib import Path

## Config Variables
verbose = True if not CompSubmission else False  # Turn this off to supress some of the "fyi" output
competition = 'tensorflow2-question-answering'
train_file = ''                       # Set this below when you download the training file
test_file = ''                        # Set this below when you download the test file
projdir = 'bertqa'                    # The project directory on Drive for this competition
notebook = 'BERTjoint_yes_no'         # Subdir on Drive for files specific to this notebook
nbver = ''                            # library/output subfolder for this notebook version, or ''
DownloadBigFiles = True               # Files will not download if already on drive
EnableSMSMessages = False             # Allows notebook to attempt to send SMS alerts

if Path('/content').exists():
    print("Detected running on Colab")
    kernel = 'Colab'
    basedir = '/content'
    libdir = f"{basedir}/lib"
    datadir = f"{basedir}/data"
    outdir = f"{basedir}/output"      # will be symlinked to a user's private gdrive for persistence
elif Path('/kaggle').exists():
    print("Detected running on Kaggle")
    kernel = 'Kaggle'
    basedir = '/kaggle'
    libdir = f"{basedir}/input/BERTjoint_yes_no_lib2"   # you will upload a {name}.zip file as a dataset
    datadir = f"{basedir}/input"      # this may need to be '../input' for scoring
    outdir = f"{basedir}/working"     # this may need to be '.' for scoring
else:
    raise ExecutionStop("Cannot continue without determining file locations")

Detected running on Colab


# ============= Machine Spinup =============

In [5]:
! zdump PST
if verbose:
    list_files(basedir)

PST  Tue Jan 14 12:05:19 2020 PST
content/
    sample_data/
        anscombe.json
        README.md
        california_housing_test.csv
        california_housing_train.csv
        mnist_train_small.csv
        mnist_test.csv


In [0]:
if kernel == "Colab":
    if Path(f"{basedir}/sample_data").exists():
        ! rm -rf "{basedir}/sample_data"

## -- Setup --

### Google Drive

<font color="pink">Note: I am using an abreviated section of the train_file as the {test_file} so we have answers to eval.<br>
Whenever you change the source for {test_file} you have to delete the nq-test.tfrecords file and remake the<br>
sample_submission.csv file (see code block after Google Drive)

In [7]:
## File link to Google Drive

######
##  WARNING: This is convienent but a rather bad idea from a security point of view. Mounting your
##           Google Drive in this way makes your entire drive accessible read/write and then you
##           are likely to run code from libraries that could be untrustable.
##           Possible alternatives would be to use read only file sharing links or if you want
##           read/write access (to allow file output for example) I would recommend creating a
##           special Google Drive account that only has files related to your Colab work.
######

if kernel == 'Colab':
    from google.colab import drive
    drive.mount(f"{basedir}/gdrive", force_remount=False)   # true to reread drive

    # this is a link to our team shared google folder (no longer sure this is useful)
    # Create a shorter shared directory name and avoid having to deal with the space in "My Drive"
    if Path(f"{basedir}/{projdir}").is_symlink():
        ! rm "{basedir}/{projdir}"
    ! ln -s "{basedir}/gdrive/My Drive/{projdir}/" "{basedir}/{projdir}"
    if not Path(f"{basedir}/{projdir}").exists():
        raise ExecutionStop("Symlink to shared project directory not found!")

    ## If you do not want to use the lib directoy from your Google Drive set this block False
    if True:
        if Path(libdir).is_symlink():
            ! rm "{libdir}"
        ! ln -s "{basedir}/gdrive/My Drive/Colab/{projdir}/{notebook}/lib{nbver}/" "{libdir}"
        if not Path(libdir).exists():
            raise ExecutionStop("Project libdir directory not found!")

    ## If you do not want output to be written to your Google Drive set this block False
    if True:
        if Path(outdir).is_symlink():
            ! rm "{outdir}"
        ! ln -s "{basedir}/gdrive/My Drive/Colab/{projdir}/{notebook}/output{nbver}/" "{outdir}"
        if not Path(outdir).exists():
            raise ExecutionStop("Project outdir directory not found!")

    ## If you want to use data from gDrive set to True, if False this will download data
    #  Reading off google drive appears to be 8 times slower so rather than mapping we are
    #  copying so you only pay the price once
    if True:
        altdatasrc = f"{basedir}/gdrive/My Drive/Colab/{projdir}/data"
        if not Path("{altdatasrc}/compdata.flag"):
            raise ExecutionStop("Could not find compdata on Google Drive!")
        if not Path(f"{datadir}/compdata.flag").exists():      ## Don't do anything if flag exists
            print("\nGetting Competition Data From Google Drive")
            ! [ -d "{datadir}/{competition}" ] || mkdir -p "{datadir}/{competition}"
            ! cp "{altdatasrc}/{competition}/sample_submission.csv" "{datadir}/{competition}/"
            ! cp "{altdatasrc}/{competition}/simplified-nq-train.jsonl" "{datadir}/{competition}"
            ! cp "{altdatasrc}/{competition}/simplified-nq-test.jsonl" "{datadir}/{competition}"
            ! cp "{altdatasrc}/{competition}/simplified-nq-eval.jsonl" "{datadir}/{competition}"
            ! cp "{altdatasrc}/{competition}"/*.tfrecords "{datadir}/{competition}"
            ! touch "{datadir}/compdata.flag"
            train_file = f"{datadir}/{competition}/simplified-nq-train.jsonl"
            test_file = f"{datadir}/{competition}/simplified-nq-eval.jsonl"
        if not Path("{altdatasrc}/bertdata.flag"):
            raise ExecutionStop("Could not find bertfiles on Google Drive!")
        if not Path(f"{datadir}/bertfiles.flag").exists():      ## Don't do anything if flag exists
            print("Getting BERTjoint Model From Google Drive")
            ! [ -d "{datadir}/bert-joint-baseline" ] || mkdir -p "{datadir}/bert-joint-baseline"
            ! cp "{altdatasrc}/bert-joint-baseline"/* "{datadir}/bert-joint-baseline/"
            ! touch "{datadir}/bertfiles.flag"

    if verbose:
        print(f"\nCWD: {basedir}\n")
        ! ls -lh "{basedir}"
        print()
        list_files(basedir)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive

Getting Competition Data From Google Drive
Getting BERTjoint Model From Google Drive

CWD: /content

total 12K
lrwxrwxrwx 1 root root   32 Jan 14 12:05 bertqa -> '/content/gdrive/My Drive/bertqa/'
drwxr-xr-x 4 root root 4.0K Jan 14 12:06 data
drwx------ 4 root root 4.0K Jan 14 12:05 gdrive
lrwxrwxrwx 1 root root   59 Jan 14 12:05 lib -> '/content/gdrive/My Drive/Colab/bertqa/BERTjoint_yes_no/lib/'
lrwxrwxrwx 1 root root   62 Jan 14 12:05 output -> '/content/

In [0]:
## Rewrite the sample_submission.csv file. You only need to do this if you have changed the 
#  data in {file_test}. If you have done that you need to recreate this file and copy it back to
#  whereever you are copying it to the VM from.

if False:
    import csv
    ! rm "{datadir}/{competition}/sample_submission.csv"
    with open(f"{datadir}/{competition}/sample_submission.csv", 'w') as file:
        writer = csv.writer(file)
        writer.writerow(["example_id","PredictionString"])

        ## examples seems to be lists of len 1 containing an bert_utils.NQ_Example
        for examples in bert_utils.nq_examples_iter( input_file = f"{test_file}", 
                                                    is_training = True,     # whether to include answer
                                                    tqdm = tqdm_notebook):
            ## uncomment this line to see answers
            # print(examples[0].example_id, examples[0].answer)
            writer.writerow([f"{examples[0].example_id}_long", ''])
            writer.writerow([f"{examples[0].example_id}_short", ''])

### SMS Messaging
<Details>You will need an account with http://twillio.com for this to work and have copied your auth API token tnto a file on your google drive.<p>
You will also need a json file with the following config information:<p>
```
{
"account_sid":"{your account_sid}",
"auth_token":"{your auth_token}",
"send_to":"+1{delivery number}",
"send_from":"+1{your twilio number}"
}
```
</Details>

In [0]:
if EnableSMSMessages and kernel == 'Colab':
    ! pip3 install twilio > /dev/null
    from twilio.rest import Client as TC
    import json
    twil_file = f"{basedir}/gdrive/My Drive/Colab/twilio.json" 
    if Path(twil_file).exists():
        # if there is a twilio.json file in gdrive use it
        with open(twil_file, 'r') as f:
            twil_auth = json.load(f)

        account_sid = twil_auth["account_sid"]
        auth_token  = twil_auth["auth_token"]

        sms = TC(account_sid, auth_token)
    else:
        sms = None
else:
    sms = None

def sms_message(str_msg):
    ''' Sends messages through sms gateway to notify user of alerts.
        If not enabled calls to sms_message() will silently do nothing.
        For this to work you need an account at https://www.twilio.com/
        Config through a twilio.json file (called above) that must contains
        {account_sid, auth_token, send_to, send_from}
    '''
    if sms is not None:
        message = sms.messages.create(
            to=twil_auth['send_to'], 
            from_=twil_auth['send_from'],
            body=str_msg)
        if message.error_message is not None:
            print(f"\nSMS Error: {message.error_message}")

sms_message("Twilio SMS gateway from Colab enabled")

### Kaggle API
<Details>You will need Kaggle API token to link the Colab instance to your Kaggle account to get data, etc.<br>
Go to: https://www.kaggle.com/yourID/account and click on the "Create New API Token: button to get a file named kaggle.json.<p>You can put your kaggle.json file in your google drive at My Drive/colab/kaggle.json.<br>
Alternately, you can store it on your local machine and the script will ask you to upload it.</Details>

In [10]:
## Link to Kaggle
if kernel == 'Colab':
    from google.colab import files
    if Path(f"{basedir}/gdrive/My Drive/Colab/kaggle.json").exists():
        # if there is a kaggle.json file in gdrive use it
        os.environ['KAGGLE_CONFIG_DIR'] = f"{basedir}/gdrive/My Drive/Colab/"
        ! ls -l "{basedir}/gdrive/My Drive/Colab/kaggle.json"
    else:
        # Have user upload file
        print('Upload kaggle.json.')
        # The files.upload() command is failing sporatically with:
        #   TypeError: Cannot read property '_uploadFiles' of undefined (just run this cell again)
        ! rm "{basedir}/kaggle.json"  2> /dev/null
        files.upload()
        ! chmod 600 kaggle.json
        os.environ['KAGGLE_CONFIG_DIR'] = f"{basedir}/"
        ! ls -l "{basedir}/kaggle.json"
    # import kaggle                           # gives us access to kaggle.api
    # help(kaggle)

-rw------- 1 root root 66 Dec 15 08:31 '/content/gdrive/My Drive/Colab/kaggle.json'


## -- Main System Config --
<Details><Summary>Global Config</Summary>
Put any global system configuration here

In [0]:
%%bash -s "{libdir}" "{datadir}" "{outdir}"
# make directories if not already exist
[ -d "$1" ] || mkdir -p "$1"        # {libdir}
[ -d "$2" ] || mkdir -p "$2"        # {datadir}
[ -d "$3" ] || mkdir -p "$3"        # {outdir}

In [0]:
import sys, os
if not libdir in sys.path:                       # don't add multiple times
    sys.path.append(libdir)
if not (libdir in os.environ['PYTHONPATH']):     # needed to run python scripts from shell
    os.environ['PYTHONPATH'] += f":{libdir}"

In [13]:
if verbose:
    print('', "sys.path:", *sys.path, '', sep='\n')
    !printenv |grep -E 'KAGGLE|PYTHON'
    print()


sys.path:

/env/python
/usr/lib/python36.zip
/usr/lib/python3.6
/usr/lib/python3.6/lib-dynload
/usr/local/lib/python3.6/dist-packages
/usr/lib/python3/dist-packages
/usr/local/lib/python3.6/dist-packages/IPython/extensions
/root/.ipython
/content/lib

KAGGLE_CONFIG_DIR=/content/gdrive/My Drive/Colab/
PYTHONPATH=/env/python:/content/lib



# =========== Project Setup Stuff ===========

### Download Dataset and Support Files

Kaggle Competition Files

In [14]:
## Competition Dataset
if DownloadBigFiles and kernel == 'Colab':
    if not Path(f"{datadir}/compdata.flag").exists():      ## Don't download again if exists
        # comp data might exist from previous run or because you copied it from gDrive
        print("Downloading Competition Data\n")
        ! kaggle competitions download -c "{competition}" -p "{datadir}"
        ! mkdir -p "{datadir}/{competition}/"
        ! mv "{datadir}/sample_submission.csv"  "{datadir}/{competition}"
        ! unzip "{datadir}/simplified-nq-train.jsonl.zip" -d "{datadir}/{competition}"
        ! rm "{datadir}/simplified-nq-train.jsonl.zip"
        ! unzip "{datadir}/simplified-nq-test.jsonl.zip" -d "{datadir}/{competition}"
        ! rm "{datadir}/simplified-nq-test.jsonl.zip"
        ! touch "{datadir}/compdata.flag"
        train_file = f"{datadir}/{competition}/simplified-nq-train.jsonl"
        test_file = f"{datadir}/{competition}/simplified-nq-test.jsonl"
    else:
        print("Competition Data already exists. Not downloading.\n")
        !ls -lh "{datadir}/{competition}"/*
else:
    print("For Kaggle, make sure you download a copy of the competition data into your kernel")
    ! ls -lh "{datadir}/{competition}"/*
    train_file = f"{datadir}/{competition}/simplified-nq-train.jsonl"
    test_file = f"{datadir}/{competition}/simplified-nq-test.jsonl"

    # public_dataset = os.path.getsize(f"{test_file}") < 20_000_000
    private_dataset = os.path.getsize(f"{test_file}") >= 20_000_000

Competition Data already exists. Not downloading.

-rw------- 1 root root 7.9M Jan 14 12:06 /content/data/tensorflow2-question-answering/nq-test.tfrecords
-rw------- 1 root root 5.5K Jan 14 12:05 /content/data/tensorflow2-question-answering/sample_submission.csv
-rw------- 1 root root 4.4M Jan 14 12:06 /content/data/tensorflow2-question-answering/simplified-nq-eval.jsonl
-rw------- 1 root root 477K Jan 14 12:06 /content/data/tensorflow2-question-answering/simplified-nq-test.jsonl
-rw------- 1 root root  53M Jan 14 12:06 /content/data/tensorflow2-question-answering/simplified-nq-train.jsonl


BERTjoint files from: https://www.kaggle.com/prokaj/bert-joint-baseline<p>
<font color='pink'>We probably want to use a different source if possible because this archive has a 1GB .tfrecords file we are not currently using so slower than it should be. For now I am using a smaller copy on my Google Drive</font>


In [15]:
# Get BERTjoint model files (this a copy of the prokaj file from my Google Drive)
if DownloadBigFiles and kernel == 'Colab':
    if not Path(f"{datadir}/bertfiles.flag").exists():      ## Don't download again if exists
        print("Downloading BERT-joint Model\n")
        ! mkdir -p "{datadir}/bert-joint-baseline/"
        filestoget = "bert_config* model_cpkt* nq-test* vocab*"
        ! kaggle datasets download -d prokaj/bert-joint-baseline -p "{datadir}"
        ! unzip "{datadir}/bert-joint-baseline.zip" {filestoget} -d "{datadir}/bert-joint-baseline/"
        ! rm "{datadir}/bert-joint-baseline.zip"
        if Path(f"{datadir}/bert-joint-baseline-output.npz").exists():
            ! rm "{datadir}/bert-joint-baseline-output.npz" # if kaggle downloaded this delete it
        if verbose:
            ! ls -lh "{datadir}/bert-joint-baseline/"
        ! touch "{datadir}/bertfiles.flag"
    else:
        print("BERT-joint Files already exists. Not downloading.\n")
        ! ls -lh "{datadir}/bert-joint-baseline/"
else:
    print("For Kaggle, make sure you download a copy of prokaj's bert-joint-baseline to your kernel")
    ! ls -lh "{datadir}/bert-joint-baseline/"


BERT-joint Files already exists. Not downloading.

total 1.3G
-rw------- 1 root root  314 Jan 14 12:06 bert_config.json
-rw------- 1 root root    0 Jan 14 12:06 bert-model-prokaj.flag
-rw------- 1 root root  78K Jan 14 12:06 model_cpkt-1.data-00000-of-00002
-rw------- 1 root root 1.3G Jan 14 12:06 model_cpkt-1.data-00001-of-00002
-rw------- 1 root root  29K Jan 14 12:06 model_cpkt-1.index
-rw------- 1 root root 227K Jan 14 12:06 vocab-nq.txt


BERTjoint files from: 
https://github.com/google-research/language/tree/master/language/question_answering/bert_joint


In [0]:
# Get BERTjoint model files    (we are not using this model / files)
if False and DownloadBigFiles and kernel == 'Colab':
    if not Path(f"{datadir}/bertfiles.flag").exists():
        print("Downloading BERT-joint Model\n")
        ! gsutil cp -R gs://bert-nq/bert-joint-baseline "{datadir}""
        ! touch "{datadir}/bertfiles.flag"
    else:
        print("BERT-joint Files already exists. Not downloading.\n")
        ! ls -lh "{datadir}/bert-joint-baseline/"
else:
    pass        # if Kaggle, data will already be there

BERT files from: https://github.com/google-research/bert<br>
(Not the model we are using at the moment)

In [0]:
## get BERT (this is unlikely to be the BERT-joint files needed for competition)
# this version of BERT seems won't import as is. On line 88 of lib/bert/optimization.py
#    change   tr.train.Optimizer to tf.keras.optimizers.Optimizer
if False and DownloadBigFiles and kernel == 'Colab':
    ! git clone https://github.com/google-research/bert.git
    ! mv bert lib

    # get some pretrained models  (I really  have no idea what these are or if useful)
    ! wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
    ! unzip cased_L-12_H-768_A-12.zip
    ! rm cased_L-12_H-768_A-12.zip

### Library Setup

Custom libraries are uploaded to google Drive and symlinked if on Colab and zipped and<br>
uploaded to a dataset if on Kaggle.

In [18]:
# if you need to move/copy any lib files to final locations do it here
if kernel == 'Colab':
    pass
if kernel == 'Kaggle':
    ! cp "{datadir}/bert-joint-baseline"/*.py "{libdir}"

if verbose:
    print(f"{libdir}/")
    ! ls -lh "{libdir}"/*

/content/lib/
-rw------- 1 root root  43K Jan 14 08:07 /content/lib/bert_utils.py
-rw------- 1 root root 7.5K Jan  7 06:53 /content/lib/flag_defaults.py
-rw------- 1 root root  41K Jan  5 00:36 /content/lib/modeling.py
-rw------- 1 root root  13K Jan  5 00:36 /content/lib/tokenization.py

/content/lib/natural_questions:
total 36K
-rw------- 1 root root  15K Jan 14 11:31 eval_utils.py
-rw------- 1 root root  17K Jan 14 11:25 nq_eval.py
drwx------ 2 root root 4.0K Jan 13 08:24 __pycache__

/content/lib/__pycache__:
total 72K
-rw------- 1 root root  25K Jan 14 08:07 bert_utils.cpython-36.pyc
-rw------- 1 root root 5.1K Jan 13 08:24 flag_defaults.cpython-36.pyc
-rw------- 1 root root  31K Jan 13 08:23 modeling.cpython-36.pyc
-rw------- 1 root root 9.7K Jan 13 08:23 tokenization.cpython-36.pyc


In [19]:
## Load Libraries
import os, sys, importlib

if kernel == "Colab":               # Kaggle is V2 by default
    #magic to make Colab path to Tensorflow V2 on Colab
    %tensorflow_version 2.x 

import tensorflow as tf
print("TensorFlow", tf.__version__)

import numpy as np
import pandas as pd
import collections

import bert_utils
import modeling
import tokenization

import json

TensorFlow 2.x selected.
TensorFlow 2.1.0-rc1


In [0]:
# raise ExecutionStop("Execution stopped")

# =========== Project Code ===========

## -- Code Implementation in Tensorflow 2.0 --

**A few notes:**
- Since we won't use it with the kernels, he removed most of the **TPU** related stuff to reduce complexity.
- Tensorflow 2 don't let us use global variables **(tf.compat.v1.trainable_variables())**.

In this notebook, we'll be using the Bert baseline for Tensorflow to create predictions for the Natural Questions test set. Note that this uses a model that has already been pre-trained - we're only doing inference here. A GPU is required, and this should take between 1-2 hours to run.

The original script can be found [here](https://github.com/google-research/language/blob/master/language/question_answering/bert_joint/run_nq.py).<br>
The supporting modules were drawn from the [official Tensorflow model repository](https://github.com/tensorflow/models/tree/master/official).<br>
The bert-joint-baseline data is described [here](https://github.com/google-research/language/tree/master/language/question_answering/bert_joint).

In [21]:
! zdump PST

PST  Tue Jan 14 12:07:40 2020 PST


### Support Classes

In [0]:
# ### Values in bert_config.json
#  = {
# 'attention_probs_dropout_prob':0.1,
# 'hidden_act':'gelu', # 'gelu',
# 'hidden_dropout_prob':0.1,
# 'hidden_size':1024,
# 'initializer_range':0.02,
# 'intermediate_size':4096,
# 'max_position_embeddings':512,
# 'num_attention_heads':16,
# 'num_hidden_layers':24,
# 'type_vocab_size':2,
# 'vocab_size':30522
# }

#### Note from ChrisM
The Bert-Joint model builds on Bert by adding some new layers to the original Bert model.  One layer is a 2 dimensional layer which is applied to the sequence hidden states and represents the start and end logits.  In other words, each sequence token gets two logits which are used to generate logit probabilities that the token is a start/end token.   The other layer will encode a 5 dimensional vector which is a one-hot encoding for the response type as encoded in the Bert-Joint paper ( ie, [no response, yes, no, short, long] ).  Both of these additional layers are created as instances of the class TDense.

The mk_model function adds these new layers to the the Bert model to create the Bert-Joint model.  This is just the model that the Bert-Joint paper builds




In [0]:
class TDense(tf.keras.layers.Layer):
    def __init__(self,
                 output_size,
                 kernel_initializer=None,
                 bias_initializer="zeros",
                **kwargs):
        super().__init__(**kwargs)
        self.output_size = output_size
        self.kernel_initializer = kernel_initializer
        self.bias_initializer = bias_initializer

    def build(self,input_shape):
        dtype = tf.as_dtype(self.dtype or tf.keras.backend.floatx())
        if not (dtype.is_floating or dtype.is_complex):
          raise TypeError("Unable to build `TDense` layer with "
                          "non-floating point (and non-complex) "
                          "dtype %s" % (dtype,))
        input_shape = tf.TensorShape(input_shape)
        if tf.compat.dimension_value(input_shape[-1]) is None:
          raise ValueError("The last dimension of the inputs to "
                           "`TDense` should be defined. "
                           "Found `None`.")
        last_dim = tf.compat.dimension_value(input_shape[-1])
#       self.input_spec = tf.keras.layers.InputSpec(min_ndim=3, axes={-1: last_dim})    # original value <<< mmm still same >>>
        self.input_spec = tf.keras.layers.InputSpec(min_ndim=2, axes={-1: last_dim})    # ChrisM changed
        self.kernel = self.add_weight(
            "kernel",
            shape=[self.output_size,last_dim],
            initializer=self.kernel_initializer,
            dtype=self.dtype,
            trainable=True)
        self.bias = self.add_weight(
            "bias",
            shape=[self.output_size],
            initializer=self.bias_initializer,
            dtype=self.dtype,
            trainable=True)
        super(TDense, self).build(input_shape)

    def call(self,x):
        return tf.matmul(x,self.kernel,transpose_b=True)+self.bias


def mk_model(config):
    seq_len = config['max_position_embeddings']
    unique_id  = tf.keras.Input(shape=(1,),dtype=tf.int64,name='unique_id')
    input_ids   = tf.keras.Input(shape=(seq_len,),dtype=tf.int32,name='input_ids')
    input_mask  = tf.keras.Input(shape=(seq_len,),dtype=tf.int32,name='input_mask')
    segment_ids = tf.keras.Input(shape=(seq_len,),dtype=tf.int32,name='segment_ids')
    BERT = modeling.BertModel(config=config,name='bert')
    pooled_output, sequence_output = BERT(input_word_ids=input_ids,
                                          input_mask=input_mask,
                                          input_type_ids=segment_ids)
    
    logits = TDense(2,name='logits')(sequence_output)
    start_logits,end_logits = tf.split(logits,axis=-1,num_or_size_splits= 2,name='split')
    start_logits = tf.squeeze(start_logits,axis=-1,name='start_squeeze')
    end_logits   = tf.squeeze(end_logits,  axis=-1,name='end_squeeze')
    
    ans_type      = TDense(5,name='ans_type')(pooled_output)
    return tf.keras.Model([input_ for input_ in [unique_id,input_ids,input_mask,segment_ids] 
                           if input_ is not None],
                          [unique_id,start_logits,end_logits,ans_type],
                          name='bert-baseline')    

### Create Model

In [24]:
print("\nGPU Memory\n")
!nvidia-smi --query-gpu=utilization.memory,memory.total,memory.free,memory.used --format=csv


GPU Memory

utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 11441 MiB, 11441 MiB, 0 MiB


In [0]:
## grab bert config
with open(f"{datadir}/bert-joint-baseline/bert_config.json", 'r') as f:
    bert_config = json.load(f)

In [26]:
if verbose:
    print("BERT config:")
    model= mk_model(bert_config)
    print(json.dumps(bert_config, indent=4))
    print("\nModel Summary")
    model.summary()
    print()
    tf.keras.utils.plot_model( model,
                        to_file=f"{outdir}/model.png",    # Graphic output in outdir
                        show_shapes=True,
                        show_layer_names = True,
                        rankdir = 'TB',            # TB creates vertical; LR creates horizontal
                        expand_nested = True,     # expand nested models into clusters
                        dpi = 300
                        )

BERT config:
{
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 1024,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "max_position_embeddings": 512,
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "type_vocab_size": 2,
    "vocab_size": 30522
}

Model Summary
Model: "bert-baseline"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 512)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 512)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputL

### Checkpoint

In [27]:
cpkt = tf.train.Checkpoint(model=model)
cpkt.restore(f"{datadir}/bert-joint-baseline/model_cpkt-1").assert_consumed()

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f2fdea4ec50>

### Setting the Flags

In [0]:
class DummyObject:
    def __init__(self,**kwargs):
        self.__dict__.update(kwargs)

## 0.60 new code mmm    (ALC values in parens if different)
# Lots of stuff here but he only really uses the three values below
# FLAGS=DummyObject(skip_nested_contexts=True, #True
#                   max_position=50,            # set to same in bert_utils.py
#                   max_contexts=48,            # set to same in bert_utils.py
#                   max_query_length=64,        # set to same in bert_utils.py
#                   max_seq_length=512, #512      (384)    # set to same in bert_utils.py
#                   doc_stride=128,             # set to same in bert_utils.py
#                   include_unknowns=0.02, #0.02  (-1.0)   # changed from bert_utils.py
#                   n_best_size=5, #20            (20)     # changed from bert_utils.py
#                   max_answer_length=30, #30   # set to same in bert_utils.py
                  
#                   warmup_proportion=0.1,
#                   learning_rate=1e-5,   #       (5e-5)
#                   num_train_epochs=3.0,
#                   train_batch_size=32,
#                   num_train_steps=100000,   #   (None: features / batch_size * epochs)
#                   num_warmup_steps=10000,  #    (None: train_steps * warmup_proportion)
#                   max_eval_steps=100,     #     (Not persent)
#                   use_tpu=False,
#                   eval_batch_size=8,      #     (Not persent)
#                   max_predictions_per_seq=20) # (Not persent)

# I treid changing n_best_size back to  20 and resubmitted. Still got 0.60
FLAGS=DummyObject(
                  max_seq_length=512, #512      (384)    # set to same in bert_utils.py
                  n_best_size=5, #20            (20)     # changed from bert_utils.py
                  max_answer_length=30  #30   # set to same in bert_utils.py
                 )

#### Note from ChrisM
This next section does the processing from the json file to a tokenized representation.  The output from the nq_examples_iter iterator is an 'example' which holds a single question along with the candidate answers.  At the nq_examples_iter output level, tokenization to the Bert-Joint wordpiece level has not yet been performed.  **NOTE: the nq_examples_iter will by default skip over any candidate long answers that are subsets of other candidate long answers ( as indicated by the top_level=False indicator ).**

Each 'example' contains the text of the question and an element called 'doc_tokens' which is a list of strings representing the long answer candidates all concatenated together:
*   The long answers have been split into whitespace delimited words
*   HTML characters have been removed
*   For each candidate long answer, the 'type' and 'position' of the answer in the text are recorded at the start in special tokens as in the Bert-Joint paper
*   'type' tells whether the long answer is a paragraph, table, list, or other as indicated by the first HTML character in the candidate answer
*   'position' indicates how many of the given answer type were seen previously in the document.  Ie, is the answer the first paragraph or the second, etc...

After processing as above, the convert(example) line performs tokenization to the Bert-Joint wordpiece vocabulary and formats the candidate question-document pairs in the format required for Bert-Joint where you have the question first, then a SEP token, and then a fragment of the text where the text is stepped over in overlapping 128 token increments.  I am not sure, but I think in the original Bert-Joint, that they fed in the entire document but here we have been given a set of 'plausible' long answers and we only include those.  **Question:  do our long answer candidates in the dataset span the entire document? If not, would it be better to feed the whole document in?**

After processing by convert(example), each json entry has been split into multiple InputFeatures objects which have the required tokenized input form for Bert-Joint as well as two identifiers:
*   example_index = the id from the json file
*   unique_id = example_index + 'a incrementing integer'

The incrementing integer that contributes to unique_id is just an attempt to give unique ids to each question/document-fragment pair.  The unique_ids are, I think, not guaranteed to be unique unfortunately but they probably are.  It depends if any of the ids from the json file are close enough to each other.  We could test for this just to be sure.












In [0]:
import tqdm
eval_records = f"{datadir}/{competition}/nq-test.tfrecords"
if kernel == 'Kaggle' and private_dataset:
    eval_records='nq-test.tfrecords'
if not Path(eval_records).exists():
    # tf2baseline.FLAGS.max_seq_length = 512
    eval_writer = bert_utils.FeatureWriter(
        filename=os.path.join(eval_records),
        is_training=False)

    tokenizer = tokenization.FullTokenizer(vocab_file=f"{datadir}/bert-joint-baseline/vocab-nq.txt", 
                                           do_lower_case=True)

    features = []
    convert = bert_utils.ConvertExamples2Features(tokenizer=tokenizer,
                                                   is_training=False,
                                                   output_fn=eval_writer.process_feature,
                                                   collect_stat=False)

    n_examples = 0
    tqdm_notebook= tqdm.tqdm_notebook if not kernel == 'Kaggle' else None
    for examples in bert_utils.nq_examples_iter(input_file=f"{test_file}", 
                                           is_training=False,
                                           tqdm=tqdm_notebook):
        for example in examples:
            n_examples += convert(example)

    eval_writer.close()
    print('number of test examples: %d, written to file: %d' % (n_examples,eval_writer.num_features))

In [0]:
seq_length = FLAGS.max_seq_length       # bertconfig['max_position_embeddings']
name_to_features = {
      "unique_id": tf.io.FixedLenFeature([], tf.int64),
      "input_ids": tf.io.FixedLenFeature([seq_length], tf.int64),
      "input_mask": tf.io.FixedLenFeature([seq_length], tf.int64),
      "segment_ids": tf.io.FixedLenFeature([seq_length], tf.int64),
  }

def _decode_record(record, name_to_features=name_to_features):
    """Decodes a record to a TensorFlow example."""
    example = tf.io.parse_single_example(serialized=record, features=name_to_features)

    # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
    # So cast all int64 to int32.
    for name in list(example.keys()):
        t = example[name]
        if name != 'unique_id': #t.dtype == tf.int64:
            # t = tf.cast(t, dtype=tf.int32)
            t = tf.cast(t, dtype=tf.int64)      ### new code mmm
        example[name] = t

    return example

def _decode_tokens(record):
    return tf.io.parse_single_example(serialized=record, 
                                      features={
                                          "unique_id": tf.io.FixedLenFeature([], tf.int64),
                                          "token_map" :  tf.io.FixedLenFeature([seq_length], tf.int64)
                                      })
      


In [0]:
## Create ds which is a generator of input data to be fed into model.predict
raw_ds = tf.data.TFRecordDataset(eval_records)
token_map_ds = raw_ds.map(_decode_tokens)
decoded_ds = raw_ds.map(_decode_record)
ds = decoded_ds.batch(batch_size=32,drop_remainder=False)   ## a generator yielding batches of input samples

After a few more cells above which just do some reformatting to a tensor-flow compatible structure,<br> we then run the next line which does the actual prediction<p>
Each ds item is a batch of <class 'dict'>. keys: input_ids, input_mask, segment_ids, unique_id<p>

In [32]:
if True and verbose:
    # print out information from the first ds (batch)
    for d in ds:
        print(type(d), "keys: ", d.keys())
        print("shape: ", d['unique_id'].shape, d['input_ids'].shape, d['input_mask'].shape, d['segment_ids'].shape)
        print(f"\nunique_id: {d['unique_id'][0]}")
        print(f"\ninput_ids:\n{d['input_ids'][0]}")
        print(f"\ninput_mask:\n{d['input_ids'][0]}")
        print(f"\nsegment_ids:\n{d['input_ids'][0]}")
        break

<class 'dict'> keys:  dict_keys(['input_ids', 'input_mask', 'segment_ids', 'unique_id'])
shape:  (32,) (32, 512) (32, 512) (32, 512)

unique_id: 5655493461695504401

input_ids:
[  101   104  2029  2003  1996  2087  2691  2224  1997 23569  1011  1999
  1041  1011  5653  5821   102   259   107   260   159  1006  5342  1007
  2023  3720  2038  3674  3314  1012  3531  2393  5335  2009  2030  6848
  2122  3314  2006  1996  2831  3931  1012  1006  4553  2129  1998  2043
  2000  6366  2122 23561  7696  1007  2023  3720  3791  3176 22921  2005
 22616  1012  3531  2393  5335  2023  3720  2011  5815 22921  2000 10539
  4216  1012  4895  6499  3126 11788  3430  2089  2022  8315  1998  3718
  1012  1006  2244  2297  1007  1006  4553  2129  1998  2043  2000  6366
  2023 23561  4471  1007  2023  3720  4298  3397  2434  2470  1012  3531
  5335  2009  2011 20410  2075  1996  4447  2081  1998  5815 23881 22921
  1012  8635  5398  2069  1997  2434  2470  2323  2022  3718  1012  1006
  2254  2325  1007  

On full training set this will count to 284 and take about 600s (10 / 24s on the short data set)


In [33]:
result=model.predict(ds, verbose = 1 if verbose else 0)

     76/Unknown - 604s 8s/step

Save the results in a compressd file. Not sure if we are wanting to be able to analyze them<br>
later or reload them to save having to run top half of program when testing bottom half.

In [0]:
if False:
    print("Saving compressed results")
    np.savez_compressed(f'{outdir}/bert-joint-baseline-output.npz',
                    **dict(zip(['unique_id','start_logits','end_logits','answer_type_logits'],
                               result)))   # result is a list of 4 np.ndarray

In [35]:
 if True and verbose:
    # print out the first result
    np.set_printoptions(suppress=True, precision=0)
    print(f"result is a {type(result)} of len ({len(result)})")
    print(f"result[0] (unique_id) is a {type(result[0])} of shape {result[0].shape}")
    print(f"result[1] (start_logits) is a {type(result[1])} of shape {result[1].shape}")
    print(f"result[2] (end_logits) is a {type(result[2])} of shape {result[2].shape}")
    print(f"result[3] (answer_type_logits) is a {type(result[3])} of shape {result[3].shape}")
    print(f"\nunique_id: \n{result[0][0]}")
    print(f"\nstart_logits: \n{result[1][0]}")
    print(f"\nend_logits: \n{result[2][0]}")
    print(f"\nanswer_type_logits: \n{result[3][0]}")


result is a <class 'list'> of len (4)
result[0] (unique_id) is a <class 'numpy.ndarray'> of shape (2405, 1)
result[1] (start_logits) is a <class 'numpy.ndarray'> of shape (2405, 512)
result[2] (end_logits) is a <class 'numpy.ndarray'> of shape (2405, 512)
result[3] (answer_type_logits) is a <class 'numpy.ndarray'> of shape (2405, 5)

unique_id: 
[5655493461695504401]

start_logits: 
[  2.  -9.  -8.  -9.  -9.  -9.  -9.  -9.  -9.  -7.  -9.  -9.  -8.  -8.
  -9.  -9.  -8.   2.  -1.  -1.  -1.  -2.  -1.  -2.  -2.  -2.  -2.  -2.
  -2.  -3.  -2.  -2.  -3.  -3.  -4.  -3.  -3.  -3.  -4.  -4.  -3.  -4.
  -4.  -3.  -3.  -4.  -4.  -3.  -4.  -3.  -4.  -3.  -4.  -4.  -4.  -3.
  -3.  -3.  -3.  -4.  -3.  -5.  -4.  -4.  -4.  -4.  -4.  -4.  -4.  -4.
  -4.  -4.  -4.  -5.  -3.  -6.  -6.  -5.  -4.  -4.  -5.  -4.  -5.  -4.
  -5.  -5.  -4.  -4.  -5.  -5.  -4.  -5.  -5.  -5.  -5.  -4.  -5.  -4.
  -5.  -5.  -5.  -4.  -5.  -4.  -4.  -4.  -5.  -4.  -4.  -5.  -5.  -4.
  -6.  -5.  -4.  -5.  -6.  -5.  -5.  -5.  -6. 

In [0]:
Span = collections.namedtuple("Span", ["start_token_idx", "end_token_idx", "score"])   ## Added score

In [0]:
class ScoreSummary(object):
  def __init__(self):
    self.predicted_label = None
    self.short_span_score = None
    self.cls_token_score = None
    self.answer_type_logits = None

In [0]:
class EvalExample(object):
  """Eval data available for a single example."""
  def __init__(self, example_id, candidates):
    self.example_id = example_id
    self.candidates = candidates
    self.results = {}
    self.features = {}

In [0]:
def get_best_indexes(logits, n_best_size):
  """Get the n-best logits from a list."""
  index_and_score = sorted(
      enumerate(logits[1:], 1), key=lambda x: x[1], reverse=True)
  best_indexes = []
  for i in range(len(index_and_score)):
    if i >= n_best_size:
      break
    best_indexes.append(index_and_score[i][0])
  return best_indexes

def top_k_indices(logits,n_best_size,token_map):
    indices = np.argsort(logits[1:])+1
    indices = indices[token_map[indices]!=-1]
    return indices[-n_best_size:]

In [0]:
def remove_duplicates(span):
    start_end = []
    for s in span:
        cont = 0
        if not start_end:
            start_end.append(Span(s[0], s[1], s[2]))
            cont += 1
        else:
            for i in range(len(start_end)):
                if start_end[i][0] == s[0] and start_end[i][1] == s[1]:
                    cont += 1
        if cont == 0:
            start_end.append(Span(s[0], s[1], s[2]))
            
    return start_end

In [0]:
def get_short_long_span(predictions, example):
    
    sorted_predictions = sorted(predictions, reverse=True)
    short_span = []
    long_span = []
    for prediction in sorted_predictions:
        score, _, summary, start_span, end_span = prediction
        # get scores > zero
        if score > 0:
            short_span.append(Span(int(start_span), int(end_span), float(score)))

    short_span = remove_duplicates(short_span)

    for s in range(len(short_span)):
        for c in example.candidates:
            start = short_span[s].start_token_idx
            end = short_span[s].end_token_idx
            ## print(c['top_level'],c['start_token'],start,c['end_token'],end)
            if c["top_level"] and c["start_token"] <= start and c["end_token"] >= end:
                long_span.append(Span(int(c["start_token"]), int(c["end_token"]), float(short_span[s].score)))
                break
    long_span = remove_duplicates(long_span)
    
    if not long_span:
        long_span = [Span(-1, -1, -10000.0)]
    if not short_span:
        short_span = [Span(-1, -1, -10000.0)]
        
    
    return short_span, long_span

### Understanding the code
In the item "answer_type", in the last lines of this block, it is responsible for storing the identified response type, which, according to [github project repository](https://github.com/google-research/language/blob/master/language/question_answering/bert_joint/run_nq.py) can be:
1. UNKNOWN = 0
2. YES = 1
3. NO = 2
4. SHORT = 3
5. LONG = 4

#### Note from ChrisM
The other thing that compute_predictions does is look through the logits for the start and end indexes and finds the pairs of start/end indexes which seem most likely to be the short answer.  It does this by creating a score for the short answers based on the combined start + end logit scores and for some reason subtracting from this the logit scores for the initial CLS token in the Bert input.  The long span is determined from the best short answer.

I am wondering here if we might be able to get some improvement by building an additional model to select the best short span in some other way rather than just taking the max logit values.  Maybe a tree ensemble, for example, could look at the logits in a different way and make a better decision for the short and long answers than what is coded here.

In [0]:
def compute_predictions(example):
    """Converts an example into an NQEval object for evaluation."""
    predictions = []
    n_best_size = FLAGS.n_best_size
    max_answer_length = FLAGS.max_answer_length
    i = 0
    for unique_id, result in example.results.items():
        if unique_id not in example.features:
            raise ValueError("No feature found with unique_id:", unique_id)
        token_map = np.array(example.features[unique_id]["token_map"]) #.int64_list.value
        start_indexes = top_k_indices(result.start_logits,n_best_size,token_map)
        if len(start_indexes)==0:
            continue
        end_indexes   = top_k_indices(result.end_logits,n_best_size,token_map)
        if len(end_indexes)==0:
            continue
        indexes = np.array(list(np.broadcast(start_indexes[None],end_indexes[:,None])))  
        indexes = indexes[(indexes[:,0]<indexes[:,1])*(indexes[:,1]-indexes[:,0]<max_answer_length)]
        for _, (start_index,end_index) in enumerate(indexes):  
            summary = ScoreSummary()
            summary.short_span_score = (
                result.start_logits[start_index] +
                result.end_logits[end_index])
            summary.cls_token_score = (
                result.start_logits[0] + result.end_logits[0])
            summary.answer_type_logits = result.answer_type_logits-result.answer_type_logits.mean()
            start_span = token_map[start_index]
            end_span = token_map[end_index] + 1

            # Span logits minus the cls logits seems to be close to the best.
            score = summary.short_span_score - summary.cls_token_score
            predictions.append((score, i, summary, start_span, end_span))
            i += 1 # to break ties

    # Default empty prediction.
    #score = -10000.0
    short_span = [Span(-1, -1, -10000.0)]
    long_span  = [Span(-1, -1, -10000.0)]
    summary    = ScoreSummary()

    if predictions:
        short_span, long_span = get_short_long_span(predictions, example)
      
    summary.predicted_label = {
        "example_id": int(example.example_id),
        "long_answers": {
          "tokens_and_score": long_span,
          #"end_token": long_span,
          "start_byte": -1,
          "end_byte": -1
        },
        #"long_answer_score": answer_score,
        "short_answers": {
          "tokens_and_score": short_span,
          #"end_token": short_span,
          "start_byte": -1,
          "end_byte": -1,
          "yes_no_answer": "NONE"
        }
        #"short_answer_score": answer_scores,
        
        #"answer_type_logits": summary.answer_type_logits.tolist(),
        #"answer_type": int(np.argmax(summary.answer_type_logits))
       }

    return summary

In [0]:
def compute_pred_dict(candidates_dict, dev_features, raw_results,tqdm=None):
    """Computes official answer key from raw logits."""
    raw_results_by_id = [(int(res.unique_id),1, res) for res in raw_results]

    examples_by_id = [(int(k),0,v) for k, v in candidates_dict.items()]
  
    features_by_id = [(int(d['unique_id']),2,d) for d in dev_features] 
  
    # ChrisM Note    (Join examples with features and raw results.)
    # NOTE:  this strange looking merge where we are sorting tuples is intended.  
    #        In the examples_by_id, each question / document pair has an id.
    #        Then, in the raw_results_by_id, a document with id 'i' from the examples_by_id
    #        will be split across multiple raw results with ids: i, i+1, i+2, ...
    #        The features_by_id has the same structure as raw_results_by_id and 
    #        contains a map from Bert-Joint sequence tokens to whitespace delimited
    #        words in the original document ( which is what we report in the answer )
    examples = []
    print('merging examples...')
    merged = sorted(examples_by_id + raw_results_by_id + features_by_id)
    print('done.')
    for idx, type_, datum in merged:
        if type_==0: #isinstance(datum, list):
            examples.append(EvalExample(idx, datum))
        elif type_==2: #"token_map" in datum:
            examples[-1].features[idx] = datum
        else:
            examples[-1].results[idx] = datum

    # Construct prediction objects.
    print('Computing predictions...')
   
    nq_pred_dict = {}
    #summary_dict = {}
    if tqdm is not None:
        examples = tqdm(examples)
    for e in examples:
        summary = compute_predictions(e)
        #summary_dict[e.example_id] = summary
        nq_pred_dict[e.example_id] = summary.predicted_label
    return nq_pred_dict

In [0]:
def read_candidates_from_one_split(input_path):
  """Read candidates from a single jsonl file."""
  candidates_dict = {}
  print("Reading examples from: %s" % input_path)
  if input_path.endswith(".gz"):
    with gzip.GzipFile(fileobj=tf.io.gfile.GFile(input_path, "rb")) as input_file:
      for index, line in enumerate(input_file):
        e = json.loads(line)
        candidates_dict[e["example_id"]] = e["long_answer_candidates"]
        
  else:
    with tf.io.gfile.GFile(input_path, "r") as input_file:
      for index, line in enumerate(input_file):
        e = json.loads(line)
        candidates_dict[e["example_id"]] = e["long_answer_candidates"]  # testar juntando com question_text
  return candidates_dict

In [0]:
def read_candidates(input_pattern):
  """Read candidates with real multiple processes."""
  input_paths = tf.io.gfile.glob(input_pattern)
  final_dict = {}
  for input_path in input_paths:
    final_dict.update(read_candidates_from_one_split(input_path))
  return final_dict

In [46]:
all_results = [bert_utils.RawResult(*x) for x in zip(*result)]
    
print ("About to read_candidates()")

candidates_dict = read_candidates(f"{test_file}")

print ("setting up eval_features as list")

eval_features = list(token_map_ds)

print ("going to compute_pred_dict()")

tqdm_notebook= tqdm.tqdm_notebook
nq_pred_dict = compute_pred_dict(candidates_dict, 
                                       eval_features,
                                       all_results,
                                      tqdm=tqdm_notebook)

predictions_json = {"predictions": list(nq_pred_dict.values())}

print ("about to write predictions.json")

with tf.io.gfile.GFile(f"{outdir}/predictions.json", "w") as f:
    json.dump(predictions_json, f, indent=4)
print('done writing!')

About to read_candidates()
Reading examples from: /content/data/tensorflow2-question-answering/simplified-nq-eval.jsonl
setting up eval_features as list
going to compute_pred_dict()
merging examples...
done.
Computing predictions...


HBox(children=(IntProgress(value=0), HTML(value='')))


about to write predictions.json
done writing!


### Processing the Output


#### Filtering the Answers

In [47]:
answers_df = pd.read_json(f"{outdir}/predictions.json")
answers_df.head()

Unnamed: 0,predictions
0,"{'example_id': -8799945603687418006, 'long_ans..."
1,"{'example_id': -8627347779381584683, 'long_ans..."
2,"{'example_id': -8114175076810279695, 'long_ans..."
3,"{'example_id': -8062182676792486818, 'long_ans..."
4,"{'example_id': -7766157450214546755, 'long_ans..."


In [0]:
# {long score > 2, cont = 5 | short score > 2, cont = 5} = 0.18
# { long score > 2, cont = 5 | short score > 6, cont = 5}
# { long score > 2, cont = 1 | short score > 6, cont = 5}

def df_long_index_score(df):
    answers = []
    cont = 0
    for e in df['long_answers']['tokens_and_score']:
        # if score > 2
        if e[2] > 3: 
            index = {}
            index['start'] = e[0]
            index['end'] = e[1]
            index['score'] = e[2]
            answers.append(index)
            cont += 1
        # number of answers
        if cont == 1:
            break
            
    return answers

def df_short_index_score(df):
    answers = []
    cont = 0
    for e in df['short_answers']['tokens_and_score']:
        # if score > 2
        if e[2] > 8:
            index = {}
            index['start'] = e[0]
            index['end'] = e[1]
            index['score'] = e[2]
            answers.append(index)
            cont += 1
        # number of answers
        if cont == 1:
            break
            
    return answers

def df_example_id(df):
    return df['example_id']

In [49]:
answers_df['example_id'] = answers_df['predictions'].apply(df_example_id)

answers_df['long_indexes_and_scores'] = answers_df['predictions'].apply(df_long_index_score)

answers_df['short_indexes_and_scores'] = answers_df['predictions'].apply(df_short_index_score)

answers_df.head()

Unnamed: 0,predictions,example_id,long_indexes_and_scores,short_indexes_and_scores
0,"{'example_id': -8799945603687418006, 'long_ans...",-8799945603687418006,"[{'start': 122, 'end': 348, 'score': 6.9993805...",[]
1,"{'example_id': -8627347779381584683, 'long_ans...",-8627347779381584683,"[{'start': 243, 'end': 474, 'score': 7.1751813...",[]
2,"{'example_id': -8114175076810279695, 'long_ans...",-8114175076810279695,[],[]
3,"{'example_id': -8062182676792486818, 'long_ans...",-8062182676792486818,"[{'start': 553, 'end': 629, 'score': 12.610366...","[{'start': 565, 'end': 567, 'score': 12.610366..."
4,"{'example_id': -7766157450214546755, 'long_ans...",-7766157450214546755,"[{'start': 178, 'end': 233, 'score': 4.7214059...",[]


In [50]:
answers_df = answers_df.drop(['predictions'], axis=1)
answers_df.head()

Unnamed: 0,example_id,long_indexes_and_scores,short_indexes_and_scores
0,-8799945603687418006,"[{'start': 122, 'end': 348, 'score': 6.9993805...",[]
1,-8627347779381584683,"[{'start': 243, 'end': 474, 'score': 7.1751813...",[]
2,-8114175076810279695,[],[]
3,-8062182676792486818,"[{'start': 553, 'end': 629, 'score': 12.610366...","[{'start': 565, 'end': 567, 'score': 12.610366..."
4,-7766157450214546755,"[{'start': 178, 'end': 233, 'score': 4.7214059...",[]


In [0]:
def create_answer(entry):
    answer = []
    for e in entry:
        answer.append(str(e['start']) + ':'+ str(e['end']))
    if not answer:
        answer = ""
    return ", ".join(answer)


In [52]:
answers_df["long_answer"] = answers_df['long_indexes_and_scores'].apply(create_answer)
answers_df["short_answer"] = answers_df['short_indexes_and_scores'].apply(create_answer)
answers_df["example_id"] = answers_df['example_id'].apply(lambda q: str(q))

long_answers = dict(zip(answers_df["example_id"], answers_df["long_answer"]))
short_answers = dict(zip(answers_df["example_id"], answers_df["short_answer"]))

answers_df.head()

Unnamed: 0,example_id,long_indexes_and_scores,short_indexes_and_scores,long_answer,short_answer
0,-8799945603687418006,"[{'start': 122, 'end': 348, 'score': 6.9993805...",[],122:348,
1,-8627347779381584683,"[{'start': 243, 'end': 474, 'score': 7.1751813...",[],243:474,
2,-8114175076810279695,[],[],,
3,-8062182676792486818,"[{'start': 553, 'end': 629, 'score': 12.610366...","[{'start': 565, 'end': 567, 'score': 12.610366...",553:629,565:567
4,-7766157450214546755,"[{'start': 178, 'end': 233, 'score': 4.7214059...",[],178:233,


In [53]:
answers_df = answers_df.drop(['long_indexes_and_scores', 'short_indexes_and_scores'], axis=1)
answers_df.head()

Unnamed: 0,example_id,long_answer,short_answer
0,-8799945603687418006,122:348,
1,-8627347779381584683,243:474,
2,-8114175076810279695,,
3,-8062182676792486818,553:629,565:567
4,-7766157450214546755,178:233,


### Generating the Submission File

sample_submission.csv has to have two lines for every line in {test_file} and example_id must match.

In [54]:
sample_submission = pd.read_csv(f"{datadir}/{competition}/sample_submission.csv")

print(f"Size of sample_submission: {len(sample_submission)} (sum of long and short answers)")
print(f"Number of long_answers: {len(long_answers)}")
print(f"Number of short_answers: {len(short_answers)}")
print()

long_prediction_strings = sample_submission[sample_submission["example_id"].str.contains("_long")].apply(lambda q: long_answers[q["example_id"].replace("_long", "")], axis=1)
short_prediction_strings = sample_submission[sample_submission["example_id"].str.contains("_short")].apply(lambda q: short_answers[q["example_id"].replace("_short", "")], axis=1)

sample_submission.loc[sample_submission["example_id"].str.contains("_long"), "PredictionString"] = long_prediction_strings
sample_submission.loc[sample_submission["example_id"].str.contains("_short"), "PredictionString"] = short_prediction_strings


Size of sample_submission: 200 (sum of long and short answers)
Number of long_answers: 100
Number of short_answers: 100



In [0]:
## Deliver compiled submission results
if kernel == 'Colab':
    sample_submission.to_csv(f'{outdir}/submission.csv', index=False)
else:
    # Kaggle wants submission.csv dropped in cwd
    sample_submission.to_csv('submission.csv', index=False)

In [56]:
! zdump PST

PST  Tue Jan 14 12:18:11 2020 PST


In [57]:
sample_submission

Unnamed: 0,example_id,PredictionString
0,5655493461695504401_long,1952:2019
1,5655493461695504401_short,
2,5328212470870865242_long,212:310
3,5328212470870865242_short,213:215
4,4435104480114867852_long,
...,...,...
195,-6753967926867752330_short,
196,-6874546130423309582_long,
197,-6874546130423309582_short,
198,67874408688779239_long,190:309


#&gt;&gt; Deleted to end before submitting to competition &lt;&lt;
<Details> This is the end of the project notebook. The rest of this notebook is helpful for<br>
development but should not be submitted to competition</Details>

### Check Scoring w/ nq_eval by Reading predictions.json
This is obviously not working. The way scores are stored in predictions.json changed and is no longe compatible<br>
with natural_questions.eval_utils. Look in that file for notes around line 222.
Amoung other potential problems, the<br>
conversion from predictions.json to nq_pred_dict does not do anything to evaluate different long_answer<br>
candidates and short_answer candidates to see what would be the best answer. Also, the way scores were tracked<br>
was changed and is likely not correct.

In [0]:
if False:
    import natural_questions.eval_utils as utils
    import natural_questions.nq_eval as nq_eval
    import flag_defaults

    if kernel == "Colab":               # Kaggle is V2 by default
        #magic to make Colab path to Tensorflow V2 on Colab
        %tensorflow_version 2.x 

    import tensorflow as tf
    print("TensorfFlow", tf.__version__)
    flags = tf.compat.v1.flags
    FLAGS = flags.FLAGS

    flag_defaults.setflags()

    flags.DEFINE_integer(
        'long_non_null_threshold', 2,
        'Require this many non-null long answer annotations to count gold as containing a long answer.')
    flags.DEFINE_integer(
        'short_non_null_threshold', 2,
        'Require this many non-null short answer annotations to count gold as containing a short answer.')
    flags.DEFINE_string(
        'gold_path', None, 'Path to the gzip JSON data. For '
        'multiple files, should be a glob pattern (e.g. "/path/to/files-*"')
    flags.DEFINE_string('predictions_path', None, 'Path to prediction JSON.')
    flags.DEFINE_bool(
        'cache_gold_data', False,
        'Whether to cache gold data in Pickle format to speed up multiple evaluations.')
    flags.DEFINE_integer('num_threads', 10, 'Number of threads for reading.')
    flags.DEFINE_bool('pretty_print', False, 'Whether to pretty print output.')

    nq_gold_dict = utils.read_annotation(f"{test_file}", n_threads=10)

    nq_pred_dict = utils.read_prediction_json( f"{outdir}/predictions.json")

    long_answer_stats, short_answer_stats = nq_eval.score_answers(nq_gold_dict, nq_pred_dict)

    metrics = nq_eval.get_metrics_with_answer_stats(long_answer_stats,
                                                short_answer_stats)
    print(*metrics.items(), sep='\n')


<font color="pink"> Can we write nq_gold_dict and nq_pred_dict out to a file to evaluate where they are different?</font><br>
Might not mean a lot because my code for reading predictions.json is questionable.

### Check scoring with eaval script
from: https://www.kaggle.com/kenkrige/possible-evaluation-metric

In [0]:
def long_annotations(example):
    longs = [('%s:%s' % (l['start_token'],l['end_token']))
                for l in [a['long_answer'] for a in example['annotations']]
                if not l['candidate_index'] == -1
            ]
    return longs #list of long annotations

def short_annotations(example):
    shorts = [('%s:%s' % (s['start_token'],s['end_token']))
              for s in 
              # sum(list_of_lists, []) is not very efficient gives an easy flat map for short lists
              sum([a['short_answers'] for a in example['annotations']], [])
             ]
    return shorts #list of short annotations

def yes_nos(example):
    return [
        yesno for yesno in [a['yes_no_answer'] for a in example['annotations']]
        if not yesno == 'NONE'
    ]

    # This is the critical method where I guess at the competition metric.
class Score():
    def __init__(self):
        self.TP = 0
        self.FP = 0
        self.FN = 0
        self.TN = 0
    def F1(self):
        return 2 * self.TP / (2 * self.TP + self.FP + self.FN)
    def increment(self, prediction, annotations, yes_nos):
        if prediction in yes_nos:
            print(prediction, yes_nos)
            self.TP += 1
        elif len(prediction) > 0:
            if prediction in annotations:
                self.TP += 1
            else:
                self.FP += 1
        elif len(annotations) == 0:
            self.TN += 1
        else:
            self.FN +=1
    def scores(self):
        return 'TP = {}   FP = {}   FN = {}   TN = {}   F1 = {:.2f}'.format(
            self.TP, self.FP, self.FN, self.TN, self.F1())


In [60]:
## He called this predictions but I think he means submission
#  predictions = pd.read_csv('../input/tinydev/ken_predictions.csv', na_filter=False).set_index('example_id')
#  This file should have been created from your reserved records in nq-dev-sample.jsonl (your training file)
submission = pd.read_csv(f"{outdir}/submission.csv", na_filter=False).set_index('example_id')

long_score = Score()
short_score = Score()
total_score = Score()
for example in map(json.loads, open(f"{test_file}", 'r')):
    long_pred = submission.loc[str(example['example_id']) + '_long', 'PredictionString']
    long_score.increment(long_pred, long_annotations(example), [])
    total_score.increment(long_pred, long_annotations(example), [])
    short_pred = submission.loc[str(example['example_id']) + '_short', 'PredictionString']
    short_score.increment(short_pred, short_annotations(example), yes_nos(example))
    total_score.increment(short_pred, short_annotations(example), [])

print("short_scores:", short_score.scores())
print("long_scores:", long_score.scores())
print("total_scores:", total_score.scores(), '(LB score)')


short_scores: TP = 15   FP = 23   FN = 9   TN = 53   F1 = 0.48
long_scores: TP = 35   FP = 35   FN = 6   TN = 24   F1 = 0.63
total_scores: TP = 50   FP = 58   FN = 15   TN = 77   F1 = 0.58 (LB score)


In [61]:
# Make sure user does not accedentially execute beyond end
raise ExecutionStop("Stopping execution")

ExecutionStop: ignored

# ====== Development Support Files (safe to ignore) =====

### SSH Setup
This is only neeeded if you want to log into the Colab machine. Otherwise fold it up and ignore.<br>
To use it you have to create a login at https://ngrok.com
<Details>Thanks to Imad El Hanafi (https://imadelhanafi.com) for showing me how to do this.<p>
You will need to create a free account at https://ngrok.com/ for the SSH tunnel to work.</Details>

File paths are hard coded here because this may be run before program variables are established.

In [0]:
## if you want to use the Kaggle api from command line you will need a kaggle.json file
from pathlib import Path
if Path('/content/gdrive/My Drive/Colab/kaggle.json').exists() or \
                                    Path('/content/kaggle.json').exists():
    pass    # we found a kaggle.json file
else:
    # Give user opportunity to upload a kaggle.json file
    from google.colab import files
    print('Upload kaggle.json if you want the Kaggle API to be availabel in bash.')
    # The files.upload() command is failing sporatically with:
    #   TypeError: Cannot read property '_uploadFiles' of undefined (just run this cell again)
    ! rm "/content/kaggle.json"  2> /dev/null
    files.upload()

In [0]:
%%bash
## Install sshd; Set to allow login and config
apt-get install -o=Dpkg::Use-Pty=0 openssh-server pwgen > /dev/null
mkdir -p /var/run/sshd
echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
echo "PasswordAuthentication yes" >> /etc/ssh/sshd_config

# set host key to known value (need to test if exist)
gdown -O "/etc/ssh/ssh_host_rsa_key" --id 17Vp-rLM0kLVsIqxo7GkV3YXibGCJ7WCR
chown 600 "/etc/ssh/ssh_host_rsa_key"    # private key will be ignored if not secure
gdown -O "/etc/ssh/ssh_host_rsa_key.pub" --id 1-5yW1EwMdBN0YlRe7McmwDxzmGyvq-gW
# get script to modify login shell to match env of Notebook
gdown -O "/root/init_shell.sh" --id 1-9s5wuq5TkebgKbFvBYy4EeM8c2Ee0xc

# this script will give fix the login shell so Python will work
if [ -f "/root/init_shell.sh" ]; then
    echo "source /root/init_shell.sh" >> /root/.bashrc
fi

In [0]:
## setup ssh user / pass and start sshd

#Generate a random root password
import random, string
sshpass = ''.join(random.choice(string.ascii_letters + string.digits) for i in range(30))

#Set root password
! echo root:$sshpass | chpasswd

#Run sshd
get_ipython().system_raw('/usr/sbin/sshd -D &')

In [0]:
%%bash
## Get Ngrok from gdrive or try to download (see: https://ngrok.com/download)
if [ -f "/content/bertqa/colab/ngrok-stable-linux-amd64.zip" ]; then
    cp "/content/bertqa/colab/ngrok-stable-linux-amd64.zip" .
    echo "Using ngrok-stable-linux-amd64.zip from gdrive"
else
    wget -q -c -nc https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
fi
unzip -qq -n ngrok-stable-linux-amd64.zip
rm ngrok-stable-linux-amd64.zip

In [0]:
## Get user to enter auth token from ngrok and start tunnel

# Get token from ngrok for the tunnel
print("Get your authtoken from https://dashboard.ngrok.com/auth")
import getpass
authtoken = getpass.getpass()

#Create tunnel
get_ipython().system_raw('./ngrok authtoken $authtoken && ./ngrok tcp 22 &')

#### ===============================<br> ||====&nbsp;&nbsp;  SSH Login Credentials &nbsp;&nbsp;====||<br> ===============================

In [0]:
#@title
print("username: root")
print("password: ", sshpass)

Get the host name and port number at: https://dashboard.ngrok.com/status

```bash
ssh root@0.tcp.ngrok.io -p [ngrok_port]
Login as: root
Servrer refused our key
root@0.tcp.ngrok.io's password: [see above]

(Colab):/content$
```


Install programs

In [0]:
%%bash
# vim
apt-get install vim > /dev/null
echo "set tabstop     =4" >> ~/.vimrc
echo "set softtabstop =4" >> ~/.vimrc
echo "set shiftwidth  =4" >> ~/.vimrc
echo "set expandtab"      >> ~/.vimrc

# js is a JSON processor
apt-get install js > /dev/null

apt-get install tree > /dev/null


If you need to kill Ngrok run this cell

In [0]:
if False:
    !kill $(ps aux | grep './ngrok' | awk '{print $2}')

## -- Misc Notes --

### Prevent Disconnects
Colab periodically disconnects the browser.<br>
You have to save model checkpoints to Google Drive so you don't lose work<br>
See: https://mc.ai/google-colab-drive-as-persistent-storage-for-long-training-runs/<br>
Something to try...<br>
Ctrl+Shift+i in browser and in console run this code...
```
function KeepAlive(){
    console.log("Maintaining Connection");
    document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(KeepAlive,60000);
```
There have been reports of people having their GPU privileges suspended for letting processes run for over 12 hours. It seems that they may penalize you rather than just cutting you off.

### Monitor GPU
```
# From cli I think to monitor GPU while fiting
$ nvidia-smi dmon
$ nvidia-smi pmon
```

### Code From Elsewhere

In [0]:
raise ExecutionStop("Stop Here")

In [0]:
!nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE

In [0]:
%%bash
## Convert notebook to HTML or PDF for printing

### Clear All Output & Save Before Doing This ###

apt-get install texlive texlive-xetex texlive-latex-extra mandoc > /dev/null
pip install pypandoc
# jupyter nbconvert --to HTML /content/gdrive/My\ Drive/Colab/bertqa/BERTjoint_yes_no/BERTjoint\ yes\ no2.ipynb
jupyter nbconvert --to HTML /content/gdrive/My\ Drive/bertqa/BERTjoint_yes_no/BERTjoint\ yes\ no2.1.ipynb

In [0]:
# Do something with it
data_line = []
nrows = 10000

# read the data
with open(f"{train_file}", 'rt') as f:
    for i in range(nrows):
        data_line.append(json.loads(f.readline()))
 
train_df = pd.DataFrame(data_line) # convert to data frame
train_df   # peek at data

index = 0
train_df.loc[index, 'question_text']                # outputs the first question

train_df.loc[index,'long_answer_candidates'][:5]    # outputs the first 5 

train_df.loc[index,'annotations'][:5]               # outputs an array with answer information

train_df.loc[index,'long_answer_candidates'][54]    # data for long answer

' '.join(train_df.loc[index,'document_text'].split()[1952:2019])        # 1952 and 2019 are start/end tokens for long answer

' '.join(train_df.loc[index,'document_text'].split()[1960:1969])        # 1960 and 1969 are start/end tokens for short answer



# ============ Notes / Eratta ============

###Things to possibly look at...<p>
TDS on Bert in Keras: https://towardsdatascience.com/bert-in-keras-with-tensorflow-hub-76bcbc9417b<br>
HuggingFace Transformers: https://github.com/huggingface/transformers (includes tonenization code)<br>
cloud_tpu_custom_training (by TensorFlow): https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/custom_training.ipynb#scrollTo=FbVhjPpzn6BM<br>
Kaggle attempt to train with huggingface: https://www.kaggle.com/yihdarshieh/tf2-training-on-gcp-tpu/comments?scriptVersionId=26464952<br>
Comments on above: https://www.kaggle.com/c/tensorflow2-question-answering/discussion/124914<br>
BERT Fine Tuning w/TPU: https://github.com/tensorflow/models/blob/master/official/nlp/bert/bert_cloud_tpu.md
