<a href="https://colab.research.google.com/github/hogo56/BertQA/blob/master/ColabGettingStarted.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colab / Kaggle Getting Started

I prefer to do my primary development in a Colab virtual machine but, some competitions require your kernel to run on Kaggle for scoring. You can start with this notebook and add your own project code which should run in either location if you use the directory variables for file locations and correctly configure data and user libraries in Kaggle.<br>
If you run into problems or have suggestions I'd love to hear from you.

## Explanation

### Data & Directories
There are a couple of differences between Kaggle and Colab.<p>
**Kaggle** - you create persistent links in your kernel to datasets and utility scripts that your notebook can access. If you have private data you can create your own datasets at Kaggle. You can put utiliity scripts (user libraries) in a dataset and copy them to a lib directory on Kaggle at runtime. Internet access may or may not be allowed when scoring a Kaggle notebook.<br>
**Colab** - nothing is persistent. You have to (and can) download data and user library scripts to your VM each time you run it. Kaggle provides an API that makes fetching data and scripts easy.<p>

Also, parts of the Kaggle directory structure are read-only so file locations are different. (eg. ./lib and ./working/lib)<p> 

Finally, if you happen to like to SSH into your VM to watch the process run or to edit files and mung stuff around, there is code at the bottom of this Notebook that will allow you to do that in Colab. This code needs to be deleted from your notebook before submitting a notebook to competition.


### User Libraries
**Colab** - you have a file system you can write to and if you need libraries you download them from somewhere. (eg. Google Dirve, Kaggle, GitHub)<br>
**Kaggle** - you have two options (internet connections are disabled during competition scoring):<p>
   * Add custom libraries to a dataset and include the dataset in your Kaggle kernel.<br>
   * Create a new kernel as a script, set it as a "Utility Script", add the kernel-script as a utility script in your competition kernel. The sctript will be linked to your kernel in the/kaggle/usr/lib directory (see: https://www.kaggle.com/product-feedback/91185 for more information) 

### Switching between Colab & Kaggle
One way of moving your script from Colab to Kaggle to run is:<br>
   * delete all cells from your Kaggle competition notebook<br>
   * download the .ipynb from Colab<br>
   * upload it into the blank Kaggle notebook.<br>
   * delete the cells near the bottom of the notebook that need to be deleted when running on Kaggle<br>
   * update any script parameters (eg. verbose)

### Drectory Structure (Notebook)

* Kaggle &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(cwd = /kaggle/working/)</em><br>
  {datadir} = /kaggle/input/<br>
  {outdir} = /kaggle/working/<br>
* Colab &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(cwd = /content/)</em><br>
  {datadir} = /content/data/<br>
  {outdir} = /content/output/<br>

#### - Required Libraries
**Kaggle** will be copied from a data directory<br>
**Colab** will be copied from a gDrive directory<br>
   * project_lib.py

#### - Inputs (competition data)
   * {datadir}/{competition}/ (from: https://www.kaggle.com/_path_/)

#### - Required Data (additional packages)
   * {datadir}/kaggle-dataset (from https://www.kaggle.com/_user_/_dataset_/)

#### - Outputs
   * {outdir}/predictions.json
   * {outdir}/submission.csv<br>
   * {outdir}/eval.tf_record<br>
   * {outdir}/.ipynb_checkpoints/<br>

### Drectory Structure (Google Drive)
Because Google Colab virtual machines are not persistent I am using a link to your Google Drive.<br>
The following is the file structure<p>

/My Drive/colab/ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(this directory should be private)</em><br>
/My Drive/colab/{gprojdir}/output/<br>
/My Drive/colab/{gprojdir}/kaggle.json &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(your personal Kaggle auth file)</em><br>
/My Drive/{gprojdir}/ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <em>(this directory should be shared between teams)</em><br>
/My Drive/{gprojdir}/lib/{gnotedir}/ &nbsp; &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<em>(libraries for this Notebook)</em><br>


### - Notes
Put notes about your Notebook here







### - Credits / Ancestry
If your notebook is a fork or combination of other notebooks here you should provide links so other people can look at where you built your current work from.<p>
This notebook is a fork of [mmmarchetti's notebook](https://www.kaggle.com/mmmarchetti/tensorflow-2-0-bert-yes-no-answers) which was a fork of [prokaj's - bert joint baseline notebook](https://www.kaggle.com/prokaj/bert-joint-baseline-notebook/notebook).<br>
mmmarchetti made some modifications to slightly improve the code and get the YES / NO answers and leave the unknowns blank.

## Notebook Variables

In [0]:
CompSubmission = False                       # Set to True if submitting to Competition
from pathlib import Path
## Reset kernel without removing downloaded data files and libs
if True and not Path('/kaggle').exists():    # the rm confirmation messages don't show up right on Kaggle so False there
    %reset
    from pathlib import Path                 # have to reimport after reset 
    for p in ['/content/output/', '/kaggle/working/']:
        if Path(p).exists() and sum(1 for _ in Path(p).iterdir()) > 1:    # are there files in dir?
            print("\nWARNING: Files found in output directory.")
            print("Removing previous output files is not reversable.")
            ! rm -i "{p}"*

In [0]:
## Required for all
import os

class ExecutionStop(Exception):             # Custom Error Handler
    def __init__(self, value): self.value=value
    def __str__(self): return(str(self.value))

def list_files(startpath):                  #  Show files
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))
# raise ExecutionStop("Message")

In [0]:
## Set file locations    (these variables are not implemented in the FLAGS code yet)
import os, sys
from pathlib import Path

## Config Variables
verbose = True if not CompSubmission else False  # Turn this off to supress some of the "fyi" output
competition = 'tensorflow2-question-answering'
train_file = 'simplified-nq-train.jsonl'
test_file = 'simplified-nq-test.jsonl'
gprojdir = 'bertqa'                 # The project directory on Drive for this competition
gnotedir = 'BERTjoint_yes_no'       # Subdir on Drive for files specific to this notebook
DownloadBigFiles = True             # Files will not download if already on drive
RunSmallConfig = False              # Reduce the bert_config.json values

if Path('/content').exists():
    print("Detected running on Colab")
    kernel = 'Colab'
    basedir = '/content'
    libdir = f"{basedir}/lib"
    datadir = f"{basedir}/data"
    outdir = f"{basedir}/output"    # will be symlinked to a user's private gdrive for persistence
elif Path('/kaggle').exists():
    print("Detected running on Kaggle")
    kernel = 'Kaggle'
    basedir = '/kaggle'
    libdir = f"{basedir}/working/lib"      # this has to be in a writable location
    datadir = f"{basedir}/input"    # this may need to be '../input' for scoring
    outdir = f"{basedir}/working"   # this may need to be '.' for scoring# BertQA
else:
    raise ExecutionStop("Cannot continue without determining file locations")

# ============= Machine Spinup =============

In [0]:
! zdump PST
if verbose:
    ! pwd
    list_files(basedir)

## -- Setup --

### Google Drive

In [0]:
## File link to Google Drive
if kernel == 'Colab':
    from google.colab import drive
    drive.mount(f"{basedir}/gdrive", force_remount=False)   # true to reread drive

    # Create a shorter shared directory name and avoid having to deal with the space
    if Path(f"{basedir}/{gprojdir}").is_symlink():
        ! rm "{basedir}/{gprojdir}"
    ! ln -s "{basedir}/gdrive/My Drive/{gprojdir}" "{basedir}/{gprojdir}"
    if not Path(f"{basedir}/{gprojdir}").exists():
        raise ExecutionStop("You cannot continue without gdirve project dir symlink")

    ## If you do not want output to be written to your Google Drive set block False
    if True:
        if Path(f"{outdir}").is_symlink():
            ! rm "{outdir}"
        ! ln -s "{basedir}/gdrive/My Drive/colab/{gprojdir}/output" "{outdir}"
        if not Path(outdir).exists():
            raise ExecutionStop("You cannot continue without gdrive output symlink")


In [0]:
if False:
    ## Flush and unmount Google Drive
    # You probably won't do this but if you want to at some point click the play button
    drive.flush_and_unmount()

## Install the large file downloader for Google Drive if needed (Colab already has it installed)
#  This works from bash or Python. Already installed in Colab by default.
if False:
    ! pip install gdown

### Kaggle API
<Details>You will need Kaggle API token to link the Colab instance to your Kaggle account to get data, etc.<br>
Go to: https://www.kaggle.com/yourID/account and click on the "Create New API Token: button to get a file named kaggle.json.<p>You can put your kaggle.json file in your google drive at My Drive/colab/kaggle.json.<br>
Alternately, you can store it on your local machine and the script will ask you to upload it.</Details>

In [0]:
## Link to Kaggle
if kernel == 'Colab':
    from google.colab import files

    if Path(f"{basedir}/gdrive/My Drive/colab/kaggle.json").exists():
        # if there is a kaggle.json file in gdrive use it
        os.environ['KAGGLE_CONFIG_DIR'] = f"{basedir}/gdrive/My Drive/colab/"
        ! ls -l "{basedir}/gdrive/My Drive/colab/kaggle.json"
        import kaggle
    else:
        # Have user upload file
        print('Upload kaggle.json.')
        # The files.upload() command is failing sporatically with:
        #   TypeError: Cannot read property '_uploadFiles' of undefined (just run this cell again)
        ! rm "{basedir}/kaggle.json"  2> /dev/null
        files.upload()
        ! chmnod 600 kaggle.json
        os.environ['KAGGLE_CONFIG_DIR'] = f"{basedir}/"
        ! ls -l "{basedir}/kaggle.json"
        import kaggle


## -- Main System Config --
<Details><Summary>Global Config</Summary>
Put any global system configuration here

In [0]:
if kernel == "Colab":
    if Path(f"{basedir}/sample_data").exists():
        !rm -rf "{basedir}/sample_data"

In [0]:
%%bash -s "{libdir}" "{datadir}" "{outdir}"
# make directories if not already exist
[ -d "$1" ] || mkdir -p "$1"        # {libdir}
[ -d "$2" ] || mkdir -p "$2"        # {datadir}
[ -d "$3" ] || mkdir -p "$3"        # {outdir}
zdump PST

In [0]:
import sys
if not libdir in sys.path:          # don't add multiple times
    sys.path.append(libdir)

In [0]:
if verbose:
    !pwd
    !ls -l
    print()
    !printenv |grep -E 'KAGGLE|PYTHON'
    print("\n[nsys.path]", *(sys.path), sep='\n')

# =========== Project Specific Stuff ===========

## -- Project Setup --

### Download Dataset and Support Files

Kaggle Competition Files<br>
Here is an example of downlaoding and unpacking competition data. If the competition set has different files you will need to adjust.

In [0]:
## Competition Dataset  (5GB zipped)
if DownloadBigFiles and kernel == 'Colab':
    if not Path(f"{datadir}/compdata.flag").exists():      ## Don't download again if exists
        if verbose:
            ! kaggle competitions list
            print()
        print("Downloading Competition Data\n")
        ! kaggle competitions download -c "{competition}" -p "{datadir}"
        ! mkdir -p "{datadir}/{competition}/"
        ! mv "{datadir}/sample_submission.csv"  "{datadir}/{competition}"
        ! unzip "{datadir}/{train_file}.zip" -d "{datadir}/{competition}"
        ! rm "{datadir}/{train_file}.zip"
        ! unzip "{datadir}/{test_file}.zip" -d "{datadir}/{competition}"
        ! rm "{datadir}/{test_file}.zip"
        ! touch "{datadir}/compdata.flag"
    else:
        print("Competition Data already exists. Not downloading.\n")
        !ls -l "{datadir}/{competition}"/*
else:
    print(" For Kaggle, make sure you download a copy of the competition data into your kernel")
    ! ls -l "{datadir}/{competition}"/*

public_dataset = os.path.getsize(f"{datadir}/{competition}/{test_file}")<20_000_000
private_dataset = os.path.getsize(f"{datadir}/{competition}/{test_file}")>=20_000_000

Additional Data Files<br>
Here is an example of downlaoding and unpacking additional data files.

In [0]:
# Get BERTjoint model files (this a copy of the prokaj file from my Google Drive)
if DownloadBigFiles and kernel == 'Colab':
    if not Path(f"{datadir}/bertfiles.flag").exists():      ## Don't download again if exists
        print("Downloading BERT-joint Model\n")
        ! mkdir -p "{datadir}/bert-joint-baseline/"
        filestoget = "bert_config* model_cpkt* nq-test* vocab*"
        ! kaggle datasets download -d prokaj/bert-joint-baseline -p "{datadir}"
        ! unzip "{datadir}/bert-joint-baseline.zip" {filestoget} -d "{datadir}/bert-joint-baseline/"
        ! rm "{datadir}/bert-joint-baseline.zip"
        if Path(f"{datadir}/bert-joint-baseline-output.npz").exists():
            ! rm "{datadir}/bert-joint-baseline-output.npz" # if kaggle downloaded this delete it
        if verbose:
            ! ls -l "{datadir}/bert-joint-baseline/"
        ! touch "{datadir}/bertfiles.flag"
    else:
        print("BERT-joint Files already exists. Not downloading.\n")
        ! ls -l "{datadir}/bert-joint-baseline/"
else:
    print("For Kaggle, make sure you download a copy of prokaj's bert-joint-baseline to your kernel")
    ! ls -l "{datadir}/bert-joint-baseline/"

### Library Setup

In [0]:
## Copy lib files from Google Drive
if kernel == 'Colab':
    # each Notebook has its own set of lib files in the shared Google Drive folder
    ! cp -a "{basedir}/{gprojdir}/lib/{gnotedir}/"* "{libdir}"
if kernel == 'Kaggle':
    ! cp "{datadir}/bert-joint-baseline"/*.py "{libdir}"
if verbose:
    ! ls -l "{libdir}"

In [0]:
## Load Libraries
import os, sys, importlib

if kernel == "Colab":
    #magic to make colab path to Tensorflow V2 on Colab
    %tensorflow_version 2.x 

import tensorflow as tf
print("TensofFlow", tf.__version__)

import numpy as np
import pandas as pd
import collections

import bert_utils
import modeling
import tokenization

import json

In [0]:
## uncomment and use this cell to reimport libs you have updated
# importlibe.reload(bert_utils)

In [0]:
! zdump PST
! pwd
if verbose:
    list_files(basedir)

In [0]:
# raise ExecutionStop("Execution stopped")

## -- Code Implementation For Your Project --

In [0]:
! zdump PST

### Support Functions

In [0]:
class Sample(tf.keras.layers.Layer):
    def __init__(self,
                 output_size,
                 kernel_initializer=None,
                 bias_initializer="zeros",
                **kwargs):
        super().__init__(**kwargs)


def mk_model(config):
    return          # tf.keras.Model()    

### Setting the Flags

In [0]:
class DummyObject:
    def __init__(self,**kwargs):
        self.__dict__.update(kwargs)

FLAGS=DummyObject(skip_nested_contexts=True,
                 max_position=50,
                 max_contexts=48,
                 max_query_length=64,
                 max_seq_length=512,
                 doc_stride=128,
                 include_unknowns=-1.0,
                 n_best_size=20,
                 max_answer_length=30)

### Create Model

In [0]:
print("\nGPU Memory\n")
!nvidia-smi --query-gpu=utilization.memory,memory.total,memory.free,memory.used --format=csv

In [0]:
## grab a config file  (make sure and use variables for file locagtions)
with open(f"{datadir}/some_path/config.json", 'r') as f:
    config = json.load(f)

In [0]:
if RunSmallConfig:          # it seems these values won't work
    small_config = config.copy()
    small_config['value_to_override']=16                # was 30522
    model = mk_model(small_config)
    print(json.dumps(small_config, indent=4))
else:
    model= mk_model(config)
    print(json.dumps(config, indent=4))

model.summary()
print("\nGPU Memory\n")
!nvidia-smi --query-gpu=utilization.memory,memory.total,memory.free,memory.used --format=csv

### Checkpoint

In [0]:
cpkt = tf.train.Checkpoint(model=model)
cpkt.restore(f"{datadir}/some_path/model_cpkt-1").assert_consumed()

In [0]:
result=model.predict_generator(ds, verbose = 1 if verbose else 0)


In [0]:
np.savez_compressed('some_file.npz',
                    **dict(zip(['uniqe_id','start_logits','end_logits','answer_type_logits'],
                               result)))

#### Creating a DataFrame

In [0]:
test_answers_df = pd.read_json(f"{outdir}/predictions.json")

test_answers_df.head()

### Generating the Submission File

In [0]:
sample_submission = pd.read_csv(f"{datadir}/{competition}/sample_submission.csv")

In [0]:
sample_submission.to_csv(f"{outdir}/submission.csv", index=False)

In [0]:
sample_submission.head()

In [0]:
! zdump PST
if verbose:
    ! pwd
    list_files(basedir)
! ls -l {outdir}

# Cells below this need to be deleted before submitting Notebook to competition

## -- Submitting Results --

In [0]:
raise ExecutionStop("Don't let run all go beyond this")

In [0]:
%%bash
## View Previous Results you have submitted
#kaggle competitions list
kaggle competitions submissions -c {competition}

In [0]:
## Make Submission
# You may be able to submit to some competitions through the API
! kaggle competitions submit -c {competition}} -f $RESULT_CSV  -m 'test kaggle cli 3'

Verify submission by viewing previous results

End of Project Notebook

In [0]:
# Make sure user does not accedentially execute beyond end
raise ExecutionStop("Stopping execution")

# ====== Please fold this stuff up and ignore =====

### SSH Setup
This is only neeeded if you want to log into the Colab machine. Otherwise fold it up and ignore.<br>
To use it you have to create a login at https://ngrok.com
<Details>Thanks to Imad El Hanafi (https://imadelhanafi.com) for showing me how to do this.<p>
You will need to create a free account at https://ngrok.com/ for the SSH tunnel to work.</Details>

File paths are hard coded here because this may be run before program variables are established.

In [0]:
## if you want to use the Kaggle api from command line you will need a kaggle.json file
from pathlib import Path
if Path('/content/gdrive/My Drive/colab/kaggle.json').exists() or \
                                    Path('/content/kaggle.json').exists():
    pass    # we found a kaggle.json file
else:
    # Give user opportunity to upload a kaggle.json file
    from google.colab import files
    print('Upload kaggle.json if you want the Kaggle API to be availabel in bash.')
    # The files.upload() command is failing sporatically with:
    #   TypeError: Cannot read property '_uploadFiles' of undefined (just run this cell again)
    ! rm "/content/kaggle.json"  2> /dev/null
    files.upload()

In [0]:
%%bash
## Install sshd; Set to allow login and config
apt-get install -o=Dpkg::Use-Pty=0 openssh-server pwgen > /dev/null
mkdir -p /var/run/sshd
echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
echo "PasswordAuthentication yes" >> /etc/ssh/sshd_config

# set host key to known value (need to test if exist)
! gdown -O "/etc/ssh/ssh_host_rsa_key" --id 17Vp-rLM0kLVsIqxo7GkV3YXibGCJ7WCR
! gdown -O "/etc/ssh/ssh_host_rsa_key.pub" --id 1-5yW1EwMdBN0YlRe7McmwDxzmGyvq-gW
# get script to modify login shell to match env of Notebook
! gdown -O "/root/init_shell.sh" --id 1-9s5wuq5TkebgKbFvBYy4EeM8c2Ee0xc

# this script will give fix the login shell so Python will work
if [ -f "/root/init_shell.sh" ]; then
    echo "source /root/init_shell.sh" >> /root/.bashrc
fi

In [0]:
## setup ssh user / pass and start sshd

#Generate a random root password
import random, string
sshpass = ''.join(random.choice(string.ascii_letters + string.digits) for i in range(30))

#Set root password
! echo root:$sshpass | chpasswd

#Run sshd
get_ipython().system_raw('/usr/sbin/sshd -D &')

In [0]:
%%bash
## Get Ngrok from gdrive or try to download (see: https://ngrok.com/download)
if [ -f "/content/bertqa/colab/ngrok-stable-linux-amd64.zip" ]; then
    cp "/content/bertqa/colab/ngrok-stable-linux-amd64.zip" .
    echo "Using ngrok-stable-linux-amd64.zip from gdrive"
else
    wget -q -c -nc https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
fi
unzip -qq -n ngrok-stable-linux-amd64.zip
rm ngrok-stable-linux-amd64.zip

In [0]:
## Get user to enter auth token from ngrok and start tunnel

# Get token from ngrok for the tunnel
print("Get your authtoken from https://dashboard.ngrok.com/auth")
import getpass
authtoken = getpass.getpass()

#Create tunnel
get_ipython().system_raw('./ngrok authtoken $authtoken && ./ngrok tcp 22 &')

#### ==============================<br>|====&nbsp;&nbsp;  SSH Login Credentials &nbsp;&nbsp;====||<br>==============================

In [0]:
#@title
print("username: root")
print("password: ", sshpass)

Get the host name and port number at: https://dashboard.ngrok.com/status

```bash
ssh root@0.tcp.ngrok.io -p [ngrok_port]
Login as: root
Servrer refused our key
root@0.tcp.ngrok.io's password: [see above]

(Colab):/content$
```


Install vim

In [0]:
! apt-get install vim > /dev/null

If you need to kill Ngrok run this cell

In [0]:
if False:
    !kill $(ps aux | grep './ngrok' | awk '{print $2}')

## -- Misc Notes --

### Prevent Disconnects
Colab periodically disconnects the browser.<br>
You have to save model checkpoints to Google Drive so you don't lose work<br>
See: https://mc.ai/google-colab-drive-as-persistent-storage-for-long-training-runs/<br>
Something to try...<br>
Ctrl+Shift+i in browser and in console run this code...
```
function KeepAlive(){
    console.log("Maintaining Connection");
    document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(KeepAlive,60000);
```
There have been reports of people having their GPU privileges suspended for letting processes run for over 12 hours. It seems that they may penalize you rather than just cutting you off.

### Monitor GPU
```
# From cli I think to monitor GPU while fiting
$ nvidia-smi dmon
$ nvidia-smi pmon
```

### Code From Elsewhere

In [0]:
!nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE