In [None]:
BRANCH = 'v1.0.0b2'

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
from nemo.utils.exp_manager import exp_manager
from nemo.collections import nlp as nemo_nlp

import os
import wget 
import torch
import pytorch_lightning as pl
from omegaconf import OmegaConf

# Task Description
Given a question and a context both in natural language, predict the span within the context with a start and end position which indicates the answer to the question.
For every word in our training dataset we’re going to predict:
- likelihood this word is the start of the span 
- likelihood this word is the end of the span 

We are using a pretrained [BERT](https://arxiv.org/pdf/1810.04805.pdf) encoder with 2 span prediction heads for prediction start and end position of the answer. The span predictions are token classifiers consisting of a single linear layer. 

# Dataset
This model expects the dataset to be in [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format, e.g. a JSON file for each dataset split. 
In the following we will show example for a training file. Each title has one or multiple paragraph entries, each consisting of the text - "context", and question-answer entries. Each question-answer entry has:
* a question
* a globally unique id
* a boolean flag "is_impossible" which shows if the question is answerable or not
* in case the question is answerable one answer entry, which contains the text span and its starting character index in the context. If not answerable, the "answers" list is empty

The evaluation files (for validation and testing) follow the above format except for it can provide more than one answer to the same question. 
The inference file follows the above format except for it does not require the "answers" and "is_impossible" keywords.




```
{
    "data": [
        {
            "title": "Super_Bowl_50", 
            "paragraphs": [
                {
                    "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.", 
                    "qas": [
                        {
                            "question": "Where did Super Bowl 50 take place?", 
                            "is_impossible": "false", 
                            "id": "56be4db0acb8001400a502ee", 
                            "answers": [
                                {
                                    "answer_start": "403", 
                                    "text": "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question": "What was the winning score of the Super Bowl 50?", 
                            "is_impossible": "true", 
                            "id": "56be4db0acb8001400a502ez", 
                            "answers": [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}
...
```



## Download the data

In this notebook we are going download the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset to showcase how to do training and inference. There are two datasets, SQuAD1.0 and SQuAD2.0. SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles. SQuAD2.0 dataset combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. 


To download both datasets, we use  [NeMo/examples/nlp/question_answering/get_squad.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/question_answering/get_squad.py). 




In [None]:
# set the following paths
DATA_DIR = "PATH_TO_DATA"
WORK_DIR = "PATH_TO_CHECKPOINTS_AND_LOGS"

In [None]:
## download get_squad.py script to download and preprocess the SQuAD data
os.makedirs(WORK_DIR, exist_ok=True)
if not os.path.exists(WORK_DIR + '/get_squad.py'):
    print('Downloading get_squad.py...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/get_squad.py', WORK_DIR)
else:
    print ('get_squad.py already exists')

In [None]:
# download and preprocess the data
! python $WORK_DIR/get_squad.py --destDir $DATA_DIR

after execution of the above cell, your data folder will contain a subfolder "squad" the following 4 files for training and evaluation
- v1.1/train-v1.1.json
- v1.1/dev-v1.1.json
- v2.0/train-v2.0.json
- v2.0/dev-v2.0.json

In [None]:
! ls -LR {DATA_DIR}/squad

## Data preprocessing

The input into the model is the concatenation of two tokenized sequences:
" [CLS] query [SEP] context [SEP]".
This is the tokenization used for BERT, i.e. [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) Tokenizer, which uses the [Google's BERT vocabulary](https://github.com/google-research/bert). This tokenizer is configured with `model.tokenizer.tokenizer_name=bert-base-uncased` and is automatically instantiated using [Huggingface](https://huggingface.co/)'s API. 
The benefit of this tokenizer is that this is compatible with a pretrained BERT model, from which we can finetune instead of training the question answering model from scratch. However, we also support other tokenizers, such as `model.tokenizer.tokenizer_name=sentencepiece`. Unlike the BERT WordPiece tokenizer, the [SentencePiece](https://github.com/google/sentencepiece) tokenizer model needs to be first created from a text file.
See [02_NLP_Tokenizers.ipynb](https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/nlp/02_NLP_Tokenizers.ipynb) for more details on how to use NeMo Tokenizers.

# Data and Model Parameters


Note, this is only an example to showcase usage and is not optimized for accuracy. In the following, we will download and adjust the model configuration to create a toy example, where we only use a small fraction of the original dataset. 

In order to train the full SQuAD model, leave the model parameters from the configuration file unchanged. This sets NUM_SAMPLES=-1 to use the entire dataset, which will slow down performance significantly. We recommend to use bash script and multi-GPU to accelerate this. 


In [None]:
# This is the model configuration file that we will download, do not change this
MODEL_CONFIG = "question_answering_squad_config.yaml"

# model parameters, play with these
BATCH_SIZE = 12
MAX_SEQ_LENGTH = 384
# specify BERT-like model, you want to use
PRETRAINED_BERT_MODEL = "bert-base-uncased"
TOKENIZER_NAME = "bert-base-uncased" # tokenizer name

# Number of data examples used for training, validation, test and inference
TRAIN_NUM_SAMPLES = VAL_NUM_SAMPLES = TEST_NUM_SAMPLES = 5000 
INFER_NUM_SAMPLES = 5

TRAIN_FILE = f"{DATA_DIR}/squad/v1.1/train-v1.1.json"
VAL_FILE = f"{DATA_DIR}/squad/v1.1/dev-v1.1.json"
TEST_FILE = f"{DATA_DIR}/squad/v1.1/dev-v1.1.json"
INFER_FILE = f"{DATA_DIR}/squad/v1.1/dev-v1.1.json"

INFER_PREDICTION_OUTPUT_FILE = "output_prediction.json"
INFER_NBEST_OUTPUT_FILE = "output_nbest.json"

# training parameters
LEARNING_RATE = 0.00003

# number of epochs
MAX_EPOCHS = 1

# Model Configuration

The model is defined in a config file which declares multiple important sections. They are:
- **model**: All arguments that will relate to the Model - language model, span prediction, optimizer and schedulers, datasets and any other related information

- **trainer**: Any argument to be passed to PyTorch Lightning

In [None]:
# download the model's default configuration file 
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/conf/{MODEL_CONFIG}', config_dir)
else:
    print ('config file is already exists')

In [None]:
# this line will print the entire default config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
print(OmegaConf.to_yaml(config))

## Setting up data within the config

Among other things, the config file contains dictionaries called dataset, train_ds and validation_ds, test_ds. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.

Specify data paths using `model.train_ds.file`, `model.valuation_ds.file` and `model.test_ds.file`.

Let's now add the data paths to the config.

In [None]:
config.model.train_ds.file = TRAIN_FILE
config.model.validation_ds.file = VAL_FILE
config.model.test_ds.file = TEST_FILE

config.model.train_ds.num_samples = TRAIN_NUM_SAMPLES
config.model.validation_ds.num_samples = VAL_NUM_SAMPLES
config.model.test_ds.num_samples = TEST_NUM_SAMPLES

config.model.tokenizer.tokenizer_name = TOKENIZER_NAME

# Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem!

Let's first instantiate a Trainer object!

In [None]:
# lets modify some trainer configs
# checks if we have GPU available and uses it
cuda = 1 if torch.cuda.is_available() else 0
config.trainer.gpus = cuda
config.trainer.precision = 16 if torch.cuda.is_available() else 32

# For mixed precision training, use precision=16 and amp_level=O1

config.trainer.max_epochs = MAX_EPOCHS

# Remove distributed training flags if only running on a single GPU or CPU
config.trainer.accelerator = None

print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

trainer = pl.Trainer(**config.trainer)

# Setting up a NeMo Experiment¶

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it!

In [None]:
config.exp_manager.exp_dir = WORK_DIR
exp_dir = exp_manager(trainer, config.get("exp_manager", None))

# the exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)

# Using an Out-Of-Box Model

In [None]:
# list available pretrained models
nemo_nlp.models.QAModel.list_available_models()

In [None]:
# load pretained model
pretrained_model_name="BERTBaseUncasedSQuADv1.1"
model = nemo_nlp.models.QAModel.from_pretrained(model_name='BERTBaseUncasedSQuADv1.1')

# Model Training

Before initializing the model, we might want to modify some of the model configs.

In [None]:
# complete list of supported BERT-like models
nemo_nlp.modules.get_pretrained_lm_models_list()

In [None]:
# add the specified above model parameters to the config
config.model.language_model.pretrained_model_name = PRETRAINED_BERT_MODEL
config.model.train_ds.batch_size = BATCH_SIZE
config.model.validation_ds.batch_size  = BATCH_SIZE
config.model.test_ds.batch_size = BATCH_SIZE
config.model.optim.lr = LEARNING_RATE

print("Updated model config - \n")
print(OmegaConf.to_yaml(config.model))

In [None]:
# initialize the model
# dataset we'll be prepared for training and evaluation during
model = nemo_nlp.models.QAModel(cfg=config.model, trainer=trainer)

## Monitoring Training Progress
Optionally, you can create a Tensorboard visualization to monitor training progress.

In [None]:
try:
  from google import colab
  COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
  COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
  %load_ext tensorboard
  %tensorboard --logdir {exp_dir}
else:
  print("To use tensorboard, please use this notebook in a Google Colab environment.")

In [None]:
# start the training
trainer.fit(model)

After training for 1 epochs, exact match on the evaluation data should be around 59.2%, F1 around 70.2%.

# Evaluation

To see how the model performs, let’s run evaluation on the test dataset.

In [None]:
model.setup_test_data(test_data_config=config.model.test_ds)
trainer.test(model)

# Inference

To use the model for creating predictions, let’s run inference on the unlabeled inference dataset.

In [None]:
# # store test prediction under the experiment output folder
output_prediction_file = f"{exp_dir}/{INFER_PREDICTION_OUTPUT_FILE}"
output_nbest_file = f"{exp_dir}/{INFER_NBEST_OUTPUT_FILE}"
all_preds, all_nbests = model.inference(file=INFER_FILE, batch_size=5, num_samples=INFER_NUM_SAMPLES, output_nbest_file=output_nbest_file, output_prediction_file=output_prediction_file)

In [None]:
for question_id, answer in all_preds.items():
    if answer != "empty":
        print(f"Question ID: {question_id}, answer: {answer}")
#The prediction file contains the predicted answer to each question id for the first TEST_NUM_SAMPLES.
! python -m json.tool $WORK_DIR/${exp_dir}/$INFER_PREDICTION_OUTPUT_FILE

If you have NeMo installed locally, you can also train the model with 
[NeMo/examples/nlp/question_answering/get_squad.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/question_answering/question_answering_squad.py).

To run training script, use:

`python question_answering_squad.py model.train_ds.file=TRAIN_FILE model.validation_ds.file=VAL_FILE model.test_ds.file=TEST_FILE`

To improve the performance of the model, train with multi-GPU and a global batch size of 24. So if you use 8 GPUs with `trainer.gpus=8`, set `model.train_ds.batch_size=3`