<a href="https://colab.research.google.com/github/Heimine/NLU_project/blob/Yichen-Liu/Fine_tuning_for_Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine_tuning_for_Question_Answering

Origin: @techno246

Adaptation: Yichen Liu


## Introduction

Reading comprehension, otherwise known as question answering systems, are one of the tasks that NLP tries to solve. The goal of this task is to be able to answer an arbitary question given a context. In our project, we want to build a QA model that can automatically search answer from given context. For instance, given the following context:

> Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, the capital of China's Hubei province, and has since spread globally, resulting in the ongoing 2019–20 coronavirus pandemic. As of 1 May 2020, more than 3.27 million cases have been reported across 187 countries and territories, resulting in more than 233,000 deaths. More than 1.02 million people have recovered.

We ask the question

> How many cases have been reported as of 1 May 2020?

We expect the QA system is to respond with something like this:

> more than 3.27 million


This notebook demonstrates how we fine-tune ALBERT for the task of QnA and use it for inference. For this tutorial, we will use the transformer library built by [Hugging Face](https://huggingface.co/), which is an extremely nice implementation of the transformer models (including ALBERT) in both TensorFlow and PyTorch. You can  just use a fine-tuned model from their [model repository](https://huggingface.co/models) (which I encourage in general to save money and reduce emissions). However in our project, we modify some codes in original transformer library to load our hand-made testset. The modified version can be acceess [here](https://github.com/lyc1005/transformers)

## 1.0 Setup

Let's check out what kind of GPU our friends at Google gave us. This notebook should be configured to give you a P100 😃 (saved in metadata)

In [0]:
!nvidia-smi

First, we clone the Hugging Face transformer library from [here](https://github.com/lyc1005/transformers).

In [0]:
#!git clone https://github.com/huggingface/transformers \
!git clone https://github.com/lyc1005/transformers.git \
&& cd transformers \
#&& git checkout a3085020ed0d81d4903c50967687192e3101e770 

In [0]:
!pip install ./transformers
!pip install tensorboardX

## 2.0 Train Model

This is where we can train our own model. Note you can skip this step if you don't want to wait 1.5 hours!

### 2.1 Get Training and Evaluation Data

The SQuAD dataset contains question/answer pairs to for training the ALBERT model for the QA task. 

Now get the SQuAD V2.0 dataset. `train-v2.0.json` is for training and `dev-v2.0.json` is for evaluation to see how well your model trained.

Read more about this dataset here: https://rajpurkar.github.io/SQuAD-explorer/

In [0]:
!mkdir SQuAD2.0 \
&& cd SQuAD2.0 \
&& wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json \
&& wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

### Download the test data set which is in our github repository

In [0]:
!git clone https://github.com/Heimine/NLU_project.git

### Convert TF-based Biobert to Pytorch model

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
!tar -xzf "/content/drive/My Drive/biobert_v1.1_pubmed.tar.gz"

In [0]:
!pip install pytorch-transformers
!pytorch_transformers bert biobert_v1.1_pubmed/model.ckpt-1000000 biobert_v1.1_pubmed/bert_config.json biobert_v1.1_pubmed/pytorch_model.bin

In [0]:
!mv biobert_v1.1_pubmed/bert_config.json biobert_v1.1_pubmed/config.json

### Convert TF-based CovidAlbert to Pytorch model

In [0]:
!unzip "/content/drive/My Drive/Covid-Albert.zip"

In [0]:
!python /content/transformers/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path=/content/Covid-Albert/albert_100000_check/train_output_model.ckpt-100000 \
    --albert_config_file=/content/Covid-Albert/albert_100000_check/albert_config.json \
    --pytorch_dump_path=/content/Covid-Albert/albert_100000_check/pytorch_model.bin \

In [0]:
!mv /content/Covid-Albert/albert_100000_check/albert_config.json /content/Covid-Albert/albert_100000_check/config.json

In [0]:
!mv /content/Covid-Albert/albert_100000_check/30k-clean.model /content/Covid-Albert/albert_100000_check/spiece.model

### 2.2 Run training 

We can now train the model with the training set. 

### Notes about parameters:
`per_gpu_train_batch_size` specifies the number of training examples per iteration per GPU. *In general*, higher means more accuracy and faster training. However, the biggest limitation is the size of the GPU. 12 is what I use for a GPU with 16GB memory. 

`save_steps` specifies number of steps before it outputs a checkpoint file. I've increased it to save disk space.

`num_train_epochs` I recommend two epochs here. It's currently set to one for the purpose of time

`version_2_with_negative` is required for SQuAD V2.0. If training with V1.1, take out this flag

Warning: it takes about 1.5 hours to train an epoch! If you don't want to wait this long, feel free to skip this step and note the comment in the code to use a pretrained model!

### Fine tuning Albert

In [0]:
# Albert
# after fune tune on SQuAD 2.0
{'exact': 63.5, 'f1': 83.9611555413186, 'total': 200, 'HasAns_exact': 62.63157894736842, 'HasAns_f1': 84.16963741191432, 'HasAns_total': 190, 'NoAns_exact': 80.0, 'NoAns_f1': 80.0, 'NoAns_total': 10, 'best_exact': 63.5, 'best_exact_thresh': 0.0, 'best_f1': 83.96115554131856, 'best_f1_thresh': 0.0}
# after fune tune on SQuAD and BioASQ 2
{'exact': 48.5, 'f1': 72.63644087813776, 'total': 200, 'HasAns_exact': 47.89473684210526, 'HasAns_f1': 73.30151671382926, 'HasAns_total': 190, 'NoAns_exact': 60.0, 'NoAns_f1': 60.0, 'NoAns_total': 10, 'best_exact': 48.5, 'best_exact_thresh': 0.0, 'best_f1': 72.63644087813778, 'best_f1_thresh': 0.0}
# after fune tune on SQuAD and BioASQ 1
{'exact': 48.0, 'f1': 72.69904379246483, 'total': 200, 'HasAns_exact': 47.36842105263158, 'HasAns_f1': 73.36741451838407, 'HasAns_total': 190, 'NoAns_exact': 60.0, 'NoAns_f1': 60.0, 'NoAns_total': 10, 'best_exact': 48.0, 'best_exact_thresh': 0.0, 'best_f1': 72.69904379246483, 'best_f1_thresh': 0.0}

In [0]:
!python transformers/examples/run_squad.py \
  --model_type albert \
  --model_name_or_path albert-base-v2 \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file /content/SQuAD2.0/train-v2.0.json \
  --predict_file /content/NLU_project/COVID19_QA_testset.csv \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 1 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /content/SQuAD_Albert_model \
  --save_steps 1000 \
  --threads 4 \
  --version_2_with_negative

### Fine tuning Biobert

In [0]:
# Biobert
# after fune tune on SQuAD 2.0
{'exact': 64.5, 'f1': 83.89901381595338, 'total': 200, 'HasAns_exact': 63.1578947368421, 'HasAns_f1': 83.57790927995093, 'HasAns_total': 190, 'NoAns_exact': 90.0, 'NoAns_f1': 90.0, 'NoAns_total': 10, 'best_exact': 64.5, 'best_exact_thresh': 0.0, 'best_f1': 83.89901381595337, 'best_f1_thresh': 0.0}
# after fune tune on SQuAD and BioASQ 2
{'exact': 47.5, 'f1': 67.90449716949715, 'total': 200, 'HasAns_exact': 45.78947368421053, 'HasAns_f1': 67.26789175736543, 'HasAns_total': 190, 'NoAns_exact': 80.0, 'NoAns_f1': 80.0, 'NoAns_total': 10, 'best_exact': 47.5, 'best_exact_thresh': 0.0, 'best_f1': 67.90449716949716, 'best_f1_thresh': 0.0}
# after fune tune on SQuAD and BioASQ 1
{'exact': 48.5, 'f1': 70.26100288600288, 'total': 200, 'HasAns_exact': 46.8421052631579, 'HasAns_f1': 69.74842409052934, 'HasAns_total': 190, 'NoAns_exact': 80.0, 'NoAns_f1': 80.0, 'NoAns_total': 10, 'best_exact': 48.5, 'best_exact_thresh': 0.0, 'best_f1': 70.26100288600287, 'best_f1_thresh': 0.0}

In [0]:
!python transformers/examples/run_squad.py \
  --model_type bert \
  --model_name_or_path /content/biobert_v1.1_pubmed \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file /content/NLU_project/BioASQ-7b/train/Full-Abstract/BioASQ-train-factoid-7b-full-annotated.json \
  --predict_file /content/NLU_project/COVID19_QA_testset.csv \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 1 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /content/SQuAD_Biobert_model \
  --save_steps 1000 \
  --threads 4 \
  --version_2_with_negative

### Fine tuning CovidAlbert

In [0]:
# CovidAlbert
# after fune tune on SQuAD 2.0

# after fune tune on SQuAD and BioASQ 2

# after fune tune on SQuAD and BioASQ 1

In [0]:
!python transformers/examples/run_squad.py \
  --model_type albert \
  --model_name_or_path /content/Covid-Albert/albert_100000_check \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file /content/SQuAD2.0/train-v2.0.json \
  --predict_file /content/NLU_project/COVID19_QA_testset.csv \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 1 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /content/SQuAD_CovidAlbert_model \
  --save_steps 5000 \
  --threads 4 \
  --version_2_with_negative

## 3.0 Setup prediction code

Now we can use the Hugging Face library to make predictions using our newly trained model. Note that a lot of the code is pulled from `run_squad.py` in the Hugging Face repository, with all the training parts removed. This modified code allows to run predictions we pass in directly as strings, rather .json format like the training/test set.

NOTE if you decided train your own mode, change the flag `use_own_model` to `True`


In [0]:
import os
import torch
import time
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

from transformers import (
    AlbertConfig,
    AlbertForQuestionAnswering,
    AlbertTokenizer,
    squad_convert_examples_to_features
)

from transformers.data.processors.squad import SquadResult, SquadV2Processor, SquadExample

from transformers.data.metrics.squad_metrics import compute_predictions_logits

# READER NOTE: Set this flag to use own model, or use pretrained model in the Hugging Face repository
use_own_model = True

if use_own_model:
  model_name_or_path = "/content/SQuAD_Biobert_model"
else:
  model_name_or_path = "ktrapeznikov/albert-xlarge-v2-squad-v2"

output_dir = ""

# Config
n_best_size = 1
max_answer_length = 30
do_lower_case = True
null_score_diff_threshold = 0.0

def to_list(tensor):
    return tensor.detach().cpu().tolist()

# Setup model
config_class, model_class, tokenizer_class = (
    AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer)
config = config_class.from_pretrained(model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(
    model_name_or_path, do_lower_case=True)
model = model_class.from_pretrained(model_name_or_path, config=config)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

processor = SquadV2Processor()

def run_prediction(question_texts, context_text):
    """Setup function to compute predictions"""
    examples = []

    for i, question_text in enumerate(question_texts):
        example = SquadExample(
            qas_id=str(i),
            question_text=question_text,
            context_text=context_text,
            answer_text=None,
            start_position_character=None,
            title="Predict",
            is_impossible=False,
            answers=None,
        )

        examples.append(example)

    features, dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=384,
        doc_stride=128,
        max_query_length=64,
        is_training=False,
        return_dataset="pt",
        threads=1,
    )

    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=10)

    all_results = []

    for batch in eval_dataloader:
        model.eval()
        batch = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
            }

            example_indices = batch[3]

            outputs = model(**inputs)

            for i, example_index in enumerate(example_indices):
                eval_feature = features[example_index.item()]
                unique_id = int(eval_feature.unique_id)

                output = [to_list(output[i]) for output in outputs]

                start_logits, end_logits = output
                result = SquadResult(unique_id, start_logits, end_logits)
                all_results.append(result)

    output_prediction_file = "predictions.json"
    output_nbest_file = "nbest_predictions.json"
    output_null_log_odds_file = "null_predictions.json"

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        n_best_size,
        max_answer_length,
        do_lower_case,
        output_prediction_file,
        output_nbest_file,
        output_null_log_odds_file,
        False,  # verbose_logging
        True,  # version_2_with_negative
        null_score_diff_threshold,
        tokenizer,
    )

    return predictions

## 4.0 Run predictions

Now for the fun part... testing out your model on different inputs. Pretty rudimentary example here. But the possibilities are endless with this function.

In [0]:
context = "Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, the capital of China's Hubei province, and has since spread globally, resulting in the ongoing 2019–20 coronavirus pandemic. As of 1 May 2020, more than 3.27 million cases have been reported across 187 countries and territories, resulting in more than 233,000 deaths. More than 1.02 million people have recovered."
questions = ["Where did COVID-19 originate from",              
             "How many cases have been reported as of 1 May 2020",
             "How many people have died from COVID-19",
             "Which country suffers most from COVID-19"]

predictions = run_prediction(questions, context)

# Print results
for key in predictions.keys():
  print(predictions[key])

# Store the model

In [0]:
#!zip -r model_output_squad.zip model_output_squad

In [0]:
from google.colab import files
files.download("/content/SQuAD_Biobert_model/pytorch_model.bin")