<a href="https://colab.research.google.com/github/SpencerPao/Natural-Language-Processing/blob/main/Question_Answering_Modeling_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Penjawab Pertanyaan (QA)

## Model Pembelajaran Mendalam yang dapat menjawab semua pertanyaan Anda....jika bagus tentunya.

### Ada 2 kategori Pemodelan QA yang berbeda
- Domain - sistem yang dibatasi oleh data masukan; kami memiliki sistem terbuka dan tertutup
     - <b> Sistem domain terbuka </b> ditujukan untuk pertanyaan luas, tidak spesifik untuk kategori diskusi apa (Wikipedia, World wide web, Alexa, dll...)
     - <b> Sistem domain tertutup </b> lebih sempit kosakatanya dan fokus pada industri atau topik tertentu (Sepak Bola, Keuangan, Teknologi, Hukum, dll.)
- Jenis Pertanyaan - Pertanyaan terbuka, pertanyaan Ya/Tidak, pertanyaan inferensi, dll...)
     - Setelah Anda menentukan jenis sistem yang ingin Anda bangun, Anda kemudian perlu mencari tahu jenis pertanyaan mana yang ingin menjadi fokus model Anda



### Jenis Model Penjawab Pertanyaan
- <i> Menjawab Pertanyaan Ekstraktif </i> adalah model pembelajaran mendalam yang dapat memberikan jawaban ketika diberikan korpus teks (yaitu konteks). Jadi, ketika Anda memberikan pertanyaan kepada model, model kemudian “mencari” dokumen untuk menentukan jawaban terbaik atas pertanyaan tersebut. Ini pada dasarnya adalah alat pencarian dalam banyak hal...
- <i> Open Generative Question Answering </i> adalah model pembelajaran mendalam yang menghasilkan teks berdasarkan konteks. Jadi, berbeda dengan model menjawab pertanyaan ekstraktif, jawabannya tidak <b> secara harafiah </b> harus ada di dalam teks.
- <i> Menjawab Pertanyaan Generatif Tertutup </i> adalah model pembelajaran mendalam yang tidak menyediakan konteks dan jawabannya dihasilkan oleh model tersebut.

Informasi selengkapnya tentang 3 subkumpulan Pemodelan Menjawab Pertanyaan dapat ditemukan [di HuggingFace.co](https://huggingface.co/tasks/question-answering) -- kami akan <b> lebih berfokus pada <i> Model Menjawab Pertanyaan Ekstraktif </i> di buku catatan ini. </b>

- Kumpulan data QA dirancang dan diberi tag dengan susah payah agar sesuai dengan "cetakan" model QA. Kumpulan data pelatihan akan terlihat seperti ini:
     - _Perhatikan juga bahwa data pelatihan QA tidak selalu memerlukan jawaban terkait._


### Sumber daya
- NLP-Kemajuan [tugas NLP canggih saat ini](http://nlpprogress.com/english/question_answering.html) --> Berisi kumpulan data QA yang biasa digunakan untuk model NLP QA canggih.

- [repositori HuggingFace](https://huggingface.co/); Ini menampung banyak Model NLP QA yang populer, beberapa di antaranya akan saya gunakan di notebook ini.

- [Penjelasan format metadata QA](https://simpletransformers.ai/docs/qa-data-formats/)

# Extractive QA Model Structure and Use

[Check out the BERT Video I did here!](https://www.youtube.com/watch?v=72Ylk77PqR8)

- The QA Models are essentially extensions of the BERT model with slightly different ouput layers
- Interested in the base mode [BERT?](https://www.geeksforgeeks.org/explanation-of-bert-model-nlp/); Here is the actual [paper of RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)
- Link to [documentation for RobERTa](https://www.geeksforgeeks.org/overview-of-roberta-model/)
    - RoBERTa is a new and improved version of BERT
        - Removes the Next Sentence Prediction (NSP) objective
        - Trained on bigger batch sizes & longer sequences
        - Dynamically changes the masking pattern
        - TRAINED on a large corpus of English data with <b> no </b> labeling whatsoever. (just the raw texts)
            - Masks 15% of the input; RoBERTa runs the entire masked sentence through the model and the model attempts to predict the masked terms correctly.
            - This is where the QA model learns the context and have a basic understanding to the language modeling!

Using the very popular question-answering model: **RoBERTa-base for QA** 

    "This is the roberta-base model, fine-tuned using the SQuAD2.0 dataset. It's been trained on question-answer pairs, including unanswerable questions, for the task of Question Answering." - deepset.ai

In [1]:
!pip install transformers



In [2]:
"""https://huggingface.co/deepset/roberta-base-squad2"""
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

# QA Metrics Material
- [Metrics on QA (implementation and explaination)](https://qa.fastforwardlabs.com/methods/background/2020/04/28/Intro-to-QA.html)

In [3]:
res # score == F1 Score: it's computed on the individual words in the prediction vs the true words provided in context

{'score': 0.2117140293121338,
 'start': 59,
 'end': 84,
 'answer': 'gives freedom to the user'}

# RoBERTa Architecture
if you are curious...

In [4]:
model

RobertaForQuestionAnswering(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (Lay

# How to fine-tune a QA Model?
- Definitley need a GPU. Else, you are looking at a fine-tuning phase that is at least 10 times slower to train.
- Let's leverage an already built training script.
    - [Located here](https://github.com/huggingface/transformers/blob/b90745c5901809faef3136ed09a689e7d733526c/examples/run_squad.py); The execution is in the cell below.
    
What is this script doing?
- Gets whichever pretrained model you want to use (we are using RoBERTa in this case, but you can use a different pretrained model)
- Input dataset is converted into features
    - The featured dataset is saved in cache, so you don't have to necessarily rerun this process once more for this model.
- Ensure that the <b> --do_train </b> is enabled; This commences the training.
- When training is done, the outputs of the model are saved in a <b> output_dir / checkpoint - step_number</b>

In [5]:
"""You could go through this route and run your training script here via command line with the associated parameters
and the script."""
## More information on how to do this can be found here: https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html
# Download training script from the transformers library
# !curl -L -O https://github.com/huggingface/transformers/blob/b90745c5901809faef3136ed09a689e7d733526c/examples/run_squad.py



### BUT, but there are some functions that enable us to do this without going the command line route.

'You could go through this route and run your training script here via command line with the associated parameters\nand the script.'

# Using Popular Libraries

In [6]:
!pip install datasets

Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
     ---------------------------------------- 0.0/110.5 kB ? eta -:--:--
     ---------------------------------------- 0.0/110.5 kB ? eta -:--:--
     ---------- -------------------------- 30.7/110.5 kB 640.0 kB/s eta 0:00:01
     ------------- ----------------------- 41.0/110.5 kB 388.9 kB/s eta 0:00:01
     ----------------------- ------------- 71.7/110.5 kB 435.7 kB/s eta 0:00:01
     --------------------------------- -- 102.4/110.5 kB 535.8 kB/s eta 0:00:01
     ------------------------------------ 110.5/110.5 kB 533.4 kB/s eta 0:00:00
Installing collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.1.1
    Uninstalling dill-0.3.1.1:
      Successfully uninstalled dill-0.3.1.1
Successfully installed dill-0.3.6


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.6 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is incompatible.


In [7]:
from datasets import load_dataset
# More documentation about the dataset can be found here: https://huggingface.co/datasets/viewer/?dataset=squad
# This is essentially a wrapper for the segmented data.
# The SQuAD dataset is a popular dataset based on wikipedia articles where there is an answer in the context provided.
# (different from SQuAD2.0)
squad = load_dataset("squad") 

ImportError: cannot import name 'load_dataset' from 'datasets' (c:\Users\Itzzgvry\anaconda3\envs\tensorflow\lib\site-packages\datasets\__init__.py)

In [None]:
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [8]:
# More specific information about the dataset can be found here: https://huggingface.co/datasets/squad#data-instances
squad["train"][0]
# id -> hash of the context
# title -> Document where the context resides
# Context -> Information where the answer resides
# Question -> What question are you trying to find the answer to?
# Answers -> What is the answer to the question? And the location on where in the text the answer begins (span)

NameError: name 'squad' is not defined

In [5]:
# Preprocess the data to a BERT format
def preprocess_function(examples):
    """Courtesy of https://huggingface.co/docs/transformers/tasks/question_answering"""
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [6]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)



  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

In [7]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

# CPU Training Result... Let's think about GPU's okay?


In [8]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_name) # remember that model_name is deepset/roberta-base-squad2

In [None]:
# Let's start training!
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=0.01,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=11,
    num_train_epochs=1,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

***** Running training *****
  Num examples = 87599
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 87599


Epoch,Training Loss,Validation Loss


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

# How to train a QA model from scratch?
- Well, its a matter of plugging in a large corpus of text to an unweighted BERT Model.
    - This is the pre-training phase of the BERT model.
    - Check out this [link on how to pretrain your very own model!](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/zh/latest/nlp/bert_pretraining.html)