## Day 2 (Topic 9.1) : BERT-based QnA Model ###
---


(The Hugging Face transformers package is an immensely popular Python library providing pretrained models that are extraordinarily useful for a variety of natural language processing (NLP) tasks.

Note: https://huggingface.co/models provides a selection of pre-trained models that can be used to quickly build prediction models for various nlp tasks. This demo uses the deepset/bert-basecased-squad2 model.

---

## Step 2: Import the BertForQuestionAnswering class and use it to define a QnA prediction model based on your selected pretrained BERT model from huggingface.

In [1]:
from transformers import BertForQuestionAnswering
model = BertForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2',force_download=True, resume_download=False)

Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

---

## Step 3: Import the bert tokenizer class and use it to construct the tokenizer using the same BERT model. The tokenizer is used to prepare the string inputs for the prediction model by splitting the strings into sub-word token strings and converting them into transformer readable token IDs.   

In [3]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('deepset/bert-base-cased-squad2')

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/152 [00:00<?, ?B/s]

#Categories of special tokens used in the tokenization process and their corresponding token ID

| Token | Meaning | Token ID |
| --- | --- | --- |
| **[PAD]** | Padding token, allows us to maintain same-length sequences (512 tokens for Bert) even when different sized sentences are fed in | 0 |
| **[UNK]** | Used when a word is unknown to Bert | 100 |
| **[CLS]** | Appears at the start of every sequence | 101 |
| **[SEP]** | Indicates a seperator - used to indicate point between context-question and appears at end of sequences | 102 |
| **[MASK]** | Used when masking tokens, for example in training with masked language modelling (MLM) | 103 |

In [4]:
tokenizer.encode("Import the bert tokenizer class and use it to construct the tokenizer using the same BERT model.", max_length=512, truncation =True, padding = True)

[101,
 146,
 24729,
 3740,
 1103,
 1129,
 3740,
 22559,
 17260,
 1705,
 1105,
 1329,
 1122,
 1106,
 9417,
 1103,
 22559,
 17260,
 1606,
 1103,
 1269,
 139,
 9637,
 1942,
 2235,
 119,
 102]

---

## Step 4: Import the pipeline wrapper class and use it to construct the pipeline for a specific NLP task, in this case the Q&A task, by passing to it the built model and tokenizer. 

In [5]:
from transformers import pipeline
qna = pipeline('question-answering', model=model, tokenizer=tokenizer)

---

## Step 5: Upload the document to be used as context for the QnA along with the question that the BERT model will need to answer. The QnA ability of this model revolves around answering questions about a passage of text that it has read.

In [6]:
context = "The Intergovernmental Panel on Climate Change (IPCC) is a scientific intergovernmental body under the auspices of the United Nations, set up at the request of member governments. It was first established in 1988 by two United Nations organizations, the World Meteorological Organization (WMO) and the United Nations Environment Programme (UNEP), and later endorsed by the United Nations General Assembly through Resolution 43/53. Membership of the IPCC is open to all members of the WMO and UNEP. The IPCC produces reports that support the United Nations Framework Convention on Climate Change (UNFCCC), which is the main international treaty on climate change. The ultimate objective of the UNFCCC is to \"stabilize greenhouse gas concentrations in the atmosphere at a level that would prevent dangerous anthropogenic [i.e., human-induced] interference with the climate system\". IPCC reports cover \"the scientific, technical and socio-economic information relevant to understanding the scientific basis of risk of human-induced climate change, its potential impacts and options for adaptation and mitigation.\""
question = "What organization is the IPCC a part of?"
qna({'question': question,'context': context})


{'answer': 'United Nations,',
 'end': 133,
 'score': 0.4881569445133209,
 'start': 118}

---

#Step 6: Plug-in QnA model to a basic user interdace and start testing the capability of the QnA Model

In [7]:
import textwrap
context = input("Enter Context Article: ")
dedented_text = textwrap.dedent(context).strip()
print("Context Article:\n")
print(textwrap.fill(dedented_text, width=120))

newcontext = 'y'
inquiry = input("\nType your question: ")
while (inquiry!='*'):
  answer = qna({'question': inquiry,'context': context})

  print("Answer found: "+ answer['answer'])
  print("At Index :", answer['start']," - ",answer['end'])
  print("with Probability:", answer['score'],"\n")
  inquiry = input("Enter another question (* to stop):")


Enter Context Article: ?
Context Article:

?

Type your question: *
