# Assignment 3: Question Answering with a Language Model

**Description:** This assignment covers question answering with a language model. There are many ways to formulate the question ansering task and this is one of them.  You will use the masked token with T5 to develop a sentence construct that allows the model to answer the question more than 75% of the time. You should also be able to develop an intuition for:


* Working with masked language models 
* Working with prompt based models 
* The depths and limits of knowledge in these large models 

 
This notebook will run on your GCP instance as the generation of sentences does not require a GPU to work in a timely fashion. This notebook should be run on a Google Colab but it does not require a GPU. By default, when you open the notebook in Colab it will not configure a GPU. 


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-summer-main/blob/master/assignment/a3/QuestionAnswering_test.ipynb)

The overall assignment structure is as follows:

1. Setup
  
  1.1 Libraries & Helper Functions

  1.2 Data Acquisition

  1.3 Training/Test/Validation Sets for BERT-based models

**INSTRUCTIONS:** 

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.





In [1]:
!pip install -q sentencepiece

[?25l[K     |▎                               | 10 kB 19.9 MB/s eta 0:00:01[K     |▌                               | 20 kB 11.3 MB/s eta 0:00:01[K     |▉                               | 30 kB 9.1 MB/s eta 0:00:01[K     |█                               | 40 kB 7.8 MB/s eta 0:00:01[K     |█▍                              | 51 kB 5.9 MB/s eta 0:00:01[K     |█▋                              | 61 kB 6.9 MB/s eta 0:00:01[K     |██                              | 71 kB 6.5 MB/s eta 0:00:01[K     |██▏                             | 81 kB 6.2 MB/s eta 0:00:01[K     |██▍                             | 92 kB 6.9 MB/s eta 0:00:01[K     |██▊                             | 102 kB 6.8 MB/s eta 0:00:01[K     |███                             | 112 kB 6.8 MB/s eta 0:00:01[K     |███▎                            | 122 kB 6.8 MB/s eta 0:00:01[K     |███▌                            | 133 kB 6.8 MB/s eta 0:00:01[K     |███▉                            | 143 kB 6.8 MB/s eta 0:00:01[K   

In [2]:
!pip install -q transformers

[K     |████████████████████████████████| 4.4 MB 7.2 MB/s 
[K     |████████████████████████████████| 101 kB 11.6 MB/s 
[K     |████████████████████████████████| 6.6 MB 54.3 MB/s 
[K     |████████████████████████████████| 596 kB 57.9 MB/s 
[?25h

In [3]:
from collections import Counter
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [4]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

In [5]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

t5_model.summary()

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (TFSharedEmbeddings)  multiple                 24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  84954240  
                                                                 
 decoder (TFT5MainLayer)     multiple                  113275008 
                                                                 
Total params: 222,903,552
Trainable params: 222,903,552
Non-trainable params: 0
_________________________________________________________________


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


"\<extra_id_0\>" is the special token we can use with T5 to invoke its masked word modeling ability.  This means we can construct sentences, like a fill in the blank test, that allow us to probe the knoweldge embedded in the model based on its pre-training.  Here's an example that works well.  After you've run it try substituting beagle for poodle and you'll see the model gets confused.

Notice too that we are using a beam search approach and accepting the top three choices rather than just the first choice.

In [7]:
PROMPT_SENTENCE = ( "A beagle is a type of <extra_id_0> .")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=1,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['be', '', 'bird']


**QUESTION:**

1.1 Given the following countries (England, France, Germany, Russia, Egypt, Thailand, Japan, Canada, India, China) construct **two** different PROMPT_SENTENCE with the special token so that in at least 7 of the 10 cases one of the top three answers is correct.  Use the string COUNTRY to stand in for each of the elements in the list.  For example, "I always wanted to \<extra_id_0\> COUNTRY".

In [47]:
#Use this space to craft your sentence.  You do NOT need to modify the hyperparameters!
PROMPT_SENTENCE = ( "The <extra_id_0> is in China.")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=1,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['', 'Chinese', 'company']


In [48]:
sent1 = "There is not <extra_id_0> in China."
sent2 = "The <extra_id_0> is in China."