<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/RR/Question_Generation_with_T5_Base_Fine_Tuned_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Question Generation with T5 Base fine-tuned model

We load a T5 base sequence to sequence model that has been trained to generate 
questions from a context, answer pair.

The model was trained on the SQUAD dataset with a maximum input length of 512.

In [4]:
!pip install -q transformers

[K     |████████████████████████████████| 5.3 MB 32.0 MB/s 
[K     |████████████████████████████████| 7.6 MB 55.8 MB/s 
[K     |████████████████████████████████| 163 kB 72.5 MB/s 
[?25h

In [5]:
!pip install -q sentencepiece

[?25l[K     |▎                               | 10 kB 32.0 MB/s eta 0:00:01[K     |▌                               | 20 kB 15.1 MB/s eta 0:00:01[K     |▊                               | 30 kB 19.6 MB/s eta 0:00:01[K     |█                               | 40 kB 14.4 MB/s eta 0:00:01[K     |█▎                              | 51 kB 14.3 MB/s eta 0:00:01[K     |█▌                              | 61 kB 16.5 MB/s eta 0:00:01[K     |█▉                              | 71 kB 15.6 MB/s eta 0:00:01[K     |██                              | 81 kB 17.1 MB/s eta 0:00:01[K     |██▎                             | 92 kB 16.8 MB/s eta 0:00:01[K     |██▋                             | 102 kB 15.9 MB/s eta 0:00:01[K     |██▉                             | 112 kB 15.9 MB/s eta 0:00:01[K     |███                             | 122 kB 15.9 MB/s eta 0:00:01[K     |███▍                            | 133 kB 15.9 MB/s eta 0:00:01[K     |███▋                            | 143 kB 15.9 MB/s eta 0:

In [6]:
import os
import numpy as np
import pandas as pd

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [7]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
# Some important file locations and constants

dataset_root = "/content/drive/MyDrive/w266 NLP Final Project/Data/"
dataset_name = "squad"
dataset_folder = dataset_root+dataset_name+".hf"
validation_file = dataset_folder + '/valid_pairs.csv'

source_model_name = "google/t5-v1_1-base"
model_path = "/content/drive/MyDrive/w266 NLP Final Project/Models/T5_base_pt_squad/"

The model was not trained on the SQUAD validation dataset.  We have formatted those samples so they can easily be fed to the model.

In [9]:
validation_df = pd.read_csv(validation_file)
validation_df

Unnamed: 0.1,Unnamed: 0,orig,target
0,0,generate question: answer: four context: Princ...,How many levels of galleries do the façades su...
1,1,generate question: answer: ink context: When s...,What are the secretions commonly called?
2,2,generate question: answer: 1835 context: The G...,When did Newcastle's first indoor market open?
3,3,generate question: answer: Bills context: Bill...,What may be presented to Parliament in various...
4,4,generate question: answer: the Timucua context...,"Prior to the arrival of the French, the area n..."
...,...,...,...
10565,10565,generate question: answer: wireless context: O...,What sort of power transmission did Tesla show...
10566,10566,generate question: answer: TFEU article 294 co...,Which TFEU article defines the ordinary legisl...
10567,10567,generate question: answer: 45 million people c...,What was the population of Kenya in 2014?
10568,10568,generate question: answer: spring of 1349 cont...,When did the y. pestis reach England?


In [10]:
# Download tokenizer and model, associate the model with the GPU

t5_tokenizer = T5Tokenizer.from_pretrained(source_model_name)
t5_model = T5ForConditionalGeneration.from_pretrained(model_path)
t5_model.to(torch.device('cuda:0'))

max_length = 512

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/605 [00:00<?, ?B/s]

In [11]:
sample_count = 25
df = validation_df[0:sample_count].copy()

In [14]:
predictions = []
for input_text in df['orig']:
  inputs = t5_tokenizer(input_text, return_tensors='pt', max_length=max_length, truncation=True)
  output_ids = t5_model.generate(inputs['input_ids'].cuda())
  prediction = "".join([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                                    clean_up_tokenization_spaces=False) for out_ids in output_ids])
  predictions.append(prediction)



In [15]:
df['context'] = [str.split('context: ')[1] for str in df['orig']]
df['answer'] =  [str.split('context: ')[0][26: ] for str in df['orig']]
df['prediction'] = predictions
df = df[['context', 'answer', 'target', 'prediction']]

In [16]:
df

Unnamed: 0,context,answer,target,prediction
0,Prince Albert appears within the main arch abo...,four,How many levels of galleries do the façades su...,What levels of galleries are in the galleries?
1,"When some species, including Bathyctena chuni,...",ink,What are the secretions commonly called?,What is the name of the secretions produced by...
2,The Grainger Market replaced an earlier market...,1835,When did Newcastle's first indoor market open?,What year was the Grainger Market opened?
3,Bills can be introduced to Parliament in a num...,Bills,What may be presented to Parliament in various...,What can be introduced to Parliament in a numb...
4,Jacksonville is in the First Coast region of n...,the Timucua,"Prior to the arrival of the French, the area n...",What people lived in the area?
5,"In addition to the Riemann hypothesis, many mo...",1912,When did Landau propose his four conjectural p...,What year was Landau's problem solved?
6,"In Marxian analysis, capitalist firms increasi...",stagnant,What type of wages does mechanization and auto...,What is the situation for the working class?
7,The final major evolution of the steam engine ...,90,What percentage of electrical power in the Uni...,What percentage of the electric power is produ...
8,"In 1968, ABC took advantage of new FCC ownersh...",1985,When was the ABC Pictures division eventually ...,What year was ABC Motion Pictures dissolved?
9,The 2007 Lisbon Treaty explicitly recognised f...,the Charter of Fundamental Rights of the Euro...,What charter has become an important aspect of...,What document has become an integral part of E...
