<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/BB/bb_t5_simple_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Transformers Attempt

### Imports and installs

In [1]:
import numpy as np
import pandas as pd

import json

# Make longer output readable without scrolling
from pprint import pprint

# Stop warning messages from showing
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [7]:
#!pip install -q sentencepiece

In [8]:
#!pip install -q transformers

In [8]:
!pip install -q datasets

[K     |████████████████████████████████| 441 kB 37.8 MB/s 
[K     |████████████████████████████████| 163 kB 71.4 MB/s 
[K     |████████████████████████████████| 212 kB 42.1 MB/s 
[K     |████████████████████████████████| 115 kB 73.7 MB/s 
[K     |████████████████████████████████| 127 kB 73.8 MB/s 
[K     |████████████████████████████████| 115 kB 66.9 MB/s 
[?25h

In [10]:
#!pip install -q evaluate
import evaluate

[?25l[K     |████▌                           | 10 kB 27.5 MB/s eta 0:00:01[K     |█████████                       | 20 kB 24.9 MB/s eta 0:00:01[K     |█████████████▌                  | 30 kB 31.8 MB/s eta 0:00:01[K     |██████████████████              | 40 kB 16.1 MB/s eta 0:00:01[K     |██████████████████████▌         | 51 kB 14.8 MB/s eta 0:00:01[K     |███████████████████████████     | 61 kB 17.2 MB/s eta 0:00:01[K     |███████████████████████████████▌| 71 kB 17.6 MB/s eta 0:00:01[K     |████████████████████████████████| 72 kB 1.3 MB/s 
[?25h

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
from datasets import list_datasets, load_dataset_builder, get_dataset_config_names, load_dataset, load_from_disk

In [5]:
def summarize_dataset (dataset, config=None):
    builder = load_dataset_builder(dataset, config)
    print(f"Description:\n {builder.info.description}")
    print(f"Features:")
    pprint(builder.info.features)
    return

### Prep data

In [10]:
data_squad = load_from_disk("/content/drive/MyDrive/w266 NLP Final Project/Data/squad.hf")

In [11]:
data_squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

### Set up dataframes

In [14]:
# Create dataframe for train data 
# shuffle so random
count=100
data_shuffle=data_squad['train'].shuffle(seed=1962).select(range(count))
train_df=pd.DataFrame()
# train_df['answer'] = [answer['text'][0] for answer in data_shuffle['answers']]
train_df['question'] = data_shuffle['question']
train_df['description'] = data_shuffle['context']
train_df.columns = ["target_text", "input_text"]
train_df['prefix'] = 'ask_question'


train_df.head()



Unnamed: 0,target_text,input_text,prefix
0,What type of businesses did Nickles want to at...,"Prior to moving its headquarters to Chicago, a...",ask_question
1,To whom did Chopin reveal in letters which par...,Four boarders at his parents' apartments becam...,ask_question
2,"If a species may be harmed, who holds final sa...",The question to be answered is whether a liste...,ask_question
3,What country has the dog as part of its 12 ani...,"In Asian countries such as China, Korea, and J...",ask_question
4,How long did his episcopate last?,Saint Athanasius of Alexandria (/ˌæθəˈneɪʃəs/;...,ask_question


In [15]:
# Create dataframe for validation data 
# shuffle so random
count=10
data_shuffle=data_squad['validation'].shuffle(seed=1962).select(range(count))
eval_df=pd.DataFrame()
# eval_df['answer'] = [answer['text'][0] for answer in data_shuffle['answers']]
eval_df['question'] = data_shuffle['question']
eval_df['description'] = data_shuffle['context']
eval_df.columns = ["target_text", "input_text"]
eval_df['prefix'] = 'ask_question'


eval_df.head()



Unnamed: 0,target_text,input_text,prefix
0,How many levels of galleries do the façades su...,Prince Albert appears within the main arch abo...,ask_question
1,What are the secretions commonly called?,"When some species, including Bathyctena chuni,...",ask_question
2,When did Newcastle's first indoor market open?,The Grainger Market replaced an earlier market...,ask_question
3,What may be presented to Parliament in various...,Bills can be introduced to Parliament in a num...,ask_question
4,"Prior to the arrival of the French, the area n...",Jacksonville is in the First Coast region of n...,ask_question


How do we save a transformer model?

## Simple Transformers

Sources: 

https://towardsdatascience.com/asking-the-right-questions-training-a-t5-transformer-model-on-a-new-task-691ebba2d72c

https://simpletransformers.ai/docs/t5-specifics/

https://github.com/ThilinaRajapakse/simpletransformers

### Install Requirements

In [None]:
!pip install -q simpletransformers

In [17]:
# !pip uninstall torch torchvision -y   # uncomment this line if having issues with cuda
!pip install torch torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# !git clone https://github.com/NVIDIA/apex
# %cd apex
# !pip install -v --no-cache-dir ./

In [18]:
from simpletransformers.t5 import T5Model

In [20]:
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 128,
    "train_batch_size": 32,
    "num_train_epochs": 1,
    "save_eval_checkpoints": True,
    "save_steps": -1,
    "use_multiprocessing": False,
    "evaluate_during_training": True,
    "evaluate_during_training_steps": 15000,
    "evaluate_during_training_verbose": True,
    "fp16": False,

    "wandb_project": "Question Generation with T5",
}

model = T5Model("t5", "t5-base", args=model_args)

model.train_model(train_df, eval_data=eval_df)

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  "`as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your "


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 1:   0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  "`as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your "


(4,
 {'global_step': [4],
  'eval_loss': [2.6202396154403687],
  'train_loss': [3.0853590965270996]})