# Data Augmentation using T5

## Install Dependencies

In [2]:
!pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/8e/8f/4607b72b25ded04835b2282b401c904e24da873e934a0e976b2debe27a41/simpletransformers-0.51.10-py3-none-any.whl (201kB)
[K     |████████████████████████████████| 204kB 10.8MB/s 
[?25hCollecting wandb
[?25l  Downloading https://files.pythonhosted.org/packages/ca/5e/9df94df3bfee51b92b54a5e6fa277d6e1fcdf1f27b1872214b98f55ec0f7/wandb-0.10.12-py2.py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 28.9MB/s 
[?25hCollecting streamlit
[?25l  Downloading https://files.pythonhosted.org/packages/dd/8d/4c7676d01e90852254e2275fb4639b747274430f2fa066aa94848d3a6ee4/streamlit-0.73.1-py2.py3-none-any.whl (7.4MB)
[K     |████████████████████████████████| 7.4MB 47.7MB/s 
[?25hCollecting seqeval
[?25l  Downloading https://files.pythonhosted.org/packages/9d/2d/233c79d5b4e5ab1dbf111242299153f3caddddbb691219f363ad55ce783d/seqeval-1.2.2.tar.gz (43kB)
[K     |████████████████████████████████

In [1]:
!pip install torch



In [28]:
import pandas as pd
from simpletransformers.t5 import T5Model
from pprint import pprint


## Upload PAWS Dataset

In [None]:
from google.colab import files
files.upload()

MessageError: ignored

## Prepare Dataset for training

In [5]:
df = pd.read_csv('train.tsv',sep='\t')
df.head(5)

Unnamed: 0,id,sentence1,sentence2,label
0,1,"In Paris , in October 1560 , he secretly met t...","In October 1560 , he secretly met with the Eng...",0
1,2,The NBA season of 1975 -- 76 was the 30th seas...,The 1975 -- 76 season of the National Basketba...,1
2,3,"There are also specific discussions , public p...","There are also public discussions , profile sp...",0
3,4,When comparable rates of flow can be maintaine...,The results are high when comparable flow rate...,1
4,5,It is the seat of Zerendi District in Akmola R...,It is the seat of the district of Zerendi in A...,1


In [6]:
df.describe()

Unnamed: 0,id,label
count,49401.0,49401.0
mean,24701.0,0.441874
std,14260.984661,0.496615
min,1.0,0.0
25%,12351.0,0.0
50%,24701.0,0.0
75%,37051.0,1.0
max,49401.0,1.0


In [22]:
paraphrase_train = df[df['label']==1]
paraphrase_train.head(5)

Unnamed: 0,id,sentence1,sentence2,label
1,2,They were there to enjoy us and they were ther...,They were there for us to enjoy and they were ...,1
2,3,"After the end of the war in June 1902 , Higgin...","In August , after the end of the war in June 1...",1
3,4,From the merger of the Four Rivers Council and...,Shawnee Trails Council was formed from the mer...,1
4,5,The group toured extensively and became famous...,The group toured extensively and was famous in...,1
5,6,Kathy and her husband Pete Beale ( Peter Dean ...,Kathy and her husband Peter Dean ( Pete Beale ...,1


In [23]:
paraphrase_train.describe()

Unnamed: 0,id,label
count,3539.0,3539.0
mean,4005.63323,1.0
std,2309.309114,0.0
min,2.0,1.0
25%,2037.0,1.0
50%,3985.0,1.0
75%,5999.5,1.0
max,7999.0,1.0


In [29]:
paraphrase_train["prefix"] = "Generate Paraphrase for this line"
paraphrase_train= paraphrase_train.rename(columns={"sentence1": "input_text", "sentence2": "target_text"})

In [30]:
df = pd.read_csv('dev.tsv',sep='\t')
paraphrase_dev = df[df['label']==1]
paraphrase_dev["prefix"] = "Generate Paraphrase for this line"
paraphrase_dev = paraphrase_dev.rename(columns={"sentence1": "input_text", "sentence2": "target_text"})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## Fine tune T5 for paraphrase generation

In [17]:
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 128,
    "train_batch_size": 16,
    "num_train_epochs": 10,
    "num_beams": None,
    "do_sample": True,
    "max_length": 20,
    "top_k": 50,
    "top_p": 0.95,
    "use_multiprocessing": False,
    "save_steps": -1,
    "save_eval_checkpoints": True,
    "evaluate_during_training": True,
    "evaluate_during_training_verbose": True,
    "num_return_sequences": 5
}

In [31]:
model = T5Model("t5","t5-small", args=model_args)
model.train_model(paraphrase_train, eval_data=paraphrase_dev)

  0%|          | 0/3539 [00:00<?, ?it/s]

Using Adafactor for T5


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/222 [00:00<?, ?it/s]



  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

  0%|          | 0/3539 [00:00<?, ?it/s]

(2220,
 {'eval_loss': [0.4193559979208436,
   0.2775603665152468,
   0.21003003053484867,
   0.16024893183370476,
   0.12563965706739147,
   0.10886048804758365,
   0.09206325020094488,
   0.0776768736204902,
   0.06836123866266644,
   0.06781832502553614,
   0.0548245133558797],
  'global_step': [222,
   444,
   666,
   888,
   1110,
   1332,
   1554,
   1776,
   1998,
   2000,
   2220],
  'train_loss': [0.46710172295570374,
   0.7751969695091248,
   0.6343348622322083,
   0.12746398150920868,
   0.15671375393867493,
   0.5032305121421814,
   0.0991746038198471,
   0.06581386923789978,
   0.19765092432498932,
   0.10003523528575897,
   0.13117152452468872]})

## Save best model to google drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!cp -r /content/outputs/best_model/ /content/gdrive/'My Drive'/T5/

In [None]:
model_save_name="TT5"
path=F"/content/gdrive/MyDrive/{model_save_name}"
torch.save(model.state_dict(),path)

NameError: ignored

## Generate paraphrases



In [None]:
pre_trained_model = T5Model("/content/outputs/best_model", model_args)

TypeError: ignored

In [27]:
pred = model.predict(["Generate Paraphrase for this line: the last NBA season was amazing"])
pprint(pred)

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/5 [00:00<?, ?it/s]

[['The last season in the NBA was amazing.',
  'The last NBA season was amazing.',
  'The last NBA season was amazing.',
  'The last NBA season was impressive.',
  'The last NBA season was incredible.']]


In [None]:
import tensorflow as tf

tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model1.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

NameError: ignored

## References

*   [T5: Text-To-Text Transfer Transformer](https://arxiv.org/abs/1910.10683)
*   [PAWS: Paraphrase Adversaries from Word Scrambling](https://github.com/google-research-datasets/paws) 
*   [PAWS wiki labeled dataset](https://storage.googleapis.com/paws/english/paws_wiki_labeled_final.tar.gz) 
*   [Simple Transformers - T5 Model](https://simpletransformers.ai/docs/t5-model/) 
*   [Top-k and Top-p sampling ](https://huggingface.co/blog/how-to-generate)
*   [Huggignface T5 models](https://huggingface.co/models?search=t5) 
