<a href="https://colab.research.google.com/github/honzas83/t5s/blob/main/examples/t5s_dstc11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis using the t5s library
## Install the t5s library and its dependencies

In [1]:
%%capture pip_install
!pip install git+https://github.com/honzas83/t5s --upgrade

## Download and extract the ACL IMDB corpus

In [2]:
!curl https://storage.googleapis.com/gresearch/dstc11/train.tts-verbatim.2022-07-27.txt -o train.tts-verbatim.2022-07-27.txt
!curl https://storage.googleapis.com/gresearch/dstc11/dev-dstc11.2022-07-27.txt -o dev-dstc11.2022-07-27.txt
!curl https://storage.googleapis.com/gresearch/dstc11/test-dstc11.2022-09-21.txt -o test-dstc11.2022-09-21.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21.2M  100 21.2M    0     0  8071k      0  0:00:02  0:00:02 --:--:-- 8069k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2892k  100 2892k    0     0  1518k      0  0:00:01  0:00:01 --:--:-- 1518k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1427k  100 1427k    0     0  1406k      0  0:00:01  0:00:01 --:--:-- 1407k


## Download the T5 SentencePiece model

This is the standard SentecePiece model provided by Google for their pre-trained T5 model. The `t5-base` model is downloaded by the `t5s` library (via the Huggingface Transformers library). The `gsutil` command copies the file from Google Cloud Storage bucket to the local directory.

In [3]:
!gsutil cp -r gs://t5-data/vocabs/cc_all.32000/ .


Copying gs://t5-data/vocabs/cc_all.32000/sentencepiece.model...
Copying gs://t5-data/vocabs/cc_all.32000/sentencepiece.vocab...
\ [2 files][  1.3 MiB/  1.3 MiB]                                                
Operation completed over 2 objects/1.3 MiB.                                      


In [4]:
import os
from glob import glob
import random

In [5]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.basicConfig()

## Convert the dataset formats


In [6]:
def convert_to_tsv(fn_in, fn_out):
    n = 0
    with open(fn_out, "w", encoding="utf-8") as fw, \
         open(fn_in, "r", encoding="utf-8") as fr:
        for line in fr:
            line = line.strip()

            if "user:" in line and "turn_id: 1 " in line:
                try:
                    input, output = line.strip().split("user:", 1)[1].split("state:", 1)
                except ValueError:
                    print("Invalid line in file", fn_in, ":", line)
                    continue
                input = input.strip()
                output = output.strip()
                print(input, output, sep="\t", file=fw)
                last_state = output
                n += 1
    print("Written", n, "lines")

In [7]:
convert_to_tsv("train.tts-verbatim.2022-07-27.txt", "train.tsv")
convert_to_tsv("dev-dstc11.2022-07-27.txt", "dev.tsv")

Written 8434 lines
Written 1000 lines


## t5s configuration

The configuration consists of different sections:

### `tokenizer`

*   `spm` - the name of the SentencePiece model

### `t5_model`

* `pre_trained` - the name of the pre-trained model to load for fine-tuning,
* `save_checkpoint` - save fine-tuned checkpoints under this name,
* `save_checkpoint_every` - integer, which specifies how often the checkpoints are saved, e.g. the value 1 means save every epoch.

### `dataset`

* `*_tsv` - names of TSV files used as training, development and test sets,
* `loader` - specification how to load the training data
  * `loader.input_size` - maximum number of input tokens in the batch
  * `loader.output_size` - maximum number of output tokens in the batch
  * `loader.min_batch_size` - minimum number of examples in the batch. Together with `input_size` and `output_size` specifies the maximum length of an input and an output sequence (`input_size//min_batch_size`, `output_size//min_batch_size`).

### `training`

* `shared_trainable` - boolean, if `True`, the parameters of shared embedding layer are trained,
* `encoder_trainable` - boolean, if `True`, the parameters of the encoder are trained,
* `n_epochs` - number of training epochs,
* `initial_epoch` - number of training epochs already performed, the next epoch will be `initial_epoch+1`,
* `steps_per_epoch` - the length of each epoch in steps, if ommited, the epoch means one pass over the training TSV,
* `learning_rate` - initial learning rate for `epoch=1`
* `learning_rate_schedule` - boolean, if `True`, the sqrt learning rate schedule is used.

In [8]:
config = {
    "tokenizer": {
        "spm": "cc_all.32000/sentencepiece.model",
    },
    "t5_model": {
        "pre_trained": "t5-base",
        "save_checkpoint": "T5_DSTC11",
        "save_checkpoint_every": 1,
    },
    "dataset": {
        "train_tsv": "train.tsv",
        "devel_tsv": "dev.tsv",
        "loader": {
            "input_size": 3072,
            "output_size": 256,
            "min_batch_size": 4,
        },
    },
    "training": {
        "shared_trainable": False,
        "encoder_trainable": True,
        "n_epochs": 10,
        "initial_epoch": 0,
        "steps_per_epoch": 200,
        "learning_rate": 0.001,
        "learning_rate_schedule": True,
    },
    "predict": {
        "batch_size": 50,
        "max_input_length": 768,
        "max_output_length": 64,
    }
}

### Import the t5s library

In [9]:
from t5s import T5

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


### Instantiate the T5 class and fine-tune it

In [10]:
t5 = T5(config)

In [11]:
t5.fine_tune()

INFO:t5s.T5:Loaded tokenizer from: cc_all.32000/sentencepiece.model
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
INFO:t5s.T5:Loading model from t5-base
All PyTorch model weights were used when initializing T5Training.

Some weights or buffers of the TF 2.0 model T5Training were not initialized from the PyTorch model and are newly initialized: ['total', 'count']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:t5s.T5:Trained model will be saved into T5_DSTC11
INFO:t5s.T5:Training 


Epoch 1: LearningRateScheduler setting learning rate to 0.001.
Epoch 1/10

INFO:t5s.CheckpointSaver:Consumed 4364 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 2: LearningRateScheduler setting learning rate to 0.0007071067811865475.
Epoch 2/10

INFO:t5s.CheckpointSaver:Consumed 8522 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 3: LearningRateScheduler setting learning rate to 0.0005773502691896258.
Epoch 3/10

INFO:t5s.CheckpointSaver:Consumed 12656 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 4: LearningRateScheduler setting learning rate to 0.0005.
Epoch 4/10

INFO:t5s.CheckpointSaver:Consumed 16762 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 5: LearningRateScheduler setting learning rate to 0.0004472135954999579.
Epoch 5/10

INFO:t5s.CheckpointSaver:Consumed 20852 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 6: LearningRateScheduler setting learning rate to 0.0004082482904638631.
Epoch 6/10

INFO:t5s.CheckpointSaver:Consumed 24943 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 7: LearningRateScheduler setting learning rate to 0.0003779644730092272.
Epoch 7/10

INFO:t5s.CheckpointSaver:Consumed 29041 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 8: LearningRateScheduler setting learning rate to 0.00035355339059327376.
Epoch 8/10

INFO:t5s.CheckpointSaver:Consumed 33167 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 9: LearningRateScheduler setting learning rate to 0.0003333333333333333.
Epoch 9/10

INFO:t5s.CheckpointSaver:Consumed 37278 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11




Epoch 10: LearningRateScheduler setting learning rate to 0.00031622776601683794.
Epoch 10/10

INFO:t5s.CheckpointSaver:Consumed 41386 training examples
INFO:t5s.CheckpointSaver:Saving checkpoint to T5_DSTC11





## Predict using the model

To use the T5 model in code, use the `predict()` method.

In [12]:
!head dev.tsv

i need to book a hotel in the east that has 4 stars.	hotel-area=east; hotel-stars=4
howdy, i need a train heading into floyd.	train-destination=floyd
what can you tell me about the eleven madison park?	restaurant-name=eleven madison park
i am looking for a specific hotel, its name is disney's contemporary resort	hotel-name=disney's contemporary resort
hi i'm looking for lodging in cambridge that includes free wifi and is upscale and expensive	hotel-pricerange=expensive; hotel-internet=yes
can you recommend some fun entertainment in the centre?	attraction-area=centre
i looking for information about a hotel in the moderate price range that includes free wifi.	hotel-pricerange=moderate; hotel-internet=yes
hello, i am trying to find a place to stay that has free wifi and 3 stars. do you have anything like that?	hotel-stars=3; hotel-internet=yes
i'm looking for a italian restaurant centre.	restaurant-food=italian; restaurant-area=centre
i'm looking for a train that departs ripley after 1:42

In [13]:
batch = []
reference = []
with open("dev.tsv", "r") as fr:
    for line in fr:
        line = line.strip()
        batch.append(line.split("\t")[0])
        reference.append(line.split("\t")[1])
        if len(batch) >= 10:
            break
predictions = t5.predict(batch)
for text, ref, hyp in zip(batch, reference, predictions):
    print(text)
    print("Reference:", ref)
    print("Predicted:", hyp)
    print()

INFO:t5s.T5:Loaded tokenizer from: cc_all.32000/sentencepiece.model


i need to book a hotel in the east that has 4 stars.
Reference: hotel-area=east; hotel-stars=4
Predicted: hotel-area=east; hotel-stars=4

howdy, i need a train heading into floyd.
Reference: train-destination=floyd
Predicted: train-destination=fliyd

what can you tell me about the eleven madison park?
Reference: restaurant-name=eleven madison park
Predicted: attraction-name=11 madison park

i am looking for a specific hotel, its name is disney's contemporary resort
Reference: hotel-name=disney's contemporary resort
Predicted: hotel-name=disneys contemporary resort

hi i'm looking for lodging in cambridge that includes free wifi and is upscale and expensive
Reference: hotel-pricerange=expensive; hotel-internet=yes
Predicted: hotel-pricerange=expensive; hotel-internet=yes

can you recommend some fun entertainment in the centre?
Reference: attraction-area=centre
Predicted: attraction-area=centre

i looking for information about a hotel in the moderate price range that includes free wifi.
