<a href="https://colab.research.google.com/github/honzas83/t5s/blob/main/examples/t5s_aclimdb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis using the t5s library
## Install the t5s library and its dependencies

In [None]:
%%capture pip_install
!pip install git+https://github.com/honzas83/t5s --upgrade

## Download and extract the ACL IMDB corpus

In [None]:
!curl http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz | tar xz

## Download the T5 SentencePiece model

This is the standard SentecePiece model provided by Google for their pre-trained T5 model. The `t5-base` model is downloaded by the `t5s` library (via the Huggingface Transformers library). The `gsutil` command copies the file from Google Cloud Storage bucket to the local directory.

In [None]:
!gsutil cp -r gs://t5-data/vocabs/cc_all.32000/ .

In [None]:
import os
from glob import glob
import random

In [None]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.basicConfig()

## Convert the dataset formats

The ACL IMDB dataset consists of a set of TXT files in `pos` and `neg` directories. We use `glob` to search such files.

In [None]:
def find_data(dn):
    fns = glob(os.path.join(dn, "pos", "*.txt"))+glob(os.path.join(dn, "neg", "*.txt"))
    return fns

def convert_data(fns, out_fn):
    with open(out_fn, "w", encoding="utf-8") as fw:
        for fn in fns:
            if "/pos/" in fn:
                label = "positive"
            elif "/neg/" in fn:
                label = "negative"
            else:
                continue
            with open(fn, "r", encoding="utf-8") as fr:
                text = fr.read().strip()
            if not text:
                continue
            
            text = text.replace("\n", " ").replace("\t", " ")
            print(text, label, sep="\t", file=fw)

We search for all `*.txt` files in the train subdirectory, then we suffle the filenames and we leave 2k files as the development set. The rest is used as the train data.

The `*.txt` files converted to tab-separated values (TSV) format using the `convert_data()` function.

In [None]:
train_fns = find_data("aclImdb/train")
random.shuffle(train_fns)
dev_fns = train_fns[-2000:]
del train_fns[-2000:]
convert_data(train_fns, "aclImdb.train.tsv")
convert_data(dev_fns, "aclImdb.dev.tsv")
test_fns = find_data("aclImdb/test")
convert_data(test_fns, "aclImdb.test.tsv")

## t5s configuration

The configuration consists of different sections:

### `tokenizer`

*   `spm` - the name of the SentencePiece model

### `t5_model`

* `pre_trained` - the name of the pre-trained model to load for fine-tuning,
* `save_checkpoint` - save fine-tuned checkpoints under this name,
* `save_checkpoint_every` - integer, which specifies how often the checkpoints are saved, e.g. the value 1 means save every epoch.

### `dataset`

* `*_tsv` - names of TSV files used as training, development and test sets,
* `loader` - specification how to load the training data
  * `loader.input_size` - maximum number of input tokens in the batch
  * `loader.output_size` - maximum number of output tokens in the batch
  * `loader.min_batch_size` - minimum number of examples in the batch. Together with `input_size` and `output_size` specifies the maximum length of an input and an output sequence (`input_size//min_batch_size`, `output_size//min_batch_size`).

### `training`

* `shared_trainable` - boolean, if `True`, the parameters of shared embedding layer are trained,
* `encoder_trainable` - boolean, if `True`, the parameters of the encoder are trained,
* `n_epochs` - number of training epochs,
* `initial_epoch` - number of training epochs already performed, the next epoch will be `initial_epoch+1`,
* `steps_per_epoch` - the length of each epoch in steps, if ommited, the epoch means one pass over the training TSV,
* `learning_rate` - initial learning rate for `epoch=1`
* `learning_rate_schedule` - boolean, if `True`, the sqrt learning rate schedule is used. 

In [None]:
config = {
    "tokenizer": {
        "spm": "cc_all.32000/sentencepiece.model",
    },
    "t5_model": {
        "pre_trained": "t5-base",
        "save_checkpoint": "T5_aclImdb",
        "save_checkpoint_every": 1,
    },
    "dataset": {
        "train_tsv": "aclImdb.train.tsv",
        "devel_tsv": "aclImdb.dev.tsv",
        "test_tsv": "aclImdb.test.tsv",
        "loader": {
            "input_size": 3072,
            "output_size": 256,
            "min_batch_size": 4,
        },
    },
    "training": {
        "shared_trainable": False,
        "encoder_trainable": True,
        "n_epochs": 1,
        "initial_epoch": 0,
        "steps_per_epoch": 500,
        "learning_rate": 0.001,
        "learning_rate_schedule": True,
    },
    "predict": {
        "batch_size": 50,
        "max_input_length": 768,
        "max_output_length": 64,
    }
}

### Import the t5s library

In [None]:
from t5s import T5

### Instantiate the T5 class and fine-tune it

In [None]:
t5 = T5(config)

In [None]:
t5.fine_tune()

## Predict using the model

The use the T5 model in code, use `predict()` method. To evaluate the model, the `predict_tsv()` could be more useful, together with evaluation using the `eval_tsv.py` script.

In [None]:
batch = []
reference = []
with open("aclImdb.dev.tsv", "r") as fr:
    for line in fr:
        line = line.strip()
        batch.append(line.split("\t")[0])
        reference.append(line.split("\t")[1])
        if len(batch) >= 10:
            break
print(reference)
print(t5.predict(batch))

In [None]:
t5.predict_tsv("aclImdb.dev.tsv", "aclImdb.dev.pred.tsv")

The evaluation script `eval_tsv.py` takes 3 parameters - the name of metrics to compute, reference TSV and predicted TSV. The `match` metric computes sentence accuracy `SAcc` and word-level accuracy `WAcc`. The output also contains the number of correct and erroneous sentences and words. The output is in the JSON format.

In [None]:
!eval_tsv.py match aclImdb.dev.tsv aclImdb.dev.pred.tsv