<a href="https://colab.research.google.com/github/honzas83/t5s/blob/main/examples/t5s_aclimdb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis using the t5s library
## Install the t5s library and its dependencies

In [1]:
%% capture pip_install
!pip install git+https://github.com/honzas83/t5s --upgrade

Collecting git+https://github.com/honzas83/t5s
  Cloning https://github.com/honzas83/t5s to /tmp/pip-req-build-2y_582bu
  Running command git clone -q https://github.com/honzas83/t5s /tmp/pip-req-build-2y_582bu
Building wheels for collected packages: t5s
  Building wheel for t5s (setup.py) ... [?25l[?25hdone
  Created wheel for t5s: filename=t5s-0.1-cp36-none-any.whl size=13589 sha256=f79c1523ef24169e00a90d6be229191bfcd747b0570622c6fa534c68eb717dd1
  Stored in directory: /tmp/pip-ephem-wheel-cache-6gaks6we/wheels/04/e5/71/24b59a9d225bfaead43ca97afe95fce46b5d56ddba98ac4b2d
Successfully built t5s
Installing collected packages: t5s
  Found existing installation: t5s 0.1
    Uninstalling t5s-0.1:
      Successfully uninstalled t5s-0.1
Successfully installed t5s-0.1


## Download and extract the ACL IMDB corpus

In [2]:
!curl http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz | tar xz

--2020-12-10 14:26:28--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.4’


2020-12-10 14:26:31 (23.3 MB/s) - ‘aclImdb_v1.tar.gz.4’ saved [84125825/84125825]



## Download the T5 SentencePiece model

This is the standard SentecePiece model provided by Google for their pre-trained T5 model. The `t5-base` model is downloaded by the `t5s` library (via the Huggingface Transformers library). The `gsutil` command copies the file from Google Cloud Storage bucket to the local directory.

In [3]:
!gsutil cp -r gs://t5-data/vocabs/cc_all.32000/ .

Copying gs://t5-data/vocabs/cc_all.32000/sentencepiece.model...
Copying gs://t5-data/vocabs/cc_all.32000/sentencepiece.vocab...
/ [2 files][  1.3 MiB/  1.3 MiB]                                                
Operation completed over 2 objects/1.3 MiB.                                      


In [4]:
import os
from glob import glob
import random

## Convert the dataset formats

The ACL IMDB dataset consists of a set of TXT files in `pos` and `neg` directories. We use `glob` to search such files.

In [5]:
def find_data(dn):
    fns = glob(os.path.join(dn, "pos", "*.txt"))+glob(os.path.join(dn, "neg", "*.txt"))
    return fns

def convert_data(fns, out_fn):
    with open(out_fn, "w", encoding="utf-8") as fw:
        for fn in fns:
            if "/pos/" in fn:
                label = "positive"
            elif "/neg/" in fn:
                label = "negative"
            else:
                continue
            with open(fn, "r", encoding="utf-8") as fr:
                text = fr.read().strip()
            if not text:
                continue
            
            text = text.replace("\n", " ").replace("\t", " ")
            print(text, label, sep="\t", file=fw)

We search for all `*.txt` files in the train subdirectory, then we suffle the filenames and we leave 2k files as the development set. The rest is used as the train data.

The `*.txt` files converted to tab-separated values (TSV) format using the `convert_data()` function.

In [6]:
train_fns = find_data("aclImdb/train")
random.shuffle(train_fns)
dev_fns = train_fns[-2000:]
del train_fns[-2000:]
convert_data(train_fns, "aclImdb.train.tsv")
convert_data(dev_fns, "aclImdb.dev.tsv")
test_fns = find_data("aclImdb/test")
convert_data(test_fns, "aclImdb.test.tsv")

## t5s configuration

The configuration consists of different sections:

### `tokenizer`

*   `spm` - the name of the SentencePiece model

### `t5_model`

* `pre_trained` - the name of the pre-trained model to load for fine-tuning,
* `save_checkpoint` - save fine-tuned checkpoints under this name,
* `save_checkpoint_every` - integer, which specifies how often the checkpoints are saved, e.g. the value 1 means save every epoch.

### `dataset`

* `*_tsv` - names of TSV files used as training, development and test sets,
* `loader` - specification how to load the training data
  * `loader.input_size` - maximum number of input tokens in the batch
  * `loader.output_size` - maximum number of output tokens in the batch
  * `loader.min_batch_size` - minimum number of examples in the batch. Together with `input_size` and `output_size` specifies the maximum length of an input and an output sequence (`input_size//min_batch_size`, `output_size//min_batch_size`).

### `training`

* `shared_trainable` - boolean, if `True`, the parameters of shared embedding layer are trained,
* `encoder_trainable` - boolean, if `True`, the parameters of the encoder are trained,
* `n_epochs` - number of training epochs,
* `initial_epoch` - number of training epochs already performed, the next epoch will be `initial_epoch+1`,
* `steps_per_epoch` - the length of each epoch in steps, if ommited, the epoch means one pass over the training TSV,
* `learning_rate` - initial learning rate for `epoch=1`
* `learning_rate_schedule` - boolean, if `True`, the sqrt learning rate schedule is used. 

In [8]:
config = {
    "tokenizer": {
        "spm": "cc_all.32000/sentencepiece.model",
    },
    "t5_model": {
        "pre_trained": "t5-base",
        "save_checkpoint": "T5_aclImdb",
        "save_checkpoint_every": 1,
    },
    "dataset": {
        "train_tsv": "aclImdb.train.tsv",
        "devel_tsv": "aclImdb.dev.tsv",
        "test_tsv": "aclImdb.test.tsv",
        "loader": {
            "input_size": 3072,
            "output_size": 256,
            "min_batch_size": 4,
        },
    },
    "training": {
        "shared_trainable": False,
        "encoder_trainable": True,
        "n_epochs": 20,
        "initial_epoch": 0,
        "steps_per_epoch": 1000,
        "learning_rate": 0.001,
        "learning_rate_schedule": True,
    },
}

### Import the t5s library

In [9]:
from t5s import T5

### Instantiate the T5 class and fine-tune it

In [10]:
t5 = T5(config)

In [None]:
t5.fine_tune()

All model checkpoint weights were used when initializing T5Training.

Some weights of T5Training were not initialized from the model checkpoint at t5-base and are newly initialized: ['loss']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Epoch 00001: LearningRateScheduler reducing learning rate to 0.001.
Epoch 1/20

Epoch 00002: LearningRateScheduler reducing learning rate to 0.0007071067811865475.
Epoch 2/20