<a href="https://colab.research.google.com/github/honzas83/t5s/blob/main/t5s/examples/t5s_csfd_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis in Czech using the t5s library
## Install the t5s library and its dependencies

In [None]:
%%capture pip_install
!pip install git+https://github.com/honzas83/t5s --upgrade

## Download and extract the Czech CSFD corpus

In [None]:
!curl https://corpora.kiv.zcu.cz/sentiment/csfd.zip > csfd.zip
!unzip -u csfd.zip

## Download the Czech T5-small model

Equivalent of Czech T5-small model trained from Common Crawl.

In [None]:
!gdown 1fvN7FhFA-ofiKXas73AXv3fTR6apK2rS && unzip -u t5_32k_cccs_jmzw_small.v2.zip

## Convert the dataset formats

The CSFD dataset consists of three files with positive, neutral and negative sentiment.

This code randomly shuffles the data and generates the training, development and test data.

In [None]:
import random

In [None]:
def read_data(input_files):
    ret = []
    for label, fn in input_files:
        with open(fn, "r", encoding="utf-8") as fr:
            for line in fr:
                text = line.strip()
                ret.append((text, label))
    random.shuffle(ret)
    return ret

def write_tsv(fn, data):
    with open(fn, "w", encoding="utf-8") as fw:
        for text, label in data:
            print(text, label, sep="\t", file=fw)

In [None]:
data = read_data([("pozitivní", "csfd/positive.txt"), ("negativní", "csfd/negative.txt")])

test_data = data[:10000]
dev_data = data[10000:15000]
train_data = data[15000:]

write_tsv("csfd.train.tsv", train_data)
write_tsv("csfd.dev.tsv", dev_data)
write_tsv("csfd.test.tsv", test_data)

### Import the t5s library

In [None]:
from t5s import T5

## t5s configuration

The configuration consists of different sections:

### `tokenizer`

*   `spm` - the name of the SentencePiece model

### `t5_model`

* `pre_trained` - the name of the pre-trained model to load for fine-tuning,
* `save_checkpoint` - save fine-tuned checkpoints under this name,
* `save_checkpoint_every` - integer, which specifies how often the checkpoints are saved, e.g. the value 1 means save every epoch.

### `dataset`

* `*_tsv` - names of TSV files used as training, development and test sets,
* `loader` - specification how to load the training data
  * `loader.input_size` - maximum number of input tokens in the batch
  * `loader.output_size` - maximum number of output tokens in the batch
  * `loader.min_batch_size` - minimum number of examples in the batch. Together with `input_size` and `output_size` specifies the maximum length of an input and an output sequence (`input_size//min_batch_size`, `output_size//min_batch_size`).

### `training`

* `shared_trainable` - boolean, if `True`, the parameters of shared embedding layer are trained,
* `encoder_trainable` - boolean, if `True`, the parameters of the encoder are trained,
* `n_epochs` - number of training epochs,
* `initial_epoch` - number of training epochs already performed, the next epoch will be `initial_epoch+1`,
* `steps_per_epoch` - the length of each epoch in steps, if ommited, the epoch means one pass over the training TSV,
* `learning_rate` - initial learning rate for `epoch=1`
* `learning_rate_schedule` - boolean, if `True`, the sqrt learning rate schedule is used. 

In [None]:
config = {
    "tokenizer": {
        "spm": "t5_32k_cccs_jmzw_small.v2/T5_32k_CCcs.model",
    },
    "t5_model": {
        "pre_trained": "t5_32k_cccs_jmzw_small.v2",
        "save_checkpoint": "T5_csfd",
        "save_checkpoint_every": 1,
    },
    "dataset": {
        "train_tsv": "csfd.train.tsv",
        "devel_tsv": "csfd.dev.tsv",
        "test_tsv": "csfd.test.tsv",
        "loader": {
            "input_size": 3072,
            "output_size": 256,
            "min_batch_size": 4,
        },
    },
    "training": {
        "shared_trainable": False,
        "encoder_trainable": True,
        "n_epochs": 5,
        "initial_epoch": 0,
        "steps_per_epoch": 500,
        "learning_rate": 0.001,
        "learning_rate_schedule": True,
    },
    "predict": {
        "batch_size": 50,
        "max_input_length": 768,
        "max_output_length": 64,
    }
}

### Instantiate the T5 class and fine-tune it

In [None]:
t5 = T5(config)

In [None]:
t5.fine_tune()

## Predict using the model

The use the T5 model in code, use `predict()` method. To evaluate the model, the `predict_tsv()` could be more useful, together with evaluation using the `eval_tsv.py` script.

In [None]:
batch = []
reference = []
with open("csfd.dev.tsv", "r") as fr:
    for line in fr:
        line = line.strip()
        batch.append(line.split("\t")[0])
        reference.append(line.split("\t")[1])
        if len(batch) >= 10:
            break
predictions = t5.predict(batch)
for text, ref, hyp in zip(batch, reference, predictions):
    print(text)
    print("Reference:", ref)
    print("Predicted:", hyp)
    print()

In [None]:
t5.predict_tsv("csfd.dev.tsv", "csfd.dev.pred.tsv")

The evaluation script `eval_tsv.py` takes 3 parameters - the name of metrics to compute, reference TSV and predicted TSV. The `match` metric computes sentence accuracy `SAcc` and word-level accuracy `WAcc`. The output also contains the number of correct and erroneous sentences and words. The output is in the JSON format.

In [None]:
!eval_tsv.py match csfd.dev.tsv csfd.dev.pred.tsv

In [None]:
t5.predict_tsv("csfd.test.tsv", "csfd.test.pred.tsv")

In [None]:
!eval_tsv.py match csfd.test.tsv csfd.test.pred.tsv