In [1]:
%load_ext autoreload
%autoreload 2

import os

while "notebooks" in os.getcwd():
    os.chdir("..")

In [2]:
from datasets import load_dataset
import torch
from torch.nn import MSELoss

from belt_nlp.bert_regressor_truncated import BertRegressorTruncated

  from .autonotebook import tqdm as notebook_tqdm


# Example - Model BERT with truncation of longer texts

In this notebook we will show how to use basic methods `fit` and `predict` for the BERT model with truncating texts longer than 512 tokens.

## Load data - predicting 5 star rating based on reviews in polish

In [3]:
dataset = load_dataset("allegro_reviews")

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'rating'],
        num_rows: 9577
    })
    test: Dataset({
        features: ['text', 'rating'],
        num_rows: 1006
    })
    validation: Dataset({
        features: ['text', 'rating'],
        num_rows: 1002
    })
})

## Divide to train and test sets

In [5]:
X_train = dataset["train"]["text"]
y_train = dataset["train"]["rating"]
X_test = dataset["validation"]["text"]
y_test = dataset["validation"]["rating"]

In [6]:
set(y_train)

{1.0, 2.0, 3.0, 4.0, 5.0}

Use the Polish BERT model:

In [7]:
pretrained_model_name_or_path = "sdadas/polish-roberta-base-v2"

## Fit the model

In [8]:
MODEL_PARAMS = {
    "pretrained_model_name_or_path": pretrained_model_name_or_path,
    "batch_size": 32,
    "learning_rate": 5e-5,
    "epochs": 3,
    "device": "cuda",
    "many_gpus": True,
}
model = BertRegressorTruncated(**MODEL_PARAMS)

Some weights of the model checkpoint at sdadas/polish-roberta-base-v2 were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at sdadas/polish-roberta-base-v2 and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference

In [9]:
%%time
model.fit(X_train, y_train, epochs=5)

CPU times: user 12min, sys: 2min 13s, total: 14min 14s
Wall time: 10min 15s


## Get predictions

In [10]:
scores = model.predict(X_test)

In [11]:
results = torch.flatten(scores).cpu()

In [12]:
results

tensor([3.7279, 3.7382, 1.6435,  ..., 2.0021, 0.9900, 1.0638])

## Calculate model mean squared error on the test data

In [13]:
mse = MSELoss()

In [14]:
mse(results,torch.Tensor(y_test))

tensor(0.6345)