In [1]:
%load_ext autoreload
%autoreload 2

import os

while "notebooks" in os.getcwd():
    os.chdir("..")

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch

from belt_nlp.bert_truncated import BertClassifierTruncated

# Example - Model BERT with truncation of longer texts

In this notebook we will show how to use basic methods `fit` and `predict` for the BERT model with truncating texts longer than 512 tokens.

## Load data - sample of IMDB reviews in english

In [3]:
SAMPLE_DATA_PATH = "sample_data/sample_data_eng.csv"

In [4]:
df = pd.read_csv(SAMPLE_DATA_PATH)
df

Unnamed: 0,sentence,target
0,I saw this movie not knowing anything about it...,0
1,"OK, don't let my summary fool you. This movie ...",0
2,"This should be re-titled ""The Curious Case Of ...",0
3,Those 2 points are dedicated the reasonable pe...,0
4,Following the success of the (awful) Gilligan'...,0
...,...,...
1995,"What if Marylin Monroe, Albert Einstein, Joe D...",1
1996,Such a film of beauty that it's hard to descri...,1
1997,I saw this movie with my friend and we couldnt...,1
1998,This is the best piece of film ever created It...,1


## Divide to train and test sets

In [5]:
texts = df["sentence"].tolist()
labels = df["target"].tolist()
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

## Fit the model

In [6]:
MODEL_PARAMS = {
    "batch_size": 8,
    "learning_rate": 5e-5,
    "epochs": 3,
}
model = BertClassifierTruncated(**MODEL_PARAMS, device="cuda:6")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
model.fit(X_train, y_train, epochs=3)

## Get predictions

In [8]:
preds = model.predict_classes(X_test)

## Calculate model accuracy on the test data

In [9]:
accurate = sum(preds == np.array(y_test).astype(bool))
accuracy = accurate / len(y_test)

print(f"Test accuracy: {accuracy}")

Test accuracy: 0.865
