<a href="https://colab.research.google.com/github/LUMII-AILab/NLP_Course/blob/main/notebooks/MSP/TextClassificationWithBERT.ipynb" target="_new"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

# Teksta klasificēšana ar loģistisko regresiju un BERT

Piezīme: Colab izpildlaika vides parametros izvēlieties bezmaksas GPU (T4).

In [1]:
!pip install transformers
!pip install datasets
!pip install scikit-learn



## GoEmotions datu kopa

* Publikācija: https://aclanthology.org/2020.acl-main.372/
* Oriģinālā datu kopa: https://github.com/google-research/google-research/tree/master/goemotions
* Priekšapstrādāta **EN** versija: https://huggingface.co/datasets/google-research-datasets/go_emotions
* Priekšapstrādāta **LV** versija: https://huggingface.co/datasets/AiLab-IMCS-UL/go_emotions-lv

## BERT modeļa izvēle un tekstvienību dalītāja ielāde

* Oficiālie Google BERT modeļi - `base` un `large` versijas: https://huggingface.co/google-bert
* Neoficiālas mazākas BERT versijas, piem., `small`: https://huggingface.co/prajjwal1/bert-small
* Latviešu valodai priekšapmācīts BERT modelis: https://huggingface.co/AiLab-IMCS-UL/lvbert
* u.c.

Piezīme: Obligāti jāizmanto modelim atbilstošais tekstvienību dalītājs (*tokenizer*).

In [2]:
from transformers import BertTokenizer

In [3]:
# Ielādē CPU atmiņā izvēlētā BERT modeļa tekstvienību dalītāju
bert_tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## Datu kopas ielāde un priekšapstrāde

In [4]:
from datasets import load_dataset

In [5]:
def is_single_label(sample):
    value = sample["labels"]
    if isinstance(value, (list, tuple)):
        return len(value) == 1
    else:
        return False

def to_int_label(sample):
    return {"labels": sample["labels"][0]}

def tokenize(batch):
    return bert_tokenizer(batch["text"], truncation=True)

In [6]:
data_set = load_dataset("google-research-datasets/go_emotions", "simplified")

filtered_data_set = data_set.filter(is_single_label)
flattened_data_set = filtered_data_set.map(to_int_label)

tokenized_data_set = flattened_data_set.map(tokenize, batched=True)
final_data_set = tokenized_data_set.select_columns(["input_ids", "labels"])

print("data_set:", data_set["train"][0])
print("filtered_data_set:", filtered_data_set["train"][0])
print("flattened_data_set:", flattened_data_set["train"][0])
print("tokenized_data_set:", tokenized_data_set["train"][0])
print("final_data_set:", final_data_set["train"][0])

train_set = final_data_set["train"]
validation_set = final_data_set["validation"]
test_set = final_data_set["test"]

README.md: 0.00B [00:00, ?B/s]

simplified/train-00000-of-00001.parquet:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

simplified/validation-00000-of-00001.par(…):   0%|          | 0.00/350k [00:00<?, ?B/s]

simplified/test-00000-of-00001.parquet:   0%|          | 0.00/347k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

Filter:   0%|          | 0/43410 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5426 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5427 [00:00<?, ? examples/s]

Map:   0%|          | 0/36308 [00:00<?, ? examples/s]

Map:   0%|          | 0/4548 [00:00<?, ? examples/s]

Map:   0%|          | 0/4590 [00:00<?, ? examples/s]

Map:   0%|          | 0/36308 [00:00<?, ? examples/s]

Map:   0%|          | 0/4548 [00:00<?, ? examples/s]

Map:   0%|          | 0/4590 [00:00<?, ? examples/s]

data_set: {'text': "My favourite food is anything I didn't have to cook myself.", 'labels': [27], 'id': 'eebbqej'}
filtered_data_set: {'text': "My favourite food is anything I didn't have to cook myself.", 'labels': [27], 'id': 'eebbqej'}
flattened_data_set: {'text': "My favourite food is anything I didn't have to cook myself.", 'labels': 27, 'id': 'eebbqej'}
tokenized_data_set: {'text': "My favourite food is anything I didn't have to cook myself.", 'labels': 27, 'id': 'eebbqej', 'input_ids': [101, 2026, 8837, 2833, 2003, 2505, 1045, 2134, 1005, 1056, 2031, 2000, 5660, 2870, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
final_data_set: {'input_ids': [101, 2026, 8837, 2833, 2003, 2505, 1045, 2134, 1005, 1056, 2031, 2000, 5660, 2870, 1012, 102], 'labels': 27}


## Loģistiskā regresija

### Izpildes vides sagatavošana

In [7]:
from transformers import BertModel

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import numpy as np
import torch

In [8]:
# Ielādē izvēlēto BERT modeli GPU atmiņā, inferences režīmā
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Izmantotais procesors:", device)

bert_model = BertModel.from_pretrained("google-bert/bert-base-uncased").to(device).eval()

Izmantotais procesors: cuda


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Ievaddatu dalīšana tekstvienībās, kontekstuāla vektorizēšana

In [9]:
def get_embeddings(texts, batch_size=256, max_length=512):
    embeddings = []

    for i in range(0, len(texts), batch_size):
        tokenized_batch = bert_tokenizer(
            texts[i:i+batch_size],
            padding = True,
            truncation = True,
            max_length = max_length,
            return_tensors = "pt"
        )

        # Ielādē ieejas vērtības GPU/CPU
        tokenized_batch = {k: v.to(device) for k, v in tokenized_batch.items()}

        with torch.no_grad():
            out = bert_model(**tokenized_batch)
            cls = out.last_hidden_state[:, 0, :]  # [CLS] vektors

        embeddings.append(cls.detach().cpu().numpy())

    return np.vstack(embeddings)

In [10]:
X_train = get_embeddings(flattened_data_set["train"]["text"])
y_train = np.array(flattened_data_set["train"]["labels"])

X_valid = get_embeddings(flattened_data_set["validation"]["text"])
y_valid = np.array(flattened_data_set["validation"]["labels"])

X_test = get_embeddings(flattened_data_set["test"]["text"])
y_test = np.array(flattened_data_set["test"]["labels"])

In [11]:
# Atbrīvo GPU atmiņu (!)
del bert_model
torch.cuda.empty_cache()

### Klasifikatora apmācība un novērtēšana

In [12]:
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train)

In [13]:
print(accuracy_score(y_valid, clf.predict(X_valid)), "- validācijas kopa")
print(accuracy_score(y_test, clf.predict(X_test)), "- testa kopa")

0.5079155672823219 - validācijas kopa
0.5152505446623094 - testa kopa


Piezīme: Iegūtie tekstu jēdzienvektori, izmantojot priekšapmācītu BERT modeli bez pielāgošanas konkrētajam klasifikācijas uzdevumam, ir kontekstualizēti, taču modeļa svari ir iesaldēti, un tie loģistiskās regresijas apmācības laikā netiek uzdevumam pielāgoti. Tādēļ BERT bāzes modelis, visticamāk, neatspoguļo uzdevumam specifiskās nianses tik efektīvi, kā tad, ja to pielāgotu šim konkrētajam klasifikācijas uzdevumam.

## BERT pielāgošana

### Izpildes vides sagatavošana

In [14]:
from transformers import BertForSequenceClassification
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

### Sagatavošanās bāzes modeļa pielāgošanai

In [15]:
# Nosaka dažādo klašu skaitu apmācības datu  kopā
label_count = len(data_set["train"].features["labels"].feature.names)
print("label_count", label_count)

# Ielādē RAM izvēlēto BERT modeli, izveido tam atbilstošu klasificēšanas "galvu"
bert_model = BertForSequenceClassification.from_pretrained(
    'google-bert/bert-base-uncased', num_labels=label_count
)

# Nodefinē vienkāršotu novērtēšanas metriku - "accuracy"
def eval_metrics(p):
    preds = p.predictions.argmax(-1)
    return {"accuracy": float((preds == p.label_ids).mean())}

# Specificē modeļa apmācības hiperparametrus
args = TrainingArguments(
    output_dir = "bert-base-uncased-go_emotions",
    learning_rate = 2e-5,              # tipiski BERT modeļiem
    per_device_train_batch_size = 64,  # atkarībā no GPU atmiņas; var ietekmēt rezultātu
    per_device_eval_batch_size = 128,  # atkarībā no GPU atmiņas
    num_train_epochs = 5,
    fp16 = True,                       # ātrdarbībai uz T4
    metric_for_best_model = "accuracy",
    save_strategy = "epoch",
    eval_strategy = "epoch",
    load_best_model_at_end = True,
    report_to = "none"                 # neizmantot W&B servisu
)

# Izveido apmācības "dzinēju"
trainer = Trainer(
    model = bert_model,
    args = args,
    train_dataset = train_set,
    eval_dataset = validation_set,
    compute_metrics = eval_metrics,
    processing_class = bert_tokenizer,
    data_collator = DataCollatorWithPadding(bert_tokenizer)
)

label_count 28


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Bāzes modeļa pielāgošana klasificēšanas uzdevumam

In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.8699,1.391387,0.604881
2,1.3081,1.297076,0.620712
3,1.1423,1.286037,0.622911
4,1.0292,1.30485,0.620932
5,0.9242,1.336853,0.613237


TrainOutput(global_step=2840, training_loss=1.2118559501540493, metrics={'train_runtime': 479.7743, 'train_samples_per_second': 378.386, 'train_steps_per_second': 5.919, 'total_flos': 3611501508638112.0, 'train_loss': 1.2118559501540493, 'epoch': 5.0})

### Labākās pielāgotās versijas testēšana

In [17]:
trainer.evaluate(test_set)

print("Labākais kontrolpunkts:", trainer.state.best_model_checkpoint)
print("Augstākā precizitāte:", trainer.state.best_metric)

Labākais kontrolpunkts: bert-base-uncased-go_emotions/checkpoint-1704
Augstākā precizitāte: 0.6229111697449429


In [18]:
# Atbrīvo GPU atmiņu (!)
del bert_model
torch.cuda.empty_cache()