# Cost-efficient use of BERT embeddings in 8-way emotion classification on a Hungarian media corpus

Accompanying script for the project found at [this GitHub Repo.](https://github.com/poltextlab/Cost-efficient-use-of-BERT-in-sentiment-classification)

This project needs a GPU-supported Colab Notebook. After setting up, the first thing to check is whether the GPU is working as desired and the exact GPU we were assigned. It most probably will be a K80. Anything will do unless it is a P4, in which case a restart is needed as batch sizes must go very low.

In [None]:
import tensorflow as tf
tf.test.gpu_device_name() # the output should be /device:GPU:0

In [None]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices() # the last row holds the key, after 'name:'

An advantage of using Google Colab is the possibility of using Google Drive as a storage. You can find more info about this [here.](https://colab.research.google.com/notebooks/io.ipynb) This code will connect to your Drive root.

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

In [None]:
!pip install transformers
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # this is needed because of sklearn
import torch
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import json

The corpus needs to be in tsv format, one row per text, UTF-8 formatting. It can contain any number of columns, but "text" for text and "topik" for class are compulsory.

In [None]:
corpus=pd.read_csv('gdrive/My Drive/etl.tsv', sep='\t') # this accesses the corpus from Drive root, named etl.tsv in this example

In [None]:
tokenizer = AutoTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc")
model = AutoModel.from_pretrained("SZTAKI-HLT/hubert-base-cc", output_hidden_layers = True)

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # this line assigns our GPU to the variable "device"

In [None]:
tokenized = corpus["text"].apply((lambda x: tokenizer.encode(x, add_special_tokens=True))) # this creates the tokenized version of the corpus
print(tokenized)

Max length (here: max_len) is the length of the longest token sequence in the corpus. In this example it is presumed that it is less than 510. If more, you need to cut off anything longer than 510.

It might be advisable to look at the distribution of lengths in the corpus, as if there are only a few lines with very long length, it is sensible to prune them to fit the majority of the corpus. If need be, the tokenizer.encode can truncate to a specified length.

In [None]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
print('Max length is:', max_len)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
batchsize = (len(corpus) // 200) + 1 # the size of the batches can be manipulated by modifying the divisor here
print('Number of batches:', batchsize)

splitpadded = np.array_split(padded, batchsize)
splitmask = np.array_split(attention_mask, batchsize)

last_hidden_states = []
model = model.to(device)

The model provides contextual embeddings from the tokens. They are saved as a numpy ndarray, after being averaged across the hidden states. The variable *all_hidden_states* is the output from the model, containing all tensors, while *featuresfinal* is the object containing the embeddings. These are reached by slicing the CLS token from all hidden layer outputs and averaging them across layer positions.

In [None]:
def last_hidden():
    for count, i in enumerate(splitpadded):
        paddedsplit = np.array(i, dtype='float64')
        length = len(paddedsplit)
        input_batch = torch.tensor(i).to(torch.long)
        mask_batch = torch.tensor(attention_mask[length*count:length*count+length])
        input_batch = input_batch.to(device)
        mask_batch = mask_batch.to(device)
        # no_grad ensures there is no gradient update in the model, as we are not looking for recursive training here
        with torch.no_grad():
            global all_hidden_states
            all_hidden_states = model(input_batch, attention_mask=mask_batch)
        print('Hidden states created for batch', count+1)
        global features
        hs12 = all_hidden_states.hidden_states[0][:,0,:].cpu()
        hs11 = all_hidden_states.hidden_states[1][:,0,:].cpu()
        hs10 = all_hidden_states.hidden_states[2][:,0,:].cpu()
        hs9 = all_hidden_states.hidden_states[3][:,0,:].cpu()
        hs8 = all_hidden_states.hidden_states[4][:,0,:].cpu()
        hs7 = all_hidden_states.hidden_states[5][:,0,:].cpu()
        hs6 = all_hidden_states.hidden_states[6][:,0,:].cpu()
        hs5 = all_hidden_states.hidden_states[7][:,0,:].cpu()
        hs4 = all_hidden_states.hidden_states[8][:,0,:].cpu()
        hs3 = all_hidden_states.hidden_states[9][:,0,:].cpu()
        hs2 = all_hidden_states.hidden_states[10][:,0,:].cpu()
        hs1 = all_hidden_states.hidden_states[11][:,0,:].cpu()
        concat_tensor = tf.stack([hs12, hs11, hs10, hs9, hs8, hs7, hs6, hs5, hs4, hs3, hs2, hs1], axis = 1)
        final_tensor = tf.reduce_mean(concat_tensor, axis=1).numpy()
        global featuresfinal
        featuresfinal = np.append(featuresfinal, final_tensor, axis=0)
        print('Finished with batch', count+1)

print('Model is running on', model.device) # one last check for a proper GPU-run
last_hidden()

At this point it is advisable to save the *featuresfinal* and the *labels* objects, if the corpus will stay the same later.

In [None]:
from google.colab import files
np.save("featuresfinal", featuresfinal)
!cp featuresfinal.npy "/content/gdrive/My Drive/"
np.save("labels", corpus["topik"])
!cp labels.npy "/content/gdrive/My Drive/"

If you already have your files ready, you can load them with this snippet.

In [None]:
#featuresfinal = np.load("/content/gdrive/My Drive/featuresfinal.npy")
#labels = np.load("/content/gdrive/My Drive/labels.npy")

In [None]:
# MinMax scaling is applied to the features
scaler = MinMaxScaler()
featuresfinal = scaler.fit_transform(featuresfinal)

# the parameter space is defined below
C = [0.1, 1]
tol = [0.001, 0.005, 0.01]
weighting = ['balanced']
solver = ['liblinear']
max_iter = [6000]
parameters = dict(C=C, tol=tol, class_weight=weighting, solver=solver, max_iter=max_iter)

# Necessary objects and variables are initialized
clasrep = list()
paramlist = list()
labels = corpus["topik"].to_numpy()

The logistic regression below splits and fits again every iteration, and appends results as dicts to list *clasrep*, and best parameters to list *paramlist*.

In [None]:
for i in range(3):
    train_features, test_features, train_labels, test_labels = train_test_split(featuresfinal, labels, stratify=labels)
    lr = LogisticRegression()
    lrmodel = GridSearchCV(lr, parameters, cv = 3, scoring = 'f1_weighted', n_jobs = -1)
    lrmodel.fit(train_features, train_labels)
    predictions = lrmodel.predict(test_features)
    classifrep = classification_report(test_labels, predictions, output_dict = True)
    clasrep.append(classifrep)
    paramlist.append(lrmodel.best_params_)
    print("Finished with run!")

The code below exports the resulting lists of dicts as JSON files, then copies them to Drive root.

In [None]:
MyFile = open('clasrep_bert.json', 'w')
json.dump(clasrep, MyFile)
MyFile.close()

MyFile = open('param_bert.json', 'w')
json.dump(paramlist, MyFile)
MyFile.close()

!cp clasrep_bert_three.json "/content/gdrive/My Drive/"
!cp param_bert_three.json "/content/gdrive/My Drive/"

In [None]:
results = pd.io.json.json_normalize(clasrep) # parsing through the lists as if they were JSONs
results.mean() # weighted precision, recall, F1 scores and sample sizes averaged over the runs

From this point, one has the option to use the total labeled set and predict for unlabeled data after tokenizing them and extracting their contextual embeddings. Sklearn also offers .predict_proba() method for probabilistic output for greater granularity.

In [None]:
lrmodel.predict(unlabeled_features)