# BERT Fine-Tuning (pure)

*By Asmik Nalmpatian and Lisa Wimmer*

*Last edited on 30.12.20* 

*For our consulting project: Aspect-Based Sentiment Analysis for Twitter Data of German MPs*

*Methodology based on: https://arxiv.org/pdf/1810.04805.pdf*

This notebook is supposed to give an overview over the used functionalities. We will show how to use BERT - Bidirectional Encoder Representations from Transformers - with PyTorch library (huggingface) to fine-tune a pretrained model in tweet classification. 

The following pretrained BERT models can be used and are available in transformers huggingface: 

*   *bert-base-german-cased (trained by Deepset.ai)*
*   *bert-base-german-dbmdz-cased (trained by by DBMDZ)*
*   *bert-base-german-dbmdz-uncased (trained by by DBMDZ)*
*   *distilbert-base-german-cased (distilled from DBMDZ)*

After using these models to extract (hopefully) high quality features from our text data, we aim to fine-tune them on our specific task using a manually labeled sample of German tweets to gain state of the art predictions.

All in all the tweets will be classified into *positive* and *negative* classes using a pretrained BERT model. We will take it, add an untrained layer of neurons on the end and train a new model specifically for classification task. 
Advantages of Fine-Tuning:  


*   Quick because we have already hardly pretrained layers of the network (only 2-4 epochs after adding 1 fully-connected layer on top are enough to train as the authors recommend)
*   Less data is sometimes enough to achieve good performance
*   Usually preferable results: Because of task-specific adjustments of the weights





**Prepare GPU:**

1. Check: Edit --> Notebook settings -> Hardware accelerator -> *GPU* 


2. Datasets are uploaded in *content*-folder: *data-germeval-2017-train.tsv* & *labeling_asmik_final.csv*


# Google Colab GPU Connection

Otherwise training a large NN may take a very long time. 

In [None]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [None]:
# Identify and specify the GPU as the device
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In [None]:
!python --version

Python 3.7.10


In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
# Where to save our data
import os, sys

os.getcwd()

'/content'

In [None]:
# # Clear RAM for more memory
#!nvidia-smi
#!kill 460.27.04

/bin/bash: line 0: kill: 460.27.04: arguments must be process or job IDs


# Fine-Tuning 

Install Huggingface Library / transformers package and specify the pretrained transformer model. (Uncased means that the texts have only lowercase letters)


In [None]:
#!pip install transformers
from transformers import BertTokenizer

pretrained_model = "./drive/MyDrive/BERT_Files/tweets_unlabeled_pt/" # "bert_german_cased"

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained(pretrained_model) # , do_lower_case=True for uncased models

Load and prepare the tweets and labels 

In [None]:
import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./drive/MyDrive/BERT_Files/BERT_Fine-Tuning_pure/data_labeled_processed_bert.csv" , delimiter=',', header = 0)

df['label_binary'] = [1 if label == 'positive' else 0 for label in df.label]

df = df.dropna(subset=['label'])

# Get the lists of tweets and their labels.
tweets = df.twitter_full_text.values
labels = df.label_binary.values

df.head()
#len(df)

Unnamed: 0,doc_id,twitter_username,twitter_available,twitter_created_at,twitter_full_text_topic,twitter_full_text,twitter_retweet_count,twitter_favorite_count,twitter_followers_count,twitter_location,twitter_word_count,twitter_year,twitter_month,twitter_week,twitter_time_index_month,twitter_time_index_week,twitter_emojis,twitter_hashtags,twitter_tags,label,topic,from,to,label_binary
0,ABaerbock15156546001,ABaerbock,True,2018-01-11T07:10:00Z,Habe mir das Gro Ko-Sondierungspapier zu Klima...,Habe mir das Gro Ko-Sondierungspapier zu Klima...,111,233,107693,Brandenburg,32,2018,1,2,5,17,,#GroKo|#Klima|#Klimadiplomatie|#Merkel|#kohlea...,,negative,Klimapolitik,239,251,0
1,ABaerbock15210084001,ABaerbock,True,2018-03-14T06:20:00Z,"Auch weltweit sieht man, dass Angela Merkel s ...","Auch weltweit sieht man, dass Angela Merkel s ...",24,96,107693,Brandenburg,27,2018,3,11,7,26,,#Merkel|#GroKo,,negative,Klimapolitik,203,215,0
2,ABaerbock15216341401,ABaerbock,True,2018-03-21T12:09:00Z,"Der groessten globalen Herausforderung, der Kl...","Der groessten globalen Herausforderung, der Kl...",39,153,107693,Brandenburg,38,2018,3,12,7,27,,#Klimakrise|#Regierungserklaerung|#Koalitionsv...,,negative,Klimapolitik,284,296,0
3,ABaerbock15252349801,ABaerbock,True,2018-05-02T04:23:00Z,Wir brauchen eine andere Verkehrspolitik - weg...,Wir brauchen eine andere Verkehrspolitik - weg...,49,213,94139,Brandenburg,38,2018,5,18,9,33,,,,negative,Verkehrspolitik,25,40,0
4,ABaerbock15256297201,ABaerbock,True,2018-05-06T18:02:00Z,Das ist die Leistung von Hunderten Wahlkaempfe...,Das ist die Leistung von Hunderten Wahlkaempfe...,39,301,107693,Brandenburg,40,2018,5,19,9,34,,#kwsh|#kow18|#Kommunalwahl,,positive,Kommunalpolitik,251,266,1


In [None]:
# Prepare GermEval Dataset
# data-germeval-2017-train.tsv
germeval = pd.read_csv("./drive/MyDrive/BERT_Files/BERT_Fine-Tuning_pure/data-germeval-2017-train.tsv", 
                   sep='\t',
                   header=None, 
                   names = ["full_text", "0", "label", "2"])
germeval['label_binary'] = [1 if label == 'positive' else 0 for label in germeval.label]
germeval = germeval.loc[lambda x: x['label'] != "neutral"]
germeval['label'].value_counts()


# Get the lists of tweets and their labels.
tweets_germeval = germeval.full_text.values[0:100]
labels_germeval = germeval.label_binary.values[0:100]

germeval.shape

(6444, 5)

First, we need to tokenize our text to be able to feet it into the BERT model. 

Below an example of tokenized and raw tweet versions is shown. To tokenize the text we have to specify and use the pre-trained BERT because each model has a fixed vocabulary (containing tokens, so wordpieces) and the BERT tokenizer handles words which are not in the certain vocabulary in a specific way.

Each token is then mapped to their index in the tokenizer vocabulary

In [None]:
# Print the raw tweet.
print('Raw: ', tweets[0])

# Print the tweet split into tokens.
print('Tokenized: ', tokenizer.tokenize(tweets[0]))

# Print the tweet mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(tweets[0])))

Raw:  Habe mir das Gro Ko-Sondierungspapier zu Klima nochmal genau angeschaut. Krass: De facto wird sogar das Kyoto-Protokoll, der Meilenstein der Klimadiplomatie fuer nichtig erklaert! Frau Merkel, das geht so nicht! Nachbessern! kohleausstieg
Tokenized:  ['habe', 'mir', 'das', 'gro', 'ko', '-', 'so', '##nd', '##ierungs', '##papier', 'zu', 'kl', '##ima', 'noch', '##mal', 'genau', 'angesch', '##aut', '.', 'k', '##rass', ':', 'de', 'fa', '##ct', '##o', 'wird', 'sogar', 'das', 'k', '##yo', '##to', '-', 'pro', '##to', '##kol', '##l', ',', 'der', 'mei', '##len', '##stein', 'der', 'kl', '##ima', '##di', '##plomat', '##ie', 'f', '##uer', 'nichtig', 'erk', '##la', '##ert', '!', 'fra', '##u', 'merk', '##el', ',', 'das', 'geht', 'so', 'nicht', '!', 'nach', '##besser', '##n', '!', 'ko', '##hle', '##auss', '##ti', '##eg']
Token IDs:  [555, 3667, 93, 649, 7424, 26935, 181, 12251, 1737, 14900, 81, 714, 3988, 357, 446, 2971, 16745, 956, 26914, 96, 21242, 26964, 900, 20568, 1920, 26910, 292, 2215, 93

In the next step we add special tokens to the start *CLS* and end of each tweet *SEP*.

In [None]:
# Tokenize all of the tweets and map the tokens to their word IDs.
input_ids = []

# For every tweet...
for tweet in tweets:
    encoded_tweet = tokenizer.encode(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                   )
    
    # Add the encoded tweet to the list.
    input_ids.append(encoded_tweet)

# Print sentence 0, now as a list of IDs.
print('Original: ', tweets[0])
print('Token IDs:', input_ids[0])



Original:  Habe mir das Gro Ko-Sondierungspapier zu Klima nochmal genau angeschaut. Krass: De facto wird sogar das Kyoto-Protokoll, der Meilenstein der Klimadiplomatie fuer nichtig erklaert! Frau Merkel, das geht so nicht! Nachbessern! kohleausstieg
Token IDs: [3, 555, 3667, 93, 649, 7424, 26935, 181, 12251, 1737, 14900, 81, 714, 3988, 357, 446, 2971, 16745, 956, 26914, 96, 21242, 26964, 900, 20568, 1920, 26910, 292, 2215, 93, 96, 18886, 14481, 26935, 1017, 14481, 13603, 26907, 26918, 21, 2377, 2941, 1407, 21, 714, 3988, 3748, 13461, 12, 69, 667, 20719, 895, 129, 335, 26982, 6458, 26906, 14540, 77, 26918, 93, 1398, 181, 149, 26982, 188, 4379, 26898, 26982, 7424, 2039, 10685, 15099, 640, 4]


In [None]:
# Do the same for GermEval 

input_ids_germeval = []

# For every tweet...
for tweet_germeval in tweets_germeval:
    encoded_tweet = tokenizer.encode(
                        tweet_germeval,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                   )
    
    # Add the encoded tweet to the list.
    input_ids_germeval.append(encoded_tweet)

# Print sentence 0, now as a list of IDs.
print('Original: ', tweets_germeval[0])
print('Token IDs:', input_ids_germeval[0])

Original:  @nordschaf theoretisch kannste dir überall im Kölner Stadtbereich was suchen. Mit der KVB + S-Bahn kommt man überall fix hin.
Token IDs: [3, 26991, 3194, 17051, 23888, 479, 116, 14843, 2118, 73, 211, 106, 18313, 344, 19516, 5492, 1859, 961, 8083, 26914, 114, 21, 96, 26919, 26912, 26986, 19, 26935, 40, 558, 1471, 478, 2118, 73, 211, 19004, 461, 26914, 4]


Then we pad and truncate all tweets to a constant fixed length 

In [None]:
print('Max sentence length: ', max([len(tweet) for tweet in input_ids]))

Max sentence length:  97


In [None]:
print('Max sentence length for GermEval data: ', max([len(tweet_germeval) for tweet_germeval in input_ids_germeval]))

Max sentence length for GermEval data:  868


In [None]:
from keras.preprocessing.sequence import pad_sequences


# Set the maximum sequence length.
# A bit larger than the maximum training tweet length of 91/110... (with germeval 3202/3300 -> bert has a max length limit of tokens = 512, we cut the longer texts off and only use the first 512 Tokens)
MAX_LEN = 250

print('\nPadding/truncating all sentences to %d values...' % MAX_LEN)

print('\nPadding token: "{:}", ID: {:}'.format(tokenizer.pad_token, tokenizer.pad_token_id))

# Pad our input tokens with value 0.
# "post" - pad and truncate at the end of the sequence, as opposed to the beginning.
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")

print('\nDone.')


Padding/truncating all sentences to 250 values...

Padding token: "[PAD]", ID: 30000

Done.


In [None]:
# same for germeval 

# MAX_LEN_germeval = 3300
input_ids_germeval = pad_sequences(input_ids_germeval, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")

Create attention masks which indicates which tokens are words and which are padding (if token ID is 0 then it is padding and the attention mask is set to 0)

In [None]:
# Create attention masks
attention_masks = []

# For each sentence...
for tweet in input_ids:
    
    att_mask = [int(token_id > 0) for token_id in tweet]
    
    # Store the attention mask for each tweet.
    attention_masks.append(att_mask)

In [None]:
# Create attention masks for Germeval data 

attention_masks_germeval = []

# For each sentence...
for tweet_germeval in input_ids_germeval:
    
    att_mask = [int(token_id > 0) for token_id in tweet_germeval]
    
    # Store the attention mask for each tweet.
    attention_masks_germeval.append(att_mask)

Now, we use train_test_split to split our data into train, test sets first  and then split the initial train set further into a final train set and validation set for training


In [None]:
# # without germeval case
# from sklearn.model_selection import train_test_split
# import numpy as np

# # Use 80% for training and 20% for validation.
# train_inputs2, test_inputs, train_labels2, test_labels = train_test_split(input_ids, labels, 
#                                                             random_state=2020, test_size=0.2, stratify = labels)

# # Do the same for the masks.
# train_masks2, test_masks, _, _ = train_test_split(attention_masks, labels,
#                                              random_state=2020, test_size=0.2, stratify = labels)





# # Use 20% of train set as validation set 
# train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(train_inputs2, train_labels2, 
#                                                                                     test_size=0.2, random_state=2020) 

# # Do the same for the masks.
# train_masks, validation_masks, _, _ = train_test_split(train_masks2, train_labels2,
#                                              random_state=2020, test_size=0.2)    

In [None]:
len(test_inputs)

243

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

# Use 80% for training and 20% for validation.
train_inputs1, test_inputs, train_labels1, test_labels = train_test_split(input_ids, labels, 
                                                            random_state=2020, test_size=0.2, stratify = labels)

# Do the same for the masks.
train_masks1, test_masks, _, _ = train_test_split(attention_masks, labels,
                                             random_state=2020, test_size=0.2, stratify = labels)

# Mix train_inputs1, train_labels1, train_masks1 with germeval data
train_inputs2 = np.concatenate((train_inputs1, input_ids_germeval), axis=0)
train_labels2 = np.concatenate((train_labels1, labels_germeval), axis=None)
train_masks2  = np.concatenate((train_masks1, attention_masks_germeval), axis=0)



# Use 20% of train set as validation set 
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(train_inputs2, train_labels2, 
                                                                                    test_size=0.2, random_state=2020) 

# Do the same for the masks.
train_masks, validation_masks, _, _ = train_test_split(train_masks2, train_labels2,
                                             random_state=2020, test_size=0.2)    

Convert all inputs and labels into torch tensors, the required datatype 
for our model.

In [None]:

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
test_inputs = torch.tensor(test_inputs)

train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
test_labels = torch.tensor(test_labels)

train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)
test_masks = torch.tensor(test_masks)

The torch DataLoader class enables to create  an iterator for our data in order to save memory during training process.

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Define batch size here to let DataLoader know. 
# BERT-authors recommend a batch size of 16 or 32 for fine-tuning .

batch_size = 16

# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

# Create the DataLoader for our test set.
test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)


Load BertForSequenceClassification, the pretrained BERT model with a single 
linear classification layer on top. This is the one for classification tasks in general. *(see for more details https://huggingface.co/transformers/v2.2.0/model_doc/bert.html)*

In [None]:
from transformers import BertForSequenceClassification, AdamW, BertConfig


model = BertForSequenceClassification.from_pretrained(
    pretrained_model, 
    num_labels = 2, # The number of output labels--2 for binary classification. Can be increased for multiclass
   output_attentions = False, 
   output_hidden_states = False 
)

# Run this model on the GPU.
model.cuda()

In [None]:
# All of the model's parameters as a list of tuples.
params = list(model.named_parameters())

Get the optimizer after loading the model to train the hyperparameters. 

We chose following values based on authors' recommendations:

Batch size: 32 
Learning rate (Adam): 2e-5
Number of epochs: 4

*(see for more details on AdamW Optimizer https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109)*

In [None]:
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, 
                  eps = 1e-8 
                )

In [None]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs (authors recommend between 2 and 4)
epochs = 4

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [None]:
# helper function for calculating accuracy
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [None]:
# helper function for formatting elapsed times
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

Let's start the actual training process

In [None]:
import random

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

seed_val = 55

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

loss_values = []

# For each epoch:
for epoch_i in range(0, epochs):
    
    ## TRAIN
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # how long does the training epoch take.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Turn on the training mode for the model. 
    model.train()

    # For each batch of training data:
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack the batch and load onto the GPU.
        # Each batch contains input ids, attention masks, labels. 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Remove any previously calculated gradients before performing a
        # backward pass. 
        model.zero_grad()        

        # Do a forward pass. 
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        
        loss = outputs[0]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. 
        total_loss += loss.item()

        # Do a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0. 
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step.
        optimizer.step()

        scheduler.step()

    # Avg loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            
    
    # Need for plotting the learning curve later.
    loss_values.append(avg_train_loss)

    print("")
    print("  Avg training loss: {0:.2f}".format(avg_train_loss))
    print("  Epoch time: {:}".format(format_time(time.time() - t0)))
        
    ## VALIDATE
    # After each training epoch, we measure performance on
    # validation set.

    t0 = time.time()

    # Turn on the evaluation mode of the model.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        
        # Get logit values predicted for each class.
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Accuracy for this batch of validation tweets.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Number of batches
        nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation time: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline

import seaborn as sns

# Use plot styling from seaborn.
sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (12,6)

# Plot the learning curve.
plt.plot(loss_values, 'b-o')

# Label the plot.
plt.title("Training loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.show()

NameError: ignored

# Evaluate on Test Set

In [None]:
# Prediction on test set

print('Predicting labels for {:,} test tweets...'.format(len(test_inputs)))

# Turn on the evaluation mode of the model
model.eval()

predictions , true_labels = [], []

# Predict 
for batch in test_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  
  # Compute gradients, save memory and speed up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)

  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

print('    DONE.')

Predicting labels for 243 test tweets...
    DONE.


In [None]:
print('Positive samples: %d of %d (%.2f%%)' % (df.label_binary.sum(), len(df.label_binary), (df.label_binary.sum() / len(df.label_binary) * 100.0)))

Positive samples: 335 of 1215 (27.57%)


Performance measures

In [None]:
# Combine the predictions for each batch into a single list of 0s and 1s.
flat_predictions = [item for sublist in predictions for item in sublist]
# list of keys of the highest scores, i.e. predictions (indices of the maximum values along an axis)
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

# Combine the correct labels for each batch into a single list.
flat_true_labels = [item for sublist in true_labels for item in sublist]

Calculate F1-Score, Accuracy and Show the Confusion Matrix

In [None]:
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix

f1_value = f1_score(flat_predictions, flat_true_labels, average="weighted")
accuracy = accuracy_score(flat_predictions, flat_true_labels)

print("F1 Score (Weighted): {}".format(f1_value))
print("Accuracy: {}".format(accuracy))

F1 Score (Weighted): 0.9055715566326372
Accuracy: 0.9053497942386831


In [None]:
df.columns

In [None]:
def display_confusion_matrix(true_labels, predicted_labels, classes=[1,0]):
    
    cm = confusion_matrix(y_true=true_labels, 
                                  y_pred=predicted_labels, 
                                  labels=classes)
    cm_frame = pd.DataFrame(data=cm, 
                            columns=pd.MultiIndex(levels=[['Predicted:'], classes], 
                                                  codes=[[0,0],[0,1]]), 
                            index=pd.MultiIndex(levels=[['Actual:'], classes], 
                                                codes=[[0,0],[0,1]])) 
    return cm_frame   

In [None]:
confusion_mat = display_confusion_matrix(true_labels = flat_true_labels, predicted_labels = flat_predictions)

confusion_mat

Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted:,Predicted:
Unnamed: 0_level_1,Unnamed: 1_level_1,1,0
Actual:,1,55,12
Actual:,0,11,165
