## Text Analytics - Knowledge Graph, BERT, spaCy, NLTK - Notebook 02

This noteboook covers some cool language modeling and natural language processing tools and methods.

References: \
https://arxiv.org/abs/1810.04805 \
https://cloud.google.com/ai-platform/training/docs/algorithms/bert-start \
https://www.kaggle.com/code/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk \
https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/ \
https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/ \
https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03 \
https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/ 

<b>Bidirectional Encoder Representations from Transformers | BERT </b>

BERT is a method of pre-training language representations. Pre-training refers to how BERT is first trained on a large source of text, such as Wikipedia. You can then apply the training results to other Natural Language Processing (NLP) tasks, such as question answering and sentiment analysis. 

BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. 

BERT was trained by masking 15% of the tokens with the goal to guess them. An additional objective was to predict the next sentence.

<i>Side Note: Transformers (Attention Is All You Need); (Pre-Trained) Contextualized Word Embeddings (ELMO)</i>

<b>Architecture</b>:\
The original BERT model was developed and trained by Google using TensorFlow. BERT is released in two sizes BERTBASE and BERTLARGE. The BASE model is used to measure the performance of the architecture comparable to another architecture and the LARGE model produces state-of-the-art results that were reported in the research paper. One of the main reasons for the good performance of BERT on different NLP tasks was the use of <b><u><i>Semi-Supervised Learning</i></u></b>. This means the model is trained for a specific task that enables it to understand the patterns of the language. After training the model (BERT) has language processing capabilities that can be used to empower other models that we build and train using supervised learning.

BERT is basically an Encoder stack of transformer architecture. A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side. 

BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack. These are more than the Transformer architecture described in the original paper (6 encoder layers). BERT architectures (BASE and LARGE) also have larger feedforward-networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the Transformer architecture suggested in the original paper. It contains 512 hidden units and 8 attention heads. BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.

In summary: \
BERT-Base: 12 layer Encoder / Decoder, d = 768, 110M parameters \
BERT-Large: 24 layer Encoder / Decoder, d = 1024, 340M parameters, where d is the dimensionality of the final hidden vector output by BERT. Both of these have a Cased and an Uncased version (the Uncased version converts all words to lowercase).

This model takes CLS token as input first, then it is followed by a sequence of words as input. Here CLS is a classification token. It then passes the input to the above layers. Each layer applies self-attention, passes the result through a feedforward network after then it hands off to the next encoder. The model outputs a vector of hidden size (768 for BERT BASE). If we want to output a classifier from this model we can take the output corresponding to CLS token.

<b>Why need such models?</b>\
Researchers have developed various techniques for training general purpose language representation models using the enormous piles of unannotated text on the web (this is known as pre-training). These general purpose pre-trained models can then be fine-tuned on smaller task-specific datasets, e.g., when working with problems like question answering and sentiment analysis. This approach results in great accuracy improvements compared to training on the smaller task-specific datasets from scratch. 
>Easy training, less data, good results 

<b>Core Idea:</b>
In the pre-BERT world, a language model would have looked at this text sequence during training from either left-to-right or combined left-to-right and right-to-left. This one-directional approach works well for generating sentences — we can predict the next word, append that to the sequence, then predict the next to next word until we have a complete sentence.

Now enters BERT, a language model which is bidirectionally trained (this is also its key technical innovation). This means we can now have a deeper sense of language context and flow compared to the single-direction language models.

Instead of predicting the next word in a sequence, BERT makes use of a novel technique called <b><u><i>Masked LM (MLM)</i></u></b>: it randomly masks words in the sentence and then it tries to predict them. Masking means that the model looks in both directions and it uses the full context of the sentence, both left and right surroundings, in order to predict the masked word. Unlike the previous language models, it takes both the previous and next tokens into account at the same time. The existing combined left-to-right and right-to-left LSTM based models were missing this “same-time part”. (It might be more accurate to say that BERT is non-directional though.)

<b>How does it work</b>?\
BERT relies on a Transformer (the attention mechanism that learns contextual relationships between words in a text). A basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task. Since BERT’s goal is to generate a language representation model, it only needs the encoder part. The input to the encoder for BERT is a sequence of tokens, which are first converted into vectors and then processed in the neural network. But before processing can start, BERT needs the input to be massaged and decorated with some extra metadata:

><u>Token embeddings</u>: A token is added to the input word tokens at the beginning of the first sentence and a token is inserted at the end of each sentence. \
<u>Segment embeddings</u>: A marker indicating Sentence A or Sentence B is added to each token. This allows the encoder to distinguish between sentences. \
<u>Positional embeddings</u>: A positional embedding is added to each token to indicate its position in the sentence.

<b>BERT is pre-trained on 2 NLP Tasks</b>: 
>1. Masked Language Modeling
2. Next Sentence Prediction

<b>MLM</b>:\
BERT is designed as a deeply bidirectional model. The network effectively captures information from both the right and left context of a token from the first layer itself and all the way through to the last layer. Traditionally, we had language models either trained to predict the next word in a sentence (right-to-left context used in GPT) or language models that were trained on a left-to-right context. This made our models susceptible to errors due to loss in information.

Let us take an example to understand it better: Let’s say we have a sentence – “I love to read data science blogs on Kaggle”. We want to train a bi-directional language model. Instead of trying to predict the next word in the sequence, we can build a model to predict a missing word from within the sequence itself. Let’s replace “Kaggle” with “[MASK]”. This is a token to denote that the token is missing. We’ll then train the model in such a way that it should be able to predict “Kaggle” as the missing token: “I love to read data science blogs on [MASK].” This is the crux of a Masked Language Model. The authors of BERT also include some caveats to further improve this technique: To prevent the model from focusing too much on a particular position or tokens that are masked, the researchers randomly masked 15% of the words.

The masked words were not always replaced by the masked tokens [MASK] because the [MASK] token would never appear during fine-tuning. So, the researchers used the below technique:
1. 80% of the time the words were replaced with the masked token [MASK]
2. 10% of the time the words were replaced with random words
3. 10% of the time the words were left unchanged

<b>NSP</b>:\
Masked Language Models (MLMs) learn to understand the relationship between words. Additionally, BERT is also trained on the task of Next Sentence Prediction for tasks that require an understanding of the relationship between sentences. In order to understand relationship between two sentences, BERT training process also uses next sentence prediction. A pre-trained model with this kind of understanding is relevant for tasks like question answering. During training the model gets as input pairs of sentences and it learns to predict if the second sentence is the next sentence in the original text as well.

As we have seen earlier, BERT separates sentences with a special [SEP] token. During training the model is fed with two input sentences at a time such that:
1. 50% of the time the second sentence comes after the first one.
2. 50% of the time it is a a random sentence from the full corpus.

BERT is then required to predict whether the second sentence is random or not, with the assumption that the random sentence will be disconnected from the first sentence. To predict if the second sentence is connected to the first one or not, basically the complete input sequence goes through the Transformer based model, the output of the [CLS] token is transformed into a 2×1 shaped vector using a simple classification layer, and the IsNext-Label is assigned using softmax. The model is trained with both Masked LM and Next Sentence Prediction together. This is to minimize the combined loss function of the two strategies — “together is better”.

<b>Applications:</b>
1. Natural Language Inference
2. Sentiment Analysis
3. Question Answering
4. Paraphrase Detection
5. Linguistic Acceptability

Let's try Text Classification using BERT:

In [1]:
# !pip3 install pytorch-pretrained-bert pytorch-nlp

Collecting pytorch-pretrained-bert
  Downloading pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.8/123.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting pytorch-nlp
  Downloading pytorch_nlp-0.5.0-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.1/90.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting boto3
  Downloading boto3-1.26.96-py3-none-any.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.5/135.5 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.0-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.6/79.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting botocore<1.30.0,>=1.29.96
  Downloading botocore-1.29.96-py3-none-any.whl (10.5 MB)
[2K     [9

In [5]:
import tensorflow as tf
import torch

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

<b>Dataset:</b>

The Corpus of Linguistic Acceptability (CoLA) dataset for single sentence classification.
It's a set of sentences labeled as grammatically correct or incorrect. The data is as follows:
>Column 1: the code representing the source of the sentence. \
Column 2: the acceptability judgment label (0=unacceptable, 1=acceptable). \
Column 3: the acceptability judgment as originally notated by the author. \
Column 4: the sentence.

In [6]:
df = pd.read_csv("cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

In [7]:
df.shape

(8551, 4)

In [9]:
df.sample(5)

Unnamed: 0,sentence_source,label,label_notes,sentence
2038,rhl07,1,,I sent the package all the way around the world.
1992,r-67,1,,That anybody ever left at all is impossible.
6621,m_02,1,,The truck spread salts.
7494,sks13,1,,"Mary sent Bill a book,…."
5761,c_13,1,,I didn't read a single book the whole time I was in the library.


In [16]:
# create sentences and label lists 
sentences = df.sentence.values

# adding special tokens at the begining and end of each sentences for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP] " for sentence in sentences]
labels = df.label.values

In [18]:
# importing the BERT tokenizer, used to convert our text into tokens that corresponds to BERTs vocabulary

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print("First sentence tokenized: ", tokenized_texts[0])

First sentence tokenized:  ['[CLS]', 'our', 'friends', 'won', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.', '[SEP]']


BERT requires specifically formatted inputs, For each tokenized input sentence, we need to create: 
1. input ids: a sequence of integers identifying each input token to its index number in the BERT tokenizer vocabulary
2. segment mask: (optional) a sequence of 1s and 0s used to identify whether the input is one sentence or two sentences long. For one sentence inputs, this is simply a sequence of 0s. For two sentence inputs, there is a 0 for each token of the first sentence, followed by a 1 for each token of the second sentence
3. attention mask: (optional) a sequence of 1s and 0s, with 1s for all input tokens and 0s for all padding tokens (we'll detail this in the next paragraph)
4. labels: a single value of 1 or 0. In our task 1 means "grammatical" and 0 means "ungrammatical"

Although we can have variable length input sentences, BERT does requires our input arrays to be the same size. So we can choose a max sentence length and pad or truncate the input as required. 

pad_sequences is a utility function that we're borrowing from Keras. It simply handles the truncating and padding of Python lists.

In [19]:
# setting up the max length, in original paper it is tkaen as 512
max_len = 128

# using bert tokenizer to convert the tokens to their index numbers in the bert vocab
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

# padding the input token
input_ids = pad_sequences(input_ids, maxlen = max_len, dtype = "long", truncating = 'post', padding = 'post')

# create attention masks
attention_masks = []

# create a mask of 1s for each token followed by 0s for padding 
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)

<b>Splitting data into train, validation sets for training:</b>

In [20]:
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(
    input_ids, labels, random_state = 2018, test_size = 0.1)
train_masks, validation_masks, _, _ = train_test_split(
    attention_masks, input_ids, random_state = 2018, test_size = 0.1)

<b>Converting all data into torch tensors:</b>

In [21]:
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

In [22]:
# selecting batch size, authors recommend 16 or 32 for fine-tuning bert for specific task
batch_size = 32

# creating an iterator of our data with torch dataloader, helps save on memory during training
# unlike a for loop, with an iterator the entire dataset doesn't need to be loaded into the memory
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler = train_sampler, batch_size = batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler = validation_sampler, batch_size=batch_size)

<u><b>Model Training:</b></u>

For this task, we first want to modify the pre-trained BERT model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task. Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained BERT model, each has different top layers and output types designed to accomodate their specific NLP task. We'll load BertForSequenceClassification. This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.

<b>Fine-Tuning Structure & Process:</b> \
The first token of every sequence is the special classification token ([CLS]). Unlike the hidden state vector corresponding to a normal word token, the hidden state corresponding to this special token is designated by the authors of BERT as an aggregate representation of the whole sentence used for classification tasks. As such, when we feed in an input sentence to our model during training, the output is the length 768 hidden state vector corresponding to this token. The additional layer that we've added on top consists of untrained linear neurons of size [hidden_state, number_of_labels], so [768,2], meaning that the output of BERT plus our classification layer is a vector of two numbers representing the "score" for "grammatical/non-grammatical" that are then fed into cross-entropy loss.

Because the pre-trained BERT layers already encode a lot of information about the language, training the classifier is relatively inexpensive. Rather than training every layer in a large model from scratch, it's as if we have already trained the bottom layers 95% of where they need to be, and only really need to train the top layer, with a bit of tweaking going on in the lower levels to accomodate our task. Sometimes practicioners will opt to "freeze" certain layers when fine-tuning, or to apply different learning rates, apply diminishing learning rates, etc. all in an effort to preserve the good quality weights in the network and speed up training (often considerably). In fact, recent research on BERT specifically has demonstrated that freezing the majority of the weights results in only minimal accuracy declines, but there are exceptions and broader rules of transfer learning that should also be considered. For example, if your task and fine-tuning dataset is very different from the dataset used to train the transfer learning model, freezing the weights may not be a good idea. We'll cover the broader scope of transfer learning in NLP in a future post.

Now, let's load BERT. There are a few different pre-trained BERT models available. "bert-base-uncased" means the version that has only lowercase letters ("uncased") and is the smaller version of the two ("base" vs "large").

In [23]:
# Loading BertForSequenceClassification
# pretrained BERT model with a single linear classification layer on top

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

100%|████████████████████████| 407873900/407873900 [00:38<00:00, 10579582.38B/s]


Now that we have our model loaded we need to grab the training hyperparameters from within the stored model. For the purposes of fine-tuning, the authors recommend the following hyperparameter ranges:

Batch size: 16, 32 \
Learning rate (Adam): 5e-5, 3e-5, 2e-5 \
Number of epochs: 2, 3, 4

In [24]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
                     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
                     'weight_decay_rate': 0.0}]

In [25]:
# This variable contains all of the hyperparemeter information our training loop needs
optimizer = BertAdam(optimizer_grouped_parameters,lr=2e-5,warmup=.1)

t_total value of -1 results in schedule not being applied


For each pass in the training loop we have a training phase and a validation phase.

At each pass we need to:

<b>Training loop:</b>
<li>Tell the model to compute gradients by setting the model in train mode
<li>Unpack our data inputs and labels
<li>Load data onto the GPU for acceleration
<li>Clear out the gradients calculated in the previous pass. In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out
<li>Forward pass (feed input data through the network)
<li>Backward pass (backpropagation)
<li>Tell the network to update parameters with optimizer.step()
<li>Track variables for monitoring progress</li>
    
    
<b>Evaluation loop:</b>
<li>Tell the model not to compute gradients by setting th emodel in evaluation mode
<li>Unpack our data inputs and labels
<li>Load data onto the GPU for acceleration
<li>Forward pass (feed input data through the network)
<li>Compute loss on our validation data and track variables for monitoring progress

In [26]:
# function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis = 1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [None]:
t = []

# storing loss, accuracy for plotting
train_loss_set = []

# number of training epochs
epochs = 2

# trange --> tqdm wrapper around basic python range
for _ in trange(epochs, desc = "Epoch"):  
    
    ############    
    # training #
    ############
    
    # set our model to training mode 
    model.train()
    
    # tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    
    # train data for one epoch
    for step, batch in enumerate(train_dataloader):
        # add batch to GPU
        # batch = tuple(t.to(device) for t in batch)
        
        # unpack inputs from dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # clear out gradients, they accumulate by default
        optimizer.zero_grad()
        
        # forward pass
        loss = model(b_input_ids, token_type_ids = None, attention_mask = b_input_mask, labels = b_labels)
        train_loss_set.append(loss.item())
        
        # backward pass
        loss.backward()
        
        # update parameters and take a step using the computed gradient
        optimizer.step()
        
        # update tracking variables
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1
    
    print("Train loss: {}".format(tr_loss/nb_tr_steps))
    
    ##############    
    # validation #
    ##############
    
    # set model to evaluation mode
    model.eval()
    
    # tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    
    # evaluate data for one epoch
    for batch in validation_dataloader:
        # add batch to GPU
        # batch = tuple(t.to(device) for t in batch)
        
        # unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # telling model NOT to compute or store gradients, saving memory and speeding up validation
        with torch.no_grad():
            # forward pass, calculate logit predictions
            logits = model(b_input_ids, token_type_ids = None, 
                               attention_mask = b_input_mask)
            
        # move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
            
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
            
        eval_accuracy+=tmp_eval_accuracy
        nb_eval_steps+=1
    
    print("Validation accuracy: {}".format(eval_accuracy/nb_eval_steps))
 

<b>Training Evaluation:</b>

In [None]:
# training loss over all batches
plt.title("Training Loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

<b>Prediction & Evaluation on Holdout Set:</b>\
Loading the holdout dataset and preparing inputs as done earlier. Evaluate predictions using Matthew's Correlation Coefficient (metric used by wider NLP community to evaluate performance on CoLA). +1 is best, -1 is worst. 

In [None]:
df = pd.read_csv("cola_public/raw/out_of_domain_dev.tsv", delimiter='\t', header=None, 
                         names=['sentence_source', 'label', 'label_notes', 'sentence'])

In [None]:
# create sentence and label list
sentences = df.sentences.values

# add special tokens at first & last of each sentence for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

max_len = 128

# using BERT tokenizer to convert token to their index numbers in the BERT vocab
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

# padding the input
input_ids = pad_sequences(input_ids, maxlen = max_len, dtype = "long", 
                                 truncating = "post", padding = "post")

# create attention mask
attention_masks = []

# create a mask of 1s for each token followed by 0s for padding 
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.apend(seq_mask)
    
prediction_inputs = torch.tensor(input_ids)
prediction_masks = torch.tensor(attention_masks)
prediction_labels = torch.tensor(labels)

batch_size = 32

prediction_data = TensorDataset(prediction_inputs, prediction_masks, prediction_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler = prediction_sampler, batch_size = batch_size)


In [None]:
# prediction on test set

# model to evaluation model
model.eval()

# tracking variables 
prediction, true_labels = [], []

# predict 
for batch in prediction_dataloader:
    # add batch to GPU 
    # batch = tuplet(t.to(device) for t in batch)
    # unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    
    # telling the model not to compute or store gradients, saving memory and speeding up predictions
    with torch.no_grad():
        # forward pass, calculate logit predictions
        logits = model(b_input_ids, token_type_ids = None, attention_mask = b_input_mask)
    
    # move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    
    # store predictions and true labels
    predictions.append(logits)
    true_labels.append(label_ids)


<b>Import and evaluate each test batch using Matthew's correlation coefficient:</b>

In [None]:
from sklearn.metrics import matthews_corrcoef
matthews_set = []

for i in range(len(true_labels)):
    matthews = matthews_corrcoef(true_labels[i],
                 np.argmax(predictions[i], axis=1).flatten())
    matthews_set.append(matthews)

Other NLP tools continued in next notebook. 