# BERT-based models: Step by step

## BERT-base models are **transformers**. They transform text into *contextualized embeddings*: vector representations of sentences that capture each words' semantic meaning.

![](https://imgur.com/dFCbxsY.jpg)

## So we use these models to "read" text. Once the text is "read" we can do three tasks: classify based on sentiment, answer questions, or named-entity regocnition.

## Task 1: Classification

![](https://imgur.com/pDMub6B.jpg)

## Task 2: Question Answering

![](https://imgur.com/MzIIHCp.jpg)

## Task 3: Named Entity Recognition

# Example 1: Classification

# Step 1: Import data, split into training and validation set (nothing specific to BERT)

Just reading in the data here, as we would for any other maching learning task

In [None]:
import pandas as pd
import numpy as np

train = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
test = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')
sample = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/sample_submission.csv')

train = train.dropna()

In [None]:
# Convert text labels to integers
from sklearn import preprocessing

features = train['text']

encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform(train['sentiment'])

In [None]:
# Split into a training set and a validation set

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(features, y)

## Step 2: Create inputs that the BERT model can read

![](https://imgur.com/PITly25.jpg)

## 2a. Tokenization

It's not necessarily the case that each word in a text gets its own tokenization. Though it is often true, the tokenizer needs to be able to handle text that it's never seen before. For example, consider misspellings, or a text like "asdfa". 

The way that text is encoded is by use of a **vocab file** and a **merges file**. The way it works is: text is broken down to the character level, then it's merged into larger pieces (sometimes, but not necessarily, into words) according to rules in the merges file, and then converted to integers using the vocab file.

![](https://imgur.com/WkMWjRV.jpg)

In [None]:
import transformers

# Tokenizer does the encoding to create the input ids
from transformers import RobertaTokenizer

# Use the 'Roberta Vocab File' dataset to get the vocab and merge files
tokenizer = RobertaTokenizer(vocab_file = '/kaggle/input/roberta-vocab-file/vocab.json',
                            merges_file = '/kaggle/input/roberta-vocab-file/merge.txt',
                            lowercase = True,
                            add_prefix_space = True)

## 2b. Creating input_ids and attention_mask

Now we can use the tokenizer to encode the text from the dataset. 

The maximum length of a BERT-based model input is 512. You can set MAX_LENGTH to be shorter, which can save space, by encoding all text strings in the training, validation, and test datasets, and seeing what the maximum length of an encoding is.

Then we'll initialize arrays for the input_ids and attention_mask, based on the size of our data. 

![](https://imgur.com/FxGhdDm.jpg)

# ** How to figure out what padding token to use

In [None]:
MAX_LENGTH = 512

# input_ids is actually made of ones in this example because the padding token is 1 for Roberta, 
# but might be 0 for other BERT-based models
train_input_ids = np.ones((X_train.shape[0], MAX_LENGTH), dtype = 'int32')
train_attention_mask = np.zeros((X_train.shape[0], MAX_LENGTH), dtype = 'int32')

Now, loop throught the training data, use the tokenizer to encode it, and set the mask to be 1's at those same locations

![](https://imgur.com/7dNHM54.jpg)

In [None]:
for k in range(X_train.shape[0]):
    encode = tokenizer.encode(X_train.iloc[k])
    train_input_ids[k, :len(encode)] = encode
    train_attention_mask[k, :len(encode)] = 1

Now we have inputs BERT-based models can actually use.

# Step 3: Construct a Pytorch Neural Network

## 3a. Create torch data loaders

In order to use these in a PyTorch neural network, we need to greate pytorch dataloaders.
To get there from numpy arrays, we do:

Numpy array -> Torch tensor -> Torch dataset -> Torch dataloader

In [None]:
import torch

In [None]:
# When we build the torch datasets, we need to pass in the batch size
batch_size = 8

# Make tensors. Making the data type long is important, since there will be an error without it
train_input_ids = torch.tensor(train_input_ids, dtype = torch.long)
train_attention_mask = torch.tensor(train_attention_mask, dtype = torch.long)
train_label = torch.tensor(y_train, dtype = torch.long)

# Make a torch dataset
train_t = torch.utils.data.TensorDataset(train_input_ids, train_attention_mask, train_label)

# Make a torch dataloader.
train_loader = torch.utils.data.DataLoader(train_t, batch_size = batch_size)

## 3b. Build a neural network

In [None]:
# For the neural network
import torch.nn as nn

# For the RobertaConfig
from transformers import *

# For some elements of the neural network
import torch.nn.functional as F

# ** What are some other configurations/ways to initialize?

In [None]:
# The Roberta Base dataset has a configuration file for roberta
PATH = '/kaggle/input/roberta-base/'
config = RobertaConfig.from_pretrained(PATH + 'config.json')

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # The base RoBERTa model
        self.roberta = RobertaModel.from_pretrained(PATH + 'pytorch_model.bin', config = config)
        
        # Update weights during training
        for param in self.roberta.parameters():
            param.requires_grad = True
        
        # A dropout layer
        self.drop_out = nn.Dropout()
        
        # A fully connected layer. 768 is the size of the output, and 3 is the number of classes
        self.fc = nn.Linear(768, 3)
        
    def forward(self, input_ids, input_mask):
        
        # Get the RoBERTa output
        last_hidden_state, _ = self.roberta(input_ids, input_mask)
        
        # Get the CLS token, which holds the embedding for the text as a whole
        last_hidden_state = last_hidden_state[:, 0, :] # indices are : (all batches), 0 (for the CLS token), : (all 768 elements of the output)
        
        # Dropout and fully connected layers
        out = self.drop_out(last_hidden_state)
        out = self.fc(out)
        
        return out

# Step 4: Training

In [None]:
# Initialize the network
model = Net()

# Set the neural network to run on the GPU
model.cuda()

## 4a. Pick hyperparameters

In addition to the structure of the neural network, you choose the learning rate and loss function. It's typical to use cross-entropy loss for classification tasks, but different learning rates should be tested with cross validation.

# ** What is the difference between optimizers??

In [None]:
# Pick a learning rate. This is a parameter you can tune yourself with cross-validation
learning_rate = 1e-5

# Pick a loss function. Usually crossentropy loss for classification tasks
loss_fn = torch.nn.CrossEntropyLoss()

# Pick an optimizer. This determines how the neural network converges to a solution
opt = torch.optim.Adam(model.parameters(), lr = learning_rate)

# Pick a number of epochs for which to train the model
n_epochs = 2

## 4b. Train the model

In [None]:
for epoch in range(n_epochs):
    
    for i, batch in enumerate(train_loader):
        
        # A batch from the data loader. It has the input_ids, attention_mask, and labels
        # Make sure to send each of these things to the GPU
        batch = tuple(t.cuda() for t in batch)
        
        # Extract the input_ids, attention_mask, and labels from the batch
        input_ids = batch[0]
        attention_mask = batch[1]
        labels = batch[2]
        
        # Use the model to make predictions
        y_pred = model(input_ids, attention_mask)

        # Calculate loss using the chosen loss function
        loss = loss_fn(y_pred, labels)
        
        # Move in the direction of the gradient
        opt.zero_grad()
        loss.backward()
        opt.step()
        
        # Status updates
        print('Epoch {}/{} | Batch {}/{} | Loss: {:.4f}'.format(
                epoch + 1, n_epochs, i, X_train.shape[0]/batch_size, loss))

# Step 5: Validation

All the steps done to convert the training data into the pytorch data loader need to be repeated for the validation set. Some people build the data loaders for the training, validation, and test sets at the same point in their notebook, but I've spread them out here so that we can go step-by-step.

I won't comment each of these steps, since they're the same steps we followed above for the training set.

## 5a. Build PyTorch datasets for validation data

In [None]:
ct = X_val.shape[0]

val_input_ids = np.ones((ct, MAX_LENGTH), dtype = 'int32')
val_attention_mask = np.zeros((ct, MAX_LENGTH), dtype = 'int32')

for k in range(X_val.shape[0]):
    encode = tokenizer.encode(X_val.iloc[k])
    val_input_ids[k, :len(encode)] = encode
    val_attention_mask[k, :len(encode)] = 1


val_input_ids = torch.tensor(val_input_ids, dtype = torch.long)
val_attention_mask = torch.tensor(val_attention_mask, dtype = torch.long)
val_label = torch.tensor(y_val, dtype = torch.long)

val_t = torch.utils.data.TensorDataset(val_input_ids, val_attention_mask, val_label)

val_loader = torch.utils.data.DataLoader(val_t, batch_size = batch_size)

## 5b. Make predictions for the validation set

In [None]:
# Holds the predictions
preds = []

# Set the model to evaluation mode
model.eval()

# torch.no_grad so the weights aren't updated
with torch.no_grad():

    # Code below here is the same as during training
    for i, batch in enumerate(val_loader):

        # Get a batch from the validation data loader
        batch = tuple(t.cuda() for t in batch)
        input_ids = batch[0]
        attention_mask = batch[1]
        labels = batch[2]

        # Make predictions, which will be probabilities of being in each class
        y_pred = model(input_ids, attention_mask)

        # Track the predictions 
        preds[i * batch_size:(i + 1) * batch_size] = F.softmax(y_pred, dim=1).detach().cpu().numpy()


## 5c. Evaluate performance

You can evaluate the performance using any metric you'd like. Here, I use sklearn's classification report.

In [None]:
from sklearn.metrics import classification_report

In [None]:
y_pred = np.argmax(preds, axis = 1)

print(classification_report(y_pred, y_val))

# Step 6: Make predictions for the test set

Once you think you've maximized the cross-validation score, use the model to make predictions for the test set, following the same steps that we did in step 5.

In [None]:
X_test = test['text']

In [None]:
ct = X_test.shape[0]

test_input_ids = np.ones((ct, MAX_LENGTH), dtype = 'int32')
test_attention_mask = np.zeros((ct, MAX_LENGTH), dtype = 'int32')

for k in range(X_test.shape[0]):
    encode = tokenizer.encode(X_test.iloc[k])
    test_input_ids[k, :len(encode)] = encode
    test_attention_mask[k, :len(encode)] = 1

# Note that there's no label for the test set
test_input_ids = torch.tensor(test_input_ids, dtype = torch.long)
test_attention_mask = torch.tensor(test_attention_mask, dtype = torch.long)

test_t = torch.utils.data.TensorDataset(test_input_ids, test_attention_mask)

test_loader = torch.utils.data.DataLoader(test_t, batch_size = batch_size)

In [None]:
# Holds the predictions
test_preds = []

# Set the model to evaluation mode
model.eval()

# torch.no_grad so the weights aren't updated
with torch.no_grad():

    # Code below here is the same as during training
    for i, batch in enumerate(test_loader):

        # Get a batch from the test data loader
        batch = tuple(t.cuda() for t in batch)
        input_ids = batch[0]
        attention_mask = batch[1]
        labels = batch[2]

        # Make predictions, which will be probabilities of being in each class
        y_pred = model(input_ids, attention_mask)

        # Track the predictions 
        test_preds[i * batch_size:(i + 1) * batch_size] = F.softmax(y_pred, dim=1).detach().cpu().numpy()


# Step 7: Submit predictions

Note that classification ot tweets was actually not the goal of this competition. But to make submissions, code should look something like the following.

In [None]:
y_test = np.argmax(test_preds, axis = 1)

sample['label'] = y_test

sample.to_csv('submission.csv')

# Example 2: Question-Answering

For QA questions, we encode the data in the same way. The way the questions are answered is by taking the dot product of each word encoding with a 'start' vector and an 'end' vector which will predict the probability that those words are the start or end of the answer.

The 'start vector' and 'end vector' are really just the weights for a 1D convolutional layer (since a 1d convolution with weights is equivalent to a dot-product with a vector).

![](https://imgur.com/j7B3C0M.jpg)

# ** How to do 5-fold cross validation (evaluate on out of fold sample, etc) 

# Step 1: Import data, split into training and validation set 

In the big picture, the way question-answering problems work is that you provide a context, like a Wikipedia article, and a question that can be answered using text from the context. 

In this competition, the goal was to predict the part of each tweet that could best identify its sentiment. So the "question" is the sentiment, the "answer" was the selected portion of the tweet that identified the sentiment, and the "context" is the entire tweet.

In order to use BERT-based models for question answering, we need to structure our inputs a little bit differently, as shown in the following image.

![](https://imgur.com/asOb599.jpg)

Our input id's are now the question and context separated by an [SEP] token, and we have two outputs, which are predictions for the tokens that start and end the answer.

In [None]:
import pandas as pd
import numpy as np

train = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
test = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')
sample = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/sample_submission.csv')

train = train.dropna()

In [None]:
from sklearn import preprocessing

features = train[['text', 'selected_text', 'sentiment']]
y = train['selected_text']

In [None]:
# Split into a training set and a validation set

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(features, y)

In [None]:
MAX_LEN = 512

ct = X_train.shape[0]
input_ids = np.ones((ct,MAX_LEN),dtype='int32')
attention_mask = np.zeros((ct,MAX_LEN),dtype='int32')
start_tokens = np.zeros((ct,MAX_LEN),dtype='int32')
end_tokens = np.zeros((ct,MAX_LEN),dtype='int32')

## 1a. Build the input_ids and attention_mask

Note that we use 0 for the CLS token and 2 for the SEP token. These may change based on the model you're using. But you can check using the following example.

In [None]:
import transformers

# Tokenizer does the encoding to create the input ids
from transformers import RobertaTokenizer

# Use the 'Roberta Vocab File' dataset to get the vocab and merge files
tokenizer = RobertaTokenizer(vocab_file = '/kaggle/input/roberta-vocab-file/vocab.json',
                            merges_file = '/kaggle/input/roberta-vocab-file/merge.txt',
                            lowercase = True,
                            add_prefix_space = True)

In [None]:
# Encoding for the cls token
tokenizer.convert_tokens_to_ids(tokenizer.cls_token)

# Encoding for the sep token
tokenizer.convert_tokens_to_ids(tokenizer.sep_token)

In [None]:
# So we can loop through
X_train.reset_index(inplace = True)

In [None]:
for k in range(X_train.shape[0]):
    
    # encode the text
    enc = tokenizer.encode(' '.join(X_train.loc[k, 'text'].split()))
    
    # get the token for the current sentiment
    s_tok = tokenizer.encode(X_train.loc[k,'sentiment'])
    s_tok = s_tok[1] # Ignore the cls and sep tokens
    
    input_ids[k,:len(enc)+2] = enc + [s_tok] + [2]
    attention_mask[k,:len(enc)+2] = 1

## 1b. Build arrays to track the tokens associated with the start and end of the selected text

This logic is borrowed from this notebook: Logic borrowed from here: https://www.kaggle.com/nkoprowicz/tensorflow-roberta-0-705/edit. I've just done my best to dissect it. 

This isn't as straightforward as it might seem, since again, there's not a 1-1 correspondence between the encoded values and the words in the tweets. So here's how the logic works:

![](https://imgur.com/AV8MuMq.jpg)

In [None]:
y_train = y_train.reset_index(drop = True)

In [None]:
y_train.iloc[0]

In [None]:
start_tokens = np.zeros((ct,MAX_LEN),dtype='int32')
end_tokens = np.zeros((ct,MAX_LEN),dtype='int32')

In [None]:
for k in range(X_train.shape[0]):
    # FIND OVERLAP
    text1 = " " + " ".join(X_train.loc[k,'text'].split()) # You need the extra space at the beginning because when the first token is decoded later, it will add a space before the first word
    text2 = " ".join(y_train[k].split())
    
    #print(text1)
    #print(text2)
    idx = text1.find(text2) # get the index of the first character where there's overlap
    
    # Initialize chars array
    chars = np.zeros((len(text1)))
    
    chars[idx:idx+len(text2)]=1 # set to 1 everywhere there's overlap
    print(chars)
    if text1[idx-1]==' ': chars[idx-1] = 1 
     
    #print(chars)
    # Encode the context
    enc = tokenizer.encode(text1) 
    print(enc)
    # Build the offsets array
    offsets = []
    idx=0
    for t in enc[1:-1]:
        w = tokenizer.decode([t]) # get the characters for the current token
        print(w)
        offsets.append((idx,idx+len(w))) # append the incides where those characters start and end
        idx += len(w) # move the index to the next index, which will be the first character of the next decoded chunk
        print(offsets)
        print(len(w))
        print(' ')
    
    # Make the start and end tokens
    # toks will track all the tokens that overlap with the selected text
    print(offsets)
    
    toks = []
    for i,(a,b) in enumerate(offsets):
        sm = np.sum(chars[a:b])
        if sm>0: toks.append(i) 

    # The start token is the first element of toks and the end index is the last element
    # We add 1 to account for the CLS token
    #print(toks)
    print(toks)
    if len(toks)>0:
     #   print(toks[0])
     #   print(toks[-1])
     #   print(' ')
        start_tokens[k,toks[0] + 1] = 1
        end_tokens[k,toks[-1] + 1] = 1
        
    break

In [None]:
X_train.loc[0, ['selected_text', 'text']]

In [None]:
tokenizer.encode(' '.join(X_train.loc[0, 'text'].split()))

In [None]:
tokenizer.decode([1437])

In [None]:
start_tokens = np.zeros((ct,MAX_LEN),dtype='int32')
end_tokens = np.zeros((ct,MAX_LEN),dtype='int32')

In [None]:
end_tokens[0]

# Step

In [None]:
import torch

In [None]:
# When we build the torch datasets, we need to pass in the batch size
batch_size = 8

# Make tensors. Making the data type long is important, since there will be an error without it
train_input_ids = torch.tensor(input_ids, dtype = torch.long)
train_attention_mask = torch.tensor(attention_mask, dtype = torch.long)
start_labels = torch.tensor(start_tokens, dtype = torch.long)
end_labels = torch.tensor(end_tokens, dtype = torch.long)

# Make a torch dataset
train_t = torch.utils.data.TensorDataset(train_input_ids, train_attention_mask, start_labels, end_labels)

# Make a torch dataloader.
train_loader = torch.utils.data.DataLoader(train_t, batch_size = batch_size)

# Step 2: Build the neural network

For a loss function, we'll just try to predict the start and end and add the losses

In [None]:
# For the neural network
import torch.nn as nn

# For the RobertaConfig
from transformers import *

# For some elements of the neural network
import torch.nn.functional as F

In [None]:
# The Roberta Base dataset has a configuration file for roberta
PATH = '/kaggle/input/roberta-base/'
config = RobertaConfig.from_pretrained(PATH + 'config.json')

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # The base RoBERTa model
        self.roberta = RobertaModel.from_pretrained(PATH + 'pytorch_model.bin', config = config)
        
        # Update weights during training
        for param in self.roberta.parameters():
            param.requires_grad = True
        
        # A dropout layer
        self.drop_out = nn.Dropout()
        
        # A fully connected layer. 768 is the size of the output, and 3 is the number of classes
        self.fc = nn.Linear(768, 2) # size 2 for the start and end
        
    def forward(self, input_ids, input_mask):
        
        # Get the RoBERTa output
        last_hidden_state, _ = self.roberta(input_ids, input_mask)
        
        # Dropout and fully connected layers
        out = self.drop_out(last_hidden_state) # NOTE: We use the whole hidden layer, not just the embedding for the CLS token
        out = self.fc(out)
        
        start_logits, end_logits = out.split(1, dim=-1) # split the output to get the start and end logits
        
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)
        
        return start_logits, end_logits

# Train the model

In [None]:
# Initialize the network
model = Net()

# Set the neural network to run on the GPU
model.cuda()

In [None]:
# Pick a learning rate. This is a parameter you can tune yourself with cross-validation
learning_rate = 1e-5

# Pick an optimizer. This determines how the neural network converges to a solution
opt = torch.optim.Adam(model.parameters(), lr = learning_rate)

# Pick a number of epochs for which to train the model
n_epochs = 1

# For the loss function, add the cross entropy loss for the start and end tokens
def loss_fn(start_preds, end_preds, start_tokens, end_tokens):
    ce_loss = nn.CrossEntropyLoss()
    
    start_loss = ce_loss(start_logits, start_tokens)
    end_loss = ce_loss(end_logits, end_tokens)    
    total_loss = start_loss + end_loss
    return total_loss

In [None]:
start_preds = []
end_preds = []

In [None]:
model.train()

for epoch in range(n_epochs):
    
    for i, batch in enumerate(train_loader):
        
        # A batch from the data loader. It has the input_ids, attention_mask, and labels
        # Make sure to send each of these things to the GPU
        batch = tuple(t.cuda() for t in batch)
        
        # Extract the input_ids, attention_mask, and labels from the batch
        input_ids = batch[0]
        attention_mask = batch[1]
        
        # Now we have the start and end tokens
        start_tokens = torch.argmax(batch[2], dim = 1) # Cross entropy loss wants the index of the answer, not the whole array
        end_tokens = torch.argmax(batch[3], dim = 1)
        
        # Use the model to make predictions
        start_logits, end_logits = model(input_ids, attention_mask)
        
        # Calculate loss using the chosen loss function
        loss = loss_fn(start_logits, end_logits, start_tokens, end_tokens)
        
        # Move in the direction of the gradient
        opt.zero_grad()
        loss.backward()
        opt.step()
        
        # Status updates
        print('Epoch {}/{} | Batch {}/{} | Loss: {:.4f}'.format(
                epoch + 1, n_epochs, i, X_train.shape[0]/batch_size, loss))
        
        start_preds[batch_size*i:batch_size*i + batch_size-1] = start_logits.argmax(dim = 1).cpu().detach().numpy()
        end_preds[batch_size*i:batch_size*i + batch_size-1] = end_logits.argmax(dim = 1).cpu().detach().numpy()
    
        '''
        print(tokenizer.decode(input_ids[0]))
        print(start_logits.size())
        print(start_logits[0].argmax())
        print(tokenizer.decode([input_ids[0][start_logits[0].argmax()]]))
        print(' ')
        print(end_logits[0].argmax())
        print(tokenizer.decode([input_ids[0][end_logits[0].argmax()]]))
        '''
        if i == 100:
            break

In [None]:
start_preds

In [None]:
end_preds

In [None]:
tokenizer.encode(' ' + ' '.join()' not  you, me, just drank too much.')

In [None]:
X_train['predicted_text'] = ' '

jaccard_scores = []

for k in range(100):
    
    # encode the text
    enc = tokenizer.encode(' ' + ' '.join(X_train.loc[k, 'text'].split()))
    print(X_train.loc[k, 'text'])
    print(enc)
    X_train.loc[k, 'predicted_text'] = tokenizer.decode(enc[start_preds[k]:end_preds[k] + 1]) # Need the + 1 since we predicted what should be the last word, but slicing cuts off 1 before that
    print(tokenizer.decode(enc[start_preds[k]:end_preds[k]]))
        
    jaccard_scores.append(jaccard(X_val.loc[k, 'selected_text'], X_val.loc[k, 'predicted_text']))

In [None]:
X_train.head(20)

# How to save model weights to not have to retrain model each time??

Always download the weights so you don't have to train the model again

In [None]:
torch.save(model.state_dict(), 'model_weights')

# Validation Set

In [None]:
MAX_LEN = 512

ct = X_val.shape[0]
val_input_ids = np.ones((ct,MAX_LEN),dtype='int32')
val_attention_mask = np.zeros((ct,MAX_LEN),dtype='int32')
val_start_tokens = np.zeros((ct,MAX_LEN),dtype='int32')
val_end_tokens = np.zeros((ct,MAX_LEN),dtype='int32')

In [None]:
# So we can loop through
X_val.reset_index(inplace = True)

In [None]:
for k in range(X_val.shape[0]):
    
    # encode the text
    enc = tokenizer.encode(' '.join(X_val.loc[k, 'text'].split()))
    
    # get the token for the current sentiment
    s_tok = tokenizer.encode(X_val.loc[k,'sentiment'])
    s_tok = s_tok[1] # Ignore the cls and sep tokens
    
    val_input_ids[k,:len(enc)+2] = enc + [s_tok] + [2]
    val_attention_mask[k,:len(enc)+2] = 1

In [None]:
# When we build the torch datasets, we need to pass in the batch size
batch_size = 8

# Make tensors. Making the data type long is important, since there will be an error without it
val_input_ids = torch.tensor(val_input_ids, dtype = torch.long)
val_attention_mask = torch.tensor(val_attention_mask, dtype = torch.long)

# Make a torch dataset
val_t = torch.utils.data.TensorDataset(val_input_ids, val_attention_mask)

# Make a torch dataloader.
val_loader = torch.utils.data.DataLoader(val_t, batch_size = batch_size)

In [None]:
start_preds = []
end_preds = []

# Set the model to evaluation mode
model.eval()

# torch.no_grad so the weights aren't updated
with torch.no_grad():
    
    for i, batch in enumerate(val_loader):

        batch = tuple(t.cuda() for t in batch)

        # Extract the input_ids, attention_mask, and labels from the batch
        input_ids = batch[0]
        attention_mask = batch[1]

        # Use the model to make predictions
        start_logits, end_logits = model(input_ids, attention_mask)

        
        #print(input_ids)
        #print(attention_mask)
        
        #a = torch.softmax(start_logits[0], dim = 0)
        #b = torch.softmax(end_logits[0], dim = 0)
        #print(a)
        #print(b)
        
        #start_idx = start_logits.argmax(dim = 1)
        #end_idx = end_logits.argmax(dim = 1)
        
        #start_idx = a.argmax(dim = 0)
        #end_idx = b.argmax(dim = 0)

        start_preds[batch_size*i:batch_size*i + batch_size-1] = start_logits.argmax(dim = 1).cpu().detach().numpy()
        end_preds[batch_size*i:batch_size*i + batch_size-1] = end_logits.argmax(dim = 1).cpu().detach().numpy()
        

In [None]:
start_preds

## Metric

In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    if (len(a)==0) & (len(b)==0): return 0.5
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [None]:
X_val['predicted_text'] = ' '

jaccard_scores = []

for k in range(X_val.shape[0]):
    
    # encode the text
    enc = tokenizer.encode(' ' + ' '.join(X_val.loc[k, 'text'].split()))
    
    X_val.loc[k, 'predicted_text'] = tokenizer.decode(enc[start_preds[k]:end_preds[k] + 1])
        
    jaccard_scores.append(jaccard(X_val.loc[k, 'selected_text'], X_val.loc[k, 'predicted_text']))

In [None]:
X_val.sample(20)

In [None]:
start_preds[3489]

In [None]:
end_preds[3489]

In [None]:
X_val.loc[3489, 'text']

In [None]:
b = tokenizer.encode(X_val.loc[3489, 'text'])

In [None]:
" ".join(tokenizer.decode(b[5:7]).split())

In [None]:
np.mean(jaccard_scores)

# Upgrades

By now, you understand how BERT-based methods work and can be used for classification or question-answering tasks. In order to show detail in the examples above, I left things simple, and didn't try to condense any code. Now we'll learn some strategies for making things easier and improving.

## 1. Doing 5-fold CV while training instead of splitting into training set and validation set from the get-go

In [None]:
for fold, (train_idx, val_idx) in enumerate(skf.split(train_df, train_df.sentiment), start=1): 
    print(f'Fold: {fold}')