## BERT for Classification

*Based on Jay Alammar's ["A Visual Notebook to Using BERT for the First Time"](https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb).*

In [1]:
import numpy as np
import pandas as pd
import time
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

Please note that the movie reviews are also available in the `data` folder, so the `dataurl` below could be simply `../data/moviereviews.csv`. (The commented code is there to remind you just how easy it is to save to a CSV file when you use **pandas**.

In [2]:
# This is simply to keep the load line from getting too long
# (You could also break at commas, but this seems cleaner.)
dataurl = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'

# Load the data
df = pd.read_csv(dataurl, delimiter='\t', header=None)

# Save the data locally (optional)
# df.to_csv("../data/moviereviews.csv")

# See the results
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


There are approximately 7000 reviews in this dataset. The processing time when building the BERT model is over six minutes. To speed things along, we can work with an initial batch. (Feel free to remove this cap when you work on your own.)

In [3]:
batch_1 = df[:2000]
batch_1[1].value_counts()

1
1    1041
0     959
Name: count, dtype: int64

In [4]:
# You can toggle between DistilBERT and plain BERT by 
# commenting/uncommenting the following two lines:

model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [5]:
# Grab just the texts
texts = batch_1[0].tolist()

# Convert them to word embedding IDs
identified = [tokenizer.encode(x, add_special_tokens=True) for x in texts]

# Check the results
print(identified[0])

[101, 1037, 18385, 1010, 6057, 1998, 2633, 18276, 2128, 16603, 1997, 5053, 1998, 1996, 6841, 1998, 5687, 5469, 3152, 102]


In [6]:
# One-liner using lambda
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

# Ask the question: Do these produce the same results?
print(tokenized[10] == identified[10])

True


Okay, if we want BERT to process our texts in a single batch, we need them all to be in a single array, and in order for them to be in a single array, they all need to be the same size and for them all to be the same size we need to pad the lists of numbers with zeros.

In [7]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

# What's our array look like?
# (How many texts and how "long"?)
np.array(padded).shape

(2000, 59)

Having padded our sequences, we need to make sure BERT ignores the zeros, so we create a mask:

In [8]:
# Ignore zeros
attention_mask = np.where(padded != 0, 1, 0)

# Check to see that shapes match
attention_mask.shape

(2000, 59)

In [9]:
# Or we could do a direct comparison again
np.array(padded).shape == attention_mask.shape

True

I've got a timer on the process below. On my M1 Macbook Air, it takes about six minutes to run.

In [10]:
# We have used Jupyter cell magic before
# now we are using the Python time library
start_time = time.time()

# The actual code
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

# More time measurement
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time}")

Execution time: 65.00770401954651


In [11]:
features = last_hidden_states[0][:,0,:].numpy()

In [12]:
batch_1 = df[:2000]
labels = batch_1[1]

## Summarizer

If you are interested in having BERT summarize things for you, check out the [BERT Extractive Summarizer](https://pypi.org/project/bert-extractive-summarizer/) -- and make sure you understand the difference between abstractive and extractive summarizing.