# Text Classification with BERT

This project aims to use **BERT** to perform **text classification**.

We'll be using the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) (SST-2) dataset of movie reviews and a smaller version of BERT - [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) - developed by HuggingFace. Our goal is to classify our moview reviews as positive or negative.

We'll start by loading and checking the data.

### Step 1

#### 1. Perform initial imports

In [1]:
import numpy as np
import pandas as pd

#### 2. Load data

In [2]:
# data from https://github.com/clairett/pytorch-sentiment-classification/tree/master/data/SST2

df_train = pd.read_csv("data/SST2/train.tsv", sep='\t', names=['review', 'label'])

#### 3. Check data

In [3]:
df_train.head()

Unnamed: 0,review,label
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [4]:
len(df_train)

6920

We have 6920 moview reviews labeled as 1 (positive review) or 0 (negative review).

In [5]:
# check number of both labels

df_train['label'].value_counts()

1    3610
0    3310
Name: label, dtype: int64

We have 3610 positive reviews and 3310 negative reviews.

#### 4. Select a subset of our data

For performance reasons, we'll be using only the **first 3000 reviews** of the training dataset for our text classification task.

In [6]:
df = df_train[:3000]

In [7]:
df['label'].value_counts()

1    1565
0    1435
Name: label, dtype: int64

We now have 3000 reviews, from which 1565 are positive and 1435 are negative.

In [8]:
# length of longest review

max([len(review.split()) for review in df['review']])

50

In [9]:
# average length of the reviews

np.rint(np.mean([len(review.split()) for review in df['review']]))

19.0

Our reviews have an **average length of 19 words** and the **longest review has 50 words**.

We can now use **DistilBERT to tokenize our reviews**. This will be our **step 2**.

### Step 2

#### 1. Perform necessary imports

In [10]:
import torch
import transformers as pytt

#### 2. Load pretrained model

In [11]:
# using DistilBERT
model_class, tokenizer_class, pretrained_weights = (pytt.DistilBertModel, pytt.DistilBertTokenizer, 'distilbert-base-uncased')

# load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

We'll be using the **DistilBERT model**, more specifically the [distilbert-base-uncased model](https://huggingface.co/transformers/pretrained_models.html), trained on lower-cased English text since our reviews are also lower-cased.

We are now ready to tokenize our reviews. Besides tokenization, we'll also perform padding so that our reviews have all the same length.

#### 3. Tokenize and padd reviews

In [12]:
# max_length=60 should be enough for all tokens, including special tokens

df_tokenized_padded = df['review'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True, 
                                                              max_length=60, 
                                                              pad_to_max_length=True)))

In [13]:
df_tokenized_padded.head()

0    [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1    [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2    [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3    [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4    [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
Name: review, dtype: object

In [14]:
df_tokenized_padded[0]

[101,
 1037,
 18385,
 1010,
 6057,
 1998,
 2633,
 18276,
 2128,
 16603,
 1997,
 5053,
 1998,
 1996,
 6841,
 1998,
 5687,
 5469,
 3152,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [15]:
len(df_tokenized_padded[0])

60

Our reviews our now tokenized and include special tokens like the **\[CLS\]** token - id 101 - at the beginning of each review and the **\[SEP\]** token - id 102 - at the end.

They've also been padded with 0's so that they all have the same length.

#### 4. Get reviews in the correct shape

In [16]:
df_tokenized_padded.shape

(3000,)

In [17]:
len(df_tokenized_padded.shape)

1

Our reviews are stored as a one dimensional tensor (a vector) with 3000 rows, but we need a two dimensional tensor (a matrix).

In [18]:
arr_reviews = np.array([review for review in df_tokenized_padded.values])

In [19]:
arr_reviews

array([[  101,  1037, 18385, ...,     0,     0,     0],
       [  101,  4593,  2128, ...,     0,     0,     0],
       [  101,  2027,  3653, ...,     0,     0,     0],
       ...,
       [  101,  2433, 20922, ...,     0,     0,     0],
       [  101,  2045,  1005, ...,     0,     0,     0],
       [  101, 11317,  7545, ...,     0,     0,     0]])

In [20]:
arr_reviews.shape

(3000, 60)

We now need to create a mask in order for our model to be able distinguish the padding from the rest of the tokens.

#### 5. Create attention mask

In [21]:
arr_attention_mask = np.where(arr_reviews != 0, 1, 0)

In [22]:
arr_attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

In [23]:
arr_attention_mask.shape

(3000, 60)

We now have everything we need to **feed the reviews to our DistilBERT model** - this is our **step 3**.

### Step 3

#### 1. Create torch tensors

In [37]:
input_ids = torch.tensor(arr_reviews, dtype=torch.int64) # a long tensor is expected
attention_mask = torch.tensor(arr_attention_mask, dtype=torch.int64) # a long tensor is expected
# PyTorch data types: https://pytorch.org/docs/stable/tensor_attributes.html

#### 2. Feed reviews to DistilBERT

In [38]:
%%time

# disabling gradient calculation
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Wall time: 2min 56s


In [49]:
type(last_hidden_states)

tuple

In [63]:
len(last_hidden_states)

1

In [67]:
last_hidden_states[0].shape

torch.Size([3000, 60, 768])

Our output is a one dimensional tuple. It stores a torch tensor of shape **(number of reviews, length of padded reviews, number of hidden units)**.

We are interested in the **output for the \[CLS\] token only**, so let's slice our output.

#### 3. Slice output

We want to keep the output for all our reviews, but only what corresponds to the first token (the \[CLS\] token) of each one and keeping the output of all the hidden units, i. e., the slice `[:, 0, :]`.

In [70]:
reviews_BERT = last_hidden_states[0][:, 0, :].numpy()

In [71]:
reviews_BERT.shape

(3000, 768)

We now have a 2D numpy array with the desired embeddings for each review.

To train our classifier, we only need to check our labels and we are good to go!

#### 4. Check labels

In [76]:
labels = df['label']

In [77]:
labels.shape

(3000,)

The labels for our 3000 reviews are stored in the variable `labels`. We can proceed to our **step 4 and train our classifier**.

### Step 4

In [27]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [28]:
X_train, X_test, y_train, y_test = train_test_split(reviews_BERT, labels, test_size = 0.3, random_state = 42)

In [29]:
X_train.shape

(2100, 768)

In [30]:
review_clf = LogisticRegression(max_iter=1000)

review_clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [31]:
predictions = review_clf.predict(X_test)

In [32]:
# confusion matrix

print(confusion_matrix(y_test, predictions))

[[365  56]
 [ 84 395]]


In [33]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.81      0.87      0.84       421
           1       0.88      0.82      0.85       479

    accuracy                           0.84       900
   macro avg       0.84      0.85      0.84       900
weighted avg       0.85      0.84      0.84       900



In [34]:
print(accuracy_score(y_test, predictions))

0.8444444444444444
