# Classifying text with BERT and SVM

In this approach, we'll use BERT embeddings as input features to a SVM classifier.

In [2]:
import os
import shutil
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

from tqdm.notebook import tqdm

tf.get_logger().setLevel('ERROR')

## Dataset

The dataset used is the IMDb reviews dataset (available at [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)).

In [3]:
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file(
        'aclImdb_v1.tar.gz', url,
        untar=True, cache_dir='../../data/aclImdb',
        cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

train_dir = os.path.join(dataset_dir, 'train')

# remove unused folders to make it easier to load the data
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

The raw dataset has train and test sets, but lacks a validation set. 20% of train set will be used to validation.

In [4]:
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    os.path.join(dataset_dir, 'train'),
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


2021-12-10 14:22:48.134503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 14:22:48.162144: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 14:22:48.162295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 14:22:48.187297: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [5]:
val_ds = tf.keras.utils.text_dataset_from_directory(
    os.path.join(dataset_dir, 'train'),
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [6]:
test_ds = tf.keras.utils.text_dataset_from_directory(
    os.path.join(dataset_dir, 'test'),
    batch_size=batch_size)

test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

Found 25000 files belonging to 2 classes.


Analyze some of the reviews to ensure everything is working so far:

In [7]:
for text_batch, label_batch in train_ds.take(1):
    # we'll print 3 reviews from the batch
    for i in range(3):
        print(f'Review: {text_batch.numpy()[i]}')
        label = label_batch.numpy()[i]
        print(f'Label : {label} ({class_names[label]})')
        print()

Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label : 0 (neg)

Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as the

2021-12-10 14:22:55.497810: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


## Loading model from TensorFlow HUB

In [8]:
def get_tfhub_model():
    model_size = [
        (2, 128, 2),
        (6, 256, 4),
        (10, 256, 4),
        (2, 768, 12),
        (12, 768, 12),
    ][0]

    # Number of layers (i.e., residual blocks)
    L = model_size[0]

    # Size of hidden layers
    H = model_size[1]

    # Number of attention heads
    A = model_size[2]

    tfhub_handle_encoder = f"https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-{L}_H-{H}_A-{A}/2"
    tfhub_handle_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
    
    input_layer = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
    
    encoder_inputs = preprocessing_layer(input_layer)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    outputs = encoder(encoder_inputs)
    
    return tf.keras.Model(input_layer, outputs['pooled_output'])

## Preparing the feature extractor

The feature extractor simply returns the output from the model.

In [9]:
def get_features(model, X):
    model_output = model(X)

    return model_output

In [10]:
def get_tfidf(X):
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=2)
    return tfidf_vectorizer.fit_transform(X.numpy())
    
for text_batch, label_batch in tqdm(train_ds.take(1)):
    features = get_tfidf(text_batch)
    print(features)

  0%|          | 0/1 [00:00<?, ?it/s]

  (0, 251)	0.2242209547563127
  (0, 92)	0.14477297004421424
  (0, 192)	0.10598599263132838
  (0, 18)	0.2242209547563127
  (0, 40)	0.2242209547563127
  (0, 299)	0.20523733584733736
  (0, 165)	0.2242209547563127
  (0, 97)	0.2242209547563127
  (0, 222)	0.2242209547563127
  (0, 305)	0.2242209547563127
  (0, 324)	0.15172551964264824
  (0, 7)	0.1905124970555341
  (0, 63)	0.1594978088360175
  (0, 39)	0.17848142774499282
  (0, 263)	0.1905124970555341
  (0, 25)	0.2242209547563127
  (0, 217)	0.2242209547563127
  (0, 28)	0.2242209547563127
  (0, 261)	0.20523733584733736
  (0, 259)	0.2242209547563127
  (0, 125)	0.12256977657084164
  (0, 37)	0.20523733584733736
  (0, 294)	0.13848362390550242
  (0, 126)	0.17848142774499282
  (0, 301)	0.13848362390550242
  :	:
  (31, 274)	0.10466151523107292
  (31, 372)	0.10466151523107292
  (31, 376)	0.11275087400439358
  (31, 363)	0.12317987131580588
  (31, 351)	0.12317987131580588
  (31, 211)	0.0760785047524466
  (31, 163)	0.09246378590723518
  (31, 112)	0.1961040

2021-12-10 14:23:02.127873: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


## Preparing the classifier

For the classifier, a simple SVM Classifier will be used.

In [11]:
classifier_tfhub = svm.SVC()
classifier_tfidf = svm.SVC()

## Making predictions

In [12]:
tfhub_model = get_tfhub_model()

Training.

In [13]:
X = []
y = []
for text_batch, label_batch in tqdm(train_ds):
    features = get_features(tfhub_model, text_batch)
    
    [X.append(f) for f in features]
    [y.append(l) for l in label_batch]
    
print(len(X))

classifier_tfhub.fit(X=X, y=y)

  0%|          | 0/625 [00:00<?, ?it/s]

20000


SVC()

Predicting values using tfhub's BERT features.

In [14]:
y_pred_tfhub = []
y_true = []
for text_batch, label_batch in tqdm(test_ds.take(1)):
    features = get_features(tfhub_model, text_batch)
    
    [y_pred_tfhub.append(prediction) for prediction in classifier_tfhub.predict(features)]
    [y_true.append(label_list) for label_list in label_batch]

  0%|          | 0/1 [00:00<?, ?it/s]

2021-12-10 14:24:23.861528: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [14]:
X = []
y = []
for text_batch, label_batch in tqdm(train_ds):
    features = get_tfidf(text_batch)
    
    for f in features:
        print(f.shape)
        
        exit()
    
    [X.append(f.numpy()) for f in features]
    [y.append(l) for l in label_batch]

    
classifier_tfidf.fit(X=X, y=y)

  0%|          | 0/625 [00:00<?, ?it/s]

(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)
(1, 532)


AttributeError: numpy not found

Predic values using TF-IDF features.

In [None]:
y_pred_tfidf = []
y_true = []
for text_batch, label_batch in tqdm(test_ds.take(1)):
    features = get_tfidf(text_batch)
    
    [y_pred_tfidf.append(prediction) for prediction in classifier_tfidf.predict(features)]
    [y_true.append(label_list) for label_list in label_batch]

Acquiring accuracy.

In [None]:
y_pred_tfhub

In [15]:
accuracy_score(y_true, y_pred_tfhub)

0.78125

In [None]:
accuracy_score(y_true, y_pred_tfidf)