## The Movie Review Polarity Analysis Using TPU
#### This is done base on https://colab.research.google.com/github/bentoml/gallery/blob/master/tensorflow/bert/bert_movie_reviews.ipynb#scrollTo=j0a4mTk9o1Qg

This is a simplified version of the above notebook, where tokenizer and pad functions use built-in capabilities of pandas with arguably less complicated and unnecessary code that run faster.
The model is the same as the notebook above. Two major changes are length of vectors which is changed to 512 and number of epochs which is increased to 10. The model is trained on TPUs. The model achieves 91.5% test accuracy. 


### TPU initialization

We start by getting a TPU distributed strategy from the provided servers. 

In [1]:
## Initialize TPUs and setup env
import os
import sys
import math
import datetime
from tqdm import tqdm
import pandas as pd
import numpy as np
import re

import tensorflow as tf
from tensorflow import keras
import bert
from bert import BertModelLayer
from bert.loader import StockBertConfig, map_stock_config_to_params, load_stock_weights
from bert.tokenization.bert_tokenization import FullTokenizer

print("Tensorflow: ", tf.__version__)
print("Python: ", sys.version)

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
strategy = tf.distribute.TPUStrategy(resolver)

Tensorflow:  2.3.0
Python:  3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0]
INFO:tensorflow:Initializing the TPU system: grpc://10.67.152.218:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.67.152.218:8470
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Finished initializing TPU system.
All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type

### Loading Dataset

In [2]:
# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
    data = {}
    data["sentence"] = []
    data["sentiment"] = []
    for file_path in tqdm(os.listdir(directory), desc=os.path.basename(directory)):
        with tf.io.gfile.GFile(os.path.join(directory, file_path), "r") as f:
            data["sentence"].append(f.read())
            data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
    return pd.DataFrame.from_dict(data)

# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
    pos_df = load_directory_data(os.path.join(directory, "pos"))
    neg_df = load_directory_data(os.path.join(directory, "neg"))
    pos_df["polarity"] = 1
    neg_df["polarity"] = 0
    return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

In [3]:
dataset = tf.keras.utils.get_file(
        fname="aclImdb.tar.gz", 
        origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
        extract=True)

train = load_dataset(os.path.join(os.path.dirname(dataset), 
                           "aclImdb", "train"))
test = load_dataset(os.path.join(os.path.dirname(dataset), 
                          "aclImdb", "test"))

pos: 100%|██████████| 12500/12500 [00:01<00:00, 7962.30it/s]
neg: 100%|██████████| 12500/12500 [00:01<00:00, 9683.67it/s]
pos: 100%|██████████| 12500/12500 [00:01<00:00, 8891.18it/s]
neg: 100%|██████████| 12500/12500 [00:01<00:00, 8574.02it/s]


### Class Counts

Counting the number of elements in each class for train and test datasets, show a very balanced dataset. 
Here we are only modeling polarity, and it is a perfect split of negative and positive reviews.

In [4]:
from collections import Counter
sc = Counter(train['sentiment'])
pc = Counter(train['polarity'])
tsc = Counter(test['sentiment'])
tpc = Counter(test['polarity'])
print('train sentiment:',sc)
print('train polarity:',pc)
print('test polarity:',sc)
print('test polarity:',pc)

train sentiment: Counter({'1': 5100, '10': 4732, '8': 3009, '4': 2696, '7': 2496, '3': 2420, '2': 2284, '9': 2263})
train polarity: Counter({1: 12500, 0: 12500})
test polarity: Counter({'1': 5100, '10': 4732, '8': 3009, '4': 2696, '7': 2496, '3': 2420, '2': 2284, '9': 2263})
test polarity: Counter({1: 12500, 0: 12500})


In [5]:
bert_model_name = "uncased_L-12_H-768_A-12"
bert_ckpt_dir    = os.path.join(bert_model_name)
bert_ckpt_file   = os.path.join(bert_ckpt_dir, "bert_model.ckpt")
bert_config_file = os.path.join(bert_ckpt_dir, "bert_config.json")

### Preparing the Data

We increased the length of vectors for 512. 

Following functions are a simplification of MovieReviewData class.

In [12]:
DATA_COLUMN = "sentence"
LABEL_COLUMN = "polarity"
global_max = 512                # Global allowed length by bert layer

def tokener(text,tokenizer):
    tokens = tokenizer.tokenize(text)
    tokens = ["[CLS]"] + tokens + ["[SEP]"]
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    return token_ids

def padder(input_ids,maxlen):
    input_ids = input_ids[:min(len(input_ids), maxlen - 2)]
    input_ids = input_ids + [0] * (maxlen - len(input_ids))
    return np.array(input_ids)

def prepare(df):
    global global_max
    tqdm.pandas()
    tokenizer = FullTokenizer(vocab_file=os.path.join(bert_ckpt_dir, "vocab.txt"))
    tokens = df[DATA_COLUMN].progress_apply(lambda text: tokener(text,tokenizer))
    maxlen = min(max(len(x) for x in tokens.values),global_max)
    print('')
    print('padded length:',maxlen)
    print('')
    padded_tokens = tokens.apply(lambda ids: padder(ids,maxlen))
    labels = df[LABEL_COLUMN].apply(int)
    return np.array(padded_tokens.values.tolist()),np.array(labels.values.tolist())

In [13]:
((train_x, train_y),
    (test_x, test_y)) = map(prepare, [train, test])

100%|██████████| 25000/25000 [01:43<00:00, 240.77it/s]

padded length: 512

100%|██████████| 25000/25000 [01:39<00:00, 252.05it/s]

padded length: 512



In [15]:
print("   train_x", train_x.shape)
print("   train_y", train_y.shape)
print("    test_x", test_x.shape)
print('    test_y',test_y.shape)
print("global max", global_max)

train_x (25000, 512)
   train_y (25000,)
    test_x (25000, 512)
    test_y (25000,)
global max 512


In [16]:
def create_model(maxlen, adapter_size=64):
        with strategy.scope():
                with tf.io.gfile.GFile(bert_config_file, "r") as reader:
                        bc = StockBertConfig.from_json_string(reader.read())
                        bert_params = map_stock_config_to_params(bc)
                        bert_params.adapter_size = adapter_size
                        bert = BertModelLayer.from_params(bert_params, name="bert")

                input_ids      = keras.layers.Input(shape=(maxlen,), dtype='int32', name="input_ids")
                output         = bert(input_ids)

                print("bert shape", output.shape)
                cls_out = keras.layers.Lambda(lambda seq: seq[:, 0, :])(output)
                cls_out = keras.layers.Dropout(0.5)(cls_out)
                logits = keras.layers.Dense(units=768, activation="tanh")(cls_out)
                logits = keras.layers.Dropout(0.5)(logits)
                logits = keras.layers.Dense(units=2, activation="softmax")(logits)

                model = keras.Model(inputs=input_ids, outputs=logits)
                model.build(input_shape=(None, maxlen))

                load_stock_weights(bert, bert_ckpt_file)

                # freeze weights if adapter-BERT is used
                if adapter_size is not None:
                        freeze_bert_layers(bert)

                model.compile(optimizer=keras.optimizers.Adam(),
                        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")])

                model.summary()

                return model
                
adapter_size = None # use None to fine-tune all of BERT
model = create_model(512, adapter_size=adapter_size)
#model_s2 = create_model(s2_ml, adapter_size=adapter_size)
#model_s3 = create_model(s3_ml, adapter_size=adapter_size)

bert shape (None, 512, 768)
Done loading 196 BERT weights from: uncased_L-12_H-768_A-12/bert_model.ckpt into <bert.model.BertModelLayer object at 0x7f3266f82128> (prefix:bert). Count of weights not found in the checkpoint was: [0]. Count of weights with mismatched shape: [0]
Unused weights from checkpoint: 
	bert/embeddings/token_type_embeddings
	bert/pooler/dense/bias
	bert/pooler/dense/kernel
	cls/predictions/output_bias
	cls/predictions/transform/LayerNorm/beta
	cls/predictions/transform/LayerNorm/gamma
	cls/predictions/transform/dense/bias
	cls/predictions/transform/dense/kernel
	cls/seq_relationship/output_bias
	cls/seq_relationship/output_weights
Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_ids (InputLayer)       [(None, 512)]             0         
_________________________________________________________________
bert (BertModelLayer)        (None, 512, 768)        

In [17]:
def create_learning_rate_scheduler(max_learn_rate=5e-5,
                                   end_learn_rate=1e-7,
                                   warmup_epoch_count=10,
                                   total_epoch_count=90):

    def lr_scheduler(epoch):
        if epoch < warmup_epoch_count:
            res = (max_learn_rate/warmup_epoch_count) * (epoch + 1)
        else:
            res = max_learn_rate*math.exp(
                math.log(end_learn_rate/max_learn_rate)*(epoch-warmup_epoch_count+1)/(total_epoch_count-warmup_epoch_count+1))
        return float(res)
    learning_rate_scheduler = tf.keras.callbacks.LearningRateScheduler(lr_scheduler, verbose=1)

    return learning_rate_scheduler

### Training Model

Removed tensorboard callback.

In [19]:
total_epoch_count = 10
model.fit(x=train_x, y=train_y,
          validation_split=0.1,
          batch_size=48,
          shuffle=True,
          epochs=total_epoch_count,
          callbacks=[create_learning_rate_scheduler(max_learn_rate=1e-5,
                                                    end_learn_rate=1e-7,
                                                    warmup_epoch_count=20,
                                                   total_epoch_count=total_epoch_count)])


Epoch 00001: LearningRateScheduler reducing learning rate to 5.000000000000001e-07.
Epoch 1/10
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.

Epoch 00002: LearningRateScheduler reducing learning rate to 1.0000000000000002e-06.
Epoch 2/10

Epoch 00003: LearningRateScheduler reducing learning rate to 1.5000000000000002e-06.
Epoch 3/10

Epoch 00004: LearningRateScheduler reducing learning rate to 2.0000000000000003e-06.
Epoch 4/10

Epoch 00005: LearningRateScheduler reducing learning rate to 2.5000000000000006e-06.
Epoch 5/10

Epoch 00006: LearningRateScheduler reducing learning rate to 3.0000000000000005e-06.
Epoch 6/10

Epoch 00007: LearningRateScheduler reducing learning rate to 3.5000000000000004e-06.
Epoch 7/10

Epoch 00008: LearningRateScheduler reducing learning rate to 4.000000000000001e-06.
Epoch 8/10

Epoch 00009: LearningRateScheduler reducing learning rate to 

<tensorflow.python.keras.callbacks.History at 0x7f3264ac9ba8>

### Model Evaluation

We get 91.63 % accuracy.

In [20]:
_, train_acc = model.evaluate(train_x, train_y)
_, test_acc = model.evaluate(test_x, test_y)

print('First Model Acc:')
print("train acc", train_acc)
print(" test acc", test_acc)

First Model Acc:
train acc 0.9455199837684631
 test acc 0.9162799715995789
