# Predicting Movie Review Sentiment with BERT on TF Hub

If you’ve been following Natural Language Processing over the past year, you’ve probably heard of BERT: Bidirectional Encoder Representations from Transformers. It’s a neural network architecture designed by Google researchers that’s totally transformed what’s state-of-the-art for NLP tasks, like text classification, translation, summarization, and question answering.

Now that BERT's been added to [TF Hub](https://www.tensorflow.org/hub) as a loadable module, it's easy(ish) to add into existing Tensorflow text pipelines. In an existing pipeline, BERT can replace text embedding layers like ELMO and GloVE. Alternatively, [finetuning](http://wiki.fast.ai/index.php/Fine_tuning) BERT can provide both an accuracy boost and faster training time in many cases.

Here, we'll train a model to predict whether an IMDB movie review is positive or negative using BERT in Tensorflow with tf hub. Some code was adapted from [this colab notebook](https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb). Let's get started!

In [1]:
import tensorflow as tf
import pandas as pd
import tensorflow_hub as hub
import os
import re
import numpy as np
from bert.tokenization import FullTokenizer
from tqdm import tqdm_notebook
from tensorflow.keras import backend as K

# Initialize session
sess = tf.Session()

# Params for bert model and tokenization
bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
max_seq_length = 256

W0504 18:46:30.614220 10548 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


In [2]:
## gpu 설정

from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]
print(get_available_devices()) 

## 정상 출력 시 
## ['/device:CPU:0', '/device:GPU:0']

['/device:CPU:0', '/device:GPU:0']


In addition to the standard libraries we imported above, we'll need to install BERT's python package.

In [3]:
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

from bert.tokenization import FullTokenizer
from tqdm import tqdm_notebook
from tensorflow.keras import backend as K

In [4]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

# Data

First, let's download the dataset, hosted by Stanford. The code below, which downloads, extracts, and imports the IMDB Large Movie Review Dataset, is borrowed from [this Tensorflow tutorial](https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub).

In [5]:
# Load all files from a directory in a DataFrame
def load_directory_data(directory):
    data = {}
    data["sentence"] = []
    data["sentiment"] = []
    for file_path in os.listdir(directory):
        print(file_path)
        with open(os.path.join(directory, file_path),"r", encoding = "utf-8") as f:
            print(os.path.join(directory, file_path))
            data['sentence'].append(f.read())
            data['sentiment'].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
            #data['sentiment'].append(file_path).group(1)
            print(data)
    return pd.DataFrame.from_dict(data)

# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
    pos_df = load_directory_data(os.path.join(directory, "pos"))
    neg_df = load_directory_data(os.path.join(directory, "neg"))
    pos_df['polarity'] = 1
    neg_df['polarity'] = 0
    
    return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

In [6]:
# dataset_dir = os.path.dirname('C:\\Users\\Ajou\\.keras\\datasets\\')

# train_df = load_dataset(os.path.join(dataset_dir,"aclImdb", "train"))
# test_df = load_dataset(os.path.join(dataset_dir,"aclImdb", "test"))

In [7]:
import pickle

# load
with open('IMDb binary dataset/train_data.pickle', 'rb') as f:
    train = pickle.load(f)
    
# load
with open('IMDb binary dataset/test_data.pickle', 'rb') as f:
    test = pickle.load(f)

#### 데이터 설명
Within these
directories, reviews are stored in text files named following the
convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
the star rating for that review on a 1-10 scale. For example, the file
[test/pos/200_8.txt] is the text for a positive-labeled test set
example with unique id 200 and star rating 8/10 from IMDb

In [8]:
train.head()

Unnamed: 0,sentence,sentiment,polarity
0,"Once again, Pia Zadora, the woman who owes her...",1,0
1,I felt like I was watching the Fast and the Fu...,3,0
2,I was so entertained throughout this insightfu...,8,1
3,This movie is really bad. The acting is plain ...,2,0
4,"I only saw this movie once, and that was enoug...",1,0


To keep training fast, we'll take a sample of 5000 train and test examples, respectively.

In [9]:
train = train.sample(5120)
test = test.sample(5120)

For us, our input data is the 'sentence' column and our label is the 'polarity' column (0, 1 for negative and positive, respecitvely)

In [10]:
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'polarity'
# label_list is the list of labels, i.e. True, False or 0, 1 or 'dog', 'cat'
label_list = [0, 1]

In [11]:
# Create datasets (Only take up to max_seq_length words for memory)
train_text = train['sentence'].tolist()
# MAX_SEQ_LENGTH만큼 자르고 합치기 
train_text = [' '.join(t.split()[0:max_seq_length]) for t in train_text]
## 1차원 리스트를 개별적으로 묶기. 2차원으로
train_text = np.array(train_text, dtype=object)[:, np.newaxis]
#y label..0 or 1
train_label = train['polarity'].tolist()

test_text = test['sentence'].tolist()
test_text = [' '.join(t.split()[0:max_seq_length]) for t in test_text]
test_text = np.array(test_text, dtype=object)[:, np.newaxis]
test_label = test['polarity'].tolist()

# Data Preprocessing

We'll need to transform our data into a format BERT understands. This involves two steps. First, we create  `InputExample`'s using the constructor provided in the BERT library.

- `text_a` is the text we want to classify, which in this case, is the `Request` field in our Dataframe. 

'text_a'는 우리가 분류하고싶은 텍스트

- `text_b` is used if we're training a model to understand the relationship between sentences (i.e. is `text_b` a translation of `text_a`? Is `text_b` an answer to the question asked by `text_a`?). This doesn't apply to our task, so we can leave `text_b` blank.

text_b는 text_a와의 관계성을 나타낼 때 쓰인다

- `label` is the label for our example, i.e. True, False

라벨

Next, we need to preprocess our data so that it matches the data BERT was trained on. For this, we'll need to do a couple of things (but don't worry--this is also included in the Python library):


1. Lowercase our text (if we're using a BERT lowercase model)
    - 텍스트 소문자화
2. Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])
    - 텍스트 토큰화
3. Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])
    - 단어 쪼개기
4. Map our words to indexes using a vocab file that BERT provides
    - BERT가 제공하는 vocab file을 이용해서 word에 index를 mapping
5. Add special "CLS" and "SEP" tokens (see the [readme](https://github.com/google-research/bert))
    - CLS, SEP 토큰 이용.
6. Append "index" and "segment" tokens to each input (see the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf))
    - 각 input에 Index, segment 토큰 추가

Happily, we don't have to worry about most of these details.

In [12]:
## fake example

class PaddingInputExample(object):
    """Fake example so the num input examples is a multiple of the batch size.
  When running eval/predict on the TPU, we need to pad the number of examples
  to be a multiple of the batch size, because the TPU requires a fixed batch
  size. The alternative is to drop the last batch, which is bad because it means
  the entire output data won't be generated.
  We use this class instead of `None` because treating `None` as padding
  battches could cause silent errors.
  """

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train and dev examples, but not for test examples.
    """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label

In [13]:
def create_tokenizer_from_hub_module():
    """Get the vocab file and casing info from the Hub module."""
    bert_module =  hub.Module(bert_path)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    vocab_file, do_lower_case = sess.run(
        [
            tokenization_info["vocab_file"],
            tokenization_info["do_lower_case"],
        ]
    )

    return FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

# Instantiate tokenizer
tokenizer = create_tokenizer_from_hub_module()
tokenizer.tokenize("John Johanson 's   house")
# Input:  John Johanson 's   house
#Tokens:  ["John", "Johanson", "'s",  "house"]

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0504 18:46:34.136799 10548 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


['john', 'johan', '##son', "'", 's', 'house']

In [14]:
# input, label
# Append "index" and "segment" tokens to each input

def convert_single_example(tokenizer, example, max_seq_length=256):
    """Converts a single `InputExample` into a single `InputFeatures`."""
    
    # make fake example
    if isinstance(example, PaddingInputExample):
        input_ids = [0] * max_seq_length
        input_mask = [0] * max_seq_length
        segment_ids = [0] * max_seq_length
        label = 0
        return input_ids, input_mask, segment_ids, label

    # tokenize the text we want to classify
    tokens_a = tokenizer.tokenize(example.text_a)  
    ## tokens_a:  ["john" ,"johan" ,"##son" ,"'" ,"s" ,"house"]
    
    # max_seq_length길이보다 길다면 앞에서부터 max_seq_length만큼 자르기
    if len(tokens_a) > max_seq_length - 2:
        tokens_a = tokens_a[0 : (max_seq_length - 2)]

        
    # Add special "CLS" and "SEP" tokens
    # segment_ids : token + 2 개수만큼 0 인 리스트
    tokens = []
    segment_ids = []
    tokens.append("[CLS]")
    segment_ids.append(0)
    for token in tokens_a:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append("[SEP]")
    segment_ids.append(0)  
    ## bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]

    # convert tokens to id
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    ## [101, 2198, 13093, 3385, 1005, 1055, 2160, 102]
    
    
    # The mask has 1 for real tokens and 0 for padding tokens. 
    # Only real tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

    # 조건에 부합하지 않으면 error 발생시킴
    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    return input_ids, input_mask, segment_ids, example.label

In [15]:
def convert_examples_to_features(tokenizer, examples, max_seq_length=256):
    """Convert a set of `InputExample`s to a list of `InputFeatures`."""

    input_ids, input_masks, segment_ids, labels = [], [], [], []
    for example in tqdm_notebook(examples, desc="Converting examples to features"):
        input_id, input_mask, segment_id, label = convert_single_example(
            tokenizer, example, max_seq_length
        )
        input_ids.append(input_id)
        input_masks.append(input_mask)
        segment_ids.append(segment_id)
        labels.append(label)
    return (
        np.array(input_ids),
        np.array(input_masks),
        np.array(segment_ids),
        np.array(labels).reshape(-1, 1),
    )

In [16]:
display(train_text[:2])
display(train_label[:2])

array([['"Capitães de Abril" is a very good. The story isn\'t a documentary about the 1974 revolution in Portugal. But it gives us an idea of how it was like. The fiction of the story isn\'t of great interest, but it doesn\'t spoil the movie. The heroic actions of Captain Salgueiro Maia aren\'t exaggerations and the film is also a tribute for his deeds. Captain Salgueiro Maia remains one of the greatest heroes of the 25th of April Revolution.<br /><br />All the actors are very good and even the smallest roles are played wonderfully. Lisbon looks beautiful as ever. Don\'t miss it! I liked this film very much.'],
       ["Three horror stories based on members of a transgressive Hindu cult that return home but changed in some way. In the first story our former cult member is now in an insane asylum and is visited by a reported who wants to find out about what went on at the cult. Somewhat slow going as story is told in flashbacks while the two sit on chairs and face each other. Reporter i

[1, 1]

In [17]:
def convert_text_to_examples(texts, labels):
    """Create InputExamples"""
    InputExamples = []
    for text, label in zip(texts, labels):
        InputExamples.append(
            InputExample(guid=None, text_a=" ".join(text), text_b=None, label=label)
        )
    return InputExamples

# Instantiate tokenizer
tokenizer = create_tokenizer_from_hub_module()

# Convert data to InputExample format
train_examples = convert_text_to_examples(train_text, train_label)
test_examples = convert_text_to_examples(test_text, test_label)

# Convert to features
(train_input_ids, train_input_masks, train_segment_ids, train_labels 
) = convert_examples_to_features(tokenizer, train_examples, max_seq_length=max_seq_length)
(test_input_ids, test_input_masks, test_segment_ids, test_labels
) = convert_examples_to_features(tokenizer, test_examples, max_seq_length=max_seq_length)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0504 18:46:36.893450 10548 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


HBox(children=(IntProgress(value=0, description='Converting examples to features', max=5120, style=ProgressSty…




HBox(children=(IntProgress(value=0, description='Converting examples to features', max=5120, style=ProgressSty…




In [18]:
test_segment_ids

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

*** Example *** <br></br>
guid: None<br><br></br></br>
tokens: [CLS] i was over ##taken by the emotion . un ##for ##get ##table rendering of a wartime story which is unknown to most people . the performances were fault ##less and outstanding . [SEP]<br><br></br></br>
input_ids: 101 1045 2001 2058 25310 2011 1996 7603 1012 4895 29278 18150 10880 14259 1997 1037 12498 2466 2029 2003 4242 2000 2087 2111 1012 1996 4616 2020 6346 3238 1998 5151 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br><br></br></br>
input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br><br></br></br>
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br><br></br></br>
label: 1 (id = 1)<br><br></br></br>

# Creating a model

Now that we've prepared our data, let's focus on building a model. `create_model` does just this below. 

First, it loads the BERT tf hub module again (this time to extract the computation graph). 

Next, it creates a single new layer that will be trained to adapt BERT to our sentiment task 

(i.e. classifying whether a movie review is positive or negative).

This strategy of using a mostly trained model is called [fine-tuning](http://wiki.fast.ai/index.php/Fine_tuning).

In [19]:
# We next build a custom layer using Keras, integrating BERT from tf-hub. 
# The model is very large (110,302,011 parameters!!!) so we fine tune a subset of layers.

class BertLayer(tf.layers.Layer):
    def __init__(self, n_fine_tune_layers=10, **kwargs):
        self.n_fine_tune_layers = n_fine_tune_layers
        self.trainable = True
        self.output_size = 768
        super(BertLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.bert = hub.Module(
            bert_path,
            trainable=self.trainable,
            name="{}_module".format(self.name)
        )

        trainable_vars = self.bert.variables

        # Remove unused layers
        trainable_vars = [var for var in trainable_vars if not "/cls/" in var.name]

        # Select how many layers to fine tune
        trainable_vars = trainable_vars[-self.n_fine_tune_layers :]

        # Add to trainable weights
        for var in trainable_vars:
            self._trainable_weights.append(var)
            
        for var in self.bert.variables:
            if var not in self._trainable_weights:
                self._non_trainable_weights.append(var)

        super(BertLayer, self).build(input_shape)

    def call(self, inputs):
        inputs = [K.cast(x, dtype="int32") for x in inputs]
        input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(
            input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
        )
        result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
            "pooled_output"
        ]
        return result

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.output_size)

In [20]:

def build_model(max_seq_length): 
    in_id = tf.keras.layers.Input(shape=(max_seq_length,), name="input_ids")
    in_mask = tf.keras.layers.Input(shape=(max_seq_length,), name="input_masks")
    in_segment = tf.keras.layers.Input(shape=(max_seq_length,), name="segment_ids")
    bert_inputs = [in_id, in_mask, in_segment]
    
    bert_output = BertLayer(n_fine_tune_layers=3)(bert_inputs)
    dense = tf.keras.layers.Dense(256, activation='relu')(bert_output)
    pred = tf.keras.layers.Dense(1, activation='sigmoid')(dense)
    
    model = tf.keras.models.Model(inputs=bert_inputs, outputs=pred)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

def initialize_vars(sess):
    sess.run(tf.local_variables_initializer())
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    K.set_session(sess)

In [21]:
model = build_model(max_seq_length)

initialize_vars(sess)

model.fit(
    [train_input_ids, train_input_masks, train_segment_ids], 
    train_labels,
    validation_data=([test_input_ids, test_input_masks, test_segment_ids], test_labels),
    epochs=3,
    # GPU ResourceExhaustedError 뜨면 batch size 줄일 것
    batch_size=16)

model.save('BertModel_3epoch.h5')

pre_save_preds = model.predict([test_input_ids[0:100], 
                            test_input_masks[0:100], 
                            test_segment_ids[0:100]]
                          ) # predictions before we clear and reload model

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0504 18:47:05.245626 10548 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0504 18:47:06.637903 10548 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          (None, 256)          0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        (None, 256)          0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        (None, 256)          0                                            
__________________________________________________________________________________________________
bert_layer_1 (BertLayer)        (None, 768)          110104890   input_ids[0][0]                  
                                                                 input_masks[0][0]                
          



Epoch 2/3




Epoch 3/3






To start, we'll need to load a vocabulary file and lowercasing information directly from the BERT tf hub module:

Voila! We have a sentiment classifier!

In [23]:
# Clear and load model
model = None
model = build_model(max_seq_length)
initialize_vars(sess)
model.load_weights('BertModel_3epoch.h5')

post_save_preds = model.predict([test_input_ids[0:100], 
                                test_input_masks[0:100], 
                                test_segment_ids[0:100]]
                              ) # predictions after we clear and reload model

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0505 22:33:07.000718 10548 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0505 22:33:08.590467 10548 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          (None, 256)          0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        (None, 256)          0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        (None, 256)          0                                            
__________________________________________________________________________________________________
bert_layer_3 (BertLayer)        (None, 768)          110104890   input_ids[0][0]                  
                                                                 input_masks[0][0]                
          

In [24]:
post_save_preds

array([[3.4457862e-03],
       [9.6813858e-01],
       [9.8280299e-01],
       [7.5745098e-02],
       [4.6677220e-01],
       [9.9224585e-01],
       [9.5921111e-01],
       [6.4939745e-03],
       [7.4180928e-03],
       [9.4558853e-01],
       [9.9092382e-01],
       [9.7183630e-02],
       [5.3208077e-01],
       [9.9250728e-01],
       [9.5702046e-01],
       [1.6069709e-03],
       [9.7688019e-01],
       [1.0215268e-02],
       [8.1035399e-01],
       [8.0544382e-01],
       [7.7812260e-01],
       [9.7519207e-01],
       [1.3380524e-02],
       [4.8946780e-03],
       [9.9204761e-01],
       [3.5089269e-01],
       [4.5742360e-03],
       [3.5204613e-03],
       [8.0536211e-01],
       [5.2231675e-01],
       [7.5015584e-03],
       [8.9300768e-03],
       [9.4740337e-01],
       [4.4505000e-03],
       [9.5870096e-04],
       [1.9042206e-01],
       [9.3479329e-01],
       [9.5588309e-01],
       [8.3491793e-03],
       [9.9396425e-01],
       [2.7839243e-01],
       [9.559593