# BERT for Sentiment Analysis

Use BERT for sentiment analysis with 1.6 million tweets.

Dataset: http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

Pretrained BERT model: BERT base
- L=12: 12 encoder layers
- H=768: 768 embedding dimension
- A=12: 12 self attention heads

Highlights:
- BERT tokenizer
- BERT pretrained embedded layer
- Bigram, trigram, and four-gram 1D Convolution model

Major data preprocessing:
1. Tokenization:
    - Use the vocabularies in BERT base model (about 30,000 words) for BERT tokenizer
    - Use BERT tokenizer to tokenize all tweets
1. 3 inputs
    - Because we're using BERT embedding layer as first layer of our model
    - Need to fit the model with suitable input
    - Each input/sentence has 3 parts:
        1. Tokenized sentence
            - Consist of:
            - Tokenized tweet
            - CLS token for classification 
            - SEP token to seperate sentences (but in our case there's only one sentence, so SEP will be at the end of sentence)
        2. Position of padding/mask
            - Sequence of 0s and 1s indicating the position of PAD token
            - 1 for the PAD token and 0 for others
        3. Sentence segmentation
            - ex:
            - first sentence corresponds to list of 0s
            - second sentence corresponds to list of 1s
            - third sentence corresponds to list of 0s
            - etc.
2. Pretrained BERT embedding layer
    - Use pretrained BERT embedding layer as is
    - No fine tuning
    - trainable=False

<a id="top"></a>
## Content

1. [Import dependencies](#import)
2. [Data preprocessing](#dataprep)
    - [Load data](#load)
    - [Drop unused features](#drop)
    - [Prepare inputs](#prep)
        - [Clean tweets](#clean)
        - [Adjust sentiment labels](#label)
        - [BERT tokenizer](#tokenizer)
        - [Tokenization](#token)
        - [3 inputs](#inputs)
        - [Data generator](#datagen)
3. [Modeling](#model)
4. [Training](#train)
5. [Evaluation](#eval)

<a id="import"></a>
## 1. Import dependencies

In [1]:
import requests 
import os
from zipfile import ZipFile
import numpy as np
import pandas as pd
import re
import math
from bs4 import BeautifulSoup
import random

try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

import tensorflow_hub as hub

from tensorflow.keras import layers
import bert

In [2]:
bert_base_model = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1"

[Back to top](#top)

<a id="dataprep"></a>
## 2. Data preprocessing

<a id="load"></a>
### Load data

In [3]:
def download_url(url, save_path, chunk_size=128):
    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)
            
def extract_file(file_path, destination):
    with ZipFile(file_path, 'r') as file: 
        # print all the contents of the zip file 
        file.printdir() 
        # extract all the files         
        file.extractall(destination) 
        print('All the files extracted to {}'.format(destination)) 

In [4]:
download_url('http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip', 'trainingandtestdata.zip')
extract_file('trainingandtestdata.zip', "data")

File Name                                             Modified             Size
testdata.manual.2009.06.14.csv                 2010-03-04 20:20:12        74326
training.1600000.processed.noemoticon.csv      2010-03-04 20:20:42    238803811
All the files extracted to data


In [5]:
# load data
cols = ["sentiment", "id", "date", "query", "user", "text"]
data = pd.read_csv(os.path.join('data', "training.1600000.processed.noemoticon.csv"), 
                   header=None,
                   names=cols,
                   engine='python',
                   encoding='latin1')
data.head()

Unnamed: 0,sentiment,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


<a id="drop"></a>
#### Drop unused features

In [6]:
# drop unused features
data.drop(['id', 'date', 'query', 'user'], axis=1, inplace=True)
data.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


[Back to top](#top)

<a id="prep"></a>
### Prepare inputs

<a id="clean"></a>
#### Clean tweets

In [7]:
def clean_tweet(tweet):
    tweet = BeautifulSoup(tweet, 'lxml').get_text()
    # remove @tag
    tweet = re.sub(r"@\S+", ' ', tweet)
    # remove link
    tweet = re.sub(r"http\S+", ' ', tweet)
    # remove special char
    tweet = re.sub(r"[^A-Za-z ?!,.\'\"]", ' ', tweet)
    # remove excess whitespace
    tweet = re.sub(r" +", ' ', tweet)
    return tweet

In [9]:
data_clean = data.text.apply(clean_tweet)
data_clean.head()



0     Awww, that's a bummer. You shoulda got David ...
1    is upset that he can't update his Facebook by ...
2     I dived many times for the ball. Managed to s...
3      my whole body feels itchy and like its on fire 
4     no, it's not behaving at all. i'm mad. why am...
Name: text, dtype: object

<a id="label"></a>
#### Adjust sentiment labels
Original data labels negative as 0, neutral as 2, and positive as 4. Change 4 to 1.

In [10]:
data_labels = data.sentiment.values
data_labels[data_labels == 4] = 1
set(data_labels)

{0, 1}

<a id="tokenizer"></a>
#### BERT tokenizer
Create a BERT tokenizer. 

But first create a BERT layer to have access to meta data for BERT tokenizer:
1. Vocab size
2. Whether to lowercase inputs when tokenizing

In [11]:
# BERT layer
bert_layer = hub.KerasLayer(bert_base_model, trainable=False)

In [12]:
# Get meta data for BERT tokenizer from BERT layer
vocab_file =    bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

In [13]:
# BERT tokenizer
FullTokenizer = bert.bert_tokenization.FullTokenizer
tokenizer = FullTokenizer(vocab_file, do_lower_case)

<a id="token"></a>
#### Tokenization

Tokenize Tweet with BERT tokenizer and wrap it in a sentence. Each sentence is composed of the [CLS] token, tokenized tweet, and the [SEP] token.

In [14]:
def encode_sentence(sentence):
    return ["[CLS]"] + tokenizer.tokenize(sentence) + ["[SEP]"]

In [15]:
data_sentences = data_clean.apply(encode_sentence)
data_sentences.head()

0    [[CLS], aw, ##w, ##w, ,, that, ', s, a, bum, #...
1    [[CLS], is, upset, that, he, can, ', t, update...
2    [[CLS], i, dive, ##d, many, times, for, the, b...
3    [[CLS], my, whole, body, feels, it, ##chy, and...
4    [[CLS], no, ,, it, ', s, not, be, ##ha, ##ving...
Name: text, dtype: object

<a id="inputs"></a>
#### 3 inputs for each sentence

To prepared inputs for BERT embedding layer (first layer of our model), create 3 types of inputs:
1. Sentence with tokens
    - Prepared in previous step
2. Mask
    - Indicates the position of the padding tokens
    - Indicates BERT to not use those data for embedding
    - 0 for padding token, [PAD], and 1 for other tokens
3. Segment input
    - Indicates seperation of sentences
    - For example, tokens in first sentence have value 0, second sentence 1, third 0, ...
    - But in our case we only have one sentence
    
And we also want to shuffle all sentences and filter out sentences that are too short.

Note: we want to create padded batches (pad sentences for each batch independently) instead of pad the whole dataset at once, this way we add the minimum of padding tokens possible, which also minimize the dataset.

In [16]:
def get_ids(tokens):
    return tokenizer.convert_tokens_to_ids(tokens)

def get_masks(tokens):
    return np.char.not_equal(tokens, "[PAD]").astype(int)

def get_segments(tokens):
    seg_ids = []
    current_seg_id = 0
    for token in tokens:
        seg_ids.append(current_seg_id)
        if token=="[SEP]":
            current_seg_id = 1-current_seg_id
    return seg_ids

Test the functions.

Expect bert_layer to accept the formatted inputs and returns:
1. BERT representation of whole sentence
    - Vector size should be 768 because we use BERT base model
2. BERT representation of individual word
    - Expect 6 length-768 vectors representing "[CLS]", "Roses", "are", "red", ".", "[SEP]"

In [17]:
my_sent = ['[CLS]'] + tokenizer.tokenize("Roses are red.") + ["[SEP]"]
my_input = [tf.expand_dims(tf.cast(get_ids(my_sent), tf.int32), 0), 
            tf.expand_dims(tf.cast(get_masks(my_sent), tf.int32), 0), 
            tf.expand_dims(tf.cast(get_segments(my_sent), tf.int32), 0)]

bert_layer(my_input)

[<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
 array([[-9.2793560e-01, -4.1033486e-01, -9.6575487e-01,  9.0731782e-01,
          8.1291342e-01, -1.7417410e-01,  9.1123444e-01,  3.4195185e-01,
         -8.7452102e-01, -9.9998927e-01, -7.7840954e-01,  9.6938503e-01,
          9.8616052e-01,  6.3696265e-01,  9.4863129e-01, -7.5119293e-01,
         -4.5833918e-01, -7.0810443e-01,  4.6209806e-01, -6.5792716e-01,
          7.6041436e-01,  9.9999493e-01, -3.9686024e-01,  3.4416601e-01,
          6.1648846e-01,  9.9440002e-01, -7.7663356e-01,  9.3831652e-01,
          9.5945227e-01,  7.3287946e-01, -6.9343668e-01,  2.9308012e-01,
         -9.9378556e-01, -1.6455150e-01, -9.6701938e-01, -9.9554950e-01,
          5.3293502e-01, -6.8806088e-01,  1.3471809e-02,  2.9819820e-02,
         -9.1835654e-01,  4.2052594e-01,  9.9998909e-01,  2.5267580e-01,
          6.0623521e-01, -3.5075003e-01, -1.0000000e+00,  4.9758524e-01,
         -8.9518732e-01,  9.6256083e-01,  9.4373035e-01,  9.0328515e-01,


Pass!

bert_layer returns:
1. Vector representation of the whole sentence, size=(768)
2. Vector representation of each word, size=(6, 768)

Note1: axis=0 indicates the batch, which is just 1 batch in our test case

Note2: vector length is 768 because we're using BERT base model

Next, shuffle the sentences and apply the 3 formatting functions to all sentences.

In [18]:
random.shuffle(data_sentences)

In [19]:
short_sent_len = 7

all_inputs = [([get_ids(sentence), get_masks(sentence), get_segments(sentence)], label)
              for sentence, label in zip(data_sentences, data_labels)
              if len(sentence) > short_sent_len]

<a id="datagen"></a>
#### Data generator
Generate batch of data:
- Training data: size=(batch size, 3 (for the 3 inputs), padded sentence length)
    - Note that each batch has different padded sentence length, but should not be too short!
- Training label: size=(batch size)

In [26]:
# A list is a type of iterator so it can be used as generator for a dataset
data_gen = tf.data.Dataset.from_generator(lambda: all_inputs, output_types=(tf.int32, tf.int32))

In [27]:
BATCH_SIZE = 32
batch_inputs = data_gen.padded_batch(BATCH_SIZE,
                                     padded_shapes=((3, None), ()),
                                     padding_values=(0,0))

In [38]:
# Expect to see size (BATCH_SIZE, 3, sentence length)
next(iter(batch_inputs))

(<tf.Tensor: shape=(32, 3, 8), dtype=int32, numpy=
 array([[[  101,  5742,  1029,  8501,  4402,  4402,  2098,   102],
         [    1,     1,     1,     1,     1,     1,     1,     1],
         [    0,     0,     0,     0,     0,     0,     0,     0]],
 
        [[  101,  2821, 23644,  4067,  2017,  8840,  2140,   102],
         [    1,     1,     1,     1,     1,     1,     1,     1],
         [    0,     0,     0,     0,     0,     0,     0,     0]],
 
        [[  101,  9875,  2026,  2047,  2880,  5440,  2773,   102],
         [    1,     1,     1,     1,     1,     1,     1,     1],
         [    0,     0,     0,     0,     0,     0,     0,     0]],
 
        [[  101,  4931,  1010,  4283,  2005,  3582,  5959,   102],
         [    1,     1,     1,     1,     1,     1,     1,     1],
         [    0,     0,     0,     0,     0,     0,     0,     0]],
 
        [[  101,  2736, 13869,  2005,  1012,  8038,  2100,   102],
         [    1,     1,     1,     1,     1,     1,     1,     1],

In [39]:
# shuffle batches and split training and testing set
NUM_BATCHES = math.ceil(len(all_inputs) / BATCH_SIZE)
test_size = 0.1
NUM_BATCHES_TEST = int(NUM_BATCHES * test_size)

batch_inputs.shuffle(NUM_BATCHES)
test_inputs = batch_inputs.take(NUM_BATCHES_TEST)
train_inputs = batch_inputs.skip(NUM_BATCHES_TEST)

[Back to top](#top)

<a id="model"></a>
## 3. Modeling
Use BERT for embedding layer.

In [29]:
class DCNNBERTEmbedding(tf.keras.Model):
    def __init__(self, 
                 num_filters=50, 
                 FFN_units=512, 
                 num_classes=2, 
                 dropout_rate=0.1,
                 name='dcnn'):
        super(DCNNBERTEmbedding, self).__init__(name=name)
        
        # Define layers
        
        # Instead of tf.keras.layers.Embedding, use BERT layer (as is) for word embedding
        self.bert_layer = hub.KerasLayer(bert_base_model, trainable=False)
        
        # Len 2 feature detector (bi gram)
        self.bigram = layers.Conv1D(kernel_size=2, filters=num_filters, padding='valid', activation='relu')
        
        # Len 3 feature detector (tri gram)
        self.trigram = layers.Conv1D(kernel_size=3, filters=num_filters, padding='valid', activation='relu')
        
        # Len 4 feature detector (four gram)
        self.fourgram = layers.Conv1D(kernel_size=4, filters=num_filters, padding='valid', activation='relu')        
        
        # Pooling layer
        self.pool = layers.GlobalMaxPooling1D()
        
        # Fully connected hidden layer
        self.dense_1 = layers.Dense(units=FFN_units, activation='relu')
        
        # Dropout layer
        self.dropout = layers.Dropout(rate=dropout_rate)
        
        # Fully connected final layer
        if num_classes==2:
            self.last_dense = layers.Dense(units=1, activation='sigmoid')
        else:
            self.last_dense = layers.Dense(units=num_classes, activation='softmax')
            
    def embd_with_bert(self, all_tokens):
        # bert_layer returns:
        # 1. BERT (vector) representation of whole sentence (not use here)
        # 2. BERT (vector) representation of individual words 
        _, embs = self.bert_layer([all_tokens[:,0,:],  # ids
                                   all_tokens[:,1,:],  # masks
                                   all_tokens[:,2,:]]) # segments
        return embs
    
    def call(self, inputs, training):
        x = self.embd_with_bert(inputs)
        x_1 = self.bigram(x)
        x_1 = self.pool(x_1)
        x_2 = self.trigram(x)
        x_2 = self.pool(x_2)
        x_3 = self.fourgram(x)
        x_3 = self.pool(x_3) # shape = (batch_size, num_filters)
        
        merged = tf.concat([x_1, x_2, x_3], axis=-1) # shape = (batch_size, 3*num_filters)
        merged = self.dense_1(merged)
        merged = self.dropout(merged, training)
        output = self.last_dense(merged)
        
        return output

[Back to top](#top)

<a id="train"></a>
## 4. Training

Compare to training model using tf.keras.layers.Embedding for word embedding, should expect:

1. Faster training because we don't have to train embedding layer (use pretrained BERT embedding layer)
2. Less overfitting
    - Customized embedding layer is likely to overfits
    - Pretrained BERT embedding layer (without fine tuning) is well trained and general enough for our dataset, hence less overfitting

In [30]:
# Hyperparameters
VOCAB_SIZE = len(tokenizer.vocab)
EMB_DIM = 200
NUM_FILTERS = 100
FFN_UNITS = 256
NUM_CLASSES = 2
DROPOUT_RATE = 0.2
EPOCHS = 5

In [31]:
# Define model
Dcnn = DCNNBERTEmbedding(num_filters=NUM_FILTERS, 
                         FFN_units=FFN_UNITS, 
                         num_classes=NUM_CLASSES, 
                         dropout_rate=DROPOUT_RATE)

In [32]:
# Compile model
if NUM_CLASSES == 2:
    Dcnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
else:
    Dcnn.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['sparse_categorical_accuracy'])

In [33]:
!mkdir -p models\embding

A subdirectory or file models\embding already exists.


In [34]:
# Checkpoint manager
checkpoint_path='models\embding'
ckpt = tf.train.Checkpoint(Dcnn=Dcnn)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)

# If there is checkpoint in folder, restore latest checkpoint
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("latest checkpoint restored!")
else:
    print("no checkpoint available...")

no checkpoint available...


In [35]:
# Checkpoint manager callback at end of epoch
class MyCustomCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        """Save checkpoint at end of each epoch"""
        ckpt_manager.save()
        print("Checkpoint saved at {}".format(checkpoint_path))

In [36]:
# Early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', verbose=1, restore_best_weights=True)

In [40]:
# Train model
Dcnn.fit(train_inputs,
         epochs=EPOCHS,
         validation_data=test_inputs,
         callbacks=[MyCustomCallback(), early_stopping])

Epoch 1/5
Checkpoint saved at models\embding
Epoch 2/5
Checkpoint saved at models\embding
Epoch 3/5
Checkpoint saved at models\embding
Epoch 4/5
Checkpoint saved at models\embding
Epoch 5/5
Checkpoint saved at models\embding
Restoring model weights from the end of the best epoch.
Epoch 00005: early stopping


<tensorflow.python.keras.callbacks.History at 0x27d6c3d98b0>

[Back to top](#top)

<a id="eval"></a>
## 5. Evaluation
Note that we're using pretrained bert AS IS!! So it's already good for not fine tuning with our dataset.

In [41]:
predictions = Dcnn.evaluate(test_inputs)
print(predictions) 

[0.3308480978012085, 0.8604309558868408]


In [51]:
def get_prediction(sentence):
    tokens = encode_sentence(sentence)
    
    inputs = [tf.cast(get_ids(tokens), dtype=tf.int32),
              tf.cast(get_mask(tokens), dtype=tf.int32),
              tf.cast(get_segments(tokens), dtype=tf.int32)]

    inputs = tf.stack(inputs, axis=0)
    inputs = tf.expand_dims(inputs, axis=0) # simulates a batch

    output = Dcnn(inputs, training=False)
    sentiment = math.floor(output*2)
    
    print(f"Output of the model: {output}\nPredicted sentiment:", "Positive" if sentiment else "Negative")

In [52]:
get_prediction("Never am I gonna train this again!")

Output of the model: [[0.20388371]]
Predicted sentiment: Negative


In [53]:
get_prediction("Let's do another BERT LOL")

Output of the model: [[0.9463612]]
Predicted sentiment: Positive


[Back to top](#top)