# Sentiment Analysis (Classification) using BERT (Keras API) 

![maxresdefault]('./img/maxresdefault.jpeg')

We will perform a sentiment Analysis using Google BERT model on the movie data with TF Keras API. 

Two things we will do:
1. We will use Keras API this time to do the analysis, as Pytorch version examples are already a lot.
2. We will use **Korean Movie Review** dataset, as analysis done in English Movie Review (IMDB) are easy-to-be-find online.

![sent]('./img/sent.jpeg')

### How BERT works

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. "The child came home from ----"), a directional approach which inherently limits context learning. 

To overcome this challenge, BERT uses two training strategies: **Masked LM (MLM)** and **Next Sentence Prediction (NSP)**

##### Masked LM (MLM)

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:

1. Adding a classification layer on top of the encoder output.
2. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
3. Calculating the probability of each word in the vocabulary with softmax.

![mlm]('./img/mlm.png')

##### Next Sentence Prediction (NSP)

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
3. A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.

![nsp]('./img/nsp.png')

> BERT paper [Link](https://arxiv.org/abs/1810.04805)

##### How to use BERT (Fine-tuning)
 
BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

1. Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
2. In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
3. In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.

This time, we will fine-tune the language model to perform a **'1.Classification'** task

![bert-sentence-pair]('./img/bert-sentence-pair.png')

In [1]:
import numpy as np
import pandas as pd
from keras import backend as K
from keras import Input, Model
from keras import optimizers
import keras as keras
from keras.layers import Embedding, Dense, Input, LSTM, Bidirectional, Activation, Conv1D, GRU, TimeDistributed, Dropout
from keras.models import Model, load_model
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import warnings
import tensorflow as tf
import os
import re
import pickle
import codecs
from tqdm import tqdm
import matplotlib.pyplot as plt
warnings.filterwarnings(action='ignore')

Using TensorFlow backend.


# 1. Load the BERT model

Google's BERT has provided mutil-language model that can be used in any other languages than English to use their model.

In [None]:
# Create a "bert" directory inside "data" directory
if "bert" not in os.listdir():
    os.makedirs("./data/bert")
else:
    pass

Navigate to [BERT Github](https://github.com/google-research/bert) and downalod "BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters" zip file to the "bert" folder created

In [None]:
import zipfile
import shutil

# Unzip zipfile and extract the file inside
bert_zip = zipfile.ZipFile('./data/bert/multi_cased_L-12_H-768_A-12.zip')
bert_zip.extractall('./data/bert')
 
bert_zip.close()

In [None]:
def copytree(src, destination, symlinks=False, ignore=None):
    for item in os.listdir(src):
        s = os.path.join(src, item)
        d = os.path.join(destination, item)
        if os.path.isdir(s):
            shutil.copytree(s, d, symlinks, ignore)
        else:
            shutil.copy2(s, d)

In [None]:
copytree("./data/bert/multi_cased_L-12_H-768_A-12", "./data/bert")

keras-bert makes easier for us to user BERT in Keras
Then, we import the Keras-radam model, which is a revised versino of Adam optimizer

In [None]:
# !pip install keras-bert
# !pip install keras-radam

In [3]:
from keras_bert import load_trained_model_from_checkpoint, load_vocabulary
from keras_bert import Tokenizer
from keras_bert import AdamWarmup, calc_train_steps

from keras_radam import RAdam

In [4]:
os.listdir('./data/bert')

['bert_config.json',
 'bert_model.ckpt.data-00000-of-00001',
 'bert_model.ckpt.index',
 'bert_model.ckpt.meta',
 'vocab.txt']

# 2. Download the Korean Movie Data

It's a binary classification where positive review is labeled as 1, and negative as 0 
No neutral reviews included

In [None]:
# You can do Git-clone
# !git clone https://github.com/e9t/nsmc.git

Or direct download from [Github](https://github.com/e9t/nsmc.git)

In [5]:
path = os.path.abspath('./data')

In [6]:
train = pd.read_table(os.path.join(path,"ratings_train.txt"))
test = pd.read_table(os.path.join(path,"ratings_test.txt"))

In [7]:
print(train.shape)
print(test.shape)

(150000, 3)
(50000, 3)


In [8]:
train[0:10]

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1
5,5403919,막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.,0
6,7797314,원작의 긴장감을 제대로 살려내지못했다.,0
7,9443947,별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단...,0
8,7156791,액션이 없는데도 재미 있는 몇안되는 영화,1
9,5912145,왜케 평점이 낮은건데? 꽤 볼만한데.. 헐리우드식 화려함에만 너무 길들여져 있나?,1


# 3. Hyperparameter setting

In [9]:
# Max-length of a sentence. If the sentence is shorter than 128, remaining is filled with 0. 
SEQ_LEN = 128
# Batch size
BATCH_SIZE = 16
# Training Epoch
EPOCHS=50
# Learning Rate
LR=1e-4

pretrained_path = os.path.abspath('./data/bert')

config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')

DATA_COLUMN = "document"
LABEL_COLUMN = "label"

Create a dictionary called 'token_dict' that adds numbering to words in vocab.txt 
So the flow of NLP is
**Tokonize the sentence into words ==> Words converted to Index (numbers) ==> Fed into the BERT model**

In [10]:
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        if "_" in token:
            token = token.replace("_","")
            token = "##" + token
        token_dict[token] = len(token_dict)

Helper function to tokenize the sentence below

In [11]:
class inherit_Tokenizer(Tokenizer):
    def _tokenize(self, text):
        if not self._cased:
            text = text
            
            text = text.lower()
        spaced = ''
        for ch in text:
            if self._is_punctuation(ch) or self._is_cjk_character(ch):
                spaced += ' ' + ch + ' '
            elif self._is_space(ch):
                spaced += ' '
            elif ord(ch) == 0 or ord(ch) == 0xfffd or self._is_control(ch):
                continue
            else:
                spaced += ch
        tokens = []
        for word in spaced.strip().split():
            tokens += self._word_piece_tokenize(word)
        return tokens

In [12]:
tokenizer = inherit_Tokenizer(token_dict)

In [13]:
tokenizer.tokenize("이 영화 정말 좋다.") # I really like this movie

['[CLS]', '이', '영화', '정', '##말', '좋', '##다', '.', '[SEP]']

===================================================================

Now we need to incode the review into tokens to be fed into the network

In [14]:
# This converts Words into Index (numbers)
def convert_data(data_df):
    global tokenizer
    indices, targets = [], []
    for i in tqdm(range(len(data_df))):
        ids, segments = tokenizer.encode(data_df[DATA_COLUMN][i], max_len=SEQ_LEN)
        indices.append(ids)
        targets.append(data_df[LABEL_COLUMN][i])
    items = list(zip(indices, targets))
    
    indices, targets = zip(*items)
    indices = np.array(indices)
    return [indices, np.zeros_like(indices)], np.array(targets)

# Load dataframe and split it into train/test
def load_data(df):
    data_df = df
    
    data_df[DATA_COLUMN] = data_df[DATA_COLUMN].astype(str)

    data_x, data_y = convert_data(data_df)

    return data_x, data_y

In [15]:
train_x, train_y = load_data(train)
test_x, test_y = load_data(test)

100%|████████████████████████████████████████████████████████████████████████| 150000/150000 [00:29<00:00, 5094.30it/s]
100%|██████████████████████████████████████████████████████████████████████████| 50000/50000 [00:08<00:00, 5670.69it/s]


For the pre-trained BERT model, it takes an input as numericalized tokens and sentence order vector. As we are training  with single sentence, sentence order vector will be all 0

And we don't use the 'masking', which hides a certain portion of senteces. 

In [16]:
train_x

[array([[  101,  9519,  9074, ...,     0,     0,     0],
        [  101,   100,   119, ...,     0,     0,     0],
        [  101,  9004, 32537, ...,     0,     0,     0],
        ...,
        [  101,  9638, 14153, ...,     0,     0,     0],
        [  101,  9751, 97707, ...,     0,     0,     0],
        [  101, 48556, 42428, ...,     0,     0,     0]]),
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])]

The input consists of 3 things: Token, Segement, and Position 

Token: Indexed numbers

Segment: NUmbers that tell whether it's a front sentence or back sentence

Position: Automatically assgined 

In [17]:
def sentence_convert_data(data):
    global tokenizer
    indices = []
    for i in tqdm(range(len(data))):
        print(tokenizer.tokenize(data[i]))
        ids, segments = tokenizer.encode(data[i], max_len=SEQ_LEN)
        indices.append(ids)
        
    items = indices
    indices = np.array(indices)
    return [indices, np.zeros_like(indices)]

def sentence_load_data(sentences):#sentence는 List로 받는다
           
    data_x = sentence_convert_data(sentences)

    return data_x

In [18]:
sentence_load_data(["이 영화 정말 좋다.", "진짜 노잼"])   

100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1003.30it/s]

['[CLS]', '이', '영화', '정', '##말', '좋', '##다', '.', '[SEP]']
['[CLS]', '진', '##짜', '노', '##잼', '[SEP]']





[array([[   101,   9638,  42428,   9670,  89523,   9685,  11903,    119,
            102,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,   

# 4. Build a model

In [19]:
layer_num = 12
model = load_trained_model_from_checkpoint(
    config_path,
    checkpoint_path,
    training=True,
    trainable=True,
    seq_len=SEQ_LEN)

We see that it has 12 layer in default Transformer model

In [20]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 128)          0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, 128)          0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 128, 768), ( 91812096    Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 128, 768)     1536        Input-Segment[0][0]              
____________________________________________________________________________________________

Below is the most important part. 
It loads the pretrained data and changes the pre-trained BERT model

We define input as:
```
inputs = modle.inputs[:2]
```
Freeze the model except a last 3 layer, then add a output Dense layer that will be used as a classification layer

It will output 1 if it's close to positive and 0 if it's close to negative
Then we use Radam as loss function to compute gradient descent, then return the model

In [21]:
def get_bert_finetuning_model(model):
    inputs = model.inputs[:2]
    dense = model.layers[-3].output
    
    outputs = Dense(1, activation='sigmoid',kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02),
                              name = 'real_output')(dense)
    
    bert_model = Model(inputs, outputs)
    
    bert_model.compile(
      optimizer=RAdam(learning_rate=0.00001, weight_decay=0.0025),
      loss='binary_crossentropy',
      metrics=['accuracy'])
  
    return bert_model

In [24]:
!nvidia-smi

Thu Jun 25 23:23:25 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 446.14       Driver Version: 446.14       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 1050   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P0    N/A /  N/A |     83MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU                  PID   Type   Process name                  GPU Memory |
|       

# 5.Start Training

In [25]:
bert_model = get_bert_finetuning_model(model)
history = bert_model.fit(train_x, 
                         train_y, 
                         epochs=1, 
                         batch_size=64,
                         verbose = True, 
                         validation_data=(test_x, test_y), 
                         shuffle=True)

Train on 150000 samples, validate on 50000 samples
Epoch 1/1
   192/150000 [..............................] - ETA: 53:47:42 - loss: 0.6939 - accuracy: 0.5156

KeyboardInterrupt: 

# 6. Save / Load Model

In [26]:
# Save model
bert_model.save_weights(path + "/bert.h5")

In [None]:
# Load model
bert_model = get_bert_finetuning_model(model)
bert_model.load_weights(path + "/bert.h5")

Check the F1 socre using the test-set.
The 'predict_convert_data' function does not take the label as it's a testset

In [27]:
def predict_convert_data(data_df):
    global tokenizer
    indices = []
    for i in tqdm(range(len(data_df))):
        ids, segments = tokenizer.encode(data_df[DATA_COLUMN][i], max_len=SEQ_LEN)
        indices.append(ids)
        
    items = indices
    
    
    indices = np.array(indices)
    return [indices, np.zeros_like(indices)]

def predict_load_data(x):
    # x should be the dataframe
    data_df = x
       
    data_df[DATA_COLUMN] = data_df[DATA_COLUMN].astype(str)

    data_x = predict_convert_data(data_df)

    return data_x

In [28]:
test_set = predict_load_data(test)

100%|██████████████████████████████████████████████████████████████████████████| 50000/50000 [00:07<00:00, 6301.41it/s]


In [None]:
preds = bert_model.predict(test_set)
preds

In [None]:
# Check the F1-score
from sklearn.metrics import classification_report
y_true = test['label']

print(classification_report(y_true, np.round(preds,0)))

============================================================= Official Model Building is over =============================================================

Good thing about Keras is that we can play around with layers
We can get the feature map wher ewe get the last 768 features not the output betwwen 0 and 1

In [None]:
def get_feature_map(model):
    inputs = model.input
    outputs = model.layers[-2].output
    feature_model = Model(inputs, outputs)
    
    return feature_model

In [None]:
bert_feature = get_feature_map(bert_model)

Let's get the feature map for test data and plot the TSNE embedding plot 

In [None]:
bert_weight_list = bert_feature.predict(test_set)
bert_weight_list

In [None]:
labels = test['label']
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

PCA reduces 768 dimensions into 256 dimensions. Then, TSNE reduces 256 dimensions into 3 dimensions.
TSNE is a algorithm that creates a cluster between groups that share similar traits

In [None]:
bert_embedded = PCA(n_components=256).fit_transform(bert_weight_list)
bert_embedded = TSNE(n_components=3).fit_transform(bert_embedded)
bert_embedded

Save Bert Embedding for later use

In [None]:
with open(path+"/bertembedding.pkl", "wb") as f:
    pickle.dump(bert_embedded, f)

In [None]:
Helper function to tell the sentence is positive or negative

In [None]:
def sentence_convert_data(data):
    global tokenizer
    indices = []
    ids, segments = tokenizer.encode(data, max_len=SEQ_LEN)
    indices.append(ids)
        
    items = indices
    indices = np.array(indices)
    return [indices, np.zeros_like(indices)]

def movie_evaluation_predict(sentence):
    data_x = sentence_convert_data(sentence)
    predict = bert_model.predict(data_x)
    predict_answer = np.round(np.ravel(predict), 0).item()
    
    if predict_answer == 0:
        print("It's a negative Movie Review")
    elif predict_answer == 1:
        print(It's a positive Movie Review")

In [None]:
movie_evaluation_predict("나만 이걸 보고 울었는지 모르겠지만. 그 모든것이 절 슬프게 하였네요 ")

In [None]:
movie_evaluation_predict("너무잼있어엉 진짜 연기가 예술이고 다시보고싶은영화")

In [None]:
movie_evaluation_predict("배우들이 맞지 않는 옷을 입은 것처럼 연기력 대부분이 별로였습니다.")

In [None]:
movie_evaluation_predict("평범한 스토리. 볼만한 영상미. 스타워즈도 이제는...")

##### Reference

> https://www.kdnuggets.com/2018/12/bert-sota-nlp-model-explained.html
> https://www.lucypark.kr/docs/2015-pyconkr/#39
> https://pypi.org/project/keras-bert/,
> https://github.com/CyberZHG/keras-bert/tree/master/keras_bert