# TensorFlow 2 - BERT: Movie Review Sentiment Analysis

$BERT$는 트랜스포머의 한 부분인 양방향 인코더이다.사전학습된 BERT 모델을 미세조정하면 광범위한 Q&A, 감성분석과 개체명 인식 등의 자연어 처리 작업을 위한 최신 모델을 만들 수 있다. $BERT_{BASE}$ 는 총 110M (L=12, H=768, A=12, Total Parameters=110M)개의 파라미터를 가지고, $BERT_{LARGE}$는 총 340M (L=24, H=1024, A=16, Total Parameters=340M)를 가진다. (Devlin et al, 2019, [BERT paper link](https://arxiv.org/pdf/1810.04805.pdf)).


**Dataset**

IMDB 데이터셋은 자연어 처리를 위한 5만개의 영화평으로 구성된다. 캐글 링크(https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)에서 다운로드할 수 있다.


**과제**

IMDB 데이터셋의 평은 긍정 또는 부정이다. 따라서 자연어 처리(NLP) 영화평 감성분석 작업은 지도학습 이진 분류문제이다. 

In [1]:
# 버트를 설치한다.
!pip install bert-for-tf2

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/18/d3/820ccaf55f1e24b5dd43583ac0da6d86c2d27bbdfffadbba69bafe73ca93/bert-for-tf2-0.14.7.tar.gz (41kB)
[K     |████████                        | 10kB 23.8MB/s eta 0:00:01[K     |████████████████                | 20kB 10.3MB/s eta 0:00:01[K     |███████████████████████▉        | 30kB 8.3MB/s eta 0:00:01[K     |███████████████████████████████▉| 40kB 7.4MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 4.6MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/a4/bf/c1c70d5315a8677310ea10a41cfc41c5970d9b37c31f9c90d4ab98021fd1/py-params-0.9.7.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... [?

In [2]:
# 필요한 모듈들을 임포트한다.
import pandas as pd
import numpy as np
import bert
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import  Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from tqdm import tqdm
import matplotlib.pyplot as plt

print("TensorFlow Version:",tf.__version__)
print("Hub version: ",hub.__version__)
pd.set_option('display.max_colwidth',1000)


TensorFlow Version: 2.3.0
Hub version:  0.10.0


## 데이터 전처리

In [3]:
# from google.colab import drive
# drive.mount("/content/drive")

Mounted at /content/drive


In [12]:
# df=pd.read_csv('/content/drive/My Drive/IMDB Dataset.csv')

In [13]:
# IMDB 데이터셋을 판다스 프레임워크로 읽어 들인다.
df=pd.read_csv('IMDB Dataset.csv')

In [14]:
# 데이터셋을 본다
df.head(5)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case wi...",positive


In [15]:
print("The number of rows and columns in the dataset is: {}".format(df.shape))

The number of rows and columns in the dataset is: (50000, 2)


In [16]:
# 결측치를 검사한다.
df.apply(lambda x: sum(x.isnull()), axis=0)

review       0
sentiment    0
dtype: int64

In [17]:
# 타겟 클래스 밸런스를 검사한다.
df["sentiment"].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

In [18]:
# BERT 임베딩을 위한 함수들: input_ids, input_masks, input_segments와 Inputs
MAX_SEQ_LEN=500 # 최대 시퀀스 길이(max sequence length)

def get_masks(tokens):
    """매스크: 토큰에 대해 1 패딩에 대해 0 할당"""
    return [1]*len(tokens) + [0] * (MAX_SEQ_LEN - len(tokens))
 
def get_segments(tokens):
    """세그먼트: 처음 문장은 0 그리고 두번째문장은 1"""  
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (MAX_SEQ_LEN - len(tokens))

def get_ids(tokens, tokenizer):
    """토크나이저 단어집의 토큰 아이디"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens,)
    input_ids = token_ids + [0] * (MAX_SEQ_LEN - len(token_ids))
    return input_ids

def create_single_input(sentence, tokenizer, max_len):
    """문장으로부터 입력을 만든다."""
    stokens = tokenizer.tokenize(sentence)
    stokens = stokens[:max_len]
    stokens = ["[CLS]"] + stokens + ["[SEP]"]
 
    ids = get_ids(stokens, tokenizer)
    masks = get_masks(stokens)
    segments = get_segments(stokens)

    return ids, masks, segments
 
def convert_sentences_to_features(sentences, tokenizer):
    """문장을 특성으로 변환한다.: input_ids, input_masks와 input_segments"""
    input_ids, input_masks, input_segments = [], [], []
 
    for sentence in tqdm(sentences,position=0, leave=True):
      ids,masks,segments=create_single_input(sentence,tokenizer,MAX_SEQ_LEN-2)
      assert len(ids) == MAX_SEQ_LEN
      assert len(masks) == MAX_SEQ_LEN
      assert len(segments) == MAX_SEQ_LEN
      input_ids.append(ids)
      input_masks.append(masks)
      input_segments.append(segments)

    return [np.asarray(input_ids, dtype=np.int32), 
          np.asarray(input_masks, dtype=np.int32), 
          np.asarray(input_segments, dtype=np.int32)]

def create_tonkenizer(bert_layer):
    """단어집과 소문자로 탑재된 토크나이저로 인스턴스화한다."""
    vocab_file=bert_layer.resolved_object.vocab_file.asset_path.numpy()
    do_lower_case=bert_layer.resolved_object.do_lower_case.numpy() 
    tokenizer=bert.bert_tokenization.FullTokenizer(vocab_file,do_lower_case)
    return tokenizer

## 모델링

In [19]:
def nlp_model(callable_object):
    # 사전훈련된 BERT 모델을 로딩한다.
    bert_layer = hub.KerasLayer(handle=callable_object, trainable=True)  
   
    # BERT 층의 3개 입력: ids, masks와 segments
    input_ids = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_ids")           
    input_masks = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_masks")       
    input_segments = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="segment_ids")
    
    inputs = [input_ids, input_masks, input_segments] # BERT inputs
    pooled_output, sequence_output = bert_layer(inputs) # BERT outputs
    
    ###################################
    # 문제1: 위의 출력에 Dense (유닛 768) +Dropout (0.1) ,Dense (2)와 softmax를 사용한 후 모델을 완성하라.
    # 1. 은익층을 하나 더한다.

    
    # 2. 출력층을 더한다.

    
    # 3.새로운 모델을 구축한다.

    
    ##################################
    
    return model

model = nlp_model("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1")
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 500)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_ids[0][0]                  
                                                                 input_masks[0][0]     

## 모델 훈련

In [20]:
# 훈련과 테스트를 위한 예제를 만든다.
df = df.sample(frac=1) # Shuffle the dataset
tokenizer = create_tonkenizer(model.layers[3])
X_train = convert_sentences_to_features(df['review'][:40000], tokenizer)
X_test = convert_sentences_to_features(df['review'][40000:], tokenizer)

df['sentiment'].replace('positive',1.,inplace=True)
df['sentiment'].replace('negative',0.,inplace=True)
one_hot_encoded = to_categorical(df['sentiment'].values)
y_train = one_hot_encoded[:40000]
y_test =  one_hot_encoded[40000:]

100%|██████████| 40000/40000 [02:04<00:00, 321.27it/s]
100%|██████████| 10000/10000 [00:31<00:00, 321.76it/s]


In [None]:
# 모델의 훈련l
BATCH_SIZE = 8
EPOCHS = 1

# 아담 최적화를 사용해 categorical_crossentropy 손실을 최소화한다.
opt = Adam(learning_rate=2e-5)
model.compile(optimizer=opt, 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

# 데이터를 모델에 적합화한다.
history = model.fit(X_train, y_train,
                    validation_data=(X_test, y_test),
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    verbose = 1)

# 훈련된 모델을 저장한다.
model.save('nlp_model.h5') 





## 모델 성과 평가

In [29]:
# 사전훈련 자연어처리 모델을 로딩한다.
from tensorflow.keras.models import load_model
new_model = load_model('nlp_model.h5',custom_objects={'KerasLayer':hub.KerasLayer})

In [30]:
# 테스트 데이터셋에 대해서 예측한다.
from sklearn.metrics import classification_report
pred_test = np.argmax(new_model.predict(X_test), axis=1)

In [31]:
print(classification_report(np.argmax(y_test,axis=1), pred_test))

              precision    recall  f1-score   support

           0       0.94      0.93      0.93      5005
           1       0.93      0.94      0.93      4995

    accuracy                           0.93     10000
   macro avg       0.93      0.93      0.93     10000
weighted avg       0.93      0.93      0.93     10000



In [32]:
pred_test[:10]

array([0, 1, 1, 0, 0, 0, 1, 0, 0, 1])

In [33]:
# 테스트 데이터셋의 첫째 평에 대해 0를 예측했다.
df['review'][40000:40001]

7401    Ahh, the dull t.v. shows and pilots that were slammed together in the 70's to make equally dull t.v. movies! Some examples would be Riding With Death(the most hysterically cheesy of the lot), Stranded in Space(confusing and uninteresting), San Francisco International(horribly dull and unbelievably confusing), and this turgid bit of Quinn Martin glamor. <br /><br />Shot in Hawaii(although you wouldn't know it from the outside shots), it's apparently a failed pilot for a lame spy show. The real problem is that you don;'t like most of the characters, including the drab main character Diamond Head, who seemed half asleep for the entire movie; his boss 'Aunt Mary', who had a really weird delivery of his lines and shellacked white hair as well as the a tan that looked like it had been stuccoed on; Diamnd Head's girlfriend/fellow agent(hell, I can't even remember her name) a skinny, wooden woman with a flat way of speaking that is just not sexy or interesting; and the singing sidekick

In [34]:
# 테스트 데이터셋의 둘째 평에 대해 1을 예측했다.
df['review'][40001:40002]

34224    That film is absolutely fantastic!! If you watch it with your friends it can be a very nice day... Obviously you have to know that the film is stupid and very bad directed and acted (Tomba/Unziker what a couple), and that is probably the worse film in the world, but you can enjoy it very much. We watched it in 19 and it was a very nice evening. The best scenes are the first one, when the criminals kill the friend of Alex, and he tries to act like a desperate, and the result is a comic scene of first category... And then when he shows to Leva (Antevleva, what a name) the "Palassio di giusstissia", and then the accident of Leva, that once is going on her car out of the road, and a second later, the car is completely empty! What a magic!
Name: review, dtype: object

In [35]:
# 테스트 데이터셋의 셋째 평에 대해 1을 예측했다.
df['review'][40002:40003]

4325    Normally I love finding old (and some not-so-old) westerns I haven't seen, to be the entertainment for the evening. It's such a great way to sit back, relax and escape the politics and world problems for a few hours. But this was not to be the case with this version of The Magnificent Seven. The casting and storyline of this series closely follow the Hollywood formula for politically correct entertainment; good old get-your-mind-right, revisionist history, where the 'bad guys' must all be white, male, Confederate (in this case), and preferably Christian (if it can somehow be worked into the script). It's sad, really. The best movies out there, are now and have always been about simply telling a good story up on the big screen - not about forwarding someone's political ideology.
Name: review, dtype: object