# Natural Language Processing using BERT

Please study AMA Lecture 12 "Natural Language Processing Using BERT" before practicing this code.

In [1]:
# mount your Google Drive so you can locate your data files.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
import pandas as pd

In [3]:
# Need tf version >=2.0 and hub version >=0.7
import tensorflow as tf
import tensorflow_hub as hub
print("TF version: ", tf.__version__)
print("Hub version: ", hub.__version__)

TF version:  2.4.1
Hub version:  0.12.0


In [4]:
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers

## Case study: the IMDB dataset

This is a widely used large dataset for text mining from a [2011 ACL meeting paper](https://ai.stanford.edu/~amaas/data/sentiment/) by Maas et al. I processed the data so it fits in a single CSV file 'IMDB_small.csv'.

The original dataset has 50000 balanced records, and the data file takes too long to upload to Google Colab. File 'IMDB_small.csv' contains a smaller 10000-record balanced sample, where the first 5000 are negative reviews and the rest are positive reviews.

In [7]:
# load the IMDB dataset
df = pd.read_csv('/content/drive/MyDrive/AMA/12_NLP_using_BERT/IMDB_small.csv')
df.head()

Unnamed: 0,review,sentiment
0,Congratulations to Christina Ricci for making ...,0
1,Another British cinema flag waver. Real garbag...,0
2,Hi there. I watched the first part when it cam...,0
3,"If you merely look at the cover of this movie,...",0
4,This movie was extremely depressing. The cha...,0


In [8]:
df.sentiment.value_counts()

1    5000
0    5000
Name: sentiment, dtype: int64

In [9]:
# one negative example:
import textwrap
print(textwrap.fill(df.review[2], 80))

Hi there. I watched the first part when it came out, and I don't remember having
left such a bad impression on me as this one.  First, the animation is choppy,
wooden, not worked on, lacks naturality - I understand the drawing style was to
be of some 'atlantean' kind, but, it could be done with the usual Disney
finesse... see "Tarzan" to see what I mean. If I didn't see the DISNEY logo in
the beginning, I would never say it was a Disney movie.  Second, the plot was
more like a PC game style, like a good old quest. Not that it was bad, but it
lacked a story that binds the viewer to the characters and their goals. It was
inconvincing, at least. The film was meant for children, but this was waaay to
childish at times.  Third, the music... I would say it was improper, but it just
fits the whole scene with the plot and animation...  Overall, I think this was
some kind of an amusement, just by-the-way kind of project by several apprentice
animators, just to fill in the count for Disney movie

In [10]:
# one positive example:
print(textwrap.fill(df.review[5000], 80))

After losing the Emmy for her performance as Mama Rose in the television version
of GYPSY, Bette won an Emmy the following year for BETTE MIDLER: DIVA LAS VEGAS,
a live concert special filmed for HBO from Las Vegas. Midler, who has been
performing live on stage since the 1970's, proves that she is still one of the
most electrifying live performers in the business. From her opening number, her
classic "Friends", where she descends from the wings atop a beautiful prop
cloud, Bette commands the stage with style and charisma from a rap-styled number
called "I Look Good" she then proves that she has a way with a joke like few
other performers in this business as she segues her way through a variety of
musical selections. The section of the show where she salutes burlesque goes on
a little too long but she does manage to incorporate her old Sophie Tucker jokes
here to good advantage (even though she actually forgets one joke in the middle
of telling it, but her ad-libbing until she remembers

In [11]:
# The following codes make it easier for you to adopt
# this file for other text mining datasets.
DATA_COLUMN = 'review'
LABEL_COLUMN = 'sentiment'
label_list = [0, 1] #0-negative, 1-positive

## Introducing BERT

**BERT (Bidirectional Encoder Representations from Transformers)** is the state-of-the-art feature extraction model for natural language.

Some resources on BERT:
- See BERT on paper: https://arxiv.org/pdf/1810.04805.pdf
- See BERT on GitHub: https://github.com/google-research/bert
- See BERT on TensorHub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1
- See 'old' use of BERT for comparison: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb

Next, we will use BERT in four steps:
* Import and build the BERT model
* Tokenization
* Convert tokens to BERT input format
* Sentence/word embedding

## Importing and building the BERT model

This part of code might confuse you a bit for now. We will come back and explain it more.

In [12]:
# !pip install sentencepiece
!pip install bert-for-tf2
import bert

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/a5/a1/acb891630749c56901e770a34d6bac8a509a367dd74a05daf7306952e910/bert-for-tf2-0.14.9.tar.gz (41kB)
[K     |████████                        | 10kB 21.8MB/s eta 0:00:01[K     |████████████████                | 20kB 16.5MB/s eta 0:00:01[K     |███████████████████████▉        | 30kB 13.7MB/s eta 0:00:01[K     |███████████████████████████████▉| 40kB 12.7MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 6.0MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/aa/e0/4f663d8abf83c8084b75b995bd2ab3a9512ebc5b97206fde38cef906ab07/py-params-0.10.2.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... 

In [13]:
# BERT requires a MAX_SEQ_LENGTH that can be any integer<=512.
# Here we pick a smaller number to cut down computation cost.
max_seq_length = 256

In [14]:
# BERT requires the following three types of inputs (more on them later)
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")

In [15]:
# Now we load the already pre-trained BERT layers
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

In [16]:
model = models.Model(inputs=[input_word_ids, input_mask, segment_ids], 
                     outputs=[pooled_output, sequence_output])

In [17]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 256)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 256)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 256)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

## BERT for tokenization

Import tokenizer using the original vocab file:

In [18]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = bert.bert_tokenization.FullTokenizer(vocab_file, do_lower_case)

In [21]:
# The tokenizer converts a sentence to a sequence of tokens. Here's an example:
text = "Here is an example sentence that I want to tokenize."
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['here', 'is', 'an', 'example', 'sentence', 'that', 'i', 'want', 'to', 'token', '##ize', '.']


Now we tokenize every review in the IMDB dataset. This may take a minute to finish.

In [22]:
df['tokens'] = df[DATA_COLUMN].apply(lambda x : tokenizer.tokenize(x))

In [23]:
# An example of how the tokens for a review look like:
print(df['tokens'][2])

['hi', 'there', '.', 'i', 'watched', 'the', 'first', 'part', 'when', 'it', 'came', 'out', ',', 'and', 'i', 'don', "'", 't', 'remember', 'having', 'left', 'such', 'a', 'bad', 'impression', 'on', 'me', 'as', 'this', 'one', '.', 'first', ',', 'the', 'animation', 'is', 'chop', '##py', ',', 'wooden', ',', 'not', 'worked', 'on', ',', 'lacks', 'natural', '##ity', '-', 'i', 'understand', 'the', 'drawing', 'style', 'was', 'to', 'be', 'of', 'some', "'", 'at', '##lan', '##tea', '##n', "'", 'kind', ',', 'but', ',', 'it', 'could', 'be', 'done', 'with', 'the', 'usual', 'disney', 'fines', '##se', '.', '.', '.', 'see', '"', 'tarzan', '"', 'to', 'see', 'what', 'i', 'mean', '.', 'if', 'i', 'didn', "'", 't', 'see', 'the', 'disney', 'logo', 'in', 'the', 'beginning', ',', 'i', 'would', 'never', 'say', 'it', 'was', 'a', 'disney', 'movie', '.', 'second', ',', 'the', 'plot', 'was', 'more', 'like', 'a', 'pc', 'game', 'style', ',', 'like', 'a', 'good', 'old', 'quest', '.', 'not', 'that', 'it', 'was', 'bad', ','

In [26]:
# Some reviews are long. For example:
len(df['tokens'][0])

637

In [27]:
# We now truncate any review with >=(MAX_SEQ_LENGTH-2) tokens.
# And add special tokens [CLS] and [SEP].

def truncate_and_add(x, max_seq_length):
  a = ["[CLS]"] + x
  if len(a)>max_seq_length-1:
    a[max_seq_length-1] = "[SEP]"
    return a[:max_seq_length]
  else:
    return a + ["[SEP]"]

df['tokens'] = df['tokens'].apply(lambda x : truncate_and_add(x, max_seq_length))

## Converting tokens to BERT input format

We'll need to transform our data into a format BERT understands. This involves two steps. First, we create  `InputExamples` using the constructor provided in the BERT library.

- `text_a` is the text we want to classify, which in this case, is the `review` field in our Dataframe. 
- `text_b` is used if we're training a model to understand the relationship between sentences (i.e. is `text_b` a translation of `text_a`? Is `text_b` an answer to the question asked by `text_a`?). This doesn't apply to our task, so we can leave `text_b` blank.
- `label` is the target in supervised learning, which is `sentiment` in our example

To use BERT embedding, we need to convert the tokens of each text input into the following format:
 - input token ids (tokenizer converts tokens using vocab file)
 - input masks (1 for useful tokens, 0 for padding)
 - segment ids (for 2 text training: 0 for the first one, 1 for the second one)


Define some functions for ease of preprocessing:

In [28]:
def get_ids(tokens, tokenizer, max_seq_length):
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    token_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return np.array(token_ids, dtype=np.int32)
    
def get_masks(tokens, max_seq_length):
    token_masks = [1]*len(tokens) + [0] * (max_seq_length - len(tokens))
    return np.array(token_masks, dtype=np.int32)

def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    segments = segments + [0] * (max_seq_length - len(tokens))
    return np.array(segments, dtype=np.int32)


In [29]:
df['ids'] = df['tokens'].apply(lambda x : get_ids(x, tokenizer, max_seq_length))
df['masks'] = df['tokens'].apply(lambda x : get_masks(x, max_seq_length))
df['segments'] = df['tokens'].apply(lambda x : get_segments(x, max_seq_length))

In [30]:
# Let's see what the first movie review is now converted to:
df.iloc[0]

review       Congratulations to Christina Ricci for making ...
sentiment                                                    0
tokens       [[CLS], congratulations, to, christina, ric, #...
ids          [101, 23156, 2000, 12657, 26220, 6895, 2005, 2...
masks        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
segments     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: 0, dtype: object

In [31]:
# Now assemble the data as required by the definition of BERT inputs
n = df.shape[0]
all_ids = np.zeros(shape=(n,max_seq_length))
all_masks = np.zeros(shape=(n,max_seq_length)) 
all_segments = np.zeros(shape=(n,max_seq_length))
i = 0
for index, row in df.iterrows():
  all_ids[i] = row.ids
  all_masks[i] = row.masks
  all_segments[i] = row.segments
  i += 1


## Using the pre-trained BERT model for sentence embedding

BERT converts each text input (in our example, a tokenized movie review) into the following.
* pooled output (also called pooled embedding, sentence embedding): this is a vector of size `768`, which represents the whole sentence.
* sequence outputs (also called sequence embeddings, word embeddings): this is a matrix of size `[max_seq_length, 768]`, where each token is now represented by a vector of size `768`.

**For sentiment analysis, we only need the pooled output.**

Similar to other deep learning models, BERT doesn't transform text one record at a time. Instead, BERT takes a batch of texts (e.g., a batch of movie reviews in our case) and convert them all at once. Thus the output shapes are:
 - pooled output of shape `[batch_size, 768]` with representations for the entire input sequences
 - sequence output of shape `[batch_size, max_seq_length, 768]`

### A big data problem

The output size from BERT can be huge. For example, in our dataset of 10000 movie reviews, where each review has a (truncated) length of 256, the total size of sequence embeddings is: `10000 * 256 * 768 * 4 ~= 8 Gigabyte`. This is too large to fit in Google Colab memory. So the following single-line code will likely trigger a "ResourceExhaustedError".

`pool_embs, seq_embs = model.predict([all_ids,all_masks,all_segments])`

Below is an workaround to avoid this bid data problem. We process our data 1000 records a time, i.e., set batch size at 1000. After each batch is processed, discard the sequence embeddings because we don't need them, and only save the pooled embeddings.

In [32]:
pool_embs = np.zeros(shape=(n,768))
for i in np.arange(10):
  j = i*1000
  pool_embs[j:j+1000], seq_embs = model.predict([all_ids[j:j+1000],
                                                 all_masks[j:j+1000],
                                                 all_segments[j:j+1000]])
  print(f'{i+1}/10 of the data processed.')

1/10 of the data processed.
2/10 of the data processed.
3/10 of the data processed.
4/10 of the data processed.
5/10 of the data processed.
6/10 of the data processed.
7/10 of the data processed.
8/10 of the data processed.
9/10 of the data processed.
10/10 of the data processed.


In [33]:
pool_embs.shape

(10000, 768)

In [34]:
pool_embs[0]

array([-0.39288506, -0.38200733, -0.93822885,  0.39840695,  0.69078875,
        0.03995161, -0.41821665,  0.18299808, -0.86370063, -0.99983764,
       -0.42773068,  0.93359691,  0.93113583,  0.49542025,  0.44924992,
       -0.11027248,  0.42844871, -0.38838491,  0.29029307,  0.84075254,
        0.54123586,  0.99999559, -0.07752901,  0.22819921,  0.37814   ,
        0.980977  , -0.57791328,  0.73325461,  0.71572924,  0.70902973,
        0.18873926,  0.09612697, -0.96260536, -0.00846757, -0.97580045,
       -0.97711515,  0.22195676, -0.17037886,  0.27903038,  0.22599238,
       -0.49821594,  0.01342047,  0.99994415, -0.13116795,  0.58893472,
       -0.00399859, -0.99987859,  0.19726965, -0.51845223,  0.92321241,
        0.85511887,  0.95291692,  0.15571323,  0.2313665 ,  0.36370099,
       -0.61507559, -0.25904995,  0.01653609, -0.09299102, -0.35973966,
       -0.48011357,  0.27778593, -0.80756056, -0.66026044,  0.86174649,
        0.91938794, -0.15493563, -0.05324274,  0.09184919, -0.18

## Assembling a new dataset with features extracted by BERT

For each text, the corresponding pooled output is a vector of 768 numbers that summaries this whole text. We can now treat these 768 numbers as features extracted by BERT. Let's assemble a new DataFrame with these figures and the sentiment data.

In [35]:
feature_df = pd.DataFrame(pool_embs)
feature_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767
0,-0.392885,-0.382007,-0.938229,0.398407,0.690789,0.039952,-0.418217,0.182998,-0.863701,-0.999838,-0.427731,0.933597,0.931136,0.49542,0.44925,-0.110272,0.428449,-0.388385,0.290293,0.840753,0.541236,0.999996,-0.077529,0.228199,0.37814,0.980977,-0.577913,0.733255,0.715729,0.70903,0.188739,0.096127,-0.962605,-0.008468,-0.9758,-0.977115,0.221957,-0.170379,0.27903,0.225992,...,0.075194,-0.225154,-0.23454,-0.320392,0.486677,-0.763093,-0.335556,-0.250378,0.448193,0.0337,0.999996,-0.778875,-0.862276,-0.560151,-0.222567,0.214504,0.028949,-1.0,0.176051,-0.697325,0.747258,-0.645595,0.810977,-0.339198,-0.551214,-0.176504,0.839319,0.864375,-0.223968,-0.34722,0.543835,-0.56437,0.956683,0.429475,0.181101,0.199112,0.635255,-0.811525,-0.435742,0.390948
1,-0.776412,-0.479143,-0.966163,0.483075,0.735439,0.11049,0.52094,0.373213,-0.884016,-0.999994,-0.464932,0.939287,0.963481,0.724542,0.811432,-0.593823,0.150257,-0.514505,0.174578,0.545818,0.727452,1.0,-0.179627,0.383355,0.363103,0.986535,-0.615079,0.885119,0.937005,0.822033,-0.332285,0.308077,-0.981316,-0.081927,-0.981529,-0.991082,0.470549,-0.43218,-0.033664,-0.094756,...,0.417558,-0.388657,-0.325882,-0.581247,0.735591,-0.750473,-0.612798,-0.58187,0.756416,0.273035,1.0,-0.854301,-0.946476,-0.443064,-0.43986,0.480186,-0.488955,-1.0,0.374267,-0.578545,0.88965,-0.754009,0.912782,-0.667797,-0.892207,-0.399767,0.878065,0.866837,-0.45935,-0.521142,0.688332,-0.457478,0.979939,0.630784,-0.082554,-0.106779,0.773375,-0.826193,-0.651949,0.81231
2,-0.532591,-0.387242,-0.952996,0.497544,0.763823,-0.018741,-0.007338,0.19736,-0.838852,-0.999875,-0.567266,0.950285,0.938886,0.641195,0.643097,-0.359462,0.31458,-0.5622,0.267115,0.781857,0.489104,0.999998,-0.152318,0.361618,0.392454,0.979932,-0.697706,0.84031,0.802404,0.710169,-0.183059,0.168304,-0.967066,-0.194827,-0.971231,-0.980333,0.359173,-0.262316,0.10326,0.168209,...,0.361302,-0.31231,-0.210683,-0.299113,0.486185,-0.736889,-0.539592,-0.343858,0.49078,0.18276,0.999998,-0.900402,-0.927199,-0.50091,-0.391939,0.445694,-0.261197,-1.0,0.214347,-0.733179,0.861573,-0.735236,0.921946,-0.400561,-0.834726,-0.252357,0.867712,0.881082,-0.290217,-0.187555,0.616574,-0.20355,0.968841,0.482654,-0.51732,-0.143238,0.614655,-0.894548,-0.570303,0.546067
3,-0.594817,-0.403342,-0.953941,0.517494,0.841254,-0.026739,0.022725,0.156407,-0.892972,-0.999742,-0.608055,0.945132,0.939643,0.599365,0.584011,-0.390083,0.086453,-0.428089,0.170764,0.626081,0.425075,0.999995,-0.038928,0.245449,0.258321,0.986758,-0.605088,0.781417,0.805887,0.589119,-0.073113,0.118716,-0.970241,-0.125292,-0.944507,-0.97719,0.2412,-0.269667,0.189351,0.18129,...,0.208609,-0.227695,-0.219675,-0.348739,0.594702,-0.777287,-0.378595,-0.372857,0.435433,0.070352,0.999995,-0.859133,-0.928382,-0.434548,-0.349002,0.312163,-0.296398,-1.0,0.090986,-0.733399,0.795553,-0.807325,0.921434,-0.501338,-0.788751,-0.087494,0.801062,0.885823,-0.373333,-0.343366,0.600299,-0.545601,0.980657,0.575146,0.13579,0.12016,0.696775,-0.911567,-0.503772,0.692356
4,-0.410833,-0.506333,-0.995662,0.53096,0.96025,-0.236363,-0.282453,0.447984,-0.960383,-0.999265,-0.781264,0.974797,0.893978,0.797164,0.428669,-0.476356,0.003985,-0.647214,0.304032,0.872063,0.531742,1.0,-0.455619,0.363179,0.364398,0.995275,-0.730364,0.680403,0.570509,0.578997,0.228405,0.039234,-0.953021,-0.27755,-0.998966,-0.962857,0.441451,-0.007218,-0.027191,0.172031,...,0.726821,-0.360532,-0.351759,-0.493682,0.436126,-0.871435,-0.586574,-0.393124,0.491337,0.28518,1.0,-0.982884,-0.977736,-0.594445,-0.442828,0.509552,-0.372019,-1.0,0.203671,-0.767502,0.953264,-0.941762,0.990971,-0.633268,-0.637323,-0.443813,0.915532,0.961086,-0.354065,-0.19967,0.555582,-0.100036,0.997626,0.311818,-0.217166,-0.397853,0.669593,-0.970973,-0.529365,0.628633


In [36]:
feature_df['sentiment'] = df['sentiment']
feature_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,sentiment
0,-0.392885,-0.382007,-0.938229,0.398407,0.690789,0.039952,-0.418217,0.182998,-0.863701,-0.999838,-0.427731,0.933597,0.931136,0.49542,0.44925,-0.110272,0.428449,-0.388385,0.290293,0.840753,0.541236,0.999996,-0.077529,0.228199,0.37814,0.980977,-0.577913,0.733255,0.715729,0.70903,0.188739,0.096127,-0.962605,-0.008468,-0.9758,-0.977115,0.221957,-0.170379,0.27903,0.225992,...,-0.225154,-0.23454,-0.320392,0.486677,-0.763093,-0.335556,-0.250378,0.448193,0.0337,0.999996,-0.778875,-0.862276,-0.560151,-0.222567,0.214504,0.028949,-1.0,0.176051,-0.697325,0.747258,-0.645595,0.810977,-0.339198,-0.551214,-0.176504,0.839319,0.864375,-0.223968,-0.34722,0.543835,-0.56437,0.956683,0.429475,0.181101,0.199112,0.635255,-0.811525,-0.435742,0.390948,0
1,-0.776412,-0.479143,-0.966163,0.483075,0.735439,0.11049,0.52094,0.373213,-0.884016,-0.999994,-0.464932,0.939287,0.963481,0.724542,0.811432,-0.593823,0.150257,-0.514505,0.174578,0.545818,0.727452,1.0,-0.179627,0.383355,0.363103,0.986535,-0.615079,0.885119,0.937005,0.822033,-0.332285,0.308077,-0.981316,-0.081927,-0.981529,-0.991082,0.470549,-0.43218,-0.033664,-0.094756,...,-0.388657,-0.325882,-0.581247,0.735591,-0.750473,-0.612798,-0.58187,0.756416,0.273035,1.0,-0.854301,-0.946476,-0.443064,-0.43986,0.480186,-0.488955,-1.0,0.374267,-0.578545,0.88965,-0.754009,0.912782,-0.667797,-0.892207,-0.399767,0.878065,0.866837,-0.45935,-0.521142,0.688332,-0.457478,0.979939,0.630784,-0.082554,-0.106779,0.773375,-0.826193,-0.651949,0.81231,0
2,-0.532591,-0.387242,-0.952996,0.497544,0.763823,-0.018741,-0.007338,0.19736,-0.838852,-0.999875,-0.567266,0.950285,0.938886,0.641195,0.643097,-0.359462,0.31458,-0.5622,0.267115,0.781857,0.489104,0.999998,-0.152318,0.361618,0.392454,0.979932,-0.697706,0.84031,0.802404,0.710169,-0.183059,0.168304,-0.967066,-0.194827,-0.971231,-0.980333,0.359173,-0.262316,0.10326,0.168209,...,-0.31231,-0.210683,-0.299113,0.486185,-0.736889,-0.539592,-0.343858,0.49078,0.18276,0.999998,-0.900402,-0.927199,-0.50091,-0.391939,0.445694,-0.261197,-1.0,0.214347,-0.733179,0.861573,-0.735236,0.921946,-0.400561,-0.834726,-0.252357,0.867712,0.881082,-0.290217,-0.187555,0.616574,-0.20355,0.968841,0.482654,-0.51732,-0.143238,0.614655,-0.894548,-0.570303,0.546067,0
3,-0.594817,-0.403342,-0.953941,0.517494,0.841254,-0.026739,0.022725,0.156407,-0.892972,-0.999742,-0.608055,0.945132,0.939643,0.599365,0.584011,-0.390083,0.086453,-0.428089,0.170764,0.626081,0.425075,0.999995,-0.038928,0.245449,0.258321,0.986758,-0.605088,0.781417,0.805887,0.589119,-0.073113,0.118716,-0.970241,-0.125292,-0.944507,-0.97719,0.2412,-0.269667,0.189351,0.18129,...,-0.227695,-0.219675,-0.348739,0.594702,-0.777287,-0.378595,-0.372857,0.435433,0.070352,0.999995,-0.859133,-0.928382,-0.434548,-0.349002,0.312163,-0.296398,-1.0,0.090986,-0.733399,0.795553,-0.807325,0.921434,-0.501338,-0.788751,-0.087494,0.801062,0.885823,-0.373333,-0.343366,0.600299,-0.545601,0.980657,0.575146,0.13579,0.12016,0.696775,-0.911567,-0.503772,0.692356,0
4,-0.410833,-0.506333,-0.995662,0.53096,0.96025,-0.236363,-0.282453,0.447984,-0.960383,-0.999265,-0.781264,0.974797,0.893978,0.797164,0.428669,-0.476356,0.003985,-0.647214,0.304032,0.872063,0.531742,1.0,-0.455619,0.363179,0.364398,0.995275,-0.730364,0.680403,0.570509,0.578997,0.228405,0.039234,-0.953021,-0.27755,-0.998966,-0.962857,0.441451,-0.007218,-0.027191,0.172031,...,-0.360532,-0.351759,-0.493682,0.436126,-0.871435,-0.586574,-0.393124,0.491337,0.28518,1.0,-0.982884,-0.977736,-0.594445,-0.442828,0.509552,-0.372019,-1.0,0.203671,-0.767502,0.953264,-0.941762,0.990971,-0.633268,-0.637323,-0.443813,0.915532,0.961086,-0.354065,-0.19967,0.555582,-0.100036,0.997626,0.311818,-0.217166,-0.397853,0.669593,-0.970973,-0.529365,0.628633,0


In [37]:
# Warning: this file will be large, about 150MB
feature_df.to_csv("/content/drive/MyDrive/AMA/12_NLP_using_BERT/IMDB_small_BERT.csv", index=False)

## Building and evaluating the prediction model

The rest is similar to what we did with the business loan dataset earlier this semester. I'll use the simple logistic regression model.

In [38]:
from sklearn.model_selection import train_test_split

In [39]:
X = feature_df.drop(columns=['sentiment'])
y = feature_df['sentiment']

# reserve 30% dataset as testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=1,
                                                    stratify=y)

In [40]:
model2 = models.Sequential()
model2.add(layers.Dense(128, activation='relu', input_dim=768))
# model2.add(layers.Dropout(0.5))
model2.add(layers.Dense(1, activation='sigmoid'))

In [41]:
model2.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

In [42]:
model2.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 128)               98432     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 98,561
Trainable params: 98,561
Non-trainable params: 0
_________________________________________________________________


In [43]:
model2.fit(X_train, y_train, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7fd86229d310>

In [44]:
test_loss, test_acc = model2.evaluate(X_test,  y_test, verbose=2)

94/94 - 0s - loss: 0.3827 - acc: 0.8313


In [45]:
# prediction
model2.predict(X_test.iloc[[0]])

array([[0.1053965]], dtype=float32)

In [46]:
print(y_test[0])

0
