## 1 basic sentiment analysis

In [1]:
from transformers import pipeline

In [2]:
senti = pipeline('sentiment-analysis')

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=629.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=267844284.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=48.0), HTML(value='')))




In [3]:
sentence1 = 'BERT stands for Bidirectional Encoder Representations from Transformers and it is a state-of-the-art machine learning model used for NLP tasks.'
sentence2 = 'i hate you'

In [4]:
result = senti(sentence1)

In [10]:
result

[{'label': 'POSITIVE', 'score': 0.9933399558067322}]

In [14]:
print('sentence 1 sentiment analysis result is {},  score is {:.4f}'.format(senti(sentence1)[0]['label'], senti(sentence1)[0]['score']))
print('sentence 2 sentiment analysis score is {}, score is {:.4f}'.format(senti(sentence2)[0]['label'], senti(sentence2)[0]['score']))

sentence 1 sentiment analysis result is POSITIVE,  score is 0.9933
sentence 2 sentiment analysis score is NEGATIVE, score is 0.9991


## 2 BERT model on IMDB reviews

In [20]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
import tensorflow as tf
import pandas as pd
import os
import shutil

### 2.1 load pre-trained model

In [17]:
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = TFBertForSequenceClassification.from_pretrained('/Users/wegzheng/Downloads/BERT/',num_labels=2)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at /Users/wegzheng/Downloads/BERT/ and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
bert_model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


### 2.2 data cleaning

In [21]:
os.listdir('data/aclImdb/')

['imdbEr.txt', 'test', 'imdb.vocab', 'README', 'train']

In [39]:
# We create a training dataset and a validation 
# dataset from our "aclImdb/train" directory with a 80/20 split.
train = tf.keras.preprocessing.text_dataset_from_directory(
    'data/aclImdb/train', batch_size=30000, validation_split=0.2, 
    subset='training', seed=2021)
valid = tf.keras.preprocessing.text_dataset_from_directory(
    'data/aclImdb/train', batch_size=30000, validation_split=0.2, 
    subset='validation', seed=2021)
test = tf.keras.preprocessing.text_dataset_from_directory(
    'data/aclImdb/test/', batch_size=25000)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [35]:
def ge_dataframe(dataset):
    for i in dataset.take(1):
        text = i[0].numpy()
        label = i[1].numpy()
    df = pd.DataFrame({'content': text, 
                       'label': label})
    df['content'] = df['content'].str.decode("utf-8") # convert b'i like...' to 'i like ...'
    print(df.shape)
    return df

In [36]:
df_train = ge_dataframe(train)

(20000, 2)


In [62]:
df_train.head(4)

Unnamed: 0,content,label
0,Joe Don Baker is an alright to good actor in s...,0
1,"I really wanted to like this film, but so much...",0
2,"I happened upon this flick on a rainy Sunday, ...",0
3,I'm writing this because I somehow felt being ...,0


In [37]:
df_valid = ge_dataframe(valid)

(5000, 2)


In [40]:
df_test = ge_dataframe(test)

(25000, 2)


In [41]:
def gen_bert_tokenizer(df):
    sentences = df['content']
    input_ids=[]
    attention_masks=[]

    for sent in sentences:
        bert_inp=bert_tokenizer.encode_plus(sent,add_special_tokens = True,max_length =64,pad_to_max_length = True,return_attention_mask = True)
        input_ids.append(bert_inp['input_ids'])
        attention_masks.append(bert_inp['attention_mask'])

    input_ids=np.asarray(input_ids)
    attention_masks=np.array(attention_masks)
    labels=np.array(df['label'])
    
    return input_ids, attention_masks, labels

In [48]:
input_ids_tr, attenten_masks_tr, labels_tr = gen_bert_tokenizer(df_train.sample(frac=0.1, random_state=2021))
input_ids_va, attenten_masks_va, labels_va = gen_bert_tokenizer(df_valid.sample(frac=0.1, random_state=2021))

In [72]:
%time input_ids_ts, attenten_masks_ts, labels_ts = gen_bert_tokenizer(df_test[:1000])

CPU times: user 5.22 s, sys: 155 ms, total: 5.37 s
Wall time: 6.06 s


### 2.3 model compile and training

In [43]:
print('\nBert Model',bert_model.summary())

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08)

bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________

Bert Model None


In [49]:
callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, mode='auto')]

In [50]:
history=bert_model.fit([input_ids_tr,attenten_masks_tr],labels_tr,
#                        batch_size=params['BATCH_SIZE'], 
                       batch_size=64, 
                       epochs=4, 
                       validation_data=([input_ids_va,attenten_masks_va],labels_va),
                       callbacks=callbacks)

Epoch 1/4
Epoch 2/4
Epoch 3/4


### 3.1 validation on test datasets

In [73]:
%time tf_outputs = bert_model([input_ids_ts,attenten_masks_ts])

CPU times: user 4min 51s, sys: 3min 16s, total: 8min 7s
Wall time: 1min 2s


In [74]:
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

In [76]:
tf_predictions[:5]

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[0.9329982 , 0.0670018 ],
       [0.99812835, 0.0018717 ],
       [0.00405917, 0.9959408 ],
       [0.00184835, 0.9981516 ],
       [0.9883062 , 0.01169381]], dtype=float32)>

In [83]:
df_test.head(10)

Unnamed: 0,content,label
0,This was a movie that could have been great if...,0
1,Surely this deserves to be in the bottom 10 fi...,0
2,"I enjoyed it. There you go, I said it again. I...",1
3,I agree with all aforementioned comments. This...,1
4,When I was 13 or so I was lucky enough to find...,0
5,"First off, let me say I wasted Halloween movie...",0
6,I love documentaries. They are among my favori...,0
7,This wonderful 3 part BBC production is one of...,1
8,Finally! Other people who have actually seen t...,1
9,This is one of the greatest films ever made. B...,1


In [84]:
labels = ['Negative', 'Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(5):
    print(df_test['content'][i], ": \n", labels[label[i]])

This was a movie that could have been great if there were not so many unnecessary historical inaccuracies and if the actors had been chosen or made up to look a little more like the real persons (not very difficult). Sissi did not go to Mayerling to see her dead son, she also did not die in the street; they carried her on to the boat and then back to the hotel, which was much more dramatic. I am not sure about the wedding night, but I find it exaggerated that a lady-in-waiting would undress the empress and leave her completely naked (and that in the 1850's) or that the emperor would announce very proudly "yes I finally laid her" to the assembled court. As far as I know this was done right away on the first night and nobody rewarded her as if she were a streetwalker. The saving grace of the movie is really Stephane Audran, excellent actress and true to character. : 
 Negative
Surely this deserves to be in the bottom 10 films of all time, pity it's just a TV movie. Rubbish that only we B

### 3.2 validation on text

In [78]:
pred_sentences = ['This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good',
                  'One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie']

In [85]:
tf_batch2 = bert_tokenizer(pred_sentences, max_length=64, padding=True, truncation=True, return_tensors='tf')
tf_outputs2 = bert_model(tf_batch2)
tf_predictions2 = tf.nn.softmax(tf_outputs2[0], axis=-1)
labels = ['Negative', 'Positive']
label = tf.argmax(tf_predictions2, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
    print(pred_sentences[i], ": \n", labels[label[i]])

This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good : 
 Positive
One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie : 
 Negative


In [82]:
label

array([1, 0])

In [80]:
tf_predictions2

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[0.00161889, 0.99838114],
       [0.9979481 , 0.00205192]], dtype=float32)>