# Natural Language Processing with BERT and GloVe

The Real or Not:  NLP with Disaster Tweets Data is a collection of tweets in english taken from twitter. The data comes with the initial text, a key word column, and a location column.

The Objective of this notebook is to predict whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0. The two types of RNN models I chose for this paper are BERT and GloVe.


BERT stands for Bidirectional Encoder Representations from Transformers. BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. 

For my BERT model my hyperparameters consisted of 3 epochs and a batch size of 16. 

For my GloVe model my hyperparameters consisted of 400 iterations at a learning rate of 0.01.

As for the results of the models on Kaggle, the BERT Model got a score of 0.83333, while the GloVe Model got a 0.66871.

If management is thinking about using a language model to classify written customer reviews and call and complaint logs, and the most critical customer messages can be identified, then customer support personnel could be assigned to contact those customers. The key transformation which needs to be made is to calibrate the models to look for sentiment as opposed to simply looking for whether or not a catastrophe was being referenced. The model could theoretically be used to respond to customer emails. 

If management wanted to use this to respond to calls instead of emails, then a method to encode the customer's speech into text would be necessay, since the models take text as an input and then later encodes it.

This notebook was inspired by two notebooks:

https://www.kaggle.com/mashiat/nlp-rnn

https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub


# Appendix

In [0]:
import tensorflow as tf 
tf.test.gpu_device_name() 

'/device:GPU:0'

In [0]:
from google.colab import drive 
drive.mount('/mntDrive') 

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /mntDrive


In [0]:
! ls "/mntDrive/My Drive"

 403				     'submission_1 assignment 5.csv'
'Colab Notebooks'		     'submission_2 assignment 5.csv'
 dogs-vs-cats-redux-kernels-edition   test.csv
 nlp-getting-started		      train.csv


In [0]:
# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [0]:
!pip3 install tensorflow_text>=2.0.0rc0
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 22.7MB/s eta 0:00:01[K     |▋                               | 20kB 3.2MB/s eta 0:00:01[K     |█                               | 30kB 4.2MB/s eta 0:00:01[K     |█▏                              | 40kB 4.5MB/s eta 0:00:01[K     |█▌                              | 51kB 3.7MB/s eta 0:00:01[K     |█▉                              | 61kB 4.2MB/s eta 0:00:01[K     |██▏                             | 71kB 4.4MB/s eta 0:00:01[K     |██▍                             | 81kB 4.8MB/s eta 0:00:01[K     |██▊                             | 92kB 5.2MB/s eta 0:00:01[K     |███                             | 102kB 4.9MB/s eta 0:00:01[K     |███▍                            | 112kB 4.9MB/s eta 0:00:01[K     |███▋                     

In [0]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

import tokenization

import matplotlib.pyplot as plt
import re
import nltk

%matplotlib inline 

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)

import warnings
warnings.filterwarnings('ignore')

In [0]:
train = pd.read_csv("/mntDrive/My Drive/nlp-getting-started/train.csv")
test = pd.read_csv("/mntDrive/My Drive/nlp-getting-started/test.csv")
submission=pd.read_csv("/mntDrive/My Drive/nlp-getting-started/sample_submission.csv")

# BERT Model Data Prep

In [0]:
def bert_encode(texts, tokenizer, max_len=512):    
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [0]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [0]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

CPU times: user 21.6 s, sys: 3.91 s, total: 25.5 s
Wall time: 26.3 s


In [0]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [0]:
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = train.target.values

# BERT Model Training

In [0]:
model = build_model(bert_layer, max_len=160)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [0]:
checkpoint = ModelCheckpoint('model.h5', monitor='val_accuracy', save_best_only=True)

train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=3,
    callbacks=[checkpoint],
    batch_size=16
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


#BERT Model on Kaggle Test Set

In [0]:
model.load_weights('model.h5')
test_pred = model.predict(test_input)

In [0]:
submission['target'] = test_pred.round().astype(int)
submission.to_csv('Assignment 8 BERT submission.csv', index=False)

# GloVe Model Data Prep

In [0]:
#reading from the file to turn the words to word embedding vector
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

In [0]:
#reading from the file to learn the word embedding into the list word_to_vec_map
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('/mntDrive/My Drive/nlp-getting-started/glove.6B.50d.txt')

In [0]:
print(train["text"])

0       Our Deeds are the Reason of this #earthquake M...
1                  Forest fire near La Ronge Sask. Canada
2       All residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       Just got sent this photo from Ruby #Alaska as ...
                              ...                        
7608    Two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @TheTawniest The out of control w...
7610    M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...
7611    Police investigating after an e-bike collided ...
7612    The Latest: More Homes Razed by Northern Calif...
Name: text, Length: 7613, dtype: object


In [0]:
def clean(text):
    regex = re.compile('([^\s\w]|_)+')
    sentence = regex.sub('', text).lower()
    sentence = sentence.split(" ")
    
    for word in list(sentence):
        if word not in word_to_vec_map:
            sentence.remove(word)  
            
    sentence = " ".join(sentence)
    return sentence

In [0]:
for i in range (train.shape[0]):
    train.at[i,'text']=clean(train.loc[i,'text'])
    
for i in range(test.shape[0]):
    test.at[i,'text']=clean(test.loc[i,'text'])

In [0]:
#determining the max length of a text in training set
maxLen = len(max(train["text"], key=len).split())

In [0]:
#trying the length of the text of id=1
length=len(str(train[train['id']==1]["text"]).split())

In [0]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this earthquake ma...,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,all residents asked to shelter in place are be...,1
3,6,,,13000 people receive wildfires evacuation orde...,1
4,7,,,just got sent this photo from ruby alaska as s...,1


In [0]:
#One hot encoding of the target to 2 dimensional vector
Y_oh_train = tf.one_hot(train["target"],2,dtype='int32')

Y_oh_train[0]

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([0, 1], dtype=int32)>

In [0]:
train["text"].values[1]

'forest fire near la ronge sask canada'

In [0]:
import string
def sentence_to_avg(sentence, word_to_vec_map):
    """
    Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each word
    and averages its value into a single vector encoding the meaning of the sentence.
    
    Arguments:
    sentence -- string, one training example from X
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    
    Returns:
    avg -- average vector encoding information about the sentence, numpy-array of shape (50,)
    """
    
    # Step 1: Split sentence into list of lower case words (≈ 1 line)
    #sentence=str(sentence.translate(str.maketrans('', '', string.punctuation)))
    words = (sentence.lower()).split()
    # Initialize the average word vector, should have the same shape as your word vectors.
    avg = np.zeros((50,))
    
    # Step 2: average the word vectors. You can loop over the words in the list "words".
    #using try except disregard some words which doesn't exist in the glove Vector such as 'dtype'
    total = 0
    for w in words:
        total += word_to_vec_map[w]
    if len(words):
        avg = total/len(words)
    
    return avg

In [0]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

In [0]:
def predict(X, Y, W, b, word_to_vec_map):
    """
    Given X (sentences) and Y (emoji indices), predict emojis and compute the accuracy of your model over the given set.
    
    Arguments:
    X -- input data containing sentences, numpy array of shape (m, None)
    Y -- labels, containing index of the label emoji, numpy array of shape (m, 1)
    
    Returns:
    pred -- numpy array of shape (m, 1) with your predictions
    """
    m = Y.shape[0]
    pred = np.zeros((m, 1))
    Y_oh=tf.one_hot(Y,n_y,dtype='int32')
    Y=np.zeros((m,1))
    
    
    for j in range(m):                       # Loop over training examples
        
        # Split jth test example (sentence) into list of lower case words
        if Y_oh[j][0]==1:
            Y[j]=0
        else:
            Y[j]=1
        words = X[j].lower().split()
        
        avg = np.zeros((50,))
    
        total = 0
        for w in words:
            total += word_to_vec_map[w]
        if len(words):
            avg = total/len(words)
        

        # Forward propagation
        Z = np.dot(W, avg) + b
        A = softmax(Z)
        pred[j] = np.argmax(A)
        
    print("Accuracy: "  + str(np.mean((pred[:] == np.reshape(Y,(Y.shape[0],1)[:])))))
    return pred

In [0]:
def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):
    """
    Model to train word vector representations in numpy.
    
    Arguments:
    X -- input data, numpy array of sentences as strings, of shape (m, 1)
    Y -- labels, numpy array of integers between 0 and 7, numpy-array of shape (m, 1)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    learning_rate -- learning_rate for the stochastic gradient descent algorithm
    num_iterations -- number of iterations
    
    Returns:
    pred -- vector of predictions, numpy-array of shape (m, 1)
    W -- weight matrix of the softmax layer, of shape (n_y, n_h)
    b -- bias of the softmax layer, of shape (n_y,)
    """
    
    np.random.seed(1)

    # Define number of training examples
    m = Y.shape[0]                          # number of training examples
    n_y = 2                                # number of classes  
    n_h = 50                                # dimensions of the GloVe vectors 
    
    # Initialize parameters using Xavier initialization
    W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
    b = np.zeros((n_y,1))
    pred=np.zeros((m,1))
    
    # Convert Y to Y_onehot with n_y classes
    Y_oh=tf.one_hot(Y,n_y,dtype='int32')
    
    # Optimization loop
    for t in range(num_iterations):# Loop over the number of iterations
        print("Number of iterations",t)
        for i in range(m):          # Loop over the training examples
            
            ### START CODE HERE ### (≈ 4 lines of code)
            # Average the word vectors of the words from the i'th training example
            avg = sentence_to_avg(X[i], word_to_vec_map)
            avg=np.reshape(avg,(n_h,1))

            # Forward propagate the avg through the softmax layer
            z = np.dot(W,avg)+b
            a = softmax(z)

            # Compute cost using the i'th training label's one hot representation and "A" (the output of the softmax)
            cost =-np.dot(np.transpose(Y_oh[i]),np.log(a))
            ### END CODE HERE ###
            
            # Compute gradients 
            Y_oh_try=np.reshape(Y_oh[i],(n_y,1))
            dz = a - Y_oh_try
            dz=np.reshape(dz,(n_y,1))
            avg=np.reshape(avg,(1, n_h))
            dW = np.dot(dz,avg)
            db = dz

            # Update parameters with Stochastic Gradient Descent
            W = W - learning_rate * dW
            b = b - learning_rate * db
        
        if t % 100 == 0:
            print("Epoch: " + str(t) + " --- cost = " + str(cost))
            print(Y.shape)
            pred = predict(X, Y, W, b, word_to_vec_map) #predict is defined in emo_utils.py
        
    return pred, W, b

# GloVe Model Training 

In [0]:
print(train["text"].shape)
print(Y_oh_train[0].shape)
X=train["text"]
n_y=2
n_h=50
W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
b = np.zeros((n_y,1))
avg = sentence_to_avg(X[0], word_to_vec_map)
avg=np.reshape(avg,(n_h,1))
# Forward propagate the avg through the softmax layer
z = np.dot(W,avg)+b
a = softmax(z)
print("shape of b",b.shape)
print("shape of W",W.shape)
print("shape of avg",avg.shape)
print("z shape",z.shape)
print()
cost =-np.dot(np.transpose(Y_oh_train[0]),np.log(a))
dz = a - Y_oh_train[0]
print("a shape",a.shape)
print("y_oh shape",train["target"].shape)
print("X_shape",train["text"].shape)
print("shape of dz",dz.shape)

(7613,)
(2,)
shape of b (2, 1)
shape of W (2, 50)
shape of avg (50, 1)
z shape (2, 1)

a shape (2, 1)
y_oh shape (7613,)
X_shape (7613,)
shape of dz (2, 2)


In [0]:
pred, W, b = model(train["text"], train["target"], word_to_vec_map)
print(pred)

Number of iterations 0
Epoch: 0 --- cost = [0.05533201]
(7613,)
Accuracy: 0.7166688559043741
Number of iterations 1
Number of iterations 2
Number of iterations 3
Number of iterations 4
Number of iterations 5
Number of iterations 6
Number of iterations 7
Number of iterations 8
Number of iterations 9
Number of iterations 10
Number of iterations 11
Number of iterations 12
Number of iterations 13
Number of iterations 14
Number of iterations 15
Number of iterations 16
Number of iterations 17
Number of iterations 18
Number of iterations 19
Number of iterations 20
Number of iterations 21
Number of iterations 22
Number of iterations 23
Number of iterations 24
Number of iterations 25
Number of iterations 26
Number of iterations 27
Number of iterations 28
Number of iterations 29
Number of iterations 30
Number of iterations 31
Number of iterations 32
Number of iterations 33
Number of iterations 34
Number of iterations 35
Number of iterations 36
Number of iterations 37
Number of iterations 38
Numb

In [0]:
def pre(X, W, b, word_to_vec_map):
    
    print(type(X))
    m=X.shape[0]
    
    pred=np.zeros((m,1))
    
    
    for j in range(m):                       # Loop over training examples
        
        # Split jth test example (sentence) into list of lower case words
        words = X[j].lower().split()
        
        avg = np.zeros((50,))
    
        total = 0
        for w in words:
            total += word_to_vec_map[w]
        if len(words):
            avg = total/len(words)
        

        # Forward propagation
        Z = np.dot(W, avg) + b
        A = softmax(Z)
        pred[j] = np.argmax(A)
    return pred

# GloVe Model on Kaggle Test Set

In [0]:
# %% [code]

submission.head()
# %% [code]
submission["target"]= pre(test["text"], W, b, word_to_vec_map)
submission["target"]=submission["target"].astype(int)

<class 'pandas.core.series.Series'>


In [0]:
submission.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1


In [0]:
submission.to_csv("Assignment 8 GloVe submission.csv",index=False)