(based on https://towardsdatascience.com/simple-bert-using-tensorflow-2-0-132cb19e9b22)

# Kaggle Disaster Tweets Challenge
## BERT Embeddings with TensorFlow 2.0 + Random Forest

https://www.kaggle.com/c/nlp-getting-started

With the new release of TensorFlow, this Notebook aims to show a simple use of the BERT model.
- See BERT on paper: https://arxiv.org/pdf/1810.04805.pdf
- See BERT on GitHub: https://github.com/google-research/bert
- See BERT on TensorHub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1
- See 'old' use of BERT for comparison: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb

## Update TF
We need Tensorflow 2.0 and TensorHub 0.7 for this Colab

In [1]:
#!pip install tensorflow==2.0
#!pip install tensorflow_hub==0.7
#!pip install bert-for-tf2
#!pip install sentencepiece
#!pip install pandas

In [2]:
import tensorflow as tf
import tensorflow_hub as hub
print("TF version: ", tf.__version__)
print("Hub version: ", hub.__version__)

TF version:  2.0.0
Hub version:  0.7.0


If TensorFlow Hub is not 0.7 yet on release, use dev:



In [3]:
### !pip install tf-hub-nightly

In [4]:
# hub.__version__

## Import modules

In [5]:
import tensorflow_hub as hub
import tensorflow as tf
import bert
FullTokenizer = bert.bert_tokenization.FullTokenizer
from tensorflow.keras.models import Model       # Keras is the new high level API for TensorFlow
import math

Building model using tf.keras and hub. from sentences to embeddings.

Inputs:
 - input token ids (tokenizer converts tokens using vocab file)
 - input masks (1 for useful tokens, 0 for padding)
 - segment ids (for 2 text training: 0 for the first one, 1 for the second one)

Outputs:
 - pooled_output of shape `[batch_size, 768]` with representations for the entire input sequences
 - sequence_output of shape `[batch_size, max_seq_length, 768]` with representations for each input token (in context)

In [6]:
max_seq_length = 128  # Your choice here.
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

In [7]:
model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])

In [9]:
#model.save('modeldump.h5')
#tf.keras.experimental.export_saved_model(model, 'modeldump2.h5')

In [15]:
 #model2 = tf.keras.models.load_model('modeldump2.h5')

In [69]:
reloaded_model = tf.keras.experimental.load_from_saved_model('modeldump2.h5', custom_objects={'KerasLayer':hub.KerasLayer})
#print(reloaded_model.get_config())

#Get input shape from model.get_config()
#reloaded_model.build((None, 224, 224, 3))
#reloaded_model.summary()

In [13]:
#reloaded_model.build((None, 224, 224, 3))
reloaded_model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 128)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

Generating segments and masks based on the original BERT

In [14]:
# See BERT paper: https://arxiv.org/pdf/1810.04805.pdf
# And BERT implementation convert_single_example() at https://github.com/google-research/bert/blob/master/run_classifier.py

def get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))


def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))


def get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

Import tokenizer using the original vocab file

In [15]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)

## Prepare data

In [16]:
import pandas as pd

In [17]:
train = pd.read_csv("./data/input/train.csv")

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [20]:
def tokenize(text):
    tokens = tokenizer.tokenize(text)
    tokens = ["[CLS]"] + tokens + ["[SEP]"]
    return tokens


In [21]:
def process_tokens(text):
    #TODO tags to separate column
    #TODO strip hash from tags in text
    return text

In [22]:
def vectorize(text):
    tokens = tokenizer.tokenize(text)
    tokens = ["[CLS]"] + tokens + ["[SEP]"]
    
    input_ids = get_ids(tokens, tokenizer, max_seq_length)
    input_masks = get_masks(tokens, max_seq_length)
    input_segments = get_segments(tokens, max_seq_length)
    
    #print('tokens')
    #print(tokens)
    #print('input_ids')
    #print(input_ids)
    #print('input_masks')
    #print(input_masks)
    #print('input_segments')
    #print(input_segments)
    
    pool_embs, all_embs = model.predict([[input_ids],[input_masks],[input_segments]])
    
    #print('pool_embs')
    #print(pool_embs)
    
    return pool_embs[0]

In [None]:
import time
def vectorize(df, text_column, result_column):
    start = time.time()
print("started")
print(start)

train_small["vectorized"] = train_small["text"].map(vectorize)

end = time.time()
print("ended")
print(end)

print("time elapsed")
print(end - start) 

In [23]:
#train["tokenized"] = train["text"].map(tokenize)
train_small = train.sample(2000)

In [24]:


start = time.time()
print("started")
print(start)

train_small["vectorized"] = train_small["text"].map(vectorize)

end = time.time()
print("ended")
print(end)

print("time elapsed")
print(end - start)

started
1578513534.959518
ended
1578514365.737824
time elapsed
830.7783060073853


In [27]:
train_small.to_csv("./data/work/trained-vectorized.csv")

In [42]:
train_small2 = pd.read_csv("./data/work/trained-vectorized.csv")

In [25]:
train_small

Unnamed: 0,id,keyword,location,text,target,vectorized
3637,5186,fatalities,,-??-\n; kitana\n? her fatalities slay me\nÛÓk...,1,"[-0.807587, -0.5216827, -0.8024632, 0.69634426..."
1137,1638,bombing,,The only country claiming the moral high groun...,1,"[-0.8450406, -0.6705742, -0.9464853, 0.7645291..."
5088,7255,nuclear%20disaster,Marbella. Spain,http://t.co/GaM7otGISw\nANOTHER DISASTER WAITI...,1,"[-0.7676656, -0.44996068, -0.8376965, 0.653483..."
611,882,bioterrorism,OES 4th Point. sisSTAR & TI,I liked a @YouTube video http://t.co/XO2ZbPBJB...,1,"[-0.5953094, -0.39899623, -0.8330214, 0.410473..."
6550,9373,survived,,Well me and dad survived my driving ????????,0,"[-0.8154489, -0.5718806, -0.94418716, 0.745205..."
...,...,...,...,...,...,...
5145,7337,nuclear%20reactor,,Salem 2 nuclear reactor shut down over electri...,1,"[-0.8021079, -0.5850645, -0.96756816, 0.676609..."
5635,8036,refugees,,./.....hmm 12000 Nigerian refugees repatriated...,1,"[-0.85007364, -0.56564945, -0.9505792, 0.77903..."
3023,4341,dust%20storm,"Atlanta, GA",@deadlydemi even staying up all night to he ba...,1,"[-0.39099488, -0.285781, -0.84145904, 0.274905..."
855,1237,blood,,A friend is like blood they are not beside us ...,0,"[-0.9279789, -0.540044, -0.9374038, 0.8930397,..."


In [53]:
train_small2["vectorized"][0:2]

0    [-0.807587   -0.5216827  -0.8024632   0.696344...
1    [-0.8450406  -0.6705742  -0.9464853   0.764529...
Name: vectorized, dtype: object

In [28]:
X = train_small["vectorized"]

In [29]:
y = train_small["target"]

## Train classifier

In [30]:
!pip install sklearn

Collecting sklearn
^C
[31mOperation cancelled by user[0m


In [31]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [32]:
np.random.seed(0)

In [33]:
clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1, random_state=42) #TODO params

In [34]:
num_features = len(X.values[0])
num_observations = len(X.values)

In [35]:
X1 = np.vstack(X.values)

In [36]:
y1 = np.array(y)

In [37]:
clf.fit(X1, y1)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [27]:
from sklearn.metrics import classification_report

classification_report(y1, clf.predict(X1))

'              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00        24\n           1       1.00      1.00      1.00        26\n\n    accuracy                           1.00        50\n   macro avg       1.00      1.00      1.00        50\nweighted avg       1.00      1.00      1.00        50\n'

## Prepare test data

In [None]:
test = pd.read_csv("./data/input/test.csv")
test.head()

In [39]:
len(test)

3263

In [40]:
test["vectorized"] = test["text"].map(vectorize)

In [41]:
test.to_csv("./data/work/test-vectorized.csv")

## Predict 

In [54]:
X_to_predict = test["vectorized"]
X_to_predict = np.vstack(X_to_predict)

In [55]:
X_to_predict

array([[-0.88337725, -0.43604568, -0.7839005 , ..., -0.62412053,
        -0.6474962 ,  0.8741402 ],
       [-0.9134144 , -0.7204528 , -0.9842551 , ..., -0.91506916,
        -0.79635906,  0.8841588 ],
       [-0.8509606 , -0.54145586, -0.97840333, ..., -0.7116491 ,
        -0.8331551 ,  0.9537944 ],
       ...,
       [-0.86718804, -0.5773119 , -0.96989584, ..., -0.7826181 ,
        -0.79885864,  0.78483367],
       [-0.82174265, -0.52890515, -0.9784757 , ..., -0.8178168 ,
        -0.7279344 ,  0.75817096],
       [-0.7993479 , -0.5381027 , -0.97783107, ..., -0.82674366,
        -0.70411426,  0.7400354 ]], dtype=float32)

In [56]:
predicted = clf.predict(X_to_predict) 

In [59]:
len(predicted)

3263

In [60]:
len(test)

3263

In [64]:
test["target"] = predicted

In [68]:
(test[["id", "target"]]).to_csv("./data/output/result.csv", index=False)

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,1
4,11,1
...,...,...
3258,10861,0
3259,10865,1
3260,10868,1
3261,10874,1
