(based on https://towardsdatascience.com/simple-bert-using-tensorflow-2-0-132cb19e9b22)

# Kaggle Disaster Tweets Challenge
## BERT Embeddings with TensorFlow 2.0 + Random Forest

https://www.kaggle.com/c/nlp-getting-started

With the new release of TensorFlow, this Notebook aims to show a simple use of the BERT model.
- See BERT on paper: https://arxiv.org/pdf/1810.04805.pdf
- See BERT on GitHub: https://github.com/google-research/bert
- See BERT on TensorHub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1
- See 'old' use of BERT for comparison: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb

## Update TF
We need Tensorflow 2.0 and TensorHub 0.7 for this Colab

In [3]:
#!pip install tensorflow==2.0
#!pip install tensorflow_hub==0.7
#!pip install bert-for-tf2
#!pip install sentencepiece
#!pip install pandas
#!pip install category_encoders==2.1.0
!which pip

/Users/ivp/dev/work/ai/bert/kaggle_disaster/venv_bert4/bin/pip


In [4]:
import tensorflow as tf
import tensorflow_hub as hub
print("TF version: ", tf.__version__)
print("Hub version: ", hub.__version__)

TF version:  2.0.0
Hub version:  0.7.0


If TensorFlow Hub is not 0.7 yet on release, use dev:



In [3]:
### !pip install tf-hub-nightly

In [4]:
# hub.__version__

## Prepare BERT

In [5]:
import tensorflow_hub as hub
import tensorflow as tf
import bert
FullTokenizer = bert.bert_tokenization.FullTokenizer
from tensorflow.keras.models import Model       # Keras is the new high level API for TensorFlow
import math

Building model using tf.keras and hub. from sentences to embeddings.

Inputs:
 - input token ids (tokenizer converts tokens using vocab file)
 - input masks (1 for useful tokens, 0 for padding)
 - segment ids (for 2 text training: 0 for the first one, 1 for the second one)

Outputs:
 - pooled_output of shape `[batch_size, 768]` with representations for the entire input sequences
 - sequence_output of shape `[batch_size, max_seq_length, 768]` with representations for each input token (in context)

In [6]:
max_seq_length = 128  # Your choice here.
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

In [7]:
model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])

In [8]:
#model.save('modeldump.h5')
#tf.keras.experimental.export_saved_model(model, 'modeldump2.h5')

In [9]:
 #model2 = tf.keras.models.load_model('modeldump2.h5')

In [13]:
reloaded_model = tf.keras.experimental.load_from_saved_model('modeldump2.h5', custom_objects={'KerasLayer':hub.KerasLayer})
#print(reloaded_model.get_config())

#Get input shape from model.get_config()
#reloaded_model.build((None, 224, 224, 3))
#reloaded_model.summary()

In [102]:
#reloaded_model.build((None, 224, 224, 3))
#reloaded_model.summary()

Generating segments and masks based on the original BERT

In [14]:
# See BERT paper: https://arxiv.org/pdf/1810.04805.pdf
# And BERT implementation convert_single_example() at https://github.com/google-research/bert/blob/master/run_classifier.py

def get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))


def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))


def get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

Import tokenizer using the original vocab file

In [15]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)

## Read data

In [16]:
import pandas as pd

In [17]:
train = pd.read_csv("./data/input/train.csv")

In [18]:
def tf_idfy():
    pass

## Explore data

In [108]:
from itertools import chain

def freq(lst):
    d = {}
    for i in lst:
        if d.get(i):
            d[i] += 1
        else:
            d[i] = 1
    return d

def series_to_list(series):
    return list(chain.from_iterable(series.values))

#len(set(tags_lowercased))
## sorted(list(tags))

In [110]:
#tags_list = series_to_list(train["hashtags"])
#tags_lowercased = list(map(lambda tag: tag.lower(), tags_list))

#freq(tags_lowercased)

In [51]:
m = train["location"].apply(lambda x: isinstance(x, float))

len(train[~m])
locations  = list(train[~m]["location"])


freq(locations)
set(locations)


#ddd = add_location_columns(train[~m])

ddd1 = add_location_columns(train[m])
#train[~m]
#train[m]["location"]

## Feature engineering

In [19]:
def tokenize(text):
    tokens = tokenizer.tokenize(text)
    tokens = ["[CLS]"] + tokens + ["[SEP]"]
    return tokens


In [20]:
def process_tokens(text):
    #TODO tags to separate column
    #TODO strip hash from tags in text
    return text

In [21]:
def vectorize(text):
    tokens = tokenizer.tokenize(text)
    tokens = ["[CLS]"] + tokens + ["[SEP]"]
    
    input_ids = get_ids(tokens, tokenizer, max_seq_length)
    input_masks = get_masks(tokens, max_seq_length)
    input_segments = get_segments(tokens, max_seq_length)
    
    #print('tokens')
    #print(tokens)
    #print('input_ids')
    #print(input_ids)
    #print('input_masks')
    #print(input_masks)
    #print('input_segments')
    #print(input_segments)
    
    pool_embs, all_embs = model.predict([[input_ids],[input_masks],[input_segments]])
    
    #print('pool_embs')
    #print(pool_embs)
    
    return pool_embs[0]

In [22]:
import re

def extract_hashtags(text):
    r = "#\S*"
    return re.findall(r, text)

In [23]:
import time
from datetime import datetime 

def vectorize_text_in_daraframe(df, text_column, result_column):
    start = time.time()
    print("started text vectorization")
    datetime.fromtimestamp(start).strftime("%A, %B %d, %Y %I:%M:%S")
    
    df[result_column] = df[text_column].map(vectorize)

    end = time.time()
    print("ended text vectorization")
    datetime.fromtimestamp(end).strftime("%A, %B %d, %Y %I:%M:%S")

    print("time elapsed")
    print(str(round(end - start)) + " seconds")
    
    return df

#### Encode categorical column

In [24]:
import category_encoders as ce
import math
import pandas as pd

#TODO reviews.region_2.fillna("Unknown")

def fill_na(df, colname):
    return df[[colname]].fillna("Unknown")[colname]


def encode_categorical_column(df, colname, encoder=None):
    name_no_nan = colname + '_no_nan'
    df[name_no_nan] = fill_na(df, colname) #df[colname].map(lambda x: x if type(x)=="str" else "NA")
    if encoder:
        hash_columns = encoder.transform(df[name_no_nan])
    else:    
        encoder = ce.HashingEncoder(cols = [name_no_nan], verbose=1)
        hash_columns = encoder.fit_transform(df[name_no_nan], df['target'])
    
    hash_columns.index = df.index
    #print(hash_columns)
    hash_col_names = list(map(lambda col_name: colname + '_' + col_name, list(hash_columns.columns.values)))
    hash_columns.columns = hash_col_names
    #print(hash_columns)
    #return df.join(hash_columns)
    return pd.concat([df,hash_columns], axis=1), encoder

#def encode_categorical_columns(df, colnames):
#    res = df
#    for colname in colnames:
#        res = encode_categorical_column(res, colname)
#    
#    return res
        
        

#### Prepare hashtags

In [25]:
train["hashtags"] = train["text"].map(extract_hashtags)

#train["hashtags_vector"] = tf_idfy(train["hashtags"])

#### Sample

In [26]:
train_small = train.sample(2000)
#train_small = train.sample(50)

### Prepare features

In [27]:
import numpy

def append_col_to_np_array(df, col, arr):
    col_arr = df[col].to_numpy()
    col_arr1 = col_arr.reshape(col_arr.shape[0],1)
    
    res = numpy.append(arr, col_arr1, axis=1)
    return res
    
def append_cols_to_np_array(df, cols, arr):
    res = arr
    for col in cols:
        res = append_col_to_np_array(df, col, res)
    
    return res

In [28]:
def copy_column(df_from, df_to, colname):
    df_to[colname] = df_from[colname]
    return df_to
    

In [29]:
def to_feature_array(df):
    x = numpy.stack(df["vectorized"], axis=0)
    
    x = append_cols_to_np_array(df, ['location_col_0', 'location_col_1', 'location_col_2',
       'location_col_3', 'location_col_4', 'location_col_5', 'location_col_6',
       'location_col_7'], x)
    
    x = append_cols_to_np_array(df, ['keyword_col_0', 'keyword_col_1',
       'keyword_col_2', 'keyword_col_3', 'keyword_col_4', 'keyword_col_5',
       'keyword_col_6', 'keyword_col_7'], x)
    
    return x

class FeaturPreparationResult:
  def __init__(self, df, encoder_location, encoder_keyword):
     self.df = df
     self.encoder_location = encoder_location
     self.encoder_keyword = encoder_keyword

def prepare_features(df, encoder_location=None, encoder_keyword=None):
    print("preparing location columns")
    df1, encoder_location = encode_categorical_column(df, 'location', encoder_location)
    df2, encoder_keyword = encode_categorical_column(df1, 'keyword', encoder_keyword)
    
    print("vectorizing text")
    df3 = vectorize_text_in_daraframe(df2, "text", "vectorized")
    
    return FeaturPreparationResult(df3, encoder_location, encoder_keyword)

In [30]:
feat_prep_result = prepare_features(train_small)

train_small_prepared = feat_prep_result.df
encoder_location = feat_prep_result.encoder_location
encoder_keyword = feat_prep_result.encoder_keyword

preparing location columns
vectorizing text
started text vectorization
ended text vectorization
time elapsed
1199 seconds


In [31]:
X = to_feature_array(train_small_prepared)

In [32]:
train_small_prepared.to_csv("./data/work/trained-vectorized.csv")

In [33]:
y = train_small_prepared["target"].to_numpy()

## Train classifier

In [30]:
!pip install sklearn

Collecting sklearn
^C
[31mOperation cancelled by user[0m


In [34]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [35]:
np.random.seed(0)

In [36]:
clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1, random_state=42) #TODO params

In [37]:
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [38]:
from sklearn.metrics import classification_report

classification_report(y, clf.predict(X))

'              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00      1154\n           1       1.00      1.00      1.00       846\n\n    accuracy                           1.00      2000\n   macro avg       1.00      1.00      1.00      2000\nweighted avg       1.00      1.00      1.00      2000\n'

## Prepare test data

In [52]:
test = pd.read_csv("./data/input/test.csv")

In [40]:
len(test)

3263

In [41]:
feat_prep_result_test = prepare_features(test, encoder_location, encoder_keyword)


preparing location columns
vectorizing text
started text vectorization
ended text vectorization
time elapsed
1797 seconds


In [43]:
test_prepared = feat_prep_result_test.df

In [44]:
X_test = to_feature_array(test_prepared)

In [45]:
test_prepared.to_csv("./data/work/test-vectorized.csv")

In [46]:
len(X_test)

3263

## Predict 

In [47]:
#X_to_predict = test["vectorized"]
#X_to_predict = np.vstack(X_to_predict)

In [48]:
#X_to_predict

In [49]:
predicted = clf.predict(X_test) 

In [50]:
len(X_test)

3263

In [51]:
len(test)

2000

In [53]:
test["target"] = predicted

In [54]:
(test[["id", "target"]]).to_csv("./data/output/result.csv", index=False)

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,1
4,11,1
...,...,...
3258,10861,0
3259,10865,1
3260,10868,1
3261,10874,1
