**Twitter Disaster Prediction** <br>

Submitted by <br>
* Haigal Harrison - C0816642



roBERTa model from tensorflow is used to predict whether a tweet is about a real disaster or not

In [None]:
# Importing the required libraries
import pandas as pd
import numpy as np
from google.colab import drive

In [None]:
# Mounting Drive
drive.mount('/content/drive/',force_remount=True)

Mounted at /content/drive/


In [None]:
# importing the required tensorflow libraries
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

In [None]:
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [None]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 27.8 MB/s eta 0:00:01[K     |▌                               | 20 kB 31.3 MB/s eta 0:00:01[K     |▉                               | 30 kB 12.1 MB/s eta 0:00:01[K     |█                               | 40 kB 9.5 MB/s eta 0:00:01[K     |█▍                              | 51 kB 5.4 MB/s eta 0:00:01[K     |█▋                              | 61 kB 5.9 MB/s eta 0:00:01[K     |██                              | 71 kB 5.7 MB/s eta 0:00:01[K     |██▏                             | 81 kB 6.4 MB/s eta 0:00:01[K     |██▍                             | 92 kB 4.9 MB/s eta 0:00:01[K     |██▊                             | 102 kB 5.2 MB/s eta 0:00:01[K     |███                             | 112 kB 5.2 MB/s eta 0:00:01[K     |███▎                            | 122 kB 5.2 MB/s eta 0:00:01[K     |███▌       

In [None]:
import tokenization

In [None]:
# a seed is set of better reproducability
SEED = 1002
def seed_everything(seed):
    np.random.seed(seed)
    tf.random.set_seed(seed) 
    
seed_everything(SEED)

In [None]:
# reading the training data from the google drive using pandas
train_df = pd.read_csv('/content/drive/My Drive/My data files/twitter_disaster_data.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Bert tokenization method is used to tokenize the tweets collected via twitter API. It includes the following 3 steps,

* Text Normalization - The text is converted to lowercase , all the whitespaces are converted to spaces and get rid of accent markers

* Punctuation Splitting - A space is added to each side of the punctuation marks

* It is also called whitespace tokenization and applies WordPiece tokenization to every word

Finally it reduces the length of the text to the given length we have provided and add [CLS] and [SEP] tags to the end. [CLS] indicates the start and [SEP] indicates the end of each sentence. <br>
Tokenizer method is used to convert words to integers. An input mask and segments ids are also created.

In [None]:
def encode_bert(txt, tokenizer, max_length=512):
  tokens = []
  masks = []
  segments = []
  for t in txt:
    t = tokenizer.tokenize(t)
    t = t[:max_length-2]
    inp_seq = ["[CLS]"] + t + ["[SEP]"]
    pad_length = max_length - len(inp_seq)

    token = tokenizer.convert_tokens_to_ids(inp_seq)
    token += [0] * pad_length
    pad_masks = [1] * len(inp_seq) + [0] * pad_length
    seg_ids = [0] * max_length

    tokens.append(token)
    masks.append(pad_masks)
    segments.append(seg_ids)
    
  return np.array(tokens), np.array(masks), np.array(segments)

Plain text input is converted to the expected input format by the roBERTa Model. <br>

* input_word_ids : Maps each words to its corresponding token id.
* input_mask : Specified the start and end of a sentence using an array. No padding tokens are given a value of 1 and padding tokens are given a value of 0.
* segment_ids : Recognizes the segments of the sentences

In [None]:
def model_build(bert_layer, max_length=512):
  inp_word_ids = Input(shape=(max_length,), dtype=tf.int32, name="inp_word_ids")
  inp_mask = Input(shape=(max_length,), dtype=tf.int32, name="inp_mask")
  seg_ids = Input(shape=(max_length,), dtype=tf.int32, name="seg_ids")

  _, sequence_output = bert_layer([inp_word_ids, inp_mask, seg_ids])
  clf_output = sequence_output[:, 0, :]
  out = Dense(1, activation='sigmoid')(clf_output)

  model = Model(inputs=[inp_word_ids, inp_mask, seg_ids], outputs=out)
    
  model.compile(Adam(learning_rate=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
  return model


Loads the roBERTa model. Large uncased model is used. It is set as a Keras Layer to use it to our specific case.

In [None]:
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

Tokenizer converts the input data to a format BERT understands. Vocab file is used so that tokenizer knows which number to be used to encode each words.

In [None]:
voc_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()

#returns true/false based on selected cased/uncased bert layer
lower_case = bert_layer.resolved_object.do_lower_case.numpy()

#Initialize the tokenizer
tokenizer = tokenization.FullTokenizer(voc_file, lower_case)

In [None]:
train_inp = encode_bert(train_df.text.values, tokenizer, max_length=160)
train_label = train_df.target.values

The model summary shows three input layers we created followed by roBERTa model in the Keras layer. Last dense layer predicts the output in a scale of 0-1

In [None]:
model = model_build(bert_layer, max_length=160)
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 inp_word_ids (InputLayer)      [(None, 160)]        0           []                               
                                                                                                  
 inp_mask (InputLayer)          [(None, 160)]        0           []                               
                                                                                                  
 seg_ids (InputLayer)           [(None, 160)]        0           []                               
                                                                                                  
 keras_layer (KerasLayer)       [(None, 1024),       335141889   ['inp_word_ids[0][0]',           
                                 (None, 160, 1024)]               'inp_mask[0][0]',           

Model is trained using Keras. ModelCheckpoint is used to save the best model with high validation accuracy

In [None]:
model_checkpoint = ModelCheckpoint('model.h5', monitor='val_accuracy', save_best_only=True)
history = model.fit(
    train_inp, train_label,
    validation_split=0.1,
    epochs=3,
    callbacks=[model_checkpoint],
    batch_size=5
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


Model is saved to google drive so that it can be used to predict tweets in realtime.

In [None]:
model.save('/content/drive/My Drive/My data files/twitter_model')



INFO:tensorflow:Assets written to: /content/drive/My Drive/My data files/twitter_model/assets


INFO:tensorflow:Assets written to: /content/drive/My Drive/My data files/twitter_model/assets
