# NLP for disaster tweets


## Outline
- [0. Overview](#0)
- [1. Read data and expand dataset](#1)
- [2. Preprocessing](#2)
- [3. BERT](#3)
  - [3.1 Tokenization](#3-1)
  - [3.2 BERT model](#3-2)
- [4. GloVe Bi-LSTM](#4)
  - [4.1 Tokenization](#4-1)
  - [4.2 GloVe embeddings](#4-2)
  - [4.3 LSTM model](#4-3)
- [5. NB classifier + Tf-idf features](#5)
- [6. Ensemble](#6)
- [7. TO DO](#7)


<a name='0'></a>
# 0. Overview
Natural language processing is used to tackle the problem of sentence classification, specifically, to classify whether a tweet is about a disaster or not. The following model scores on the top 10% of all submissions to the Kaggle competition with a final leaderboard F1 score of 0.84094.

Uncomment the cell below if using Google Colab.

In [None]:
#!pip install sentencepiece
#import nltk
#nltk.download('stopwords')

Import libraries, together with auxiliary scripts.

In [2]:
import numpy as np
import pandas as pd
import datetime, sys, string
import regex as re
from random import randint
from sklearn.linear_model import Ridge
from sklearn.naive_bayes import MultinomialNB
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import TweetTokenizer

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.initializers import Constant
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, LSTM, Embedding, SpatialDropout1D, Dropout, Input, GlobalAveragePooling1D, Concatenate, Bidirectional
from tensorflow.keras.optimizers import Adam

# Auxiliary scripts
import tokenization
from utils import *

<a name='1'></a>
# 1. Read data and expand dataset
Each sample in the train and test set has the following information:


*   The **text** of a tweet
*   A **keyword** from that tweet
*   The **location** the tweet was sent from
*   Label

In this script, only the text was used, ignoring the other two given features.
Moreover, some of the tweets were mislabelled and this is addressed below (by changing to the correct label).
Furthermore, the train set was expand using additional tweets chosen at random from the following dataset: [link](https://www.kaggle.com/kazanova/sentiment140). These 1000 extra tweets were inspected by hand and labelled accordingly. It turns out that all these tweets are label 0.





In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Change target value to some mislabelled tweets
ids_with_target_error = [328,443,513,2619,3640,3900,4342,5781,6552,6554,6570,6701,6702,6729,6861,7226]
train.loc[train['id'].isin(ids_with_target_error),'target'] = 0

# Expand the training set by adding tweets from: https://www.kaggle.com/kazanova/sentiment140
extra_train = pd.read_csv('expand_train_dataset.csv') #They are all label 0
train = train.append(extra_train, sort=False).reset_index()
y_train = train.target.values
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)
print('Some tweet examples: \n', train.text.values[0:10])

Train shape:  (8604, 6)
Test shape:  (3263, 4)
Some tweet examples: 
 ['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'
 'Forest fire near La Ronge Sask. Canada'
 "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected"
 '13,000 people receive #wildfires evacuation orders in California '
 'Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school '
 '#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires'
 '#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas'
 "I'm on top of the hill and I can see a fire in the woods..."
 "There's an emergency evacuation happening now in the building across the street"
 "I'm afraid that the tornado is coming to our area..."]


<a name='2'></a>
# 2. Preprocessing
Since the dataset consists of raw text, it is necessary to 'clean' the text into a more suitable form since the models will use it as input. The preprocessing is separated into two functions:

- preprocessing: Its main function is to correct mispelled words using a dictionary stored in utils.py
- glove_preprocessing: This is the python version for the Ruby script created by the GloVe project at Stanford to preprocess Twitter data ([link](https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb)). It takes care of URLS, hashtags, usernames and others.


In [4]:
def preprocessing(tweet):
  # Remove empty spaces
    tweet = tweet.strip(' ')
  # Remove old RT style
    tweet = re.sub(r'^RT[\s]+', '', tweet)
  # Tokenize to take care of mispelled words
    tokenizer = TweetTokenizer(preserve_case=True, strip_handles=False,
                               reduce_len=False)
    tweet_tokens = tokenizer.tokenize(tweet)
    tweet_clean = ''
  # Iterate over dict in utils.py to correct mispelled words
    for word in tweet_tokens:
        if word.lower() in mispell_dict.keys():
            word = mispell_dict[word.lower()].lower()
        if (word not in stop):
            tweet_clean+= (' '+ word)

    return tweet_clean

In [5]:
FLAGS = re.MULTILINE | re.DOTALL

def hashtag(text):
    text = text.group()
    hashtag_body = text[1:]
    if hashtag_body.isupper():
        result = "<hashtag> {} <allcaps>".format(hashtag_body.lower())
    else:
        result = " ".join(["<hashtag>"] + re.split(r"(?=[A-Z])", hashtag_body, flags=FLAGS))
    return result

def allcaps(text):
    text = text.group()
    return text.lower() + " <allcaps> "

def glove_preprocessing(text):
    # Different regex parts for smiley faces
    eyes = r"[8:=;]"
    nose = r"['`\-]?"

    # function so code less repetitive
    def re_sub(pattern, repl):
        return re.sub(pattern, repl, text, flags=FLAGS)

    text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "<url>")
    text = re_sub(r"@\w+", "<user>")
    text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), "<smile>")
    text = re_sub(r"{}{}p+".format(eyes, nose), "<lolface>")
    text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), "<sadface>")
    text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), "<neutralface>")
    text = re_sub(r"/"," / ")
    text = re_sub(r"<3","<heart>")
    text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", "<number>")
    text = re_sub(r"#\w+", hashtag)
    text = re_sub(r"([!?.]){2,}", r"\1 <repeat>")
    text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 <elong>")

    text = re_sub(r"([a-zA-Z<>()])([?!.:;,])", r"\1 \2")
    text = re_sub(r"\(([a-zA-Z<>]+)\)", r"( \1 )")
    text = re_sub(r"  ", r" ")
    text = re_sub(r" ([A-Z]){2,} ", allcaps)
    
    return text.lower()

Apply preprocessing functions to the datasets.

In [None]:
tweets = train['text']
tweets_test = test['text']

for i, line in enumerate(tweets):
    pre_tweet = preprocessing(line)
    tweets[i] = glove_preprocessing(pre_tweet)

for i, line in enumerate(tweets_test):
    pre_tweet_test = preprocessing(line)
    tweets_test[i] = glove_preprocessing(pre_tweet_test)

<a name='3'></a>
# 3. BERT
The Bidirectional Encoder Representations from Transformers (BERT) is a powerful transformer (encoder) that produces SOTA results in a variety of NLP tasks. A pre-trained version of BERT base will be used from TensorFlow.


In [7]:
# Load BERT
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2"
bert_layer = hub.KerasLayer(module_url, trainable=True)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

<a name='3-1'></a>
#### 3.1 Tokenization
The first step is to use the BERT tokenizer (that can be found in the tokenization.py script) to first split the word into tokens. Next, we add the special tokens needed for sentence classifications (these are [CLS] at the first position, and [SEP] at the end of the sentence). The tokens are then replaced by unique ids from the embedding table given by the model. Also, BERT works with a constant input lenght which means that if the sentence is shorter that this hyperparam, it will be padded with 0s until the assigned lenght. Conversely, if the lenght of the sentence is too large, it will be truncated to the give parameter.

In [8]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
          
        # Truncate text
        text = text[:max_len-2]

        # Add special tokens
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        # Look-up the value of each token in the embedding table
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [9]:
max_len = 32
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
bert_train_input = bert_encode(train.text.values, tokenizer, max_len=max_len)
bert_test_input = bert_encode(test.text.values, tokenizer, max_len=max_len)

<a name='3-2'></a>
#### 3.2 BERT model
The BERT model expects three inputs, these are produced in the bert_encode function above. For this specific task, the two outputs of the BERT model are concatenated to later be fed to a Dense layer for classification. The two outputs are: BERT's output for the [CLS] token and the output for the embedding tokens. Since the output is a 3d tensor, a global average pooling is perform in the sentence lenght direction.


In [10]:
input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

_, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
cls_feat = sequence_output[:, 0, :]
emb_feat = GlobalAveragePooling1D()(sequence_output)
x = Concatenate()([cls_feat, emb_feat])
x = Dropout(0.3)(x)
out = Dense(1, activation='sigmoid')(x)
bert_model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
bert_model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 32)]         0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 32)]         0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 32)]         0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [11]:
bert_model.compile(Adam(lr=1e-6), loss = 'binary_crossentropy', metrics = ['accuracy'])
train_history = bert_model.fit(
    bert_train_input, y_train,
    validation_split=0.2,
    epochs=8,
    batch_size=32,
    verbose=2
)

Epoch 1/8
216/216 - 64s - loss: 0.6358 - accuracy: 0.6436 - val_loss: 0.4059 - val_accuracy: 0.8908
Epoch 2/8
216/216 - 51s - loss: 0.5097 - accuracy: 0.7588 - val_loss: 0.3011 - val_accuracy: 0.9082
Epoch 3/8
216/216 - 52s - loss: 0.4569 - accuracy: 0.7943 - val_loss: 0.2756 - val_accuracy: 0.9123
Epoch 4/8
216/216 - 53s - loss: 0.4258 - accuracy: 0.8158 - val_loss: 0.2779 - val_accuracy: 0.9123
Epoch 5/8
216/216 - 53s - loss: 0.4086 - accuracy: 0.8194 - val_loss: 0.2782 - val_accuracy: 0.9059
Epoch 6/8
216/216 - 53s - loss: 0.3924 - accuracy: 0.8322 - val_loss: 0.2832 - val_accuracy: 0.9064
Epoch 7/8
216/216 - 53s - loss: 0.3832 - accuracy: 0.8357 - val_loss: 0.2494 - val_accuracy: 0.9140
Epoch 8/8
216/216 - 53s - loss: 0.3740 - accuracy: 0.8437 - val_loss: 0.2501 - val_accuracy: 0.9146


Create features from the trained BERT model to later feed the ensemble algorithm.

In [12]:
bert_feat = bert_model.predict(bert_train_input).flatten()
bert_out = bert_model.predict(bert_test_input).flatten()

<a name='4'></a>
# 4. GloVe Bi-LSTM


<a name='4-1'></a>
#### 4.1 Tokenization
Similar to BERT, the LSTM model also expects ids instead of words as inputs. This is performed using the TensorFlow tokenizer.

In [13]:
tokenizer_glove = Tokenizer(split=' ', oov_token='<UNK>')
tokenizer_glove.fit_on_texts(tweets)
glove_x = tokenizer_glove.texts_to_sequences(tweets)
word_index = tokenizer_glove.word_index
print('Number of unique words: ', len(word_index))
glove_x = sequence.pad_sequences(glove_x)
glove_x_test = tokenizer_glove.texts_to_sequences(tweets_test)
glove_x_test = sequence.pad_sequences(glove_x_test, maxlen=np.shape(glove_x)[1])
Y = pd.get_dummies(y_train).values

Number of unique words:  15241


<a name='4-2'></a>
#### 4.2 GloVe embeddings
The GloVe embeddings were pre-trained on 2 billion tweets and will be used as inputs to the model. The length of each embedding is 200, meaning that each word is represented by 200 floats. Moreover, an embedding matrix will be created for our corpus to map between ids and embeddings.



In [14]:
embedding_dict = {}

with open('glove.twitter.27B.200d.txt', encoding="utf8") as glove:
    for line in glove:
        values = line.split()
        word = values[0]
        vectors = np.asarray(values[1:], 'float32')
        embedding_dict[word] = vectors        
glove.close()

num_words = len(word_index) + 1
embedding_matrix = np.zeros((num_words,200))

for word, i in tqdm(word_index.items()):
    if i > num_words:
        continue
    embedding_vector = embedding_dict.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

100%|██████████| 15241/15241 [00:00<00:00, 493289.51it/s]


<a name='4-3'></a>
#### 4.2 LSTM model
Long Short-Term Memory (LSTM) models are a type of recurrent neural network that allows for longer range dependencies, unlike traditional RNNs. For this task, a bi-directional LSTM will be used which can levarage from information from both past and future to create the output of the current timepoint. The LSTM has 512 hidden states on each direction. Dropout and recurrent dropout are added for regularization. Also spatial dropout for the embedding features are used for the same reason.

In [15]:
glove_model = Sequential()
glove_model.add(Embedding(num_words, 200, input_length = np.shape(glove_x)[1], embeddings_initializer=Constant(embedding_matrix), trainable=True))
glove_model.add(SpatialDropout1D(0.1))
glove_model.add(Bidirectional(LSTM(512, dropout=0.1, recurrent_dropout=0.2)))
glove_model.add(Dense(64, activation = 'relu'))
glove_model.add(Dropout(0.2))
glove_model.add(Dense(2, activation='sigmoid'))
glove_model.summary()













Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 43, 200)           3048400   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 43, 200)           0         
_________________________________________________________________
bidirectional (Bidirectional (None, 1024)              2920448   
_________________________________________________________________
dense_1 (Dense)              (None, 64)                65600     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 130       
Total params: 6,034,578
Trainable params: 6,034,578
Non-trainable params: 0
______________________________________________

In [16]:
x_1, x_val, y_1, y_val = train_test_split(glove_x, Y, test_size=0.2, random_state=40, stratify=y_train)
glove_model.compile(Adam(lr=2e-5), loss = 'categorical_crossentropy', metrics = ['accuracy'])
train_history = glove_model.fit(x_1, y_1, 
                epochs=10, 
                batch_size=64, 
                verbose=2, 
                validation_data=(x_val, y_val)
                )

Epoch 1/10
108/108 - 43s - loss: 0.5978 - accuracy: 0.6924 - val_loss: 0.5057 - val_accuracy: 0.7775
Epoch 2/10
108/108 - 39s - loss: 0.4454 - accuracy: 0.8043 - val_loss: 0.4221 - val_accuracy: 0.8175
Epoch 3/10
108/108 - 39s - loss: 0.4070 - accuracy: 0.8241 - val_loss: 0.4069 - val_accuracy: 0.8257
Epoch 4/10
108/108 - 39s - loss: 0.3961 - accuracy: 0.8297 - val_loss: 0.4043 - val_accuracy: 0.8222
Epoch 5/10
108/108 - 39s - loss: 0.3848 - accuracy: 0.8320 - val_loss: 0.3952 - val_accuracy: 0.8239
Epoch 6/10
108/108 - 39s - loss: 0.3773 - accuracy: 0.8361 - val_loss: 0.3933 - val_accuracy: 0.8234
Epoch 7/10
108/108 - 39s - loss: 0.3744 - accuracy: 0.8350 - val_loss: 0.3934 - val_accuracy: 0.8222
Epoch 8/10
108/108 - 39s - loss: 0.3697 - accuracy: 0.8402 - val_loss: 0.3909 - val_accuracy: 0.8263
Epoch 9/10
108/108 - 39s - loss: 0.3683 - accuracy: 0.8389 - val_loss: 0.3900 - val_accuracy: 0.8251
Epoch 10/10
108/108 - 39s - loss: 0.3674 - accuracy: 0.8443 - val_loss: 0.3892 - val_accura

Create second set of features using trained LSTM.

In [17]:
glove_feat = glove_model.predict(glove_x)[:,1]
glove_out = glove_model.predict(glove_x_test)[:,1]

<a name='5'></a>
# 5. NB classifier + Tf-idf features
The Naive Bayes classifier is used together with tf-idf features to produce the last set of features. Tf-idf is a variant of the bag-of-words model that calculates the importance of each word by taking the raw frequencies of ocurrences in a document and scales them down by their frequency in the corpus.

In [18]:
tf = TfidfVectorizer(max_features=2500, stop_words=stop).fit(tweets)
x_train_tf = tf.transform(tweets)
x_test_tf = tf.transform(tweets_test)

NB = MultinomialNB(alpha=1).fit(x_train_tf, y_train)

Create thrid set of features using the trained NB classifier.

In [19]:
NB_tfidf_feat = NB.predict_proba(x_train_tf)[:,1]
NB_tfidf_out = NB.predict_proba(x_test_tf)[:,1]

<a name='6'></a>
# 6. Ensemble


A L2 regression model is used to combine the prediction of the models described above. A high regularization is used for the model not to be biased towards one single model and it can generalize better.

In [20]:
feat_train = pd.DataFrame({'bert':bert_feat, 'lstm_glove':glove_feat, 'nb':NB_tfidf_feat})
feat_test = pd.DataFrame({'bert':bert_out, 'lstm_glove':glove_out, 'nb':NB_tfidf_out})

In [21]:
ensemble_model = Ridge(alpha=10)
predictions = ensemble_model.fit(feat_train, y_train).predict(feat_test).round().astype(int)

Create csv for competition submission.

In [22]:
name = 'final_ensemble'
submission = pd.read_csv('sample_submission.csv')
submission['target'] = predictions
print(submission.head(10))
submission.to_csv('submit_'+name+'.csv', index=False)

   id  target
0   0       1
1   2       1
2   3       1
3   9       1
4  11       1
5  12       1
6  21       0
7  22       0
8  27       0
9  29       0


<a name='7'></a>
# 7. TO DO


*   Introduce the other two features (location and keywords) to the models
*   Expand the preprocessing step (cleaning of data)
*   Create meta-features from the tweets that might boost performance
*   Use other pre-trained embeddings that might be more useful in this task
*   Use other SOTA algorithms like BERT Large, T5, GPT2 and others.

