In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

https://github.com/nisode/MSDS/tree/main/Deep%20Learning

# Task

The goal is to categorize Twitter posts as disaster tweets or not.

Twitter posts are fast and can be useful to be monitored to spot the beginnings of a disaster.

We are given about 7.6k tweets to train our data and then to categorize on an unseen test set.

Some posts have misleading texts which can result in false flags.

# Load Data

In [3]:
train_df = pd.read_csv('../input/nlp-getting-started/train.csv')
train_df.head()

Positive target (1) means a disaster related tweet.

# EDA

Let's see what our data looks like and data types are:

In [4]:
train_df.info()

In [5]:
train_df.isnull().sum().plot(kind='bar', color=['red', 'blue'])

Checking our data columns for missing values, location has around a third of its data missing. That would make it hard for categorization so I think we won't consider using them. Keyword has a few missing as well.

In [6]:
test_df = pd.read_csv('../input/nlp-getting-started/test.csv')
test_df.head()

In [7]:
test_df.info()

In [8]:
test_df.isnull().sum()

In the test data, location also has about a third of its data missing. I think I will be dropping this and keyword because I'm not sure how I would include either in a matrix of tokens.

In [9]:
import matplotlib.pyplot as plt

Let's take a look at the target value breakdown:

In [10]:
plt.figure(figsize=(6,6))
train_df['target'].value_counts(normalize=True).plot(kind="pie", autopct='%1.1f%%')

In [11]:
train_df['target'].value_counts().plot(kind='barh', color=['blue','red'])

About 57% not disaster and 43% disaster. It is a bit imbalance, but I won't worry about it this time.

In [12]:
train_df['keyword'].value_counts()[0:10]

Just a look at what the keywords are and how they might be used in another attempt.

# Cleaning Data

I see a lot of notebooks use this preprocessing pipeline or something very similar, so it must be popular and effective solution to this problem. I have one of the notebooks where I found it linked in the comment section below.

It goes through a 9 step procress:

1. removes any @ tagging
2. removes any digits
3. removes any hastag symbols
4. removes links such as emails and websites
5. removes anything that doesn't start with a character
6. lower cases everything
7. removes any stopwords
8. lemmatize which turns words to their base words
9. joins it all back together

In [13]:
# preprocessPipeline base from here: https://www.kaggle.com/code/alid3bs/lstm-vs-gru-vs-bidirectional/notebook
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

import nltk
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words('english')

def preprocessPipline(data, labelName):
    
    dataTemp = data.copy()
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x: re.sub('(\s*)@\w+(\s*)','', x))
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x: re.sub('\d+','', x))
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x: re.sub('#','', x))
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x: re.sub('https?://\S+|www\.\S+','',x))
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x: re.sub('[^A-Za-z]',' ',x))
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x: x.lower())
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x : [word for word in x.split()  if word not in stop_words])
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
    dataTemp[labelName] = dataTemp[labelName].apply(lambda x: ' '.join(x))
    return dataTemp

In [14]:
train_df_cleaned = preprocessPipline(train_df, "text")
clean_text = train_df_cleaned['text']
clean_text[0:5]

# Preparing Data

Using tokenizer, we will turn the token counts into integer sequences to represent the order of words in the twitter posts with respect to the token in the tokenizer dictionary. Then padding makes it so all the sequences are the same length as the longest one.

In [15]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token="unk")
tokenizer.fit_on_texts(clean_text)
X = tokenizer.texts_to_sequences(clean_text)

X = pad_sequences(X, padding='post',truncating='post')
X

In [16]:
X.shape

We can see the longest post (minus stopwords) is 23 tokens long.

We put the targets as their own series and split the data into train and validation set by 0.2 and keeping a proportional amount of Y.

In [17]:
Y = train_df_cleaned['target']
Y 

In [18]:
from sklearn.model_selection import train_test_split
x_train, x_val ,y_train ,y_val = train_test_split(X, Y, test_size=0.2, stratify=Y)
x_train.shape

In [19]:
vocab_size = len(tokenizer.word_index)
vocab_size

# Model Architecture

In [20]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dropout, Flatten, Conv2D, MaxPooling2D, Dense, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

Some explanations for architecture choices:
1. embedding apparently needs vocab_size + 1 according to keras documentation
2. relus seem to work good previously, so I will be reusing them
3. sigmoid with 1 output in the output layer because it is a binary classification problem
3. Adam seems to work well in general so this will also be reused.
4. I will be tuning number of hidden layers, drop out thresholds, and learning rates
6. Will try versions where lstm is bidirectional and versions where it is not
7. I pick LSTM for this problem, because sentences can be seen as time series data and it can help us deal with any vanishing gradient problems

In [23]:
lstm_model1 = Sequential([
    layers.Embedding(vocab_size + 1,64),
    Dropout(0.2),
    layers.LSTM(64),
    Dense(23, activation = 'relu'),
    Dropout(0.2),
    Dense(23, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

lstm_model1.compile(optimizer = Adam(learning_rate=0.001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

lstm_model1.summary()

# Results

In [24]:
lstm_model1_hist = lstm_model1.fit(x = x_train,
                                   y = y_train,
                                   validation_data = (x_val, y_val),
                                   epochs = 10,
                                   verbose = 1)

In [27]:
def metrics_plot(history):
    xs = np.arange(1, len(history['loss'])+1)
    plt.figure(figsize=[16,4])
    
    plt.subplot(1,3,1)
    plt.plot(xs, history['loss'], label='Training')
    plt.plot(xs, history['val_loss'], label='Validation')
    plt.xlabel('Epoch') 
    plt.ylabel('Loss') 
    plt.title('Loss')
    plt.legend()
    
    plt.subplot(1,3,2)
    plt.plot(xs, history['accuracy'], label='Training')
    plt.plot(xs, history['val_accuracy'], label='Validation')
    plt.xlabel('Epoch') 
    plt.ylabel('Accuracy')
    plt.title('Accuracy')
    plt.legend()
    
    plt.subplot(1,3,3)
    plt.plot(xs, history[list(history.keys())[2]], label='Training')
    plt.plot(xs, history[list(history.keys())[5]], label='Validation')
    plt.xlabel('Epoch') 
    plt.ylabel('AUC') 
    plt.title('AUC')
    plt.legend()
    plt.tight_layout()
    
    plt.show()

In [28]:
metrics_plot(lstm_model1_hist.history)

In [25]:
lstm_model2 = Sequential([
    layers.Embedding(vocab_size + 1,64),
    Dropout(0.5),
    layers.LSTM(64),
    Dense(23, activation = 'relu'),
    Dropout(0.5),
    Dense(23, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

lstm_model2.compile(optimizer = Adam(learning_rate=0.001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

lstm_model2.summary()

In [26]:
lstm_model2_hist = lstm_model2.fit(x = x_train,
                                   y = y_train,
                                   validation_data = (x_val, y_val),
                                   epochs = 10,
                                   verbose = 1)

In [29]:
metrics_plot(lstm_model2_hist.history)

In [30]:
lstm_model3 = Sequential([
    layers.Embedding(vocab_size + 1,64),
    Dropout(0.5),
    layers.LSTM(64),
    Dense(23, activation = 'relu'),
    Dropout(0.5),
    Dense(23, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

lstm_model3.compile(optimizer = Adam(learning_rate=0.0001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

lstm_model3.summary()

In [31]:
lstm_model3_hist = lstm_model3.fit(x = x_train,
                                   y = y_train,
                                   validation_data = (x_val, y_val),
                                   epochs = 10,
                                   verbose = 1)

In [32]:
metrics_plot(lstm_model3_hist.history)

In [33]:
lstm_model4 = Sequential([
    layers.Embedding(vocab_size + 1,64),
    Dropout(0.2),
    layers.LSTM(64),
    Dense(23, activation = 'relu'),
    Dropout(0.2),
    Dense(23, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

lstm_model4.compile(optimizer = Adam(learning_rate=0.0001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

lstm_model4.summary()

In [34]:
lstm_model4_hist = lstm_model4.fit(x = x_train,
                                   y = y_train,
                                   validation_data = (x_val, y_val),
                                   epochs = 10,
                                   verbose = 1)

In [35]:
metrics_plot(lstm_model4_hist.history)

In [38]:
lstm_model5 = Sequential([
    layers.Embedding(vocab_size + 1,64),
    Dropout(0.2),
    layers.LSTM(64),
    Dense(23, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

lstm_model5.compile(optimizer = Adam(learning_rate=0.0001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

lstm_model5.summary()

lstm_model5_hist = lstm_model5.fit(x = x_train,
                                   y = y_train,
                                   validation_data = (x_val, y_val),
                                   epochs = 10,
                                   verbose = 1)

In [39]:
metrics_plot(lstm_model5_hist.history)

So far these results have not been very interesting. There isn't much variation in results, but let's try to see if it's different if we make the lstm bidirectional.

In [40]:
bid_model1 = Sequential([
    layers.Embedding(vocab_size + 1,128),
    Dropout(0.2),
    layers.Bidirectional(layers.LSTM(64)),
    Dense(23, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

bid_model1.compile(optimizer = Adam(learning_rate=0.0001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

bid_model1.summary()

bid_model1_hist = bid_model1.fit(x = x_train,
                                   y = y_train,
                                   validation_data = (x_val, y_val),
                                   epochs = 10,
                                 verbose = 1)

In [46]:
metrics_plot(bid_model1_hist.history)

In [41]:
bid_model2 = Sequential([
    layers.Embedding(vocab_size + 1,128),
    Dropout(0.5),
    layers.Bidirectional(layers.LSTM(64)),
    Dense(1, activation = 'sigmoid')
])

bid_model2.compile(optimizer = Adam(learning_rate=0.0001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

bid_model2.summary()

bid_model2_hist = bid_model2.fit(x = x_train,
                                   y = y_train,
                                   validation_data = (x_val, y_val),
                                   epochs = 10,
                                 verbose = 1)

In [45]:
metrics_plot(bid_model2_hist.history)

In [42]:
bid_model3 = Sequential([
    layers.Embedding(vocab_size + 1,128),
    Dropout(0.5),
    layers.Bidirectional(layers.LSTM(64)),
    Dense(1, activation = 'sigmoid')
])

bid_model3.compile(optimizer = Adam(learning_rate=0.001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

bid_model3.summary()

bid_model3_hist = bid_model3.fit(x = x_train,
                                 y = y_train,
                                 validation_data = (x_val, y_val),
                                 epochs = 10,
                                 verbose = 1)

In [44]:
metrics_plot(bid_model3_hist.history)

I ran every model for 10 epochs. This runs much much faster than the cancer image set. The first epoch was always slowest, but even then it was multitudes faster. Here are the results if I were to stop at the highest validation accuracy. Disappointingly, they didn't seem to vary all that much from 79% validation accuracy. The best performing model seems to be the last model, but that was also at the first epoch. The learning rates that were one magnitude lower at 0.0001 were a lot more stable at reaching around 0.79 than the higher magnitude of 0.001. Train accuracy usually got very high toward the end of the epochs, but validation accuracy scores rarely saw consistent improvement. Instead, it would usally drop off very early in the epochs and you can see this in the metric graphs for every model. This likely means that model overfits to the data and overfits it fast. There seems to be a very slight improvement in validation accuracy with the bidirection model when compared to the non bidirection lstms. With scores this close it's hard to tell which direction to head, especially with 8 models run with varying hyper parameters. It is notable that the bidirectional models had double the parameters usually because of how I was encoding the embedded. Maybe a diffirent architechture would do better.

| model | type | drop_out | learning rate | train_acc | val_acc | val_auc
| ---- | ---- | ---- | ---- | --- | --- | --- | 
| lstm_model1 | 2 Hidden | 0.2 | 0.001 | 0.8713 | 0.7905 | 0.8514 |
| lstm_model2 | 2 Hidden | 0.5 | 0.001 | 0.8997 | 0.7814 | 0.8478 |
| lstm_model3 | 2 Hidden | 0.5 | 0.0001 | 0.8568 | 0.7866 | 0.8498 |
| lstm_model4 | 2 Hidden | 0.2 | 0.0001 | 0.9112 | 0.7859 | 0.8495 |
| lstm_model5 | 1 Hidden | 0.2 | 0.0001 | 0.9039 | 0.7919 | 0.850 |
| bid_model1 | 1 Hidden | 0.2 | 0.0001 | 0.8363 | 0.7951 | 0.8540 |
| bid_model2 | 1 Hidden | 0.5 | 0.0001 | 0.8747 | 0.7938 | 0.8580 |
| bid_model3 | 1 Hidden | 0.5 | 0.001 | 0.7223 | 0.7965 | 0.8556 |

# Conclusion

In [51]:
callback = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2, restore_best_weights=True)
bid_model1 = Sequential([
    layers.Embedding(vocab_size + 1,128),
    Dropout(0.2),
    layers.Bidirectional(layers.LSTM(64)),
    Dense(23, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

bid_model1.compile(optimizer = Adam(learning_rate=0.0001),
                    loss="binary_crossentropy", 
                    metrics = ['accuracy', tf.keras.metrics.AUC()])

bid_model1.summary()

bid_model1_hist = bid_model1.fit(x = x_train,
                                 y = y_train,
                                 validation_data = (x_val, y_val),
                                 epochs = 10,
                                 callbacks=[callback],
                                 verbose = 1)

In [53]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
y_preds1 = bid_model1.predict(x_val, verbose=1)
y_hat_labels1 = np.round(y_preds1)
conf_mat1 = confusion_matrix(y_val, y_hat_labels1)/len(y_val)
sns.heatmap(conf_mat1, annot=True, linewidths=0.01,cmap="Greens",linecolor="gray")
plt.title("Confusion Matrix")

I reran with early stopping for the last model which was bidirectional, since it performed best, although the difference was very slight. I had it restore the weights to the epoch that had the best validation accuracy scores. It stopped at a slightly higher validation and managed to break 0.8, but these small increases are random and aren't really good indicators. From the confusion matrix using this model, you can see that this model has a harder time predicting true positives as there seems to be a quite a lot more false negatives than false positives, nearly double the amount. It's likely that this model is bias toward predicting not disaster when it is a disaster. To improve, we need to try different architectures and perhaps try to think of a way that also includes data about the keywords.

# Create Submission

In [54]:
test_df_clean = preprocessPipline(test_df, 'text')

test_df_clean = tokenizer.texts_to_sequences(test_df_clean['text'])

test_pad = pad_sequences(test_df_clean,
                                maxlen=23, 
                                truncating='post', 
                                padding='post'
                               )

In [55]:
test_y = bid_model1.predict(test_pad, verbose=1)
test_y = np.round(test_y)

In [56]:
submission = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')
submission['target']=test_y.astype('int')
submission.head()

In [57]:
submission.to_csv('submission.csv', index=False)

# References:

https://www.kaggle.com/code/mariapushkareva/nlp-disaster-tweets-with-glove-and-lstm/notebook

https://www.kaggle.com/code/andreshg/nlp-glove-bert-tf-idf-lstm-explained

https://www.kaggle.com/code/alid3bs/lstm-vs-gru-vs-bidirectional/notebook

https://www.kaggle.com/code/danilastepochkin/disaster-tweets-dl-with-lstm-and-language-model

https://www.kaggle.com/code/ltrahul/nlp-disaster-tweets-prediction-with-nltk-and-lstm