# Problem Description

This problem involves using Recurrent neural networks and natural language processing to classify sequential text data, namely tweets that may or may not be reporting on a natural distaster which has occurred. the training data consists of a list of tweets and labels on whether or not the tweet contains information pertaining to a disaster. Due to the structure of this data, a bag of words approach will not be as effective, as some tweets may contain words that are used in disaster tweets (e.g. "fire") that are used in other contexts. our task is to create a recurrent neural network that will accurately classify these tweets, which will then be used to identify unlabelled tweets, which will be submitted to the Kaggle competition.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import string

In [None]:
import os
os.environ['KERAS_BACKEND'] = 'tensorflow'

# EDA

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/test.csv')
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


In [None]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


## Data Cleaning

We will be removing stopwords (common words such as the) and converting all text data to words by removing all punctuation and converting the remaining words to lowercase. this will allow us to convert individual words to tokens that can be fed into our recurrent neural network with minimal confounds

In [None]:
from nltk.corpus import stopwords
stop=set(stopwords.words('english'))

def clean_str(s):
    # strip punctuation
    s = s.translate(str.maketrans('', '', string.punctuation))
    # lowercase
    s = s.lower()
    # remove stopwords
    lists = [word for word in s.split() if word not in stop]
    return ' '.join(lists)

train['text'] = train['text'].astype(str)
train['text'] = train['text'].map(lambda s: clean_str(s))
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,deeds reason earthquake may allah forgive us,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,residents asked shelter place notified officer...,1
3,6,,,13000 people receive wildfires evacuation orde...,1
4,7,,,got sent photo ruby alaska smoke wildfires pou...,1


## Preprocessing

### Tokenizing

We now convert our text data that has been stripped of punctuation and capitalization and common words to a list of values using the Keras Tokenizer

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_features=5000
tokenizer=Tokenizer(num_words=max_features,split=' ')
tokenizer.fit_on_texts(train['text'].values)
X = tokenizer.texts_to_sequences(train['text'].values)
X = pad_sequences(X)
y = train['target']
X

array([[   0,    0,    0, ..., 1454, 4391,   13],
       [   0,    0,    0, ...,  134,  575, 1237],
       [   0,    0,    0, ...,  547, 1238,  950],
       ...,
       [   0,    0,    0, ..., 4382,  482, 1443],
       [   0,    0,    0, ..., 1074, 2447,  203],
       [   0,    0,    0, ...,   79,  596,   10]], dtype=int32)

### Embedding - GLoVe

In order to train our neural network, we can save time on training the network by using a set of weights that have been trained on a much larger corpus of text data. For this purpose, we will use the GloVe data downloaded from Stanford.

In [None]:
# Source: https://www.kaggle.com/code/mariapushkareva/nlp-disaster-tweets-with-glove-and-lstm

embeddings_dictionary = dict()
embedding_dim = 100
glove = open('/content/drive/MyDrive/Colab Notebooks/glove.6B.100d.txt')
for line in glove:
    records = line.split()
    word = records[0]
    vector_dimensions = np.asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions
glove.close()

In [None]:
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, embedding_dim))

for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector
embedding_matrix.shape

(22565, 100)

# Modeling

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state =42)

In [None]:
from keras import models, layers

## Base Model - LTSM

For our model, we will use LTSM, which stands for Long Short-term memory. This is a recurrent neural network that stores a context window that varies in length based on how relevant the surrounding information is. Our model  

In [None]:
lstm_model = models.Sequential([
    layers.Embedding(input_dim=22565,
        output_dim=100,
        weights = [embedding_matrix]),
    layers.Bidirectional(layers.LSTM(
        100,
        return_sequences = True,
        recurrent_dropout=0.2)),
    layers.GlobalMaxPool1D(),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(100, activation = "relu"),
    layers.Dropout(0.5),
    layers.Dense(100, activation = "relu"),
    layers.Dropout(0.5),
    layers.Dense(1, activation = 'sigmoid'),


])

lstm_model.compile(optimizer='rmsprop',
                       loss='binary_crossentropy',
                       metrics=['accuracy'])

lstm_model.summary()



  super().__init__(**kwargs)


In [None]:
lstm_model.fit(X_train, y_train, epochs = 10, validation_data=(X_test, y_test))

Epoch 1/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 107ms/step - accuracy: 0.6325 - loss: 0.7896 - val_accuracy: 0.8004 - val_loss: 0.5536
Epoch 2/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 102ms/step - accuracy: 0.7551 - loss: 0.5380 - val_accuracy: 0.8050 - val_loss: 0.4743
Epoch 3/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 112ms/step - accuracy: 0.7830 - loss: 0.4718 - val_accuracy: 0.7984 - val_loss: 0.4662
Epoch 4/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 113ms/step - accuracy: 0.8146 - loss: 0.4438 - val_accuracy: 0.8076 - val_loss: 0.4377
Epoch 5/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 105ms/step - accuracy: 0.8309 - loss: 0.4093 - val_accuracy: 0.8030 - val_loss: 0.4511
Epoch 6/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 111ms/step - accuracy: 0.8339 - loss: 0.3952 - val_accuracy: 0.8148 - val_loss: 0.4559
Epoch 7/10

<keras.src.callbacks.history.History at 0x7df1fe11a4d0>

# Results and Analysis

## Comparison of Different Architectures

### GRU

Here we compare our LTSM model to another model architecture, GRU. This model appears to overfit more quickly than LTSM and does not perform noticeably better.

In [None]:
gru_model = models.Sequential([
    layers.Embedding(input_dim=22565,
        output_dim=100,
        weights = [embedding_matrix]),
    layers.Bidirectional(layers.GRU(
        100,
        return_sequences = True,
        recurrent_dropout=0.2)),
    layers.GlobalMaxPool1D(),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(100, activation = "relu"),
    layers.Dropout(0.5),
    layers.Dense(100, activation = "relu"),
    layers.Dropout(0.5),
    layers.Dense(1, activation = 'sigmoid'),


])

gru_model.compile(optimizer='rmsprop',
                       loss='binary_crossentropy',
                       metrics=['accuracy'])

gru_model.fit(X_train, y_train, epochs = 10, validation_data=(X_test, y_test))

gru_model.summary()

Epoch 1/10


  super().__init__(**kwargs)


[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 108ms/step - accuracy: 0.6351 - loss: 0.7432 - val_accuracy: 0.7761 - val_loss: 0.5608
Epoch 2/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 107ms/step - accuracy: 0.7529 - loss: 0.5319 - val_accuracy: 0.8168 - val_loss: 0.4767
Epoch 3/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 123ms/step - accuracy: 0.7923 - loss: 0.4679 - val_accuracy: 0.8181 - val_loss: 0.4289
Epoch 4/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 106ms/step - accuracy: 0.8210 - loss: 0.4263 - val_accuracy: 0.8050 - val_loss: 0.4599
Epoch 5/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 104ms/step - accuracy: 0.8300 - loss: 0.3997 - val_accuracy: 0.8181 - val_loss: 0.4410
Epoch 6/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 106ms/step - accuracy: 0.8345 - loss: 0.3998 - val_accuracy: 0.8155 - val_loss: 0.4572
Epoch 7/10
[1m191/19

## Performance

## Hyperparameter Tuning

The hyperparameters for the LTSM model, which appears to perform better than GRU, can now be tuned using the keras tuner package.

In [None]:
!pip install keras_tuner
import keras_tuner as kt

Collecting keras_tuner
  Downloading keras_tuner-1.4.7-py3-none-any.whl.metadata (5.4 kB)
Collecting kt-legacy (from keras_tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl.metadata (221 bytes)
Downloading keras_tuner-1.4.7-py3-none-any.whl (129 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: kt-legacy, keras_tuner
Successfully installed keras_tuner-1.4.7 kt-legacy-1.0.5


In [None]:
def model_builder(hp):
    # Tune the number of units in the first Dense layer
    # Choose an optimal value between 32-96
    hp_dropout = hp.Float('dropout', min_value=.1, max_value=.5, step=.1)

    model = models.Sequential([
    layers.Embedding(input_dim=22565,
        output_dim=100,
        weights = [embedding_matrix]),
    layers.Bidirectional(layers.LSTM(
        100,
        return_sequences = True,
        recurrent_dropout=hp_dropout)),
    layers.GlobalMaxPool1D(),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(100, activation = "relu"),
    layers.Dropout(0.5),
    layers.Dense(100, activation = "relu"),
    layers.Dropout(0.5),
    layers.Dense(1, activation = 'sigmoid'),
    ])

    # Tune the learning rate for the optimizer
    # Choose an optimal value from 0.01, 0.001, or 0.0001
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss='binary_crossentropy',
                metrics=['accuracy'])


    return model

In [None]:
tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=30,
                     factor=3)

In [None]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

In [None]:
tuner.search(X_train, y_train, epochs = 10, validation_data=(X_test, y_test), callbacks=[stop_early])

Trial 15 Complete [00h 01m 03s]
val_accuracy: 0.7892317771911621

Best val_accuracy So Far: 0.8128693103790283
Total elapsed time: 00h 19m 38s


In [None]:
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
print(f"""
The hyperparameter search is complete. The optimal number of units in the dropout
layer is {best_hps.get('dropout')} and the optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}.
""")
tuner.search_space_summary(extended=True)


The hyperparameter search is complete. The optimal number of units in the dropout
layer is 0.5 and the optimal learning rate for the optimizer
is 0.001.

Search space summary
Default search space size: 2
dropout (Float)
{'default': 0.1, 'conditions': [], 'min_value': 0.1, 'max_value': 0.5, 'step': 0.1, 'sampling': 'linear'}
learning_rate (Choice)
{'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True}


## Best Model

After tuning out dropout and learning rates for LTSM, we can train our best model on the training data in order to process the test data

In [None]:
best_model = models.Sequential([
    layers.Embedding(input_dim=22565,
        output_dim=100,
        weights = [embedding_matrix]),
    layers.Bidirectional(layers.LSTM(
        100,
        return_sequences = True,
        recurrent_dropout=0.2)),
    layers.GlobalMaxPool1D(),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(100, activation = "relu"),
    layers.Dropout(0.5),
    layers.Dense(100, activation = "relu"),
    layers.Dropout(0.5),
    layers.Dense(1, activation = 'sigmoid'),


])

best_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=.001),
                       loss='binary_crossentropy',
                       metrics=['accuracy'])


best_model.fit(X_train, y_train, epochs = 5, validation_data=(X_test, y_test))

Epoch 1/5
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 123ms/step - accuracy: 0.6479 - loss: 0.7337 - val_accuracy: 0.7978 - val_loss: 0.5861
Epoch 2/5
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 121ms/step - accuracy: 0.7514 - loss: 0.5260 - val_accuracy: 0.8142 - val_loss: 0.4725
Epoch 3/5
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 122ms/step - accuracy: 0.8055 - loss: 0.4476 - val_accuracy: 0.8122 - val_loss: 0.4399
Epoch 4/5
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 117ms/step - accuracy: 0.8321 - loss: 0.4007 - val_accuracy: 0.8142 - val_loss: 0.4547
Epoch 5/5
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 108ms/step - accuracy: 0.8375 - loss: 0.3989 - val_accuracy: 0.8188 - val_loss: 0.4370


<keras.src.callbacks.history.History at 0x7df1f219c050>

## Submission

In [None]:
test['text'] = test['text'].astype(str)
test['text'] = test['text'].map(lambda s: clean_str(s))
test_data = tokenizer.texts_to_sequences(test['text'].values)
test_data = pad_sequences(test_data)

In [None]:
submission = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/sample_submission.csv')
submission.target = best_model.predict(test_data)
submission.target = submission.target.round(0)
submission.to_csv("/content/drive/MyDrive/Colab Notebooks/submission.csv", index=False)

[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step


# Conclusion

While the two model architectures, LTSM and GRU, largely performed similarly, and the hyperparamter tuning showed only minimal improvement or deviation, it seems that these types of models can consistently predict this sort of natural language data with approximately 80% accuracy.

# References
1. https://www.kaggle.com/code/alexia/kerasnlp-starter-notebook-disaster-tweets
2. https://keras.io/examples/nlp/masked_language_modeling/
3. https://www.kaggle.com/code/tuckerarrants/disaster-tweets-eda-glove-rnns-bert#2.-Simple-LSTM-Model
4. https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
5. https://www.kaggle.com/code/andreshg/nlp-glove-bert-tf-idf-lstm-explained#7.-LSTM
6. https://towardsdatascience.com/mamba-ssm-theory-and-implementation-in-keras-and-tensorflow-32d6d4b32546/