NLP Disaster Tweets Analysis

Week 4 Assignment

In [None]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.layers import Dense, LSTM, Embedding, Bidirectional, Dropout
from keras.callbacks import ReduceLROnPlateau, EarlyStopping

Introduction: 
The first steps within this analysis will involve reading the data into the python script from the files online. After loading the files, we'll be able to dig in and explore the data and what it's architecture looks like more.

In [None]:
# Load Dataset
dataset = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

sentences = dataset['text']
target = dataset['target']
target = np.array(target)

EDA:

We have both test and train datasets. They both contain five different columns: ID, Keyword (a keyword from that tweets_, Location (the location the tweet was sent from), test (the text of the tweet), and the target (1 if the tweet is a real disaster and 0 if it is not). 

Based on the code below, we know that the breakdown of the architecture of the data looks like this:

Training Set Shape = (7613, 5)
Training Set Memory Usage = 0.29 MB
Test Set Shape = (3263, 4)
Test Set Memory Usage = 0.10 MB

Additionally, here are some breakdown metrics of the data:

Train Length Stat
count    7613.000000;
mean      101.037436;
std        33.781325;
min         7.000000;
25%        78.000000;
50%       107.000000;
75%       133.000000;
max       157.000000;

Name: length, dtype: float64

Test Length Stat
count    3263.000000;
mean      102.108183;
std        33.972158;
min         5.000000;
25%        78.000000;
50%       109.000000;
75%       134.000000;
max       151.000000;

Name: length, dtype: float64

In [None]:
print('Training Set Shape = {}'.format(dataset.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(dataset.memory_usage().sum() / 1024**2))
print('Test Set Shape = {}'.format(test.shape))
print('Test Set Memory Usage = {:.2f} MB'.format(test.memory_usage().sum() / 1024**2))

dataset.head()

In [None]:
dataset["length"] = dataset["text"].apply(lambda x : len(x))
test["length"] = test["text"].apply(lambda x : len(x))

print("Train Length Stat")
print(dataset["length"].describe())
print()

print("Test Length Stat")
print(test["length"].describe())

In [None]:
# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
vocab_size = len(tokenizer.word_index) + 1

encoded_text = tokenizer.texts_to_sequences(sentences)
padded_text = pad_sequences(encoded_text, maxlen=65, padding='post')

Model Architecture: 

Please see above for specific details on the dataset. 


In [None]:
model = Sequential([
    Embedding(vocab_size, 64, input_length=65),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2)),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

In [None]:
# Learning Rate Schedule and Early Stopping
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=1e-6)
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

In [None]:
# Model Training
x = padded_text
y = target

model.fit(x, y, epochs=10, batch_size=64, validation_split=0.2, callbacks=[reduce_lr, early_stopping])

In [None]:
# Test Data
test_sentences = test['text']
encoded_test_text = tokenizer.texts_to_sequences(test_sentences)
padded_test_text = pad_sequences(encoded_test_text, maxlen=65, padding='post')

In [None]:
# Prediction
predictions = model.predict(padded_test_text)
binary_pred = (predictions > 0.5).astype(int)

ids = test['id']

Results and Analysis

Score came out to be 0.654 on model

In [None]:
# Save predictions to CSV file
submission_df = pd.DataFrame({'id': ids, 'target': binary_pred.flatten()})
submission_df.to_csv('submission.csv', index=False)

Conclusion

After reviewing how this model and assignment went, there is definitley room for improvement. However, with that said, I was pleasently happy with how the model came out. I learned more about how to implament code like this, as to analyze writing/text.

In my next assignment, I would love to learn more about ways to tune the model to be more in tune with themes and sentiment from the texts. 