# 1. Title : Natural Language Processing with Disaster Tweets

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster.

The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

# 2. Importing required Libraries

Using TensorFlow backend.

In [1]:
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords 

import warnings
warnings.filterwarnings('ignore')

import keras
from keras.initializers import Constant
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers, Sequential
from tensorflow.keras import optimizers
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import LSTM

# 3. Loading the data

In [2]:
train_data = pd.read_csv('../input/nlp-getting-started/train.csv', dtype={'id': np.int16, 'target': np.int16})
test_data = pd.read_csv('../input/nlp-getting-started/test.csv', dtype={'id': np.int16})

# 4. EDA

**1) Check missing values**

In [3]:
train_data.isnull().sum()

Missing values exist in the keyword, location variables.

In [4]:
miss_cols = ['keyword', 'location']

fig, axes = plt.subplots(2,figsize=(10, 15))

sns.barplot(x=train_data[miss_cols].isnull().sum().index, y=train_data[miss_cols].isnull().sum().values, ax=axes[0])
sns.barplot(x=train_data[miss_cols].isnull().sum().index, y=train_data[miss_cols].isnull().sum().values, ax=axes[1])

axes[0].set_ylabel('Missing Count', size=15, labelpad=20)
axes[0].tick_params(axis='x', labelsize=15)
axes[0].tick_params(axis='y', labelsize=15)
axes[1].set_ylabel('Missing Count', size=15, labelpad=20)
axes[1].tick_params(axis='x', labelsize=15)
axes[1].tick_params(axis='y', labelsize=15)

axes[0].set_title('Train data', fontsize=15)
axes[1].set_title('Test data', fontsize=15)

plt.show()

LOCATION variable has many missing values.

In [5]:
train_data.groupby('target').count()['id']

There are more tweets with class 0 ( No disaster) than class 1 ( disaster tweets)

In [6]:
train_x = train_data['text'].copy()
train_y = train_data['target'].copy()

0 is more than 1, but it is not much different.

**2) Data Cleaning**

Missing values must be processed before data analysis can be performed.

In [8]:
stop = stopwords.words('english')
def clean(text):
    
    text = re.sub(r'http\S+', ' ', text)
    
    text = re.sub(r'<.*?>', ' ', text)    
    
    text = re.sub(r'#\w+', ' ', text)    
     
    text = re.sub(r'@\w+', ' ', text)
    
    text = re.sub(r'\d+', ' ', text)
    
    text = text.split()
    
    text = ' '.join([word for word in text if word not in stop])
    
    return text

Removing Stop words

In [9]:
train_x_cleaned = train_x.apply(clean)
train_x_cleaned.head()

Max Length

In [10]:
max_len = max(train_x_cleaned.apply(len))
print('max length: {}'.format(max_len))

**3) Tokenize**

the text must be vectorized by generating a sequence of specified lengths for each tweet in the dataset. 
Using the Keras Tokenizer

In [11]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_x_cleaned)
vocab_size = len(tokenizer.word_index) + 1
x = tokenizer.texts_to_sequences(train_x_cleaned)
x = pad_sequences(x, max_len, padding='post')
y = train_y
print('train_x_clean:', train_x_cleaned[4])
print('*'*50)
print('x:',x[5])
print('vocabulary size:{}'.format(vocab_size))

In [12]:
x.shape

In [13]:
y.shape

# 5. Create Models

**model 1**

GRU implementation with basic embedding layer

In [14]:
epoch_size =10
batch_size = 32
embedding_dim = 16
optimizer = optimizers.Adam(lr=3e-4)

model = Sequential([
    layers.Embedding(vocab_size, embedding_dim, input_length=max_len),
    layers.Bidirectional(layers.GRU(256, return_sequences=True)),
    layers.GlobalMaxPool1D(),
#     layers.Dense(128, activation='relu'),
#     layers.Dropout(0.4),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(2, activation='sigmoid')
])
model.summary()

training the model

In [15]:
model.compile(loss='sparse_categorical_crossentropy', optimizer = 'adam', metrics=['accuracy'])
history = model.fit(x, y, epochs=epoch_size, validation_split=0.1)

In [16]:
test_x = test_data['text'].copy()
test_x = test_x.apply(clean)
test_x = tokenizer.texts_to_sequences(test_x)
test_x = pad_sequences(test_x, max_len, padding='post')

prediction

In [17]:
test_pred = np.argmax(model.predict(test_x), axis=1)
print(test_pred)

Check the loss ans accuracy

In [18]:
history1 = history.history

trg_loss = history1['loss']
val_loss = history1['val_loss']

trg_acc = history1['accuracy']
val_acc = history1['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.show()

In train data, it is a desirable state in which the loss is small and the accuracy value is high. However, in validation data, it is not convergent and the value is not good.

**model 2**

Add Dropout layer

In [19]:
epoch_size =10
batch_size = 32
embedding_dim = 16
optimizer = optimizers.Adam(lr=3e-4)

model2 = Sequential([
    layers.Embedding(vocab_size, embedding_dim, input_length=max_len),
    layers.Bidirectional(layers.GRU(256, return_sequences=True)),
    layers.GlobalMaxPool1D(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(2, activation='sigmoid')
])
model2.summary()

In [20]:
model2.compile(loss='sparse_categorical_crossentropy', optimizer = 'adam', metrics=['accuracy'])
history = model2.fit(x, y, epochs=epoch_size, validation_split=0.1)

In [21]:
test_pred = np.argmax(model2.predict(test_x), axis=1)
print(test_pred)

In [22]:
history2 = history.history

trg_loss = history2['loss']
val_loss = history2['val_loss']

trg_acc = history2['accuracy']
val_acc = history2['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.show()

It is improved over model 1, but it is still not convergent and has a good value in validation data.

**model 3**

LSTM implementation with basic embedding layer

In [23]:
epoch_size =10
batch_size = 32
embedding_dim = 16
optimizer = optimizers.Adam(lr=3e-4)

model3 = Sequential([
    layers.Embedding(vocab_size, embedding_dim, input_length=max_len, trainable=False),
    layers.SpatialDropout1D(0.2),

    layers.Bidirectional(layers.LSTM(64, recurrent_dropout=0.5, dropout=0.5, return_sequences=True)),
    layers.Bidirectional(layers.LSTM(64, recurrent_dropout=0.5, dropout=0.5)),

    layers.Dense(64, activation='relu'),
    layers.Dense(2, activation='sigmoid')
])
model3.summary()

In [24]:
model3.compile(loss='sparse_categorical_crossentropy', optimizer = optimizer, metrics=['accuracy'])
history = model3.fit(x, y, epochs=epoch_size, validation_split=0.1)

In [25]:
test_pred = np.argmax(model3.predict(test_x), axis=1)
print(test_pred)

In [26]:
history3 = history.history

trg_loss = history3['loss']
val_loss = history3['val_loss']

trg_acc = history3['accuracy']
val_acc = history3['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.show()

It is improved over model 2, but it is still not convergent and has a good value in validation data.

**model 4**

GRU implementation with basic embedding layer and 3 Dense layers

In [27]:
epoch_size =10
batch_size = 32
embedding_dim = 16
optimizer = optimizers.Adam(lr=3e-4)

model4 = Sequential([
    layers.Embedding(vocab_size, embedding_dim, input_length=max_len),
    layers.Bidirectional(layers.GRU(256, return_sequences=True)),
    layers.GlobalMaxPool1D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(2, activation='sigmoid')
])
model4.summary()

In [28]:
model4.compile(loss='sparse_categorical_crossentropy', optimizer = optimizer, metrics=['accuracy'])
history = model4.fit(x, y, epochs=epoch_size, validation_split=0.2)

In [29]:
test_pred = np.argmax(model2.predict(test_x), axis=1)
print(test_pred)

In [30]:
history4 = history.history

trg_loss = history4['loss']
val_loss = history4['val_loss']

trg_acc = history4['accuracy']
val_acc = history4['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.show()

The loss value is high and the accuracy is significantly different from the train result.

**model 5**

GloVe embedded LSTM, RNN and add BatchNormalization

In [31]:
embeddings_dictionary = dict()
embedding_dim = 100
glove_file = open('../input/glove6b100dtxt/glove.6B.100d.txt', encoding='UTF8')
for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = np.asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions
glove_file.close()

In [32]:
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [33]:
epoch_size =10
batch_size = 32

model5 = Sequential([
    layers.Embedding(input_dim=embedding_matrix.shape[0], 
                        output_dim=embedding_matrix.shape[1], 
                        weights = [embedding_matrix]
                        ),
    layers.Bidirectional(LSTM(64, return_sequences = True, recurrent_dropout=0.2)),
    layers.GlobalMaxPool1D(),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])
model5.summary()

In [34]:
model5.compile(loss='binary_crossentropy', optimizer = RMSprop(learning_rate=0.0001), metrics=['accuracy'])
history5 = model5.fit(x, y, epochs=epoch_size, validation_split=0.2)

In [35]:
test_pred = np.argmax(model5.predict(test_x), axis=1)
print(test_pred)

In [36]:
history = history5.history

trg_loss = history['loss']
val_loss = history['val_loss']

trg_acc = history['accuracy']
val_acc = history['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.show()

Both loss and accuracy values converge and the difference from the train data results tends to decrease.
However, the loss value is still large and the accumulation value seems to need further improvement.

**model 6**

GloVe embedded GRU

In [37]:
epoch_size =10
batch_size = 32

model6 = Sequential([
    layers.Embedding(input_dim=embedding_matrix.shape[0], 
                        output_dim=embedding_matrix.shape[1], 
                        weights = [embedding_matrix]
                        ),
    layers.Bidirectional(layers.GRU(256, return_sequences=True)),
    layers.GlobalMaxPool1D(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(2, activation='sigmoid')
])
model6.summary()

In [38]:
model6.compile(loss='sparse_categorical_crossentropy', optimizer = RMSprop(learning_rate=0.0001), metrics=['accuracy'])
history6 = model6.fit(x, y, epochs=epoch_size, validation_split=0.1)

In [39]:
test_pred = np.argmax(model6.predict(test_x), axis=1)
print(test_pred)

In [40]:
history = history6.history

trg_loss = history['loss']
val_loss = history['val_loss']

trg_acc = history['accuracy']
val_acc = history['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.show()

Both loss and accuracy values were improved in the desired direction.

# 6. Create a submission

Generate the results as a csv file.

In [41]:
submission = pd.DataFrame({'id':test_data['id'], 'target':test_pred})
submission.to_csv('submission.csv', index=False)