# Disaster Detection Tweets

### For this assignment, I will be predicting if a tweet relates to some natural or personal disaster or not. Many tweets may have words that a machine may associate with a disaster, such as wreck, but are actually exclamations. 

### I will be converting tweets into a sequence of encoded words and using those endoded sequense to determine if a tweet is about an actual disaster.


## Import the packages that will be used

In [70]:
import numpy as np
import pandas as pd
import os


In [75]:
import nltk
import re
import matplotlib.pyplot as plt
import string
import seaborn as sns
import spacy
from time import time
from nltk.corpus import stopwords
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
from keras.utils.data_utils import pad_sequences
from keras.layers import preprocessing

In [76]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [77]:
pip install -q kaggle

In [78]:
pip install -q opendatasets

In [79]:
! pip install opendatasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [81]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [82]:
import opendatasets as od

od.download("https://www.kaggle.com/competitions/nlp-getting-started/data")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: joshuatrahan
Your Kaggle Key: ··········
Downloading nlp-getting-started.zip to ./nlp-getting-started


100%|██████████| 593k/593k [00:00<00:00, 77.0MB/s]


Extracting archive ./nlp-getting-started/nlp-getting-started.zip to ./nlp-getting-started





## Update the cwd and create dataframes for the train and test data. Get value counts for outputs.

In [83]:
curr_dir = os.chdir('/content/nlp-getting-started')

train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [180]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this earthquake ma...,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,all residents asked to shelter in place are be...,1
3,6,,,people receive wildfires evacuation orders in...,1
4,7,,,just got sent this photo from ruby alaska as s...,1


## Create a text cleaner to remove punctuation and random characters

In [85]:
def cleantext(text):
    
    text = str(text).lower()  
    text = re.sub(r"\[(.*?)\]", "", text) 
    text = re.sub(r"\s+", " ", text)  
    text = re.sub(r"\w+…|…", "", text)  
    text = re.sub(r"(?<=\w)-(?=\w)", " ", text)  
    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)
    text = re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", text)
    text = re.sub('[^a-zA-Z1-9]+', ' ', str(text))

    return text

In [203]:
train_df['text'] = train_df.text.apply(cleantext)
test_df['text'] = test_df.text.apply(cleantext)

## Preprocess training and test text for use in the model. 

### This will involve creating tokens, encoding words with numbers, creating sequences, and padding the sequences with zeros so that all sequences are the same size.

In [204]:
from keras.preprocessing.text import Tokenizer

test_df = pd.read_csv('test.csv')
train_labels = train_df.target.values

tokenizer1 = Tokenizer(num_words=75000, oov_token = '<OOV>')
tokenizer1.fit_on_texts(train_df.text)
words = tokenizer1.word_index
sequences = tokenizer1.texts_to_sequences(train_df.text)
pad = pad_sequences(sequences, padding='post', truncating = 'post', maxlen =150)

test_seq = tokenizer1.texts_to_sequences(test_df.text)
test_pad = pad_sequences(test_seq,padding = 'post', maxlen=150)

### Word Frequency

In [267]:
word_dict = dict(tokenizer1.word_counts)

keys = list(word_dict.keys())
values = list(word_dict.values())
sorted_value_index = np.argsort(values)
sorted_dict = {keys[i]: values[i] for i in sorted_value_index[::-1]}

sorted_dict

{'the': 3276,
 'a': 2193,
 'in': 1982,
 'to': 1950,
 'of': 1829,
 'and': 1423,
 'i': 1405,
 'is': 944,
 'for': 895,
 'on': 860,
 'you': 810,
 'my': 678,
 'it': 579,
 'with': 574,
 'that': 567,
 'at': 543,
 'by': 527,
 'this': 478,
 'from': 420,
 'be': 408,
 'are': 403,
 'have': 386,
 'was': 385,
 'like': 347,
 'as': 330,
 'up': 329,
 'me': 323,
 'just': 322,
 'but': 317,
 'so': 317,
 'amp': 300,
 'im': 299,
 'not': 297,
 'your': 293,
 'out': 271,
 'its': 268,
 'no': 262,
 'all': 261,
 'after': 260,
 'will': 257,
 'when': 255,
 'fire': 252,
 'an': 251,
 'has': 250,
 'we': 242,
 'if': 242,
 'get': 229,
 'new': 226,
 'now': 222,
 'via': 220,
 'more': 217,
 'about': 214,
 'or': 203,
 'what': 199,
 'one': 197,
 'people': 196,
 'news': 195,
 'he': 194,
 'they': 193,
 'dont': 191,
 'been': 191,
 'how': 191,
 'over': 189,
 'who': 180,
 'into': 173,
 'do': 168,
 'were': 167,
 'us': 167,
 'can': 165,
 's': 163,
 'video': 158,
 'emergency': 157,
 'disaster': 154,
 'there': 151,
 'police': 141,
 '

## Finding the average length of a tweet to help with maxlen.

In [205]:
total = 0
for i in range(len(sequences)):
  total += len(sequences[i])
print('Average number of characters: '+ str(total/len(sequences)))


Average number of characters: 14.53684487061605


## Build stacked GRU RNN with a single dense layer to produce an output.

In [206]:
from keras.api._v2.keras import activations
from keras.layers import Flatten
tf.random.set_seed(123)

model = keras.Sequential()

model.add(layers.Embedding(input_dim = 75000, output_dim = 64, input_length=150))

model.add(layers.Bidirectional(keras.layers.GRU(64,activation= 'tanh', dropout=.35, return_sequences=True)))
model.add(layers.Bidirectional(keras.layers.GRU(32, activation= 'tanh', dropout=.4)))

model.add(layers.Dense(1, activation = 'sigmoid'))

model.summary()


Model: "sequential_39"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_39 (Embedding)    (None, 150, 64)           4800000   
                                                                 
 bidirectional_65 (Bidirecti  (None, 150, 128)         49920     
 onal)                                                           
                                                                 
 bidirectional_66 (Bidirecti  (None, 64)               31104     
 onal)                                                           
                                                                 
 dense_68 (Dense)            (None, 1)                 65        
                                                                 
Total params: 4,881,089
Trainable params: 4,881,089
Non-trainable params: 0
_________________________________________________________________


In [207]:
model.compile(loss = 'binary_crossentropy', optimizer= keras.optimizers.Adam(learning_rate=.0001),
              metrics = ['accuracy'])

## Add callbacks and fit the model. Creating a 20% validation portion to test predictions

In [208]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

early_stop = EarlyStopping(monitor='val_accuracy', min_delta=0.005, patience = 5, verbose = 1, mode = 'auto')
mod_checkpt = ModelCheckpoint(monitor = 'val_accuracy', filepath='./best_mod.b1', verbose = 1, save_best_only= True)
y_train = train_df.target
cb = [early_stop, mod_checkpt]

results = model.fit(pad,y_train, epochs = 20, verbose = 1, callbacks=cb, validation_split=.15)

Epoch 1/20
Epoch 1: val_accuracy improved from -inf to 0.53415, saving model to ./best_mod.b1




Epoch 2/20
Epoch 2: val_accuracy improved from 0.53415 to 0.73730, saving model to ./best_mod.b1




Epoch 3/20
Epoch 3: val_accuracy improved from 0.73730 to 0.74431, saving model to ./best_mod.b1




Epoch 4/20
Epoch 4: val_accuracy did not improve from 0.74431
Epoch 5/20
Epoch 5: val_accuracy improved from 0.74431 to 0.75219, saving model to ./best_mod.b1




Epoch 6/20
Epoch 6: val_accuracy improved from 0.75219 to 0.75569, saving model to ./best_mod.b1




Epoch 7/20
Epoch 7: val_accuracy did not improve from 0.75569
Epoch 8/20
Epoch 8: val_accuracy did not improve from 0.75569
Epoch 9/20
Epoch 9: val_accuracy did not improve from 0.75569
Epoch 10/20
Epoch 10: val_accuracy did not improve from 0.75569
Epoch 10: early stopping


In [190]:
best_model = keras.models.load_model('./best_mod.b1')

pred = model.predict(test_pad)




In [214]:
df1 = pd.DataFrame(pred.round(), test_df.id, columns = ['Prediction']).reset_index()

In [232]:
np.asarray(test_df.id)
p_lst = []
for p in pred:
  p_lst.append(p[0])

In [241]:
final_df = pd.DataFrame( p_lst,test_df.id.values).reset_index()
final_df.to_csv('./pred1')

### My submission scored 74% accuracy on the test data. I kept my model as simple as I could with a conservative learning rate. The GRU model performed better than LSTM model.

### I thought about removing stopwords from the text which may have given me a little better result. I will likely try this at a later time. The keras vectorizer and tokenizer were easiest to use with a keras model.

### The end result without removing stopwords was better than I expected.