<a href="https://colab.research.google.com/github/hecshzye/nlp-disaster-tweet-detection/blob/main/disaster_tweet_detection_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Disaster Tweets Detection using Natural Language Processing

### The goal is to predict which tweets are about real disasters and which are not using `NLP` and `TensorFlow`

- The dataset used in this model is the `Real-or-Not` from `Kaggle competition`: https://www.kaggle.com/c/nlp-getting-started/data


In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import random
import os

In [27]:
# Importing few functions wriiten for workflow and ease
!wget https://raw.githubusercontent.com/hecshzye/natural_language_processing-cases/main/helper_functions.py

--2022-01-14 05:20:01--  https://raw.githubusercontent.com/hecshzye/natural_language_processing-cases/main/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6442 (6.3K) [text/plain]
Saving to: ‘helper_functions.py’


2022-01-14 05:20:01 (74.9 MB/s) - ‘helper_functions.py’ saved [6442/6442]



In [28]:
from helper_functions import plot_loss_curves, create_confusion_matrix, create_tensorboard_callback, unzip_data, compare_history

## Dataset & EDA

In [30]:
!wget https://github.com/hecshzye/nlp-disaster-tweet-detection/blob/main/nlp_getting_started.zip?raw=true
unzip_data("nlp_getting_started.zip?raw=true")

--2022-01-14 05:21:30--  https://github.com/hecshzye/nlp-disaster-tweet-detection/blob/main/nlp_getting_started.zip?raw=true
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/hecshzye/nlp-disaster-tweet-detection/raw/main/nlp_getting_started.zip [following]
--2022-01-14 05:21:30--  https://github.com/hecshzye/nlp-disaster-tweet-detection/raw/main/nlp_getting_started.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/hecshzye/nlp-disaster-tweet-detection/main/nlp_getting_started.zip [following]
--2022-01-14 05:21:30--  https://raw.githubusercontent.com/hecshzye/nlp-disaster-tweet-detection/main/nlp_getting_started.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting

In [32]:
# Converting CSV to DataFrame
train_df = pd.read_csv("/content/train.csv")
test_df = pd.read_csv("/content/test.csv")

In [33]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [34]:
# Shuffling the dataset
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [35]:
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

**Refrence dictionary** 

Real disaster tweet = `1` (3271)

Not real disaster tweet = `0` (4342)

In [37]:
# Train & test data distribution 
print(f"train - {len(train_df)}")
print(f"test - {len(test_df)}")
print(f"total size - {len(train_df) + len(test_df)}")

train - 7613
test - 3263
total size - 10876


In [39]:
# Data viz 
random_index = random.randint(0, len(train_df)-10)
for row in train_df_shuffled[["text", "target"]][random_index:random_index+10].itertuples():
  _, text, target = row
  print(f"target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"text:\n{text}\n")
  print(f"---\n")

target: 1 (real disaster)
text:
Drones Under Fire: Officials Offer $75000 Reward Leading To Pilots Who Flew Over Wildfire http://t.co/d2vEppeh8S #photography #arts

---

target: 1 (real disaster)
text:
Flood Advisory in effect for Shelby County in AL until 9 PM #alwx http://t.co/gTqMGsgcsB

---

target: 0 (not real disaster)
text:
Buyout Giants Bid To Derail å£6bn Worldpay IPO ÛÒ SkyåÊNews http://t.co/94GjsKUR0r

---

target: 0 (not real disaster)
text:
He's being put on a stretcher ?? don't want to see that.

---

target: 0 (not real disaster)
text:
I wish that the earth sea and sky up above
would send me someone to lava????

---

target: 0 (not real disaster)
text:
Flat out bomb by @FlavaFraz21 #whatcanthedo

---

target: 1 (real disaster)
text:
Russian 'food crematoria' provoke outrage amid crisis famine memories: MOSCOW (Reuters) - Russian government ... http://t.co/Mphog0QDDN

---

target: 1 (real disaster)
text:
Mexico: construction of bridge collapse killsåÊone http://t.co/I2C0

In [43]:
# Splitting 
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [44]:
train_sentences[:5], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk'],
       dtype=object), array([0, 0, 1, 0, 0, 1, 1, 0, 1, 1]))

In [46]:
# Preprocessing - turning text into vectors
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_vectorizer = TextVectorization(max_tokens=None,
                                    standardize="lower_and_strip_punctuation",
                                    split="whitespace",
                                    ngrams=None,
                                    output_mode="int",
                                    output_sequence_length=None)

In [49]:
# Average number of words in a tweet (tokens after vectorization)
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [51]:
# Text Vectorization using custom variables
max_vocab_length = 1000
max_length = 15
text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [52]:
# Mapping TextVectorization and text_vectorizer
text_vectorizer.adapt(train_sentences)

# Tokenizing sample sentences
sample_sentence = "Floor is lava at the end of the day"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[  1,   9, 434,  17,   2, 304,   6,   2, 101,   0,   0,   0,   0,
          0,   0]])>

In [53]:
# Vectorization of sentences 
random_sentence = random.choice(train_sentences)
print(f"Original Text:\n{random_sentence}\
         \n\nVectorized version:")
text_vectorizer([random_sentence])

Original Text:
The once desolate valley was transformed into a thriving hub of hiÛÓtech business.         

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[  2, 838, 744,   1,  23,   1,  66,   3,   1,   1,   6,   1, 691,
          0,   0]])>

In [54]:
# Checking unique tokens in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
print(f"Number of words in the vocab: {len(words_in_vocab)}")
top_5_words = words_in_vocab[:5]
print(f"Top 5 common words: {top_5_words}")
bottom_5_words = words_in_vocab[-5:]
print(f"Bottom 5 least common words: {bottom_5_words}")

Number of words in the vocab: 1000
Top 5 common words: ['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words: ['reported', 'r', 'pray', 'playlist', 'patience']


In [56]:
# Embedding and Embedding layer
from tensorflow.keras import layers
tf.random.set_seed(42)
embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             embeddings_initializer="uniform",
                             input_length=max_length,
                             name="embedding_1")
embedding

<keras.layers.embeddings.Embedding at 0x7fcd9c5659d0>

In [58]:
# Testing out another sample layer
random_sentence = random.choice(train_sentences)
print(f"Original Text:\n{random_sentence}\
        \n\nEmbedded version:")
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original Text:
&gt;As soon as maintenance ends everyone floods the servers
&gt;Servers destroyed by extreme load
&gt;Maintenance starts anew        

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.03977952, -0.03782602, -0.03646283, ...,  0.00236253,
          0.03332629,  0.02803668],
        [-0.03606297, -0.03847123, -0.02388229, ..., -0.01015524,
          0.01009201,  0.01856624],
        [-0.00724739, -0.04718477, -0.02565417, ..., -0.03481182,
          0.01107268, -0.03028326],
        ...,
        [ 0.03977952, -0.03782602, -0.03646283, ...,  0.00236253,
          0.03332629,  0.02803668],
        [ 0.03977952, -0.03782602, -0.03646283, ...,  0.00236253,
          0.03332629,  0.02803668],
        [ 0.03977952, -0.03782602, -0.03646283, ...,  0.00236253,
          0.03332629,  0.02803668]]], dtype=float32)>

In [59]:
sample_embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.03977952, -0.03782602, -0.03646283, -0.02449075, -0.00015752,
        0.02220254,  0.00162981,  0.00603487,  0.0085157 , -0.02620113,
        0.04101599,  0.03715892,  0.02397566,  0.00281113, -0.02704906,
       -0.04870148,  0.01457943,  0.0059551 , -0.02334484,  0.03581132,
        0.04377897,  0.04186075,  0.03245703, -0.045092  ,  0.04260418,
        0.03398135, -0.01812425, -0.03539513,  0.02954218,  0.02556742,
       -0.03345481,  0.04272738, -0.00798845, -0.0406163 , -0.00644834,
        0.00232404,  0.01703629,  0.03645121, -0.02622857,  0.03498118,
       -0.03059715,  0.02576998, -0.04221511,  0.02654583, -0.02192564,
       -0.0346157 ,  0.00075326,  0.01427345,  0.01027539, -0.04311384,
       -0.03973336, -0.00966626,  0.01032177, -0.04011822, -0.018892  ,
       -0.01233201,  0.02721632, -0.01232889, -0.02504088, -0.04715574,
        0.00558523, -0.00801403,  0.03058865, -0.01923352, -0.04175536,
       -0.036542

# Modelling 