# Natural Language Processing with Disaster Tweets

Predict which Tweets are about real disasters and which ones are not

This notebook provides a way of predicting if the content of a tweet corresponds to  a real disaster or not. If so, predict a `1`. If not, predict a `0`.

For more information, visit https://www.kaggle.com/c/nlp-getting-started/data.

## Getting the data

In [1]:
# Importing needed packages
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
# Download helper functions file
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

# Import helper functions
from helper_functions import unzip_data, plot_loss_curves, compare_historys

# Download dataset (as from Kaggle)
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip dataset into directory
unzip_data("nlp_getting_started.zip")

--2021-12-03 09:16:37--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2021-12-03 09:16:37 (70.2 MB/s) - ‘helper_functions.py’ saved [10246/10246]

--2021-12-03 09:16:37--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.103.128, 108.177.120.128, 142.250.128.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.103.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-12-03 09:16:37 (56.

## Getting data ready for modelling

In [3]:
# Import .csv's to pandas dataframes
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


Note that test dataset doesn't have the `target` column. In that case, we'll split our train dataset into train/validation splits.

In [5]:
# Check true/false balance
train_df[["target"]].value_counts()

target
0         4342
1         3271
dtype: int64

In [6]:
# Shuffle train dataset
train_df_shuffled = train_df.sample(frac=1, random_state=42)

In [7]:
# Split train dataset into train/validation sets
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=.1,
                                                                            random_state=42)

# Check splits shapes
train_sentences.shape, train_labels.shape, val_sentences.shape, val_labels.shape

((6851,), (6851,), (762,), (762,))

In [8]:
# Create a text vectorizer layer
from tensorflow.keras.layers import TextVectorization

text_vectorizer = TextVectorization(max_tokens=None,
                                    standardize="lower_and_strip_punctuation",
                                    split="whitespace",
                                    ngrams=None,
                                    output_mode="int",
                                    output_sequence_length=None)

# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

### Model 1

In [9]:
# Create an embedding layer
from tensorflow.keras.layers import Embedding

embedding = Embedding(input_dim=20000,
                      output_dim=512,
                      embeddings_initializer=tf.keras.initializers.RandomUniform(),
                      input_length=20000,
                      name="embedding_1")

In [10]:
# Create model_1 using Sequential API
tf.random.set_seed(42)
model_1 = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1,), dtype="string", name="input_layer"),
    text_vectorizer,
    embedding,
    tf.keras.layers.GlobalAveragePooling1D(name="global_average_pooling_1d_layer"),
    tf.keras.layers.Dense(1, activation="sigmoid", name="output_layer")
])

# Compile model_1
model_1.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Check model_1 summary
model_1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding_1 (Embedding)     (None, None, 512)         10240000  
                                                                 
 global_average_pooling_1d_l  (None, 512)              0         
 ayer (GlobalAveragePooling1                                     
 D)                                                              
                                                                 
 output_layer (Dense)        (None, 1)                 513       
                                                                 
Total params: 10,240,513
Trainable params: 10,240,513
Non-trainable params: 0
____________________________________________

In [11]:
# Fit model_1
history_model_1 = model_1.fit(train_sentences, train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Model 2

In [12]:
tf.random.set_seed(42)
# Create new embedding for model_2
embedding_2 = Embedding(input_dim=20000, output_dim=512,
                        embeddings_initializer=tf.keras.initializers.RandomUniform(),
                        input_length=20000,
                        name="embedding_2")

# Create model_2
model_2 = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1,), dtype="string", name="input_layer"),
    text_vectorizer,
    embedding_2,
    tf.keras.layers.LSTM(64, name="LSTM_layer"),
    tf.keras.layers.Dense(1, activation="sigmoid", name="output_layer")
])

# Compile model_2
model_2.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Model_2 summary
model_2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding_2 (Embedding)     (None, None, 512)         10240000  
                                                                 
 LSTM_layer (LSTM)           (None, 64)                147712    
                                                                 
 output_layer (Dense)        (None, 1)                 65        
                                                                 
Total params: 10,387,777
Trainable params: 10,387,777
Non-trainable params: 0
_________________________________________________________________


In [13]:
# Fit model_2
history_model_2 = model_2.fit(train_sentences, train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Model 3

In [14]:
tf.random.set_seed(42)
# Create new embedding for model_5
embedding_3 = Embedding(input_dim=20000,
                        output_dim=512,
                        embeddings_initializer=tf.keras.initializers.RandomUniform(),
                        input_length=20000)

# Create model_5
model_3 = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1,), dtype="string"),
    text_vectorizer,
    embedding_3,
    tf.keras.layers.Conv1D(filters=32, kernel_size=3, activation="relu", name="conv_1d_layer"),
    tf.keras.layers.GlobalAveragePooling1D(name="global_average_pooling_1d_layer"),
    tf.keras.layers.Dense(1, activation="sigmoid", name="output_layer")
])

# Compile model_5
model_3.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Check summary
model_3.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, None, 512)         10240000  
                                                                 
 conv_1d_layer (Conv1D)      (None, None, 32)          49184     
                                                                 
 global_average_pooling_1d_l  (None, 32)               0         
 ayer (GlobalAveragePooling1                                     
 D)                                                              
                                                                 
 output_layer (Dense)        (None, 1)                 33        
                                                      

In [15]:
# Fit model
history_model_3 = model_3.fit(train_sentences, train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Try fine-tuning the TF Hub Universal Sentence Encoder model by setting training=True when instantiating it as a Keras layer

```
# We can use this encoding layer in place of our text_vectorizer and embedding layer
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[],
                                        dtype=tf.string,
                                        trainable=True) # turn training on to fine-tune the TensorFlow Hub model
                                        
```

In [18]:
# Try using USE v4
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed([
    "The quick brown fox jumps over the lazy dog.",
    "I am a sentence for which I would like to get its embedding"])
print(embeddings)

tf.Tensor(
[[-0.03133017 -0.06338634 -0.01607498 ... -0.03242779 -0.04575739
   0.05370455]
 [ 0.05080865 -0.01652431  0.01573783 ...  0.0097666   0.03170121
   0.01788121]], shape=(2, 512), dtype=float32)


In [20]:
# Create model_4
tf.random.set_seed(42)
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[],
                                        dtype=tf.string,
                                        trainable=True)
model_4 = tf.keras.Sequential([
    sentence_encoder_layer,
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
], name="model_USE")

# Compile model_4
model_4.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Fit model_4
history_model_4 = model_4.fit(train_sentences, train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5



## Retrain the best model you've got so far on the whole training set (no validation split).

Our best model is `model_4` in which we used the Universal Sentece Encoder model from tensorflow hub and fine tuned for our own purposes.

In [21]:
# Create full sentence/label sets
whole_training_sentences = train_df_shuffled["text"].to_numpy()
whole_training_labels = train_df_shuffled["target"].to_numpy()

# Check shapes
whole_training_sentences.shape, whole_training_labels.shape

((7613,), (7613,))

In [23]:
# Create new model
model_5 = tf.keras.Sequential([
    sentence_encoder_layer,
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

# Compile model
model_5.compile(loss="binary_crossentropy",
                optimizer="Adam",
                metrics=["accuracy"])

# Fit model with whole data
history_model_5 = model_5.fit(whole_training_sentences, whole_training_labels,
                              epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Make predictions on the test dataset and format the predictions into the same format as the `sample_submission.csv` file from Kaggle. 

[Make a submission to the Kaggle competition](https://www.kaggle.com/c/nlp-getting-started/data)

In [24]:
# Get model_7 predictions probabilities
model_5_pred_probs = model_5.predict(test_df["text"].to_numpy())
model_5_pred_probs[:10]

array([[9.9921286e-01],
       [9.9712342e-01],
       [9.9967146e-01],
       [9.9740207e-01],
       [9.9970657e-01],
       [9.9581063e-01],
       [4.6759538e-04],
       [4.1384361e-04],
       [3.9360445e-04],
       [3.6974449e-04]], dtype=float32)

In [25]:
# Convert model_7's prediction probabilities into labels (round)
model_5_preds = tf.cast(tf.squeeze(tf.round(model_5_pred_probs)), dtype=tf.int32)
model_5_preds[:10]

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0], dtype=int32)>

In [26]:
# Create submission dataframe
submission = pd.DataFrame({"id": test_df["id"].values,
                           "target": (model_5_preds.numpy())})
submission

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1


In [27]:
# Create submission csv to upload to kaggle
submission.to_csv("submission.csv", index=False)

As I didn't want to spoil the results, I tell you now Kaggle's score of this model: 0.78823 😀