# **Data Preparation**

We'll be using the [Real or Not?](https://www.kaggle.com/c/nlp-getting-started/data) datset from Kaggle which contains text-based Tweets about natural disasters.

We use Text Vectorization and Text Embedding as preprocessing technique and Multinomial Naive Bayes as the classifier.


In [None]:
import zipfile

def unzip_data(filename):
  """
  Unzips filename into the current working directory.

  Args:
    filename (str): a filepath to a target zip folder to be unzipped.
  """
  zip_ref = zipfile.ZipFile(filename, "r")
  zip_ref.extractall()
  zip_ref.close()

# Unzip data
unzip_data("real-or-not.zip")

In [2]:
# Turn .csv files into pandas DataFrame's
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42) # shuffle with random_state=42 for reproducibility
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


# **Data Exploration**

In [4]:
#test dataframe check
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [5]:
# How many data of each label
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [6]:
# How many samples total?
print(f"Total training samples: {len(train_df)}")
print(f"Total test samples: {len(test_df)}")
print(f"Total samples: {len(train_df) + len(test_df)}")

Total training samples: 7613
Total test samples: 3263
Total samples: 10876


In [7]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 1 (real disaster)
Text:

---

Target: 1 (real disaster)
Text:
Police Officer Wounded Suspect Dead After Exchanging Shots: Richmond police officer wounded suspect killed a... http://t.co/5uFTRXPpV0

---

Target: 0 (not real disaster)
Text:
act my age was a MESS everyone was so wild it was so fun my videos a wreck

---

Target: 0 (not real disaster)
Text:
A true #TBT  Eyewitness News WBRE WYOU http://t.co/JHVigsX5Jg

---

Target: 1 (real disaster)
Text:
LLF TALK  WORLD NEWS U.S. in record hurricane drought - The United States hasn't been hit by a major hurricane in ... http://t.co/ML8IrhWg7O

---



# **Data Preprocessing**

In [8]:
from sklearn.model_selection import train_test_split

# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, # dedicate 10% of samples to validation set
                                                                            random_state=42) # random state for reproducibility


In [9]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [10]:
# View the first 10 training sentences and their labels
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

In [11]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Use the default TextVectorization variables
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in our text)
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None) # how long should the output sequence of tokens be?
                                    # pad_to_max_tokens=True) # Not valid if using max_tokens=None

In [12]:
# Find average number of tokens (words) in training Tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [13]:
# Setup text vectorization with custom variables
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be 

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [14]:
text_vectorizer.adapt(train_sentences)

In [15]:
tf.random.set_seed(42)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=128, # set size of embedding vector
                             embeddings_initializer="uniform", # default, intialize randomly
                             input_length=max_length, # how long is each input
                             name="embedding_1") 

In [16]:

# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
sample_embed = embedding(text_vectorizer([random_sentence]))

Original text:
when you're taking a shower and someone flushes the toilet and you have .1 second to GTFO or you get burned??????????????????????????????????????????????????      

Embedded version:


# **Modelling**

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline
model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [18]:
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Our baseline model achieves an accuracy of: 79.27%


In [19]:
# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

In [20]:
# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.

  Args:
  -----
  y_true = true labels in the form of a 1D array
  y_pred = predicted labels in the form of a 1D array

  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

In [21]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'f1': 0.7862189758049549,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706}

# **Save Model**

In [22]:
import joblib
joblib.dump(model_0, 'model.h5')

['model.h5']

In [23]:
loaded_model_SavedModel = joblib.load('model.h5')

In [24]:
loaded_model_preds = loaded_model_SavedModel.predict(val_sentences)
loaded_model_results = calculate_results(y_true=val_labels,
                                     y_pred=loaded_model_preds)
loaded_model_results
# The loaded model has the same accuracy to the model we've created,
#  that means the model is successfully saved

{'accuracy': 79.26509186351706,
 'f1': 0.7862189758049549,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706}

# **Model Exploration**

In [25]:
loaded_model_probs = loaded_model_SavedModel.predict_proba(val_sentences)[:,1]

In [26]:
val_df = pd.DataFrame({"text": val_sentences,
                       "target": val_labels,
                       "pred": loaded_model_preds,
                       "pred_prob": loaded_model_probs})
val_df.head()

Unnamed: 0,text,target,pred,pred_prob
0,DFR EP016 Monthly Meltdown - On Dnbheaven 2015...,0,1,0.555075
1,FedEx no longer to transport bioterror germs i...,0,1,0.701657
2,Gunmen kill four in El Salvador bus attack: Su...,1,1,0.86646
3,@camilacabello97 Internally and externally scr...,1,0,0.219887
4,Radiation emergency #preparedness starts with ...,1,0,0.368094


In [27]:
# Find the wrong predictions and sort by prediction probabilities
most_wrong = val_df[val_df["target"] != val_df["pred"]].sort_values("pred_prob", ascending=False)
most_wrong[:10]

Unnamed: 0,text,target,pred,pred_prob
606,Maid charged with stealing Dh30000 from police...,0,1,0.823563
303,Trafford Centre film fans angry after Odeon ci...,0,1,0.78665
209,Ashes 2015: AustraliaÛªs collapse at Trent Br...,0,1,0.750554
129,Drowning in Actavis suicide,0,1,0.712391
1,FedEx no longer to transport bioterror germs i...,0,1,0.701657
182,@ONU_France 74/75 Bioterrorism on '@Rockefelle...,0,1,0.699034
284,Truth...\nhttps://t.co/h6amECX5K7\n#News\n#BBC...,0,1,0.695575
381,Deaths 3 http://t.co/nApviyGKYK,0,1,0.680968
698,åÈMGN-AFRICAå¨ pin:263789F4 åÈ Correction: Ten...,0,1,0.670889
759,FedEx will no longer transport bioterror patho...,0,1,0.638878


In [28]:
# Check the false positives (model predicted 1 when should've been 0)
for row in most_wrong[:5].itertuples(): # loop through the top 10 rows (change the index to view different rows)
  _, text, target, pred, prob = row
  print(f"Target: {target}, Pred: {int(pred)}, Prob: {prob}")
  print(f"Text:\n{text}\n")
  print("----\n")

Target: 0, Pred: 1, Prob: 0.8235632593451624
Text:
Maid charged with stealing Dh30000 from police officer sponsor http://t.co/y35qtVDSOH | https://t.co/qhUJAjCTR5

----

Target: 0, Pred: 1, Prob: 0.7866496291577006
Text:
Trafford Centre film fans angry after Odeon cinema evacuated following false fire alarm   http://t.co/6GLDwx71DA

----

Target: 0, Pred: 1, Prob: 0.7505539881510137
Text:
Ashes 2015: AustraliaÛªs collapse at Trent Bridge among worst in history: England bundled out Australia for 60 ... http://t.co/t5TrhjUAU0

----

Target: 0, Pred: 1, Prob: 0.7123912297994762
Text:
Drowning in Actavis suicide

----

Target: 0, Pred: 1, Prob: 0.701656549293724
Text:
FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps http://t.co/qZQc8WWwcN via @usatoday

----



In [29]:
# Check the most wrong false negatives (model predicted 0 when should've predict 1)
for row in most_wrong[-5:].itertuples():
  _, text, target, pred, prob = row
  print(f"Target: {target}, Pred: {int(pred)}, Prob: {prob}")
  print(f"Text:\n{text}\n")
  print("----\n")

Target: 1, Pred: 0, Prob: 0.05213190573918038
Text:
@willienelson We need help! Horses will die!Please RT &amp; sign petition!Take a stand &amp; be a voice for them! #gilbert23 https://t.co/e8dl1lNCVu

----

Target: 1, Pred: 0, Prob: 0.046861198890915376
Text:
Just came back from camping and returned with a new song which gets recorded tomorrow. Can't wait! #Desolation #TheConspiracyTheory #NewEP

----

Target: 1, Pred: 0, Prob: 0.04477045194954551
Text:
Why are you deluged with low self-image? Take the quiz: http://t.co/XsPqdOrIqj http://t.co/CQYvFR4UCy

----

Target: 1, Pred: 0, Prob: 0.0357261825511069
Text:
When you go to a concert and someone screams in your ear... Does it look like I wanna loose my hearing anytime soon???

----

Target: 1, Pred: 0, Prob: 0.02529689983647664
Text:
You can never escape me. Bullets don't harm me. Nothing harms me. But I know pain. I know pain. Sometimes I share it. With someone like you.

----



In [30]:
import numpy as np

In [31]:
# Making predictions on the test dataset
test_sentences = test_df["text"].to_list()
test_samples = random.sample(test_sentences, 5)
for test_sample in test_samples:
  pred_prob = model_0.predict_proba([test_sample])[:,1] # has to be list
  pred = np.round(pred_prob)
  print(f"Pred: {int(pred)}, Prob: {pred_prob}")
  print(f"Text:\n{test_sample}\n")
  print("----\n")

Pred: 1, Prob: [0.82674113]
Text:
MEG issues Hazardous Weather Outlook (HWO) http://t.co/3X6RBQJHn3

----

Pred: 1, Prob: [0.88732242]
Text:
Green Line train derails on South Side passengers safely evacuated CTA says http://t.co/w6F7ZiS3KA http://t.co/t7L8jCjyq3

----

Pred: 1, Prob: [0.83290383]
Text:
#ISIL has claimed credit for three suicide bombing attacks targeting regime checkpoints on the outskirts of Al-Qaryatayn. #Homs #Syria

----

Pred: 1, Prob: [0.53216523]
Text:
SB228 [Passed] Relating to sources of radiation; and declaring an emergency. http://t.co/D1xlFKNJsM

----

Pred: 0, Prob: [0.20238492]
Text:
*standing in line at JoAnn's little girl and her mom behind me*

Little girl: Mommy is that a boy or a girl?

...
Welp www

----



In [32]:
def predict_on_sentence(model, sentence):
  """
  Uses model to make a prediction on sentence.

  Returns the sentence, the predicted label and the prediction probability.
  """
  pred_prob = model.predict_proba([sentence])[:,1]
  pred_label = np.round(pred_prob)
  print(f"Pred: {pred_label[0]}", "(real disaster)" if pred_label[0] > 0 else "(not real disaster)", f"Prob: {round(pred_prob[0],3)}")
  print(f"Text:\n{sentence}")


# **Tweet Classification : @TheJakartaPost Twitter account**

In [33]:
#Sample Tweet from theJakartaPost
text = 'Biden needed to demonstrate himself as a decisive leader in front of Xi because he knew very well the Republicans were waiting for him to commit “a slip of the tongue” that could be used to help defeat the Democrats in the mid-term elections next year'

In [34]:
# Make a prediction on Tweet from the wild
predict_on_sentence(model=model_0,
                    sentence=text)


Pred: 0.0 (not real disaster) Prob: 0.204
Text:
Biden needed to demonstrate himself as a decisive leader in front of Xi because he knew very well the Republicans were waiting for him to commit “a slip of the tongue” that could be used to help defeat the Democrats in the mid-term elections next year


In [35]:
# https://twitter.com/jakpost/status/1460782801414987777
text1 = 'Jakarta disaster agency warns residents of the capital and its satellite cities of possible extreme weather and floods.'
# https://twitter.com/jakpost/status/1460626574374539265
text2 = 'Mental health communities and their volunteers are reaching out across the archipelago to help people deal with their psychological and emotional impacts from the pandemic.'

In [36]:
# Predict on diaster Tweet 1
predict_on_sentence(model=model_0, 
                    sentence=text1)


Pred: 1.0 (real disaster) Prob: 0.781
Text:
Jakarta disaster agency warns residents of the capital and its satellite cities of possible extreme weather and floods.


In [37]:
# Predict on diaster Tweet 1
predict_on_sentence(model=model_0, 
                    sentence=text2)

Pred: 0.0 (not real disaster) Prob: 0.303
Text:
Mental health communities and their volunteers are reaching out across the archipelago to help people deal with their psychological and emotional impacts from the pandemic.
