## Natural Disaster Tweet Prediction Model (Notebook)

In [1]:
import numpy as np
import pandas as pd
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import string
import warnings
import pickle
from spacy.lang.char_classes import LIST_PUNCT
from collections import defaultdict
from spellchecker import SpellChecker
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer, WordPunctTokenizer, RegexpTokenizer
from nltk.corpus import stopwords, wordnet
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from normalization import text_cleaning, text_preprocessing
from imblearn.over_sampling import SMOTE, RandomOverSampler
from evaluation import score_df, avg_score_list
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics, svm
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from xgboost import XGBClassifier

warnings.filterwarnings('ignore')

nlp = spacy.load("en_core_web_sm")
stopwords = nlp.Defaults.stop_words
regexp = RegexpTokenizer("[\w']+")

### Data Acquisition

Loading the downloaded disaster tweets dataset. The raw tweeter dataset obtained from Kaggle contains various disaster including non-natural disaster tweets, which are outside of our research scope such as airplane crash, car accidents, riot, arson, explosions, war & many others. Hence, the twitter dataset needs to be filtered by extracting natural disaster tweets only by referring to the keyword feature. The tweets will then selected according to list of natural disasters as defined by Federal Emergency Management Agency (FEMA).

In [2]:
raw_tweets_df = pd.read_csv("./data/tweets.csv")
raw_tweets_df['keyword'] = raw_tweets_df['keyword'].str.replace(r"%20", "_")

natural_disaster_keywords = ["avalanche", "bush_fires", "cyclone", "flood", "flooding", "floods", "flames", "forest_fire", "forest_fires", "drought", "dust_storm", 
                            "earthquake", "hail", "hailstorm", "heat_wave", "hurricane", "icestorm", "landslide", "lava", "lightning", "mudslide", "rainstorm", 
                            "sandstorm", "snowstorm",  "strong_wind", "storm", "thunder", "thunderstorm", "tornado", "twister", "typhoon", "tsunami", "violent_storm", 
                            "volcanic_activity", "volcano", "volcanic", "wildfire", "wildfires", "wild_fire", "wild_fires", "whirlwind", "windstorm"]

natural_disaster_keywords_df = raw_tweets_df[raw_tweets_df['keyword'].isin(natural_disaster_keywords)].reset_index(drop=True)
non_natural_disaster_keywords_df = raw_tweets_df[~raw_tweets_df['keyword'].isin(natural_disaster_keywords)].reset_index(drop=True)
extra_natural_disaster_df = non_natural_disaster_keywords_df[non_natural_disaster_keywords_df.apply(lambda x: bool(re.search('|'.join(natural_disaster_keywords), x.text)) and x.target == 1, axis=1)]
filtered_natural_disaster_df = natural_disaster_keywords_df.append(extra_natural_disaster_df, ignore_index=True)

### Data Cleaning


Once the raw dataset has been loaded, we proceed to clean it thoroughly by conducting these processes:
1) Convert text to lowercase
2) Remove whitespaces 
3) Remove punctuations
4) Remove HTML tags
5) Remove HTML entities
6) Remove URL links
7) Remove emoji
8) Remove non-ASCII characters
9) Remove numbers

In [3]:
filtered_natural_disaster_df["cleaned_text"] = filtered_natural_disaster_df["text"].apply(text_cleaning) # implementing text cleaning

### Data Pre-Processing

Next, the cleaned disaster twitter dataset needs to be processed further before being fitted into the model for training. The pre-processsing steps taken are as below:
1) Expand contractions
2) Correct spellings
3) Remove stopwords
4) Lemmatize words

In [4]:
filtered_natural_disaster_df["normalized_text"] = filtered_natural_disaster_df["cleaned_text"].apply(text_preprocessing) # implementing text normalization
filtered_natural_disaster_df[["id", "keyword", "location", "text", "normalized_text", "target"]]

Unnamed: 0,id,keyword,location,text,normalized_text,target
0,665,avalanche,Hell,Washington has an avalanche rescue goat. His n...,Washington avalanche rescue goat his name Mazama,0
1,666,avalanche,South Africa,petition for KFC to bring back the avalanche,petition KFC bring back avalanche,0
2,667,avalanche,,BSF ‘Borderman’ killed in avalanche: On Jan 13...,BSF Borderman kill avalanche on Jan evening pm...,0
3,668,avalanche,"Bangalore, India",1 BSF soldier killed in avalanche in Kashmir's...,BSF soldier kill avalanche Naugam sector perso...,1
4,669,avalanche,,"Very sad news coming in. 3 soldiers killed, 1 ...",very sad news come soldier kill miss avalanche...,1
...,...,...,...,...,...,...
1719,10337,trapped,Turkey,Five soldiers were trapped under the #avalanch...,five soldier trap avalanche effort trace one m...,1
1720,10378,trapped,Jammu And Kashmir,Border Security Force: Last evening at 8:30 pm...,Border Security Force last evening pm avalanch...,1
1721,10381,trapped,"Mumbai, India",Please pray for the safety of 5 soldiers who h...,please pray safety soldier trap avalanche Mach...,1
1722,10382,trapped,UNION REPUBLIC of Hindustan,Border Security Force: Last evening at 8:30 pm...,Border Security Force last evening pm avalanch...,1


### Feature Engineering

Machine learning algorithms are unable to read text input from the dataset. Therefore, the cleaned & normalized tweet contents' needs to be converted to a numeric vector format. There are two (2) methods, based on the Bag-of-Words model that will be used in this research to vectorize the text:
1) Count
2) TF-IDF

In [5]:
X = filtered_natural_disaster_df['normalized_text'].tolist()
y = filtered_natural_disaster_df['target'].tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

CountVectorizer
- Counts the frequency of words in each document/ text and converts it into numerical feature vector, with each element in the vector representing the count of occureneces of the identified words within the document/ text

In [6]:
# bow = CountVectorizer()
# X_train_bow = bow.fit_transform(X_train)
# X_test_bow = bow.fit_transform(X_test)
# X_bow = bow.fit_transform(X)
# bow_model_evaluation_df = score_df(X_bow.toarray(), y)
# bow_model_evaluation_df
# bow_model_evaluation_df.to_csv("bow_model_evaluation.csv", index=False)

TF-IDF Vectorizer
- Computes the relative frequency of a word's occurences in a specified document as compared to its frequency across all other documents. It factors in the importance of each word in a document collection.

In [7]:
TFIDF = TfidfVectorizer(ngram_range = (1, 2))
X_train_tfidf = TFIDF.fit_transform(X_train)
# tfidf_model_evaluation_df = score_df(X_train_tfidf.toarray(), y_train)
# tfidf_model_evaluation_df
# tfidf_model_evaluation_df.to_csv("tfidf_model_evaluation.csv", index=False)

##### Resampling

In [8]:
#Random Oversampler with TFIDF

# ROS = RandomUnderSampler(random_state=123)
# X_train_tfidf = TFIDF.fit_transform(X_train)
# X_test_tfidf = TFIDF.fit_transform(X_test)
# X_train_ros, y_train_ros = ROS.fit_resample(X_train_tfidf, y_train)
# ros_tfidf_model_evaluation_df = score_df(X_train_ros.toarray(), y_train_ros)
# ros_tfidf_model_evaluation_df.to_csv("ros_model_evaluation.csv", index=False)

In [9]:
# #Synthetic Minority Oversampling Technique (SMOTE) oversampler with TFIDF

smote = SMOTE(random_state=123)
X_train_smote, y_train_smote = smote.fit_resample(X_train_tfidf, y_train)
smote_tfidf_model_evaluation_df = score_df(X_train_smote.toarray(), y_train_smote)
smote_tfidf_model_evaluation_df
#smote_tfidf_model_evaluation_df.to_csv("smote_model_evaluation.csv", index=False)

Unnamed: 0,Classifier,Average_accuracy_score,Average_precision_score,Average_recall_score,Average_f1-score
0,Linear SVM,0.897981,0.992954,0.802439,0.874226
1,Naive Bayes,0.871638,0.810117,0.971951,0.883284
2,Random Forest,0.902857,0.961091,0.840236,0.889119
3,XGBoost,0.867392,0.897199,0.832852,0.856655
4,Soft Voting,0.931553,0.944718,0.919512,0.929384
5,Hard Voting,0.909577,0.9822,0.835366,0.893682


##### Model Deployment

Since we know from our performance evaluation result done above that Ensemble Soft Voting gives the best performance, we will use that to train and fit it for deployment purposes

In [10]:
X = filtered_natural_disaster_df['normalized_text'].tolist()
y = filtered_natural_disaster_df['target'].tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

TFIDF = TfidfVectorizer(ngram_range = (1, 2))
# pickle.dump(TFIDF, open('tfidf.pkl', 'wb'))
smote = SMOTE(random_state=123)

svm_linear = svm.SVC(probability=True, random_state = 123)
nb = GaussianNB()
rf = RandomForestClassifier(n_jobs = -1, random_state = 123)
xgb = XGBClassifier(eval_metric = 'logloss', random_state = 123)
soft_vote = VotingClassifier(estimators = [('svm', svm_linear), ('nb', nb), ('rf', rf), ('xgb',xgb)], voting = 'soft', n_jobs=-1) 
hard_vote = VotingClassifier(estimators = [('svm', svm_linear), ('nb', nb), ('rf', rf), ('xgb',xgb)], voting = 'hard', n_jobs=-1) 

X_train_tfidf = TFIDF.fit_transform(X_train)
X_test_tfidf = TFIDF.transform(X_test)
X_train_smote, y_train_smote = smote.fit_resample(X_train_tfidf, y_train)

In [11]:
soft_vote.fit(X_train_smote.toarray(), y_train_smote)
# pickle.dump(soft_vote, open('model.pkl', 'wb'))

VotingClassifier(estimators=[('svm', SVC(probability=True, random_state=123)),
                             ('nb', GaussianNB()),
                             ('rf',
                              RandomForestClassifier(n_jobs=-1,
                                                     random_state=123)),
                             ('xgb',
                              XGBClassifier(base_score=None, booster=None,
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None,
                                            enable_categorical=False,
                                            eval_metric='logloss', gamma=None,
                                            gpu_id=None, importanc...
                                            learning_rate=None,
                                            max_delta_step=None, max_depth=None,
                   