# Social Media Disaster Alert System

## Training Data Collection and Cleaning

We collected data froma number of emergency datasets so we didn't limit the type of emergencies that FEMA could monitor. We created one large dataset and tested our models on it to be able to tune while we collected real-time data. 

In [1]:
import pandas as pd
import numpy as np
import requests
import time
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split, GridSearchCV
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')

In order to train a classification model, we found historical datasets online on crisislex.org. These datasets included tweets pertaining to particular disasters that had occured within the last 10 years. The datasets also included whether the tweet was relevant to the specified disaster or not. This allowed the model to be trained using a target variable.

In [3]:
sandy = pd.read_csv('./Data/CrisisLexT6/2012_Sandy_Hurricane/2012_Sandy_Hurricane-ontopic_offtopic.csv')

In [4]:
alberta = pd.read_csv('./Data/CrisisLexT6/2013_Alberta_Floods/2013_Alberta_Floods-ontopic_offtopic.csv')

In [5]:
boston = pd.read_csv('./Data/CrisisLexT6/2013_Boston_Bombings/2013_Boston_Bombings-ontopic_offtopic.csv')

In [6]:
oklahoma = pd.read_csv('./Data/CrisisLexT6/2013_Oklahoma_Tornado/2013_Oklahoma_Tornado-ontopic_offtopic.csv')

In [7]:
queensland = pd.read_csv('./Data/CrisisLexT6/2013_Queensland_Floods/2013_Queensland_Floods-ontopic_offtopic.csv')

In [8]:
texas = pd.read_csv('./Data/CrisisLexT6/2013_West_Texas_Explosion/2013_West_Texas_Explosion-ontopic_offtopic.csv')

We checked the first few rows of the datasets to ensure that the formatting was uniform across all 6 datasets.

In [9]:
sandy.head()

Unnamed: 0,tweet id,tweet,label
0,'262596552399396864',I've got enough candles to supply a Mexican fa...,off-topic
1,'263044104500420609',Sandy be soooo mad that she be shattering our ...,on-topic
2,'263309629973491712',@ibexgirl thankfully Hurricane Waugh played it...,off-topic
3,'263422851133079552',@taos you never got that magnificent case of B...,off-topic
4,'262404311223504896',"I'm at Mad River Bar &amp; Grille (New York, N...",off-topic


In [10]:
alberta.head()

Unnamed: 0,tweet id,tweet,label
0,'348351442404376578',@Jay1972Jay Nope. Mid 80's. It's off Metallica...,off-topic
1,'348167215536803841',Nothing like a :16 second downpour to give us ...,off-topic
2,'348644655786778624',@NelsonTagoona so glad that you missed the flo...,on-topic
3,'350519668815036416',"Party hard , suns down , still warm , lovin li...",off-topic
4,'351446519733432320',@Exclusionzone if you compare yourself to wate...,off-topic


Here we concatenated all of the datasets into a single dataset to be modeled on.

In [11]:
df = pd.concat(objs = [sandy, alberta, boston, oklahoma, queensland, texas], ignore_index=True)

In [12]:
df.head()

Unnamed: 0,tweet id,tweet,label
0,'262596552399396864',I've got enough candles to supply a Mexican fa...,off-topic
1,'263044104500420609',Sandy be soooo mad that she be shattering our ...,on-topic
2,'263309629973491712',@ibexgirl thankfully Hurricane Waugh played it...,off-topic
3,'263422851133079552',@taos you never got that magnificent case of B...,off-topic
4,'262404311223504896',"I'm at Mad River Bar &amp; Grille (New York, N...",off-topic


Fortunately, we were able to create a robust dataset containing a little over 60,000 observations.

In [13]:
df.shape

(60082, 3)

In [14]:
df.columns

Index(['tweet id', ' tweet', ' label'], dtype='object')

As part of the preprocessing, we wanted to use only legitimate and "dictionary" words in the model. Therefore, tokenizing and lemmatizing the tweets was a neccesary step. 

In [20]:
tokenizer = RegexpTokenizer(pattern = r'[A-z]+')

In [21]:
lemmatizer = WordNetLemmatizer()

In [22]:
token_titles = [tokenizer.tokenize(title.lower()) for title in df[' tweet']]
lem_tokens = [' '.join([lemmatizer.lemmatize(word) for word in token_list]) for token_list in token_titles]
df['lemmatized_title'] = lem_tokens

In [23]:
df.head()

Unnamed: 0,tweet id,tweet,label,lemmatized_title
0,'262596552399396864',I've got enough candles to supply a Mexican fa...,off-topic,i ve got enough candle to supply a mexican family
1,'263044104500420609',Sandy be soooo mad that she be shattering our ...,on-topic,sandy be soooo mad that she be shattering our ...
2,'263309629973491712',@ibexgirl thankfully Hurricane Waugh played it...,off-topic,ibexgirl thankfully hurricane waugh played it ...
3,'263422851133079552',@taos you never got that magnificent case of B...,off-topic,tao you never got that magnificent case of bur...
4,'262404311223504896',"I'm at Mad River Bar &amp; Grille (New York, N...",off-topic,i m at mad river bar amp grille new york ny ht...


In order to properly classify whether a tweet is on-topic or not, the target variable needs to binarized in order for the model to predict it. 

In [24]:
df[' label'].value_counts()

on-topic     32462
off-topic    27620
Name:  label, dtype: int64

In [25]:
# replace on-topic with 1.
df[' label'].replace('on-topic', 1, inplace= True)

In [26]:
# replace off-topic with 0.
df[' label'].replace('off-topic', 0, inplace= True)

In [27]:
# check if previous step was done correctly. 
df.head()

Unnamed: 0,tweet id,tweet,label,lemmatized_title
0,'262596552399396864',I've got enough candles to supply a Mexican fa...,0,i ve got enough candle to supply a mexican family
1,'263044104500420609',Sandy be soooo mad that she be shattering our ...,1,sandy be soooo mad that she be shattering our ...
2,'263309629973491712',@ibexgirl thankfully Hurricane Waugh played it...,0,ibexgirl thankfully hurricane waugh played it ...
3,'263422851133079552',@taos you never got that magnificent case of B...,0,tao you never got that magnificent case of bur...
4,'262404311223504896',"I'm at Mad River Bar &amp; Grille (New York, N...",0,i m at mad river bar amp grille new york ny ht...


In [30]:
# check for null values.
df.isnull().sum()

tweet id            0
 tweet              0
 label              0
lemmatized_title    0
dtype: int64

We saved this new dataset as our final dataset to be used in modeling and evauluation.

In [29]:
df.to_csv('./Data/final_df.csv')