# Disaster Relief Project

This notebook analyses disaster data from [Figure Eight](https://www.figure-eight.com), a company specialising in data analytics and machine learning, to build a machine learning pipeline that classifies messages sent during a natural disaster. The purpose is to classify messages that were created during a disaster into 36 categories to help in aid efforts. This project is done to complete the requirements for Udacity's Data Scientist Nanodegree.

# ETL Pipeline Preparation

This section analyses thousands of disaster messages which are real messages that were sent during disaster events. First, it loads the messages and categories datasets, then merges the two datasets and cleans the data and finally stores it in a SQLite database.

Import needed libraries and packages

In [1]:
import re
import nltk
import pickle
import sqlite3

In [2]:
import numpy as np
import pandas as pd

In [3]:
import warnings
warnings.simplefilter('ignore')
from sqlalchemy import create_engine

In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [5]:
nltk.download('punkt')
nltk.download('stopwords')

In [6]:
# load csv files
categories = pd.read_csv('disaster_categories.csv')
messages = pd.read_csv('disaster_messages.csv')

# Data Exploration

In [7]:
# explore message dataset
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [8]:
# explore categories dataset
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


In [9]:
# merge these two datasets into a master dataframe
master_df = messages.merge(categories, on = ['id'], how = 'left')
master_df.head()

Unnamed: 0,id,message,original,genre,categories
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...


# Data Cleaning

Now lets start by cleaning categoreis dataset, <br>
(1) first we gonna split the values of categores column on (;), <br>
(2) then create a new column for each value and use the first row to name these columns, <br>
(3) finally, we ganna keep only the last character of each value (which is 0 or 1, i.e. related-1 will be 1) then convert it into numeric value

In [10]:
# split categories column on (;)
lables = master_df['categories'].str.split(';', expand = True)
lables.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [11]:
# name new colmuns after the first row
lables.columns = lables.iloc[1].map(lambda x: str(x)[:-2])
lables.head()

1,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [12]:
# keep only the last character of each category
for column in lables:
    lables[column] = lables[column].map(lambda x: str(x)[-1:])
lables.head()

1,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# convert all values into numeric type
for value in lables:
    lables[value] = lables[value].astype(int)

In [14]:
# make sure that all values are just zeros & ones
lables.apply(lambda x: x.nunique())
# for column in lables:
#     print column + str(': ') + str(lables[column].unique())

1
related                   3
request                   2
offer                     2
aid_related               2
medical_help              2
medical_products          2
search_and_rescue         2
security                  2
military                  2
child_alone               1
water                     2
food                      2
shelter                   2
clothing                  2
money                     2
missing_people            2
refugees                  2
death                     2
other_aid                 2
infrastructure_related    2
transport                 2
buildings                 2
electricity               2
tools                     2
hospitals                 2
shops                     2
aid_centers               2
other_infrastructure      2
weather_related           2
floods                    2
storm                     2
fire                      2
earthquake                2
cold                      2
other_weather             2
direct_report     

From above script it's obvious that related colmun has three values and child_alone has just one values, so lets check both of them

In [15]:
lables['related'].unique()

array([1, 0, 2])

In [16]:
lables['related'].value_counts()

1    20042
0     6140
2      204
Name: related, dtype: int64

Now we have to explore the messages of these 193 rows to decide whether to keep or drop them.

In [17]:
lables['child_alone'].unique()

array([0])

Since the child_alone column has the same value, it's better off drop this column.

In [18]:
lables.drop('child_alone', axis = 1, inplace = True)

Next step is to drop categories column of master_fd as it's no longer needed and replace it with the lables dataframe

In [19]:
master_df.head()

Unnamed: 0,id,message,original,genre,categories
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...


In [20]:
# replace the catogeries column in master_df with categories of lables
master_df.drop('categories', axis = 1, inplace = True)
master_df = pd.concat([master_df, lables], axis = 1)
master_df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# check for doublications
master_df.duplicated().any()

True

In [22]:
# check number of doublications
sum(master_df.duplicated())

170

So we have 41 duplicate lines that they need to be removed!

In [23]:
# drop doublications
master_df = master_df.drop_duplicates()

In [24]:
# check again
master_df.duplicated().any()

False

In [25]:
# now lets fetch all the message related to value 2
master_df['message'][master_df['related'] == 2].sample(15)

12290    BBC HELP LINE K TUFAIL MIN TEH.JAMPUR DISTT.RA...
6461     The SMS:  Evitons 2 traiter 1 tas bLaissons 2 ...
7448                     Wesantyahoo.fr.Pepayisenyahoo.fr 
20636    Families also have solar lamps which can be re...
12758    atrickSF @antmarshall @kleankut1 @somarlavous ...
12304                            thatta,         (, ,)    
3138     we are in rue Dessalines Petit goaves and it n...
7235     annot ni batiman dada cheri se pwason ni ou ni...
11745    elipepolesi @vivisabetudo po to sempre do lado...
12086       mviick to ligada , santa cruz (: ja colei , rs
11989    odricao to em santiago, aqui nao houve tanto p...
18644    Refugees International therefore recommends that:
7609                                          //// // @:@ 
12254                          kamea     kole asman K     
903      It's Over in Gressier. The population in the a...
Name: message, dtype: object

In [26]:
# drop rows with raltaed = 2
i = master_df[master_df['related'] == 2].index
master_df = master_df.drop(i)

In [27]:
# check again
master_df['related'].value_counts()

1    19906
0     6122
Name: related, dtype: int64

In [28]:
# save the cleaned dataset into a sql database
conn = sqlite3.connect('cleaned_dataset.db')
conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')
master_df.to_sql('cleaned_dataset', conn, if_exists = 'replace', index = False)

# ML Pipeline Preparation

This section loads data from the SQLite database, then splits the dataset into training and test sets and builds a text processing and machine learning pipeline, finally trains and exports the final model as a pickle file.

In [29]:
# load data from database
engine = create_engine('sqlite:///cleaned_dataset.db')
cleaned_df = pd.read_sql("SELECT * FROM cleaned_dataset", engine)

In [30]:
X = cleaned_df['message']
Y = cleaned_df.drop(['id', 'message', 'original', 'genre'], axis = 1)

Create process_data function that takes text and returns list of normalised and stemmed tokens

In [31]:
def process_data(text):
    # convert text to lowercase and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize words
    tokens = word_tokenize(text)
    
    # stem word tokens and remove stop words
    stemmer = PorterStemmer()
    stop_words = stopwords.words("english")
    
    stemmed = [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return stemmed

In [32]:
# create the model
model = Pipeline([
    ('vect', CountVectorizer(tokenizer = process_data)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))])

In [33]:
# split data into train & test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 1)

In [34]:
np.random.seed(17)
model.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))])

# Model Evaluation

In this section we gonna evaluate f1 score, accuracy, precision and recall for both train and test set

In [35]:
col_names = list(Y.columns.values)

In [36]:
# calculate evaluation metrics for each category on train dataset
train_metrics = []
Y_train_pred = model.predict(X_train)

for i in range(len(col_names)):
    f1 = f1_score(np.array(Y_train)[:, i], Y_train_pred[:, i])
    accuracy = accuracy_score(np.array(Y_train)[:, i], Y_train_pred[:, i])
    precision = precision_score(np.array(Y_train)[:, i], Y_train_pred[:, i])
    recall = recall_score(np.array(Y_train)[:, i], Y_train_pred[:, i])
    
    train_metrics.append([accuracy, precision, recall, f1])

In [37]:
# create dataframe for train metrics
train_metrics = np.array(train_metrics)
train_metrics_df = pd.DataFrame(data = train_metrics, index = col_names, columns = ['F1', 'Accuracy', 'Precision', 'Recall'])
train_metrics_df.head()

Unnamed: 0,F1,Accuracy,Precision,Recall
related,0.990267,0.992651,0.994645,0.993647
request,0.987603,0.997442,0.930233,0.962666
offer,0.998617,1.0,0.715789,0.834356
aid_related,0.984325,0.994847,0.967608,0.981039
medical_help,0.988935,0.99926,0.86262,0.925926


In [38]:
# calculate evaluation metrics for each category on test dataset
test_metrics = []
Y_test_pred = model.predict(X_test)

for i in range(len(col_names)):
    f1 = f1_score(np.array(Y_test)[:, i], Y_test_pred[:, i])
    accuracy = accuracy_score(np.array(Y_test)[:, i], Y_test_pred[:, i])
    precision = precision_score(np.array(Y_test)[:, i], Y_test_pred[:, i])
    recall = recall_score(np.array(Y_test)[:, i], Y_test_pred[:, i])
    
    test_metrics.append([accuracy, precision, recall, f1])

In [39]:
# create dataframe for test metrics
test_metrics = np.array(train_metrics)
test_metrics_df = pd.DataFrame(data = test_metrics, index = col_names, columns = ['F1', 'Accuracy', 'Precision', 'Recall'])
test_metrics_df

Unnamed: 0,F1,Accuracy,Precision,Recall
related,0.990267,0.992651,0.994645,0.993647
request,0.987603,0.997442,0.930233,0.962666
offer,0.998617,1.0,0.715789,0.834356
aid_related,0.984325,0.994847,0.967608,0.981039
medical_help,0.988935,0.99926,0.86262,0.925926
medical_products,0.992367,0.998817,0.850806,0.918889
search_and_rescue,0.993955,0.995624,0.796848,0.885214
security,0.995543,1.0,0.74928,0.856672
military,0.99498,0.996317,0.849294,0.916949
water,0.995031,0.999139,0.923567,0.959868


In [40]:
# Get summary stats for the model
test_metrics_df.describe()

Unnamed: 0,F1,Accuracy,Precision,Recall
count,35.0,35.0,35.0,35.0
mean,0.992954,0.998388,0.849565,0.91605
std,0.004996,0.002032,0.078207,0.045158
min,0.978331,0.992651,0.708333,0.829268
25%,0.990036,0.997484,0.795965,0.885529
50%,0.994314,0.999139,0.849294,0.916949
75%,0.996286,1.0,0.912365,0.953516
max,0.998822,1.0,0.994645,0.993647


In [41]:
# pickle the results
pickle.dump(model, open('disaster_model.pkl', 'wb'))