#### Machine Learning Pipeline Preparation

In this notebook, we will reorganize the clean data from our database to be fit into a machine learning model. This problem revolves around multi class classifications, with a total of 36 different classes.

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import pickle
import os
import re
import nltk
# nltk.download(['punkt', 'wordnet', 'stopwords', 'averaged_perceptron_tagger'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")

Working in our notebook, it is necessary to find the path of our current database in order to read our cleaned table. The procedure is shown as follows.

In [2]:
cwd = os.getcwd()
cwd_path = cwd.replace('models', '')

engine = create_engine('sqlite:///'+ cwd_path + 'data\DisasterResponse.db')
df = pd.read_sql_table('disaster_cleaned', con = engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The dataframe is shown above. Due to large number of classes within this dataframe, lets examine each class to see if there is any outlier or error within the data.

In [3]:
classes = df.iloc[:,4:]
classes.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
for col in classes.columns.tolist():
    print(col, classes[col].unique())

related [1 0 2]
request [0 1]
offer [0 1]
aid_related [0 1]
medical_help [0 1]
medical_products [0 1]
search_and_rescue [0 1]
security [0 1]
military [0 1]
child_alone [0]
water [0 1]
food [0 1]
shelter [0 1]
clothing [0 1]
money [0 1]
missing_people [0 1]
refugees [0 1]
death [0 1]
other_aid [0 1]
infrastructure_related [0 1]
transport [0 1]
buildings [0 1]
electricity [0 1]
tools [0 1]
hospitals [0 1]
shops [0 1]
aid_centers [0 1]
other_infrastructure [0 1]
weather_related [0 1]
floods [0 1]
storm [0 1]
fire [0 1]
earthquake [0 1]
cold [0 1]
other_weather [0 1]
direct_report [0 1]


The cell above shows the number of disaster response categories as well as their classification values. It is safe to assume that the values should be either 0 OR 1, nothing more. This can result in a misfit in applying a machine learning model. 

The values to be looked into are related category which contains an int of 2 and child_alone category that contains all 0 values. 

Since the disaster response classification can only be categorized to fall within that category or not, we can change the value of 2 in related category to 1 (assuming this as misinput data). The child_alone category does not serve any meaningful purpose for our classification model. Therefore this category can be dropped. 

In [5]:
df['related'] = df['related'].replace(2, 1)
df = df.drop('child_alone', axis = 1)

In [6]:
for col in df.iloc[:,4:].columns.tolist():
    print(col, df[col].unique())

related [1 0]
request [0 1]
offer [0 1]
aid_related [0 1]
medical_help [0 1]
medical_products [0 1]
search_and_rescue [0 1]
security [0 1]
military [0 1]
water [0 1]
food [0 1]
shelter [0 1]
clothing [0 1]
money [0 1]
missing_people [0 1]
refugees [0 1]
death [0 1]
other_aid [0 1]
infrastructure_related [0 1]
transport [0 1]
buildings [0 1]
electricity [0 1]
tools [0 1]
hospitals [0 1]
shops [0 1]
aid_centers [0 1]
other_infrastructure [0 1]
weather_related [0 1]
floods [0 1]
storm [0 1]
fire [0 1]
earthquake [0 1]
cold [0 1]
other_weather [0 1]
direct_report [0 1]


Now we can be assured that the values are better, let's split the feature and the target value for the machine learning model. 

In [7]:
X = df['message']
Y = df.iloc[:,4:]
category_names = Y.columns

In [8]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [9]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The values for the model input, X, and model output (target), Y are shown above.

Before a machine learning model can be defined, it is essential to break up the messages into tokens (individual words), as well as normalizing it (getting rid of variation in capitalization, punctuation, and any urls contained within the text). The following cell provides a tokenize function which takes text as an argument and returns clean tokens of the original text.

In [10]:
def tokenize(text):
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    urls_found = re.findall(url_regex, text)
    for url in urls_found:
        text = text.replace(url, 'urlplaceholder')
    
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = [lemmatizer.lemmatize(words).strip() for words in tokens]

    return clean_tokens

Let's proceed to building an initial model using MultiOutputClassifier, due to the 36 different categories (35 without child_alone). The estimator used in this project is RandomForestClassifier. One thing to note is, RandomForest takes a class_weight as one of the parameter to consider unbalanced data. Depending on the data, it might be beneficial in introducing class_weight = 'balanced'. 

Looking at the spread of the categories below, it is safe to assume that the data being quite unbalanced (with related category dominating the data). 

In [11]:
total = len(Y)
for col in Y.columns.tolist():
    print(col, str(round(len(Y[Y[col] == 1])*100/total,2)) + str('%'))

related 76.65%
request 17.07%
offer 0.45%
aid_related 41.43%
medical_help 7.95%
medical_products 5.01%
search_and_rescue 2.76%
security 1.8%
military 3.28%
water 6.38%
food 11.15%
shelter 8.83%
clothing 1.54%
money 2.3%
missing_people 1.14%
refugees 3.34%
death 4.55%
other_aid 13.14%
infrastructure_related 6.5%
transport 4.58%
buildings 5.08%
electricity 2.03%
tools 0.61%
hospitals 1.08%
shops 0.46%
aid_centers 1.18%
other_infrastructure 4.39%
weather_related 27.83%
floods 8.22%
storm 9.32%
fire 1.08%
earthquake 9.36%
cold 2.02%
other_weather 5.25%
direct_report 19.36%


Function build_model() below provides a pipeline of our machine learning model. In order to account for the data imbalance above, the class_weight = 'balanced' is specified for the RandomForestClassifier model.

In [12]:
def build_model():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(estimator = RandomForestClassifier(class_weight = 'balanced')))
    ])
    
    return pipeline

sklearn's train_test_split is used to split the X,Y data into X_train, Y_train to be fit into the model. The X_test and Y_test will be used for our cross validation check using a GridSearchCV method later on in the notebook. This allows us to hypertune the model parameter in achieving a better result. 

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

model = build_model()
model.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                                                        class_weight='balanced',
                                                                        criterion='gini',
                                                                        max_depth=None,
                                                            

Create a metric report on how well the model performs as overall using classification_report from sklearn. This function prints the classification report for each of the categories, as well as the overall accuracy of the model. 

In [14]:
def evaluate_model(model, X_test, Y_test, category_names):
    
    Y_prediction = model.predict(X_test)
    Y_prediction_df = pd.DataFrame(Y_prediction, columns = category_names)
    
    for col in category_names:
        print(str(col) + ' category:')
        print(classification_report(Y_test[col], Y_prediction_df[col]))
        print(str('_____________________________________________________'))
    
    accuracy = (Y_prediction == Y_test).mean().mean()
    
    return print('Accuracy of model: ' + str(round(accuracy*100,2)) + '%')

In [15]:
evaluate_model(model, X_test, Y_test, category_names)

related category:
              precision    recall  f1-score   support

           0       0.72      0.31      0.44      1206
           1       0.82      0.96      0.89      4038

    accuracy                           0.81      5244
   macro avg       0.77      0.64      0.66      5244
weighted avg       0.80      0.81      0.79      5244

_____________________________________________________
request category:
              precision    recall  f1-score   support

           0       0.89      0.99      0.94      4329
           1       0.87      0.43      0.58       915

    accuracy                           0.89      5244
   macro avg       0.88      0.71      0.76      5244
weighted avg       0.89      0.89      0.87      5244

_____________________________________________________
offer category:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5219
           1       0.00      0.00      0.00        25

    accuracy          

In order to improve the model, another classifier is implemented in place of RandomForest. AdaBoostClassifier is considered to be more efficient in this case due to its adaptive method in classifying the categories.

In [16]:
def build_model_improved():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(estimator = AdaBoostClassifier()))
    ])
    
    return pipeline

In [17]:
model_improved = build_model_improved()
model_improved.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at 0x000001D9CB4DEDC8>,
                                 vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMM

In [18]:
evaluate_model(model_improved, X_test, Y_test, category_names)

related category:
              precision    recall  f1-score   support

           0       0.67      0.32      0.44      1206
           1       0.83      0.95      0.88      4038

    accuracy                           0.81      5244
   macro avg       0.75      0.64      0.66      5244
weighted avg       0.79      0.81      0.78      5244

_____________________________________________________
request category:
              precision    recall  f1-score   support

           0       0.91      0.96      0.93      4329
           1       0.74      0.53      0.62       915

    accuracy                           0.89      5244
   macro avg       0.83      0.75      0.78      5244
weighted avg       0.88      0.89      0.88      5244

_____________________________________________________
offer category:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5219
           1       0.00      0.00      0.00        25

    accuracy          

The classifier can be improved further using hyperparameter tuning GridSearchCV. Although the other aspects of the pipeline can also be improved, the runtime of the tuning can be significantly higher in comparison to the a regular pipeline run, therefore only two parameters are considered with 2 options each.

Optimal AdaBoost:

'clf__estimator__n_estimators': 50, 'clf__estimator__learning_rate': 1

The parameters above are basically the default configuration of AdaBoostClassifier. Since GridSearchCV calculation requires high computing power, otherwise will take longer to execute, an additional feature to the pipeline will be introduced. The following class is constructed from Udacity part of ML Pipelines lesson.

In [19]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In order to implement this class, we will need to import FeatureUnion from sklearn to be included within our Pipeline.

In [20]:
def build_model_final():
    pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),

        ('clf', MultiOutputClassifier(estimator = AdaBoostClassifier(n_estimators = 50, learning_rate = 1.2)))
    ])
    
    return pipeline

In [21]:
model_final = build_model_final()
model_final.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('text_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('vect',
                                                                  CountVectorizer(analyzer='word',
                                                                                  binary=False,
                                                                                  decode_error='strict',
                                                                                  dtype=<class 'numpy.int64'>,
                                                                                  encoding='utf-8',
                                                                                  input='content',
                                                                                  low

In [22]:
evaluate_model(model_final, X_test, Y_test, category_names)

related category:
              precision    recall  f1-score   support

           0       0.65      0.37      0.47      1206
           1       0.83      0.94      0.88      4038

    accuracy                           0.81      5244
   macro avg       0.74      0.66      0.68      5244
weighted avg       0.79      0.81      0.79      5244

_____________________________________________________
request category:
              precision    recall  f1-score   support

           0       0.91      0.96      0.93      4329
           1       0.74      0.55      0.63       915

    accuracy                           0.89      5244
   macro avg       0.83      0.75      0.78      5244
weighted avg       0.88      0.89      0.88      5244

_____________________________________________________
offer category:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5219
           1       0.00      0.00      0.00        25

    accuracy          

Save our final model in pickle -> classifier.pkl

In [23]:
pickle.dump(model_final, open('classifier.pkl', 'wb'))