# F8 Disaster Response ML Pipeline

## Overview

This notebook aims to explore different ML models and techniques to perform multi-label classification. The end goal is to have a model that will automatically categorize messages as to its nature of disaster and its nature. This will then be used to create scripts that can digest data later on and be used with a web application.

## Preparation

Install and import required libraries

In [1]:
!{sys.executable} -m pip install -r requirements.txt -q

In [2]:
import nltk

nltk.download(['punkt', 'stopwords', 'wordnet'], quiet=True)

True

In [3]:
# import libraries
import pandas as pd
import numpy as np
import re
import joblib

from sqlalchemy import create_engine

## Load data from database
- Define feature and target variables X and Y

In [4]:
# load data from database
engine = create_engine('sqlite:///../data/processed/DisasterResponse.db')
df = pd.read_sql("SELECT * FROM Message", engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Convert `genre` to categorical variables. This might help us later if we try to add this to training data.

In [5]:
df.genre.value_counts()

news      13035
direct    10747
social     2394
Name: genre, dtype: int64

In [6]:
GenreType = pd.CategoricalDtype(categories=["news", "direct", "social"], ordered=False)
df.genre = df.genre.astype(GenreType)

df.genre.dtype

CategoricalDtype(categories=['news', 'direct', 'social'], ordered=False)

In [7]:
X = df[["message", "genre"]]
X.head()

Unnamed: 0,message,genre
0,Weather update - a cold front from Cuba that c...,direct
1,Is the Hurricane over or is it not over,direct
2,Looking for someone but no name,direct
3,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,"says: west side of Haiti, rest of the country ...",direct


In [8]:
y = df.drop(columns=["id", "message", "genre", "original"])
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Write a tokenization function to process your text data

In [9]:
# created a class to contain and cache stopwords, regex, and lemmatizer for performance reason

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

class Tokenizer:
    """Contains work cleaning and tokenizing words.

    Attributes:
        None
    """

    def __init__(self):
        self._lemmatizer = WordNetLemmatizer()
        self._stop_words = stopwords.words("english")
        self._pattern = re.compile(r"[^a-zA-Z0-9]")

    def tokenize(self, text):
        """Tokenize text into words.

        Args:
            text (str): Text to be tokenized.

        Returns:
            list: List of tokenized words.
        """
        
        text = self._pattern.sub(" ", text.lower())
        tokens = word_tokenize(text)

        tokens = [self._lemmatizer.lemmatize(token).strip() for token in tokens if token not in self._stop_words]

        return tokens

## Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.compose import ColumnTransformer

# initial pipeline
token = Tokenizer()

pipeline = Pipeline([
    ("features", ColumnTransformer([
        ("text", Pipeline([
            ("count", CountVectorizer(tokenizer=token.tokenize)),
            ("tfidf", TfidfTransformer())
        ]), "message")
    ], remainder="drop")),
    ("clf", MultiOutputClassifier(MultinomialNB(), n_jobs=8))
])

## Train pipeline
- Split data into train and test sets
- Train pipeline

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [12]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('features',
                 ColumnTransformer(transformers=[('text',
                                                  Pipeline(steps=[('count',
                                                                   CountVectorizer(tokenizer=<bound method Tokenizer.tokenize of <__main__.Tokenizer object at 0x0000013A1301D160>>)),
                                                                  ('tfidf',
                                                                   TfidfTransformer())]),
                                                  'message')])),
                ('clf',
                 MultiOutputClassifier(estimator=MultinomialNB(), n_jobs=8))])

## Test model
Report the f1 score, precision and recall for each output category of the dataset.

In [13]:
from sklearn.metrics import classification_report

def get_metrics(y_test, y_pred):
    """Flatten classification report dictionary to dataframe for easier processing.
    
    Args:
        y_test (pandas.DataFrame): Test set labels.
        y_pred (pandas.DataFrame): Predicted labels.

    Returns:
        pandas.DataFrame: Classification report in dataframe format.
    """

    scores = []

    for i, col in enumerate(y_test.columns):
        report = classification_report(y_test.iloc[:, i], y_pred[:, i], output_dict=True, zero_division=0)

        scores.append({ 
            "category": col,
            "precision_0": report["0"]["precision"],
            "recall_0": report["0"]["recall"],
            "f1_0": report["0"]["f1-score"],
            "support_0": report["0"]["support"],
            "precision_1": report["1"]["precision"],
            "recall_1": report["1"]["recall"],
            "f1_1": report["1"]["f1-score"],
            "support_1": report["1"]["support"],
            "accuracy": report["accuracy"],
            "precision_macro_avg": report["macro avg"]["precision"],
            "recall_macro_avg": report["macro avg"]["recall"],
            "f1_macro_avg": report["macro avg"]["f1-score"],
            "support_macro_avg": report["macro avg"]["support"],
            "precision_weighted_avg": report["weighted avg"]["precision"],
            "recall_weighted_avg": report["weighted avg"]["recall"],
            "f1_weighted_avg": report["weighted avg"]["f1-score"],
            "support_weighted_avg": report["weighted avg"]["support"]
        })

    return pd.DataFrame.from_records(scores)

In [14]:
y_pred = pipeline.predict(X_test)
get_metrics(y_test, y_pred).mean(numeric_only=True)

precision_0                  0.936725
recall_0                     0.967542
f1_0                         0.942102
support_0                 5957.200000
precision_1                  0.234976
recall_1                     0.073973
f1_1                         0.086258
support_1                  586.800000
accuracy                     0.934815
precision_macro_avg          0.585850
recall_macro_avg             0.520758
f1_macro_avg                 0.514180
support_macro_avg         6544.000000
precision_weighted_avg       0.910694
recall_weighted_avg          0.934815
f1_weighted_avg              0.911706
support_weighted_avg      6544.000000
dtype: float64

## Improve your model
Use grid search to find better parameters.

In [15]:
def convert_params(params):
    """Convert GridSearchCV best parameters to acceptable dictionary format that can be fed again to GridSearchCV.

    Args:
        params (dict): Dictionary of best parameters.

    Returns:
        dict: Dictionary of best parameters in acceptable format.
    """
    
    dict = {}
    for key in params:
        dict[key] = [params[key]]

    return dict

In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit

split = ShuffleSplit(test_size=0.20, n_splits=1, random_state=42)

parameters_text = {
    "features__text__count__ngram_range": [(1, 1), (1, 2)], # unigram and bigram
    "features__text__count__min_df": [1, 10, 20] # remove words that appear less than 1, 10, or 20 times
}

cv = GridSearchCV(pipeline, parameters_text, cv=split, verbose=10)

print(parameters_text)

{'features__text__count__ngram_range': [(1, 1), (1, 2)], 'features__text__count__min_df': [1, 10, 20]}


In [17]:
cv.fit(X_train, y_train)


parameters_text = convert_params(cv.best_params_)
print("Best Parameters:", parameters_text)

Fitting 1 folds for each of 6 candidates, totalling 6 fits
[CV 1/1; 1/6] START features__text__count__min_df=1, features__text__count__ngram_range=(1, 1)
[CV 1/1; 1/6] END features__text__count__min_df=1, features__text__count__ngram_range=(1, 1);, score=0.171 total time=   6.6s
[CV 1/1; 2/6] START features__text__count__min_df=1, features__text__count__ngram_range=(1, 2)
[CV 1/1; 2/6] END features__text__count__min_df=1, features__text__count__ngram_range=(1, 2);, score=0.182 total time=   8.1s
[CV 1/1; 3/6] START features__text__count__min_df=10, features__text__count__ngram_range=(1, 1)
[CV 1/1; 3/6] END features__text__count__min_df=10, features__text__count__ngram_range=(1, 1);, score=0.215 total time=   6.4s
[CV 1/1; 4/6] START features__text__count__min_df=10, features__text__count__ngram_range=(1, 2)
[CV 1/1; 4/6] END features__text__count__min_df=10, features__text__count__ngram_range=(1, 2);, score=0.233 total time=   7.1s
[CV 1/1; 5/6] START features__text__count__min_df=20,

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [18]:
y_pred = cv.predict(X_test)

get_metrics(y_test, y_pred).mean(numeric_only=True)

precision_0                  0.943946
recall_0                     0.970796
f1_0                         0.955101
support_0                 5957.200000
precision_1                  0.533624
recall_1                     0.171188
f1_1                         0.215335
support_1                  586.800000
accuracy                     0.942578
precision_macro_avg          0.738785
recall_macro_avg             0.570992
f1_macro_avg                 0.585218
support_macro_avg         6544.000000
precision_weighted_avg       0.931649
recall_weighted_avg          0.942578
f1_weighted_avg              0.929333
support_weighted_avg      6544.000000
dtype: float64

## Try improving your model further
* try other ML algorithms
* add other features besides the TF-IDF

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# fast to train algorithms

parameters_clf = {
    "clf__estimator": [
        LogisticRegression(max_iter=1000),
        LinearSVC(),
        MultinomialNB()
    ]
}

parameters_clf.update(parameters_text)

cv = GridSearchCV(pipeline, param_grid=parameters_clf, cv=split, verbose=10)

print(parameters_clf)

{'clf__estimator': [LogisticRegression(max_iter=1000), LinearSVC(), MultinomialNB()], 'features__text__count__min_df': [10], 'features__text__count__ngram_range': [(1, 2)]}


In [20]:
cv.fit(X_train, y_train)

parameters_clf = convert_params(cv.best_params_)
print("Best Parameters:", parameters_clf)

Fitting 1 folds for each of 3 candidates, totalling 3 fits
[CV 1/1; 1/3] START clf__estimator=LogisticRegression(max_iter=1000), features__text__count__min_df=10, features__text__count__ngram_range=(1, 2)
[CV 1/1; 1/3] END clf__estimator=LogisticRegression(max_iter=1000), features__text__count__min_df=10, features__text__count__ngram_range=(1, 2);, score=0.292 total time=   8.1s
[CV 1/1; 2/3] START clf__estimator=LinearSVC(), features__text__count__min_df=10, features__text__count__ngram_range=(1, 2)
[CV 1/1; 2/3] END clf__estimator=LinearSVC(), features__text__count__min_df=10, features__text__count__ngram_range=(1, 2);, score=0.283 total time=   7.2s
[CV 1/1; 3/3] START clf__estimator=MultinomialNB(), features__text__count__min_df=10, features__text__count__ngram_range=(1, 2)
[CV 1/1; 3/3] END clf__estimator=MultinomialNB(), features__text__count__min_df=10, features__text__count__ngram_range=(1, 2);, score=0.233 total time=   6.8s
Best Parameters: {'clf__estimator': [LogisticRegress

In [21]:
from sklearn.preprocessing import OneHotEncoder

pipeline_genre = Pipeline([
    ("features", ColumnTransformer([
        ("genre_category", OneHotEncoder(dtype=int), ["genre"]),
        ("text", Pipeline([
            ("count", CountVectorizer(tokenizer=token.tokenize)),
            ("tfidf", TfidfTransformer())
        ]), "message")
    ], remainder="drop")),
    ("clf", MultiOutputClassifier(MultinomialNB()))
])

parameters_feat = parameters_clf
cv = GridSearchCV(pipeline_genre, parameters_feat, cv=split, verbose=10)

print(parameters_feat)

{'clf__estimator': [LogisticRegression(max_iter=1000)], 'features__text__count__min_df': [10], 'features__text__count__ngram_range': [(1, 2)]}


In [22]:
cv.fit(X_train, y_train)

Fitting 1 folds for each of 1 candidates, totalling 1 fits
[CV 1/1; 1/1] START clf__estimator=LogisticRegression(max_iter=1000), features__text__count__min_df=10, features__text__count__ngram_range=(1, 2)
[CV 1/1; 1/1] END clf__estimator=LogisticRegression(max_iter=1000), features__text__count__min_df=10, features__text__count__ngram_range=(1, 2);, score=0.294 total time=  12.5s


GridSearchCV(cv=ShuffleSplit(n_splits=1, random_state=42, test_size=0.2, train_size=None),
             estimator=Pipeline(steps=[('features',
                                        ColumnTransformer(transformers=[('genre_category',
                                                                         OneHotEncoder(dtype=<class 'int'>),
                                                                         ['genre']),
                                                                        ('text',
                                                                         Pipeline(steps=[('count',
                                                                                          CountVectorizer(tokenizer=<bound method Tokenizer.tokenize of <__main__.Tokenizer object at 0x0000013A1301D160>>)),
                                                                                         ('tfidf',
                                                                                       

In [23]:
y_pred = cv.predict(X_test)

In [24]:
metrics = get_metrics(y_test, y_pred)
metrics

Unnamed: 0,category,precision_0,recall_0,f1_0,support_0,precision_1,recall_1,f1_1,support_1,accuracy,precision_macro_avg,recall_macro_avg,f1_macro_avg,support_macro_avg,precision_weighted_avg,recall_weighted_avg,f1_weighted_avg,support_weighted_avg
0,related,0.747917,0.458786,0.568713,1565,0.848317,0.951396,0.896904,4979,0.833588,0.798117,0.705091,0.732809,6544,0.824306,0.833588,0.818417,6544
1,request,0.918681,0.976151,0.946545,5451,0.827128,0.569076,0.674255,1093,0.90816,0.872904,0.772614,0.8104,6544,0.903389,0.90816,0.901066,6544
2,offer,0.995874,1.0,0.997933,6517,0.0,0.0,0.0,27,0.995874,0.497937,0.5,0.498966,6544,0.991765,0.995874,0.993815,6544
3,aid_related,0.77741,0.855434,0.814558,3846,0.759516,0.650852,0.700998,2698,0.771088,0.768463,0.753143,0.757778,6544,0.770032,0.771088,0.767739,6544
4,medical_help,0.93152,0.991843,0.960735,6007,0.668919,0.184358,0.289051,537,0.925581,0.800219,0.5881,0.624893,6544,0.909971,0.925581,0.905617,6544
5,medical_products,0.959641,0.996948,0.977939,6225,0.753247,0.181818,0.292929,319,0.957213,0.856444,0.589383,0.635434,6544,0.94958,0.957213,0.944547,6544
6,search_and_rescue,0.9758,1.0,0.987752,6371,1.0,0.086705,0.159574,173,0.975856,0.9879,0.543353,0.573663,6544,0.97644,0.975856,0.965858,6544
7,security,0.982121,1.0,0.99098,6427,0.0,0.0,0.0,117,0.982121,0.491061,0.5,0.49549,6544,0.964562,0.982121,0.973262,6544
8,military,0.97171,0.99716,0.98427,6338,0.55,0.106796,0.178862,206,0.969132,0.760855,0.551978,0.581566,6544,0.958435,0.969132,0.958917,6544
9,water,0.963628,0.99134,0.977287,6120,0.78629,0.459906,0.580357,424,0.956907,0.874959,0.725623,0.778822,6544,0.952138,0.956907,0.951569,6544


In [25]:
metrics.mean(numeric_only=True)

precision_0                  0.949179
recall_0                     0.974633
f1_0                         0.960444
support_0                 5957.200000
precision_1                  0.654204
recall_1                     0.241675
f1_1                         0.310005
support_1                  586.800000
accuracy                     0.948511
precision_macro_avg          0.801692
recall_macro_avg             0.608154
f1_macro_avg                 0.635224
support_macro_avg         6544.000000
precision_weighted_avg       0.941274
recall_weighted_avg          0.948511
f1_weighted_avg              0.938566
support_weighted_avg      6544.000000
dtype: float64

## Export your model as a pickle file

In [26]:
joblib.dump(cv, "../models/disaster_response_model.pkl")

['../models/disaster_response_model.pkl']