# F8 Disaster Response ML Pipeline

## Overview

This notebook aims to explore different ML models and techniques to perform multi-label classification. The end goal is to have a model that will automatically categorize messages as to its nature of disaster and its nature. This will then be used to create scripts that can digest data later on and be used with a web application.

## Preparation

Install and import required libraries

In [1]:
!{sys.executable} -m pip install -r ../requirements.txt -q

In [2]:
import nltk

nltk.download(['punkt', 'stopwords', 'wordnet'], quiet=True)

True

In [3]:
# import libraries
import pandas as pd
import numpy as np
import re
import joblib

from sqlalchemy import create_engine

## Load data from database
- Define feature and target variables X and Y

In [4]:
# load data from database
engine = create_engine('sqlite:///../data/processed/DisasterResponse.db')
df = pd.read_sql("SELECT * FROM Message", engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Convert `genre` to categorical variables. This might help us later if we try to add this to training data.

In [5]:
df.genre.value_counts()

news      13035
direct    10747
social     2394
Name: genre, dtype: int64

In [6]:
GenreType = pd.CategoricalDtype(categories=["news", "direct", "social"], ordered=False)
df.genre = df.genre.astype(GenreType)

df.genre.dtype

CategoricalDtype(categories=['news', 'direct', 'social'], ordered=False)

In [7]:
X = df[["message", "genre"]]
X.head()

Unnamed: 0,message,genre
0,Weather update - a cold front from Cuba that c...,direct
1,Is the Hurricane over or is it not over,direct
2,Looking for someone but no name,direct
3,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,"says: west side of Haiti, rest of the country ...",direct


In [8]:
y = df.drop(columns=["id", "message", "genre", "original"])
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Write a tokenization function to process your text data

In [9]:
# created a class to contain and cache stopwords, regex, and lemmatizer for performance reason

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words("english")
pattern = re.compile(r"[^a-zA-Z0-9]")

def tokenize(text):
    """Tokenize text into words.

    Args:
        text (str): Text to be tokenized.

    Returns:
        list: List of tokenized words.
    """
        
    text = pattern.sub(" ", text.lower())
    tokens = word_tokenize(text)

    tokens = [lemmatizer.lemmatize(token).strip() for token in tokens if token not in stop_words]

    return tokens

In [10]:
# test the tokenizer

tokens = df.message.apply(lambda text: tokenize(text)).explode().value_counts().reset_index().rename(columns={'index': 'token', 'message': 'count'})
tokens.head(50)

Unnamed: 0,token,count
0,water,3037
1,people,3006
2,food,2898
3,help,2651
4,need,2490
5,please,2045
6,earthquake,1918
7,u,1756
8,area,1664
9,like,1527


## Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.compose import ColumnTransformer

pipeline = Pipeline([
    ("features", ColumnTransformer([
        ("text", Pipeline([
            ("count", CountVectorizer(tokenizer=tokenize)),
            ("tfidf", TfidfTransformer())
        ]), "message")
    ], remainder="drop")),
    ("clf", MultiOutputClassifier(MultinomialNB(), n_jobs=8))
])

## Train pipeline
- Split data into train and test sets
- Train pipeline

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [13]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('features',
                 ColumnTransformer(transformers=[('text',
                                                  Pipeline(steps=[('count',
                                                                   CountVectorizer(tokenizer=<function tokenize at 0x000001CB8BEEBB80>)),
                                                                  ('tfidf',
                                                                   TfidfTransformer())]),
                                                  'message')])),
                ('clf',
                 MultiOutputClassifier(estimator=MultinomialNB(), n_jobs=8))])

## Test model
Report the f1 score, precision and recall for each output category of the dataset.

In [14]:
from sklearn.metrics import classification_report

def get_metrics(y_test, y_pred):
    """Flatten classification report dictionary to dataframe for easier processing.
    
    Args:
        y_test (pandas.DataFrame): Test set labels.
        y_pred (pandas.DataFrame): Predicted labels.

    Returns:
        pandas.DataFrame: Classification report in dataframe format.
    """

    scores = []

    for i, col in enumerate(y_test.columns):
        report = classification_report(y_test.iloc[:, i], y_pred[:, i], output_dict=True, zero_division=0)

        scores.append({ 
            "category": col,
            "precision_0": report["0"]["precision"],
            "recall_0": report["0"]["recall"],
            "f1_0": report["0"]["f1-score"],
            "support_0": report["0"]["support"],
            "precision_1": report["1"]["precision"],
            "recall_1": report["1"]["recall"],
            "f1_1": report["1"]["f1-score"],
            "support_1": report["1"]["support"],
            "accuracy": report["accuracy"],
            "precision_macro_avg": report["macro avg"]["precision"],
            "recall_macro_avg": report["macro avg"]["recall"],
            "f1_macro_avg": report["macro avg"]["f1-score"],
            "support_macro_avg": report["macro avg"]["support"],
            "precision_weighted_avg": report["weighted avg"]["precision"],
            "recall_weighted_avg": report["weighted avg"]["recall"],
            "f1_weighted_avg": report["weighted avg"]["f1-score"],
            "support_weighted_avg": report["weighted avg"]["support"]
        })

    return pd.DataFrame.from_records(scores)

In [15]:
y_pred = pipeline.predict(X_test)
get_metrics(y_test, y_pred)[["category", "f1_macro_avg", "precision_macro_avg", "recall_macro_avg", "accuracy"]]

Unnamed: 0,category,f1_macro_avg,precision_macro_avg,recall_macro_avg,accuracy
0,related,0.509543,0.802531,0.537964,0.776284
1,request,0.632486,0.85539,0.603192,0.862317
2,offer,0.498966,0.497937,0.5,0.995874
3,aid_related,0.73318,0.742754,0.729214,0.747708
4,medical_help,0.482249,0.659092,0.501612,0.917787
5,medical_products,0.487509,0.475627,0.5,0.951253
6,search_and_rescue,0.493302,0.486782,0.5,0.973564
7,security,0.49549,0.491061,0.5,0.982121
8,military,0.492004,0.48426,0.5,0.968521
9,water,0.48326,0.467604,0.5,0.935208


## Improve your model
Use grid search to find better parameters.

In [16]:
def convert_params(params):
    """Convert GridSearchCV best parameters to acceptable dictionary format that can be fed again to GridSearchCV.

    Args:
        params (dict): Dictionary of best parameters.

    Returns:
        dict: Dictionary of best parameters in acceptable format.
    """
    
    dict = {}
    for key in params:
        dict[key] = [params[key]]

    return dict

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit

split = ShuffleSplit(test_size=0.20, n_splits=1, random_state=42)

parameters_text = {
    "features__text__count__ngram_range": [(1, 1), (1, 2)], # unigram and bigram
    "features__text__count__min_df": [1, 10, 20] # remove words that appear less than 1, 10, or 20 times
}

cv = GridSearchCV(pipeline, parameters_text, cv=split, verbose=10)

print(parameters_text)

{'features__text__count__ngram_range': [(1, 1), (1, 2)], 'features__text__count__min_df': [1, 10, 20]}


In [18]:
cv.fit(X_train, y_train)


parameters_text = convert_params(cv.best_params_)
print("Best Parameters:", parameters_text)

Fitting 1 folds for each of 6 candidates, totalling 6 fits
[CV 1/1; 1/6] START features__text__count__min_df=1, features__text__count__ngram_range=(1, 1)
[CV 1/1; 1/6] END features__text__count__min_df=1, features__text__count__ngram_range=(1, 1);, score=0.171 total time=   9.8s
[CV 1/1; 2/6] START features__text__count__min_df=1, features__text__count__ngram_range=(1, 2)
[CV 1/1; 2/6] END features__text__count__min_df=1, features__text__count__ngram_range=(1, 2);, score=0.182 total time=  12.0s
[CV 1/1; 3/6] START features__text__count__min_df=10, features__text__count__ngram_range=(1, 1)
[CV 1/1; 3/6] END features__text__count__min_df=10, features__text__count__ngram_range=(1, 1);, score=0.215 total time=   8.7s
[CV 1/1; 4/6] START features__text__count__min_df=10, features__text__count__ngram_range=(1, 2)
[CV 1/1; 4/6] END features__text__count__min_df=10, features__text__count__ngram_range=(1, 2);, score=0.233 total time=   9.2s
[CV 1/1; 5/6] START features__text__count__min_df=20,

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
y_pred = cv.predict(X_test)

get_metrics(y_test, y_pred)[["category", "f1_macro_avg", "precision_macro_avg", "recall_macro_avg", "accuracy"]]

## Try improving your model further
* try other ML algorithms
* add other features besides the TF-IDF

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# fast to train algorithms

parameters_clf = {
    "clf__estimator": [
        LogisticRegression(max_iter=1000),
        LinearSVC(),
        MultinomialNB()
    ]
}

parameters_clf.update(parameters_text)

cv = GridSearchCV(pipeline, param_grid=parameters_clf, cv=split, verbose=10)

print(parameters_clf)

In [None]:
cv.fit(X_train, y_train)

parameters_clf = convert_params(cv.best_params_)
print("Best Parameters:", parameters_clf)

In [None]:
from sklearn.preprocessing import OneHotEncoder

pipeline_genre = Pipeline([
    ("features", ColumnTransformer([
        ("genre_category", OneHotEncoder(dtype=int), ["genre"]),
        ("text", Pipeline([
            ("count", CountVectorizer(tokenizer=tokenize)),
            ("tfidf", TfidfTransformer())
        ]), "message")
    ], remainder="drop")),
    ("clf", MultiOutputClassifier(MultinomialNB()))
])

parameters_feat = parameters_clf
cv = GridSearchCV(pipeline_genre, parameters_feat, cv=split, verbose=10)

print(parameters_feat)

In [None]:
cv.fit(X_train, y_train)

In [None]:
y_pred = cv.predict(X_test)

In [None]:
metrics = get_metrics(y_test, y_pred)
metrics[["category", "f1_macro_avg", "precision_macro_avg", "recall_macro_avg", "accuracy"]]

In [None]:
metrics.mean(numeric_only=True)

## Export your model as a pickle file

In [None]:
joblib.dump(cv, "../models/disaster_response_model_nb.pkl")