# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [9]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import nltk
import re

# download nltk libraries
nltk.download(['punkt', 'wordnet', 'stopwords'])

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, f1_score, precision_score, recall_score, make_scorer

import pickle

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jordi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jordi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jordi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [3]:
# load data from database
database_filename = 'DisasterResponse.db'
engine = create_engine('sqlite:///db/'+database_filename)
df = pd.read_sql_table("messages", engine)
X = df.message.values
Y = df.drop(columns=['id', 'message', 'original', 'genre']).values
category_names = np.array(df.drop(columns=['id', 'message', 'original', 'genre']).columns)

### 2. Write a tokenization function to process your text data

In [4]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

In [5]:
def replace_urls(text):
    """
    Returns an editted version of the input Python str object `text` with all urls in text replaced with the str 'urlplaceholder'.
    
    INPUT:
        - text - Python str object - A raw text data
        
    OUTPUT:
        - text - Python str object - An editted version of the input data `text` with all urls in text replaced with the str 'urlplaceholder'.
    """
    
     # get list of all urls using regex
    detected_urls = re.findall(url_regex, text)
    
    # replace each url in text string with placeholder
    for url in detected_urls:
        text = text.replace(url, 'urlplaceholder')
    return text

In [6]:
def tokenize(text):
    """
    Takes a Python string object and returns a list of processed words 
    of the text.
    
    INPUT:
        - text - Python str object - A raw text data
        
    OUTPUT:
        - stem_words - Python list object - A list of processed words from the input `text`.
    """
        
    text = replace_urls(text)
        
    # Text normalising process: 
    # 1. Remove punctuations and 
    # 2. Covert to lower case 
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text).lower()

    # tokenize text: 
    # That is, split the text into a list of words
    tokens = word_tokenize(text)
    
    # Remove stop words
    words = [w for w in tokens if w not in stopwords.words("english")]
    
    # Lemmatize verbs by specifying pos
    lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
    
    # Reduce words in lemmed to their stems
    stem_words = [PorterStemmer().stem(w) for w in lemmed]

    return stem_words

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
# build model pipeline
rfc_pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)), 
        ('tfdif', TfidfTransformer()), 
        ('clf', MultiOutputClassifier(estimator = RandomForestClassifier(n_jobs=-1)))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

# train classifier
rfc_pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [None]:
# predict on test data
y_pred_rfc = rfc_pipeline.predict(X_test)

In [None]:
#target_names = ['class 0', 'class 1']
for idx in range(y_test.shape[-1]):
    print(classification_report(y_test[:,idx], y_pred_rfc[:,idx], zero_division='warn'))
    print("------------------------------------------------------\n")

In [12]:
def get_scores(y_true, y_pred):
    """
    Returns the accuracy, precision and recall and f1 scores of the two same shape numpy arrays `y_true` and `y_pred`.

    INPUTS:
        y_true - Numpy array object - A (1 x n) vector of true values
        y_pred - Numpy array object - A (1 x n) vector of predicted values
        
    OUPUT:
        dict_scores - Python dict - A dictionary of accuracy, precision and recall and f1 scores of `y_true` and `y_pred`.
    """
    
    # Compute the accuracy score of y_true and y_pred
    accuracy = accuracy_score(y_true, y_pred)
    
    # Compute the precision score of y_true and y_pred
    precision =round( precision_score(y_true, y_pred, average='micro'))
    
    # Compute the recall score of y_true and y_pred
    recall = recall_score(y_true, y_pred, average='micro')
    
    # Compute the recall score of y_true and y_pred
    f_1 = f1_score(y_true, y_pred, average='micro')
    
    # A dictionary of accuracy, precision and recall and f1 scores of `y_true` and `y_pred`
    dict_scores = {
        'Accuracy': accuracy, 
        'Precision': precision, 
        'Recall': recall, 
        'F1 Score': f_1
    }
    
    return dict_scores

In [14]:
tabulate_metric_scores = lambda y_test, y_pred : pd.DataFrame([get_scores(y_test[:, idx], y_pred[:, idx]) for idx in range(y_test.shape[-1])], index=category_names)

tabulate_metric_scores(y_test, y_pred_rfc)

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
related,0.82484,1,0.82484,0.82484
request,0.898383,1,0.898383,0.898383
offer,0.995423,1,0.995423,0.995423
aid_related,0.780287,1,0.780287,0.780287
medical_help,0.924016,1,0.924016,0.924016
medical_products,0.951327,1,0.951327,0.951327
search_and_rescue,0.974672,1,0.974672,0.974672
security,0.982301,1,0.982301,0.982301
military,0.968111,1,0.968111,0.968111
child_alone,1.0,1,1.0,1.0


In [17]:
non_refined_rfc_scores = tabulate_metric_scores(y_test, y_pred_rfc).mean()

non_refined_rfc_scores

Accuracy     0.949276
Precision    1.000000
Recall       0.949276
F1 Score     0.949276
dtype: float64

### 6. Improve your model
Use grid search to find better parameters. 

In [18]:
def avg_accuracy_score(y_true, y_pred):
    """
    Assumes that the numpy arrays `y_true` and `y_pred` ararys 
    are of the same shape and returns the average of the 
    accuracy score computed columnwise. 
    
    y_true - Numpy array - An (m x n) matrix 
    y_pred - Numpy array - An (m x n) matrix 
    
    avg_accuracy - Numpy float64 object - Average of accuracy score
    """
    
    # initialise an empty list
    accuracy_results = []
    
    # for each column index in either y_true or y_pred
    for idx in range(y_true.shape[-1]):
        # Get the accuracy score of the idx-th column of y_true and y_pred
        accuracy = accuracy_score(y_true[:,idx], y_pred[:,idx])
        
        # Update accuracy_results with accuracy
        accuracy_results.append(accuracy)
        
    # Take the mean of accuracy_results
    avg_accuracy = np.mean(accuracy_results)
    
    return avg_accuracy

In [19]:
average_accuracy_score = make_scorer(avg_accuracy_score)

In [20]:
rfc_pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>)),
  ('tfdif', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>),
 'tfdif': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfdif__norm': 'l2',
 'tfdif__smooth_idf': True,
 'tfdif__sublinear_tf

In [21]:
bool_entries = [True, False]

parameters = {
    #'vect__ngram_range': ((1, 1), (1, 2)), 
    #'clf__estimator__bootstrap': bool_entries, 
    #'clf__estimator__criterion' : ["gini", "entropy"], 
    'clf__estimator__max_depth': [25, 50],  
    'clf__estimator__n_estimators': [100, 250]
    #'clf__estimator__oob_score': bool_entries, 
    #'clf__estimator__warm_start': bool_entries
}

# create grid search object
clf_rfc = GridSearchCV(
    rfc_pipeline, 
    param_grid=parameters, 
    scoring=average_accuracy_score, 
    verbose=10, 
    return_train_score=True
    )

clf_rfc.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.936, test=0.932), total= 1.3min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.1min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.936, test=0.930), total= 1.4min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  4.4min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.936, test=0.934), total= 1.4min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  6.6min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.935, test=0.932), total= 1.4min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  8.8min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.935, test=0.932), total= 1.4min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 11.1min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.936, test=0.932), total= 2.0min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 14.2min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.936, test=0.931), total= 2.1min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 17.4min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.935, test=0.934), total= 2.3min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 20.7min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.935, test=0.932), total= 2.2min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 24.1min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.936, test=0.933), total= 1.9min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estimators=100, score=(train=0.953, test=0.937), total= 1.8min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estimators=100, score=(train=0.954, test=0.937), total= 1.8min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estimators=100, score=(train=0.953, test=0.940), total= 1.7min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estimators=100, score=(train=0.953, test=0.938), total= 1.8min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estima

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 58.7min finished


GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>)),
                                       ('tfdif', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))]),
             param_grid={'clf__estimator__max_depth': [25, 50],
                         'clf__estimator__n_estimators': [100, 250]},
             return_train_score=True, scoring=make_scorer(avg_accuracy_score),
             verbose=10)

In [22]:
# Get the best score

clf_rfc.best_score_

0.9380376261691573

In [23]:
# Get the best parameters the model

clf_rfc.best_params_

{'clf__estimator__max_depth': 50, 'clf__estimator__n_estimators': 100}

In [24]:
# Get the best model or estimator

refined_rfc_pipeline = clf_rfc.best_estimator_ 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [25]:
# Test best model 

y_pred_best_rfc = refined_rfc_pipeline.predict(X_test)

In [26]:
#target_names = ['class 0', 'class 1']
for idx in range(y_test.shape[-1]):
    print(classification_report(y_test[:,idx], y_pred_best_rfc[:,idx], zero_division='warn'))
    print("------------------------------------------------------\n")

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.75      0.08      0.14      1484
           1       0.78      0.99      0.87      5021
           2       0.00      0.00      0.00        49

    accuracy                           0.78      6554
   macro avg       0.51      0.36      0.34      6554
weighted avg       0.77      0.78      0.70      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      5422
           1       0.94      0.27      0.42      1132

    accuracy                           0.87      6554
   macro avg       0.90      0.64      0.68      6554
weighted avg       0.88      0.87      0.84      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6524
           1       0.00      0.00      0.00        30

    accuracy    

In [27]:
refined_rfc_scores = tabulate_metric_scores(y_test, y_pred_best_rfc).mean()

refined_rfc_scores

Accuracy     0.939007
Precision    1.000000
Recall       0.939007
F1 Score     0.939007
dtype: float64

In [28]:
non_refined_rfc_scores

Accuracy     0.949276
Precision    1.000000
Recall       0.949276
F1 Score     0.949276
dtype: float64

The non-refined `RandomForestClassifier` model performs a little better than the refined. 

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [29]:
class DisasterWordExtrator(BaseEstimator, TransformerMixin):

    def contain_disaster_word(self, text):
        """
        INPUT:
            - text - Python str object - A raw text data

        OUTPUT:
            - bool - Python bool object - True or False
        """

        # Words that communicates ones necessary need during a disaster.
        # These can be updated as well. 
        dis_words = ['hunger', 
                     'hungry', 
                     'food', 
                     'water',
                     'drink', 
                     'eat',
                     'thirst', 
                     'medicine', 
                     'medicial', 
                     'cloth', 
                     'shelter', 
                     'help'
                    ]

        # Lemmatise the words in dis_words
        lemmed_dis_words = [WordNetLemmatizer().lemmatize(w, pos='v') for w in dis_words]

        # Get the stem words of each word in lemmed_dis_words
        stem_dis_words = [PorterStemmer().stem(w) for w in lemmed_dis_words]
        
        # Replace all urls in the input str object text
        text = replace_urls(text)

        # Tokenise the str object text
        stem_words = tokenize(text)

        # return whether stem_words contains any of words in stem_dis_words
        return any([words in stem_dis_words for words in stem_words])


    def fit(self, X, y=None):

        return self

    def transform(self, X):

        X_dis_word = pd.Series(X).apply(self.contain_disaster_word)
        
        return pd.DataFrame(X_dis_word)

In [30]:
k_nhb_pipeline = Pipeline([
    ('features', FeatureUnion([
        
        ('text_pipeline', Pipeline([('vect', CountVectorizer(tokenizer=tokenize)), 
                                    ('tfdif', TfidfTransformer())
                                    ])),
        
        ('disaster_words', DisasterWordExtrator())
        ])), 
       
    ('clf', MultiOutputClassifier(estimator = KNeighborsClassifier(n_jobs=-1)))
    ])

In [31]:
# train classifier

k_nhb_pipeline.fit(X_train, y_train)

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('text_pipeline',
                                                 Pipeline(steps=[('vect',
                                                                  CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>)),
                                                                 ('tfdif',
                                                                  TfidfTransformer())])),
                                                ('disaster_words',
                                                 DisasterWordExtrator())])),
                ('clf',
                 MultiOutputClassifier(estimator=KNeighborsClassifier(n_jobs=-1)))])

In [33]:
y_pred_k_nhb = k_nhb_pipeline.predict(X_test)

In [34]:
#target_names = ['class 0', 'class 1']
for idx in range(y_test.shape[-1]):
    print(classification_report(y_test[:,idx], y_pred_k_nhb[:,idx], zero_division='warn'))
    print("------------------------------------------------------\n")

              precision    recall  f1-score   support

           0       0.59      0.36      0.45      1484
           1       0.83      0.92      0.87      5021
           2       0.25      0.33      0.28        49

    accuracy                           0.79      6554
   macro avg       0.56      0.54      0.53      6554
weighted avg       0.77      0.79      0.77      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       0.91      0.95      0.93      5422
           1       0.71      0.53      0.61      1132

    accuracy                           0.88      6554
   macro avg       0.81      0.74      0.77      6554
weighted avg       0.87      0.88      0.87      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6524
           1       0.00      0.00      0.00        30

    accuracy    

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6453
           1       0.62      0.26      0.36       101

    accuracy                           0.99      6554
   macro avg       0.80      0.63      0.68      6554
weighted avg       0.98      0.99      0.98      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      6386
           1       0.55      0.04      0.07       168

    accuracy                           0.97      6554
   macro avg       0.76      0.52      0.53      6554
weighted avg       0.96      0.97      0.96      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6487
           1       0.00      0.00      0.00        67

    accuracy                           0.99      6554
   macro avg    

In [35]:
tabulate_metric_scores(y_test, y_pred_k_nhb)

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
related,0.788221,1,0.788221,0.788221
request,0.880684,1,0.880684,0.880684
offer,0.995423,1,0.995423,0.995423
aid_related,0.699115,1,0.699115,0.699115
medical_help,0.920964,1,0.920964,0.920964
medical_products,0.949649,1,0.949649,0.949649
search_and_rescue,0.974825,1,0.974825,0.974825
security,0.982759,1,0.982759,0.982759
military,0.967348,1,0.967348,0.967348
child_alone,1.0,1,1.0,1.0


In [36]:
non_refined_k_nhb_scores = tabulate_metric_scores(y_test, y_pred_k_nhb).mean()

non_refined_k_nhb_scores

Accuracy     0.935998
Precision    1.000000
Recall       0.935998
F1 Score     0.935998
dtype: float64

In [37]:
k_nhb_pipeline.get_params()

{'memory': None,
 'steps': [('features',
   FeatureUnion(transformer_list=[('text_pipeline',
                                   Pipeline(steps=[('vect',
                                                    CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>)),
                                                   ('tfdif',
                                                    TfidfTransformer())])),
                                  ('disaster_words', DisasterWordExtrator())])),
  ('clf', MultiOutputClassifier(estimator=KNeighborsClassifier(n_jobs=-1)))],
 'verbose': False,
 'features': FeatureUnion(transformer_list=[('text_pipeline',
                                 Pipeline(steps=[('vect',
                                                  CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>)),
                                                 ('tfdif',
                                                  TfidfTransformer())])),
                                ('disaster_

In [38]:
parameters = {
    'clf__estimator__leaf_size': [30, 60], 
    'clf__estimator__n_neighbors': [5, 25, 50]
    
}


clf_k_nhb = GridSearchCV(k_nhb_pipeline, 
                         param_grid=parameters, 
                         scoring=average_accuracy_score, 
                         verbose=5,
                         return_train_score=True
                        )

clf_k_nhb.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5 .....


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, score=(train=0.948, test=0.937), total= 2.7min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5 .....


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.8min remaining:    0.0s


[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, score=(train=0.944, test=0.934), total= 2.7min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5 .....


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 15.5min remaining:    0.0s


[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, score=(train=0.943, test=0.937), total= 2.7min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5 .....


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 23.5min remaining:    0.0s


[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, score=(train=0.943, test=0.935), total= 2.8min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5 .....


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 31.3min remaining:    0.0s


[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, score=(train=0.950, test=0.937), total= 2.7min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25 ....
[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25, score=(train=0.944, test=0.941), total= 2.7min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25 ....
[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25, score=(train=0.943, test=0.939), total= 2.7min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25 ....
[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25, score=(train=0.943, test=0.943), total= 2.7min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25 ....
[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25, score=(train=0.943, test=0.941), total= 2.7min
[CV] clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25 ....
[CV]  clf__estimator__leaf_size=30, clf__estimator__n_neighbors=25, sco

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 230.1min finished


GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('text_pipeline',
                                                                        Pipeline(steps=[('vect',
                                                                                         CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>)),
                                                                                        ('tfdif',
                                                                                         TfidfTransformer())])),
                                                                       ('disaster_words',
                                                                        DisasterWordExtrator())])),
                                       ('clf',
                                        MultiOutputClassifier(estimator=KNeighborsClassifier(n_jobs=-1)))]),
             param_grid={'clf__estimator__leaf_si

In [39]:
# Get the best score

clf_k_nhb.best_score_

0.9411415192119177

In [40]:
# Get the best parameters the model

clf_k_nhb.best_params_

{'clf__estimator__leaf_size': 30, 'clf__estimator__n_neighbors': 25}

In [41]:
# Get the best model or estimator

refined_k_nhb_pipeline = clf_k_nhb.best_estimator_

In [42]:
# Test best model 

y_pred_best_k_nhb = refined_k_nhb_pipeline.predict(X_test)

In [43]:
refined_k_nhb_scores = tabulate_metric_scores(y_test, y_pred_best_k_nhb).mean()

refined_k_nhb_scores

Accuracy     0.94166
Precision    1.00000
Recall       0.94166
F1 Score     0.94166
dtype: float64

In [44]:
non_refined_k_nhb_scores

Accuracy     0.935998
Precision    1.000000
Recall       0.935998
F1 Score     0.935998
dtype: float64

Comaparing the outputs of `non_tuned_k_nhb_scores` and `tuned_k_nhb_scores`, I choose the refined pipeline `refined_k_nhb_pipeline` over the non-refined version `k_nhb_pipeline`.

In [45]:
rfc_feat_union_pipeline = Pipeline([
    ('features', FeatureUnion([
        
        ('text_pipeline', Pipeline([('vect', CountVectorizer(tokenizer=tokenize)), 
                                    ('tfdif', TfidfTransformer())
                                    ])),
        
        ('disaster_words', DisasterWordExtrator())
        ])), 
       
    ('clf', MultiOutputClassifier(estimator = RandomForestClassifier(n_jobs=-1)))
    ])

In [46]:
# train classifier

rfc_feat_union_pipeline.fit(X_train, y_train)

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('text_pipeline',
                                                 Pipeline(steps=[('vect',
                                                                  CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>)),
                                                                 ('tfdif',
                                                                  TfidfTransformer())])),
                                                ('disaster_words',
                                                 DisasterWordExtrator())])),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))])

In [47]:
y_pred_rfc_feat_union = rfc_feat_union_pipeline.predict(X_test)

In [48]:
#target_names = ['class 0', 'class 1']
for idx in range(y_test.shape[-1]):
    print(classification_report(y_test[:,idx], y_pred_rfc_feat_union[:,idx], zero_division='warn'))
    print("------------------------------------------------------\n")

              precision    recall  f1-score   support

           0       0.70      0.44      0.54      1484
           1       0.85      0.94      0.89      5021
           2       0.47      0.49      0.48        49

    accuracy                           0.83      6554
   macro avg       0.67      0.62      0.64      6554
weighted avg       0.81      0.83      0.81      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       0.91      0.98      0.94      5422
           1       0.84      0.54      0.65      1132

    accuracy                           0.90      6554
   macro avg       0.87      0.76      0.80      6554
weighted avg       0.90      0.90      0.89      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6524
           1       0.00      0.00      0.00        30

    accuracy    

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.95      1.00      0.98      6212
           1       0.82      0.12      0.21       342

    accuracy                           0.95      6554
   macro avg       0.89      0.56      0.59      6554
weighted avg       0.95      0.95      0.94      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      6415
           1       0.67      0.03      0.06       139

    accuracy                           0.98      6554
   macro avg       0.82      0.51      0.52      6554
weighted avg       0.97      0.98      0.97      6554

------------------------------------------------------

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6514
           1       0.00      0.00      0.00        40

    accuracy                           0.99      6554
   macro avg    

In [49]:
tabulate_metric_scores(y_test, y_pred_rfc_feat_union)

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
related,0.825755,1,0.825755,0.825755
request,0.901892,1,0.901892,0.901892
offer,0.995423,1,0.995423,0.995423
aid_related,0.778761,1,0.778761,0.778761
medical_help,0.922643,1,0.922643,0.922643
medical_products,0.950717,1,0.950717,0.950717
search_and_rescue,0.974519,1,0.974519,0.974519
security,0.982301,1,0.982301,0.982301
military,0.968569,1,0.968569,0.968569
child_alone,1.0,1,1.0,1.0


In [64]:
non_refined_rfc_feat_union_scores = tabulate_metric_scores(y_test, y_pred_rfc_feat_union).mean()

non_refined_rfc_feat_union_scores

Accuracy     0.949585
Precision    1.000000
Recall       0.949585
F1 Score     0.949585
dtype: float64

In [51]:
parameters = {'clf__estimator__max_depth': [25, 50], 
              'clf__estimator__n_estimators': [100, 250]
             }

In [52]:
bool_entries = [True, False]

parameters = {
    #'vect__ngram_range': ((1, 1), (1, 2)), 
    #'clf__estimator__bootstrap': bool_entries, 
    #'clf__estimator__criterion' : ["gini", "entropy"], 
    'clf__estimator__max_depth': [25, 50],  
    'clf__estimator__n_estimators': [100, 250]
    #'clf__estimator__oob_score': bool_entries, 
    #'clf__estimator__warm_start': bool_entries
}

# create grid search object
clf_rfc_feat_union = GridSearchCV(rfc_feat_union_pipeline, 
                                  param_grid=parameters, 
                                  scoring=average_accuracy_score, 
                                  verbose=10, 
                                  return_train_score=True 
                                  )

clf_rfc_feat_union.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.936, test=0.932), total= 2.2min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.7min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.937, test=0.931), total= 2.2min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  7.3min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.936, test=0.935), total= 2.2min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 11.0min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.936, test=0.933), total= 2.2min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 14.7min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=100, score=(train=0.937, test=0.933), total= 2.2min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 18.5min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.936, test=0.932), total= 2.9min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 23.0min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.937, test=0.931), total= 2.7min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 27.4min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.936, test=0.934), total= 2.7min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 31.7min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.936, test=0.933), total= 2.7min
[CV] clf__estimator__max_depth=25, clf__estimator__n_estimators=250 ..


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 36.1min remaining:    0.0s


[CV]  clf__estimator__max_depth=25, clf__estimator__n_estimators=250, score=(train=0.937, test=0.933), total= 2.8min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estimators=100, score=(train=0.954, test=0.938), total= 2.5min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estimators=100, score=(train=0.954, test=0.937), total= 2.5min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estimators=100, score=(train=0.954, test=0.940), total= 2.5min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estimators=100, score=(train=0.954, test=0.938), total= 2.5min
[CV] clf__estimator__max_depth=50, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=50, clf__estimator__n_estima

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 86.1min finished


GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('text_pipeline',
                                                                        Pipeline(steps=[('vect',
                                                                                         CountVectorizer(tokenizer=<function tokenize at 0x7fc11ee1f700>)),
                                                                                        ('tfdif',
                                                                                         TfidfTransformer())])),
                                                                       ('disaster_words',
                                                                        DisasterWordExtrator())])),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))]),
             param_grid={'clf__estimator__max_d

In [53]:
# Get the best score

clf_rfc_feat_union.best_score_

0.9387638150677515

In [54]:
# Get the best parameters the model

clf_rfc_feat_union.best_params_

{'clf__estimator__max_depth': 50, 'clf__estimator__n_estimators': 250}

In [55]:
# Get the best model or estimator

refined_rfc_feat_union_pipeline = clf_rfc_feat_union.best_estimator_

In [56]:
# Test best model 

y_pred_refined_rfc_feat_union = refined_rfc_feat_union_pipeline.predict(X_test)

In [57]:
refined_rfc_feat_union_scores = tabulate_metric_scores(y_test, y_pred_best_k_nhb).mean()

refined_rfc_feat_union_scores

Accuracy     0.94166
Precision    1.00000
Recall       0.94166
F1 Score     0.94166
dtype: float64

In [65]:
[non_refined_rfc_scores, 
              refined_rfc_scores, 
              non_refined_k_nhb_scores, 
              refined_k_nhb_scores, 
              non_refined_rfc_feat_union_scores, 
              refined_rfc_feat_union_scores
             ]

[Accuracy     0.949276
 Precision    1.000000
 Recall       0.949276
 F1 Score     0.949276
 dtype: float64,
 Accuracy     0.939007
 Precision    1.000000
 Recall       0.939007
 F1 Score     0.939007
 dtype: float64,
 Accuracy     0.935998
 Precision    1.000000
 Recall       0.935998
 F1 Score     0.935998
 dtype: float64,
 Accuracy     0.94166
 Precision    1.00000
 Recall       0.94166
 F1 Score     0.94166
 dtype: float64,
 Accuracy     0.949585
 Precision    1.000000
 Recall       0.949585
 F1 Score     0.949585
 dtype: float64,
 Accuracy     0.94166
 Precision    1.00000
 Recall       0.94166
 F1 Score     0.94166
 dtype: float64]

In [70]:
pd.DataFrame([non_refined_rfc_scores, 
              refined_rfc_scores, 
              non_refined_k_nhb_scores, 
              refined_k_nhb_scores, 
              non_refined_rfc_feat_union_scores, 
              refined_rfc_feat_union_scores
             ], 
             index=['RandomForestPipeline', 
                    'RefinedRandomForestPipeline', 
                    'KNeighborsPipeline', 
                    'RefinedKNeighborsPipeline', 
                    'RandomForestFeatureUnionPipeline', 
                    'RefinedRandomForestFeatureUnionPipeline'
                   ]
            ).sort_values(['Accuracy', 'Precision', 'Recall', 'F1 Score'], ascending=False)

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
RandomForestFeatureUnionPipeline,0.949585,1.0,0.949585,0.949585
RandomForestPipeline,0.949276,1.0,0.949276,0.949276
RefinedKNeighborsPipeline,0.94166,1.0,0.94166,0.94166
RefinedRandomForestFeatureUnionPipeline,0.94166,1.0,0.94166,0.94166
RefinedRandomForestPipeline,0.939007,1.0,0.939007,0.939007
KNeighborsPipeline,0.935998,1.0,0.935998,0.935998


The output of the pandas DataFrame above the pipeline gives the scores of each pipeline in descending order. Since the Random Forest Feature Union Pipeline — `non_refined_rfc_feat_union_scores` — gives the highest scores, I choose it as the best among the rest.

### 9. Export your model as a pickle file

In [71]:
# save the model to disk
filename = 'random_forest_classifier_model.pkl'
pickle.dump(rfc_feat_union_pipeline, open(filename, 'wb'))

In [72]:
# save the model to disk
filename = 'refined_random_forest_classifier_model.pkl'
pickle.dump(refined_rfc_pipeline, open(filename, 'wb'))

In [73]:
# save the model to disk
filename = 'k_neighbors_classifier_model.pkl'
pickle.dump(k_nhb_pipeline, open(filename, 'wb'))

In [74]:
# save the model to disk
filename = 'refined_k_neighbors_classifier_model.pkl'
pickle.dump(refined_k_nhb_pipeline, open(filename, 'wb'))

In [75]:
# save the model to disk
filename = 'random_forest_classifier_feature_union_model.pkl'
pickle.dump(rfc_feat_union_pipeline, open(filename, 'wb'))

In [76]:
# save the model to disk
filename = 'refined_random_forest_classifier_feature_union_model.pkl'
pickle.dump(refined_rfc_feat_union_pipeline, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.