# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import re
import pickle
import numpy as np
import pandas as pd

from sqlalchemy import create_engine

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

nltk.download(['punkt', 'stopwords', 'wordnet']);

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

[nltk_data] Downloading package punkt to /Users/pwolter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pwolter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/pwolter/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///emergency_response.db')
conn = engine.connect()
df = pd.read_sql_table('messages', conn)

In [3]:
# Sample 10% of dataframe to speed up modeling - change to 100% for full run
sample_rate = 0.05

In [4]:
# setting cross validation folds
cv = 5

# random state 
random_state = 42

In [5]:
# filename for the file to contain the model to save
file_name = 'model/classifier.pkl'

In [6]:
df = df.sample(frac=sample_rate, random_state=random_state)

In [7]:
df.shape

(1311, 40)

In [8]:
(df == 0).all()

id                        False
message                   False
original                  False
genre                     False
related                   False
request                   False
offer                     False
aid_related               False
medical_help              False
medical_products          False
search_and_rescue         False
security                  False
military                  False
child_alone                True
water                     False
food                      False
shelter                   False
clothing                  False
money                     False
missing_people            False
refugees                  False
death                     False
other_aid                 False
infrastructure_related    False
transport                 False
buildings                 False
electricity               False
tools                     False
hospitals                 False
shops                     False
aid_centers               False
other_in

In [9]:
# delete child_alone and shops also as all are 0s
columns = ['id', 'message', 'original', 'genre', 'request', 'offer', 'child_alone', 'shops']

X = df['message']
Y = df.drop(columns=columns)

category_names = Y.columns.tolist()


In [10]:
# converting dataframes to numpy arrays
# X = X.values
# Y = Y.to_numpy()

In [11]:
X.shape

(1311,)

In [12]:
Y.shape

(1311, 32)

### 2. Write a tokenization function to process your text data

In [13]:
def tokenize(text):
    
    # URL replacement
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    # Lowercase and punctuation removal
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()).split()

    # Stop word removal
    words = [w for w in text if w not in stopwords.words("english")]

    # Stemming
    stemmed = [PorterStemmer().stem(w) for w in words]

    # Lemmatization
    lemmed = [WordNetLemmatizer().lemmatize(w) for w in stemmed]

    return lemmed

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [14]:
# Split the data in training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=random_state)

In [15]:
X_train.shape

(983,)

In [16]:
y_train.shape

(983, 32)

In [17]:
X_test.shape

(328,)

In [18]:
y_test.shape

(328, 32)

In [19]:
# creating the Random Forest Classifier (rfc) pipeline
rfc_pipeline = Pipeline(
    [
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=random_state)))
    ]
)

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [20]:
%%time
# fitting the rfc classifier with the baseline pipeline

rfc_pipeline.fit(X_train, y_train);

CPU times: user 4.48 s, sys: 901 ms, total: 5.38 s
Wall time: 5.39 s


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...           oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=None))])

In [21]:
%%time
# predicting

y_pred = rfc_pipeline.predict(X_test);

CPU times: user 929 ms, sys: 273 ms, total: 1.2 s
Wall time: 1.2 s


In [22]:
# printing the score
print(rfc_pipeline.score(X_test, y_test))

0.18902439024390244


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [23]:
# function to test models based on Udacity's provided ocde
def test_model(y_test, y_pred, category_names):
    print(classification_report(y_test, y_pred, target_names=category_names))

In [24]:
# report of classification results
test_model(y_test, y_pred, category_names)

                        precision    recall  f1-score   support

               related       0.76      0.82      0.79       237
           aid_related       0.73      0.55      0.63       130
          medical_help       0.00      0.00      0.00        30
      medical_products       0.00      0.00      0.00        25
     search_and_rescue       0.00      0.00      0.00         2
              security       0.00      0.00      0.00         4
              military       0.00      0.00      0.00         7
                 water       0.44      0.21      0.29        19
                  food       0.93      0.68      0.79        38
               shelter       1.00      0.12      0.21        26
              clothing       0.00      0.00      0.00         3
                 money       0.00      0.00      0.00         8
        missing_people       0.00      0.00      0.00         5
              refugees       1.00      0.08      0.15        12
                 death       0.00      

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [25]:
# adding Vectorizer and TF-IDF parameters to GridSearch
parameters_rfc = {
    'tfidf__use_idf': (True, False),

    'clf__estimator__n_estimators': (50, 100, 150),
    'clf__estimator__min_samples_leaf': (2, 3)
}

In [26]:
%%time
# running GridSearch and fitting the model

grid_search_rfc = GridSearchCV(estimator=rfc_pipeline, param_grid=parameters_rfc, n_jobs=-1, cv=cv,
                           refit=True, return_train_score=True, verbose=1)
grid_search_rfc.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  3.9min finished


CPU times: user 5.8 s, sys: 1.1 s, total: 6.9 s
Wall time: 4min


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...           oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=None))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'tfidf__use_idf': (True, False), 'clf__estimator__n_estimators': (50, 100, 150), 'clf__estimator__min_samples_leaf': (2, 3)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [27]:
# Best parameters for rfc
best_parameters_rfc = grid_search_rfc.best_estimator_.get_params()
print(best_parameters_rfc)

{'memory': None, 'steps': [('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function tokenize at 0x1a1ce9eef0>, vocabulary=None)), ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False,
         use_idf=False)), ('clf', MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
    

In [28]:
# saving the best model found
best_model_rfc = grid_search_rfc.best_estimator_

In [29]:
# printing the best estimator found
print(grid_search_rfc.best_estimator_)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...           oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=None))])


In [30]:
# printing the score to compare to previous model --> 0.21392190152801357
print(grid_search_rfc.best_score_)

0.16581892166836215


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [31]:
y_pred = best_model_rfc.predict(X_test)

In [32]:
test_model(y_test, y_pred, category_names)

                        precision    recall  f1-score   support

               related       0.74      0.98      0.85       237
           aid_related       0.65      0.67      0.66       130
          medical_help       1.00      0.03      0.06        30
      medical_products       0.00      0.00      0.00        25
     search_and_rescue       0.00      0.00      0.00         2
              security       0.00      0.00      0.00         4
              military       0.00      0.00      0.00         7
                 water       0.00      0.00      0.00        19
                  food       1.00      0.21      0.35        38
               shelter       1.00      0.04      0.07        26
              clothing       0.00      0.00      0.00         3
                 money       0.00      0.00      0.00         8
        missing_people       0.00      0.00      0.00         5
              refugees       0.00      0.00      0.00        12
                 death       0.00      

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [33]:
# parameters and pipeline for Adaboost classifier
parameters_ab = {
    'tfidf__use_idf': (True, False),

    'clf__estimator__n_estimators': (50, 100, 150),
    'clf__estimator__learning_rate': (0.1, 0.15, 0.2),
}

ab_pipeline = Pipeline(
    [
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(AdaBoostClassifier(random_state=random_state)))
    ]
)

In [34]:
%%time
# run gridserach and fit the model

grid_search_ada = GridSearchCV(estimator=ab_pipeline, param_grid=parameters_ab, n_jobs=-1, cv=cv,
                           refit=True, return_train_score=True, verbose=1)
grid_search_ada.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  6.3min finished


CPU times: user 6.8 s, sys: 919 ms, total: 7.72 s
Wall time: 6min 22s


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ator=None,
          learning_rate=1.0, n_estimators=50, random_state=42),
           n_jobs=None))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'tfidf__use_idf': (True, False), 'clf__estimator__n_estimators': (50, 100, 150), 'clf__estimator__learning_rate': (0.1, 0.15, 0.2)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [35]:
# save best estimator and print best parameters found
best_parameters_ada = grid_search_ada.best_estimator_.get_params()
print(best_parameters_ada)

{'memory': None, 'steps': [('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function tokenize at 0x1a1ce9eef0>, vocabulary=None)), ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False,
         use_idf=False)), ('clf', MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.1, n_estimators=50, random_state=42),
           n_jobs=None))], 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, s

In [36]:
# save best model
best_model_ada = grid_search_ada.best_estimator_

print(grid_search_ada.best_estimator_)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ator=None,
          learning_rate=0.1, n_estimators=50, random_state=42),
           n_jobs=None))])


In [37]:
# score for adaboost --> 0.22
print(grid_search_ada.best_score_)

0.20040691759918616


In [38]:
# predict with ada best model
y_pred = best_model_ada.predict(X_test)

In [39]:
# report of classification results
test_model(y_test, y_pred, category_names)

                        precision    recall  f1-score   support

               related       0.72      1.00      0.84       237
           aid_related       0.75      0.33      0.46       130
          medical_help       0.50      0.07      0.12        30
      medical_products       0.67      0.08      0.14        25
     search_and_rescue       0.00      0.00      0.00         2
              security       0.00      0.00      0.00         4
              military       0.00      0.00      0.00         7
                 water       0.93      0.74      0.82        19
                  food       0.83      0.66      0.74        38
               shelter       0.73      0.31      0.43        26
              clothing       1.00      0.33      0.50         3
                 money       0.00      0.00      0.00         8
        missing_people       0.67      0.40      0.50         5
              refugees       0.00      0.00      0.00        12
                 death       0.00      

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 9. Export your model as a pickle file

In [None]:
# # save the model to use it in the app
# with open(file_name, 'wb') as pickled_model:
#     pickle.dump(best_model, pickled_model)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.