# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import re

import nltk
nltk.download(['punkt', 'wordnet'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql("SELECT * FROM DisasterResponseMessages", engine)
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    # remove punctuation and use lowercase letters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    # tokentize words
    tokens = word_tokenize(text)
    # create word lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # lemmatize words
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
# create pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
# split data into train and test sets (using random state to get always the same data)
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)
# train classifier
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [6]:
# predict
y_pred = pipeline.predict(X_test)

# print report using sklearn classification_report
print('\n', classification_report(y_test, y_pred, target_names=Y.columns.values))


                         precision    recall  f1-score   support

               related       0.82      0.93      0.87      5001
               request       0.84      0.38      0.53      1093
                 offer       0.00      0.00      0.00        32
           aid_related       0.74      0.54      0.63      2700
          medical_help       0.58      0.09      0.15       532
      medical_products       0.68      0.08      0.15       345
     search_and_rescue       0.67      0.11      0.19       165
              security       0.00      0.00      0.00       127
              military       0.47      0.04      0.07       197
           child_alone       0.00      0.00      0.00         0
                 water       0.81      0.22      0.34       408
                  food       0.81      0.37      0.51       723
               shelter       0.80      0.27      0.41       590
              clothing       0.85      0.12      0.20        95
                 money       0.62    

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [7]:
parameters =  {
        'vect__max_df': (0.5, 1.0),
        'tfidf__use_idf': (True, False),
        'clf__estimator__max_features': ['auto', 'sqrt'],
        'clf__estimator__max_depth': [5,10, 20,None]
        }

# use GridSearchCV for finding optimal parameters
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=2)

In [8]:
# fit model
cv.fit(X_train, y_train)

Fitting 3 folds for each of 32 candidates, totalling 96 fits
[CV] clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=0.5 
[CV]  clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=0.5, total=  11.2s
[CV] clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=0.5 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.8s remaining:    0.0s


[CV]  clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=0.5, total=  11.1s
[CV] clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=0.5 
[CV]  clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=0.5, total=  11.2s
[CV] clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=1.0 
[CV]  clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=1.0, total=  11.2s
[CV] clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=1.0 
[CV]  clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=1.0, total=  11.3s
[CV] clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf__use_idf=True, vect__max_df=1.0 
[CV]  clf__estimator__max_depth=5, clf__estimator__max_features=auto, tfidf_

[Parallel(n_jobs=1)]: Done  96 out of  96 | elapsed: 43.5min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__max_df': (0.5, 1.0), 'tfidf__use_idf': (True, False), 'clf__estimator__max_features': ['auto', 'sqrt'], 'clf__estimator__max_depth': [5, 10, 20, None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [9]:
# predict
y_pred_cv = cv.predict(X_test)

# print report using sklearn classification_report
print('\n', classification_report(y_test, y_pred_cv, target_names=Y.columns.values))


                         precision    recall  f1-score   support

               related       0.83      0.93      0.87      5001
               request       0.82      0.36      0.50      1093
                 offer       0.00      0.00      0.00        32
           aid_related       0.75      0.53      0.62      2700
          medical_help       0.69      0.11      0.19       532
      medical_products       0.67      0.09      0.16       345
     search_and_rescue       0.60      0.02      0.04       165
              security       0.33      0.01      0.02       127
              military       0.50      0.04      0.08       197
           child_alone       0.00      0.00      0.00         0
                 water       0.72      0.15      0.25       408
                  food       0.81      0.36      0.50       723
               shelter       0.82      0.26      0.39       590
              clothing       0.76      0.17      0.28        95
                 money       0.78    

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [10]:
from sklearn.naive_bayes import MultinomialNB
# I try to use MultinomialNB as classifier, because it is also used in other sources like [R1] 

# create advanced pipeline
adv_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(MultinomialNB()))
])

# set advanced model parameter
adv_parameters =  {
        'vect__max_df': (0.5, 1.0),
        'tfidf__use_idf': (True, False),
        'clf__estimator__alpha': [0.01, 0.05, 0.1]
        }

# use GridSearchCV for finding optimal parameters
adv_model = GridSearchCV(adv_pipeline, param_grid=adv_parameters, verbose=2)

# fit model
adv_model.fit(X_train, y_train)

# predict results
y_pred_adv = adv_model.predict(X_test)

# print report using sklearn classification_report
print('\n', classification_report(y_test, y_pred_adv, target_names=Y.columns.values))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=0.5, total=   9.0s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=0.5 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   14.9s remaining:    0.0s
  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=0.5, total=   8.9s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=0.5, total=   8.9s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=1.0, total=   8.7s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=1.0, total=   8.7s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=True, vect__max_df=1.0, total=   8.7s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=0.5, total=   9.0s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=0.5, total=   8.9s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=0.5, total=   9.0s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=1.0, total=   8.9s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=1.0, total=   8.9s
[CV] clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.01, tfidf__use_idf=False, vect__max_df=1.0, total=   8.9s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=0.5, total=   8.9s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=0.5, total=   9.0s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=0.5, total=   8.8s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=1.0, total=   8.6s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=1.0, total=   8.7s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=True, vect__max_df=1.0, total=   8.8s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=0.5, total=   8.7s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=0.5, total=   8.9s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=0.5, total=   8.8s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=1.0, total=   9.1s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=1.0, total=   8.9s
[CV] clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.05, tfidf__use_idf=False, vect__max_df=1.0, total=   8.8s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=0.5, total=   9.0s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=0.5, total=   8.7s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=0.5, total=   8.9s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=1.0, total=   9.1s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=1.0, total=   8.9s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=True, vect__max_df=1.0, total=   8.9s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=0.5, total=   8.7s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=0.5, total=   8.8s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=0.5 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=0.5, total=   9.1s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=1.0, total=   9.0s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=1.0, total=   9.0s
[CV] clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=1.0 


  self.class_log_prior_ = (np.log(self.class_count_) -


[CV]  clf__estimator__alpha=0.1, tfidf__use_idf=False, vect__max_df=1.0, total=   9.0s


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  8.8min finished
  self.class_log_prior_ = (np.log(self.class_count_) -



                         precision    recall  f1-score   support

               related       0.84      0.94      0.89      5001
               request       0.71      0.61      0.66      1093
                 offer       0.00      0.00      0.00        32
           aid_related       0.71      0.67      0.69      2700
          medical_help       0.60      0.16      0.25       532
      medical_products       0.65      0.20      0.31       345
     search_and_rescue       0.40      0.01      0.02       165
              security       0.00      0.00      0.00       127
              military       0.54      0.24      0.34       197
           child_alone       0.00      0.00      0.00         0
                 water       0.66      0.16      0.26       408
                  food       0.75      0.37      0.49       723
               shelter       0.71      0.16      0.26       590
              clothing       0.59      0.27      0.37        95
                 money       0.57    

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 9. Export your model as a pickle file

In [11]:
import pickle
## pickle the models
# save pipeline model
pickle.dump(pipeline, open('DisasterResponse_model_pipeline.pkl', 'wb'))
# save optimized GridSearchCV model
pickle.dump(cv, open('DisasterResponse_model_cv.pkl', 'wb'))
# save advanced model
pickle.dump(cv, open('DisasterResponse_model_adv.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

In [12]:
import sys
import pandas as pd
from sqlalchemy import create_engine
import re

import nltk
nltk.download(['punkt', 'wordnet'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
import pickle

def load_data(database_filepath):
    """Load the data from database

    Parameters:
    database_filepath: sqllite file containing the clean dataset, the file is created by process_data.py

    Returns:
    X: input data containing sent messages
    Y: categories to be predicted
    category_names: names of the categories saves in variable Y
    """
    # load data from database
    database_filepath = 'sqlite:///{}'.format(database_filepath)
    engine = create_engine(database_filepath)
    # read data using pandas read_sql
    df = pd.read_sql("SELECT * FROM DisasterResponseMessages", engine)
    X = df['message']
    Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)
    # get category names which are the columns of Y
    category_names = Y.columns.values
    # return values
    return X, Y, category_names

def tokenize(text):
    """Tokenize input text

    Parameters:
    text: input text

    Returns:
    clean_tokens: list containing the normalized, tokenized and lemmazied words
    """
    # remove punctuation and use lowercase letters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    # tokentize words
    tokens = word_tokenize(text)
    # create word lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # lemmatize words
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).strip()
        clean_tokens.append(clean_tok)

    return clean_tokens


def build_model():
    """Build machine learning model using pipeline and GridSearch

    Returns:
    model: machine learning model with best parameters using GridSearch
    """
    # create pipeline
    pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    # parameter dictionary
    parameters = {'tfidf__use_idf': (True, False)}# for faster execution just use one parameter for the test
#    parameters =  {
#        'vect__max_df': (0.5, 1.0),
#        'tfidf__use_idf': (True, False),
#        'clf__estimator__max_features': ['auto', 'sqrt'],
#        'clf__estimator__max_depth': [5,10, 20,None]
#        }
    # use GridSearchCV for finding optimal parameters
    model = GridSearchCV(pipeline, param_grid=parameters, verbose=10)
    # return model
    return model


def evaluate_model(model, X_test, Y_test, category_names):
    """Evaluate the machine learning model using X_test, Y_test and category_names

    Parameters:
    model: machine learning model with best parameters using GridSearch
    X_test: X values of the test subset
    Y_test: Y values of the test subset
    category_names: category names
    """
    # predict results
    Y_pred = model.predict(X_test)

    # print report using sklearn classification_report
    print('\n', classification_report(Y_test, Y_pred, target_names=category_names))


def save_model(model, model_filepath):
    """Save the machine learning model to a pickle file

    Parameters:
    model: machine learning model with best parameters using GridSearch
    model_filepath: filename the model is saved to
    """
    pickle.dump(model, open(model_filepath, 'wb'))



database_filepath = 'DisasterResponse_test.db'
model_filepath = 'DisasterResponse_model_test.pkl'

print('Loading data...\n    DATABASE: {}'.format(database_filepath))
X, Y, category_names = load_data(database_filepath)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
        
print('Building model...')
model = build_model()
        
print('Training model...')
model.fit(X_train, Y_train)
        
print('Evaluating model...')
evaluate_model(model, X_test, Y_test, category_names)

print('Saving model...\n    MODEL: {}'.format(model_filepath))
save_model(model, model_filepath)

print('Trained model saved!')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Loading data...
    DATABASE: DisasterResponse_test.db
Building model...
Training model...
Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] tfidf__use_idf=True .............................................
[CV] ... tfidf__use_idf=True, score=0.22614790444857674, total=  48.3s
[CV] tfidf__use_idf=True .............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   57.4s remaining:    0.0s


[CV] ... tfidf__use_idf=True, score=0.23001001287369474, total=  47.6s
[CV] tfidf__use_idf=True .............................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.9min remaining:    0.0s


[CV] ... tfidf__use_idf=True, score=0.22675250357653792, total=  46.0s
[CV] tfidf__use_idf=False ............................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.8min remaining:    0.0s


[CV] .. tfidf__use_idf=False, score=0.22743527392361607, total=  48.8s
[CV] tfidf__use_idf=False ............................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.8min remaining:    0.0s


[CV] .. tfidf__use_idf=False, score=0.23644686024889144, total=  49.2s
[CV] tfidf__use_idf=False ............................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  4.8min remaining:    0.0s


[CV] .. tfidf__use_idf=False, score=0.22088698140200286, total=  48.1s


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  5.7min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  5.7min finished


Evaluating model...


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)



                         precision    recall  f1-score   support

               related       0.83      0.93      0.88      3999
               request       0.83      0.38      0.53       869
                 offer       0.00      0.00      0.00        21
           aid_related       0.75      0.53      0.62      2169
          medical_help       0.44      0.06      0.10       401
      medical_products       0.65      0.05      0.10       249
     search_and_rescue       0.33      0.02      0.03       132
              security       0.00      0.00      0.00        91
              military       0.65      0.09      0.16       169
           child_alone       0.00      0.00      0.00         0
                 water       0.92      0.17      0.29       332
                  food       0.86      0.31      0.46       577
               shelter       0.80      0.16      0.26       443
              clothing       0.82      0.11      0.20        79
                 money       0.86    

# References
[R1] Working with text - scikitlearn: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html <br>