# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [159]:
# import libraries
import pandas as pd
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
# pd.set_option('max_colwidth',200)
pd.set_option('expand_frame_repr', False)

from sqlalchemy import create_engine

import nltk
nltk.download(['punkt', 'wordnet'])

import re
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
nltk.download(['punkt', 'wordnet', 'stopwords'])

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

[nltk_data] Downloading package punkt to /Users/kevin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/kevin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/kevin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/kevin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/kevin/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [134]:
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [121]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('InsertTableName', engine)  
# df.head()

X = df['message']
# y = df.iloc[:,4:]
y = df.iloc[:,5:]

# question: col - related has class in 2 which will cause error in model evaluation

In [122]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 40 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      103 non-null    int64 
 1   message                 103 non-null    object
 2   original                48 non-null     object
 3   genre                   103 non-null    object
 4   related                 103 non-null    int64 
 5   request                 103 non-null    int64 
 6   offer                   103 non-null    int64 
 7   aid_related             103 non-null    int64 
 8   medical_help            103 non-null    int64 
 9   medical_products        103 non-null    int64 
 10  search_and_rescue       103 non-null    int64 
 11  security                103 non-null    int64 
 12  military                103 non-null    int64 
 13  child_alone             103 non-null    int64 
 14  water                   103 non-null    int64 
 15  food  

## 2. Write a tokenization function to process your text data

In [10]:
def tokenize(text):
    
    # 1. remove url and replace url string as 'urlplaceholder'
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    # 2. work tokenization
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer() # lemmatization method

    clean_tokens = []
    for tok in tokens:
        
        # 3. converting lowercase and removing space in tokens
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [23]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

In [103]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x7fce0fbc8ee0>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x7fce0fbc8ee0>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,
 'tfidf_

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [123]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# train classifier (pipeline)
pipeline.fit(X_train, y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x7fce0fbc8ee0>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [124]:
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

In [125]:
y_true = y_test.values
# predict on test data
y_pred = pipeline.predict(X_test)

In [128]:
y_test.columns.values

array(['request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers',
       'other_infrastructure', 'weather_related', 'floods', 'storm',
       'fire', 'earthquake', 'cold', 'other_weather', 'direct_report'],
      dtype=object)

In [140]:
# print (f1_score(y_true, y_pred, average=None))
# print ('###################################')
# print (recall_score(y_true, y_pred, average=None))
# print ('###################################')
# print (precision_score(y_true, y_pred, average=None))
print ('Random Forest Classifer on Train set \n')
print (classification_report(y_train.values, pipeline.predict(X_train), target_names=y_train.columns))

print ('###################################')

print ('Random Forest Classifer on Test set \n')
print (classification_report(y_true, y_pred, target_names=y_test.columns))

Random Forest Classifer on Train set
                        precision    recall  f1-score   support

               request       0.94      0.85      0.89        20
                 offer       1.00      1.00      1.00         1
           aid_related       0.93      1.00      0.96        41
          medical_help       1.00      0.67      0.80         6
      medical_products       1.00      1.00      1.00         3
     search_and_rescue       1.00      1.00      1.00         1
              security       0.00      0.00      0.00         0
              military       0.50      0.50      0.50         2
           child_alone       0.00      0.00      0.00         0
                 water       1.00      1.00      1.00         6
                  food       1.00      1.00      1.00        16
               shelter       1.00      0.89      0.94         9
              clothing       1.00      1.00      1.00         3
                 money       1.00      1.00      1.00         2
  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 6. Improve your model
Use grid search to find better parameters. 

In [135]:
 # specify parameters for grid search
# Reduce cv=2 instead of default cv=5 
# Use verbose=2 or 3 to know that the process is running
parameters = {'clf__estimator__n_estimators':[100,200],
              'clf__estimator__max_depth':[5]}

# create grid search object
cv = GridSearchCV(pipeline, param_grid=parameters, cv=3, verbose=3)

In [136]:
cv.fit(X_train,y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV 1/3] END clf__estimator__max_depth=5, clf__estimator__n_estimators=100; total time=   3.7s
[CV 2/3] END clf__estimator__max_depth=5, clf__estimator__n_estimators=100; total time=   4.8s
[CV 3/3] END clf__estimator__max_depth=5, clf__estimator__n_estimators=100; total time=   4.3s
[CV 1/3] END clf__estimator__max_depth=5, clf__estimator__n_estimators=200; total time=   7.3s
[CV 2/3] END clf__estimator__max_depth=5, clf__estimator__n_estimators=200; total time=   7.7s
[CV 3/3] END clf__estimator__max_depth=5, clf__estimator__n_estimators=200; total time=   7.1s


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x7fce0fbc8ee0>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'clf__estimator__max_depth': [5],
                         'clf__estimator__n_estimators': [100, 200]},
             verbose=3)

In [137]:
cv.best_params_

{'clf__estimator__max_depth': 5, 'clf__estimator__n_estimators': 100}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [144]:
y_pred_train_cv = cv.predict(X_train)

y_pred_test_cv = cv.predict(X_test)

In [146]:
print ('Random Forest Classifer on Train set \n')
print (classification_report(y_train.values, y_pred_train_cv, target_names=y_train.columns))

print ('#########################################################################################################')

print ('Random Forest Classifer on Test set \n')
print (classification_report(y_test.values, y_pred_test_cv, target_names=y_test.columns))

Random Forest Classifer on Train set 

                        precision    recall  f1-score   support

               request       1.00      0.45      0.62        20
                 offer       1.00      1.00      1.00         1
           aid_related       0.93      1.00      0.96        41
          medical_help       1.00      0.67      0.80         6
      medical_products       1.00      1.00      1.00         3
     search_and_rescue       1.00      1.00      1.00         1
              security       0.00      0.00      0.00         0
              military       0.00      0.00      0.00         2
           child_alone       0.00      0.00      0.00         0
                 water       1.00      1.00      1.00         6
                  food       1.00      0.81      0.90        16
               shelter       1.00      0.78      0.88         9
              clothing       1.00      0.67      0.80         3
                 money       1.00      1.00      1.00         2


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [148]:
# accuracy score on test set
print('GridSearch rfc Accuracy')
print((y_pred_test_cv == y_test).mean())

GridSearch rfc Accuracy
request                   0.807692
offer                     1.000000
aid_related               0.884615
medical_help              0.923077
medical_products          0.961538
search_and_rescue         0.961538
security                  1.000000
military                  0.961538
child_alone               1.000000
water                     0.961538
food                      0.884615
shelter                   0.846154
clothing                  1.000000
money                     1.000000
missing_people            1.000000
refugees                  0.923077
death                     0.961538
other_aid                 0.884615
infrastructure_related    0.923077
transport                 1.000000
buildings                 0.923077
electricity               1.000000
tools                     1.000000
hospitals                 1.000000
shops                     1.000000
aid_centers               1.000000
other_infrastructure      0.923077
weather_related           0.769

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

##### 8a. try other machine learning algorithms - classifier
1. GBM - gradient boosting classifier - GradientBoostingClassifier
2. KNN - k-nearest neighbours classifer - KNeighborsClassifier
3. SVR - support vector classifier - SVC (Noted: larger datasets consider using LinearSVC or SGDClassifier)
...

##### 8b. add other features
1. add distaster word extractor

In [155]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

In [156]:
# A custom transformer which will identify buzzwords signaling disaster
class DisasterWordExtractor(BaseEstimator, TransformerMixin):

    def disaster_words(self, text):
        """
        INPUT: text - string, raw text data
        OUTPUT: bool -bool object, True or False
        """
        # list of words that are commonly used during a disaster event
        disaster_words = ['tsunami','volcano','tornado','avalanche','earthquake','blizzard','drought','bushfire',
                          'tremor','duststorm','magma','twister','windstorm','heat','cyclone','fire','flood',
                          'hailstorm','lava','lightning','high-pressure','hail','hurricane','seismic','erosion',
                          'whirlpool','Richter scale','whirlwind','cloud','thunderstorm','barometer',
                          'gale','blackout','gust','force','low-pressure','volt','snowstorm','rainstorm','storm',
                          'nimbus','violentstorm','sandstorm','casualty','fatal','fatality','cumulonimbus','death',
                          'lost','destruction','money','tension','cataclysm','damage','uproot','underground',
                          'destroy','arsonist','arson','rescue','permafrost','disaster','fault','scientist','shelter']

        # lemmatize the buzzwords
        lemmatized_words = [WordNetLemmatizer().lemmatize(w, pos='v') for w in disaster_words]
        # Get the stem words of each word in lemmatized_words
        stem_disaster_words = [PorterStemmer().stem(w) for w in lemmatized_words]

        # tokenize the input text
        clean_tokens = tokenize(text)
        for token in clean_tokens:
            if token in stem_disaster_words:
                return True
        return False

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        X_disaster_words = pd.Series(X).apply(self.disaster_words)
        return pd.DataFrame(X_disaster_words)

In [157]:
pipeline_2 = Pipeline([
    ('features',FeatureUnion([
        ('text_pipeline',Pipeline([
            ('vect',CountVectorizer(tokenizer=tokenize)),
            ('tfidf',TfidfTransformer())
            ])),
        ('disaster_words',DisasterWordExtractor()) # custom transformer - disaster word extractor
        ])),
    ('clf',MultiOutputClassifier(RandomForestClassifier()))
    ])

In [1]:
# train classifier (pipeline)
pipeline_2.fit(X_train, y_train)

y_true = y_test.values
# predict on test data
y_pred_2 = pipeline_2.predict(X_test)

NameError: name 'pipeline_2' is not defined

In [163]:
print ('Random Forest Classifer on Train set \n')
print (classification_report(y_train.values, pipeline.predict(X_train), target_names=y_train.columns))

print ('###################################')

print ('Random Forest Classifer on Test set \n')
print (classification_report(y_true, y_pred_2, target_names=y_test.columns))

Random Forest Classifer on Train set 

                        precision    recall  f1-score   support

               request       0.94      0.85      0.89        20
                 offer       1.00      1.00      1.00         1
           aid_related       0.93      1.00      0.96        41
          medical_help       1.00      0.67      0.80         6
      medical_products       1.00      1.00      1.00         3
     search_and_rescue       1.00      1.00      1.00         1
              security       0.00      0.00      0.00         0
              military       0.50      0.50      0.50         2
           child_alone       0.00      0.00      0.00         0
                 water       1.00      1.00      1.00         6
                  food       1.00      1.00      1.00        16
               shelter       1.00      0.89      0.94         9
              clothing       1.00      1.00      1.00         3
                 money       1.00      1.00      1.00         2


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 9. Export your model as a pickle file

In [149]:
import pickle

In [None]:
# save the model to disk
pickle.dump(pipeline, open('classifier.pickle','wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

In [6]:
import os

database_filename = 'data/DisasterResponse.db'
name = os.path.basename(database_filename).split('.')[0]

print (name)

DisasterResponse
