# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, f1_score, accuracy_score
from sklearn.model_selection import GridSearchCV

import pickle

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('MESSAGES', engine)
print(df.shape)
print(df.head())
X = df['message'].values
Y = df.iloc[:, 4:]

(26028, 39)
   id                                            message  \
0   2  Weather update - a cold front from Cuba that c...   
1   7            Is the Hurricane over or is it not over   
2   8                    Looking for someone but no name   
3   9  UN reports Leogane 80-90 destroyed. Only Hospi...   
4  12  says: west side of Haiti, rest of the country ...   

                                            original   genre  related  \
0  Un front froid se retrouve sur Cuba ce matin. ...  direct        1   
1                 Cyclone nan fini osinon li pa fini  direct        1   
2  Patnm, di Maryani relem pou li banm nouvel li ...  direct        1   
3  UN reports Leogane 80-90 destroyed. Only Hospi...  direct        1   
4  facade ouest d Haiti et le reste du pays aujou...  direct        1   

   request  offer  aid_related  medical_help  medical_products      ...        \
0        0      0            0             0                 0      ...         
1        0      0         

### 2. Write a tokenization function to process your text data

In [4]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

In [5]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize andremove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [6]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=42)))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:
y_pred = pipeline.predict(X_test)

In [9]:
print(classification_report(y_test, y_pred, target_names=df.columns[4:], digits=3))

                        precision    recall  f1-score   support

               related      0.856     0.926     0.890      6026
               request      0.799     0.426     0.555      1351
                 offer      0.000     0.000     0.000        35
           aid_related      0.756     0.583     0.659      3318
          medical_help      0.687     0.084     0.150       678
      medical_products      0.746     0.133     0.226       375
     search_and_rescue      0.545     0.027     0.052       219
              security      0.000     0.000     0.000       144
              military      0.773     0.069     0.126       248
                 water      0.843     0.178     0.294       511
                  food      0.857     0.456     0.595       895
               shelter      0.803     0.284     0.420       718
              clothing      0.810     0.155     0.260       110
                 money      0.500     0.011     0.022       180
        missing_people      1.000     0

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fc964380620>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=42, v

In [11]:
parameters = {
    'vect__max_df': [0.5, 0.75, 1.0],
    'tfidf__use_idf': [True, False],
    'clf__estimator__n_estimators': [10, 25, 50]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [12]:
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__max_df': [0.5, 0.75, 1.0], 'tfidf__use_idf': [True, False], 'clf__estimator__n_estimators': [10, 25, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [13]:
y_pred = cv.predict(X_test)

In [14]:
print(classification_report(y_test, y_pred, target_names=df.columns[4:], digits=3))

                        precision    recall  f1-score   support

               related      0.845     0.949     0.894      6026
               request      0.847     0.512     0.638      1351
                 offer      0.000     0.000     0.000        35
           aid_related      0.767     0.669     0.715      3318
          medical_help      0.786     0.081     0.147       678
      medical_products      0.846     0.117     0.206       375
     search_and_rescue      0.421     0.037     0.067       219
              security      0.000     0.000     0.000       144
              military      0.812     0.052     0.098       248
                 water      0.865     0.352     0.501       511
                  food      0.858     0.541     0.663       895
               shelter      0.853     0.364     0.510       718
              clothing      0.895     0.155     0.264       110
                 money      0.714     0.028     0.053       180
        missing_people      1.000     0

  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [15]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(AdaBoostClassifier(random_state=42)))
])

In [16]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...timator=None,
          learning_rate=1.0, n_estimators=50, random_state=42),
           n_jobs=1))])

In [17]:
y_pred = pipeline.predict(X_test)

In [18]:
print(classification_report(y_test, y_pred, target_names=df.columns[4:], digits=3))

                        precision    recall  f1-score   support

               related      0.804     0.972     0.880      6026
               request      0.764     0.523     0.621      1351
                 offer      0.143     0.029     0.048        35
           aid_related      0.769     0.612     0.681      3318
          medical_help      0.625     0.240     0.347       678
      medical_products      0.605     0.307     0.407       375
     search_and_rescue      0.547     0.187     0.279       219
              security      0.242     0.056     0.090       144
              military      0.587     0.327     0.420       248
                 water      0.725     0.603     0.658       511
                  food      0.787     0.642     0.707       895
               shelter      0.776     0.552     0.645       718
              clothing      0.734     0.427     0.540       110
                 money      0.495     0.256     0.337       180
        missing_people      0.526     0

In [19]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fc964380620>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
             learning_rate=1.0, n_estimators=50, random_state=42),
              n_jobs=1))],
 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=

In [20]:
parameters = {
    'vect__max_df': [0.5, 0.75, 1.0],
    'tfidf__use_idf': [True, False],
    'clf__estimator__n_estimators': [10, 25, 50],
    'clf__estimator__learning_rate': [0.1, 0.2, 0.5]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

In [21]:
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...timator=None,
          learning_rate=1.0, n_estimators=50, random_state=42),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__max_df': [0.5, 0.75, 1.0], 'tfidf__use_idf': [True, False], 'clf__estimator__n_estimators': [10, 25, 50], 'clf__estimator__learning_rate': [0.1, 0.2, 0.5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [22]:
y_pred = cv.predict(X_test)

In [23]:
print(classification_report(y_test, y_pred, target_names=df.columns[4:], digits=3))

                        precision    recall  f1-score   support

               related      0.795     0.982     0.879      6026
               request      0.825     0.466     0.596      1351
                 offer      0.000     0.000     0.000        35
           aid_related      0.787     0.540     0.641      3318
          medical_help      0.721     0.149     0.247       678
      medical_products      0.720     0.227     0.345       375
     search_and_rescue      0.634     0.119     0.200       219
              security      0.667     0.028     0.053       144
              military      0.706     0.194     0.304       248
                 water      0.770     0.597     0.673       511
                  food      0.808     0.699     0.750       895
               shelter      0.820     0.483     0.608       718
              clothing      0.844     0.345     0.490       110
                 money      0.556     0.167     0.256       180
        missing_people      0.706     0

  'precision', 'predicted', average, warn_for)


In [24]:
print('best score  is: {}'.format(cv.best_score_))
print('best parameters are: '.format(cv.best_params_))

best score  is: 0.2438662934299358
best parameters are: {'clf__estimator__learning_rate': 0.5, 'clf__estimator__n_estimators': 50, 'tfidf__use_idf': True, 'vect__max_df': 0.5}


### 9. Export your model as a pickle file

In [25]:
with open('disaster_response_model.pkl', 'wb') as pkl_file:
    pickle.dump(cv, pkl_file)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.