# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [179]:
# import libraries
import pandas as pd
import numpy as np
import os
from sqlalchemy import create_engine

# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
import re

#sklearn
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline,  FeatureUnion
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier

[nltk_data] Downloading package punkt to /Users/jeffsan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jeffsan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeffsan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [180]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql("SELECT * FROM messages", engine)
X = df[['message', 'genre']]
Y = df.drop(columns=['id', 'message', 'original','genre'])


### 2. Write a tokenization function to process your text data

In [181]:
def tokenize(text):
    #remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    #tokenize text
    tokens = word_tokenize(text)
    
    # initiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #iterate for each tokens
    clean_tokens = []
    for tok in tokens:
        
        if tok not in stopwords.words('english'):
            # lemmatize, normalize case, and remove leading/trailing white space
            clean_tok = lemmatizer.lemmatize(tok).lower().strip()

            clean_tokens.append(clean_tok)
    
    return clean_tokens
    

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [182]:
from sklearn.preprocessing import FunctionTransformer

get_msg_data = FunctionTransformer(lambda x: x['message'], validate=False)
get_genre_data = FunctionTransformer(lambda x: pd.get_dummies(x['genre']), validate=False)

In [209]:
msg_pipeline = Pipeline([
    ('msg_selector', get_msg_data),
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
])

features_pipeline_union = FeatureUnion([
    ('msg_pipeline', msg_pipeline),
    ('genre_pipeline', get_genre_data)
])

pipeline = Pipeline([
    ('features', features_pipeline_union),
    ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=42), n_jobs=-1))
    #('clf', OneVsRestClassifier(LogisticRegression(random_state=42), n_jobs=-1))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [210]:
X_train, X_test, y_train, y_test = tts(X,Y,test_size=0.33, random_state= 42)

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('msg_pipeline', Pipeline(memory=None,
     steps=[('msg_selector', FunctionTransformer(accept_sparse=False,
          func=<function <lambda> at 0x1364bb510>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='dep...
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=-1))])

### 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [205]:
def show_report(model, X,y):
    """ Print out classification report """
    y_pred = model.predict(X)
    labels = y.columns.tolist()
    class_report = classification_report(y, y_pred, target_names=labels)
    print("\nClassification report:\n", class_report)
 

In [206]:
""" LR on Train """
show_report(pipeline, X_train, y_train)


Classification report:
                         precision    recall  f1-score   support

               related       0.88      0.97      0.92     13334
               request       0.87      0.61      0.72      2980
                 offer       0.00      0.00      0.00        82
           aid_related       0.86      0.74      0.80      7277
          medical_help       0.82      0.19      0.31      1391
      medical_products       0.88      0.19      0.32       906
     search_and_rescue       0.97      0.07      0.14       499
              security       0.00      0.00      0.00       324
              military       0.89      0.17      0.28       593
           child_alone       0.00      0.00      0.00         0
                 water       0.86      0.54      0.67      1155
                  food       0.87      0.62      0.73      1949
               shelter       0.87      0.44      0.58      1510
              clothing       0.83      0.21      0.33       283
              

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [207]:
""" LR on Test """
show_report(pipeline, X_test, y_test)


Classification report:
                         precision    recall  f1-score   support

               related       0.84      0.95      0.89      6542
               request       0.82      0.56      0.67      1484
                 offer       0.00      0.00      0.00        36
           aid_related       0.77      0.67      0.72      3564
          medical_help       0.60      0.14      0.23       690
      medical_products       0.79      0.18      0.29       405
     search_and_rescue       0.73      0.05      0.09       225
              security       0.00      0.00      0.00       147
              military       0.57      0.10      0.17       266
           child_alone       0.00      0.00      0.00         0
                 water       0.79      0.47      0.59       514
                  food       0.86      0.57      0.69       968
               shelter       0.85      0.40      0.54       798
              clothing       0.85      0.19      0.31       121
              

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [211]:
""" RF on Train """
show_report(pipeline, X_train, y_train)


Classification report:
                         precision    recall  f1-score   support

               related       0.99      1.00      0.99     13334
               request       1.00      0.92      0.96      2980
                 offer       1.00      0.68      0.81        82
           aid_related       1.00      0.97      0.98      7277
          medical_help       1.00      0.84      0.91      1391
      medical_products       1.00      0.84      0.91       906
     search_and_rescue       1.00      0.76      0.86       499
              security       1.00      0.74      0.85       324
              military       1.00      0.88      0.93       593
           child_alone       0.00      0.00      0.00         0
                 water       1.00      0.93      0.96      1155
                  food       1.00      0.94      0.97      1949
               shelter       1.00      0.90      0.95      1510
              clothing       1.00      0.86      0.93       283
              

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [None]:
""" RF on Test """
show_report(pipeline, X_test, y_test)

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.