# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import re
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import numpy as np
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sqlalchemy import create_engine
from sklearn.metrics import classification_report

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///disaster_response.db')
df = pd.read_sql('SELECT * FROM disaster_response', engine)
X = df['message']
y = df.drop(['id', 'message', 'original', 'genre'],  axis=1).astype(float)

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(sentence):
#     sentence = re.sub('[^A-Za-z0-9]+', ' ', sentence)
#     tokens = word_tokenize(sentence.lower().strip())
#     lemmatizer = WordNetLemmatizer()
#     tokens = [lemmatizer.lemmatize(i,'n') for i in tokens]
#     tokens = [lemmatizer.lemmatize(i,'v') for i in tokens]
#     return tokens
    
    tokens = nltk.word_tokenize(sentence)
    lemmatizer = nltk.WordNetLemmatizer()
    
    return [lemmatizer.lemmatize(x).lower().strip() for x in tokens]

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [6]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [7]:
def classification_results(y_test, y_pred):
    for i, col in enumerate(y_test): 
        print(f'#######################:{col}:#######################')
        print(classification_report(y_test.iloc[:,i], y_pred[:,i]))

In [8]:
y_pred = pipeline.predict(X_test)

In [9]:
classification_results(y_test, y_pred)

#######################:related:#######################
             precision    recall  f1-score   support

        0.0       0.62      0.34      0.44      1508
        1.0       0.82      0.94      0.87      5001
        2.0       0.42      0.22      0.29        45

avg / total       0.77      0.79      0.77      6554

#######################:request:#######################
             precision    recall  f1-score   support

        0.0       0.88      0.98      0.93      5456
        1.0       0.82      0.36      0.50      1098

avg / total       0.87      0.88      0.86      6554

#######################:offer:#######################
             precision    recall  f1-score   support

        0.0       0.99      1.00      1.00      6513
        1.0       0.00      0.00      0.00        41

avg / total       0.99      0.99      0.99      6554

#######################:aid_related:#######################
             precision    recall  f1-score   support

        0.0       0.71

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
parameters = {'clf__estimator__min_samples_split': [2],
              'clf__estimator__n_estimators': [20, 40]}

cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs =-1)

In [11]:
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'clf__estimator__min_samples_split': [2], 'clf__estimator__n_estimators': [20, 40]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [12]:
y_pred = cv.predict(X_train)

In [13]:
classification_results(y_train, y_pred)

#######################:related:#######################
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00      4614
        1.0       1.00      1.00      1.00     14905
        2.0       1.00      0.99      0.99       143

avg / total       1.00      1.00      1.00     19662

#######################:request:#######################
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00     16286
        1.0       1.00      0.99      1.00      3376

avg / total       1.00      1.00      1.00     19662

#######################:offer:#######################
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00     19585
        1.0       1.00      0.95      0.97        77

avg / total       1.00      1.00      1.00     19662

#######################:aid_related:#######################
             precision    recall  f1-score   support

        0.0       1.00

In [14]:
y_pred = cv.predict(X_test)

In [15]:
classification_results(y_test, y_pred)

#######################:related:#######################
             precision    recall  f1-score   support

        0.0       0.72      0.28      0.40      1508
        1.0       0.81      0.97      0.88      5001
        2.0       0.44      0.24      0.31        45

avg / total       0.79      0.80      0.77      6554

#######################:request:#######################
             precision    recall  f1-score   support

        0.0       0.89      0.99      0.94      5456
        1.0       0.88      0.42      0.57      1098

avg / total       0.89      0.89      0.88      6554

#######################:offer:#######################
             precision    recall  f1-score   support

        0.0       0.99      1.00      1.00      6513
        1.0       0.00      0.00      0.00        41

avg / total       0.99      0.99      0.99      6554

#######################:aid_related:#######################
             precision    recall  f1-score   support

        0.0       0.74

  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [16]:
pipeline_ada = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(AdaBoostClassifier()))
])
parameters_ada = {
    'clf__estimator__learning_rate': [0.5, 1],
    'clf__estimator__n_estimators': [50, 100]
}

cv_ada = GridSearchCV(pipeline_ada, parameters_ada, n_jobs =-1)

In [17]:
cv_ada.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...mator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'clf__estimator__learning_rate': [0.5, 1], 'clf__estimator__n_estimators': [50, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [18]:
# Best parameters set
cv_ada.best_params_

{'clf__estimator__learning_rate': 0.5, 'clf__estimator__n_estimators': 100}

In [19]:
y_pred = cv_ada.predict(X_train)

In [20]:
classification_results(y_train, y_pred)

#######################:related:#######################
             precision    recall  f1-score   support

        0.0       0.61      0.17      0.26      4614
        1.0       0.79      0.97      0.87     14905
        2.0       0.75      0.27      0.40       143

avg / total       0.74      0.77      0.72     19662

#######################:request:#######################
             precision    recall  f1-score   support

        0.0       0.91      0.98      0.94     16286
        1.0       0.83      0.54      0.65      3376

avg / total       0.90      0.90      0.89     19662

#######################:offer:#######################
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00     19585
        1.0       1.00      0.13      0.23        77

avg / total       1.00      1.00      1.00     19662

#######################:aid_related:#######################
             precision    recall  f1-score   support

        0.0       0.76

In [21]:
y_pred = cv_ada.predict(X_test)

In [22]:
classification_results(y_test, y_pred)

#######################:related:#######################
             precision    recall  f1-score   support

        0.0       0.58      0.17      0.26      1508
        1.0       0.79      0.96      0.87      5001
        2.0       0.33      0.11      0.17        45

avg / total       0.74      0.77      0.72      6554

#######################:request:#######################
             precision    recall  f1-score   support

        0.0       0.91      0.97      0.94      5456
        1.0       0.78      0.53      0.63      1098

avg / total       0.89      0.90      0.89      6554

#######################:offer:#######################
             precision    recall  f1-score   support

        0.0       0.99      1.00      1.00      6513
        1.0       0.00      0.00      0.00        41

avg / total       0.99      0.99      0.99      6554

#######################:aid_related:#######################
             precision    recall  f1-score   support

        0.0       0.74

  'precision', 'predicted', average, warn_for)


### 9. Export your model as a pickle file

In [23]:
with open('classifier.pkl', 'wb') as file:
    pickle.dump(cv, file)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.