# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [88]:
# import libraries
import sqlalchemy
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
import re
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import nltk
nltk.download(['punkt', 'wordnet','averaged_perceptron_tagger'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [89]:
# load data from database
engine = sqlalchemy.create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('InsertTableName', engine)
X = df.message
Y = df.drop(columns = ["id","message", "genre", "original"])

In [90]:
Y.columns

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

### 2. Write a tokenization function to process your text data

In [91]:
def tokenize(text):
    # Remove punctuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens


In [92]:
# perform train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [93]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import MultiOutputClassifier
# Instantiate transformers and classifier
vect = CountVectorizer(tokenizer=tokenize)
tfidf = TfidfTransformer()
# clf = MultiOutputClassifier(KNeighborsClassifier())
clf = MultiOutputClassifier( RandomForestClassifier())
# clf =

# Fit and/or transform each to the training data
# Hint: you can use the fit_transform method
X_train_counts = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)


# Fit or train the classifier
clf.fit(X_train_tfidf, y_train)

MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1)

In [94]:
# Transform test data
X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)

# Predict test labels
y_pred = clf.predict(X_test_tfidf)

In [95]:
target_names = Y.columns.values
print(target_names)

['related' 'request' 'offer' 'aid_related' 'medical_help'
 'medical_products' 'search_and_rescue' 'security' 'military' 'child_alone'
 'water' 'food' 'shelter' 'clothing' 'money' 'missing_people' 'refugees'
 'death' 'other_aid' 'infrastructure_related' 'transport' 'buildings'
 'electricity' 'tools' 'hospitals' 'shops' 'aid_centers'
 'other_infrastructure' 'weather_related' 'floods' 'storm' 'fire'
 'earthquake' 'cold' 'other_weather' 'direct_report']


In [96]:
from sklearn.utils.multiclass import type_of_target
type_of_target(y_test)
type_of_target(y_pred)

'multiclass-multioutput'

In [97]:
import numpy as np
from sklearn.metrics import classification_report

for idex,column in enumerate(Y.columns.values):
    print(column, classification_report(y_test.values[:,idex],y_pred[:,idex]))


related              precision    recall  f1-score   support

          0       0.62      0.36      0.46      2004
          1       0.82      0.93      0.87      6567
          2       0.82      0.13      0.23        68

avg / total       0.78      0.79      0.77      8639

request              precision    recall  f1-score   support

          0       0.89      0.98      0.93      7177
          1       0.83      0.38      0.52      1462

avg / total       0.88      0.88      0.86      8639

offer              precision    recall  f1-score   support

          0       1.00      1.00      1.00      8598
          1       0.00      0.00      0.00        41

avg / total       0.99      1.00      0.99      8639

aid_related              precision    recall  f1-score   support

          0       0.74      0.88      0.80      5070
          1       0.76      0.55      0.64      3569

avg / total       0.75      0.74      0.73      8639

medical_help              precision    recall  f1-sco

  'precision', 'predicted', average, warn_for)


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [98]:
pipeline = Pipeline([         
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer()),
            ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [99]:
# perform train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [100]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [101]:
import numpy as np
from sklearn.metrics import classification_report

y_pred = pipeline.predict(X_test)

for idex,column in enumerate(Y.columns.values):
    print(column, classification_report(y_test.values[:,idex],y_pred[:,idex]))

related              precision    recall  f1-score   support

          0       0.64      0.38      0.48      2004
          1       0.83      0.93      0.88      6567
          2       0.75      0.22      0.34        68

avg / total       0.78      0.80      0.78      8639

request              precision    recall  f1-score   support

          0       0.89      0.98      0.93      7177
          1       0.83      0.37      0.52      1462

avg / total       0.88      0.88      0.86      8639

offer              precision    recall  f1-score   support

          0       1.00      1.00      1.00      8598
          1       0.00      0.00      0.00        41

avg / total       0.99      1.00      0.99      8639

aid_related              precision    recall  f1-score   support

          0       0.73      0.87      0.79      5070
          1       0.75      0.53      0.62      3569

avg / total       0.73      0.73      0.72      8639

medical_help              precision    recall  f1-sco

  'precision', 'predicted', average, warn_for)


In [102]:
accuracy = (y_pred == y_test).mean()
accuracy

related                   0.801713
request                   0.881468
offer                     0.995254
aid_related               0.731682
medical_help              0.920477
medical_products          0.952772
search_and_rescue         0.972682
security                  0.981711
military                  0.970483
child_alone               1.000000
water                     0.949415
food                      0.930779
shelter                   0.922213
clothing                  0.986110
money                     0.979164
missing_people            0.989698
refugees                  0.963885
death                     0.959370
other_aid                 0.866998
infrastructure_related    0.931474
transport                 0.957981
buildings                 0.948142
electricity               0.980553
tools                     0.994212
hospitals                 0.988425
shops                     0.993749
aid_centers               0.987383
other_infrastructure      0.954740
weather_related     

### 6. Improve your model
Use grid search to find better parameters. 

In [103]:
parameters = {
#     'vect__ngram_range': ((1, 1), (1, 2)),
    'clf__estimator__n_estimators': [20, 50],
#     'clf__estimator__min_samples_split': [2, 3]
}

cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__n_estimators': [20, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [104]:
cv.grid_scores_



[mean: 0.22819, std: 0.00601, params: {'clf__estimator__n_estimators': 20},
 mean: 0.23617, std: 0.00306, params: {'clf__estimator__n_estimators': 50}]

In [107]:
cv.cv_results_



{'mean_fit_time': array([  54.10484139,  129.32314825]),
 'std_fit_time': array([ 1.09324376,  0.90310958]),
 'mean_score_time': array([ 4.70520314,  8.16383251]),
 'std_score_time': array([ 0.07009158,  0.17862018]),
 'param_clf__estimator__n_estimators': masked_array(data = [20 50],
              mask = [False False],
        fill_value = ?),
 'params': [{'clf__estimator__n_estimators': 20},
  {'clf__estimator__n_estimators': 50}],
 'split0_test_score': array([ 0.22083476,  0.23195347]),
 'split1_test_score': array([ 0.22819022,  0.2374273 ]),
 'split2_test_score': array([ 0.23554567,  0.23913787]),
 'mean_test_score': array([ 0.22819022,  0.23617288]),
 'std_test_score': array([ 0.00600571,  0.00306421]),
 'rank_test_score': array([2, 1], dtype=int32),
 'split0_train_score': array([ 0.91421485,  0.9868286 ]),
 'split1_train_score': array([ 0.91507013,  0.98546014]),
 'split2_train_score': array([ 0.91515566,  0.9844338 ]),
 'mean_train_score': array([ 0.91481355,  0.98557418]),
 'st

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [105]:
y_pred =cv.predict(X_test)
accuracy = (y_pred == y_test).mean()
accuracy

related                   0.808890
request                   0.889455
offer                     0.995254
aid_related               0.766524
medical_help              0.921403
medical_products          0.953583
search_and_rescue         0.972451
security                  0.981711
military                  0.970251
child_alone               1.000000
water                     0.949878
food                      0.926149
shelter                   0.925454
clothing                  0.986920
money                     0.979396
missing_people            0.989698
refugees                  0.963653
death                     0.956708
other_aid                 0.868735
infrastructure_related    0.931358
transport                 0.959255
buildings                 0.948721
electricity               0.980669
tools                     0.994212
hospitals                 0.988425
shops                     0.993749
aid_centers               0.987267
other_infrastructure      0.954740
weather_related     

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [109]:
class TextLengthExtractor(BaseEstimator, TransformerMixin):

    def count_length(self, text):
        tokens = word_tokenize(text)
        return len(tokens)

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.count_length)
        return pd.DataFrame(X_tagged)
    
pipeline = Pipeline([
    ('features', FeatureUnion([

        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),

        ('count_length', TextLengthExtractor())
    ])),

    ('clf', MultiOutputClassifier(KNeighborsClassifier()))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

accuracy = (y_pred == y_test).mean()
accuracy

related                   0.774742
request                   0.854034
offer                     0.995254
aid_related               0.655979
medical_help              0.915847
medical_products          0.953467
search_and_rescue         0.971409
security                  0.981595
military                  0.969672
child_alone               1.000000
water                     0.936104
food                      0.889108
shelter                   0.905776
clothing                  0.986225
money                     0.978470
missing_people            0.989351
refugees                  0.962727
death                     0.954624
other_aid                 0.850214
infrastructure_related    0.928464
transport                 0.955319
buildings                 0.943396
electricity               0.980090
tools                     0.994212
hospitals                 0.988309
shops                     0.993749
aid_centers               0.987036
other_infrastructure      0.953467
weather_related     

### 9. Export your model as a pickle file

In [110]:
import pickle
# Exports the final model as a pickle file
pickle.dump(cv.best_estimator_, open("classifier.pkl", 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.