# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [2]:
# import libraries
from sqlalchemy import create_engine
import sqlite3
import pandas as pd
import numpy as np
import re
import pickle

import nltk
nltk.download(['punkt', 'wordnet','stopwords'])
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, classification_report, make_scorer, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\zaplu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\zaplu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zaplu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# load data from database
engine = create_engine('sqlite:///disaster_response.db')
df = pd.read_sql_table('messages', engine)

# Define feature and target variables X and Y
X = df['message']
Y = df.iloc[:,4:]

In [4]:
X.shape

(26180,)

In [5]:
Y.shape

(26180, 35)

In [6]:
Y.columns.values

array(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'water', 'food', 'shelter', 'clothing', 'money', 'missing_people',
       'refugees', 'death', 'other_aid', 'infrastructure_related',
       'transport', 'buildings', 'electricity', 'tools', 'hospitals',
       'shops', 'aid_centers', 'other_infrastructure', 'weather_related',
       'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather',
       'direct_report'], dtype=object)

### 2. Write a tokenization function to process your text data

In [7]:
def tokenize(text):
    #Transform the text to lowercase plus remove all characters that are not letters or numbers
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text.lower())
    
    #Tokenize the text to words
    tokens = word_tokenize(text)
    
    #Remove stopwords and whitespace around the words
    tokens_subset = [v.strip() for v in tokens if v.strip() not in set(stopwords.words('english'))]
    
    #Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmed = [lemmatizer.lemmatize(w) for w in tokens_subset]
    
    return lemmed

In [8]:
tokenize(X[10])

['nothing', 'eat', 'water', 'starving', 'thirsty']

In [9]:
#Explore the feature variable - message lengths
lengths= []
for i in range(X.shape[0]):
    lengths.append(len(X[i]))
ls = pd.Series(lengths)

In [10]:
#Explore the feature variable - message lengths and corresponding quantiles
for col in Y.columns:
    print(col, ls[df[col]==0].quantile(0.75),ls[df[col]==1].quantile(0.75))

related 148.0 186.0
request 185.0 150.0
offer 179.0 200.5
aid_related 163.0 197.0
medical_help 174.0 219.0
medical_products 176.0 221.0
search_and_rescue 178.0 204.0
security 178.0 207.5
military 176.0 237.0
water 177.0 203.0
food 178.0 184.0
shelter 176.0 202.0
clothing 178.0 210.0
money 177.0 229.5
missing_people 178.0 199.0
refugees 176.0 229.5
death 176.0 216.25
other_aid 176.0 191.0
infrastructure_related 175.0 222.0
transport 176.0 214.0
buildings 177.0 211.0
electricity 178.0 215.5
tools 178.0 243.0
hospitals 178.0 228.0
shops 178.0 227.0
aid_centers 178.0 221.0
other_infrastructure 176.0 223.0
weather_related 169.0 201.0
floods 173.0 224.0
storm 174.0 212.25
fire 178.0 217.75
earthquake 179.0 171.0
cold 178.0 215.0
other_weather 176.0 218.0
direct_report 185.0 153.0


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [11]:
#Build a machine learning pipeline 
pipeline = Pipeline([
    ('count', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))   
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25)
pipeline.fit(X_train,y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [18]:
y_pred = pd.DataFrame(pipeline.predict(X_test), columns = y_test.columns)

In [21]:
#Combined f1_score
f1_score(y_test, y_pred, average='micro')

0.6081393282388483

In [15]:
#Binary classification metrics per output category
for col in y_test.columns:
    print(col)
    print(classification_report(y_test[col],y_pred[col]))

related
              precision    recall  f1-score   support

           0       0.61      0.48      0.54      1487
           1       0.86      0.91      0.88      5058

   micro avg       0.81      0.81      0.81      6545
   macro avg       0.73      0.70      0.71      6545
weighted avg       0.80      0.81      0.80      6545

request
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      5454
           1       0.77      0.45      0.57      1091

   micro avg       0.89      0.89      0.89      6545
   macro avg       0.83      0.71      0.75      6545
weighted avg       0.88      0.89      0.87      6545

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6515
           1       0.00      0.00      0.00        30

   micro avg       0.99      0.99      0.99      6545
   macro avg       0.50      0.50      0.50      6545
weighted avg       0.99      0.99      0.99      654

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      6418
           1       0.75      0.12      0.20       127

   micro avg       0.98      0.98      0.98      6545
   macro avg       0.87      0.56      0.60      6545
weighted avg       0.98      0.98      0.98      6545

other_weather
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      6196
           1       0.50      0.06      0.11       349

   micro avg       0.95      0.95      0.95      6545
   macro avg       0.72      0.53      0.54      6545
weighted avg       0.93      0.95      0.93      6545

direct_report
              precision    recall  f1-score   support

           0       0.86      0.97      0.91      5315
           1       0.72      0.32      0.44      1230

   micro avg       0.85      0.85      0.85      6545
   macro avg       0.79      0.64      0.68      6545
weighted avg       0.83      0.85      0.82   

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
#Parameters of the model
pipeline.get_params()

{'memory': None,
 'steps': [('count',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x000001D74972D0D0>,
           vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
               oob_score=Fa

In [14]:
#Grid Search
parameters = {'count__max_df': [0.75, 1.0], #0.75
              'count__ngram_range': [(1,1),(1,2)], #(1,1)
              #'count__max_features' : [100,200],
              'tfidf__smooth_idf':[True, False], #True
              #'clf__estimator__max_depth': [None,4,8],
              'clf__estimator__min_samples_split': [2, 10, 50], #10
              'clf__estimator__n_estimators': [10, 50] #50
             }

#Instead of averaging the f1 score achieved on different categories,
#we perform the grid search with respect to the global f1_score that counts the total true positives, false negatives and false positives.
#this is achieved by defining a custom scorer and setting the average parameter to average='micro'.
#In turn more emphasis is put on the categories with more positive labels in the testing set.
#Overall we do not believe this poses a major concern, since we ultimately care about the total performance,
#rather than achieving a high score in each of the categories. Specific categories are often highly unbalanced and have very little positive labels,
#therefore, we do not want that a poor performance on one such category offsets a good performance on a more material/important category.

total_scorer = make_scorer(f1_score, average = 'micro')
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = total_scorer, verbose = 3, n_jobs = 6)
cv.fit(X_train,y_train)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed: 49.0min
[Parallel(n_jobs=6)]: Done 116 tasks      | elapsed: 446.3min
[Parallel(n_jobs=6)]: Done 144 out of 144 | elapsed: 518.4min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('count', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))]),
       fit_params=None, iid='warn', n_jobs=6,
       param_grid={'count__max_df': [0.75, 1.0], 'count__ngram_range': [(1, 1), (1, 2)], 'tfidf__smooth_idf': [True, False], 'clf__estimator__min_samples_split': [2, 10, 50], 'clf__estimator__n_estimators': [10, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(f1_score, average=micro), verbose=3)

In [15]:
pd.DataFrame(cv.cv_results_).sort_values(by=['rank_test_score']).iloc[:50,1:]



Unnamed: 0,std_fit_time,mean_score_time,std_score_time,param_clf__estimator__min_samples_split,param_clf__estimator__n_estimators,param_count__max_df,param_count__ngram_range,param_tfidf__smooth_idf,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
24,13.790186,132.170944,9.292411,10,50,0.75,"(1, 1)",True,"{'clf__estimator__min_samples_split': 10, 'clf...",0.646104,0.647428,0.643834,0.645789,0.001484,1,0.946455,0.94573,0.946411,0.946199,0.000332
29,130.070811,140.338393,29.341547,10,50,1.0,"(1, 1)",False,"{'clf__estimator__min_samples_split': 10, 'clf...",0.645896,0.643297,0.644738,0.644644,0.001063,2,0.945473,0.946781,0.947829,0.946694,0.000964
28,25.810716,173.954488,10.83889,10,50,1.0,"(1, 1)",True,"{'clf__estimator__min_samples_split': 10, 'clf...",0.642513,0.644622,0.645628,0.644255,0.001298,3,0.946024,0.945802,0.947368,0.946398,0.000692
25,8.702223,114.033495,5.106103,10,50,0.75,"(1, 1)",False,"{'clf__estimator__min_samples_split': 10, 'clf...",0.644768,0.644787,0.642521,0.644025,0.001064,4,0.94637,0.946305,0.948117,0.946931,0.000839
45,59.576437,157.021001,20.917853,50,50,1.0,"(1, 1)",False,"{'clf__estimator__min_samples_split': 50, 'clf...",0.646768,0.6375,0.641883,0.64205,0.003785,5,0.882716,0.881446,0.882753,0.882305,0.000608
41,31.162243,113.612712,2.081775,50,50,0.75,"(1, 1)",False,"{'clf__estimator__min_samples_split': 50, 'clf...",0.64318,0.640166,0.641122,0.641489,0.001258,6,0.882125,0.881156,0.879875,0.881052,0.000921
40,6.189027,117.861618,2.680167,50,50,0.75,"(1, 1)",True,"{'clf__estimator__min_samples_split': 50, 'clf...",0.643297,0.639647,0.641343,0.641429,0.001491,7,0.881258,0.880984,0.88109,0.881111,0.000113
44,7.213611,197.770372,10.70874,50,50,1.0,"(1, 1)",True,"{'clf__estimator__min_samples_split': 50, 'clf...",0.640349,0.640589,0.6402,0.640379,0.00016,8,0.882277,0.881384,0.881425,0.881695,0.000412
13,2.194074,104.414408,4.075895,2,50,1.0,"(1, 1)",False,"{'clf__estimator__min_samples_split': 2, 'clf_...",0.635859,0.634178,0.635888,0.635308,0.0008,9,0.996328,0.996472,0.996516,0.996439,8e-05
9,7.382524,239.590553,19.331161,2,50,0.75,"(1, 1)",False,"{'clf__estimator__min_samples_split': 2, 'clf_...",0.638198,0.633463,0.633895,0.635185,0.002137,10,0.996159,0.996726,0.996563,0.996483,0.000238


In [16]:
#Tuned model
best_model = cv.best_estimator_

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [24]:
total_scorer(best_model, X_test, y_test)

0.653718058298211

In [25]:
y_pred_tuned = pd.DataFrame(best_model.predict(X_test), columns = y_test.columns)
for col in y_test.columns:
    print(col)
    print(classification_report(y_test[col],y_pred_tuned[col]))

related
              precision    recall  f1-score   support

           0       0.74      0.41      0.53      1586
           1       0.84      0.95      0.89      4959

   micro avg       0.82      0.82      0.82      6545
   macro avg       0.79      0.68      0.71      6545
weighted avg       0.81      0.82      0.80      6545

request
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      5442
           1       0.83      0.50      0.62      1103

   micro avg       0.90      0.90      0.90      6545
   macro avg       0.87      0.74      0.78      6545
weighted avg       0.89      0.90      0.89      6545

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6515
           1       0.00      0.00      0.00        30

   micro avg       1.00      1.00      1.00      6545
   macro avg       0.50      0.50      0.50      6545
weighted avg       0.99      1.00      0.99      654

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.94      0.99      0.97      5994
           1       0.84      0.36      0.50       551

   micro avg       0.94      0.94      0.94      6545
   macro avg       0.89      0.68      0.73      6545
weighted avg       0.94      0.94      0.93      6545

clothing
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6464
           1       0.69      0.22      0.34        81

   micro avg       0.99      0.99      0.99      6545
   macro avg       0.84      0.61      0.67      6545
weighted avg       0.99      0.99      0.99      6545

money
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      6400
           1       1.00      0.02      0.04       145

   micro avg       0.98      0.98      0.98      6545
   macro avg       0.99      0.51      0.51      6545
weighted avg       0.98      0.98      0.97      6545

miss

In [26]:
f1_score(y_pred_tuned, y_test, average = 'macro')

  'recall', 'true', average, warn_for)


0.26743351618105116

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [145]:
#Try GradientBoostingClassifier instead of RandomForestClassifier
pipeline2 = Pipeline([
    ('count', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(GradientBoostingClassifier()))   
])

pipeline2.fit(X_train,y_train)

y_pred2 = pd.DataFrame(pipeline2.predict(X_test), columns = y_test.columns)

for col in y_test.columns:
    print(col)
    print(classification_report(y_test[col],y_pred2[col]))

related
             precision    recall  f1-score   support

          0       0.74      0.17      0.28      1495
          1       0.80      0.98      0.88      5050

avg / total       0.79      0.80      0.74      6545

request
             precision    recall  f1-score   support

          0       0.91      0.98      0.94      5490
          1       0.82      0.48      0.61      1055

avg / total       0.89      0.90      0.89      6545

offer
             precision    recall  f1-score   support

          0       0.99      1.00      1.00      6507
          1       0.00      0.00      0.00        38

avg / total       0.99      0.99      0.99      6545

aid_related
             precision    recall  f1-score   support

          0       0.73      0.89      0.80      3828
          1       0.78      0.54      0.64      2717

avg / total       0.75      0.75      0.74      6545

medical_help
             precision    recall  f1-score   support

          0       0.93      0.99      0

In [146]:
f1_score(y_test,y_pred2, average='micro')

0.65440769826346645

### 9. Export your model as a pickle file

In [17]:
filename = 'classifier.pkl'
pickle.dump(best_model, open(filename, 'wb'))

#loaded_model = pickle.load(open(filename, 'rb'))
#result = loaded_model.score(X_test, Y_test)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.