# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
import os
from sqlalchemy import create_engine

# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
import re

#sklearn

from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline,  FeatureUnion
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import MaxAbsScaler

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /Users/jeffsan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jeffsan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeffsan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql("SELECT * FROM messages", engine)
X = df['message'].values
Y = df.drop(columns=['id', 'message', 'original','genre'])


### 2. Write a tokenization function to process your text data

In [2]:
def tokenize(text):
    #remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    #tokenize text
    tokens = word_tokenize(text)
    
    # initiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #iterate for each tokens
    clean_tokens = []
    for tok in tokens:
        
        if tok not in stopwords.words('english'):
            # lemmatize, normalize case, and remove leading/trailing white space
            clean_tok = lemmatizer.lemmatize(tok).lower().strip()

            clean_tokens.append(clean_tok)
    
    return clean_tokens
    

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
""" LogisticRegression """
msg_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
])


pipeline = Pipeline([
    ('features', msg_pipeline),
    ('scaler', MaxAbsScaler()),
    ('clf', OneVsRestClassifier(LogisticRegression(random_state=42), n_jobs=-1))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [4]:
#split data
X_train, X_test, y_train, y_test = tts(X,Y,test_size=0.33, random_state= 42)



In [7]:
%%time
#train pipeline
pipeline.fit(X_train, y_train)

CPU times: user 2min 17s, sys: 19.3 s, total: 2min 37s
Wall time: 2min 44s


Pipeline(memory=None,
     steps=[('features', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), pr...e=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=-1))])

### 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [7]:
def show_report(model, X,y):
    """ Print out classification report """
    y_pred = model.predict(X)
    labels = y.columns.tolist()
    class_report = classification_report(y, y_pred, target_names=labels)
    accuracy = (y_pred == y).mean().mean()
    print("\nClassification report:\n", class_report)
 

In [9]:
""" LogReg on Train  """
show_report(pipeline, X_train, y_train)


Classification report:
                         precision    recall  f1-score   support

               related       0.95      0.98      0.96     13334
               request       0.91      0.71      0.80      2980
                 offer       0.00      0.00      0.00        82
           aid_related       0.92      0.88      0.90      7277
          medical_help       0.95      0.53      0.68      1391
      medical_products       0.96      0.52      0.67       906
     search_and_rescue       0.99      0.32      0.49       499
              security       1.00      0.15      0.26       324
              military       0.98      0.69      0.81       593
           child_alone       0.00      0.00      0.00         0
                 water       0.94      0.80      0.86      1155
                  food       0.93      0.83      0.88      1949
               shelter       0.93      0.70      0.80      1510
              clothing       0.93      0.54      0.69       283
              

In [10]:
""" LogReg on Test  """
show_report(pipeline, X_test, y_test)


Classification report:
                         precision    recall  f1-score   support

               related       0.85      0.93      0.89      6542
               request       0.81      0.55      0.65      1484
                 offer       0.00      0.00      0.00        36
           aid_related       0.74      0.69      0.71      3564
          medical_help       0.61      0.21      0.31       690
      medical_products       0.68      0.23      0.35       405
     search_and_rescue       0.67      0.12      0.21       225
              security       0.00      0.00      0.00       147
              military       0.58      0.24      0.34       266
           child_alone       0.00      0.00      0.00         0
                 water       0.77      0.49      0.60       514
                  food       0.84      0.58      0.69       968
               shelter       0.83      0.46      0.59       798
              clothing       0.80      0.26      0.40       121
              

### 6. Improve your model
Use grid search to find better parameters. 

In [41]:
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV

parameters = {
        'features__msg_pipeline__vect__max_features': (None,1000, 5000, 10000),
        'clf__estimator__C': [1,3,5,7,9],
        'features__msg_pipeline__vect__ngram_range' : [(1,1), (1,2)]
        
    }


cv = GridSearchCV(pipeline, param_grid=parameters,scoring='f1_micro', verbose=10, n_jobs=7)
#cv = RandomizedSearchCV(pipeline, param_distributions=parameters,n_iter=20, scoring='f1_micro', verbose=10, n_jobs=7)

# Fit the grid search object to the training data 
grid_fit = cv.fit(X_train, y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

Fitting 3 folds for each of 40 candidates, totalling 120 fits
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_fe

[Parallel(n_jobs=7)]: Done   4 tasks      | elapsed:  2.0min


[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6471824259789876, total= 1.3min
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6610179974024367, total= 1.4min
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6715278850610227, total= 1.4min
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1), s

[Parallel(n_jobs=7)]: Done  11 tasks      | elapsed:  3.7min


[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6560517716409303, total= 1.1min
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6504686095776094, total= 1.1min
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6566134185303514, total= 1.1min
[CV] clf__estimator__C=1, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1)

[Parallel(n_jobs=7)]: Done  18 tasks      | elapsed:  5.7min


[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6533928002548582, total= 1.4min
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6450354836647274, total= 1.4min
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.65230179028133, total= 1.4min
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=1, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 2), 

[Parallel(n_jobs=7)]: Done  27 tasks      | elapsed:  7.8min


[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6634961282848606, total= 1.4min
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6618397028269218, total= 1.1min
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6635100045822515, total= 1.3min
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1), s

[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed: 11.2min


[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.664794533370776, total= 1.1min
[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6507715619878158, total= 1.1min
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6435235408982426, total= 1.1min
[CV] clf__estimator__C=3, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1),

[Parallel(n_jobs=7)]: Done  47 tasks      | elapsed: 13.3min


[CV]  clf__estimator__C=3, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6500724838839025, total= 1.3min
[CV] clf__estimator__C=5, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=5, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6407934454506253, total= 1.3min
[CV] clf__estimator__C=5, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=5, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.652385349338258, total= 1.3min
[CV] clf__estimator__C=5, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=5, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1), s

[Parallel(n_jobs=7)]: Done  58 tasks      | elapsed: 16.7min


[CV]  clf__estimator__C=5, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6633886878807557, total= 1.0min
[CV] clf__estimator__C=5, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=5, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6627189428411144, total= 1.1min
[CV] clf__estimator__C=5, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=5, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6368131868131868, total= 1.1min
[CV] clf__estimator__C=5, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=5, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1),

[Parallel(n_jobs=7)]: Done  71 tasks      | elapsed: 20.0min


[CV]  clf__estimator__C=5, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6460468521229868, total= 1.2min
[CV] clf__estimator__C=7, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=7, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.638463423600049, total= 1.3min
[CV] clf__estimator__C=7, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=7, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6482184400636708, total= 1.2min
[CV] clf__estimator__C=7, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=7, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 1), s

[Parallel(n_jobs=7)]: Done  84 tasks      | elapsed: 23.0min


[CV]  clf__estimator__C=7, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6323169441842192, total= 1.1min
[CV] clf__estimator__C=7, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=7, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6385593220338983, total= 1.0min
[CV] clf__estimator__C=7, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=7, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6380034444209445, total= 1.2min
[CV] clf__estimator__C=7, features__msg_pipeline__vect__max_features=10000, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=7, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 2)

[Parallel(n_jobs=7)]: Done  99 tasks      | elapsed: 27.8min


[CV]  clf__estimator__C=9, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6623852655175547, total= 1.3min
[CV] clf__estimator__C=9, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=9, features__msg_pipeline__vect__max_features=None, features__msg_pipeline__vect__ngram_range=(1, 2), score=0.6729811778992107, total= 1.4min
[CV] clf__estimator__C=9, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__C=9, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1), score=0.6528278813818404, total= 1.2min
[CV] clf__estimator__C=9, features__msg_pipeline__vect__max_features=5000, features__msg_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__C=9, features__msg_pipeline__vect__max_features=1000, features__msg_pipeline__vect__ngram_range=(1, 1), s

[Parallel(n_jobs=7)]: Done 120 out of 120 | elapsed: 32.7min remaining:    0.0s
[Parallel(n_jobs=7)]: Done 120 out of 120 | elapsed: 32.7min finished


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [42]:
""" Tuned LogReg on Train  """
show_report(best_clf, X_train, y_train)

Accuracy: 1.00


Classification report:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00     13334
               request       1.00      0.99      0.99      2980
                 offer       1.00      0.96      0.98        82
           aid_related       1.00      1.00      1.00      7277
          medical_help       1.00      0.99      1.00      1391
      medical_products       1.00      0.99      1.00       906
     search_and_rescue       1.00      0.98      0.99       499
              security       1.00      0.97      0.99       324
              military       1.00      1.00      1.00       593
           child_alone       0.00      0.00      0.00         0
                 water       0.99      1.00      1.00      1155
                  food       0.99      1.00      1.00      1949
               shelter       1.00      1.00      1.00      1510
              clothing       1.00      0.99      0.99       28

In [43]:
""" Tuned LogReg on Test  """
show_report(best_clf, X_test, y_test)

Accuracy: 0.95


Classification report:
                         precision    recall  f1-score   support

               related       0.85      0.94      0.89      6542
               request       0.77      0.64      0.70      1484
                 offer       0.00      0.00      0.00        36
           aid_related       0.70      0.76      0.73      3564
          medical_help       0.57      0.27      0.37       690
      medical_products       0.66      0.25      0.37       405
     search_and_rescue       0.67      0.15      0.25       225
              security       0.25      0.01      0.03       147
              military       0.56      0.29      0.38       266
           child_alone       0.00      0.00      0.00         0
                 water       0.73      0.62      0.67       514
                  food       0.82      0.66      0.73       968
               shelter       0.78      0.53      0.63       798
              clothing       0.82      0.33      0.47       12

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [5]:
""" Pipeline LogReg """
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.decomposition import NMF, LatentDirichletAllocation, PCA, TruncatedSVD
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LogisticRegression

msg_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1,2))),
    ('nlp_union', FeatureUnion([
        ('tfidf_pl', Pipeline([
            ('tfidf', TfidfTransformer()),
            ('dim_red', SelectKBest(chi2, 500))
        ])),
        ('svd_pl', Pipeline([
            ('tfidf_svd', TfidfTransformer()),
            ('truncated_svd', TruncatedSVD(500))
        ]))
    ]))         
])




pipeline = Pipeline([
    ('features', msg_pipeline),
    ('scale', MaxAbsScaler()),
    ('clf', OneVsRestClassifier(LogisticRegression(random_state=42), n_jobs=1))
    
])

In [None]:
%%time
#train pipeline on LogReg
pipeline.fit(X_train, y_train)

In [66]:
""" Improved LogReg on Train  """
show_report(pipeline, X_train, y_train)

Accuracy: 0.95


Classification report:
                         precision    recall  f1-score   support

               related       0.87      0.94      0.90     13334
               request       0.85      0.59      0.69      2980
                 offer       1.00      0.07      0.14        82
           aid_related       0.79      0.68      0.73      7277
          medical_help       0.70      0.28      0.40      1391
      medical_products       0.78      0.29      0.43       906
     search_and_rescue       0.82      0.23      0.36       499
              security       1.00      0.04      0.08       324
              military       0.73      0.37      0.49       593
           child_alone       0.00      0.00      0.00         0
                 water       0.84      0.63      0.72      1155
                  food       0.85      0.68      0.76      1949
               shelter       0.83      0.55      0.66      1510
              clothing       0.85      0.51      0.64       28

In [62]:
""" Improved LogReg on Test  """
show_report(pipeline, X_test, y_test)

Accuracy: 0.95


Classification report:
                         precision    recall  f1-score   support

               related       0.85      0.94      0.89      6542
               request       0.74      0.65      0.69      1484
                 offer       0.00      0.00      0.00        36
           aid_related       0.70      0.77      0.73      3564
          medical_help       0.57      0.34      0.43       690
      medical_products       0.57      0.34      0.43       405
     search_and_rescue       0.55      0.19      0.28       225
              security       0.26      0.04      0.07       147
              military       0.49      0.36      0.41       266
           child_alone       0.00      0.00      0.00         0
                 water       0.69      0.70      0.69       514
                  food       0.79      0.73      0.76       968
               shelter       0.72      0.62      0.67       798
              clothing       0.68      0.50      0.57       12

In [76]:
""" Grid search for improved LogReg model """

from sklearn.model_selection import RandomizedSearchCV
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.preprocessing import FunctionTransformer


parameters = {
        'clf__estimator__C': [1,3,5],
        'features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k' : [100,250,500],
        'features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components': [100,250, 500],
    }



cv = GridSearchCV(pipeline, param_grid=parameters,scoring='f1_micro', verbose=10, n_jobs=7)
#cv = RandomizedSearchCV(pipeline, param_distributions=parameters,n_iter=20, scoring='f1_micro', verbose=10, n_jobs=-1)
15
# Fit the grid search object to the training data 
grid_fit = cv.fit(X_train, y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250 
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250 
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline

[Parallel(n_jobs=7)]: Done   4 tasks      | elapsed:  2.7min


[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.6551461863762207, total= 2.0min
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.655171365143083, total= 2.0min
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250 
[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.6525948164940873, total= 2.0min
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__sv

[Parallel(n_jobs=7)]: Done  11 tasks      | elapsed:  6.6min


[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.6823648789805115, total= 3.2min
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.675973027327576, total= 3.2min
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.6779681074524422, total= 2.7min
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__sv

[Parallel(n_jobs=7)]: Done  18 tasks      | elapsed: 11.0min


[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.6883707458363504, total= 6.1min
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500 
[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.6810810810810811, total= 6.2min
[CV] clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500 
[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.686019467654406, total= 6.4min
[CV] clf__estimator__C=3, features__msg_pipeline__nlp_union__sv

[Parallel(n_jobs=7)]: Done  27 tasks      | elapsed: 18.4min


[CV]  clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.6544790874524714, total= 2.8min
[CV] clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500 
[CV]  clf__estimator__C=1, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500, score=0.6890434782608696, total= 5.2min
[CV] clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500 
[CV]  clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.6537709326488204, total= 2.8min
[CV] clf__estimator__C=3, features__msg_pipeline__nlp_union__s

[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed: 23.1min


[CV]  clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.6806176593334526, total= 5.1min
[CV] clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500 
[CV]  clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.6764775553713135, total= 5.2min
[CV] clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500 
[CV]  clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.6774793937782505, total= 5.2min
[CV] clf__estimator__C=3, features__msg_pipeline__nlp_union__s

[Parallel(n_jobs=7)]: Done  47 tasks      | elapsed: 35.1min


[CV]  clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100, score=0.6796, total= 7.7min
[CV] clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV]  clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.6763990267639902, total= 7.5min
[CV] clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV]  clf__estimator__C=3, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.6840443245105358, total= 7.3min
[CV] clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__trunc

[Parallel(n_jobs=7)]: Done  58 tasks      | elapsed: 43.6min


[CV]  clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.6550322442617439, total= 3.5min
[CV] clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=100 
[CV]  clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.654693778156581, total= 3.3min
[CV] clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=250, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250 
[CV]  clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=100, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500, score=0.6537289787960029, total= 3.4min
[CV] clf__estimator__C=5, features__msg_pipeline__nlp_union__sv

[Parallel(n_jobs=7)]: Done  77 out of  81 | elapsed: 60.2min remaining:  3.1min


[CV]  clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=250, score=0.6777828363305258, total= 7.0min
[CV]  clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500, score=0.6725152818655196, total= 5.6min
[CV]  clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500, score=0.6830594835976412, total= 4.7min
[CV]  clf__estimator__C=5, features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components=500, features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k=500, score=0.6763855011628567, total= 4.7min


[Parallel(n_jobs=7)]: Done  81 out of  81 | elapsed: 64.9min finished


In [78]:
""" Tuned Logreg on Train   """
show_report(best_clf, X_train, y_train)

Accuracy: 0.95


Classification report:
                         precision    recall  f1-score   support

               related       0.87      0.94      0.90     13334
               request       0.84      0.58      0.69      2980
                 offer       1.00      0.01      0.02        82
           aid_related       0.79      0.67      0.73      7277
          medical_help       0.68      0.27      0.39      1391
      medical_products       0.77      0.26      0.39       906
     search_and_rescue       0.83      0.20      0.32       499
              security       0.83      0.02      0.03       324
              military       0.72      0.37      0.49       593
           child_alone       0.00      0.00      0.00         0
                 water       0.84      0.61      0.71      1155
                  food       0.85      0.68      0.75      1949
               shelter       0.82      0.55      0.65      1510
              clothing       0.84      0.51      0.63       28

In [79]:
""" Tuned Logreg on Test   """
show_report(best_clf, X_test, y_test)

Accuracy: 0.95


Classification report:
                         precision    recall  f1-score   support

               related       0.85      0.94      0.89      6542
               request       0.75      0.63      0.69      1484
                 offer       0.50      0.03      0.05        36
           aid_related       0.69      0.79      0.73      3564
          medical_help       0.58      0.32      0.41       690
      medical_products       0.62      0.32      0.43       405
     search_and_rescue       0.62      0.16      0.25       225
              security       0.50      0.01      0.01       147
              military       0.51      0.35      0.42       266
           child_alone       0.00      0.00      0.00         0
                 water       0.69      0.73      0.71       514
                  food       0.79      0.76      0.77       968
               shelter       0.73      0.63      0.68       798
              clothing       0.73      0.55      0.62       12

In [80]:
cv.best_params_

{'clf__estimator__C': 1,
 'features__msg_pipeline__nlp_union__svd_pl__truncated_svd__n_components': 500,
 'features__msg_pipeline__nlp_union__tfidf_pl__dim_red__k': 100}

### 9. Export your model as a pickle file

In [44]:
import pickle
def save(model, filename):
    pickle.dump(model, open(filename, 'wb'))

def load(filename):
    return pickle.load(open(filename, 'rb'))


In [81]:
save(best_clf, "Improved_LogReg_tuned.pkl")

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.