# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# %pip install --user xgboost
# !conda install -c anaconda py-xgboost

In [46]:
# import libraries
import sqlite3
import pandas as pd
from nltk.stem import WordNetLemmatizer 
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
import re, string
from sklearn.base import TransformerMixin
from joblib import dump, load
from workspace_utils import active_session
import nltk
import ssl
import numpy as np

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords


class DenseTransformer(TransformerMixin):
    """
    Taken from: http://zacstewart.com/2014/08/05/
    pipelines-of-featureunions-of-pipelines.html
    """
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X, y=None, **fit_params):
        return X.todense()

stop_words = set(stopwords.words('english'))
engine = sqlite3.connect('etl.db')

[nltk_data] Downloading package stopwords to /Users/Jon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/Jon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 2. Write a tokenization function to process your text data

In [47]:
# load data from database
df = pd.read_sql('select * from message_categories', engine)

def clean_text(text):
    clean_non_ascii = lambda wrd: re.sub(r"[^{}]".format(string.ascii_letters), " ", wrd.lower())
    remove_stop_words = lambda text: ' '.join([w for w in text.split() if not w in stop_words])
    return WordNetLemmatizer().lemmatize(
        remove_stop_words(clean_non_ascii(text))
    )  

cols = list(df)
cols.insert(4,'message_cleaned')
df['message_cleaned'] = df.message.apply(clean_text)
df = df[cols]
df.head()
X, Y = df['message_cleaned'], df[list(df)[5:]]

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [48]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
#     ('to_dense', DenseTransformer()),
#     ('clf', MultiOutputClassifier(GaussianNB()))
    ('clf', MultiOutputClassifier(XGBClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [5]:
%%time
# with active_session():
pipeline.fit(X_train, y_train)
dump(pipeline, 'model.joblib')

CPU times: user 8min 23s, sys: 4.35 s, total: 8min 27s
Wall time: 49.2 s


['model.joblib']

In [6]:
pipeline = load('model.joblib') 

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [7]:
preds = pipeline.predict(X_test)
preds

array([['1', '1', '0', ..., '0', '0', '0'],
       ['1', '0', '0', ..., '0', '0', '0'],
       ['1', '0', '0', ..., '1', '0', '0'],
       ...,
       ['1', '0', '0', ..., '0', '0', '0'],
       ['1', '0', '0', ..., '0', '0', '0'],
       ['1', '0', '0', ..., '0', '0', '0']], dtype=object)

In [8]:
print('>> Possible clf params <<\n')
for i in pipeline.steps[1][1].get_params().keys():
    print('\t', i)

>> Possible clf params <<

	 estimator__objective
	 estimator__base_score
	 estimator__booster
	 estimator__colsample_bylevel
	 estimator__colsample_bynode
	 estimator__colsample_bytree
	 estimator__gamma
	 estimator__gpu_id
	 estimator__importance_type
	 estimator__interaction_constraints
	 estimator__learning_rate
	 estimator__max_delta_step
	 estimator__max_depth
	 estimator__min_child_weight
	 estimator__missing
	 estimator__monotone_constraints
	 estimator__n_estimators
	 estimator__n_jobs
	 estimator__num_parallel_tree
	 estimator__random_state
	 estimator__reg_alpha
	 estimator__reg_lambda
	 estimator__scale_pos_weight
	 estimator__subsample
	 estimator__tree_method
	 estimator__validate_parameters
	 estimator__verbosity
	 estimator
	 n_jobs


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
%%time
"""Params adapted from:
https://towardsdatascience.com/nlp-with-pipeline-gridsearch-5922266e82f4
https://www.kaggle.com/tilii7/hyperparameter-grid-search-with-xgboost
https://mlfromscratch.com/gridsearch-keras-sklearn/#/
"""
params = {
#     'tfidf__max_features':[100, 2000],
#     'tfidf__ngram_range': [(1, 1), (1, 2), (2, 2)],
#     'tfidf__stop_words': [None, 'english'],
#     'clf__estimator__min_child_weight': [1, 5, 10],
#     'clf__estimator__gamma': [0.5, 1, 1.5, 2, 5],
#     'clf__estimator__subsample': [0.6, 0.8, 1.0],
#     'clf__estimator__colsample_bytree': [0.6, 0.8, 1.0],
#     'clf__estimator__max_depth': [3, 4, 5],
    
#     'clf__estimator__n_estimators': [400, 700, 1000],
#     'clf__estimator__colsample_bytree': [0.7, 0.8],
#     'clf__estimator__max_depth': [15,20,25],
#     'clf__estimator__reg_alpha': [1.1, 1.2, 1.3],
#     'clf__estimator__reg_lambda': [1.1, 1.2, 1.3],
#     'clf__estimator__subsample': [0.7, 0.8, 0.9]
    'clf__estimator__max_depth': range (2, 10, 1),
    'clf__estimator__n_estimators': range(60, 220, 40),
    'clf__estimator__learning_rate': [0.1, 0.01, 0.05]
}
cv = GridSearchCV(pipeline, params, verbose=10, n_jobs=8)
cv.fit(X_train, y_train)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:   59.5s
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed:  2.6min
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:  4.3min
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:  6.0min
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed: 10.3min
[Parallel(n_jobs=8)]: Done  45 tasks      | elapsed: 13.1min
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed: 19.1min
[Parallel(n_jobs=8)]: Done  69 tasks      | elapsed: 24.1min
[Parallel(n_jobs=8)]: Done  82 tasks      | elapsed: 32.1min
[Parallel(n_jobs=8)]: Done  97 tasks      | elapsed: 42.1min
[Parallel(n_jobs=8)]: Done 112 tasks      | elapsed: 52.9min
[Parallel(n_jobs=8)]: Done 129 tasks      | elapsed: 66.7min
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed: 83.8min
[Parallel(n_jobs=8)]: Done 165 tasks      | elapsed: 95.6min
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed: 101.7min
[Parallel

CPU times: user 19min 10s, sys: 22.1 s, total: 19min 32s
Wall time: 5h 15min 33s


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [11]:
cv.best_estimator_.steps

[('tfidf',
  TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.float64'>, encoding='utf-8',
                  input='content', lowercase=True, max_df=1.0, max_features=None,
                  min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                  smooth_idf=True, stop_words=None, strip_accents=None,
                  sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, use_idf=True, vocabulary=None)),
 ('clf',
  MultiOutputClassifier(estimator=XGBClassifier(base_score=None, booster=None,
                                                colsample_bylevel=None,
                                                colsample_bynode=None,
                                                colsample_bytree=None, gamma=None,
                                                gpu_id=None,
                                                importance_type='gain',
                               

In [50]:
%%time
pipeline2 = Pipeline([
  ('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
#                   dtype=<class 'numpy.float64'>,
                  encoding='utf-8',
                  input='content', lowercase=True, max_df=1.0, max_features=None,
                  min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                  smooth_idf=True, stop_words=None, strip_accents=None,
                  sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, use_idf=True, vocabulary=None)),
#  ('tfidf', TfidfVectorizer()),
#  ('to_dense', DenseTransformer()),
 ('clf', MultiOutputClassifier(estimator=XGBClassifier(base_score=None, 
                                                       booster=None,
                                                       colsample_bylevel=None,
                                                       colsample_bynode=None,
                                                       colsample_bytree=None, 
                                                       gamma=None,
                                                       gpu_id=None,
                                                       importance_type='gain',
                                                       interaction_constraints=None,
                                                       learning_rate=0.1,
                                                       max_delta_step=None, 
                                                       max_depth=8,
                                                       min_child_weight=None,
                                                       missing=np.nan,
                                                       monotone_constraints=None,
                                                       n_estimators=180, 
                                                       n_jobs=None,
                                                       num_parallel_tree=None,
                                                       objective='binary:logistic',
                                                       random_state=None, 
                                                       reg_alpha=None,
                                                       reg_lambda=None,
                                                       scale_pos_weight=None,
                                                       subsample=None, 
                                                       tree_method=None,
                                                       validate_parameters=None,
                                                       verbosity=None),
                        n_jobs=-1))]
                    )
pipeline2.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                                                               learning_rate=0.1,
                                                               max_delta_step=None,
                                                               max_depth=8,
                                                            

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [168]:
%%time
predictions = pipeline2.predict(X_test)

CPU times: user 4.02 s, sys: 1 s, total: 5.03 s
Wall time: 6.84 s


In [169]:
%%time
pd.set_option('display.max_rows', 500)

def make_classification_report():
    def classification_report_csv(col, report):
        """Adapted from:
        https://stackoverflow.com/questions/39662398/scikit-learn-output-
        metrics-classification-report-into-csv-tab-delimited-format
        """
        lines = report.splitlines()
        report, lines = lines[0].split(), lines[2:-4]
        df = pd.DataFrame([ln.split() for ln in lines], columns=['class']+report)
        df['label'] = col
        return df

    report_df = pd.DataFrame()
    yhat_df = pd.DataFrame(predictions, columns=y_test.columns)
    for col in y_test.columns:
        report_df = pd.concat([report_df, classification_report_csv(col, classification_report(y_test[col], yhat_df[col]))])
    report_df = report_df.reset_index()
    report_df = report_df.drop(['index'], axis=1)
    return report_df

make_classification_report()

CPU times: user 6.91 s, sys: 19.4 ms, total: 6.93 s
Wall time: 6.94 s


Unnamed: 0,class,precision,recall,f1-score,support,label
0,0,0.7,0.31,0.43,2054,related
1,1,0.81,0.96,0.88,6534,related
2,2,0.69,0.17,0.28,64,related
3,0,0.91,0.97,0.94,7180,request
4,1,0.8,0.55,0.65,1472,request
5,0,1.0,1.0,1.0,8614,offer
6,1,0.0,0.0,0.0,38,offer
7,0,0.78,0.86,0.82,5107,aid_related
8,1,0.77,0.64,0.7,3545,aid_related
9,0,0.94,0.98,0.96,7951,medical_help


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [170]:
"""I believe I already did this step by adding in more params to the Grid Search"""

'I already did this step by adding in more params to the Grid Search'

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.