# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [26]:
# import libraries
import numpy as np
import pandas as pd
import pickle
import re
import nltk
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neural_network import MLPClassifier

In [2]:
# Uncomment if you haven't download these packages before
#nltk.download(['punkt', 'wordnet'])

In [8]:
# load data from database
engine = create_engine('sqlite:///../data/DisasterResponse.db')

df = pd.read_sql_table('disaster_messages', engine)
X = df['message']
Y = df.loc[:, 'related':'direct_report']

In [9]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [10]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [11]:
def tokenize(text):
    # replace all non-alphabets and non-numbers with blank space
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize words
    tokens = word_tokenize(text)
    
    # instantiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # instantiate stemmer
    stemmer = PorterStemmer()
    
    clean_tokens = []
    for tok in tokens:
        # lemmtize token using noun as part of speech
        clean_tok = lemmatizer.lemmatize(tok)
        # lemmtize token using verb as part of speech
        clean_tok = lemmatizer.lemmatize(clean_tok, pos='v')
        # stem token
        clean_tok = stemmer.stem(clean_tok)
        # strip whitespace and append clean token to array
        clean_tokens.append(clean_tok.strip())
        
    return clean_tokens

In [12]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2."

tokenize(text)

['the',
 'first',
 'time',
 'you',
 'see',
 'the',
 'second',
 'renaiss',
 'it',
 'may',
 'look',
 'bore',
 'look',
 'at',
 'it',
 'at',
 'least',
 'twice',
 'and',
 'definit',
 'watch',
 'part',
 '2']

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [13]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
    ('clf', MultiOutputClassifier(AdaBoostClassifier(random_state=42)))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# train classifier
pipeline.fit(X_train, Y_train)

# predict on test data
Y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [20]:
# Get names of all categories
category_names = Y_test.columns.tolist()

Y_pred_df = pd.DataFrame(Y_pred, columns = category_names)
Y_pred_df.head()

In [105]:
for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_df.iloc[:,i]))

related 
              precision    recall  f1-score   support

          0       0.66      0.37      0.47      1245
          1       0.83      0.94      0.88      3998

avg / total       0.79      0.80      0.78      5243

request 
              precision    recall  f1-score   support

          0       0.91      0.97      0.94      4352
          1       0.77      0.54      0.64       891

avg / total       0.89      0.90      0.89      5243

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      5219
          1       0.00      0.00      0.00        24

avg / total       0.99      0.99      0.99      5243

aid_related 
              precision    recall  f1-score   support

          0       0.75      0.87      0.81      3079
          1       0.76      0.59      0.66      2164

avg / total       0.75      0.75      0.75      5243

medical_help 
              precision    recall  f1-score   support

          0       0.94      0

### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
pipeline.get_params()

{'memory': None,
 'steps': [('tfidf',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
           stop_words=None, strip_accents=None, sublinear_tf=False,
           token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x00000291CF893840>, use_idf=True,
           vocabulary=None)),
  ('clf',
   MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
             learning_rate=1.0, n_estimators=50, random_state=42),
              n_jobs=1))],
 'tfidf': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1)

In [14]:
parameters = {
    #'tfidf__max_df': (0.9, 1.0),
    #'tfidf__min_df': (0.01, 1),
    'tfidf__ngram_range': ((1, 1),(1,3)),
    #'tfidf__stop_words': (None, 'english'),
    #'clf__estimator__learning_rate': (0.1,1.0),
    #'clf__estimator__n_estimators': (50, 100)
}

cv = GridSearchCV(pipeline, param_grid=parameters, verbose=1)

In [17]:
cv.fit(X_train, Y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 57.3min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...timator=None,
          learning_rate=1.0, n_estimators=50, random_state=42),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'tfidf__ngram_range': ((1, 1), (1, 3))},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [22]:
Y_pred = cv.predict(X_test)
Y_pred_df = pd.DataFrame(Y_pred, columns = category_names)
Y_pred_df.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_df.iloc[:,i]))

related 
              precision    recall  f1-score   support

          0       0.66      0.37      0.47      1245
          1       0.83      0.94      0.88      3998

avg / total       0.79      0.80      0.78      5243

request 
              precision    recall  f1-score   support

          0       0.91      0.97      0.94      4352
          1       0.77      0.54      0.64       891

avg / total       0.89      0.90      0.89      5243

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      5219
          1       0.00      0.00      0.00        24

avg / total       0.99      0.99      0.99      5243

aid_related 
              precision    recall  f1-score   support

          0       0.75      0.87      0.81      3079
          1       0.76      0.59      0.66      2164

avg / total       0.75      0.75      0.75      5243

medical_help 
              precision    recall  f1-score   support

          0       0.94      0

In [29]:
for i in range(36):
    category = category_names[i]
    accuracy = accuracy_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i])
    precision = precision_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    recall = recall_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    f1 = f1_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    print(category)
    print("\t Accuracy: %.4f \t Precision: %.4f \t Recall: %.4f \t F1-Score: %.4f \n" %\
              (accuracy, precision, recall, f1))       

related
	 Accuracy: 0.8043 	 Precision: 0.8043 	 Recall: 0.8043 	 F1-Score: 0.8043 

request
	 Accuracy: 0.8951 	 Precision: 0.8951 	 Recall: 0.8951 	 F1-Score: 0.8951 

offer
	 Accuracy: 0.9945 	 Precision: 0.9945 	 Recall: 0.9945 	 F1-Score: 0.9945 

aid_related
	 Accuracy: 0.7534 	 Precision: 0.7534 	 Recall: 0.7534 	 F1-Score: 0.7534 

medical_help
	 Accuracy: 0.9262 	 Precision: 0.9262 	 Recall: 0.9262 	 F1-Score: 0.9262 

medical_products
	 Accuracy: 0.9535 	 Precision: 0.9535 	 Recall: 0.9535 	 F1-Score: 0.9535 

search_and_rescue
	 Accuracy: 0.9754 	 Precision: 0.9754 	 Recall: 0.9754 	 F1-Score: 0.9754 

security
	 Accuracy: 0.9794 	 Precision: 0.9794 	 Recall: 0.9794 	 F1-Score: 0.9794 

military
	 Accuracy: 0.9706 	 Precision: 0.9706 	 Recall: 0.9706 	 F1-Score: 0.9706 

child_alone
	 Accuracy: 1.0000 	 Precision: 1.0000 	 Recall: 1.0000 	 F1-Score: 1.0000 

water
	 Accuracy: 0.9647 	 Precision: 0.9647 	 Recall: 0.9647 	 F1-Score: 0.9647 

food
	 Accuracy: 0.9510 	 Precision

In [30]:
cv.best_params_

{'tfidf__ngram_range': (1, 1)}

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [34]:
class GloveVectorizer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        word2vec = {}
        embedding = []
        idx2word = []
        
        # get pretrained glove vectors
        with open('../glove.6B/glove.6B.100d.txt', encoding='utf-8') as f:
            for line in f:
                values = line.split()
                # get word
                word = values[0]
                # get glove vector for word
                vec = np.asarray(values[1:], dtype='float32')
                word2vec[word] = vec
                embedding.append(vec)
                idx2word.append(word)
                
        self.word2vec = word2vec
        self.embedding = embedding
        self.idx2word = idx2word
        
        # Get number of vocabulary and dimensions for word vector
        self.vocab_size, self.dim = len(embedding), len(embedding[0])
                
    def tokenize(self, text):
        # replace all non-alphabets and non-numbers with blank space
        text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

        # Tokenize words
        tokens = word_tokenize(text)
        
        # Remove stopwords
        tokens = [word for word in tokens if word not in stopwords.words("english")]

        # instantiate lemmatizer
        lemmatizer = WordNetLemmatizer()

        clean_tokens = []
        for tok in tokens:
            # lemmtize token using noun as part of speech
            clean_tok = lemmatizer.lemmatize(tok)
            # lemmtize token using verb as part of speech
            clean_tok = lemmatizer.lemmatize(clean_tok, pos='v')
            # strip whitespace and append clean token to array
            clean_tokens.append(clean_tok.strip())
         
        return clean_tokens
        
    
    def fit(self, x, y=None):
        pass
    
    def transform(self, X):       
        new_X = np.zeros((len(X), self.dim))
        
        # keep track of sentences without any glove vectors representation
        self.emptycount = 0
        
        n=0
        
        for message in X:
            clean_tokens = self.tokenize(message)
            vecs = []
            for word in clean_tokens:
                if word in self.word2vec:
                    vec = self.word2vec[word]
                    vecs.append(vec)
            if len(vecs) > 0:
                vecs = np.array(vecs)
                # Get mean of all glove vectors of each message
                new_X[n] = vecs.mean(axis=0)
            else:
                self.emptycount += 1
            n += 1
        return pd.DataFrame(new_X)
    
    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X)

In [35]:
glove_vect = GloveVectorizer()
X_glove = glove_vect.transform(X)

In [36]:
X_glove.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.373089,0.143821,0.526428,-0.165015,-0.085869,0.030974,0.20888,0.292311,-0.165434,0.036243,...,0.338162,-0.00076,-0.055677,-0.01669,-0.39653,0.146006,-0.059675,-0.185591,0.285596,0.12587
1,-0.29315,-0.001739,0.24216,-0.42287,-0.19497,-0.81765,0.66281,-0.1657,-0.19258,-0.070444,...,0.87229,0.60738,-0.39358,0.058054,-0.25089,0.1604,0.69389,0.10292,0.53148,-0.89686
2,0.06143,0.337297,0.6897,-0.369527,-0.074156,0.426358,-0.159638,-0.077936,0.2169,-0.014871,...,0.090407,-0.096784,0.278117,-0.267483,-0.177794,-0.22188,-0.676807,-0.273858,0.221307,0.095899
3,-0.276249,0.318849,0.048543,0.195906,-0.057357,0.08267,-0.049792,0.34458,0.072582,0.016215,...,0.142197,0.020486,0.188797,0.090719,-0.582998,0.225993,0.189824,-0.141658,0.402548,0.024831
4,-0.255748,0.299672,0.66505,-0.141468,-0.214566,0.192325,-0.155083,0.372069,-0.103434,-0.069906,...,0.117501,-0.13997,-0.183351,0.182389,-0.94733,0.172302,-0.034799,-0.005394,0.561409,0.257176


In [37]:
print("Number of messages with no words found: %s / %s" % (glove_vect.emptycount, len(X)))
#glove_vect.emptycount

Number of messages with no words found: 11 / 26215


In [38]:
pipeline = Pipeline([
    ('glove',GloveVectorizer()),\
    ('clf', MLPClassifier(solver='lbfgs', random_state=42))
])

In [39]:
pipeline.get_params()

{'memory': None,
 'steps': [('glove', GloveVectorizer()),
  ('clf',
   MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
          beta_2=0.999, early_stopping=False, epsilon=1e-08,
          hidden_layer_sizes=(100,), learning_rate='constant',
          learning_rate_init=0.001, max_iter=200, momentum=0.9,
          nesterovs_momentum=True, power_t=0.5, random_state=42, shuffle=True,
          solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
          warm_start=False))],
 'glove': GloveVectorizer(),
 'clf': MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
        beta_2=0.999, early_stopping=False, epsilon=1e-08,
        hidden_layer_sizes=(100,), learning_rate='constant',
        learning_rate_init=0.001, max_iter=200, momentum=0.9,
        nesterovs_momentum=True, power_t=0.5, random_state=42, shuffle=True,
        solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
        warm_start=False)

In [40]:
parameters = {
    'clf__hidden_layer_sizes': ((32,),(64,))
    #'clf__learning_rate_init': (0.001, 0.01)
}

In [41]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [46]:
# train classifier
pipeline.fit(X_train, Y_train)

# predict on test data
Y_pred = pipeline.predict(X_test)

# Get names of all categories
category_names = Y_test.columns.tolist()

Y_pred_df = pd.DataFrame(Y_pred, columns = category_names)
Y_pred_df.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,1,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_df.iloc[:,i]))

related 
              precision    recall  f1-score   support

          0       0.69      0.52      0.59      1245
          1       0.86      0.93      0.89      3998

avg / total       0.82      0.83      0.82      5243

request 
              precision    recall  f1-score   support

          0       0.91      0.97      0.94      4352
          1       0.76      0.51      0.61       891

avg / total       0.88      0.89      0.88      5243

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      5219
          1       0.00      0.00      0.00        24

avg / total       0.99      1.00      0.99      5243

aid_related 
              precision    recall  f1-score   support

          0       0.80      0.82      0.81      3079
          1       0.73      0.71      0.72      2164

avg / total       0.77      0.77      0.77      5243

medical_help 
              precision    recall  f1-score   support

          0       0.93      0

  'precision', 'predicted', average, warn_for)


In [43]:
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=5)

cv.fit(X_train, Y_train)

Y_pred = cv.predict(X_test)

# Get names of all categories
category_names = Y_test.columns.tolist()

Y_pred_df = pd.DataFrame(Y_pred, columns = category_names)
Y_pred_df.head()

for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_df.iloc[:,i]))

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__hidden_layer_sizes=(32,) ...................................
[CV]  clf__hidden_layer_sizes=(32,), score=0.2609068802746388, total= 6.9min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 11.9min remaining:    0.0s


[CV] clf__hidden_layer_sizes=(32,) ...................................
[CV]  clf__hidden_layer_sizes=(32,), score=0.2667715634387069, total= 7.5min


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 24.2min remaining:    0.0s


[CV] clf__hidden_layer_sizes=(32,) ...................................
[CV]  clf__hidden_layer_sizes=(32,), score=0.26509298998569386, total= 7.1min


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 35.7min remaining:    0.0s


[CV] clf__hidden_layer_sizes=(64,) ...................................
[CV]  clf__hidden_layer_sizes=(64,), score=0.27549706765841797, total= 7.0min


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 47.6min remaining:    0.0s


[CV] clf__hidden_layer_sizes=(64,) ...................................
[CV]  clf__hidden_layer_sizes=(64,), score=0.2696323844943499, total= 6.7min
[CV] clf__hidden_layer_sizes=(64,) ...................................
[CV]  clf__hidden_layer_sizes=(64,), score=0.271101573676681, total= 6.4min


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 69.5min finished


related 
              precision    recall  f1-score   support

          0       0.69      0.54      0.61      1245
          1       0.87      0.92      0.89      3998

avg / total       0.83      0.83      0.83      5243

request 
              precision    recall  f1-score   support

          0       0.91      0.97      0.94      4352
          1       0.79      0.55      0.65       891

avg / total       0.89      0.90      0.89      5243

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      5219
          1       0.00      0.00      0.00        24

avg / total       0.99      1.00      0.99      5243

aid_related 
              precision    recall  f1-score   support

          0       0.80      0.82      0.81      3079
          1       0.74      0.71      0.72      2164

avg / total       0.77      0.78      0.77      5243

medical_help 
              precision    recall  f1-score   support

          0       0.94      0

  'precision', 'predicted', average, warn_for)


In [44]:
cv.best_params_

{'clf__hidden_layer_sizes': (64,)}

### 9. Export your model as a pickle file

In [59]:
filename = 'classifier.pkl'
pickle.dump(cv, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.