# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [26]:
# import libraries
import re
import numpy as np
import pandas as pd
from sqlalchemy import create_engine

import sklearn
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier

import nltk
nltk.set_proxy('rb-proxy-de02.bosch.com:8080', 'zip9fe', 'IchMachArbeit3!')
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\zip9fe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\zip9fe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\zip9fe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [27]:
# load data from database
engine = create_engine('sqlite:///../data/DisasterResponse.db')
df = pd.read_sql('DisasterResponseData',con=engine)
#TODO: think about adding genre to X
X = df["message"]
y = df.iloc[:,4:]
y.shape

(26028, 36)

### 2. Write a tokenization function to process your text data

In [28]:
def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [30]:
pipeline = Pipeline([
    ('vect',CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf',MultiOutputClassifier(RandomForestClassifier(n_jobs=-1))),
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = pipeline
model.fit(X_train, y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000024AAC972620>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [32]:
y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)

In [34]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_test,target_names = y.columns))

                        precision    recall  f1-score   support

               related       0.81      0.97      0.89      4964
               request       0.90      0.46      0.61      1076
                 offer       0.00      0.00      0.00        29
           aid_related       0.79      0.59      0.67      2711
          medical_help       0.53      0.05      0.09       495
      medical_products       0.74      0.05      0.10       324
     search_and_rescue       0.65      0.06      0.11       185
              security       0.00      0.00      0.00       123
              military       0.70      0.04      0.07       200
           child_alone       0.00      0.00      0.00         0
                 water       0.94      0.21      0.34       431
                  food       0.89      0.42      0.57       712
               shelter       0.87      0.19      0.31       576
              clothing       0.78      0.08      0.14        91
                 money       0.83      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


We do see that the data is strongly imbalanced. Examples in the training data are classes like "child_alone with zero entries, "offer" with 92 compared to "related" with 14929.
This is a Multilabel classification problem. Therefore we do have many metrics to measure the model performance. Examples are accuracy, precision, recall, F1 score, F beta score, ... . Since it is a Multilabel problem for all these scores we also do have the micro and macro average. Which metric to take depends a lot on the use-case of the classifier. It depends on how the system is built. Are all messages getting analyzed by humans afterwards anyways. Are the ones classified as not "related" not beeing looked at by humans. How bad is the impact of a message then being looked at and they are classified as request for help? On the other hand how big is the capacity to process messages? If we are classifying too much als relevant is the human team overloaded getting rid of the FPs and therfore less help can get to people. There are always tradeoffs. 
Since there is no clear use case defined we have to make assumptions. The only thing we do know is that the data is heavily imbalanced. Therefore we should take F1 score over Accuracy. 
The assumption we do for this project is that we want to have a healthy tradeoff of precision and recall. Only optimizing for recall would overburden the users of our classifier to manually get rid of the FPs. Only optimizing for precision would miss too much. Therefore we go for the F1 score which acchieves this. The argument for weighted vs micro vs macro F1 score is similar. If we do care more about overall number of falsly classified samples and therefore work we should take the micro or the weighted F1 score. If we would value classes equally like the underrepresented but probably important class "medical_help" we would take the macro F1 score. Looking at the classes it is not clear though which one is really important. Since we do not know about the actual use case we take the micro avg F1 score. With this decision we reduce the overall number of false classifications humans have to deal with and take the risk of worse performance on underrepresented but probably important classes.
This will get considered chosing the best hyperparameters.

Given we knew other business needs there also would be different solutions:
- Do we want to make a more destinct tradeoff for accuracy or recall? Then we could use the F beta score and increase the importance of either https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html
- are ther categories or combinations of categories we really do not want to miss? Like "request" and "medical_help". Then we could add custom class_weights as input to the RandomForest classifiers: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
- ...

In [35]:
print(classification_report(y_train, y_pred_train,target_names = y.columns))

                        precision    recall  f1-score   support

               related       1.00      1.00      1.00     14942
               request       1.00      1.00      1.00      3398
                 offer       1.00      0.98      0.99        89
           aid_related       1.00      1.00      1.00      8149
          medical_help       1.00      1.00      1.00      1589
      medical_products       1.00      1.00      1.00       989
     search_and_rescue       1.00      0.99      1.00       539
              security       1.00      1.00      1.00       348
              military       1.00      0.99      1.00       660
           child_alone       0.00      0.00      0.00         0
                 water       1.00      1.00      1.00      1241
                  food       1.00      1.00      1.00      2211
               shelter       1.00      1.00      1.00      1738
              clothing       1.00      1.00      1.00       314
                 money       1.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



The classificaiton of the test data is far better than on the training data. This indicates overfitting. The micro F1 score is 1.0 in the training set and 0.61 on test data. Single classes differ even more. E.g. "security" in the train data set has a f1 score of one. In the test data set it is zero.
With the RandomForest classifier this is not necessarily true though: https://towardsdatascience.com/one-common-misconception-about-random-forest-and-overfitting-47cae2e2c23b. This can be investigated using the maximum dept in the gridsearch.



### 6. Improve your model
Use grid search to find better parameters. Here we do have to define the score. Default behaviour is to take the default score of the classifier. Which is the mean accuracy on the given test data and labels. "In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted." https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. As previously discussed we chose the micro F1 score instead.
In order to prevent the potential overfitting we observed we try out different parameters for max_features, max_dept and min_samples_leaf using crossvalidation. See https://stackoverflow.com/questions/20463281/how-do-i-solve-overfitting-in-random-forest-of-python-sklearn

Before using GridSearch we first try StartingVerbExtractor from our previous excercise as a feature. We do that here separatly so our grid search does not take even longer.

In [36]:
# Let us first add the startingVerbExtractor from our previous excercise
class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In [37]:
pipeline_sv_feature = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),
        ('clf', MultiOutputClassifier(RandomForestClassifier(n_jobs=-1)))
    ])
model = pipeline_sv_feature
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)

In [39]:
print(classification_report(y_test, y_pred_test,target_names = y.columns))

                        precision    recall  f1-score   support

               related       0.81      0.97      0.89      4964
               request       0.91      0.44      0.59      1076
                 offer       0.00      0.00      0.00        29
           aid_related       0.79      0.59      0.68      2711
          medical_help       0.56      0.05      0.09       495
      medical_products       0.81      0.07      0.13       324
     search_and_rescue       0.50      0.02      0.04       185
              security       0.00      0.00      0.00       123
              military       0.70      0.04      0.07       200
           child_alone       0.00      0.00      0.00         0
                 water       0.93      0.22      0.36       431
                  food       0.89      0.37      0.52       712
               shelter       0.88      0.18      0.30       576
              clothing       0.71      0.05      0.10        91
                 money       0.83      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


It seems like the starting verb extractor does not help us. No improvement visible. So now let´s investigate if we overfit using grid search in the estimator parameters.

In [42]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__ccp_alpha', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__max_samples', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_impurity_split', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf_

First we try to learn if an increased number of estimators helps us. This has no effect to overfitting according to: https://towardsdatascience.com/one-common-misconception-about-random-forest-and-overfitting-47cae2e2c23b

In [45]:
parameters = {
    'clf__estimator__n_estimators': [50, 100, 200, 500]
}
cv = GridSearchCV(pipeline, param_grid=parameters, scoring="f1_micro" , verbose=10)
cv.fit(X_train, y_train)
print("Best score:" + str(cv.best_score_) + " for: ")
print(cv.best_params_)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5; 1/4] START clf__estimator__n_estimators=50.............................
[CV 1/5; 1/4] END clf__estimator__n_estimators=50;, score=0.616 total time=  56.9s
[CV 2/5; 1/4] START clf__estimator__n_estimators=50.............................
[CV 2/5; 1/4] END clf__estimator__n_estimators=50;, score=0.611 total time=  56.2s
[CV 3/5; 1/4] START clf__estimator__n_estimators=50.............................
[CV 3/5; 1/4] END clf__estimator__n_estimators=50;, score=0.604 total time=  55.5s
[CV 4/5; 1/4] START clf__estimator__n_estimators=50.............................
[CV 4/5; 1/4] END clf__estimator__n_estimators=50;, score=0.603 total time=  59.9s
[CV 5/5; 1/4] START clf__estimator__n_estimators=50.............................
[CV 5/5; 1/4] END clf__estimator__n_estimators=50;, score=0.609 total time=  55.7s
[CV 1/5; 2/4] START clf__estimator__n_estimators=100............................
[CV 1/5; 2/4] END clf__estimator__n_est

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x0000024AAC972620>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))]),
             param_grid={'clf__estimator__n_estimators': [50, 100, 200, 500]},
             scoring='f1_micro', verbose=10)

So we see that the increase in estimators helps us. 50 performs worse than the default of 100. 200 performs better than 100. 500 seems to have a slight advantage but it is slim. Therefore probably increasing further will not help. So now let´s see if overfitting is a problem:

In [None]:
# max_features, max_dept and min_samples_leaf 
parameters = {
    'clf__estimator__max_depth': [None, 10, 50, 100],
    'clf__estimator__max_features': ["auto", "log2", None],
    'clf__estimator__min_samples_leaf': [1, 2, 3, 4]
}
cv = GridSearchCV(pipeline, param_grid=parameters, scoring="f1_micro" , verbose=10)
cv.fit(X_train, y_train)
print("Best score:" + str(cv.best_score_) + " for: ")
print(cv.best_params_)

In [43]:
# now test optimize text pipeline separatly
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1,2)),
    'vect__max_features': (None, 5000,10000),
    'tfidf__use_idf': (True, False)
}
cv = GridSearchCV(pipeline, param_grid=parameters, scoring="f1_micro" , verbose=10)
cv.fit(X_train, y_train)
print("Best score:" + str(cv.best_score_))
print(cv.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)
[CV 1/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1);, score=0.624 total time= 1.5min
[CV 2/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)
[CV 2/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1);, score=0.623 total time= 1.5min
[CV 3/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)
[CV 3/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1);, score=0.612 total time= 1.4min
[CV 4/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)
[CV 4/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_feature

KeyboardInterrupt: 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [25]:
print(cv.best_params_)
print(cv.best_score_)


{'clf__min_samples_split': 4, 'clf__n_estimators': 100, 'features__text_pipeline__vect__max_df': 0.5, 'features__transformer_weights': {'text_pipeline': 1, 'starting_verb': 0.25}}
0.5676834989502053


In [13]:
y_pred_test = cv.predict(X_test)
print(classification_report(y_test, y_pred_test,target_names = y.columns))

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [77]:
# already added the starting verb feature before and tested with grid search
from sklearn.naive_bayes import BernoulliNB

pipeline_bernuilly = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultiOutputClassifier(BernoulliNB()))])
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1,2)),
    'vect__max_features': (None, 5000,10000),
    'tfidf__use_idf': (True, False)
}
cv = GridSearchCV(pipeline_bernuilly, param_grid=parameters, scoring="f1_micro" , verbose=10)
cv.fit(X_train, y_train)
print("Best score:" + str(cv.best_score_))
print(cv.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)
[CV 1/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1);, score=0.572 total time=   1.4s
[CV 2/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)
[CV 2/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1);, score=0.570 total time=   1.3s
[CV 3/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)
[CV 3/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1);, score=0.569 total time=   1.3s
[CV 4/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)
[CV 4/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_feature

KeyboardInterrupt: 

In [78]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
pipeline_linear_svc = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', OneVsRestClassifier(estimator = LinearSVC(random_state=0), n_jobs = -1))])
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1,2)),
    'vect__max_features': (None, 5000,10000),
    'tfidf__use_idf': (True, False)
}
cv = GridSearchCV(pipeline_linear_svc, param_grid=parameters, scoring="f1_micro" , verbose=10)
cv.fit(X_train, y_train)
print("Best score:" + str(cv.best_score_))
print(cv.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)


  str(classes[c]))


[CV 1/5; 1/36] END tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1);, score=0.681 total time=   3.1s
[CV 2/5; 1/36] START tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1)


KeyboardInterrupt: 

It seems like OneVsRestClassifier using SVC outperforms RandomForest and the naive Bayes (BernoulliNB). So we do have a winner!

### 9. Export your model as a pickle file

In [12]:
import pickle
filename = 'disaster_classifier.pkl'
pickle.dump(pipeline, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.