# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
import sys; sys.executable

'/opt/conda/bin/python'

In [2]:
! {sys.executable} -m pip install scikit-multilearn
! {sys.executable} -m pip install arff


Collecting scikit-multilearn
[?25l  Downloading https://files.pythonhosted.org/packages/bb/1f/e6ff649c72a1cdf2c7a1d31eb21705110ce1c5d3e7e26b2cc300e1637272/scikit_multilearn-0.2.0-py3-none-any.whl (89kB)
[K    100% |████████████████████████████████| 92kB 3.5MB/s ta 0:00:011
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0
Collecting arff
  Downloading https://files.pythonhosted.org/packages/50/de/62d4446c5a6e459052c2f2d9490c370ddb6abc0766547b4cef585913598d/arff-0.9.tar.gz
Building wheels for collected packages: arff
  Running setup.py bdist_wheel for arff ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/04/d0/70/2c73afedd3ac25c6085b528742c69b9587cbdfa67e5194583b
Successfully built arff
Installing collected packages: arff
Successfully installed arff-0.9


In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# import libraries
import pandas as pd
import sqlite3
import re

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from skmultilearn.model_selection import iterative_train_test_split

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [8]:
# load data from database
engine = create_engine('sqlite:///DisasterMessages.db')
df = pd.read_sql('SELECT * FROM CleanMessages', engine)

# define variables. X is input, Y is target
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis=1) 

In [9]:
# drop the 'child alone' tag because there are no messages with this tag.
Y = Y.drop('child_alone', axis=1)

# replace 2's with 1's in the related field
Y.loc[Y.related == 2, 'related'] = 1

In [10]:
# check out the shape of the data
X.shape, Y.shape

((26216,), (26216, 35))

In [None]:
# we have 26k messages with 35 possible labels

### 2. Write a tokenization function to process your text data

In [11]:
# define a function that will allow a treebank POS tag to be converted into a WordNet
# POS Tag so the lemmatizer will understand it
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    # default to Noun 
    else:
        return wordnet.NOUN


def tokenize(text):
    # remove all non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    
    # lower and strip whitespace
    text = text.lower().strip()
    
    # tokenize words
    words = word_tokenize(text)
        
    # tag words with Part of Speech - list of (word, POS) tuples 
    words_with_pos_tag = pos_tag(words)
    
    # remove stop words
    words_with_pos_tag = [word for word in words_with_pos_tag if word[0] not in stop_words]
    
    # change pos tags to wordnet pos tags for lemmatizer
    words_with_wordnet_tag = []
    
    for word_with_tag in words_with_pos_tag:
        word, tag = word_with_tag
        tag = get_wordnet_pos(tag)
        words_with_wordnet_tag.append((word, tag))

    # lemmatize
    lemm = WordNetLemmatizer()
    # unpack the (word, pos) tuple into the Lemmatizer to give better lemmatization
    words = [lemm.lemmatize(*w) for w in words_with_wordnet_tag]
    
    return words


In [12]:
# implement a custom transformer to determine if removing stops and/or lemmatizing improves model performance

class MessageTokenizer(BaseEstimator, TransformerMixin):
    def __init__(self, remove_stops=True, lemmatize=True):
        self.remove_stops = remove_stops
        self.lemmatize = lemmatize
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        
        # iterate over supplied messages
        for text in X: 
            # remove all non-alphanumeric characters
            text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    
            # lower and strip whitespace
            text = text.lower().strip()
    
            # tokenize words - nltk.tokenize.word_tokenize
            words = word_tokenize(text)
            
            if self.lemmatize:
                # tag words with Part of Speech - list of (word, POS) tuples 
                # nltk.pos_tag()
                words_with_pos_tag = pos_tag(words)
                
                if self.remove_stops:
                    # remove stop words
                    # stop_words = nlt.corpus.stopwords of 'english' language
                    words_with_pos_tag = [word for word in words_with_pos_tag if word[0] not in stop_words]
                
                # change pos tags to wordnet pos tags for lemmatizer
                words_with_wordnet_tag = []
    
                for word_with_tag in words_with_pos_tag:
                    word, tag = word_with_tag
                    tag = get_wordnet_pos(tag)
                    words_with_wordnet_tag.append((word, tag))

                # lemmatize
                lemm = WordNetLemmatizer()
                # unpack the (word, pos) tuple into the Lemmatizer to give better lemmatization
                words = [lemm.lemmatize(*w) for w in words_with_wordnet_tag]
                
            else:
                if self.remove_stops:
                    words = [word for word in words if word not in stop_words]

            # join cleaned words back into single document
            X_transformed.append(' '.join(words))
        
        return X_transformed    

In [13]:
text = ["We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words"]
text.append("Here is another example of words. Isn't it great how words are?")

In [14]:
MessageTokenizer(remove_stops=True, lemmatize=True).transform(text)

['would want word take space database take valuable processing time remove easily store list word consider stop word',
 'another example word great word']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [15]:
pipeline = Pipeline([
    ('msg_tokenizer', MessageTokenizer()),
    # Count Vectorizer with Tokenizer
    ('count_vec', CountVectorizer(tokenizer=tokenize)),
    # TF-IDF Transformer
    ('tfidf', TfidfTransformer()),
    # classifier - one classifier per label
    ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

Since this is a multilabel classification, I looked into the iterative train-test-split supplied by skmultilearn.
Here I will compare whether this train test split results in appropriate label representation for the train set.

In [56]:
# get the proportion of labels in the original data
compare = pd.DataFrame(Y.mean(axis=0), columns=['dataset'])

In [57]:
compare.head()

Unnamed: 0,dataset
related,0.77365
request,0.170659
offer,0.004501
aid_related,0.414251
medical_help,0.079493


In [53]:
# employ skmultilearn's iterative train test split.
# have to reshape the X values to be multidimensional since that's what this expects

X_train, y_train, X_test, y_test = iterative_train_test_split(X.values.reshape(-1,1), Y.values, test_size = 0.25)

In [58]:
# we want to see how the iterative split did with label proportions
compare['train_set'] = y_train.mean(axis=0)

In [59]:
# normal train test split - how does it compare
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, Y, test_size=0.25)

In [60]:
compare['normal_split'] = y_train2.values.mean(axis=0)

In [61]:
compare

Unnamed: 0,dataset,train_set,normal_split
related,0.77365,0.771488,0.775201
request,0.170659,0.135795,0.17099
offer,0.004501,0.004526,0.004476
aid_related,0.414251,0.414251,0.413997
medical_help,0.079493,0.086919,0.079595
medical_products,0.050084,0.054165,0.049842
search_and_rescue,0.027617,0.028837,0.028176
security,0.017966,0.019225,0.017191
military,0.032804,0.041857,0.032499
water,0.063778,0.06271,0.062557


In [68]:
diff = pd.DataFrame(compare['dataset'] - compare['normal_split'])
diff.columns = ['dataset - normalsplit']
diff['dataset - iterative split'] = compare['dataset'] - compare['train_set']
diff

Unnamed: 0,dataset - normalsplit,dataset - iterative split
related,-0.001551,0.002162
request,-0.000331,0.034864
offer,2.5e-05,-2.5e-05
aid_related,0.000254,0.0
medical_help,-0.000102,-0.007425
medical_products,0.000242,-0.004081
search_and_rescue,-0.000559,-0.001221
security,0.000776,-0.001259
military,0.000305,-0.009053
water,0.001221,0.001068


It seems that a normal train-test-split does a fine job.

In [16]:
# normal train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

In [76]:
# fit the pipeline
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('msg_tokenizer', MessageTokenizer(lemmatize=True, remove_stops=True)), ('count_vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [18]:
#MessageTokenizer(remove_stops=True, lemmatize=True).transform(X_train.reshape(-1,))

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [77]:
y_pred = pipeline.predict(X_train)

In [78]:
y_pred = pd.DataFrame(y_pred, columns=Y.columns)
y_train = pd.DataFrame(y_train, columns=Y.columns)

for col in Y.columns:
    print(col, '\n', classification_report(y_pred[col], y_train[col]))

related 
              precision    recall  f1-score   support

          0       0.98      0.99      0.98      4556
          1       1.00      0.99      0.99     14986
          2       0.88      0.99      0.93       120

avg / total       0.99      0.99      0.99     19662

request 
              precision    recall  f1-score   support

          0       1.00      0.98      0.99     16548
          1       0.92      1.00      0.96      3114

avg / total       0.99      0.99      0.99     19662

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00     19604
          1       0.62      1.00      0.77        58

avg / total       1.00      1.00      1.00     19662

aid_related 
              precision    recall  f1-score   support

          0       1.00      0.98      0.99     11733
          1       0.97      0.99      0.98      7929

avg / total       0.98      0.98      0.98     19662

medical_help 
              precision    reca

In [17]:
# compute hamming loss as well
from sklearn.metrics import hamming_loss, make_scorer
hamming_scorer = make_scorer(hamming_loss, greater_is_better=False)

# drop related column while measuring this because there are some rows with related=2
hamming_loss(y_train.drop('related', axis=1), y_pred.drop('related', axis=1))

NameError: name 'y_pred' is not defined

In [86]:
# fit and predict on the iterative split to see if it performs better
pipeline2 = Pipeline([
    ('msg_tokenizer', MessageTokenizer()),
    # Count Vectorizer with Tokenizer
    ('count_vec', CountVectorizer(tokenizer=tokenize)),
    # TF-IDF Transformer
    ('tfidf', TfidfTransformer()),
    # classifier - one classifier per label
    ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))
])

X_train2, y_train2, X_test2, y_test2 = iterative_train_test_split(X.values.reshape(-1,1), Y.values, test_size = 0.25)

pipeline2.fit(X_train.values.reshape(-1,), y_train2)

Pipeline(memory=None,
     steps=[('msg_tokenizer', MessageTokenizer(lemmatize=True, remove_stops=True)), ('count_vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [88]:
y_pred2 = pipeline.predict(X_train2.reshape(-1,))
y_pred2 = pd.DataFrame(y_pred2, columns=Y.columns)
y_train2 = pd.DataFrame(y_train2, columns=Y.columns)

for col in Y.columns:
    print(col, '\n', classification_report(y_pred2[col], y_train2[col]))

related 
              precision    recall  f1-score   support

          0       0.87      0.93      0.90      4269
          1       0.98      0.96      0.97     15316
          2       0.71      0.94      0.80        77

avg / total       0.95      0.95      0.95     19662

request 
              precision    recall  f1-score   support

          0       1.00      0.97      0.98     17519
          1       0.78      0.96      0.86      2143

avg / total       0.97      0.97      0.97     19662

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00     19620
          1       0.48      1.00      0.65        42

avg / total       1.00      1.00      1.00     19662

aid_related 
              precision    recall  f1-score   support

          0       0.96      0.91      0.94     12143
          1       0.87      0.94      0.90      7519

avg / total       0.93      0.92      0.92     19662

medical_help 
              precision    reca

In [89]:
# drop related column while measuring this because there are some rows with related=2
hamming_loss(y_train2.drop('related', axis=1), y_pred2.drop('related', axis=1))

0.018713313827209248

In [None]:
# again, doesn't seem to be any advantage to it - precision is lower and hamming loss is higher. 

In [90]:
y_train

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
15196,1,0,0,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
11247,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,1
844,1,1,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
1392,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
25764,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9090,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
17534,1,0,0,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,1,0
1732,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12025,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Test three different classifiers before tuning hyper params

In [91]:
(y_train.related==2).mean()

0.0068660360085444003

In [92]:
# replace the related = 2 labels - this will allow me to use hamming loss
y_train.loc[y_train.related == 2, 'related'] = 1

In [98]:
# tune the grid with hamming loss instead - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html
# make a custom scorer https://scikit-learn.org/stable/modules/model_evaluation.html#scoring
from sklearn.metrics import hamming_loss, make_scorer
hamming_scorer = make_scorer(hamming_loss, greater_is_better=False)

# search over a grid of possible classifiers to determine which might be the best fit for our data
# using hamming loss as the score metric. This gives the proportion of labels which are correct in a multilabel output.

#clf_params = {
#    'clf__estimator': [RandomForestClassifier(), SVC(kernel='linear'), GaussianNB()]  
#}

#clf_cv = GridSearchCV(pipeline, clf_params, scoring=hamming_scorer)
#clf_search = clf_cv.fit(X_train, y_train)

In [99]:
def model_compare(y_pred, y_train):
    y_pred2 = pd.DataFrame(y_pred, columns=Y.columns)
    y_train2 = pd.DataFrame(y_train, columns=Y.columns)

    for col in Y.columns:
        print(col, '\n', classification_report(y_pred2[col], y_train2[col]))
        
    print(f"Hamming Loss: {hamming_loss(y_pred, y_train)}")
    

In [100]:
# RF supports direct multi label output - don't have to wrap in a MultiOutputClassifier

pipeline_rf = Pipeline([
    ('msg_tokenizer', MessageTokenizer()),
    # Count Vectorizer with Tokenizer
    ('count_vec', CountVectorizer(tokenizer=tokenize)),
    # TF-IDF Transformer
    ('tfidf', TfidfTransformer()),
    # classifier - one classifier per label
    ('clf', RandomForestClassifier())
])

pipeline_rf.fit(X_train, y_train)
y_pred_rf = pipeline_rf.predict(X_train)

model_compare(y_pred_rf, y_train)

related 
              precision    recall  f1-score   support

        0.0       0.98      0.98      0.98      4627
        1.0       0.99      0.99      0.99     15035

avg / total       0.99      0.99      0.99     19662

request 
              precision    recall  f1-score   support

        0.0       1.00      0.98      0.99     16572
        1.0       0.92      1.00      0.96      3090

avg / total       0.99      0.99      0.99     19662

offer 
              precision    recall  f1-score   support

        0.0       1.00      1.00      1.00     19594
        1.0       0.73      1.00      0.84        68

avg / total       1.00      1.00      1.00     19662

aid_related 
              precision    recall  f1-score   support

        0.0       1.00      0.97      0.98     11810
        1.0       0.96      1.00      0.98      7852

avg / total       0.98      0.98      0.98     19662

medical_help 
              precision    recall  f1-score   support

        0.0       1.00      0

0.8% of labels are predicted incorrectly with RF

In [101]:
pipeline_svc = Pipeline([
    ('msg_tokenizer', MessageTokenizer()),
    # Count Vectorizer with Tokenizer
    ('count_vec', CountVectorizer(tokenizer=tokenize)),
    # TF-IDF Transformer
    ('tfidf', TfidfTransformer()),
    # classifier - one classifier per label
    ('clf', MultiOutputClassifier(estimator=SVC(kernel='linear')))
])

pipeline_svc.fit(X_train, y_train)
y_pred_svc = pipeline_svc.predict(X_train)

model_compare(y_pred_svc, y_train)

related 
              precision    recall  f1-score   support

          0       0.72      0.88      0.79      3809
          1       0.97      0.92      0.94     15853

avg / total       0.92      0.91      0.91     19662

request 
              precision    recall  f1-score   support

          0       0.98      0.94      0.96     17116
          1       0.68      0.90      0.77      2546

avg / total       0.94      0.93      0.94     19662

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00     19662
          1       0.00      0.00      0.00         0

avg / total       1.00      1.00      1.00     19662

aid_related 
              precision    recall  f1-score   support

          0       0.92      0.88      0.90     12066
          1       0.82      0.88      0.85      7596

avg / total       0.88      0.88      0.88     19662

medical_help 
              precision    recall  f1-score   support

          0       1.00      0

  'recall', 'true', average, warn_for)


other_weather 
              precision    recall  f1-score   support

          0       1.00      0.96      0.98     19418
          1       0.21      0.92      0.34       244

avg / total       0.99      0.96      0.97     19662

direct_report 
              precision    recall  f1-score   support

          0       0.98      0.91      0.94     17063
          1       0.60      0.87      0.71      2599

avg / total       0.93      0.91      0.91     19662

Hamming Loss: 0.034697821759158344


The SVC Classifier is not as good at distinguishing between the majority and the minority classes. Accuracy on the individual labels is worse, and Hamming score is worse.

In [None]:
# create a DenseTranformer class to make sure the output is correct for the GaussianNB

class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

In [None]:
pipeline_nb = Pipeline([
    ('msg_tokenizer', MessageTokenizer()),
    # Count Vectorizer with Tokenizer
    ('count_vec', CountVectorizer(tokenizer=tokenize)),
    # TF-IDF Transformer
    ('tfidf', TfidfTransformer()),
    # make sure the output is dense, not sparse - needed for GaussianNB
    ('dense', DenseTransformer()),
    # classifier - one classifier per label
    ('clf', MultiOutputClassifier(estimator=GaussianNB()))
])

pipeline_nb.fit(X_train.values, y_train)
y_pred_nb = pipeline_nb.predict(X_train.values)

model_compare(y_pred_nb, y_train)

Choosing to go with the RandomForest model because it seems to yield the best results on training set.

### 6. Improve your model
Use grid search to find better parameters. 

In [1]:
search_params = {
    'msg_tokenizer__remove_stops': [False, True],
    'msg_tokenizer__lemmatize': [False, True],
    'count_vec__ngram_range': [(1,1), (1,2), (1,3)],
    'count_vec__max_features': [None, 100, 500, 1000],
    'tfidf__norm': [None, 'l1', 'l2'],
    'tfidf__use_idf': [False, True],
    'tfidf__smooth_idf': [False, True],
    'clf__estimator__n_estimators': [10, 100, 500],
    'clf__estimator__max_depth': [None, 50, 100, 500],
    'clf__estimator__bootstrap': [True, False],
    'clf__estimator__class_weight': [None, 'balanced']
}

cv = RandomizedSearchCV(pipeline, search_params, n_iter=5, n_jobs=-1, scoring=hamming_scorer)
search = cv.fit(X_train, y_train)

NameError: name 'RandomizedSearchCV' is not defined

In [None]:
cv2 = RandomizedSearchCV(pipeline, search_params, n_iter=10, n_jobs=-1, scoring=hamming_score)
search2 = cv.fit(X_train, y_train)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.