# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [19]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import nltk
nltk.download(['punkt', 'wordnet'])
nltk.download('averaged_perceptron_tagger')
nltk.download('omw')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet,stopwords

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import f1_score,accuracy_score,precision_score,recall_score,make_scorer,classification_report


import re
import pickle

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data]   Package omw is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
# load data from database
engine = create_engine('sqlite:///data/cleaned_data.db')
df = pd.read_sql('SELECT * FROM message', engine)


In [23]:
df_tmp = df.drop(['id','message','original','genre'],axis=1)
count_per_category = df_tmp[df_tmp!=0].sum()

### 2. Write a tokenization function to process your text data

In [18]:
def replace_URLs_with_placeholder(text):
    # Regular Expression to detect URLs for http and https urls (does not cater for uppercase HTTP/S or other protocols)
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    #detect all URLs in a text message
    url_list = re.findall(url_regex, text)
    
    #remove the URLs
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
        
    return text
    
    

In [4]:
def tokenize_sentences_by_words(text):
    # this will make every sentence a token by itself
    sentence_list = nltk.sent_tokenize(text)
    
    # iterate through the sentences and make each one an array of token seperately.
    array_of_tokenized_sentences = []
    for sentence in sentence_list:
        word_tokenized_sentence = word_tokenize(sentence.lower())
        array_of_tokenized_sentences.append(word_tokenized_sentence)
    
    return array_of_tokenized_sentences

In [5]:
def tag_POS_for_sentence_tokens(array_of_tokenized_sentences):
    # take the array of tokens for each sentence seperately and get its POS tags
    array_of_tagged_sentence_tokens = []
    for sentence_tokens in array_of_tokenized_sentences:
        pos_tags = nltk.pos_tag(sentence_tokens)
        array_of_tagged_sentence_tokens.append(pos_tags)    
    return array_of_tagged_sentence_tokens

In [6]:
def lemmatize_tokens_based_on_POS_tags(array_of_tagged_sentence_tokens):
    
    # this mapping is from the POS tags to the wordnet tags understood by the lemmatization function
    tag_dict = {"J": wordnet.ADJ,"N": wordnet.NOUN,"V": wordnet.VERB,"R": wordnet.ADV}
    
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = []
    for sentence_tokens in array_of_tagged_sentence_tokens:
        for token_pair in sentence_tokens:
            token = token_pair[0]
            stop_words = set(stopwords.words('english'))
            if (token not in stop_words) & token.isalpha():
                oldTag = token_pair[1].upper()
                newTag = tag_dict.get(oldTag, wordnet.NOUN)
                # Here we lemmatize based on the POS tag for better accuracy of lemmatization
                newToken = lemmatizer.lemmatize(token,newTag)
                lemmatized_tokens.append(newToken)
    return lemmatized_tokens

In [7]:
def tokenize(text):
    text = replace_URLs_with_placeholder(text)
    array_of_tokenized_sentences = tokenize_sentences_by_words(text)
    array_of_tagged_sentence_tokens = tag_POS_for_sentence_tokens(array_of_tokenized_sentences)
    lemmatized_tokens = lemmatize_tokens_based_on_POS_tags(array_of_tagged_sentence_tokens)
    return lemmatized_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [8]:
def model_pipeline():
    pipeline = Pipeline(
                            [
                                ('text_pipeline', Pipeline(
                                                                [
                                                                    ('vect', CountVectorizer(tokenizer=tokenize)),
                                                                    ('tfidf', TfidfTransformer())

                                                                ]
                                                            )),
                                ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=10,n_jobs=12)))
    
                            ]
                        )
    return pipeline

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [9]:
X = df['message']
y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)

In [10]:
def train_valid_test_split(X,y):
    # split the dataset to training, validation, and testing sets
    X_others, X_test, y_others, y_test = train_test_split(X, y,test_size=0.1, random_state = 42)
    X_train, X_valid, y_train, y_valid = train_test_split(X_others, y_others,test_size=0.05, random_state = 42)
    return X_train,X_valid,X_test,y_train,y_valid,y_test

In [11]:
# validation sets will be used to quickly test the fitting function for code errors
X_train,X_valid,X_test,y_train,y_valid,y_test = train_valid_test_split(X,y)

In [12]:
model = model_pipeline()

In [13]:
model.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('text_pipeline',
                 Pipeline(memory=None,
                          steps=[('vect',
                                  CountVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=<class 'numpy.int64'>,
                                                  encoding='utf-8',
                                                  input='content',
                                                  lowercase=True, max_df=1.0,
                                                  max_features=None, min_df=1,
                                                  ngram_range=(1, 1),
                                                  preprocessor=None,
                                                  stop_words=None,
                                                  strip_accents=None,
                                                  token_patter

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [14]:
y_test_pred = model.predict(X_test)

In [15]:
def print_model_metrics(y_pred,y_target,categories):
    y_target = pd.DataFrame(y_target,columns=categories)
    y_pred = pd.DataFrame(y_pred,columns=categories)
    
    for category in categories:
        print("Scores for Category '"+category+"'")
        temp = classification_report(y_target[category],y_pred[category])
        print(temp) 
        

In [16]:
print_model_metrics(y_test_pred,y_test,y_test.columns.values)

Scores for Category 'related'
              precision    recall  f1-score   support

           0       0.57      0.27      0.36       646
           1       0.79      0.93      0.85      1951
           2       0.00      0.00      0.00        21

    accuracy                           0.76      2618
   macro avg       0.45      0.40      0.41      2618
weighted avg       0.73      0.76      0.73      2618

Scores for Category 'request'
              precision    recall  f1-score   support

           0       0.85      0.98      0.91      2142
           1       0.73      0.20      0.32       476

    accuracy                           0.84      2618
   macro avg       0.79      0.59      0.61      2618
weighted avg       0.83      0.84      0.80      2618

Scores for Category 'offer'
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2601
           1       0.00      0.00      0.00        17

    accuracy                           0

  'precision', 'predicted', average, warn_for)



Scores for Category 'transport'
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      2504
           1       0.00      0.00      0.00       114

    accuracy                           0.96      2618
   macro avg       0.48      0.50      0.49      2618
weighted avg       0.91      0.96      0.93      2618

Scores for Category 'buildings'
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      2482
           1       0.00      0.00      0.00       136

    accuracy                           0.95      2618
   macro avg       0.47      0.50      0.49      2618
weighted avg       0.90      0.95      0.92      2618

Scores for Category 'electricity'
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      2571
           1       0.00      0.00      0.00        47

    accuracy                           0.98      2618
   macro avg       0.49      

### 6. Improve your model
Use grid search to find better parameters. 

In [17]:
model = model_pipeline()

In [22]:
RandomForest_parameters = {
    'clf__estimator__n_estimators': list(range(50,151,25)),
    'clf__estimator__max_features': ["sqrt","log2"]
}

# 12 jobs are used to utilize the multiple cores of the CPU. 
# If it fails to execute try changing the number of jobs and run again. 
# If it keeps failing, remove the n_jobs parameter to run the optimization on a single core
cv_random_forest = GridSearchCV(estimator=model, param_grid=RandomForest_parameters, verbose=3,n_jobs=12)

In [23]:
cv_random_forest.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  18 out of  30 | elapsed: 42.7min remaining: 28.5min
[Parallel(n_jobs=12)]: Done  30 out of  30 | elapsed: 54.7min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('text_pipeline',
                                        Pipeline(memory=None,
                                                 steps=[('vect',
                                                         CountVectorizer(analyzer='word',
                                                                         binary=False,
                                                                         decode_error='strict',
                                                                         dtype=<class 'numpy.int64'>,
                                                                         encoding='utf-8',
                                                                         input='content',
                                                                         lowercase=True,
                                                                     

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [24]:
y_test_pred = cv_random_forest.predict(X_test)
print_model_metrics(y_test_pred,y_test,y_test.columns.values)

Scores for Category 'related'
              precision    recall  f1-score   support

           0       0.74      0.45      0.56       646
           1       0.84      0.94      0.89      1951
           2       0.38      0.67      0.48        21

    accuracy                           0.82      2618
   macro avg       0.65      0.68      0.64      2618
weighted avg       0.81      0.82      0.80      2618

Scores for Category 'request'
              precision    recall  f1-score   support

           0       0.89      0.98      0.93      2142
           1       0.83      0.45      0.59       476

    accuracy                           0.88      2618
   macro avg       0.86      0.72      0.76      2618
weighted avg       0.88      0.88      0.87      2618

Scores for Category 'offer'
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2601
           1       0.00      0.00      0.00        17

    accuracy                           0

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2592
           1       1.00      0.08      0.14        26

    accuracy                           0.99      2618
   macro avg       1.00      0.54      0.57      2618
weighted avg       0.99      0.99      0.99      2618

Scores for Category 'earthquake'
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      2407
           1       0.83      0.84      0.84       211

    accuracy                           0.97      2618
   macro avg       0.91      0.91      0.91      2618
weighted avg       0.97      0.97      0.97      2618

Scores for Category 'cold'
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      2576
           1       0.83      0.12      0.21        42

    accuracy                           0.99      2618
   macro avg       0.91      0.56      0.60      2618
weighted avg  

In [None]:
pickle.dump(cv_random_forest, open('RandomForestModel.pkl', 'wb'))

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [26]:
def model_pipeline2():
    pipeline = Pipeline(
                            [
                                ('text_pipeline', Pipeline(
                                                                [
                                                                    ('vect', CountVectorizer(tokenizer=tokenize)),
                                                                    ('tfidf', TfidfTransformer())

                                                                ]
                                                            )),
                                ('clf', MultiOutputClassifier(AdaBoostClassifier()))
    
                            ]
                        )
    return pipeline

In [27]:
model2 = model_pipeline2()

In [28]:
parameters_AdaBoost = {
    'clf__estimator__n_estimators' : list(range(50,151,25)),
    'clf__estimator__learning_rate': [0.01,0.05,0.1,0.25]
}


cv_AdaBoost = GridSearchCV(estimator=model2, param_grid=parameters_AdaBoost,verbose=3,n_jobs=12)

In [30]:
cv_AdaBoost.fit(X_train, y_train)

[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=12)]: Done   8 tasks      | elapsed: 12.9min
[Parallel(n_jobs=12)]: Done  58 out of  60 | elapsed: 62.5min remaining:  2.2min
[Parallel(n_jobs=12)]: Done  60 out of  60 | elapsed: 62.6min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('text_pipeline',
                                        Pipeline(memory=None,
                                                 steps=[('vect',
                                                         CountVectorizer(analyzer='word',
                                                                         binary=False,
                                                                         decode_error='strict',
                                                                         dtype=<class 'numpy.int64'>,
                                                                         encoding='utf-8',
                                                                         input='content',
                                                                         lowercase=True,
                                                                     

In [31]:
y_test_pred = cv_AdaBoost.predict(X_test)
print_model_metrics(y_test_pred,y_test,y_test.columns.values)

Scores for Category 'related'
              precision    recall  f1-score   support

           0       0.71      0.05      0.09       646
           1       0.75      0.99      0.86      1951
           2       1.00      0.05      0.09        21

    accuracy                           0.75      2618
   macro avg       0.82      0.36      0.35      2618
weighted avg       0.75      0.75      0.66      2618

Scores for Category 'request'
              precision    recall  f1-score   support

           0       0.88      0.98      0.93      2142
           1       0.80      0.42      0.55       476

    accuracy                           0.87      2618
   macro avg       0.84      0.70      0.74      2618
weighted avg       0.87      0.87      0.86      2618

Scores for Category 'offer'
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2601
           1       0.00      0.00      0.00        17

    accuracy                           0

  'precision', 'predicted', average, warn_for)



Scores for Category 'other_weather'
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      2481
           1       0.60      0.07      0.12       137

    accuracy                           0.95      2618
   macro avg       0.78      0.53      0.55      2618
weighted avg       0.93      0.95      0.93      2618

Scores for Category 'direct_report'
              precision    recall  f1-score   support

           0       0.86      0.98      0.92      2107
           1       0.82      0.33      0.48       511

    accuracy                           0.86      2618
   macro avg       0.84      0.66      0.70      2618
weighted avg       0.85      0.86      0.83      2618



### 9. Export your model as a pickle file

In [32]:
pickle.dump(cv_AdaBoost, open('AdaBoost_model.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.