# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
import re, unicodedata, pickle
from sqlalchemy import create_engine
import utils

In [None]:
# download nltk libraries
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])

In [2]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, f1_score, make_scorer

In [4]:
# load data from database
engine = create_engine('sqlite:///CategorisedMessages.db')
df = pd.read_sql_table("Messages", engine)
X = df['message'].values
Y = df[df.columns.difference(['id', 'message', 'original', 'genre'])] # all columns except these 4

In [10]:
type(Y.columns)

pandas.core.indexes.base.Index

### 2. Write a tokenization function to process your text data

Each twitter message needs to be 
 - Clean: remove urls and punctuation marks, symbols etc...
 - Normalise
 - Tokenise
 - Stop word removal
 - Part of Speech Tagging and Named Entity Recognition
 - Stemming and Lemmatization

In [18]:
# check for html tags
utils.run_regex(X, utils.regex_html_tag, pout=-1)

Detected: [' ', 'a']: Index=9949
Source: Haitians in U.S. watch earthquake #news and worry http bit.ly 6PkPGv News #CNN favorited false in_reply_to_user_id null in_reply_to_status_id null in_reply_to_screen_name null source <a href= http ping.fm rel= nofollow >Ping.fm< a>
Detected: [' ', 'a']: Index=9984
Source: RT compassion Donate to our Disaster Relief Fund to help those affected by the Haiti earthquake http bit.ly 6l9Xhv favorited false in_reply_to_user_id null in_reply_to_status_id null in_reply_to_screen_name null source <a href= http www.tweetdeck.com rel= nofollow >TweetDeck< a>
Detected: ['(']: Index=23651
Source: Observe strict hygienic practice namely; wash hands every time before eating and during food preparation, clean and disinfect every surfaces and utensils used for cooking, store cooked food separate from fresh food, cooking the food thoroughly and store cooked food in appropriate temperature, do not left cooked food in the room temperature more than 2 hours, keep coo

These 3 data points should be fixed or removed!

In [19]:
# check for urls
utils.run_regex(X, utils.regex_url, pout=5)

Detected: ['http://www.jobpaw.com/']: Index=5026
Source: If you want to find a Job at an NGO or the Government, upload your resume at http://www.jobpaw.com/ 
Detected: ['http://welcome.topuertorico.org/government.shtml']: Index=5296
Source: NOTES: WHAT A JERK ,ALL HAITIANS DONT HAVE ANYTHING TO EAT ,AND ''HE'' ORDERING 3 DAYS WITHOUT FOOD LIKE SUPPORT FOR THOSE WITHOUT FOOD? http://welcome.topuertorico.org/government.shtml
Detected: ['http://wap.sina.comhttp://wap.sina.com']: Index=7343
Source: http://wap.sina.comhttp://wap.sina.com 
Detected: ['http://ea.mobile.nokia.com/ea/graphics']: Index=8850
Source: Nokia.com http://ea.mobile.nokia.com/ea/graphics 
Detected: ['http://172.16.3.136/mymain2.php', 'http://172.16.3.136/mymain2.php']: Index=9723
Source: BEGIN:VBKM VERSION:1.0 TITLE:Digicel Live Ha URL:http://172.16.3.136/mymain2.php BEGIN:ENV X-IRMC-URLQUOTED-PRINTABLE: InternetShortcut  URLhttp://172.16.3.136/mymain2.php END:ENV END:VBKM  
Total: 669


In [20]:
# check accented unicode characters
utils.check_accented_chars(X, pout=2)

Detected at Index=10634
Source: Clothes ( men 's , women , s girls 6-8 , baby 0-6 ) . Some blankets , canned food , baby formula , a baby bottle , soap , toothpaste .. . Looking for more stuff . I can also donate frozen homemade baby food , it 's all veg/fruit pur‚àö¬©e frozen into cubes .
Detected at Index=11219
Source: @meliithebest26 JUAA F #UCK SANDY I GOT MY PL‚àö√ÖTANOS AND AREPAS READY : D you already know how i am with food : D
Total: 263


In [17]:
# TODO: check if there are labels except 0 and 1
np.unique(Y)
# Y.related.loc[(Y['related']==2)] = 1
# Y.related[Y['related']==2]

array([0, 1, 2])

In [9]:
# TODO: check for data with no labels

Twitter messages contain 
 - urls
 - special / accented characters

Must clean these in or before tokenisation. 

NOTE: Should we?
- Expand contracted words (i.e I'm -> I am)
- Annotate text with PoS tags

In [7]:
def tokenize(text:str)->list:
    # remove html tags
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf8') # get rid of accented chars
    text = text.lower() # convert to lowercase
    text = re.sub(utils.regex_url, "urlplaceholder", text) # remove urls
    text = re.sub(utils.regex_non_alphanumeric, " ", text) # remove everything not letters or numbers
    tokens = word_tokenize(text) # tokenize words
    words = [w for w in tokens if w not in stopwords.words("english")] # remove stopwords
    # TODO: Improve lemmatization and stemming
    lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words] # Lemmatize verbs by specifying pos
    stem_words = [PorterStemmer().stem(w) for w in lemmed] # Reduce words in lemmed to their stems
    
    return stem_words

Test tokenisation

In [8]:
text = X[0]
print(text)
tokenize(text)

Weather update - a cold front from Cuba that could pass over Haiti


['weather', 'updat', 'cold', 'front', 'cuba', 'could', 'pass', 'haiti']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [8]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(), 
                                  n_jobs=-1))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,Y)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [14]:
utils.compare_test_train(100, X_test, y_test, y_pred)

X_test[100] = Stay tuned to CBS47 This Morning for updates on Haitian earthquake recovery efforts.. and storm updates. Have a good night Zara
columnname:predicted(label)
earthquake:1(1) infrastructure_related:0(1) other_infrastructure:0(1) related:1(1) storm:1(1) weather_related:1(1) 


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [15]:
col = 2
print(y_test.columns[col])
print(classification_report(y_test.iloc[:,col].values, y_pred[:,col]))

buildings
              precision    recall  f1-score   support

           0       0.95      1.00      0.98      6220
           1       0.72      0.11      0.20       334

    accuracy                           0.95      6554
   macro avg       0.84      0.56      0.59      6554
weighted avg       0.94      0.95      0.94      6554



Since this is a classification task, both the precision and recall is important. Just by looking at `classification_report` we can see that value `0` has predicted very well. Looking at statistics vast majority of data is labeled as `0`. there's definitly a bias towards the data labeled as `0`. Therefor, to get a good balance between precision and recall, it's better to use the **macro average of f1-score** as the scoring metric. 

In [16]:
# model score
utils.model_score(y_test, y_pred)

0.6177008663917776

Current model has an average f1-score of 0.61. This needs to be improved!

In [17]:
scores_table = utils.create_scores_table(y_test, y_pred)
scores_table

Unnamed: 0,precision,recall,f1-score,accuracy
aid_centers,0.494202,0.5,0.497084,0.988404
aid_related,0.779454,0.769448,0.772875,0.78227
buildings,0.89138,0.555878,0.587143,0.95209
child_alone,1.0,1.0,1.0,1.0
clothing,0.909782,0.571196,0.61837,0.98581
cold,0.931463,0.552661,0.589359,0.980317
death,0.951929,0.5817,0.629685,0.961398
direct_report,0.82079,0.666439,0.702054,0.851388
earthquake,0.939184,0.890406,0.912988,0.972078
electricity,0.936423,0.535321,0.561492,0.983827


As I guessed, accuracy is highly driven by the `0` values as the number of data points labeled as `0` is very high (as a rough percentage 95 percentage). Therefore, f1-scores can be chosen as a fair scoring metric.  

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
# current parameters
pipeline.get_params()

In [None]:
# create model scorer
model_scorer = make_scorer(utils.model_score)

# parameters to optimise
parameters = {
    'clf__estimator__bootstrap': [True, False],
    'clf__estimator__max_depth': [100, 200, None],
    'clf__estimator__max_features': ['auto', 'sqrt'],
    'clf__estimator__n_estimators': [100, 200, 250, 300],
    'clf__estimator__min_samples_leaf': [2, 4],
    'clf__estimator__min_samples_split': [5, 10],
    'clf__estimator__n_estimators': [100, 200, 500]
}

# cross validation
cv = GridSearchCV(pipeline, 
    param_grid=parameters,
    scoring=model_scorer, 
    verbose=3
    )

cv.fit(X_train, y_train)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [21]:
print(cv.best_estimator_)
print(cv.best_score_)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x7f14be5e1940>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=False,
                                                                        min_samples_leaf=2,
                                                                        min_samples_split=10),
                                       n_jobs=-1))])
0.5858000713775421


This did not improve the model scores! must look for a different method.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn import model_selection

ml_models = {
    'KNN' : KNeighborsClassifier(),
    'SVM': OneVsRestClassifier(SVC(kernel='rbf')),
    'NN': MLPClassifier()
}

In [None]:
for name, model in ml_models.items():
    kfold = model_selection.KFold()
    cv_pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(model, n_jobs=1)) ])
    try:
        cv_results = model_selection.cross_val_score(cv_pipeline, X_train, y_train, cv=kfold, scoring=model_scorer)
        print(f'{name:5s}: {cv_results.mean():.5f} ± {cv_results.std():.5f}')
    except:
        print(f'{name}: Error!')
        pass


Interesting results! f1-scores are as follows,
 - `KNeighborsClassifier()` : 0.56259 ± 0.02197
 - `OneVsRestClassifier(SVC(kernel='rbf'))` : 0.63631 ± 0.00341
 - `MLPClassifier()`: 0.65270 ± 0.00195

Both SVC and MLP performs better than Random Forest even with the default parameters. Good candidates for further studies!

Things to keep in mind / optimise / learn further
 - SVC only works with `n_jobs=1`
 - some folds only contain a single label. This needs to be rectified. Our data set is not evenly distributed. Therefore, stratified data splitting is recommended (Now this is not easy since our data set is multi-labeled)
 - Some estimators do not accept sparse matrix (returned from tfidf). Y may need to convert in to an array / dense matrix 
 - SVC does not work without OneVsRestClassifier. why?
 - It's good to optimise your model, but also look for ways to optimise data. quality data will improve the score

read these for more details
 - [How to use sklearn train_test_split to stratify data for multi-label classification?](https://datascience.stackexchange.com/questions/45174/how-to-use-sklearn-train-test-split-to-stratify-data-for-multi-label-classificat)
 - [Scikit-Learn's Pipeline: A sparse matrix was passed, but dense data is required](https://stackoverflow.com/questions/28384680/scikit-learns-pipeline-a-sparse-matrix-was-passed-but-dense-data-is-required)
 - [UserWarning: Label not :NUMBER: is present in all training examples](https://stackoverflow.com/questions/42821315/userwarning-label-not-number-is-present-in-all-training-examples/42956097)
 - [Scikit-Learn: Label not x is present in all training examples](https://stackoverflow.com/questions/34561554/scikit-learn-label-not-x-is-present-in-all-training-examples)
 - [Multi-Label Classification in Python](http://scikit.ml/index.html)


### 9. Export your model as a pickle file

In [20]:
pickle.dump(pipeline, open('rfc_auto.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.