# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [179]:
# import libraries
import pandas as pd
import numpy as np
import os
from sqlalchemy import create_engine

# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
import re

#sklearn
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline,  FeatureUnion
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier

[nltk_data] Downloading package punkt to /Users/jeffsan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jeffsan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeffsan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [180]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql("SELECT * FROM messages", engine)
X = df[['message', 'genre']]
Y = df.drop(columns=['id', 'message', 'original','genre'])


### 2. Write a tokenization function to process your text data

In [181]:
def tokenize(text):
    #remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    #tokenize text
    tokens = word_tokenize(text)
    
    # initiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #iterate for each tokens
    clean_tokens = []
    for tok in tokens:
        
        if tok not in stopwords.words('english'):
            # lemmatize, normalize case, and remove leading/trailing white space
            clean_tok = lemmatizer.lemmatize(tok).lower().strip()

            clean_tokens.append(clean_tok)
    
    return clean_tokens
    

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [182]:
from sklearn.preprocessing import FunctionTransformer

get_msg_data = FunctionTransformer(lambda x: x['message'], validate=False)
get_genre_data = FunctionTransformer(lambda x: pd.get_dummies(x['genre']), validate=False)

In [184]:
msg_pipeline = Pipeline([
    ('msg_selector', get_msg_data),
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
])

features_pipeline_union = FeatureUnion([
    ('msg_pipeline', msg_pipeline),
    ('genre_pipeline', get_genre_data)
])

pipeline = Pipeline([
    ('features', features_pipeline_union),
    ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=42), n_jobs=-1))
    #('clf', OneVsRestClassifier(LogisticRegression()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [185]:
X_train, X_test, y_train, y_test = tts(X,Y,test_size=0.33, random_state= 42)

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('msg_pipeline', Pipeline(memory=None,
     steps=[('msg_selector', FunctionTransformer(accept_sparse=False,
          func=<function <lambda> at 0x1364bb510>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='dep...
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=-1))])

### 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [None]:
""" Performance on Training Dataset """
y_pred = pipeline.predict(X_train)
labels = y_train.columns.tolist()
class_report = classification_report(y_train, y_pred, target_names=labels)
accuracy = (y_pred == y_train).mean()

#print("Labels:\n", labels)
print("\nClassification report:\n", class_report)
print("\nAccuracy:\n", accuracy)

In [193]:
""" Performance on Training Dataset """
labels = y_test.columns.tolist()
class_report = classification_report(y_test, y_pred, target_names=labels)
accuracy = (y_pred == y_test).mean()

#print("Labels:\n", labels)
print("\nClassification report:\n", class_report)
print("\nAccuracy:\n", accuracy)

Labels:
 ['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

Classification report:
                         precision    recall  f1-score   support

               related       0.85      0.92      0.88      6542
               request       0.80      0.46      0.58      1484
                 offer       0.00      0.00      0.00        36
           aid_related       0.75      0.60      0.67      3564
          medical_help       0.56      0.10      0.17       690
      medical_products       0.75      0.08      0.15       405
     search_and_rescue       0.65 

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [198]:
from sklearn.metrics import precision_recall_fscore_support

for i,label in enumerate(y_test.columns):
    prec, recall, f1, _ = precision_recall_fscore_support(y_test[label], y_pred[:,i], average='binary')
    print("label {} : precision = {:.2f}, recall = {:.2f}, f1 = {:.2f}".format(label, prec, recall, f1))

label related : precision = 0.85, recall = 0.92, f1 = 0.88
label request : precision = 0.80, recall = 0.46, f1 = 0.58
label offer : precision = 0.00, recall = 0.00, f1 = 0.00
label aid_related : precision = 0.75, recall = 0.60, f1 = 0.67
label medical_help : precision = 0.56, recall = 0.10, f1 = 0.17
label medical_products : precision = 0.75, recall = 0.08, f1 = 0.15
label search_and_rescue : precision = 0.65, recall = 0.14, f1 = 0.23
label security : precision = 0.12, recall = 0.01, f1 = 0.01
label military : precision = 0.61, recall = 0.11, f1 = 0.19
label child_alone : precision = 0.00, recall = 0.00, f1 = 0.00
label water : precision = 0.80, recall = 0.36, f1 = 0.50
label food : precision = 0.85, recall = 0.39, f1 = 0.53
label shelter : precision = 0.85, recall = 0.31, f1 = 0.46
label clothing : precision = 0.85, recall = 0.19, f1 = 0.31
label money : precision = 0.73, recall = 0.04, f1 = 0.07
label missing_people : precision = 0.67, recall = 0.02, f1 = 0.04
label refugees : precis

  'precision', 'predicted', average, warn_for)


In [129]:
report = pd.DataFrame(index = y_test.columns)
from sklearn.metrics import f1_score, recall_score, precision_score
for i,label in enumerate(y_test.columns):
    f_1 = f1_score(y_test[label], y_pred[:,i], average='micro')
    rec = recall_score(y_test[label], y_pred[:,i], average='micro')
    precision = precision_score(y_test[label], y_pred[:,i], average='micro')
    print("label {} : precision = {:.2f}, recall = {:.2f}, f1 = {:.2f}".format(label, precision, rec, f_1))

label related : precision = 0.81, recall = 0.81, f1 = 0.81
label request : precision = 0.89, recall = 0.89, f1 = 0.89
label offer : precision = 1.00, recall = 1.00, f1 = 1.00
label aid_related : precision = 0.75, recall = 0.75, f1 = 0.75
label medical_help : precision = 0.92, recall = 0.92, f1 = 0.92
label medical_products : precision = 0.95, recall = 0.95, f1 = 0.95
label search_and_rescue : precision = 0.97, recall = 0.97, f1 = 0.97
label security : precision = 0.98, recall = 0.98, f1 = 0.98
label military : precision = 0.97, recall = 0.97, f1 = 0.97
label child_alone : precision = 1.00, recall = 1.00, f1 = 1.00
label water : precision = 0.95, recall = 0.95, f1 = 0.95
label food : precision = 0.93, recall = 0.93, f1 = 0.93
label shelter : precision = 0.93, recall = 0.93, f1 = 0.93
label clothing : precision = 0.99, recall = 0.99, f1 = 0.99
label money : precision = 0.98, recall = 0.98, f1 = 0.98
label missing_people : precision = 0.99, recall = 0.99, f1 = 0.99
label refugees : precis

In [132]:
report = pd.DataFrame(index = y_test.columns)
from sklearn.metrics import f1_score, recall_score, precision_score
for i,label in enumerate(y_test.columns):
    f_1 = f1_score(y_test[label], y_pred[:,i], average='micro')
    rec = recall_score(y_test[label], y_pred[:,i], average='micro')
    #precision = precision_score(y_test[label], y_pred[:,i], average='micro')
    print("label {} : ,recall = {:.2f}, f1 = {:.2f}".format(label, rec, f_1))

label related : ,recall = 0.81, f1 = 0.81
label request : ,recall = 0.89, f1 = 0.89
label offer : ,recall = 1.00, f1 = 1.00
label aid_related : ,recall = 0.75, f1 = 0.75
label medical_help : ,recall = 0.92, f1 = 0.92
label medical_products : ,recall = 0.95, f1 = 0.95
label search_and_rescue : ,recall = 0.97, f1 = 0.97
label security : ,recall = 0.98, f1 = 0.98
label military : ,recall = 0.97, f1 = 0.97
label child_alone : ,recall = 1.00, f1 = 1.00
label water : ,recall = 0.95, f1 = 0.95
label food : ,recall = 0.93, f1 = 0.93
label shelter : ,recall = 0.93, f1 = 0.93
label clothing : ,recall = 0.99, f1 = 0.99
label money : ,recall = 0.98, f1 = 0.98
label missing_people : ,recall = 0.99, f1 = 0.99
label refugees : ,recall = 0.97, f1 = 0.97
label death : ,recall = 0.96, f1 = 0.96
label other_aid : ,recall = 0.87, f1 = 0.87
label infrastructure_related : ,recall = 0.94, f1 = 0.94
label transport : ,recall = 0.96, f1 = 0.96
label buildings : ,recall = 0.95, f1 = 0.95
label electricity : ,re

In [164]:
f1_score(y_test.iloc[:,0], y_pred[:,0], average='weighted')
    

0.7955196576327143

In [168]:
precision_score(y_test.iloc[:,0], y_pred[:,0])

ValueError: Target is multiclass but average='binary'. Please choose another average setting.

In [171]:
y_test.iloc[:,0]

10239    1
22968    1
11552    0
17980    1
18987    1
12903    1
5653     1
25637    1
14219    1
10338    1
22275    0
22444    1
6224     1
20195    0
21312    0
14169    1
14314    1
4936     0
25343    1
16321    1
17306    1
10856    1
6125     1
25033    1
6743     0
4192     1
19748    1
19595    1
6850     0
9951     1
        ..
9921     1
9717     1
9658     1
9196     1
13673    1
12498    0
6632     1
14221    1
1960     0
20882    1
12155    1
725      1
23304    1
1519     1
22934    1
10108    1
21810    1
12394    1
8755     1
1453     0
12491    1
9905     1
14099    1
20833    1
18143    1
4075     1
23102    1
8831     1
5738     0
23981    0
Name: related, Length: 8640, dtype: int64

In [158]:
acc = (y_test.iloc[:,0] == y_pred[:,0]).sum() / y_test.shape[0]
tp = ((y_pred[:,0] == 1) & (y_test.iloc[:,0]==1)).sum()
fp = ((y_pred[:,0] == 1) & (y_test.iloc[:,0]==0)).sum()
fn = ((y_pred[:,0] == 0) & (y_test.iloc[:,0]==1)).sum()

In [159]:
p_m = tp / (tp + fp)
r_m = tp / (tp + fn)

In [160]:
acc

0.8082175925925926

In [161]:
p_m

0.8481792717086835

In [162]:
r_m

0.9235931066036297

In [156]:
recall_score(y_test.iloc[:,0], y_pred[:,0], average='macro')

0.5524231393965601

In [99]:
y_pred[0]

array([1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0])

In [104]:
y_test.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
10239,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
22968,1,0,0,1,1,1,1,0,1,0,...,0,1,1,1,0,1,0,0,0,0
11552,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17980,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18987,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.