# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [4]:
import nltk
nltk.download(['punkt', 'wordnet'])

[nltk_data] Downloading package punkt to /Users/muneerah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/muneerah/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score, average_precision_score,recall_score
from sklearn.preprocessing import label_binarize
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
import numpy as np

In [6]:
# load data from database
engine = create_engine('sqlite:///message_category.db')
conn = engine.connect()
df = pd.read_sql_table('message_category',conn)
X = df['message']
Y = df.iloc[:,4:]

In [7]:
print(X.shape)
print(Y.shape)

(26216,)
(26216, 36)


### 2. Write a tokenization function to process your text data

In [8]:
def tokenize(text):
    tokens = word_tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    clean_token =  []
    for token in tokens:
        clean_token.append(lemmatizer.lemmatize(token).lower().strip())
        
    return clean_token


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [14]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(KNeighborsClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [15]:
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size = 0.3)

pipeline.fit(X_train,y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x125f38820>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=KNeighborsClassifier()))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
for i,column in enumerate(y_test.columns):
    print('report for {}'.format(column))
    print(classification_report(y_test[column].values.astype(str),y_pred[:,i].astype(int).astype(str)), '\n')

In [None]:
for i,column in enumerate(y_test.columns):
    print('report for {}'.format(column))
    print('accuracy score:',accuracy_score(y_test[column].values.astype(str),y_pred[:,i].astype(int).astype(str)))
    print('average_precision_score:',average_precision_score(y_test[column].values.astype(int),y_pred[:,i].astype(int)), '\n')
#     print('accuracy score:',recall_score(y_test[column].values.astype(int),y_pred[:,i].astype(int)), '\n')

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
params={'clf__estimator__n_neighbors':(4, 5)
        }

cv = GridSearchCV(pipeline, param_grid=params)


cv.fit(X_train,y_train)

In [17]:
print("\nBest Parameters:", cv.best_params_)


Best Parameters: {'clf__max_depth': None, 'clf__max_features': 0.7, 'clf__min_samples_leaf': 1}


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [18]:
#accuracy , precision and recall
grid_ypred = cv.predict(X_test)

In [37]:
for i,column in enumerate(y_test.columns):
    print('report for {}'.format(column))
#     print(classification_report(y_test[column].values.astype(str),y_pred[:,i].astype(int).astype(str)), '\n')
    print('accuracy score:',accuracy_score(y_test[column].values.astype(str),grid_ypred[:,i].astype(int).astype(str)))
    print('average_precision_score:',average_precision_score(y_test[column].values.astype(int),grid_ypred[:,i].astype(int)))
    print('accuracy score:',recall_score(y_test[column].values.astype(int),grid_ypred[:,i].astype(int)), '\n')

report for related
accuracy score: 0.804958677686
average_precision_score: 0.409820485808
accuracy score: 0.454496499731 

report for request
accuracy score: 0.890654799746
average_precision_score: 0.491447170322
accuracy score: 0.507947976879 

report for offer
accuracy score: 0.993897012079
average_precision_score: 0.00610298792117
accuracy score: 0.0 

report for aid_related
accuracy score: 0.740750158932
average_precision_score: 0.603501471387
accuracy score: 0.515086206897 

report for medical_help
accuracy score: 0.922695486332
average_precision_score: 0.130978804745
accuracy score: 0.0816326530612 

report for medical_products
accuracy score: 0.95041322314
average_precision_score: 0.11130497461
accuracy score: 0.0924574209246 

report for search_and_rescue
accuracy score: 0.974316592498
average_precision_score: 0.0739466365538
accuracy score: 0.0566037735849 

report for security
accuracy score: 0.983598219962
average_precision_score: 0.0162746344565
accuracy score: 0.0 

report

  recall = tps / tps[-1]


accuracy score: 0.935791481246
average_precision_score: 0.335256709464
accuracy score: 0.331395348837 

report for clothing
accuracy score: 0.983852511125
average_precision_score: 0.0406357279085
accuracy score: 0.0307692307692 

report for money
accuracy score: 0.976605212969
average_precision_score: 0.029785029785
accuracy score: 0.0222222222222 

report for missing_people
accuracy score: 0.989828353465
average_precision_score: 0.0265054063034
accuracy score: 0.0246913580247 

report for refugees
accuracy score: 0.96617927527
average_precision_score: 0.0576133298094
accuracy score: 0.0486891385768 

report for death
accuracy score: 0.96312778131
average_precision_score: 0.152766299357
accuracy score: 0.157407407407 

report for other_aid
accuracy score: 0.873617291799
average_precision_score: 0.200874267166
accuracy score: 0.146551724138 

report for infrastructure_related
accuracy score: 0.937190082645
average_precision_score: 0.0698899186937
accuracy score: 0.0161943319838 

report

### 9. Export your model as a pickle file

In [33]:
import pickle    
    
filename = 'GridSearchv_disaster_model.sav'
pickle.dump(cv, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.