# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
import nltk
nltk.download(['punkt', 'wordnet'])

[nltk_data] Downloading package punkt to /Users/muneerah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/muneerah/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score, average_precision_score,recall_score
from sklearn.preprocessing import label_binarize
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
import numpy as np

In [5]:
# load data from database
engine = create_engine('sqlite:///message_category.db')
conn = engine.connect()
df = pd.read_sql_table('message_category',conn)
X = df['message']
Y = df.iloc[:,4:]

In [6]:
print(X.shape)
print(Y.shape)

(26216,)
(26216, 36)


### 2. Write a tokenization function to process your text data

In [7]:
def tokenize(text):
    tokens = word_tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    clean_token =  []
    for token in tokens:
        clean_token.append(lemmatizer.lemmatize(token).lower().strip())
        
    return clean_token


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [8]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())
    ])


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [9]:
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size = 0.3)

pipeline.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [18]:
y_pred = pipeline.predict(X_test)

In [9]:
# y__pred= pd.DataFrame(y_pred, columns=y_test.columns).astype('int64')

In [11]:
for i,column in enumerate(y_test.columns):
    print('report for {}'.format(column))
    print(classification_report(y_test[column].values.astype(str),y_pred[:,i].astype(int).astype(str)), '\n')

report for related
             precision    recall  f1-score   support

          0       0.55      0.46      0.50      1773
          1       0.84      0.89      0.87      6041
          2       0.56      0.10      0.17        51

avg / total       0.78      0.79      0.78      7865
 

report for request
             precision    recall  f1-score   support

          0       0.88      0.99      0.93      6482
          1       0.87      0.37      0.52      1383

avg / total       0.88      0.88      0.86      7865
 

report for offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      7828
          1       0.00      0.00      0.00        37

avg / total       0.99      1.00      0.99      7865
 

report for aid_related
             precision    recall  f1-score   support

          0       0.68      0.92      0.78      4569
          1       0.78      0.40      0.53      3296

avg / total       0.72      0.70      0.67      7865
 

r

  'precision', 'predicted', average, warn_for)


             precision    recall  f1-score   support

          0       1.00      1.00      1.00      7865

avg / total       1.00      1.00      1.00      7865
 

report for water
             precision    recall  f1-score   support

          0       0.94      1.00      0.97      7368
          1       0.94      0.12      0.22       497

avg / total       0.94      0.94      0.92      7865
 

report for food
             precision    recall  f1-score   support

          0       0.91      0.99      0.95      6981
          1       0.82      0.20      0.32       884

avg / total       0.90      0.91      0.88      7865
 

report for shelter
             precision    recall  f1-score   support

          0       0.92      1.00      0.96      7155
          1       0.86      0.13      0.23       710

avg / total       0.91      0.92      0.89      7865
 

report for clothing
             precision    recall  f1-score   support

          0       0.98      1.00      0.99      7731
      

In [60]:
for i,column in enumerate(y_test.columns):
    print('report for {}'.format(column))
    print('accuracy score:',accuracy_score(y_test[column].values.astype(str),y_pred[:,i].astype(int).astype(str)))
    print('average_precision_score:',average_precision_score(y_test[column].values.astype(int),y_pred[:,i].astype(int)), '\n')
#     print('accuracy score:',recall_score(y_test[column].values.astype(int),y_pred[:,i].astype(int)), '\n')

report for related
accuracy score: 0.335410044501
average_precision_score: 0.236017874425 

report for request
accuracy score: 0.790336935791
average_precision_score: 0.173137912186 

report for offer
accuracy score: 0.995168467896
average_precision_score: 0.00483153210426 

report for aid_related
accuracy score: 0.540241576605
average_precision_score: 0.416082683417 

report for medical_help
accuracy score: 0.919898283535
average_precision_score: 0.0783216783217 

report for medical_products
accuracy score: 0.945327399873
average_precision_score: 0.0518753973299 

report for search_and_rescue
accuracy score: 0.972282263191
average_precision_score: 0.0277177368086 

report for security
accuracy score: 0.980801017165
average_precision_score: 0.0191989828353 

report for military
accuracy score: 0.964780673872
average_precision_score: 0.0348378893833 

report for child_alone
accuracy score: 1.0
average_precision_score: nan 

report for water
accuracy score: 0.927272727273
average_precisi

  recall = tps / tps[-1]


 0.0179036872219 

report for money
accuracy score: 0.978003814367
average_precision_score: 0.0247592134613 

report for missing_people
accuracy score: 0.986395422759
average_precision_score: 0.0134774316592 

report for refugees
accuracy score: 0.968086458996
average_precision_score: 0.031277813096 

report for death
accuracy score: 0.951176096631
average_precision_score: 0.0481014631485 

report for other_aid
accuracy score: 0.862301335029
average_precision_score: 0.132592960605 

report for infrastructure_related
accuracy score: 0.932485696122
average_precision_score: 0.0670057215512 

report for transport
accuracy score: 0.956134774317
average_precision_score: 0.04361093452 

report for buildings
accuracy score: 0.944691671964
average_precision_score: 0.0543982098379 

report for electricity
accuracy score: 0.978512396694
average_precision_score: 0.0213604577241 

report for tools
accuracy score: 0.994532739987
average_precision_score: 0.00546726001271 

report for hospitals
accura

### 6. Improve your model
Use grid search to find better parameters. 

In [19]:
def convert_to_binary(data):
    for i in range (data.shape[1]):
        labels= np.unique(data.iloc[:,i])
        data.iloc[:,i] = label_binarize(data.iloc[:,i], classes=labels)
    return data
    

In [20]:
# convert y_train to binary labels
y_train = convert_to_binary(y_train)   
print(np.unique(y_train))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


[0 1]


In [21]:
#convert y_test to binary labels
y_test = convert_to_binary(y_test) 
    
print(np.unique(y_test))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


[0 1]


In [31]:

params={'clf__max_features':[0.3, 0.5, 0.7],
        'clf__min_samples_leaf':[1, 2, 3],
        'clf__max_depth':[None]
        }


cv = GridSearchCV(pipeline, param_grid=params)

# y_train = label_binarize(y_train, classes=[0, 1, 2])

cv.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__max_features': [0.3, 0.5, 0.7], 'clf__min_samples_leaf': [1, 2, 3], 'clf__max_depth': [None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [17]:
print("\nBest Parameters:", cv.best_params_)


Best Parameters: {'clf__max_depth': None, 'clf__max_features': 0.7, 'clf__min_samples_leaf': 1}


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [18]:
#accuracy , precision and recall
grid_ypred = cv.predict(X_test)


In [37]:
for i,column in enumerate(y_test.columns):
    print('report for {}'.format(column))
#     print(classification_report(y_test[column].values.astype(str),y_pred[:,i].astype(int).astype(str)), '\n')
    print('accuracy score:',accuracy_score(y_test[column].values.astype(str),grid_ypred[:,i].astype(int).astype(str)))
    print('average_precision_score:',average_precision_score(y_test[column].values.astype(int),grid_ypred[:,i].astype(int)))
    print('accuracy score:',recall_score(y_test[column].values.astype(int),grid_ypred[:,i].astype(int)), '\n')

report for related
accuracy score: 0.804958677686
average_precision_score: 0.409820485808
accuracy score: 0.454496499731 

report for request
accuracy score: 0.890654799746
average_precision_score: 0.491447170322
accuracy score: 0.507947976879 

report for offer
accuracy score: 0.993897012079
average_precision_score: 0.00610298792117
accuracy score: 0.0 

report for aid_related
accuracy score: 0.740750158932
average_precision_score: 0.603501471387
accuracy score: 0.515086206897 

report for medical_help
accuracy score: 0.922695486332
average_precision_score: 0.130978804745
accuracy score: 0.0816326530612 

report for medical_products
accuracy score: 0.95041322314
average_precision_score: 0.11130497461
accuracy score: 0.0924574209246 

report for search_and_rescue
accuracy score: 0.974316592498
average_precision_score: 0.0739466365538
accuracy score: 0.0566037735849 

report for security
accuracy score: 0.983598219962
average_precision_score: 0.0162746344565
accuracy score: 0.0 

report

  recall = tps / tps[-1]


accuracy score: 0.935791481246
average_precision_score: 0.335256709464
accuracy score: 0.331395348837 

report for clothing
accuracy score: 0.983852511125
average_precision_score: 0.0406357279085
accuracy score: 0.0307692307692 

report for money
accuracy score: 0.976605212969
average_precision_score: 0.029785029785
accuracy score: 0.0222222222222 

report for missing_people
accuracy score: 0.989828353465
average_precision_score: 0.0265054063034
accuracy score: 0.0246913580247 

report for refugees
accuracy score: 0.96617927527
average_precision_score: 0.0576133298094
accuracy score: 0.0486891385768 

report for death
accuracy score: 0.96312778131
average_precision_score: 0.152766299357
accuracy score: 0.157407407407 

report for other_aid
accuracy score: 0.873617291799
average_precision_score: 0.200874267166
accuracy score: 0.146551724138 

report for infrastructure_related
accuracy score: 0.937190082645
average_precision_score: 0.0698899186937
accuracy score: 0.0161943319838 

report

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [23]:
knn_pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', KNeighborsClassifier(n_neighbors=3))
    ])

knn_pipeline.fit(X_train,y_train)

knn_ypred = knn_pipeline.predict(X_test)


report for related
accuracy score: 0.790082644628


AttributeError: 'list' object has no attribute 'values'

In [24]:
for i,column in enumerate(y_test.columns):
    print('report for {}'.format(column))
    print('accuracy score:',accuracy_score(y_test[column].values.astype(str),knn_ypred[:,i].astype(int).astype(str)))
    print('average_precision_score:',average_precision_score(y_test[column].values.astype(int),knn_ypred[:,i].astype(int)), '\n')
#     print('accuracy score:',recall_score(y_test[column].values.astype(int),y_pred[:,i].astype(int)), '\n')

report for related
accuracy score: 0.790082644628
average_precision_score: 0.282740965644 

report for request
accuracy score: 0.839287984743
average_precision_score: 0.246654372265 

report for offer
accuracy score: 0.995295613477
average_precision_score: 0.00470438652257 

report for aid_related
accuracy score: 0.596185632549
average_precision_score: 0.437219471475 

report for medical_help
accuracy score: 0.917736808646
average_precision_score: 0.0916991664792 

report for medical_products
accuracy score: 0.953210425938
average_precision_score: 0.0693944595057 

report for search_and_rescue
accuracy score: 0.970375079466
average_precision_score: 0.0338984248075 

report for security
accuracy score: 0.979529561348
average_precision_score: 0.0191989828353 

report for military
accuracy score: 0.964144945963
average_precision_score: 0.0358550540369 

report for child_alone
accuracy score: 1.0
average_precision_score: nan 

report for water
accuracy score: 0.938970120788
average_precisi

  recall = tps / tps[-1]


accuracy score: 0.976859504132
average_precision_score: 0.0397427547749 

report for missing_people
accuracy score: 0.987921169739
average_precision_score: 0.0120788302606 

report for refugees
accuracy score: 0.964272091545
average_precision_score: 0.0472934053976 

report for death
accuracy score: 0.955880483153
average_precision_score: 0.0671671748288 

report for other_aid
accuracy score: 0.866369993643
average_precision_score: 0.135669720078 

report for infrastructure_related
accuracy score: 0.935282898919
average_precision_score: 0.0753160525971 

report for transport
accuracy score: 0.950794659886
average_precision_score: 0.053174865432 

report for buildings
accuracy score: 0.947997457088
average_precision_score: 0.076080900088 

report for electricity
accuracy score: 0.979783852511
average_precision_score: 0.022212131303 

report for tools
accuracy score: 0.993642720915
average_precision_score: 0.00635727908455 

report for hospitals
accuracy score: 0.989828353465
average_pre

In [27]:
tree_pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', tree.DecisionTreeClassifier())
    ])

tree_pipeline.fit(X_train,y_train)

tree_ypred = tree_pipeline.predict(X_test)

In [28]:
for i,column in enumerate(y_test.columns):
    print('report for {}'.format(column))
    print('accuracy score:',accuracy_score(y_test[column].values.astype(str),tree_ypred[:,i].astype(int).astype(str)))
    print('average_precision_score:',average_precision_score(y_test[column].values.astype(int),tree_ypred[:,i].astype(int)), '\n')
#     print('accuracy score:',recall_score(y_test[column].values.astype(int),y_pred[:,i].astype(int)), '\n')

report for related
accuracy score: 0.754736172918
average_precision_score: 0.348697066043 

report for request
accuracy score: 0.8541640178
average_precision_score: 0.401593702378 

report for offer
accuracy score: 0.991608391608
average_precision_score: 0.00544908052239 

report for aid_related
accuracy score: 0.705403687222
average_precision_score: 0.565781841946 

report for medical_help
accuracy score: 0.891799109981
average_precision_score: 0.13127479096 

report for medical_products
accuracy score: 0.936427209154
average_precision_score: 0.092869748494 

report for search_and_rescue
accuracy score: 0.958931977114
average_precision_score: 0.0517508117264 

report for security
accuracy score: 0.969103623649
average_precision_score: 0.0201864207495 

report for military
accuracy score: 0.957024793388
average_precision_score: 0.121634710974 

report for child_alone
accuracy score: 1.0
average_precision_score: nan 

report for water
accuracy score: 0.953337571519
average_precision_sco

  recall = tps / tps[-1]


average_precision_score: 0.0178495320881 

report for refugees
accuracy score: 0.951176096631
average_precision_score: 0.0650758398413 

report for death
accuracy score: 0.943674507311
average_precision_score: 0.12640275812 

report for other_aid
accuracy score: 0.829624920534
average_precision_score: 0.188003438219 

report for infrastructure_related
accuracy score: 0.90298792117
average_precision_score: 0.0752367276831 

report for transport
accuracy score: 0.935155753338
average_precision_score: 0.0735144956329 

report for buildings
accuracy score: 0.93159567705
average_precision_score: 0.0906555360882 

report for electricity
accuracy score: 0.972790845518
average_precision_score: 0.0532967032967 

report for tools
accuracy score: 0.98944691672
average_precision_score: 0.00635727908455 

report for hospitals
accuracy score: 0.982326764145
average_precision_score: 0.0140649041341 

report for shops
accuracy score: 0.992371265099
average_precision_score: 0.00508582326764 

report fo

### 9. Export your model as a pickle file

In [33]:
# import pickle
# filename = 'Dtree_disaster_model.sav'
# pickle.dump(tree_pipeline, open(filename, 'wb'))
 
    
    
filename = 'GridSearchv_disaster_model.sav'
pickle.dump(cv, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.