# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
!pip install nltk
!pip install plotly



In [2]:
# import libraries
import pandas as pd
import os
import sys
import re
import pickle
import numpy as np
import sqlite3
pd.set_option('display.max_columns', None)


from sqlalchemy import create_engine
from sklearn.pipeline import Pipeline, FeatureUnion # for implementing pipelines and feature Union
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, GridSearchCV  # to split data into training and testing set
from sklearn.metrics import classification_report, confusion_matrix, fbeta_score, make_scorer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from scipy.stats import gmean

# We also need necessary NLTK packages
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [3]:
# I am performing the data fetch using the read_sql command.
# It can also be performed using read_sql_table and the query can be replaced by just the db name

engine = create_engine('sqlite:////home/workspace/data/disaster_response.db')
df =pd.read_sql("SELECT * FROM disaster_response", engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
# now that we have connected the database and got it into a dataframe, 
# we can use some pandas function to understand the data.

df.describe()

# Upon execution of .describe() please manually scrutinize for anomalies

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0
mean,15224.82133,0.77365,0.170659,0.004501,0.414251,0.079493,0.050084,0.027617,0.017966,0.032804,0.0,0.063778,0.111497,0.088267,0.015449,0.023039,0.011367,0.033377,0.045545,0.131446,0.065037,0.045812,0.050847,0.020293,0.006065,0.010795,0.004577,0.011787,0.043904,0.278341,0.082202,0.093187,0.010757,0.093645,0.020217,0.052487,0.193584
std,8826.88914,0.435276,0.376218,0.06694,0.492602,0.270513,0.218122,0.163875,0.132831,0.178128,0.0,0.244361,0.314752,0.283688,0.123331,0.150031,0.106011,0.179621,0.2085,0.337894,0.246595,0.209081,0.219689,0.141003,0.077643,0.103338,0.067502,0.107927,0.204887,0.448191,0.274677,0.2907,0.103158,0.29134,0.140743,0.223011,0.395114
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7446.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15662.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22924.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
# We observe that the column 'child_alone' has all the values as 0 and hence it is redundant.
# for sake of our analysis, we can remove it.

df.drop(['child_alone'], axis =1, inplace=True)

In [6]:
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0
mean,15224.82133,0.77365,0.170659,0.004501,0.414251,0.079493,0.050084,0.027617,0.017966,0.032804,0.063778,0.111497,0.088267,0.015449,0.023039,0.011367,0.033377,0.045545,0.131446,0.065037,0.045812,0.050847,0.020293,0.006065,0.010795,0.004577,0.011787,0.043904,0.278341,0.082202,0.093187,0.010757,0.093645,0.020217,0.052487,0.193584
std,8826.88914,0.435276,0.376218,0.06694,0.492602,0.270513,0.218122,0.163875,0.132831,0.178128,0.244361,0.314752,0.283688,0.123331,0.150031,0.106011,0.179621,0.2085,0.337894,0.246595,0.209081,0.219689,0.141003,0.077643,0.103338,0.067502,0.107927,0.204887,0.448191,0.274677,0.2907,0.103158,0.29134,0.140743,0.223011,0.395114
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7446.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15662.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22924.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
# The column 'related' seems to be interesting, let us see how macy unique variables it has

df.groupby('related').count()

Unnamed: 0_level_0,id,message,original,genre,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
related,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
0,6122,6122,3395,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122
1,19906,19906,6643,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906
2,188,188,132,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188


In [8]:
# We can see that there are 3 unique values, namely: 0 (indicates not related), 1 (indicates related), 2 (abnormal)
# An assumption that category 2 can be entered by mistake (a whole range of possibilities), is a fair one.
# We can replace category 2 as category 1 because naturally 1 is more likely when 2 is the mistake.

# changing the related column entry 2 to 1

df.related = df.related.map(lambda x: 1 if x==2 else x)
df.groupby('related').count()

Unnamed: 0_level_0,id,message,original,genre,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
related,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
0,6122,6122,3395,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122
1,20094,20094,6775,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094,20094


Now we can see that there are only 2 categories in the related column. 

The data set is extracted and project can be further pushed to next phase.


In [9]:
# To perform modeling we will have to split the data in X and y data where y will be used to predict 

X = df.message
y = df.iloc[:, 4:]
display(X.head())
display(y.head())

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [10]:
df.columns

Index(['id', 'message', 'original', 'genre', 'related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

### 2. Write a tokenization function to process your text data

In [11]:
def tokenize(text):
    """
    This function will perform the tokenization process
    
    Arguments:
        - text: the message which needs to be tokenized
        
    Output:
        - token_msgs = A list of tokens which are derived from the input messages
        
    """
    url_string = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    # Lets replace the url contents in messages with a string to reduce complexity
    
    detected_urls = re.findall(url_string, text)  # finds all the urls
    
    # Replace urls with string 'url_string'
    for detected_url in detected_urls:
        text = text.replace(detected_url, url_string)
    
    # convert the words in text msgs into tokens
    tokens = nltk.word_tokenize(text)
    
    # Lemmatize the words to get it into root form
    lemmatized = nltk.WordNetLemmatizer()
    
    # change the format of lemmatized tokens to convert it to all lower case and strip white spaces for simplicity
    token_msgs = [lemmatized.lemmatize(x).lower().strip() for x in tokens]
    
    return token_msgs


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

I will be creating 2 pipeline, 1st (pipeline1) one will be normal 2nd (pipeline2) with StartingVerbEstimator. 

In [12]:
pipeline1 = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
            ('tfidf_transformer', TfidfTransformer())
        ]))
    ])),
    
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
model_1 = pipeline1.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

##### Testing Model 1

In [14]:
y_pred1_train = model_1.predict(X_train)
y_pred1_test = model_1.predict(X_test)

# Classification report on testing set

print(classification_report(y_test.values, y_pred1_test, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.83      0.93      0.88      5036
               request       0.82      0.35      0.49      1140
                 offer       0.00      0.00      0.00        26
           aid_related       0.75      0.50      0.60      2736
          medical_help       0.61      0.04      0.08       542
      medical_products       0.70      0.08      0.14       332
     search_and_rescue       0.54      0.08      0.14       185
              security       1.00      0.01      0.01       142
              military       0.56      0.10      0.17       223
                 water       0.83      0.18      0.29       439
                  food       0.88      0.30      0.45       750
               shelter       0.88      0.17      0.28       619
              clothing       0.64      0.09      0.16       100
                 money       0.89      0.05      0.10       152
        missing_people       0.75      

  'precision', 'predicted', average, warn_for)


In [15]:
# Classification report on training set to check model performance

print(classification_report(y_train.values, y_pred1_train, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.99      1.00      0.99     15058
               request       1.00      0.93      0.96      3334
                 offer       1.00      0.76      0.86        92
           aid_related       1.00      0.97      0.98      8124
          medical_help       1.00      0.86      0.92      1542
      medical_products       1.00      0.83      0.91       981
     search_and_rescue       1.00      0.78      0.87       539
              security       1.00      0.73      0.85       329
              military       1.00      0.86      0.93       637
                 water       1.00      0.91      0.95      1233
                  food       1.00      0.94      0.97      2173
               shelter       1.00      0.91      0.95      1695
              clothing       1.00      0.86      0.93       305
                 money       1.00      0.79      0.88       452
        missing_people       1.00      

### 6. Improve your model
Use grid search to find better parameters. 

In [16]:
list(pipeline1.get_params().keys())

['memory',
 'steps',
 'features',
 'clf',
 'features__n_jobs',
 'features__transformer_list',
 'features__transformer_weights',
 'features__text_pipeline',
 'features__text_pipeline__memory',
 'features__text_pipeline__steps',
 'features__text_pipeline__count_vectorizer',
 'features__text_pipeline__tfidf_transformer',
 'features__text_pipeline__count_vectorizer__analyzer',
 'features__text_pipeline__count_vectorizer__binary',
 'features__text_pipeline__count_vectorizer__decode_error',
 'features__text_pipeline__count_vectorizer__dtype',
 'features__text_pipeline__count_vectorizer__encoding',
 'features__text_pipeline__count_vectorizer__input',
 'features__text_pipeline__count_vectorizer__lowercase',
 'features__text_pipeline__count_vectorizer__max_df',
 'features__text_pipeline__count_vectorizer__max_features',
 'features__text_pipeline__count_vectorizer__min_df',
 'features__text_pipeline__count_vectorizer__ngram_range',
 'features__text_pipeline__count_vectorizer__preprocessor',
 'fe

In [17]:
# we will use some of the parameters to perform grid search and increase model performance

parameter_grid = {'clf__estimator__n_estimators': [10, 20]}
# 'clf__estimator__learning_rate'
#'clf__estimator__max_depth': [10, 20]

model_1_cv = GridSearchCV(pipeline1, param_grid=parameter_grid, n_jobs=-1)

model_1_cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('count_vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'clf__estimator__n_estimators': [10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [18]:
y_pred_cv_test = model_1_cv.predict(X_test)
y_pred_cv_train = model_1_cv.predict(X_train)

In [19]:
# Classification report on testing set
print(classification_report(y_test.values, y_pred_cv_test, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.82      0.95      0.88      5036
               request       0.84      0.40      0.54      1140
                 offer       0.00      0.00      0.00        26
           aid_related       0.77      0.57      0.65      2736
          medical_help       0.63      0.06      0.11       542
      medical_products       0.71      0.07      0.12       332
     search_and_rescue       0.25      0.01      0.02       185
              security       0.00      0.00      0.00       142
              military       0.61      0.06      0.11       223
                 water       0.84      0.24      0.38       439
                  food       0.83      0.36      0.50       750
               shelter       0.84      0.22      0.35       619
              clothing       0.67      0.10      0.17       100
                 money       0.80      0.03      0.05       152
        missing_people       1.00      

  'precision', 'predicted', average, warn_for)


In [20]:
# Classification report on training set
print(classification_report(y_train.values, y_pred_cv_train, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       1.00      1.00      1.00     15058
               request       1.00      0.97      0.99      3334
                 offer       1.00      0.83      0.90        92
           aid_related       1.00      0.99      1.00      8124
          medical_help       1.00      0.93      0.96      1542
      medical_products       1.00      0.92      0.96       981
     search_and_rescue       1.00      0.88      0.93       539
              security       1.00      0.85      0.92       329
              military       1.00      0.95      0.97       637
                 water       1.00      0.97      0.99      1233
                  food       1.00      0.98      0.99      2173
               shelter       1.00      0.97      0.98      1695
              clothing       1.00      0.93      0.96       305
                 money       1.00      0.92      0.96       452
        missing_people       1.00      

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [21]:
# A custom transformer to extract the starting verb of a message

class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    """
    The class mentioned here will extract the starting verb for a senence which will be used as an additional feature 
    for the classification model
    """
    
    def starting_verb(self, text):
        sent_list = nltk.sent_tokenize(text)
        
        for sent in sent_list:
            pos_tags = nltk.pos_tag(tokenize(sent))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word =='RT':
                return True
        return False
    
    def fit(self, X, y=None):
        return self
    
    
    def transform(self, X):
        X_tag = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tag)

In [22]:
pipeline2 = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
            ('tfidf_transformer', TfidfTransformer())
        ])),
        ('starting_verb_transformer', StartingVerbExtractor())
    ])),
    ('classifier', MultiOutputClassifier(RandomForestClassifier()))
])

#### Trainging new model with Verb Extraction

In [23]:
model_2 = pipeline2.fit(X_train, y_train)

#### Testing new model with Verb Extraction

In [24]:
y_pred2_train = model_2.predict(X_train)
y_pred2_test = model_2.predict(X_test)

# Classification report on testing set

print(classification_report(y_test.values, y_pred2_test, target_names=y.columns.values))


                        precision    recall  f1-score   support

               related       0.83      0.93      0.88      5036
               request       0.83      0.38      0.52      1140
                 offer       0.00      0.00      0.00        26
           aid_related       0.77      0.54      0.63      2736
          medical_help       0.64      0.06      0.11       542
      medical_products       0.77      0.07      0.13       332
     search_and_rescue       0.55      0.09      0.15       185
              security       0.00      0.00      0.00       142
              military       0.53      0.09      0.15       223
                 water       0.85      0.24      0.38       439
                  food       0.84      0.26      0.40       750
               shelter       0.84      0.26      0.40       619
              clothing       0.62      0.18      0.28       100
                 money       0.80      0.03      0.05       152
        missing_people       0.00      

  'precision', 'predicted', average, warn_for)


In [25]:
# Classification report on training set

print(classification_report(y_train.values, y_pred2_train, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.99      1.00      0.99     15058
               request       1.00      0.92      0.96      3334
                 offer       1.00      0.76      0.86        92
           aid_related       1.00      0.96      0.98      8124
          medical_help       1.00      0.86      0.92      1542
      medical_products       1.00      0.84      0.91       981
     search_and_rescue       1.00      0.82      0.90       539
              security       1.00      0.71      0.83       329
              military       1.00      0.85      0.92       637
                 water       1.00      0.93      0.96      1233
                  food       1.00      0.94      0.97      2173
               shelter       1.00      0.91      0.95      1695
              clothing       1.00      0.88      0.94       305
                 money       1.00      0.82      0.90       452
        missing_people       1.00      

### 9. Export your model as a pickle file

In [27]:
Pkl_file = "classifier.pkl"  

with open(Pkl_file, 'wb') as file:  
    pickle.dump(model_2, file)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.