# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [None]:
# import libraries
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
import sqlite3
from sqlalchemy import create_engine
import re

# Natural Langauge Toolkit
from nltk.corpus import stopwords
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize

import sys
import os
import re
from sqlalchemy import create_engine
import pickle

from scipy.stats import gmean
# import relevant modules from the sklearn
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.metrics import precision_recall_fscore_support
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.base import BaseEstimator,TransformerMixin



In [None]:
# load data from database
engine = create_engine('sqlite:///final_response.db')
df = pd.read_sql_table('final_responsetable', 'sqlite:///final_response.db')  

In [37]:
df.describe() #not from the describe, child_alone has a whole column of zero, we should be able to just drop this.

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0
mean,15224.82133,0.77365,0.170659,0.004501,0.414251,0.079493,0.050084,0.027617,0.017966,0.032804,0.0,0.063778,0.111497,0.088267,0.015449,0.023039,0.011367,0.033377,0.045545,0.131446,0.065037,0.045812,0.050847,0.020293,0.006065,0.010795,0.004577,0.011787,0.043904,0.278341,0.082202,0.093187,0.010757,0.093645,0.020217,0.052487,0.193584
std,8826.88914,0.435276,0.376218,0.06694,0.492602,0.270513,0.218122,0.163875,0.132831,0.178128,0.0,0.244361,0.314752,0.283688,0.123331,0.150031,0.106011,0.179621,0.2085,0.337894,0.246595,0.209081,0.219689,0.141003,0.077643,0.103338,0.067502,0.107927,0.204887,0.448191,0.274677,0.2907,0.103158,0.29134,0.140743,0.223011,0.395114
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7446.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15662.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22924.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


From above and also below, only the 'related' may have a mistake, i.e. 0, 1, 2

In [38]:
# df['related'].value_counts()
df.groupby('related').count() # when using goupby, follow up with some metric, in this case count but could be mean - for example.

Unnamed: 0_level_0,id,message,original,genre,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
related,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
0.0,6122,6122,3395,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122,6122
1.0,19906,19906,6643,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906,19906
2.0,188,188,132,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188


In [39]:
def replace_with_majority(row):
    # import pdb; pdb.set_trace()
    if row == 2.0:
        return 1.0
    return row

df['related'] = df['related'].apply(replace_with_majority)
df['related'].value_counts() # great, we've gotten rid of the 2.0 which was most probably an error.

1.0    20094
0.0     6122
Name: related, dtype: int64

## Remove Class with only 0.0 (child_alone)
If this class is not removed, will run into the error below:
**valueerror: this solver needs samples of at least 2 classes in the data, but the data contains only one class: 0.0**

In [40]:
df.drop(['child_alone'], axis=1, inplace=True)

In [41]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
# ------------------------------------
X = df['message']
y = df.iloc[:,4:]
# ------------------------------------
features = df['message']
labels = df.iloc[:,4:] # take from 4 col, i.e. 'id', 'origignal', 'message', and 'genre' cols do not have any value and not labels.
# features # note, we can do some clearning here.
X.shape, y.shape

((26216,), (26216, 35))

In [43]:
X # numpy array of messages

0        Weather update - a cold front from Cuba that c...
1                  Is the Hurricane over or is it not over
2                          Looking for someone but no name
3        UN reports Leogane 80-90 destroyed. Only Hospi...
4        says: west side of Haiti, rest of the country ...
5                   Information about the National Palace-
6                           Storm at sacred heart of jesus
7        Please, we need tents and water. We are in Sil...
8          I would like to receive the messages, thank you
9        I am in Croix-des-Bouquets. We have health iss...
10       There's nothing to eat and water, we starving ...
11       I am in Petionville. I need more information r...
12       I am in Thomassin number 32, in the area named...
13       Let's do it together, need food in Delma 75, i...
14       More information on the 4636 number in order f...
15       A Comitee in Delmas 19, Rue ( street ) Janvier...
16       We need food and water in Klecin 12. We are dy.

In [44]:
y # numpy array of 36 class.

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [45]:
#----------------------------------------------------------------------------

In [46]:
# custom function for you to visualize all the cleaning of the text.
def normalize_clean(text):
    text = text.lower() # normalize
    text = re.sub(r"[^a-zA-Z0-9]", " ", text) # remove punctuations
    text = re.sub(r'http\S+', '', text) # remove urls
    text = re.sub(r'[%$#]', '', text) # remove special chars
    text = re.sub("[^a-zA-Z]", " ", text) # clean numbers 
    return text

features = features.apply(normalize_clean)
features

0        weather update   a cold front from cuba that c...
1                  is the hurricane over or is it not over
2                          looking for someone but no name
3        un reports leogane       destroyed  only hospi...
4        says  west side of haiti  rest of the country ...
5                   information about the national palace 
6                           storm at sacred heart of jesus
7        please  we need tents and water  we are in sil...
8          i would like to receive the messages  thank you
9        i am in croix des bouquets  we have health iss...
10       there s nothing to eat and water  we starving ...
11       i am in petionville  i need more information r...
12       i am in thomassin number     in the area named...
13       let s do it together  need food in delma     i...
14       more information on the      number in order f...
15       a comitee in delmas     rue   street   janvier...
16       we need food and water in klecin     we are dy.

Next we will remove stop words:
Stop words usually refers to the most commonly used words in a language (such as “the”, “a”, “an”, “in”)
This step is a one of the important steps to come up with more robust input. We will do all this in the tockenize def.

### 2. Write a tokenization function to process your text data

In [47]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def tokenize_custom(text):    
    # Extract the word tokens from the provided text row 
    tokens = nltk.word_tokenize(text)
    tokens = [w for w in tokens if not w in stop_words] #stopwords removal
    
    #Lemmanitizer to remove inflectional and derivationally related forms of a word
    lemmatizer = nltk.WordNetLemmatizer()
    cleaned_tokens = [lemmatizer.lemmatize(w).strip() for w in tokens]
    
    return cleaned_tokens

features = features.apply(tokenize_custom)
features

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0        [weather, update, cold, front, cuba, could, pa...
1                                              [hurricane]
2                                 [looking, someone, name]
3        [un, report, leogane, destroyed, hospital, st,...
4        [say, west, side, haiti, rest, country, today,...
5                          [information, national, palace]
6                            [storm, sacred, heart, jesus]
7                 [please, need, tent, water, silo, thank]
8                   [would, like, receive, message, thank]
9        [croix, de, bouquet, health, issue, worker, sa...
10                [nothing, eat, water, starving, thirsty]
11             [petionville, need, information, regarding]
12       [thomassin, number, area, named, pyron, would,...
13        [let, together, need, food, delma, didine, area]
14       [information, number, order, participate, see,...
15       [comitee, delmas, rue, street, janvier, impass...
16       [need, food, water, klecin, dying, hunger, imp.

In [48]:
def tokenize(text):    
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens


## Summary of text preprocessing
* Normalize - to lowercase and cleaned up punctuation
* tokenize - to each individual word
* removal of stop words - reduce the dimensionality we are working with.
* stemming/lemmetization - to further reduce words we work with.

In [49]:
# from lecture - here is a custom transformer for us to include in our pipeline
class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence)) # using our defined token() above after pos.
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [None]:
# We will try various models and use GridSearchCV to get the est hyperparameters for each model.
models = [
        {
            "name": "LogisticRegression",
            "estimator": LogisticRegression(verbose=True),
            "hyperparameters":
                {
                    "classifier__estimator__solver": ["newton-cg", "lbfgs", "liblinear"]
                }
        },
        {
            "name": "AdaBoostClassifier",
            "estimator": AdaBoostClassifier(),
            "hyperparameters":
                {
                    'classifier__estimator__learning_rate': [0.01, 0.02, 0.05],
                    'classifier__estimator__n_estimators': [10, 20, 40]
                }
        },
        {
            "name": "RandomForestClassifier",
            "estimator": RandomForestClassifier(random_state=1, verbose=True),
            "hyperparameters":
                {
                    "classifier__estimator__n_estimators": [20],
                    "classifier__estimator__criterion": ["entropy", "gini"],
                    "classifier__estimator__max_depth": [2, 5, 10],
                    "classifier__estimator__max_features": ["log2", "sqrt"],
                    "classifier__estimator__max_features": [1, 5, 8],
                    "classifier__estimator__min_samples_split": [2, 3, 5]

                }
        }
    ]

In [51]:
# from instructions above, suggesting for us to use MultiOuptuClassifier
# strategy based on fitting one classifier per target.

def build_model(models, tune_hyperparameters=False):
    pipes, grids = [], []
    # create a pipeline + GridSearch obj for each model.
    for model in models:
        # import pdb; pdb.set_trace()
        pipeline = Pipeline([
            ('features', FeatureUnion([

                ('text_pipeline', Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()) # adding addtional features.
                ])),

                ('starting_verb', StartingVerbExtractor())
            ])),

            ('classifier', MultiOutputClassifier(model['estimator']))
        ])
        # --------------------------------------------
        # Hyperameter tuning
        # --------------------------------------------
        cv = GridSearchCV(pipeline, param_grid=model['hyperparameters'], scoring='f1_micro', n_jobs=-1)
        grids.append(cv)
        pipes.append(pipeline)
    return grids if tune_hyperparameters else pipes

def display_results(y_test, y_pred):
    pass

From above, we have CountVectorizer and tfidf running in parallel.
* convert to float/int for model to digest
* add additional features, i.e. tfidf transformation.

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.33, 
                                                    random_state=42)

In [None]:
pipelines = build_model(models)
for pipe, model in zip(pipelines, models):
    # import pdb; pdb.set_trace()
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print("=" * 50, model['name'], "=" * 25)
    print(classification_report(y_test.values, y_pred, target_names=labels.columns.values))
    # import pdb; pdb.set_trace()

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [19]:
y_inf_train = pipeline_trained.predict(X_train)
y_inf_test = pipeline_trained.predict(X_test)

# Print classification report on test data
print(classification_report(y_test.values, y_inf_test, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.81      0.95      0.87      6598
               request       0.78      0.50      0.61      1472
                 offer       0.12      0.05      0.07        38
           aid_related       0.74      0.61      0.67      3545
          medical_help       0.57      0.25      0.35       701
      medical_products       0.60      0.30      0.40       446
     search_and_rescue       0.57      0.17      0.26       226
              security       0.17      0.03      0.04       160
              military       0.56      0.30      0.39       267
           child_alone       0.00      0.00      0.00         0
                 water       0.74      0.65      0.69       543
                  food       0.81      0.66      0.73       965
               shelter       0.76      0.53      0.63       775
              clothing       0.70      0.41      0.52       127
                 money       0.47      

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.