# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
import pandas as pd
import sqlalchemy as db

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.model_selection import GridSearchCV
from tensorflow import keras
from scikeras.wrappers import KerasClassifier
from sklearn.base import TransformerMixin, BaseEstimator


[nltk_data] Downloading package punkt to
[nltk_data]     /nethome/m.chamanbaz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /nethome/m.chamanbaz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /nethome/m.chamanbaz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# load data from database
engine = db.create_engine('sqlite:///messages_and_categories.db')
df = pd.read_sql_table('messages_and_categories', con=engine)
X = df['message']
column_names = df.columns
Y = df[column_names[4:]]
Y.head()


Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    '''
    INPUT
    text - the input text
    
    OUTPUT
    tokens - the tokenized text
    
    This function cleans the tokenize the input text
    '''
    stop_words = stopwords.words("english")
    lemmatizer = WordNetLemmatizer()
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize andremove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

In [4]:
# test the tokenize function
corpus = ["The first time you see The Second Renaissance it may look boring."]
tokens = tokenize(corpus[0])
tokens

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
# build a machine learning pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(KNeighborsClassifier(n_jobs=-1)))
])

# pipeline = Pipeline([
#     ('vect', CountVectorizer(tokenizer=tokenize)),
#     ('tfidf', TfidfTransformer()),
#     ('clf', MultiOutputClassifier(RandomForestClassifier(n_jobs=-1)))
# ])



### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
def display_results(y_test, y_pred):
    """
    INPUT 
    y_test - the actual label
    y_pred - the predicted label
    
    This function computes the accuracy based on 
    y_test and y_pred and prints the accuracy for each label
    """
    labels = np.unique(y_pred)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    print("Accuracy:", accuracy)

In [7]:
# split the dataset into test and training 
X_train, X_test, y_train, y_test = train_test_split(X, Y)

# fit the pipeline 
pipeline.fit(X_train, y_train)

# predict on test data
y_pred = pipeline.predict(X_test)

# display results
display_results(y_test, y_pred)

Labels: [0 1]
Accuracy: related                   0.778610
request                   0.837128
offer                     0.995264
aid_related               0.592208
medical_help              0.921008
medical_products          0.949274
search_and_rescue         0.971887
security                  0.980138
military                  0.969595
child_alone               1.000000
water                     0.936440
food                      0.893812
shelter                   0.917036
clothing                  0.983652
money                     0.978762
missing_people            0.988235
refugees                  0.966387
death                     0.951566
other_aid                 0.861115
infrastructure_related    0.935982
transport                 0.953858
buildings                 0.948969
electricity               0.979221
tools                     0.994500
hospitals                 0.990222
shops                     0.995111
aid_centers               0.986707
other_infrastructure      0.956

In [10]:
y_pred.shape

(6545, 36)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:
for column_number, column_name in enumerate(y_test.columns):
    print(f"Clasification Report for the feature {column_name} is:\n")
    print(classification_report(y_test[column_name], y_pred[:,column_number]))
y_test.shape

Clasification Report for the feature related is:

              precision    recall  f1-score   support

           0       0.69      0.12      0.20      1549
           1       0.78      0.98      0.87      4996

    accuracy                           0.78      6545
   macro avg       0.74      0.55      0.54      6545
weighted avg       0.76      0.78      0.71      6545

Clasification Report for the feature request is:

              precision    recall  f1-score   support

           0       0.84      0.99      0.91      5433
           1       0.72      0.07      0.12      1112

    accuracy                           0.84      6545
   macro avg       0.78      0.53      0.52      6545
weighted avg       0.82      0.84      0.78      6545

Clasification Report for the feature offer is:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6514
           1       0.00      0.00      0.00        31

    accuracy                      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6481
           1       0.00      0.00      0.00        64

    accuracy                           0.99      6545
   macro avg       0.50      0.50      0.50      6545
weighted avg       0.98      0.99      0.99      6545

Clasification Report for the feature shops is:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6513
           1       0.00      0.00      0.00        32

    accuracy                           1.00      6545
   macro avg       0.50      0.50      0.50      6545
weighted avg       0.99      1.00      0.99      6545

Clasification Report for the feature aid_centers is:

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6458
           1       0.00      0.00      0.00        87

    accuracy                           0.99      6545
   macro avg       0.49   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


(6545, 36)

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
# select some hyperparameters to see which set creates the best response

parameters = {'vect__binary': (True, False),
         'vect__max_df': (0.5, 0.75, 1.0),
         'tfidf__smooth_idf': (True, False),
         'tfidf__use_idf': (True, False),
         'clf__estimator__leaf_size': [10, 30, 60],
         'clf__estimator__n_neighbors': [5, 10, 20],
             }

cv = GridSearchCV(pipeline, param_grid=parameters, cv=2, verbose=2)
cv.fit(X_train, y_train)
y_pred_cv = cv.predict(X_test)

Fitting 2 folds for each of 216 candidates, totalling 432 fits
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.2min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.2min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=5, tfidf__smo

[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.2min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.2min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vec

[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.2min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=10, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vec

[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.3min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.2min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.2min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max

[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.4min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.4min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.4min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.2min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vec

[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.3min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.2min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=30, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vec

[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.2min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.2min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=5, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max

[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.2min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.2min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=10, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vec

[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.5; total time= 2.2min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=0.75; total time= 2.2min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vect__max_df=1.0; total time= 2.3min
[CV] END clf__estimator__leaf_size=60, clf__estimator__n_neighbors=20, tfidf__smooth_idf=True, tfidf__use_idf=True, vect__binary=True, vec

In [None]:
display_results(y_test, y_pred_cv)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [12]:
for column_number, column_name in enumerate(y_test.columns):
    print(f"Clasification Report for the feature {column_name} is:\n")
    print(classification_report(y_test[column_name], y_pred_cv[:,column_number]))
y_test.shape

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001BF3175D8B0>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001BF3175D8B0>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__subl

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [35]:
class modify_input(BaseEstimator, TransformerMixin):
    """ This class is defined to modify the input to the 
    neural network classifier. """

    def transform(self, X):
        
        return X.toarray()

    def fit(self, X, y=None, **fit_params):
        return self


(19635, 27147)


InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

In [None]:
def get_model(number_of_layers, depth_of_layers):
    """
    INPUT
    number_of_layers - the number of hidden layers in the neural network
    depth_of_layers - depth of each hidden layer
    
    OUTPUT
    A Keras model designed for multioutput classification
    
    """
    n_outputs = y_train.shape[1]
    model = keras.models.Sequential()
    for _ in range(number_of_layers):
        model.add(keras.layers.Dense(depth_of_layers, activation='relu'))
        model.add(keras.layers.Dropout(0.5))
    
    model.add(keras.layers.Dense(n_outputs, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])
    return model

Define the pipeline which uses a neural network classifier. 
Please note that the classifier is wrapped in KerasClassifier which is imported from scikeras library.
sklearn KerasClassifier does not handle multioutput models

In [None]:
pipeline = Pipeline([
   ('vect', CountVectorizer(tokenizer=tokenize)),
   ('tfidf', TfidfTransformer()),
   ('debug', modify_input()),
   ('clf', KerasClassifier(model=get_model, 
                        number_of_layers=2, 
                        depth_of_layers=20, 
                        epochs=10, 
                        batch_size=512, 
                        verbose=0, 
                        loss='binary_crossentropy', 
                        optimizer='adam',
                        metrics=['accuracy']))])

Define the parameters to be tuned including the depth and the number of hidden layers.

In [None]:
parameters = {'vect__binary': (True, False),
         'vect__max_df': (0.5, 1.0),
         'tfidf__use_idf': (True, False),
         'clf__epochs':[50, 100],
         'clf__number_of_layers':[2,5,10],
         'clf__depth_of_layers':[10,20,50]
             }

In [None]:
cv = GridSearchCV(pipeline, scoring='accuracy', param_grid=parameters, cv=2, verbose=2)
cv.fit(X_train, y_train)
#predict on test data
y_pred = cv.predict(X_test)

Test the model and show the accuracy, precision, and recall of the tuned model. 

In [32]:
display_results(y_test, y_pred)

for column_number, column_name in enumerate(y_test.columns):
   print(f"Clasification Report for the feature {column_name} is:\n")
   print(classification_report(y_test[column_name], y_pred[:,column_number]))

(19635, 36)

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.