# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])
import re
import pickle
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.multioutput import MultiOutputClassifier
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import warnings
#!pip install numpy --upgrade
warnings.simplefilter('ignore')

[nltk_data] Downloading package punkt to /Users/michaelt2/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/michaelt2/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/michaelt2/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
import sqlite3 
con = sqlite3.connect("DisasterResponse10.db")
cursor = con.cursor()

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

[('message_and_categories_ds10',)]


In [3]:
# load data from database
#df = cursor.execute("SELECT * FROM message_and_categories_ds4 ")
#engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('message_and_categories_ds10', con = 'sqlite:///DisasterResponse10.db')
df.to_csv('ML_Data.csv')
df.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,storm,fire,earthquake,cold,other_weather,direct_report,id,message,original,genre
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,1,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [4]:
X = df.loc[:,["message"]]
X = X.squeeze()
Y = df.iloc[:,list(range(36))]

In [5]:
for column in Y.columns:
    print(column, ': ',  Y[column].unique())
Y = Y.apply(pd.to_numeric)

related :  ['1' '0']
request :  ['0' '1']
offer :  ['0' '1']
aid_related :  ['0' '1']
medical_help :  ['0' '1']
medical_products :  ['0' '1']
search_and_rescue :  ['0' '1']
security :  ['0' '1']
military :  ['0' '1']
child_alone :  ['0']
water :  ['0' '1']
food :  ['0' '1']
shelter :  ['0' '1']
clothing :  ['0' '1']
money :  ['0' '1']
missing_people :  ['0' '1']
refugees :  ['0' '1']
death :  ['0' '1']
other_aid :  ['0' '1']
infrastructure_related :  ['0' '1']
transport :  ['0' '1']
buildings :  ['0' '1']
electricity :  ['0' '1']
tools :  ['0' '1']
hospitals :  ['0' '1']
shops :  ['0' '1']
aid_centers :  ['0' '1']
other_infrastructure :  ['0' '1']
weather_related :  ['0' '1']
floods :  ['0' '1']
storm :  ['0' '1']
fire :  ['0' '1']
earthquake :  ['0' '1']
cold :  ['0' '1']
other_weather :  ['0' '1']
direct_report :  ['0' '1']


In [6]:
Y.dtypes

related                   int64
request                   int64
offer                     int64
aid_related               int64
medical_help              int64
medical_products          int64
search_and_rescue         int64
security                  int64
military                  int64
child_alone               int64
water                     int64
food                      int64
shelter                   int64
clothing                  int64
money                     int64
missing_people            int64
refugees                  int64
death                     int64
other_aid                 int64
infrastructure_related    int64
transport                 int64
buildings                 int64
electricity               int64
tools                     int64
hospitals                 int64
shops                     int64
aid_centers               int64
other_infrastructure      int64
weather_related           int64
floods                    int64
storm                     int64
fire    

In [7]:
X

0        Weather update - a cold front from Cuba that c...
1                  Is the Hurricane over or is it not over
2                          Looking for someone but no name
3        UN reports Leogane 80-90 destroyed. Only Hospi...
4        says: west side of Haiti, rest of the country ...
                               ...                        
26023    The training demonstrated how to enhance micro...
26024    A suitable candidate has been selected and OCH...
26025    Proshika, operating in Cox's Bazar municipalit...
26026    Some 2,000 women protesting against the conduc...
26027    A radical shift in thinking came about as a re...
Name: message, Length: 26028, dtype: object

In [8]:
Y

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26023,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26024,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26025,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26026,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

def tokenize(text):
    # Convert text to lowercase and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    clean_tokens = []
 # Stem word tokens and remove stop words
    stemmer = PorterStemmer()
    stop_words = stopwords.words("english")
    
    stemmed = [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return stemmed




In [9]:
def tokenize(text):
    """ Normalize text string, tokenize text string and remove stop words from text string
    Args: 
        Text string with message
    Returns 
        Normalized text string with word tokens 

"""
    
    # Convert text to lowercase and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize words
    tokens = word_tokenize(text)
    
    # Stem word tokens and remove stop words
    stemmer = PorterStemmer()
    stop_words = stopwords.words("english")
    
    stemmed = [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return stemmed

def tokenize(text):
    """
    Function: tokenize the text
    Args:  source string
    Return:
    clean_tokens(str list): clean string list
    
    """
    #normalize text
    text = re.sub(r'[^a-zA-Z0-9]',' ',text.lower())
    
    #token messages
    words = word_tokenize(text)
    tokens = [w for w in words if w not in stopwords.words("english")]
    
    #sterm and lemmatizer
    lemmatizer = WordNetLemmatizer()
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [10]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [11]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2,shuffle=True, random_state=42 )

In [12]:
x_train

15272    The failure of rescuers to reach these areas p...
15797    NEW YORK, 21 September 2007 - The exceptionall...
7072     Human rights group Be a pillar on which those ...
3261          Please send us information concerning them. 
23655    In deference to the concerns of the Malian gov...
                               ...                        
21575    Having trekked a long distance, the herders' e...
5390     What about people who did get their dossier (f...
860      our people would like our aid we would like th...
15795    The work of the UN in response to the floods d...
23654    DEDAYE, Myanmar, May 31, 2008 (AFP) - Myanmar ...
Name: message, Length: 20822, dtype: object

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(x_train)

# Now apply the transformations to the data:
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [13]:
np.random.seed(17)
pipeline.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                   

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [14]:
def get_eval_metrics(actual, predicted, col_names):
    """Calculate evaluation metrics for ML model
    
    Args:
    actual: array. Array containing actual labels.
    predicted: array. Array containing predicted labels.
    col_names: list of strings. List containing names for each of the predicted fields.
       
    Returns:
    metrics_df: dataframe. Dataframe containing the accuracy, precision, recall 
    and f1 score for a given set of actual and predicted labels.
    """
    metrics = []
    
    # Calculate evaluation metrics for each set of labels
    for i in range(len(col_names)):
        #print(col_names[i])
        accuracy = accuracy_score(actual[:, i], predicted[:, i])
        precision = precision_score(actual[:, i], predicted[:, i])
        recall = recall_score(actual[:, i], predicted[:, i])
        f1 = f1_score(actual[:, i], predicted[:, i])
        
        metrics.append([accuracy, precision, recall, f1])
    
    # Create dataframe containing metrics
    metrics = np.array(metrics)
    metrics_df = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return metrics_df    

In [15]:
# Calculate evaluation metrics for training set
y_train_pred = pipeline.predict(x_train)
y_train_pred =y_train_pred.astype(int)
col_names = list(Y.columns.values)

In [16]:
print(get_eval_metrics(np.array(y_train), y_train_pred, col_names))

                        Accuracy  Precision    Recall        F1
related                 0.997983   0.999623  0.997743  0.998682
request                 0.999184   1.000000  0.995241  0.997615
offer                   0.999952   1.000000  0.989247  0.994595
aid_related             0.998895   0.999540  0.997817  0.998678
medical_help            0.999568   1.000000  0.994555  0.997270
medical_products        0.999760   1.000000  0.995234  0.997611
search_and_rescue       0.999856   1.000000  0.994764  0.997375
security                0.999760   1.000000  0.986301  0.993103
military                0.999808   1.000000  0.994161  0.997072
child_alone             1.000000   0.000000  0.000000  0.000000
water                   1.000000   1.000000  1.000000  1.000000
food                    0.999904   1.000000  0.999144  0.999572
shelter                 0.999952   1.000000  0.999453  0.999726
clothing                0.999952   1.000000  0.996933  0.998464
money                   0.999952   1.000

In [17]:
# Calculate evaluation metrics for test set
y_test_pred = pipeline.predict(x_test)

eval_metrics0 = get_eval_metrics(np.array(y_test), y_test_pred, col_names)
print(eval_metrics0)

                        Accuracy  Precision    Recall        F1
related                 0.777180   0.848915  0.860066  0.854454
request                 0.863043   0.617391  0.550998  0.582308
offer                   0.992317   0.000000  0.000000  0.000000
aid_related             0.706493   0.647280  0.640074  0.643657
medical_help            0.906262   0.418803  0.341067  0.375959
medical_products        0.943143   0.432773  0.390152  0.410359
search_and_rescue       0.957933   0.234375  0.198675  0.215054
security                0.967730   0.081081  0.056604  0.066667
military                0.964464   0.459016  0.320000  0.377104
child_alone             1.000000   0.000000  0.000000  0.000000
water                   0.956396   0.676737  0.651163  0.663704
food                    0.933922   0.699013  0.725256  0.711893
shelter                 0.935267   0.668161  0.611910  0.638800
clothing                0.987130   0.581081  0.544304  0.562092
money                   0.971955   0.438

In [18]:
# Calculation the proportion of each column that have label == 1
#Y.sum()/len(Y)
#print(Y.sum())
#print(len(Y))
#type(Y)
#(Y>2).count()
#Y[Y >2].count()
# Calculation the proportion of each column that have label == 1
Y.sum()/len(Y)

related                   0.764792
request                   0.171892
offer                     0.004534
aid_related               0.417243
medical_help              0.080068
medical_products          0.050446
search_and_rescue         0.027816
security                  0.018096
military                  0.033041
child_alone               0.000000
water                     0.064239
food                      0.112302
shelter                   0.088904
clothing                  0.015560
money                     0.023206
missing_people            0.011449
refugees                  0.033618
death                     0.045874
other_aid                 0.132396
infrastructure_related    0.065506
transport                 0.046143
buildings                 0.051214
electricity               0.020440
tools                     0.006109
hospitals                 0.010873
shops                     0.004610
aid_centers               0.011872
other_infrastructure      0.044222
weather_related     

### 6. Improve your model
Use grid search to find better parameters. 

In [19]:
def performance_metric(y_true, y_pred):
    """Calculate median F1 score for all of the output classifiers
    
    Args:
    y_true: array. Array containing actual labels.
    y_pred: array. Array containing predicted labels.
        
    Returns:
    score: float. Median F1 score for all of the output classifiers
    """
    f1_list = []
    for i in range(np.shape(y_pred)[1]):
        f1 = f1_score(np.array(y_true)[:, i], y_pred[:, i])
        f1_list.append(f1)
        
    score = np.median(f1_list)
    return score

In [20]:
# Create grid search object
## Parameters for random forest
'''

parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf':[True, False],
              'clf__estimator__n_estimators':[10, 25], 
              'clf__estimator__min_samples_split':[2, 5, 10]}

scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = scorer, verbose = 10)
'''

"\n\nparameters = {'vect__min_df': [1, 5],\n              'tfidf__use_idf':[True, False],\n              'clf__estimator__n_estimators':[10, 25], \n              'clf__estimator__min_samples_split':[2, 5, 10]}\n\nscorer = make_scorer(performance_metric)\ncv = GridSearchCV(pipeline, param_grid = parameters, scoring = scorer, verbose = 10)\n"

In [21]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__ccp_alpha', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_impurity_split', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__presort', 'clf__estimator__random_state', 'clf__estimator__splitter', 'clf__estimator', 'clf__n_jobs'])

In [22]:
# Create grid search object
# Parameters for Decision Tre
parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf':[True, False],
              'clf__estimator__min_samples_split':[2, 5]}

scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = scorer, n_jobs = 8,cv =4,verbose = 10)

In [23]:
# Find best parameters
np.random.seed(81)
tuned_model = cv.fit(x_train, y_train)

Fitting 4 folds for each of 8 candidates, totalling 32 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:  2.3min
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed:  5.0min
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:  5.1min
[Parallel(n_jobs=8)]: Done  21 out of  32 | elapsed:  8.0min remaining:  4.2min
[Parallel(n_jobs=8)]: Done  25 out of  32 | elapsed:  9.7min remaining:  2.7min
[Parallel(n_jobs=8)]: Done  29 out of  32 | elapsed:  9.9min remaining:  1.0min
[Parallel(n_jobs=8)]: Done  32 out of  32 | elapsed:  9.9min finished


In [24]:
# Get results of grid search
tuned_model.cv_results_

{'mean_fit_time': array([176.93855244, 132.49799126, 156.74522251, 113.31946301,
        178.43825412, 134.36238652, 142.49773109, 100.02140796]),
 'std_fit_time': array([1.82659084, 1.74567214, 1.08803684, 0.66265749, 1.0679891 ,
        0.89138939, 1.01771392, 0.74623853]),
 'mean_score_time': array([5.68244696, 5.28034019, 6.16705126, 6.13113505, 5.15595704,
        5.09675729, 4.24671739, 4.70171636]),
 'std_score_time': array([0.37313172, 0.09034795, 0.12625859, 0.04766536, 0.0947315 ,
        0.21648621, 0.27823402, 0.42580739]),
 'param_clf__estimator__min_samples_split': masked_array(data=[2, 2, 2, 2, 5, 5, 5, 5],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_tfidf__use_idf': masked_array(data=[True, True, False, False, True, True, False, False],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_vect__m

In [25]:
# Best mean test score
np.max(tuned_model.cv_results_['mean_test_score'])

0.37452439054047354

In [26]:
# Parameters for best mean test score
tuned_model.best_params_

{'clf__estimator__min_samples_split': 5,
 'tfidf__use_idf': False,
 'vect__min_df': 1}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [27]:
# Calculate evaluation metrics for test set
tuned_pred_test = tuned_model.predict(x_test)
eval_metrics1 = get_eval_metrics(np.array(y_test), tuned_pred_test, col_names)
eval_metrics1

Unnamed: 0,Accuracy,Precision,Recall,F1
related,0.789282,0.855793,0.869411,0.862549
request,0.861506,0.606096,0.573171,0.589174
offer,0.99443,0.0,0.0,0.0
aid_related,0.717057,0.660253,0.652597,0.656403
medical_help,0.907991,0.423077,0.306265,0.355316
medical_products,0.941606,0.415254,0.371212,0.392
search_and_rescue,0.960814,0.273504,0.211921,0.238806
security,0.97234,0.134615,0.066038,0.088608
military,0.96504,0.473684,0.36,0.409091
child_alone,1.0,0.0,0.0,0.0


In [28]:
# Get summary stats for first model
eval_metrics0.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,36.0,36.0,36.0,36.0
mean,0.93364,0.402934,0.375241,0.387779
std,0.067169,0.248042,0.244208,0.245347
min,0.706493,0.0,0.0,0.0
25%,0.929841,0.201435,0.194748,0.198531
50%,0.956877,0.429153,0.352352,0.384631
75%,0.980503,0.612582,0.566226,0.590393
max,1.0,0.848915,0.860066,0.854454


In [29]:
# Get summary stats for tuned model
eval_metrics1.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,36.0,36.0,36.0,36.0
mean,0.936532,0.419169,0.374478,0.394113
std,0.064513,0.250343,0.248664,0.248722
min,0.717057,0.0,0.0,0.0
25%,0.932242,0.2375,0.191009,0.21161
50%,0.959758,0.432692,0.351756,0.391652
75%,0.980311,0.617939,0.575155,0.596694
max,1.0,0.855793,0.869411,0.862549


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [30]:
for column in y_train.columns:
    print(column, ': ',  y_train[column].unique())

related :  [1 0]
request :  [0 1]
offer :  [0 1]
aid_related :  [1 0]
medical_help :  [0 1]
medical_products :  [0 1]
search_and_rescue :  [0 1]
security :  [0 1]
military :  [0 1]
child_alone :  [0]
water :  [0 1]
food :  [0 1]
shelter :  [0 1]
clothing :  [0 1]
money :  [0 1]
missing_people :  [0 1]
refugees :  [0 1]
death :  [1 0]
other_aid :  [0 1]
infrastructure_related :  [0 1]
transport :  [0 1]
buildings :  [0 1]
electricity :  [0 1]
tools :  [0 1]
hospitals :  [0 1]
shops :  [0 1]
aid_centers :  [0 1]
other_infrastructure :  [0 1]
weather_related :  [0 1]
floods :  [0 1]
storm :  [0 1]
fire :  [0 1]
earthquake :  [0 1]
cold :  [0 1]
other_weather :  [0 1]
direct_report :  [0 1]


Neural Network based Model
==

In [31]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(32,32,8), activation='relu', solver='adam', max_iter=200)


In [32]:
pipeline_mlp = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', mlp)
])

In [33]:
np.random.seed(17)
pipeline_mlp.fit(x_train,y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                               batch_size='auto', beta_1=0.9, beta_2=0.999,
                               early_stopping=False, epsilon=1e-08,
                               hidden_layer_sizes=(32, 32, 8),
                               learning_rate='constant',
                               learning_rate_init=0.001, max_fun=15000,

In [34]:
# Calculate evaluation metrics for training set
y_train_mlp_pred = pipeline.predict(x_train)
y_train_mlp_pred =y_train_mlp_pred.astype(int)
#col_names = list(y.columns.values)

In [35]:
mlp_result =get_eval_metrics(np.array(y_train).astype(int), y_train_mlp_pred.astype(int), col_names)
mlp_result

Unnamed: 0,Accuracy,Precision,Recall,F1
related,0.997983,0.999623,0.997743,0.998682
request,0.999184,1.0,0.995241,0.997615
offer,0.999952,1.0,0.989247,0.994595
aid_related,0.998895,0.99954,0.997817,0.998678
medical_help,0.999568,1.0,0.994555,0.99727
medical_products,0.99976,1.0,0.995234,0.997611
search_and_rescue,0.999856,1.0,0.994764,0.997375
security,0.99976,1.0,0.986301,0.993103
military,0.999808,1.0,0.994161,0.997072
child_alone,1.0,0.0,0.0,0.0


In [36]:
# Calculate evaluation metrics for training set
y_test_mlp_pred = pipeline.predict(x_test)
y_test_mlp_pred =y_test_mlp_pred.astype(int)
#col_names = list(y.columns.values)

In [37]:
mlp_result_test =get_eval_metrics(np.array(y_test).astype(int), y_test_mlp_pred.astype(int), col_names)
mlp_result_test

Unnamed: 0,Accuracy,Precision,Recall,F1
related,0.77718,0.848915,0.860066,0.854454
request,0.863043,0.617391,0.550998,0.582308
offer,0.992317,0.0,0.0,0.0
aid_related,0.706493,0.64728,0.640074,0.643657
medical_help,0.906262,0.418803,0.341067,0.375959
medical_products,0.943143,0.432773,0.390152,0.410359
search_and_rescue,0.957933,0.234375,0.198675,0.215054
security,0.96773,0.081081,0.056604,0.066667
military,0.964464,0.459016,0.32,0.377104
child_alone,1.0,0.0,0.0,0.0


In [38]:
# Get summary stats for tuned model
mlp_result_test.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,36.0,36.0,36.0,36.0
mean,0.93364,0.402934,0.375241,0.387779
std,0.067169,0.248042,0.244208,0.245347
min,0.706493,0.0,0.0,0.0
25%,0.929841,0.201435,0.194748,0.198531
50%,0.956877,0.429153,0.352352,0.384631
75%,0.980503,0.612582,0.566226,0.590393
max,1.0,0.848915,0.860066,0.854454


### 9. Export your model as a pickle file

In [39]:
# Pickle best model
pickle.dump(tuned_model, open('disaster_model.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.