# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [359]:
# import libraries
import pandas as pd
import string
import numpy as np
import pickle
import re
from sqlalchemy import create_engine
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [153]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('disaster_messages', engine)
X = df['message']
y = df.iloc[:, 3:]

In [154]:
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [155]:
df.head(4)

Unnamed: 0,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [156]:
# Genre is the source of the message
df['genre'].value_counts()

news      13053
direct    10766
social     2396
Name: genre, dtype: int64

### 2. Write a tokenization function to process your text data

In [325]:
def tokenize(text, stop_words=set(stopwords.words('english'))):
    ''' Remove punctuation, strip unncessary spaces, lematize,
    remove stopwords and tokenize text
    
    Args:
        text: str
    returns:
        clean_tokens: list of tokens
    '''
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'[0-9]+', '', text)
    tokens = word_tokenize(text)
    
    filtered_tokens = [w for w in tokens if not w.lower() in stop_words]
    
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for token in filtered_tokens:
        clean_token = lemmatizer.lemmatize(token).lower().strip()
        clean_tokens.append(clean_token)
    
    return clean_tokens

In [326]:
tokenize('Our Process Its Easy Simple and Secure 1. Select Token The first thing that you will need to do is select the token that you want to generate. 2. Confirm Payment Simply make the payment using your desired available payment method, when you are done with the process 3. Deploy Smart Contract')

['process',
 'easy',
 'simple',
 'secure',
 'select',
 'token',
 'first',
 'thing',
 'need',
 'select',
 'token',
 'want',
 'generate',
 'confirm',
 'payment',
 'simply',
 'make',
 'payment',
 'using',
 'desired',
 'available',
 'payment',
 'method',
 'done',
 'process',
 'deploy',
 'smart',
 'contract']

In [327]:
df['message'].apply(lambda x: tokenize(x))

0        [weather, update, cold, front, cuba, could, pa...
1                                              [hurricane]
2                                 [looking, someone, name]
3        [un, report, leogane, destroyed, hospital, st,...
4        [say, west, side, haiti, rest, country, today,...
                               ...                        
26210    [training, demonstrated, enhance, micronutrien...
26211    [suitable, candidate, selected, ocha, jakarta,...
26212    [proshika, operating, cox, bazar, municipality...
26213    [woman, protesting, conduct, election, teargas...
26214    [radical, shift, thinking, came, result, meeti...
Name: message, Length: 26215, dtype: object

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [328]:
pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [329]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [330]:
pipeline.fit(X_train, y_train)

In [331]:
y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [332]:
def display_results(y_test, y_pred):
    '''
    Ouput precision, recall and f1_score using the weighted average method 
    
    Args:
        y_test: pandas columns of "true" results
        y_pred: series of preditions
    returns:
        column name, precision score, recall and f1_score
    '''
    labels = np.unique(y_test)
    precision,recall,fscore,suppport=score(y_test,y_pred,average='weighted')
    accuracy = (y_pred == y_test).mean()
    
    print(y_test.name)
    print("f1-score:\n", fscore)
    print("support:\n", y_test.sum())
    print("Accuracy:", accuracy)
    return y_test.name, precision, recall, fscore

In [310]:
y_test.iloc[:,0].name

'related'

In [311]:
display_results(y_test.iloc[:,0], y_pred[:,0])

related
f1-score:
 0.7655359910363685
support:
 5050
Accuracy: 0.8008849557522124


('related', 0.7912904533732408, 0.8008849557522124, 0.7655359910363685)

In [313]:
print(classification_report(y_test.iloc[:,0], y_pred[:,0], output_dict=False))

              precision    recall  f1-score   support

           0       0.74      0.28      0.41      1555
           1       0.81      0.97      0.88      4948
           2       0.90      0.35      0.51        51

    accuracy                           0.80      6554
   macro avg       0.82      0.53      0.60      6554
weighted avg       0.79      0.80      0.77      6554



In [335]:
result = {}

for i in range(0, 36):
    print(i)
    col_name, col_precision, col_recall, col_fscore = display_results(y_test.iloc[:,i], y_pred[:,i])
    result[col_name] = [col_precision, col_recall, col_fscore]

0
related
f1-score:
 0.7987344909624668
support:
 5074
Accuracy: 0.8167531278608483
1
request
f1-score:
 0.8824505496161951
support:
 1149
Accuracy: 0.8931949954226427
2
offer
f1-score:
 0.9926822025793574
support:
 32
Accuracy: 0.9951174855050351
3
aid_related
f1-score:
 0.7860664697703974
support:
 2733
Accuracy: 0.7880683552029295
4
medical_help
f1-score:
 0.8945675585357383
support:
 512
Accuracy: 0.923863289594141
5
medical_products
f1-score:
 0.9403407901190229
support:
 307
Accuracy: 0.9566676838571865
6
search_and_rescue
f1-score:
 0.9615038353084161
support:
 176
Accuracy: 0.9734513274336283
7
security
f1-score:
 0.9727716079601751
support:
 119
Accuracy: 0.9816905706438815
8
military
f1-score:
 0.9514745221613466
support:
 230
Accuracy: 0.9653646628013427
9
child_alone
f1-score:
 1.0
support:
 0
Accuracy: 1.0
10
water
f1-score:
 0.9496348187199509
support:
 414
Accuracy: 0.9580408910588953
11
food
f1-score:
 0.9331676151872501
support:
 769
Accuracy: 0.9385108330790357
12
she

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


floods
f1-score:
 0.9395329172395714
support:
 542
Accuracy: 0.9484284406469332
30
storm
f1-score:
 0.9323606716618682
support:
 659
Accuracy: 0.9382056759231004
31
fire
f1-score:
 0.9866690357719068
support:
 58
Accuracy: 0.9909978638999084
32
earthquake
f1-score:
 0.969833779069028
support:
 614
Accuracy: 0.9705523344522429
33
cold
f1-score:
 0.9768390482196966
support:
 122
Accuracy: 0.9832163564235581
34
other_weather
f1-score:
 0.9254238361362875
support:
 346
Accuracy: 0.9476655477570949
35
direct_report
f1-score:
 0.8368079138535194
support:
 1297
Accuracy: 0.8579493439121147


In [336]:
df_result_stopwords = pd.DataFrame.from_dict(result, orient='index', columns=['precision', 'recall', 'fscore'])

In [320]:
df_result['fscore'].mean()

0.9332718679737876

In [338]:
df_result_stopwords['fscore'].mean()

0.9373871478563294

### 6. Improve your model
Use grid search to find better parameters. 

In [342]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x00000208E93AB160>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x00000208E93AB160>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text, stop_words={'shouldn', 'whom', 'and', 'against', 'their', "shan't", 'from', 'couldn', 'am', 'yours', 'about', 

In [346]:
# specify parameters for grid search
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__estimator__n_estimators': [5, 10, 20],
              'clf__estimator__min_samples_split': [2, 5]
             }

# create grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

In [347]:
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

ValueError: multiclass-multioutput is not supported

In [348]:
cv.best_params_

{'clf__estimator__min_samples_split': 2,
 'clf__estimator__n_estimators': 20,
 'vect__ngram_range': (1, 2)}

In [None]:
df_pipeline_best_params = pd.DataFrame.from_dict(result, orient='index', columns=['precision', 'recall', 'fscore'])

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [349]:
pipeline_best_param = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize, ngram_range= (1, 2))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier(min_samples_split=2, n_estimators=20)))
                    ])

In [350]:
pipeline_best_param.fit(X_train, y_train)
y_pred = pipeline_best_param.predict(X_test)

In [351]:
result = {}

for i in range(0, 36):
    print(i)
    col_name, col_precision, col_recall, col_fscore = display_results(y_test.iloc[:,i], y_pred[:,i])
    result[col_name] = [col_precision, col_recall, col_fscore]

0
related
f1-score:
 0.7910831628816877
support:
 5074
Accuracy: 0.80103753433018
1
request
f1-score:
 0.8772166079809515
support:
 1149
Accuracy: 0.888617638083613
2
offer
f1-score:
 0.9926822025793574
support:
 32
Accuracy: 0.9951174855050351
3
aid_related
f1-score:
 0.7550778540165141
support:
 2733
Accuracy: 0.7616722612145255
4
medical_help
f1-score:
 0.8960682538591636
support:
 512
Accuracy: 0.9247787610619469
5
medical_products
f1-score:
 0.9400395372481504
support:
 307
Accuracy: 0.9562099481232835
6
search_and_rescue
f1-score:
 0.9635070912025965
support:
 176
Accuracy: 0.9742142203234666
7
security
f1-score:
 0.9731466246848942
support:
 119
Accuracy: 0.9818431492218492
8
military
f1-score:
 0.9508055079263923
support:
 230
Accuracy: 0.9655172413793104
9
child_alone
f1-score:
 1.0
support:
 0
Accuracy: 1.0
10


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


water
f1-score:
 0.9553407019858491
support:
 414
Accuracy: 0.9615501983521514
11
food
f1-score:
 0.9281197381191991
support:
 769
Accuracy: 0.9345437900518767
12
shelter
f1-score:
 0.9182699799987118
support:
 596
Accuracy: 0.9324076899603295
13
clothing
f1-score:
 0.9815899994262951
support:
 101
Accuracy: 0.9859627708269759
14
money
f1-score:
 0.967630092891369
support:
 154
Accuracy: 0.9774183704607873
15
missing_people
f1-score:
 0.981728047727816
support:
 80
Accuracy: 0.9877937137625877
16
refugees
f1-score:
 0.9529538956737929
support:
 217
Accuracy: 0.9670430271589868
17
death
f1-score:
 0.947458483897159
support:
 304
Accuracy: 0.9606347268843455
18
other_aid
f1-score:
 0.8218313795625634
support:
 847
Accuracy: 0.8719865730851388
19
infrastructure_related
f1-score:
 0.9120929490074616
support:
 386
Accuracy: 0.9401891974366798
20
transport
f1-score:
 0.9389438092358251
support:
 290
Accuracy: 0.9569728410131217
21
buildings
f1-score:
 0.9412832794360686
support:
 307
Accurac

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [352]:
df_pipeline_best_param = pd.DataFrame.from_dict(result, orient='index', columns=['precision', 'recall', 'fscore'])

In [353]:
df_pipeline_best_param['fscore'].mean()

0.9333688846748096

In [357]:
df_pipeline_best_param - df_result_stopwords

Unnamed: 0,precision,recall,fscore
related,-0.015083,-0.015716,-0.007651
request,-0.005743,-0.004577,-0.005234
offer,0.0,0.0,0.0
aid_related,-0.022853,-0.026396,-0.030989
medical_help,0.004334,0.000915,0.001501
medical_products,-0.002678,-0.000458,-0.000301
search_and_rescue,0.003775,0.000763,0.002003
security,0.009226,0.000153,0.000375
military,0.00169,0.000153,-0.000669
child_alone,0.0,0.0,0.0


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [358]:
# save the model to disk
filename = 'classifier.pkl'
pickle.dump(pipeline_best_param, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.