## ML Pipeline Preparation

### Main blocks of code
* Importing libraries and loading data
* Writing a tokenization function to process text data
* Building a machine learning Pipeline
* Improving model with grid search and testing new model
* Export model as a pickle file



______________________________

### Importing libraries and loading data

In [25]:
# importing libraries
import sys
import pandas as pd 

import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])

import re
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\BernadettKepenyes\AppData\Roaming\nltk_data..
[nltk_data]     .
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\BernadettKepenyes\AppData\Roaming\nltk_data..
[nltk_data]     .
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\BernadettKepenyes\AppData\Roaming\nltk_data..
[nltk_data]     .
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
df = pd.read_csv(r"C:\Users\BernadettKepenyes\Documents\GitHub\disaster-response-webapp\data\DisasterResponse.csv")
df.head(2)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,2,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,2,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0


In [27]:
# extract message column
X = df['message']

# classification labels
# Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1), or:
Y = df.iloc[:, 4:]

# category names for visualization
category_names = Y.columns

_____________________

### Writing a tokenization function to process text data

In [28]:
# tokenization function to process text data
def tokenize(text):
    '''
    function: returning the root form of the words of messages
    input: message text(str)
    output: cleaned list of words of messages
    '''
    
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # normalizing text
    words = word_tokenize(text) # tokenizing text
    words = [w for w in words if w not in stopwords.words("english")] # removing stop words
    lemmatizer = WordNetLemmatizer() # initiating text
    
    # lemmatizing - iterate through each token
    clean_words = []
    for w in words:
        clean = lemmatizer.lemmatize(w)
        clean_words.append(clean)
    
    return clean_words

# testing out function
for message in X[:5]:
    words = tokenize(message)
    print(message)
    print(words, '\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 

Is the Hurricane over or is it not over
['hurricane'] 

Looking for someone but no name
['looking', 'someone', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', '80', '90', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'need', 'supply', 'desperately'] 

says: west side of Haiti, rest of the country today and tonight
['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight'] 



________________

### Building a machine learning pipeline

In [31]:
### defining pipeline
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf',  MultiOutputClassifier(RandomForestClassifier()))
    ])

# splitting data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
# fit Random Forest Classifier
pipeline.fit(X_train, Y_train)
# prediction
Y_pred = pipeline.predict(X_test)

In [33]:
# testing function
def model_report(Y_test, Y_pred):

    i = 0
    for col in Y_test:
        print('Category {}: {}'.format(i+1, col))
        print(classification_report(Y_test[col], Y_pred[:, i]))
        i = i+1
    accuracy = (Y_pred == Y_test).mean()
    print('Accuracy: ', accuracy)
    sample_accuracy = accuracy.mean()
    print('Average accuracy: ', sample_accuracy)

model_report(Y_test, Y_pred)

Category 1: related
              precision    recall  f1-score   support

           0       0.73      0.40      0.51      1553
           1       0.84      0.95      0.89      5001

    accuracy                           0.82      6554
   macro avg       0.78      0.68      0.70      6554
weighted avg       0.81      0.82      0.80      6554

Category 2: request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      5430
           1       0.83      0.47      0.60      1124

    accuracy                           0.89      6554
   macro avg       0.86      0.73      0.77      6554
weighted avg       0.89      0.89      0.88      6554

Category 3: offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6511
           1       0.00      0.00      0.00        43

    accuracy                           0.99      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg    

_________________

### Improving model with grid search and testing new model

In [34]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x0000025F08F94670>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x0000025F08F94670>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,


In [35]:
# grid search
parameters = {
            'clf__estimator__n_estimators': [60]
}

cv = GridSearchCV(pipeline, param_grid=parameters)
cv

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x0000025F08F94670>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'clf__estimator__n_estimators': [60]})

In [37]:
cv.fit(X_train, Y_train)
Y_pred = cv.predict(X_test)
model_report(Y_test, Y_pred)

Category 1: related
              precision    recall  f1-score   support

           0       0.73      0.39      0.51      1553
           1       0.83      0.96      0.89      5001

    accuracy                           0.82      6554
   macro avg       0.78      0.67      0.70      6554
weighted avg       0.81      0.82      0.80      6554

Category 2: request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      5430
           1       0.81      0.47      0.59      1124

    accuracy                           0.89      6554
   macro avg       0.86      0.72      0.76      6554
weighted avg       0.88      0.89      0.88      6554

Category 3: offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6511
           1       0.00      0.00      0.00        43

    accuracy                           0.99      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg    

In [43]:
# saving model
import pickle
filename = 'classifier.pkl'
pickle.dump(cv, open(filename, 'wb'))