# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
###Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd "/content/drive/My Drive/Data Job/Colab Notebooks/Data Science Nanodegree/Exercise Files/udacity-nanodegree-exercise/4_Data_Engineering/Lesson_5_Project"


/content/drive/My Drive/Data Job/Colab Notebooks/Data Science Nanodegree/Exercise Files/udacity-nanodegree-exercise/4_Data_Engineering/Lesson_5_Project


###Imports

In [None]:
# import libraries
from sqlalchemy import create_engine
from sqlalchemy import inspect
import pandas as pd
import math
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from matplotlib import rcParams
from wordcloud import WordCloud, STOPWORDS

from sklearn.multioutput import MultiOutputClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
nltk.download(['punkt', 'wordnet', 'stopwords'])

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.metrics import  confusion_matrix, classification_report

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


###Load Data

In [None]:
# load data from database
engine = create_engine('sqlite:///disaster_response.db')
df = pd.read_sql("SELECT * FROM disaster_resp_mes;", engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:

inspector = inspect(engine)
schema = inspector.get_schema_names()[0]
colnames = []
table_name = inspector.get_table_names(schema=schema)[0]
for column in inspector.get_columns(table_name, schema=schema):
    colnames.append(column['name'])
target_colnames = colnames[4:]


In [None]:
X = df['message']
Y = df[target_colnames]

###Functions

In [None]:
def tokenize(text):
    """
    INPUT:
    text - string
    OUTPUT:
    tokens - list of strings
    
    function takes raw text, removes punctuation signs, substitutes
    with spaces. Puts all characters in lower case, tokenizes text
    by words, removes stop words, lemmatizes, and returns list of tokens 
    """
    
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [None]:
def prep_wordcloud(df, colnames):
  """
  Function that modifies NLP dataset into lists that are ready to create wordcloud from

  INPUT:
    df - dataframe
    colnames - list of target column names 
  OUTPUT
    list of string, where each element is a string from each target
  """
  wordcloud_list = []
  for colum in colnames:

    #tokenize messages of each target
    wordcloud_set = df[df[colum]==1]['message']
    wordcloud_plot = ' '.join(wordcloud_set)
    wordcloud_plot = tokenize(wordcloud_plot)
    wordcloud_plot = ' '.join(wordcloud_plot)

    
    wordcloud_list.append(wordcloud_plot)

  return wordcloud_list

In [49]:
def display_results(y_test, y_pred, grid = None):
  f1_scores = []
  for ind, cat in enumerate(y_test):
    print('Class - {}'.format(cat))
    print(classification_report(y_test.values[ind], y_pred[ind], zero_division = 1))
    #display_results(y_test.values[ind], y_pred[ind])
    f1_scores.append(f1_score(y_test.values[ind], y_pred[ind]))
  
  print('Base Model\nMinimum f1 score - {}\nBest f1 score - {}\nMean f1 score - {}'.format(min(f1_scores), max(f1_scores), round(sum(f1_scores)/len(f1_scores), 3)))
  if grid:
    print("\nBest Parameters:", grid.best_params_)

In [None]:
class message_length_word(BaseEstimator, TransformerMixin):

    def message_length_words(self, text):
      # tokenize by words, how many words in message
      word_list_tok = word_tokenize(text)

      return len(word_list)

      
    def fit(self, x, y=None):
        return self
    """
    def fit_transform(self, X):
        # apply length_word function to all values in X
        print(self.message_length_words)
        X_tagged_words = pd.Series(X).apply(self.message_length_words)


        return pd.DataFrame(X_tagged_words)
    """

    def transform(self, X):
        # apply length_word function to all values in X
        
        X_tagged_words = pd.Series(X).apply(self.message_length_words)


        return pd.DataFrame(X_tagged_words)

In [None]:
class message_length_char(BaseEstimator, TransformerMixin):
    #get how many characters in string
    def message_length_char(self, text):
          
      tran = len(text)
      return tran
      
    def fit(self, x, y=None):
        return self
    """
    def fit_transform(self, X):
        # apply length_char function to all values in X
        X_tagged_char = pd.Series(X).apply(self.message_length_char)

        return pd.DataFrame(X_tagged_char)
    """
    def transform(self, X):
        # apply length_char function to all values in X
        X_tagged_char = pd.Series(X).apply(self.message_length_char)

        return pd.DataFrame(X_tagged_char)

###WordCloud

In [None]:
word_list = prep_wordcloud(df, target_colnames)

In [None]:
stopwords = STOPWORDS
#on initial run, the word 'people' was present in almost all categories
stopwords.add('people')

for i, cloud in enumerate(word_list):
  if len(cloud):
    print('Word Cloud for {} category'.format(target_colnames[i]))
    wordcloud = WordCloud(stopwords=stopwords, background_color="white",
                          max_words=20).generate(cloud)
    rcParams['figure.figsize'] = 10, 30
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

Output hidden; open in https://colab.research.google.com to view.

###Feature Research
There are two features I would like to explore:
1. the original dataset has the column 'genre', most probably the origin of the message, could be transformed into dummy variables
2. It is possible length of messages could be related to a disaster as people write in a hurry shorter messages with fewer words

####Dummy Variables

In [None]:
df['genre'].unique()

array(['direct', 'social', 'news'], dtype=object)

In [None]:
df_cat = pd.get_dummies(df['genre'], dummy_na = False, drop_first=True)
df_cat.head()

Unnamed: 0,news,social
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0


####Length Plots

In [None]:
df['message_length_char'] = df['message'].apply(lambda x: len(x))
df['message_length_words'] = df['message'].apply(lambda x: len(x.split()))

df_char = pd.DataFrame(columns = ['message_length_char', 'message_length_words'])
for feat in target_colnames:
  row_to_append = df.groupby(by = feat).mean()[['message_length_char', 'message_length_words']].reset_index()
  df_char = df_char.append(row_to_append)

df_char = df_char[df_char.index == 1][['message_length_char', 'message_length_words']]
df_char['category'] = target_colnames

In [None]:
df_char

Unnamed: 0,message_length_char,message_length_words,category
1,154.600309,25.358384,related
1,128.424229,23.044703,request
1,151.533898,24.669492,offer
1,173.296961,28.5186,aid_related
1,245.829655,38.996161,medical_help
1,269.006093,42.623001,medical_products
1,220.209945,35.751381,search_and_rescue
1,229.157113,36.923567,security
1,252.360465,39.676744,military
1,242.026316,39.198565,water


In [None]:
fig = make_subplots(rows=2, cols=1, subplot_titles=['Length in characters', 'Length in words'], vertical_spacing=0.5)

fig.add_trace(go.Bar(x=df_char['category'], y=df_char['message_length_char'],
                     name='message_length_char'), row=1, col= 1)

fig.add_trace(go.Bar(x=df_char['category'], y=df_char['message_length_words'],
                     name='message_length_char'), row=2, col= 1)

fig.update(layout_showlegend=False)
fig.update_layout(height=600, width=1000)
fig.show()

###Messages by genre

In [None]:
genre_mes = df.groupby(by = 'genre').sum()

In [None]:
fig = make_subplots(rows=6, cols=6, subplot_titles=target_colnames)

for i, nam in enumerate(target_colnames):
  fig.add_trace(go.Bar(x=genre_mes[nam].index, y=genre_mes[nam].values, name=nam),
                row=math.ceil((i+1) / 6), col= (i % 6) + 1)
  
fig.update(layout_showlegend=False)
fig.update_layout(height=1200, width=1200, title_text="Genre of messages per target")
fig.show()

It looks like the messages that come from 'social' are rarely classified into any of categories, 'direct' and 'news' are most sources of important information 

###Train/Test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)


###Base NLP pipeline (Random Forest Classifier)

In [None]:
pipeline_base = Pipeline([('vect', CountVectorizer(tokenizer=tokenize)),
                         ('tfidf', TfidfTransformer()),
                         ('clf', MultiOutputClassifier(RandomForestClassifier()))]) 

In [None]:

pipeline_base.fit(X_train, y_train)

y_pred_base = pipeline_base.predict(X_test)

In [None]:
#Evaluate Base Model
display_results(y_test, y_pred_base)

Class - related
              precision    recall  f1-score   support

           0       1.00      0.97      0.99        35
           1       0.00      1.00      0.00         0

    accuracy                           0.97        35
   macro avg       0.50      0.99      0.49        35
weighted avg       1.00      0.97      0.99        35

Class - request
              precision    recall  f1-score   support

           0       0.97      1.00      0.98        30
           1       1.00      0.80      0.89         5

    accuracy                           0.97        35
   macro avg       0.98      0.90      0.94        35
weighted avg       0.97      0.97      0.97        35

Class - offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        34
           1       1.00      1.00      1.00         1

    accuracy                           1.00        35
   macro avg       1.00      1.00      1.00        35
weighted avg       1.00     


F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.



###Base NLP model (Logistic Regression)

In [None]:
pipeline_log = Pipeline([('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(LogisticRegression()))
    ])

pipeline_log.fit(X_train, y_train)

y_pred_log = pipeline_log.predict(X_test)

In [None]:
#Evaluate Base Model
display_results(y_test, y_pred_log)

Class - related
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        35

    accuracy                           1.00        35
   macro avg       1.00      1.00      1.00        35
weighted avg       1.00      1.00      1.00        35

Class - request
              precision    recall  f1-score   support

           0       1.00      0.97      0.98        30
           1       0.83      1.00      0.91         5

    accuracy                           0.97        35
   macro avg       0.92      0.98      0.95        35
weighted avg       0.98      0.97      0.97        35

Class - offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        34
           1       1.00      1.00      1.00         1

    accuracy                           1.00        35
   macro avg       1.00      1.00      1.00        35
weighted avg       1.00      1.00      1.00        35

Class - aid_related
       


F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.



Class - other_weather
              precision    recall  f1-score   support

           0       1.00      0.97      0.99        35
           1       0.00      1.00      0.00         0

    accuracy                           0.97        35
   macro avg       0.50      0.99      0.49        35
weighted avg       1.00      0.97      0.99        35

Class - direct_report
              precision    recall  f1-score   support

           0       0.94      1.00      0.97        30
           1       1.00      0.60      0.75         5

    accuracy                           0.94        35
   macro avg       0.97      0.80      0.86        35
weighted avg       0.95      0.94      0.94        35

Base Model
Minimum f1 score - 0.0
Best f1 score - 1.0
Mean f1 score - 0.552


Logistic Regression is peforming slightly better, so we would continue exploring LogisticRegression based pipelines

###NLP model Feature Union, message length feature

In [None]:
#Feature Union Pipeline

pipeline_feat_base = Pipeline([('features', FeatureUnion([('nlp_pipeline', Pipeline([('vect', CountVectorizer()),
                                                                          ('tfidf', TfidfTransformer())])),
                                                      ('ml_wor', message_length_word()),
                                                      ('ml_char', message_length_char())
                                                      ])),
                      ('clf', MultiOutputClassifier(LogisticRegression(max_iter = 600)))])

In [None]:
pipeline_feat_base.fit(X_train, y_train)

y_pred_feat_base = pipeline_feat_base.predict(X_test)

In [None]:
display_results(y_test, y_pred_feat_base)

Class - related
              precision    recall  f1-score   support

           0       0.81      1.00      0.90        26
           1       1.00      0.33      0.50         9

    accuracy                           0.83        35
   macro avg       0.91      0.67      0.70        35
weighted avg       0.86      0.83      0.79        35

Class - request
              precision    recall  f1-score   support

           0       0.94      0.97      0.95        31
           1       0.67      0.50      0.57         4

    accuracy                           0.91        35
   macro avg       0.80      0.73      0.76        35
weighted avg       0.91      0.91      0.91        35

Class - offer
              precision    recall  f1-score   support

           0       0.97      1.00      0.98        31
           1       1.00      0.75      0.86         4

    accuracy                           0.97        35
   macro avg       0.98      0.88      0.92        35
weighted avg       0.97     


F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.



The score of the model with extra features perform worse than the base model, but we shall perform some grid search on it

###Attempt at column transformer

here is where the code breaks!
ErrorMessage:

TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 

My custom functions had fit_transform methods (commented out so other pipelines would have a problem)

In [None]:
#for this one we would need to pull one more column from the original dataset
df_feat = df[['message', 'genre']]

X_traina, X_testa, y_traina, y_testa = train_test_split(df_feat, Y)


col_transform = ColumnTransformer(transformers=[("mes",pipeline_feat_base, ['message']),
                                                ("gen",OneHotEncoder(drop = 'first'),['genre'])], remainder='passthrough')




In [None]:
col_transform.fit_transform(X_traina)
y_pred_feat_base = pipeline_feat_base.predict(X_test)


TypeError: ignored

###GridSearch of the best performing base model

In [54]:
sam = 10000

X_train, X_test, y_train, y_test = train_test_split(X, Y)
X_train1, y_train1 = X_train[:sam], y_train[:sam]
X_test1, y_test1 = X_test[:sam], y_test[:sam]

In [None]:
pipeline_feat_base.get_params

<bound method Pipeline.get_params of Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('nlp_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('vect',
                                                                  CountVectorizer(analyzer='word',
                                                                                  binary=False,
                                                                                  decode_error='strict',
                                                                                  dtype=<class 'numpy.int64'>,
                                                                                  encoding='utf-8',
                                                                                  input='content',
                                                 

In [55]:
parameters = {'features__nlp_pipeline__tfidf__norm': ['l1', 'l2'],
              'clf__estimator__max_iter': [600, 800],
              'clf__estimator__C': [0.5, 1.5]

    }

cv_feat_base = GridSearchCV(pipeline_feat_base, param_grid = parameters)

In [None]:
cv_feat_base.fit(X_train1, y_train1)
y_pred_feat_cv = cv_feat_base.predict(X_test1)

In [53]:
display_results(y_test1, y_pred_feat_cv, grid = cv_feat_base)


Class - related
              precision    recall  f1-score   support

           0       1.00      0.97      0.99        35
           1       0.00      1.00      0.00         0

    accuracy                           0.97        35
   macro avg       0.50      0.99      0.49        35
weighted avg       1.00      0.97      0.99        35

Class - request
              precision    recall  f1-score   support

           0       0.91      1.00      0.95        31
           1       1.00      0.25      0.40         4

    accuracy                           0.91        35
   macro avg       0.96      0.62      0.68        35
weighted avg       0.92      0.91      0.89        35

Class - offer
              precision    recall  f1-score   support

           0       1.00      0.97      0.99        35
           1       0.00      1.00      0.00         0

    accuracy                           0.97        35
   macro avg       0.50      0.99      0.49        35
weighted avg       1.00     

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.