# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries


#To Handle datasets
import pandas as pd
import numpy as np

#To handle Databases
from sqlalchemy import create_engine
import re
import pickle
import string 
import sys 

#To Handle text data using Natural Language ToolKit
import nltk
nltk.download(['punkt', 'wordnet','stopwords'])

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

#Sklearn Libraries for Ml Models
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Mustafa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mustafa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mustafa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1.Load Dataset from sqlite database
- Use `read_sql_table` to read data from DisasterResponse database

In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('DisasterResponse', engine)
X = df['message']
y = df[df.columns[4:]]
print(X.head(),y.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26028 entries, 0 to 26027
Data columns (total 35 columns):
related                   26028 non-null int64
request                   26028 non-null int64
offer                     26028 non-null int64
aid_related               26028 non-null int64
medical_help              26028 non-null int64
medical_products          26028 non-null int64
search_and_rescue         26028 non-null int64
security                  26028 non-null int64
military                  26028 non-null int64
water                     26028 non-null int64
food                      26028 non-null int64
shelter                   26028 non-null int64
clothing                  26028 non-null int64
money                     26028 non-null int64
missing_people            26028 non-null int64
refugees                  26028 non-null int64
death                     26028 non-null int64
other_aid                 26028 non-null int64
infrastructure_related    26028 non-null int6

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    """
    INPUT - text - messages column from the table
    Returns tokenized text after performing below actions
    
    1. Remove Punctuation and normalize text
    2. Tokenize text and remove stop words
    3. Use stemmer and Lemmatizer to Reduce words to its root form
    """
    # Remove Punctuations and normalize text by converting text into lower case
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize text and remove stop words
    tokens = word_tokenize(text)
    stop_words = stopwords.words("english")
    words = [w for w in tokens if w not in stop_words]
    
    #Reduce words to its stem/Root form
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(w) for w in words]
    
    #Lemmatizer - Reduce words to its root form
    lemmatizer = WordNetLemmatizer()
    lemm = [lemmatizer.lemmatize(w) for w in stemmed]
    
    return lemm

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
#create pipeline

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
]) 

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
#Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit pipeline
pipeline.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                  

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [6]:
# Predict on test set 
y_pred = pipeline.predict(X_test)
categories = y.columns.tolist()
# Test model on test set
print(classification_report(y_test, y_pred, target_names=categories))

                        precision    recall  f1-score   support

               related       0.84      0.92      0.88      3959
               request       0.78      0.43      0.56       902
                 offer       0.00      0.00      0.00        25
           aid_related       0.74      0.61      0.67      2156
          medical_help       0.60      0.10      0.16       431
      medical_products       0.76      0.11      0.19       264
     search_and_rescue       0.64      0.06      0.11       151
              security       0.00      0.00      0.00       106
              military       0.69      0.06      0.12       175
                 water       0.81      0.39      0.52       344
                  food       0.83      0.46      0.59       586
               shelter       0.87      0.32      0.46       487
              clothing       0.81      0.22      0.34        79
                 money       0.89      0.06      0.11       131
        missing_people       0.00      

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [7]:
parameters = {
    'clf__estimator__n_estimators': [20, 50]
}

cv = GridSearchCV(estimator=pipeline, param_grid=parameters, cv=3, verbose=3)

cv.fit(X_train, y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__n_estimators=20 .................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ..... clf__estimator__n_estimators=20, score=0.244, total= 3.6min
[CV] clf__estimator__n_estimators=20 .................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.6min remaining:    0.0s


[CV] ..... clf__estimator__n_estimators=20, score=0.247, total= 3.3min
[CV] clf__estimator__n_estimators=20 .................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  7.0min remaining:    0.0s


[CV] ..... clf__estimator__n_estimators=20, score=0.255, total= 4.4min
[CV] clf__estimator__n_estimators=50 .................................
[CV] ..... clf__estimator__n_estimators=50, score=0.254, total= 6.3min
[CV] clf__estimator__n_estimators=50 .................................
[CV] ..... clf__estimator__n_estimators=50, score=0.253, total= 6.4min
[CV] clf__estimator__n_estimators=50 .................................
[CV] ..... clf__estimator__n_estimators=50, score=0.268, total= 6.4min


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 30.3min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [8]:
# Predict on test set 
y_pred = cv.predict(X_test)
categories = y.columns.tolist()

# Test model on test set
print(classification_report(y_test, y_pred, target_names=categories))

                        precision    recall  f1-score   support

               related       0.84      0.95      0.89      3959
               request       0.81      0.52      0.63       902
                 offer       0.00      0.00      0.00        25
           aid_related       0.76      0.69      0.73      2156
          medical_help       0.70      0.08      0.14       431
      medical_products       0.81      0.08      0.15       264
     search_and_rescue       0.76      0.11      0.19       151
              security       0.50      0.01      0.02       106
              military       0.87      0.07      0.14       175
                 water       0.90      0.37      0.52       344
                  food       0.82      0.61      0.70       586
               shelter       0.87      0.32      0.47       487
              clothing       0.81      0.16      0.27        79
                 money       0.88      0.05      0.10       131
        missing_people       1.00      

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [None]:
## USE ADABOOST CLASSIFIER 

pipeline_ada = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(
        AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1, class_weight='balanced'))
    ))
])

parameters_ada = {
    'clf__estimator__learning_rate': [0.1, 0.3],
    'clf__estimator__n_estimators': [100, 200]
}

cv_ada = GridSearchCV(estimator=pipeline_ada, param_grid=parameters_ada, cv=3, scoring='f1_weighted', verbose=3)

cv_ada.fit(X_train, y_train)

In [None]:
# Best parameters set
cv_ada.best_params_

In [None]:
# Predict on test set 
y_pred = cv_ada.predict(X_test)
categories = y.columns.tolist()

# Test model on test set
print(classification_report(y_test, y_pred, target_names=categories))

### 9. Export your model as a pickle file

In [None]:
import joblib 

joblib.dump(cv_ada, 'DisasterResponseModel.pkl')

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.