# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [63]:
# import libraries
from sqlalchemy import *
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import fbeta_score, accuracy_score
from sklearn.pipeline import Pipeline

In [64]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('InsertTableName', engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [65]:
X = df['message']
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [83]:
Y = df.iloc[:,4:]
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### 2. Write a tokenization function to process your text data

In [116]:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
from nltk.stem.wordnet import WordNetLemmatizer

def tokenize(text):
    '''
    text = [text]
    
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(text)
    X_train_counts.shape
    
    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
    X_train_tfidf.shape
    '''
    # Normalize
    text = text.lower()
    # Tokenize
    tokens = nltk.word_tokenize(text)
    # Lemmatize
    lmtzr = WordNetLemmatizer()
    lemmatized = [lmtzr.lemmatize(word) for word in tokens]
    
    return lemmatized
    

[nltk_data] Downloading package wordnet to /Users/jamesyu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jamesyu/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [113]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier

pipeline = Pipeline([('vect', CountVectorizer(tokenizer=tokenize)),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultiOutputClassifier(MultinomialNB())),])


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [114]:
# Import train_test_split
from sklearn.cross_validation import train_test_split

# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

In [117]:
# Train
print(y_train.shape)
pipeline.fit(X_train, y_train)

(20890, 36)


  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.cla

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ssifier(estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
           n_jobs=1))])

### 5. Test your model
Report the accuracy, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [126]:
from sklearn.metrics import classification_report

train_pred = pipeline.predict(X_train)
print("Training metrics")
print(classification_report(y_train, train_pred, target_names=list(Y)))

test_pred = pipeline.predict(X_test)
print("Testing metrics")
print(classification_report(y_test, test_pred, target_names=list(Y)))

Training metrics
                        precision    recall  f1-score   support

               related       0.99      1.00      1.00     20781
               request       0.00      0.00      0.00         0
                 offer       0.00      0.00      0.00         0
           aid_related       0.99      1.00      1.00     20781
          medical_help       0.00      0.00      0.00         0
      medical_products       0.00      0.00      0.00         0
     search_and_rescue       0.00      0.00      0.00         0
              security       0.00      0.00      0.00         0
              military       0.00      0.00      0.00         0
           child_alone       0.00      0.00      0.00         0
                 water       0.00      0.00      0.00         0
                  food       0.00      0.00      0.00         0
               shelter       0.00      0.00      0.00         0
              clothing       0.00      0.00      0.00         0
                 money

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Testing metrics
                        precision    recall  f1-score   support

               related       0.99      1.00      1.00      5194
               request       0.00      0.00      0.00         0
                 offer       0.00      0.00      0.00         0
           aid_related       0.99      1.00      1.00      5194
          medical_help       0.00      0.00      0.00         0
      medical_products       0.00      0.00      0.00         0
     search_and_rescue       0.00      0.00      0.00         0
              security       0.00      0.00      0.00         0
              military       0.00      0.00      0.00         0
           child_alone       0.00      0.00      0.00         0
                 water       0.00      0.00      0.00         0
                  food       0.00      0.00      0.00         0
               shelter       0.00      0.00      0.00         0
              clothing       0.00      0.00      0.00         0
                 money 

### 6. Improve your model
Use grid search to find better parameters. 

In [77]:
from sklearn.model_selection import GridSearchCV

parameters = parameters = {'tfidf__use_idf': (True, False),
                           'vect__ngram_range': [(1, 1), (1, 2)],
                           'tfidf__smooth_idf': (True, False),}

cv = GridSearchCV(pipeline, parameters, n_jobs=-1)
cv = cv.fit(X_train, y_train)

  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.class_log_prior_ = (np.log(self.class_count_) -
  self.cla

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [87]:
print(cv.best_estimator_)
train_pred_gs = cv.best_estimator_.predict(X_train)
print(classification_report(y_train, train_pred_gs, target_names=list(Y)))

test_pred_gs = cv.best_estimator_.predict(X_test)
print(classification_report(y_test, test_pred_gs, target_names=list(Y)))

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ssifier(estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
           n_jobs=1))])
                        precision    recall  f1-score   support

               related       0.99      1.00      1.00     20781
               request       0.00      0.00      0.00         0
                 offer       0.00      0.00      0.00         0
           aid_related       0.99      1.00      1.00     20781
          medical_help       0.00      0.00      0.00         0
      medical_products       0.00      0.00      0.00         0
     search_and_rescue       0.00      0.00      0.00         0
              security       0.00      0.00      0.00  

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [80]:
import pickle

filename = 'finalized_model.sav'
pickle.dump(cv, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.