# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [6]:
# import libraries
import pandas as pd
import time
import os
from sqlalchemy import create_engine

# global for Random State 
seed = 2020

# Specific Machine Learning Algorithms
from sklearn.ensemble import RandomForestClassifier

#For Word Processing
import re
import nltk
nltk.download(['punkt', 'wordnet', 
               'averaged_perceptron_tagger', 'stopwords'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.corpus import stopwords
stop_words = stopwords.words("english")

#For the Machine Learning Model
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split 

# For Model Fit
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# To Save the Model
import pickle

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
# load data from database

database_filepath = "../data/messages.db"
engine = create_engine('sqlite:///' + database_filepath)
#table_name = os.path.basename(database_filepath).replace(".db","") + "_table"
#df = pd.read_sql_table(table_name,engine)
df = pd.read_sql("SELECT * FROM Messages", engine)

X = df['message']
y = df.iloc[:, -36:]
# To fix potential multi-output error for 'related' column
y['related'].replace(to_replace=2,value=1,inplace=True)

### 2. Write a tokenization function to process your text data

In [10]:
def tokenize(text):
    text = text.lower()
    text = re.sub(r'[^A-Za-x0-9]', ' ', text)
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    final_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return final_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [11]:
pipeline =  Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer(smooth_idf=False)),
    ('clf', RandomForestClassifier(random_state=seed))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=2020)
start = time.time()
pipeline.fit(X_train, y_train)
# predict on test data
y_pred = pipeline.predict(X_test)
end = time.time()
print(end - start)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [None]:
issues = y_test.columns
y_pred_df = pd.DataFrame(y_pred, columns=issues)
for issue in issues:
    print(classification_report(y_test[issue], 
                            y_pred_df[issue]))

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = {#'scaler__with_mean':['True','False'],
             'clf__n_estimators': [100, 300],
             'clf__max_features':[5, 8]}

cv = GridSearchCV(pipeline, parameters,n_jobs=4, verbose=2)

In [None]:
start = time.time()
cv.fit(X_train, y_train)
# predict on test data
y_pred = cv.predict(X_test)
end = time.time()
print(end-start)

issues = y_test.columns
y_pred_df = pd.DataFrame(y_pred, columns=issues)
for issue in issues:
    print(classification_report(y_test[issue], 
                            y_pred_df[issue]))

In [None]:
print('\n',classification_report(y_train.values, y_prediction_train, target_names=y.columns.values))

In [None]:
#pickle.dump(model,open(model_filepath,'wb'))
#classifier = pickle.dumps(cv)

In [None]:
with open('classifier.pkl', 'wb') as classifier:
    pickle.dumps(cv, classifier)