# Find Best Params using GridsearchCV
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# ensure pandas is updated to latest version
!pip3  install pandas==1.0.3



In [2]:
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
import re

In [4]:
# import libraries
import numpy as np
import pandas as pd
import sklearn



In [5]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'



  from numpy.core.umath_tests import inner1d


In [6]:
import sqlite3
from sqlalchemy import create_engine

In [7]:
# load data from database
engine = create_engine('sqlite:///Disaster_response.db')
df = pd.read_sql_table('mescat', engine)
X = df['message'] 
Y = df.drop(['id', 'message', 'original', 'genre', 'child_alone'], axis=1) 

### 2. Write a tokenization function to process your text data

In [8]:
def tokenize(text):
        # get list of all urls using regex
    detected_urls = re.findall(url_regex, text)
    
    # replace each url in text string with placeholder
    for url in detected_urls:
        text =text.replace(url, 'urlplaceholder')

    # tokenize text
    tokens = word_tokenize(text)
    
    # initiate lemmatizer
    lemmatizer =WordNetLemmatizer()

    # iterate through each token
    clean_tokens = []
    for tok in tokens:
        
        # lemmatize, normalize case, and remove leading/trailing white space
        clean_tok = lemmatizer.lemmatize(tok)
        clean_tok=clean_tok.lower()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [9]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

In [10]:
pipeline = Pipeline([
    ('vec', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ( 'multi_out_clf', MultiOutputClassifier(LogisticRegression()))
        
    ])

### 4. Split to train and test data
- Split data into train and test sets
- Train pipeline

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

### 5. Improve model
Use grid search to find better parameters. 

In [24]:
# 4 params. Adding more could increase time to find optimal params
parameters = {
        'vec__ngram_range': ((1, 1), (1, 2)),
        'vec__max_df': (0.75, 1.0),
        'vec__max_features': (None, 5000, 10000),
        'tfidf__use_idf': (True, False),
        'multi_out_clf__estimator__penalty': ['l1', 'l2']
    }

In [25]:
cv = GridSearchCV(pipeline, param_grid=parameters)

In [28]:
# Decrease time to find optimal params
cv.fit(X_train, Y_train.to_numpy());

In [29]:
cv.best_params_

{'multi_out_clf__estimator__penalty': 'l1',
 'tfidf__use_idf': True,
 'vec__max_df': 1.0,
 'vec__max_features': 5000,
 'vec__ngram_range': (1, 1)}

In [27]:
cv.best_params_

{'multi_out_clf__estimator__penalty': 'l1',
 'tfidf__use_idf': True,
 'vec__max_df': 1.0,
 'vec__max_features': 5000,
 'vec__ngram_range': (1, 1)}