# Grid Search

Grid search can be used to optimize hyper parameters of a model.  All you need to do is create a dictionary of parameters to search, using keys for the names of the parameters and values for the list of parameter values to check. Then, pass the model and parameter grid to the grid search object.

Then when you call fit on this grid search object, it will run cross validation on all different combinations of these parameters to find the best combination of parameters for the model.

Example:

    parameters = {
        'kernel': ['linear', 'rbf'],
        'C':[1, 10]
    }

    svc = SVC()
    clf = GridSearchCV(svc, parameters)
    clf.fit(X_train, y_train)

<br>

Parameters tells Scikit-learn to evaluate 2 x 2 = 4 combinataions of `kernel` and 'C' parameter values. Grid Search has **cv** parameter which refers to #folds for cross validation. If cv=5, there will be 2 x 2 x 5 = 20 rounds of training.

Further to [CAUTION for data leakage](#leakage)! 

---

Let's incorporate grid search into your modeling process. To start, include an import statement for `GridSearchCV` below.

In [2]:
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

[nltk_data] Downloading package punkt to /Users/jsuk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jsuk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jsuk/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [8]:
import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import GridSearchCV

In [4]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens


class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

### View parameters in pipeline
Before modifying your build_model method to include grid search, view the parameters in your pipeline here.

In [5]:
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('nlp_transformer', Pipeline([
            ('count_vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),
        ('starting_verb', StartingVerbExtractor())
    ])),
    ('clf', RandomForestClassifier())
])

In [7]:
#pipeline.get_params()

### Modify your `build_model` function to return a GridSearchCV object.
Try to grid search some parameters in your data transformation steps as well as those for your classifier! Browse the parameters you can search above.

In [13]:
def build_model() : 
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('transformer', Pipeline([
                ('countVect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
                ])),
            ('startingVerb', StartingVerbExtractor())
        ])),
        ('clf', RandomForestClassifier())
    ])
    
    # set a dictionary of parameters
    paramters = {
#         'features__transformer__countVect__ngram_range': ((1, 1), (1,2)),
#         'features__transformer__countVect__max_df': (0.75, 0.90, 1),
#         'features__transformer__countVect__max_features': (None, 5000, 10000),
#         'features__transformer__tfidf__use_idf': (True, False),
        'clf__n_estimators': (50, 100, 200),
        'features__transformer_weights': (
            {'text_pipeline': 1, 'starting_verb': 0.5},
            {'text_pipeline': 0.5, 'starting_verb': 1},
            {'text_pipeline': 0.8, 'starting_verb': 1},
        )            
    }

    # create grid search object
    cv = GridSearchCV(pipeline, param_grid=paramters) # 5 folds as default
    
    return cv
     

### Run program to test
Running grid search can take a while, especially if you are searching over a lot of parameters! If you want to reduce it to a few minutes, try commenting out some of your parameters to grid search over just 1 or 2 parameters with a small number of values each. Once you know that works, feel free to add more parameters and see how well your final model can perform! 

In [14]:
def load_data():
    df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
    df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
    X = df.text.values
    y = df.category.values
    return X, y


def display_results(cv, y_test, y_pred):
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    print("Confusion Matrix:\n", confusion_mat)
    print("Accuracy:", accuracy)
    print("\nBest Parameters:", cv.best_params_)


def main():
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    model = build_model()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    display_results(model, y_test, y_pred)


main()

Labels: ['Action' 'Dialogue' 'Information']
Confusion Matrix:
 [[ 79   0  21]
 [  3  34   7]
 [  4   0 453]]
Accuracy: 0.9417637271214643

Best Parameters: {'clf__n_estimators': 50, 'features__transformer_weights': {'text_pipeline': 1, 'starting_verb': 0.5}}


---
## Advantages of using pipeline
The entire texts are obtained directly from the course notes. In the above example, without pipeline implementation, I personally would possibly have fallen into mistakes, such as
- repeating fit_transform() for the test set, instead of transform() 
- accidentally skipping vector transformation for the test set

By using pipeline, it just requires fitting the estimator for the train sets, and predicting/transforming the result for the test sets, and that is pretty easy and free of mental burden!

### 1. Simplicity and Convencience

> **Automates repetitive steps** - Chaining all of your steps into one estimator allows you to fit and predict on all steps of your sequence automatically with one call. It handles smaller steps for you, so you can focus on implementing higher level changes swiftly and efficiently.

> **Easily understandable workflow** - Not only does this make your code more concise, it also makes your workflow much easier to understand and modify. Without Pipeline, your model can easily turn into messy spaghetti code from all the adjustments and experimentation required to improve your model.

> **Reduces mental workload** - Because Pipeline automates the intermediate actions required to execute each step, it reduces the mental burden of having to keep track of all your data transformations. Using Pipeline may require some extra work at the beginning of your modeling process, but it prevents a lot of headaches later on.

### 2. Optimizing Entire Workflow

> GRID SEARCH: Method that automates the process of testing different hyper parameters to optimize a model.
- By running grid search on your pipeline, you're able to optimize your entire workflow, including data transformation and modeling steps. This accounts for any interactions among the steps that may affect the final metrics.
- Without grid search, tuning these parameters can be painfully slow, incomplete, and messy.


### 3. Preventing Data leakage <a id="leakage"></a>
- Using Pipeline, all transformations for data preparation and feature extractions occur within each fold of the cross validation process.

- This prevents common mistakes where you’d allow your training process to be influenced by your test data - for example, if you used the entire training dataset to normalize or extract features from your data.

Example : 
    
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(X_train)

    parameters = {
        'kernel': ['linear', 'rbf'],
        'C':[1, 10]
    }

    svc = SVC()
    clf = GridSearchCV(svc, parameters)
    clf.fit(scaled_data, y_train)


This may seem okay at first, but if you standardize your whole training dataset, and then use cross validation in grid search to evaluate your model, you've got data leakage. 

Grid search uses cross validation to score your model, meaning it splits your training data into folds of train and validation sets, trains your model on the train set, and scores it on the validation set, and does this multiple times.

However, each time, or fold, that this happens, the model already has knowledge of the validation set because all the data was rescaled based on the distribution of the whole training dataset. Important factors like the mean and standard deviation are influenced by the whole dataset. This means the model perform better than it really should on unseen data, since information about the validation set is always baked into the rescaled values of your train dataset.

The way to fix this, would be to make sure you run standard scaler only on the training set, and not the validation set within each fold of cross validation. Pipelines allow you to do just this.

Example: 

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', SVC())
    ])

    parameters = {
        'scaler__with_mean': [True, False]
        'clf__kernel': ['linear', 'rbf'],
        'clf__C':[1, 10]
    }

    cv = GridSearchCV(pipeline,
    param_grid=parameters)

    cv.fit(X_train, y_train)
    y_pred = cv.predict(X_test)

Now, since the rescaling is included as part of the pipeline, the standardization doesn't happen until we run grid search. Meaning in each fold of cross validation, the rescaling is done only on the data that the model is trained on, preventing leakage from the validation set. As you can see, pipelines are very valuable to removing the risk of data leakage during the data preparation process.