## 5. Model Optimization

In the previous notebook we decided to optimize the following models. 

 - Rank=1, Name=**SVC-linear-0.4**, Score=**0.9877151685937721 (+/- 0.003806562050586554)**
 - Rank=4, Name=**PassiveAggressiveClassifier**, Score=**0.9873060780235428 (+/- 0.005755384960731155)**
 - Rank=11, Name=**SGDClassifier**, Score=**0.986486033144723 (+/- 0.005575110872674419)**
 - Rank=14, Name=**KNeighborsClassifier-1**, Score=**0.9833439566419908 (+/- 0.007040876080775306)**
 
To optimize each model, we will use GridSearchCV, which is a method to automatically find the best parameters for a given model, based on a target metric, which, again, in this case is accuracy. 

The CV part means that we can use cross-validation to be really sure of the model's performance. As happened in the Model Evaluation phase, we cross-validate using 10 folds. 

As we might expect, this process will take some time, because we are training lots of models, and each variation will be spawn another ten models (due to the cross-validaton process).

Let's get started defining the helper functions we'll use to optimize.
 

### Data

This function will give us the relevant data needed to perform the optimization of the models.

In [1]:
import numpy as np

def load_dataset():
    X = np.load('global_max_features.npy')
    y = np.load('labels.npy')
    
    return X, y

### Models

This function returns a dict of dicts that contain the model to be optimized and the grid of parameters to do so.

In [2]:
from sklearn.linear_model import SGDClassifier, PassiveAggressiveClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

def define_models(models=dict()):
    models['SGDClassifier'] = {
        'model': SGDClassifier(),
        'parameters': {
            'loss': ('hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'),
            'penalty': ('none','l2', 'l1'),
            'alpha': (1e-3, 5e-3, 1e-4, 1e-2, 5e-3),
            'fit_intercept': (True, False)
        }
    }
    
    models['KNeighborsClassifier-1'] = {
        'model': KNeighborsClassifier(n_neighbors=1),
        'parameters': {
            'weights': ('uniform', 'distance'),
            'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute')
        }
    }
    
    models['SVC-linear-0.4'] = {
        'model': SVC(kernel='linear', C=0.4),
        'parameters': {
            'probability': (True, False),
            'decision_function_shape': ('ovo', 'ovr'),
            'shrinking': (True, False)
        }
    }
    
    models['PassiveAggressiveClassifier'] = {
        'model': PassiveAggressiveClassifier(),
        'parameters': {
            'fit_intercept': (True, False),
            'C': (1, 0.1, 0.01, 1.5, 2),
            'loss': ('hinge', 'squared_hinge')
        }
    }
    
    print(f'Defined {len(models)} models.')

    return models

### Pipeline

As in the last notebook, this pipeline is used to pre-process the features that'll be fed to the model.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

def make_pipeline(model):
    steps = [
        ('StandardScaler', StandardScaler()),
        ('MinMaxScaler', MinMaxScaler()),
        ('model', model)
    ]
    
    pipeline = Pipeline(steps=steps)
    
    return pipeline

### Grid Search

Here we are defining a couple of auxiliary functions that will perform the actual grid search for each model. At the end of the process, the parameters that yield the best estimator will be printed, along with the best score of this estimator.

In [4]:
from sklearn.model_selection import GridSearchCV

def grid_search_model(X, y, model, folds, metric):
    m = model['model'] 
    parameters = model['parameters']
    
    classifier = GridSearchCV(m, parameters, cv=folds, scoring=metric, n_jobs=-1)
    pipeline = make_pipeline(classifier)
    pipeline.fit(X, y)
    
    return pipeline.steps[-1][1]

In [5]:
import warnings

def grid_search_models(X, y, models, folds=10, metric='accuracy'):
    with warnings.catch_warnings():
        warnings.filterwarnings('ignore')
        for model_name, model in models.items():
            m = grid_search_model(X, y, model, folds, metric)

            if m is not None:
                print(f'Best parameters for {model_name}: \n{m.best_params_}')
                print(f'Best model {metric}: {m.best_score_ * 100}%')
            else:
                print(f'{model_name}: error')
            
            print('----\n')

In [6]:
X, y = load_dataset()
models = define_models()
grid_search_models(X, y, models)

Defined 4 models.
Best parameters for SGDClassifier: 
{'alpha': 0.001, 'fit_intercept': False, 'loss': 'modified_huber', 'penalty': 'l2'}
Best model accuracy: 98.7849829351536%
----

Best parameters for KNeighborsClassifier-1: 
{'algorithm': 'auto', 'weights': 'uniform'}
Best model accuracy: 98.30716723549489%
----

Best parameters for SVC-linear-0.4: 
{'decision_function_shape': 'ovo', 'probability': True, 'shrinking': True}
Best model accuracy: 98.7849829351536%
----

Best parameters for PassiveAggressiveClassifier: 
{'C': 2, 'fit_intercept': False, 'loss': 'squared_hinge'}
Best model accuracy: 98.75767918088737%
----



## Conclusion

These are our results:

 - **SVC-linear-0.4** goes from 98.77151685937721% to 98.7849829351536%, with a total improvement of 0.013466075776378261%.
 - **PassiveAggressiveClassifier** goes from 98.73060780235428% to 98.75767918088737%, with a total improvement of 0.027071378533079837%.
 - **SGDClassifier** goes from 98.6486033144723% to 98.7849829351536%, with a total improvement of 0.1363796206812964%.
 - **KNeighborsClassifier-1** goes from 98.33439566419908% to 98.30716723549489%, with a total worsening of 0.027228428704191288%.
 
We see that, besides KNeighborsClassifier-1, all the models present, at least, a small improvement in their accuracy, which means that the grid search was successful.

Clearly, the model that benefited the most was SGDClassifier, with a total improvement of 0.1363796206812964%. It also achieves the best score, tied with SVC-linear-0.4.

Sadly, KNeighborsClassifier-1 do worse after the grid search. Maybe tweaking other parameters or adding more neighbors would yield better results.

## Ideas

 - Try different models.
 - Try more parameters.
 - Try dimensionality reduction.
 - Use more feature engineering techniques.
 - Use stacking.
 
### A final note

We didn't use the flattened version of the data because it didn't fit in my computer's memory! :)