# En este ejercicio vamos a optimizar parámetros #

(Credits to https://github.com/codiply/blog-ipython-notebooks/blob/master/scikit-learn-estimator-selection-helper.ipynb )

Para optimizar los parámetros usaremos un GridSearch.

Y comparar clasificadores.

<div class="alert alert-danger" role="alert">
  Este ejemplo es para python v2.x, no funcionara en un virtualenv 3.x
<div>


In [None]:
import sys
import IPython
import numpy as np
import pandas as pd
import sklearn as sk


print('Python version: %s.%s.%s' % sys.version_info[:3])
print( 'IPython version:', IPython.__version__)
print( 'numpy version:', np.__version__)
print( 'pandas version:', pd.__version__)
print( 'scikit-learn version:', sk.__version__)

This is a helper class for running paramater grid search across different classification or regression models. The helper takes two dictionaries as its constructor parameters. The first dictionary contains the models to be scored, while the second contains the parameters for each model (see examples below or the [GridSearchCV documentation](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) for the expected format). The `fit(X, y)` method runs a parameter grid search with cross validation for each model and for the given training data. After calling `fit(X, y)`, the `score_summary()` method returns a data frame with a summary of the scores.

In [None]:
from sklearn.grid_search import GridSearchCV

class EstimatorSelectionHelper:
    def __init__(self, models, params):
        if not set(models.keys()).issubset(set(params.keys())):
            missing_params = list(set(models.keys()) - set(params.keys()))
            raise ValueError("Some estimators are missing parameters: %s" % missing_params)
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}
    
    def fit(self, X, y, cv=3, n_jobs=1, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print("Running GridSearchCV for %s." % key)
            model = self.models[key]
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, 
                              verbose=verbose, scoring=scoring, refit=refit)
            gs.fit(X,y)
            self.grid_searches[key] = gs    
    
    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': mean(scores),
                 'std_score': std(scores),
            }
            return pd.Series(dict(params.items() + d.items()))
                      
        rows = [row(k, gsc.cv_validation_scores, gsc.parameters) 
                for k in self.keys
                for gsc in self.grid_searches[k].grid_scores_]
        df = pd.concat(rows, axis=1).T.sort([sort_by], ascending=False)
        
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        
        return df[columns]

Classification example
----

I load the data.

In [None]:
from sklearn import datasets

iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

print ("Los datos son : " , iris.data[0:5])


Definimos dos diccionarios.

- Diccionario de modelos.
- Diccionario de juegos de parámetros (GridSearch) a probar con cada modelo.



In [None]:
from sklearn.ensemble import (ExtraTreesClassifier, RandomForestClassifier, 
                              AdaBoostClassifier, GradientBoostingClassifier)
from sklearn.svm import SVC
models1 = {
    'ExtraTreesClassifier': ExtraTreesClassifier(),
    'RandomForestClassifier': RandomForestClassifier(),
    'AdaBoostClassifier': AdaBoostClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'SVC': SVC()
}

params1 = {
    'ExtraTreesClassifier': { 'n_estimators': [16, 32] },
    'RandomForestClassifier': { 'n_estimators': [16, 32] },
    'AdaBoostClassifier':  { 'n_estimators': [16, 32] },
    'GradientBoostingClassifier': { 'n_estimators': [16, 32], 'learning_rate': [0.8, 1.0] },
    'SVC': [
        {'kernel': ['linear'], 'C': [1, 10]},
        {'kernel': ['rbf'], 'C': [1, 10], 'gamma': [0.001, 0.0001]},
    ]
}

I create the helper and fit the data.

In [None]:
helper1 = EstimatorSelectionHelper(models1, params1)
helper1.fit(X_iris, y_iris, scoring='f1', n_jobs=2)

Finally, I print the summary.

In [None]:
helper1.score_summary(sort_by='min_score')

Regression example
----

I load the data.

In [None]:
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target

I define the models and the grid search parameters.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso

models2 = { 
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso()
}

params2 = { 
    'LinearRegression': { },
    'Ridge': { 'alpha': [0.1, 1.0] },
    'Lasso': { 'alpha': [0.1, 1.0] }
}

I create the helper and fit the data.

In [None]:
helper2 = EstimatorSelectionHelper(models2, params2)
helper2.fit(X_diabetes, y_diabetes, n_jobs=-1)

Finally, I print the summary.

In [None]:
helper2.score_summary()