# Atividade Prática do dia 20.09.2022
### Solução

Usar o dataset Iris para treinar diferentes modelos de classificação (a sua escolha), para treinar e realizar um Fine Tuning em cada um deles, e então passar o `best_estimator_` de cada um deles para o `VotingClassifier()` para se obter o melhor model para o dataset.


In [70]:
# Bibliotecas básicas
import numpy as np
import pandas as pd

# bibliotecas úteis do sklearn
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

# Modelos a serem testados
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

# Métodos para avaliação dos modelos
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

# GridSearch
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV

# Preparação dos dados

Vamos começar importando o dataset Iris e aplicando o `StandarScaler()`.

In [71]:
iris = datasets.load_iris()

In [72]:
X = iris["data"][:, (2, 3)]
X[:5]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

In [73]:
y = (iris["target"]==2).astype(np.float64)
y[:5]

array([0., 0., 0., 0., 0.])

In [74]:
scaler = StandardScaler()
scaler.fit(X)

StandardScaler()

# Fine Tunig dos modelos

Agora vamos treinar e fazer o **fine tuning* de  3 diferentes modelos que serão passados para o `VotingClassifuier()` para determinarmos qual o modelo que está realizando a melhor previsão.

Modelos a serem usados:

    - Random Forest
    - Logistic Regression
    - Decision Tree

In [75]:
# Random Forest
x = [i for i in range(1, 30)]

rf_param_grid = {'max_depth': x,
                 'min_samples_split': [2, 5, 10]}

base_estimator_rf = RandomForestClassifier(random_state=42)

rf_sh = HalvingGridSearchCV(base_estimator_rf, 
                            rf_param_grid, 
                            cv=5, factor=2, 
                            resource='n_estimators',
                            max_resources=30, 
                            verbose=True, n_jobs=-1).fit(X, y)

n_iterations: 5
n_required_iterations: 7
n_possible_iterations: 5
min_resources_: 1
max_resources_: 30
aggressive_elimination: False
factor: 2
----------
iter: 0
n_candidates: 87
n_resources: 1
Fitting 5 folds for each of 87 candidates, totalling 435 fits
----------
iter: 1
n_candidates: 44
n_resources: 2
Fitting 5 folds for each of 44 candidates, totalling 220 fits
----------
iter: 2
n_candidates: 22
n_resources: 4
Fitting 5 folds for each of 22 candidates, totalling 110 fits
----------
iter: 3
n_candidates: 11
n_resources: 8
Fitting 5 folds for each of 11 candidates, totalling 55 fits
----------
iter: 4
n_candidates: 6
n_resources: 16
Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [76]:
# salvamos o resultado do fune tuning
rf_model = rf_sh.best_estimator_
rf_model

RandomForestClassifier(max_depth=23, min_samples_split=10, n_estimators=16,
                       random_state=42)

In [77]:
# Decision Tree

tree_param_grid = {'max_depth': [2,4,6,8,10,12]}

base_estimator_tree = DecisionTreeClassifier()

n_components = list(range(1,X.shape[1]+1,1))

tree_sh = HalvingGridSearchCV(base_estimator_tree, tree_param_grid, verbose=True, n_jobs=-1).fit(X, y)

n_iterations: 2
n_required_iterations: 2
n_possible_iterations: 2
min_resources_: 50
max_resources_: 150
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 6
n_resources: 50
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 1
n_candidates: 2
n_resources: 150
Fitting 5 folds for each of 2 candidates, totalling 10 fits


In [78]:
# Salvamos o resultado
tree_model = tree_sh.best_estimator_
tree_model

DecisionTreeClassifier(max_depth=10)

In [91]:
logist_param_grid = {
        'C' : np.logspace(-4, 4, 20),
#        'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
        }
base_estimator_logist = LogisticRegression()

logist_sh = HalvingGridSearchCV(base_estimator_logist, param_grid=logist_param_grid, cv = 5, verbose=True, n_jobs=-1).fit(X, y)


n_iterations: 2
n_required_iterations: 3
n_possible_iterations: 2
min_resources_: 20
max_resources_: 150
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 20
n_resources: 20
Fitting 5 folds for each of 20 candidates, totalling 100 fits
----------
iter: 1
n_candidates: 7
n_resources: 60
Fitting 5 folds for each of 7 candidates, totalling 35 fits


In [92]:
logist_model = logist_sh.best_estimator_
logist_model

LogisticRegression(C=0.615848211066026)

Agora podemos pegar os 3 modelos treinados e ajustados acima e passar para o `VotingClassifier()`

In [93]:
rnf_clf = rf_model
lgr_clf = logist_model
det_clf = tree_model
voting_clf = VotingClassifier(
    estimators=[("rnf", rnf_clf), ("det", det_clf), ("lf", lgr_clf)],
    voting="hard"
).fit(X, y)

In [94]:
for clf in (rnf_clf, det_clf, lgr_clf, voting_clf):
    scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5)
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), clf.__class__.__name__))

Accuracy: 0.96 (+/- 0.03) [RandomForestClassifier]
Accuracy: 0.93 (+/- 0.05) [DecisionTreeClassifier]
Accuracy: 0.97 (+/- 0.02) [LogisticRegression]
Accuracy: 0.97 (+/- 0.02) [VotingClassifier]
