### Redes Neurais Artificiais - IFES - PPCOMP
### Exercicio 02
### Comparação de Redes Neurais Rasas em Múltiplos Datasets
### Perceptron (Atividade 1), Perceptron SciKit, MLP (1 Hidden Layer), Linear SVM, SGD (Hinge Loss)
### Datasets: Breast Cancer,  Dummy datasets (*)

##### (*) Utilizada a implementação do PerformanceEvaluator desenvolvido na disciplina de Reconhecimento de Padrões

In [72]:
import time
import sklearn
import numpy as np

from sklearn.base import BaseEstimator,ClassifierMixin

from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_digits
from sklearn.datasets import make_classification

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import mean_squared_error

# Classificadores
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

In [73]:
print('Versão do scikit-learn {}.'.format(sklearn.__version__))

Versão do scikit-learn 0.21.2.


In [78]:
# Datasets Binários
dX_AllDatasets={}
dy_AllDatasets={}

# Breast Cancer
data = load_breast_cancer()
X,y = data.data,data.target
dX_AllDatasets['breast_cancer']=X
dy_AllDatasets['breast_cancer']=y

# Dummy Dataset 1 - sklearn.datasets.make_classification
# One informative feature, one cluster per class
X, y = make_classification(n_samples=1000,n_features=2, n_redundant=0, n_informative=1,
                             n_clusters_per_class=1)
dX_AllDatasets['dummy_ds_1']=X
dy_AllDatasets['dummy_ds_1']=y

# Dummy Dataset 2 - sklearn.datasets.make_classification
# Two informative features, one cluster per class
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                             n_clusters_per_class=1)
dX_AllDatasets['dummy_ds_2']=X
dy_AllDatasets['dummy_ds_2']=y

# Dummy Dataset 3 - sklearn.datasets.make_classification
# Two informative features, two clusters per class
X, Y = make_classification(n_features=2, n_redundant=0, n_informative=2)
dX_AllDatasets['dummy_ds_3']=X
dy_AllDatasets['dummy_ds_3']=y

# Dummy Dataset 4 - sklearn.datasets.make_classification
# 10.000 Samples com 10% de "ruído"
X, y = make_classification(
    n_samples=10000, 
    n_features=25,
    flip_y=0.1) 
dX_AllDatasets['dummy_ds_4_10000_10_noise']=X
dy_AllDatasets['dummy_ds_4_10000_10_noise']=y

# Dummy Dataset 5 - sklearn.datasets.make_classification
# 10.000 Samples - Difícil separação
X, y = make_classification(
    n_samples=10000, 
    n_features=25,
    class_sep=0.1) # class_sep padrão=1.0. Menor o valor, mais dificil a classificação
dX_AllDatasets['dummy_ds_5_10000_hard_sep']=X
dy_AllDatasets['dummy_ds_5_10000_hard_sep']=y


# Dummy Dataset 6 - sklearn.datasets.make_classification
# 5.000 Samples - Ajuste na contribuição das features
X, y = make_classification(n_samples=5000, 
    n_features=25, 
    n_redundant=10, # 10 das 25 features serão combinações das outras
    n_repeated=5) # e 5 das 25 serão duplicadas
dX_AllDatasets['dummy_ds_6_5000_feat_contrib']=X
dy_AllDatasets['dummy_ds_6_5000_feat_contrib']=y

In [79]:
class PerceptronPPCOMPClassifier(BaseEstimator, ClassifierMixin):
    
    def __init__(self):
        return
    
    def predict(self, X):    
        r = np.dot(X, self.w) + self.b
        if np.isscalar(r):
            if r>=0.0:
                return 1.0
            else:
                return 0.0
        else:            
            for i in range(len(r)):
                if r[i]>=0.0:
                    r[i]=1.0
                else:
                    r[i]=0.0
            return r
            
    def fit(self, X, y, e=100,learn_r=0.001):
        # Inicializa pesos (w) e bias (b)
        
        # Inicializacao com Zeros (0)
        #self.w = np.zeros((X.shape[1], )) # X.shape[1] = total de caracteristicas do dataset
        #self.b = 0.0
        
        # Inicialização com valores aleatorios
        #self.w = np.random.normal(size=X.shape[1])
        self.w = np.random.random((X.shape[1], ))
        self.b = np.random.random()
        
        for f in range(e):
            error_conv = 0 # avaliar convergencia
            for xi, yi in zip(X, y):
                err = yi - self.predict(xi)
                if err != 0:                    
                    self.w += learn_r*err*xi # w <- w + α(y — f(x))x
                    self.b += learn_r*err
                    error_conv+=1
            if error_conv == 0:
                break
        return self

In [80]:
class PerformanceEvaluator():
  def __init__(self, X, y,cv,scaler):
    self.X=X
    self.y=y
    self.cv=cv
    self.scaler=scaler
  def score(self, pipe):
    scores=cross_val_score(pipe, self.X,self.y, cv=self.cv) # (Stratified)KFold
    return scores 
  def evaluate(self, clfs):
    best_overal=0
    for name,clf in clfs:
        if self.scaler==True:
            pipe = Pipeline(steps=[('scaler', StandardScaler()),
                   ('classifier', clf)])
        else:
            pipe = clf
        t_inicio = time.time()
        scores=self.score(pipe)
        t_fim = time.time()
        print('Mean: %0.7f Std: %0.7f(+/-) Best: %0.7f Time: %.2f(s) [%s]' % (scores.mean(), scores.std(), scores.max(),t_fim-t_inicio,name))
        if (scores.mean()>best_overal):
            best_overal=scores.mean()
            best_pipe=pipe
            best_clf_name=name    
    print('Best Estimator: ',best_clf_name)        
    ### Matriz de Confusão ilustrativa para o melhor estimator
    X_train, X_test, y_train, y_test = train_test_split(self.X, self.y, test_size=0.20)
    best_pipe.fit(X_train,y_train)
    y_p=best_pipe.predict(X_test)
    conf_mat = confusion_matrix(y_test,y_p)
    print(conf_mat)

In [81]:
print ("Comparativo de Redes Neurais Rasas com multiplos datasets")

# Classificadores de interesse com respectivos hyper-parametros
clfs = [
    ('PerceptronPPCOMP',PerceptronPPCOMPClassifier()),
    ('PerceptronSciKit',Perceptron(tol=1e-3, random_state=0)),
    ('LinearSVM',SVC(kernel="linear", C=0.025)),
    ('SGD_LossHinge',SGDClassifier(loss='hinge',max_iter=1000, tol=1e-3)),
    ('MLP',MLPClassifier(max_iter=500,early_stopping=True,hidden_layer_sizes=(100,)))
]

### Parametros complementaras ###
# cross-validation folds
cv = 5
# habilita ou nao scaler (standard scaler)
scaler = False
#################################

for key in dX_AllDatasets.keys():
    print("\n" +"="*40)
    print(key)
    print("-"*40)    
    X,y=dX_AllDatasets[key],dy_AllDatasets[key]
    pe = PerformanceEvaluator(X,y,cv,scaler)
    pe.evaluate(clfs)


Comparativo de Redes Neurais Rasas com multiplos datasets

breast_cancer
----------------------------------------
Mean: 0.8719969 Std: 0.0425969(+/-) Best: 0.9292035 Time: 0.90(s) [PerceptronPPCOMP]
Mean: 0.8025702 Std: 0.1697021(+/-) Best: 0.9217391 Time: 0.01(s) [PerceptronSciKit]
Mean: 0.9491343 Std: 0.0267773(+/-) Best: 0.9911504 Time: 0.17(s) [LinearSVM]
Mean: 0.9068103 Std: 0.0300324(+/-) Best: 0.9292035 Time: 0.01(s) [SGD_LossHinge]
Mean: 0.8290881 Std: 0.1041403(+/-) Best: 0.9217391 Time: 0.24(s) [MLP]
Best Estimator:  LinearSVM
[[31  3]
 [ 4 76]]

dummy_ds_1
----------------------------------------
Mean: 0.9929648 Std: 0.0098472(+/-) Best: 1.0000000 Time: 1.39(s) [PerceptronPPCOMP]
Mean: 0.9879698 Std: 0.0081538(+/-) Best: 1.0000000 Time: 0.01(s) [PerceptronSciKit]
Mean: 0.9909748 Std: 0.0092091(+/-) Best: 1.0000000 Time: 0.01(s) [LinearSVM]
Mean: 0.9959899 Std: 0.0058586(+/-) Best: 1.0000000 Time: 0.01(s) [SGD_LossHinge]
Mean: 0.9799647 Std: 0.0170605(+/-) Best: 1.0000000 Tim

In [82]:
### Ativando StandardScaler ###
scaler = True
#################################

for key in dX_AllDatasets.keys():
    print("\n" +"="*40)
    print(key)
    print("-"*40)    
    X,y=dX_AllDatasets[key],dy_AllDatasets[key]
    pe = PerformanceEvaluator(X,y,cv,scaler)
    pe.evaluate(clfs)


breast_cancer
----------------------------------------
Mean: 0.9701578 Std: 0.0068991(+/-) Best: 0.9823009 Time: 0.88(s) [PerceptronPPCOMP]
Mean: 0.9666795 Std: 0.0148693(+/-) Best: 0.9823009 Time: 0.01(s) [PerceptronSciKit]
Mean: 0.9718969 Std: 0.0065201(+/-) Best: 0.9823009 Time: 0.02(s) [LinearSVM]
Mean: 0.9718969 Std: 0.0128707(+/-) Best: 0.9826087 Time: 0.01(s) [SGD_LossHinge]
Mean: 0.9279877 Std: 0.0231880(+/-) Best: 0.9734513 Time: 0.19(s) [MLP]
Best Estimator:  LinearSVM
[[42  5]
 [ 0 67]]

dummy_ds_1
----------------------------------------
Mean: 0.9899797 Std: 0.0095189(+/-) Best: 1.0000000 Time: 1.50(s) [PerceptronPPCOMP]
Mean: 0.9899797 Std: 0.0071138(+/-) Best: 1.0000000 Time: 0.01(s) [PerceptronSciKit]
Mean: 0.9909748 Std: 0.0092091(+/-) Best: 1.0000000 Time: 0.02(s) [LinearSVM]
Mean: 0.9949749 Std: 0.0077849(+/-) Best: 1.0000000 Time: 0.01(s) [SGD_LossHinge]
Mean: 0.9639341 Std: 0.0248987(+/-) Best: 0.9950000 Time: 0.22(s) [MLP]
Best Estimator:  SGD_LossHinge
[[ 93   0]

#### Observaçoes sobre o experimento:

* Para alguns datasets (Ex: Breast Cancer) a normalização dos dados trouxe uma melhoria significativa.
* Notória a vantagem da MLP (1 hidden layer) contra os classificadores lineares em um cenário de difício separação (Dummy Dataset 5)

##### SGD (Hinge Loss) x Linear SVM
Pelo meu entendimento o SGDClassifer é um otimizador para classificadores lineares utilizando o SGD. Por padrão ele otimiza uma SVM Linear com a função de custo Hinge. Com o uso de uma função de custo do tipo log, por exemplo, otimizaria uma regressão logística. Outro ponto é que o mesmo faz uso de mini-batches. 

>This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning via the partial_fit method...The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM).


Interessante observar que no experimento o SGD não obteve sucesso na otimização em muitos casos com os parametros escolhidos. O tunning destes parâmetros não foi objeto de análise pelo menos nesta primeira experimentação.
