# Laboratório 6 - Classificação
Filipe Gomes Arante de Souza

1. Compare a acurácia da árvore de decisão que utiliza ganho de informação com aquela que usa índice gini para seleção da característica dos nós de decisão da árvore no dataset wine. Faça a comparação usando 6 rodadas de validação cruzada estratificada com 5 folds. A menos do critério de seleção de caraterísticas, use os valores default para os demais hiperparâmetros da árvore. Indique se existe diferença significativa entre os resultados das árvores usando o teste t de Student.

In [12]:
from sklearn.datasets import load_wine

# Carregando dataset
wine = load_wine()

wine_X = wine.data
wine_y = wine.target


In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from scipy import stats
import numpy as np

dtEntropy = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

rkf = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 6, random_state = 0)
scoresEntropy = cross_val_score(dtEntropy, wine_X, wine_y, scoring = 'accuracy', 
                         cv = rkf)

print (scoresEntropy)

mean = scoresEntropy.mean()
std = scoresEntropy.std()
inf, sup = stats.norm.interval(0.95, loc=mean, 
                               scale=std/np.sqrt(len(scores)))

print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" % 
       (inf, sup)) 

[0.94444444 0.91666667 0.97222222 0.97142857 0.94285714 0.80555556
 0.91666667 0.94444444 0.91428571 0.94285714 0.91666667 0.88888889
 0.91666667 0.88571429 0.91428571 0.94444444 0.91666667 0.83333333
 0.97142857 0.88571429 0.88888889 0.91666667 0.94444444 0.94285714
 0.88571429 0.97222222 0.97222222 0.86111111 0.91428571 1.        ]

Mean Accuracy: 0.92 Standard Deviation: 0.04
Accuracy Confidence Interval (95%): (0.91, 0.94)



In [14]:
dtGini = DecisionTreeClassifier(criterion = 'gini', random_state = 0)

rkf = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 6, random_state = 0)
scoresGini = cross_val_score(dtGini, wine_X, wine_y, scoring = 'accuracy', 
                         cv = rkf)

print (scoresGini)

mean = scoresGini.mean()
std = scoresGini.std()
inf, sup = stats.norm.interval(0.95, loc=mean, 
                               scale=std/np.sqrt(len(scores)))

print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" % 
       (inf, sup)) 

[0.91666667 0.83333333 0.97222222 0.97142857 0.94285714 0.88888889
 0.86111111 0.94444444 0.94285714 0.91428571 0.94444444 0.88888889
 0.94444444 0.88571429 0.85714286 0.83333333 0.88888889 0.83333333
 0.94285714 0.97142857 0.88888889 0.94444444 0.86111111 0.82857143
 0.88571429 0.88888889 0.91666667 0.86111111 0.91428571 0.94285714]

Mean Accuracy: 0.90 Standard Deviation: 0.04
Accuracy Confidence Interval (95%): (0.89, 0.92)



In [15]:
from scipy.stats import ttest_rel

print('Paired T Test')
s, p = ttest_rel(scoresEntropy, scoresGini)
print("t: %0.2f p-value: %0.2f\n" % (s,p))

Paired T Test
t: 1.98 p-value: 0.06



2. Determine qual o valor do hiperparâmetro ccp_alpha (fator de poda) em uma busca em grade
com validação cruzada em 10 folds no dataset wine que obtém a melhor acurácia média. Varie o
hiperparâmetro de 0.1 em 0.1 no intervalo entre 0.1 e 0.7.

In [16]:
from sklearn.model_selection import GridSearchCV

parameters = {'ccp_alpha': [x / 10 for x in range(1, 8)]}

dt = DecisionTreeClassifier(random_state = 0)

gs = GridSearchCV(dt, parameters, cv = 10)

gs_results = gs.fit(wine_X, wine_y)

print("Best Mean Accuracy: %0.2f" % gs.best_score_)
print("Best Parameter Values: ", gs.best_params_)
print("Grid Search Result Infos: ", gs.cv_results_.keys())


Best Mean Accuracy: 0.80
Best Parameter Values:  {'ccp_alpha': 0.1}
Grid Search Result Infos:  dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_ccp_alpha', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'split5_test_score', 'split6_test_score', 'split7_test_score', 'split8_test_score', 'split9_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])


3. Compare o desempenho em f1 macro do classificador Naive Bayes com os do classificadores
Árvore de Decisão (com valores default de hiperparâmetros) e com o classificador aleatório
estratificado em uma validação cruzada com 10 folds no dataset breast.

In [31]:
from sklearn.datasets import load_breast_cancer

breast = load_breast_cancer()
breast_X = breast.data
breast_y = breast.target


In [32]:
from sklearn.naive_bayes import GaussianNB
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

naiveBayes = GaussianNB()

scorings = ['f1_macro']
scores = cross_validate(naiveBayes, breast_X, breast_y, scoring = scorings, cv = 10)
nb_f1 = scores['test_f1_macro']

mean = nb_f1.mean()
std = nb_f1.std()
inf, sup = stats.norm.interval(0.95, loc=mean, 
                               scale=std/np.sqrt(len(nb_f1)))

print("\nMean F1 Macro: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" % (inf, sup)) 



Mean F1 Macro: 0.93 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.91, 0.95)



In [33]:
dt = DecisionTreeClassifier(random_state = 0)

scorings = ['f1_macro']
scores = cross_validate(dt, breast_X, breast_y, scoring = scorings, cv = 10)
dt = scores['test_f1_macro']

mean = dt.mean()
std = dt.std()
inf, sup = stats.norm.interval(0.95, loc=mean, scale=std/np.sqrt(len(dt)))

print("\nMean F1 Macro: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" % (inf, sup)) 




Mean F1 Macro: 0.91 Standard Deviation: 0.04
Accuracy Confidence Interval (95%): (0.89, 0.93)



In [34]:
randomStratified = DummyClassifier(strategy = 'stratified', random_state = 0)

scorings = ['f1_macro']
scores = cross_validate(randomStratified, breast_X, breast_y, scoring = scorings, cv = 10)
randomStratified = scores['test_f1_macro']

mean = randomStratified.mean()
std = randomStratified.std()
inf, sup = stats.norm.interval(0.95, loc=mean, scale=std/np.sqrt(len(randomStratified)))

print("\nMean F1 Macro: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" % (inf, sup)) 


Mean F1 Macro: 0.56 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.55, 0.58)



4. Obtenha a acurácia média, o desvio padrão e o intervalo de confiança a 95% do classificador
Perceptron de Múltiplas Camadas usando validação cruzada com 10 dobras (folds) na base de dados
(dataset) wine padronizada e não padronizada. Altere manualmente o valor da taxa de aprendizado
inicial no melhor classificador para 0.1, 0.01 e 0.0001 e observe o resultado.

In [40]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Normalizando
mlp = MLPClassifier(random_state = 0)
scalar = StandardScaler()
pipeline = Pipeline([('transformer', scalar), ('estimator', mlp)])

scores = cross_val_score(pipeline, wine_X, wine_y, scoring = 'accuracy', cv = 10)

mean = scores.mean()
std = scores.std()
inf, sup = stats.norm.interval(0.95, loc=mean, scale=std/np.sqrt(len(scores)))

print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" % (inf, sup)) 





Mean Accuracy: 0.98 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.96, 0.99)





In [41]:
# Sem normalizar
mlp = MLPClassifier(random_state = 0)

scores = cross_val_score(mlp, wine_X, wine_y, scoring='accuracy', cv = 10)

mean = scores.mean()
std = scores.std()
inf, sup = stats.norm.interval(0.95, loc=mean, scale=std/np.sqrt(len(scores)))

print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" % (inf, sup)) 




Mean Accuracy: 0.90 Standard Deviation: 0.06
Accuracy Confidence Interval (95%): (0.87, 0.94)





In [48]:
# Melhor classificador foi o normalizado

rates = [0.1, 0.01, 0.001]
for rate in rates:
    mlp = MLPClassifier(learning_rate_init = rate, random_state = 0, max_iter = 500)
    scalar = StandardScaler()
    pipeline = Pipeline([('transformer', scalar), ('estimator', mlp)])

    scores = cross_val_score(pipeline, wine_X, wine_y, scoring = 'accuracy', cv = 10)

    mean = scores.mean()
    std = scores.std()
    inf, sup = stats.norm.interval(0.95, loc=mean, scale=std/np.sqrt(len(scores)))

    print("Learning Rate: ", rate)
    print("Mean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
    print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" % (inf, sup)) 

Learning Rate:  0.1
Mean Accuracy: 0.98 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.96, 0.99)

Learning Rate:  0.01
Mean Accuracy: 0.98 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.96, 0.99)

Learning Rate:  0.001
Mean Accuracy: 0.98 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.96, 0.99)

