<a href="https://colab.research.google.com/github/mvdj/mvdj.github.io/blob/master/QualidadeVinho_RedesNeuraisArtificiais.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Qualidade Vinho - Classificação com Redes Neurais Artificiais (RNA)**

Dataset (https://archive.ics.uci.edu/ml/datasets/wine+quality) com 8096 instâncias de dados sobre a qualidade dos vinhos (Red and White). Dados descrevem componentes químicos/biológicos que fazem parte da qualidade de um vinho.

* Aprendizado supervisionado
* Mapeamento de um vetor de atributos para um atributo de classe
* Seja $x_i$ um conjunto de $n$ instâncias pertencentes a uma classe $c$ 
  * $x_i$ tem dimensão $d$
  * existem $m$ classes, $c \in {c_1,...,c_m}$
* Aprendizagem é identificar a função $f$ tal que:
  * $f([x_{i1},x_{i2},...,x_{id}]) = c$
* O aprendizado em uma RNA consiste no ajuste dos pesos
  * a minimização do erro é a função objetivo




In [3]:
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


# **1. Importação das bibliotecas**


In [4]:
# Importando bibliotecas
import pandas as pd
import matplotlib.pyplot as plt

# **2. Lendo dados**


In [5]:
# lendo csv e armazenando em um dataframe
dados = pd.read_csv('/content/drive/My Drive/Colab Notebooks/dataSets/QualidadeVinho/winequality-whiteAndRed.csv')

In [6]:
# verificando o dataframe
dados.head(5)

Unnamed: 0,wine type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,45.0,45.0,170.0,1001.0,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,49.0,14.0,132.0,994.0,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,58.0,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,58.0,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [7]:
#verificando colunas dos dados
dados.columns

Index(['wine type', 'fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality'],
      dtype='object')

In [8]:
#verificar a quantidade de instâncias
dados.shape

(8096, 13)

# **3. Limpeza e organização dos dados**

In [9]:
#verificar e eliminar se existem valores NAN, ? e/ou faltantes
dados = dados.dropna()

In [10]:
#eliminar colunas irrelevantes para o contexto
dados = dados.drop(columns=['density']) # atributo com valores desproporcional

In [11]:
#verificar quantidade de instâncias sem valores NAN, ? e/ou faltantes
dados.head()

Unnamed: 0,wine type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,45.0,45.0,170.0,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,49.0,14.0,132.0,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,58.0,47.0,186.0,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,58.0,47.0,186.0,3.19,0.4,9.9,6


In [12]:
#trocando o tipo do atributo 'wine type'  por um tipo numerico
dados['wine type'] = dados['wine type'].replace(['white','red'],[0,1]) # 0 - para white wine | 1 - para red wine
dados.head()

Unnamed: 0,wine type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,pH,sulphates,alcohol,quality
0,0,7.0,0.27,0.36,20.7,45.0,45.0,170.0,3.0,0.45,8.8,6
1,0,6.3,0.3,0.34,1.6,49.0,14.0,132.0,3.3,0.49,9.5,6
2,0,8.1,0.28,0.4,6.9,0.05,30.0,97.0,3.26,0.44,10.1,6
3,0,7.2,0.23,0.32,8.5,58.0,47.0,186.0,3.19,0.4,9.9,6
4,0,7.2,0.23,0.32,8.5,58.0,47.0,186.0,3.19,0.4,9.9,6


# **4. Re-escala dos dados**

**Re-escala usando máximo e mínimo**

In [13]:
dados = (dados - dados.min())/(dados.max()-dados.min())

# **5. Organizando dados para modelagem**

**Dividir os dados em atributos descritores e atributo de classe (target)**

In [14]:
#dividindo dados em atributos descritores e atributo de classe
X = dados.iloc[:,1:]
X.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,pH,sulphates,alcohol,quality
0,0.264463,0.00016,0.216867,0.308282,0.073619,0.152778,0.37788,0.217054,0.129213,0.115942,0.5
1,0.206612,0.000186,0.204819,0.015337,0.080166,0.045139,0.290323,0.449612,0.151685,0.217391,0.5
2,0.355372,0.000169,0.240964,0.096626,4.9e-05,0.100694,0.209677,0.418605,0.123596,0.304348,0.5
3,0.280992,0.000127,0.192771,0.121166,0.094897,0.159722,0.414747,0.364341,0.101124,0.275362,0.5
4,0.280992,0.000127,0.192771,0.121166,0.094897,0.159722,0.414747,0.364341,0.101124,0.275362,0.5


In [15]:
y = dados['wine type']
y.head()


0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: wine type, dtype: float64

**Dividir os dados em treino e teste**



In [16]:
from sklearn.model_selection import train_test_split

* Divide a matriz em subconjuntos aleatórios de treino e teste
  * test_size: tamanho do subconjunto de teste (em percentual)
  * random_state: define a semente para a aleatoriedade (se não definido, semente aleatória)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)#random_state=42

In [18]:
X_train.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,pH,sulphates,alcohol,quality
3643,0.247934,9.3e-05,0.198795,0.065951,0.076893,0.142361,0.285714,0.310078,0.191011,0.434783,0.5
2117,0.322314,0.000152,0.192771,0.009202,3.3e-05,0.086806,0.25576,0.379845,0.191011,0.405797,0.333333
175,0.256198,0.000127,0.240964,0.105828,3.3e-05,0.170139,0.334101,0.302326,0.02809,0.492754,0.5
7905,0.355372,0.000177,0.216867,0.02454,0.07853,0.118056,0.108295,0.426357,0.44382,0.637681,0.666667
6006,0.355372,0.000591,0.060241,0.041411,0.000115,0.010417,0.016129,0.496124,0.151685,0.217391,0.333333


In [19]:
X_train.shape

(5667, 11)

In [20]:
X_test.shape

(2429, 11)

In [21]:
y_train.head()

3643    0.0
2117    0.0
175     0.0
7905    1.0
6006    0.0
Name: wine type, dtype: float64

# **5. Definindo algoritmo de aprendizado**

Rede Neural Multi-Layer Perceptron (MLP)

Parâmetros da MLP:
* Número de neurônios e camadas - hidden_layer_sizes
  * tupla com a arquitetura
  * ex: (100,10) - duas camadas escondidas com 100 e 10 neurônios respectivamente
  * ex: (100,50,10)
* Função de ativação - activation
  * função de ativação das camadas escondidas
  * identidade - identity
  * sigmóide logística - logistic
  * tangente hiperbólica - tanh
  * função de unidade linear retificada - relu (max(0,x))
* Treinamento - solver
  * forma de otimizar os pesos da rede
  * gradiente estocástico  proposto por Kingma, Diederik, and Jimmy Ba - adam
  * descida do gradiente estocástico - sgd
  * familia dos métodos quasi-Newton - lbfgs
* Taxa de aprendizado - learning_rate
  * taxa constant - constant
  * decrescente - invscaling
  * adaptativa - adaptive
* Número máximo de iterações - max_iter
  * número de épocas de treinamento

In [28]:
from sklearn.neural_network import MLPClassifier

In [29]:
#definindo modelo
classificador = MLPClassifier(hidden_layer_sizes=(100),activation='logistic',max_iter=1000)

In [30]:
#treinando modelo
classificador.fit(X_train,y_train)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
              beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=100, learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=1000,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [31]:
X_train

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,pH,sulphates,alcohol,quality
3643,0.247934,0.000093,0.198795,0.065951,0.076893,0.142361,0.285714,0.310078,0.191011,0.434783,0.500000
2117,0.322314,0.000152,0.192771,0.009202,0.000033,0.086806,0.255760,0.379845,0.191011,0.405797,0.333333
175,0.256198,0.000127,0.240964,0.105828,0.000033,0.170139,0.334101,0.302326,0.028090,0.492754,0.500000
7905,0.355372,0.000177,0.216867,0.024540,0.078530,0.118056,0.108295,0.426357,0.443820,0.637681,0.666667
6006,0.355372,0.000591,0.060241,0.041411,0.000115,0.010417,0.016129,0.496124,0.151685,0.217391,0.333333
...,...,...,...,...,...,...,...,...,...,...,...
2769,0.289256,0.000203,0.210843,0.012270,0.000049,0.024306,0.361751,0.403101,0.112360,0.391304,0.333333
3088,0.206612,0.000245,0.222892,0.013804,0.039248,0.038194,0.161290,0.170543,0.095506,0.623188,0.500000
6392,0.214876,0.000194,0.054217,0.012270,0.107990,0.048611,0.050691,0.542636,0.269663,0.289855,0.666667
7814,0.504132,0.000304,0.277108,0.024540,0.148908,0.031250,0.080645,0.356589,0.264045,0.565217,0.500000


In [32]:
y_train

3643    0.0
2117    0.0
175     0.0
7905    1.0
6006    0.0
       ... 
2769    0.0
3088    0.0
6392    0.0
7814    1.0
3497    0.0
Name: wine type, Length: 5667, dtype: float64

In [33]:
#realizando classificação
classificacao = classificador.predict(X_test)
classificacao

array([0., 0., 0., ..., 0., 0., 0.])

# **6. Avaliação do classificador**

Acurácia
* taxa de acertos do classificador

In [1]:
#calculando acurácia
from sklearn.metrics import accuracy_score

In [34]:
acuracia = accuracy_score(y_test,classificacao)
round(acuracia,3)

0.806

Precisão
* taxa de instâncias classificadas como positivas que são realmente positivas

In [35]:
#calculando precisão
from sklearn.metrics import precision_score

In [36]:
precisao = precision_score(y_test,classificacao)
round(precisao,3)

0.493

Recall
* taxa de instâncias positivas classificadas corretamente

In [37]:
#calculando recall (revocação)
from sklearn.metrics import recall_score

In [38]:
recall = recall_score(y_test,classificacao)
round(recall,3)

0.36

F1-score
* balanço entre precisão e recall

In [39]:
#calculando f1-score
from sklearn.metrics import f1_score

In [41]:
f1 = f1_score(y_test,classificacao)
round(f1,3)

0.416

### Curva ROC
* Representação gráfica do desempenho de um classificador binário
* Razão entre a taxa de positivos verdadeiros (TPR) e positivos falsos (FPR)
  * $tpr = \dfrac{tp}{tp+fn} = \dfrac{positivos\_verdadeiros}{positivos\_totais}$ 
    * (recall)
  * $fpr = \dfrac{fp}{tn+fp} = \dfrac{positivos\_falsos}{negativos\_totais}$
* Interpretação
  * quanto maior tpr, melhor
  * quanto menor fpr, melhor

<img src=https://upload.wikimedia.org/wikipedia/commons/3/36/ROC_space-2.png width=500>

In [None]:
#plotando curva roc
from sklearn.metrics  import roc_curve

In [None]:
fpr, tpr, _ = roc_curve(y_test,classificacao)

In [None]:
plt.plot(fpr,tpr,marker='.')
plt.title('Curva ROC')
plt.xlabel('Taxa de Falsos Positivos')
plt.ylabel('Taxa de Verdadeiro Positivos')
plt.show()

## Área sob a curva (*Area under the curve - AUC)*
* Área sob a curva ROC
* Interpretação numérica da curva ROC

In [43]:
#calculando area sob a curva ROC
from sklearn.metrics import roc_auc_score

In [44]:
erro = roc_auc_score(y_test,classificacao)
round(erro,3)

0.636

## **Validação cruzada**

In [45]:
# avaliando modelo com cross validation
from sklearn.model_selection import cross_val_score

In [46]:
#define modelo
classificador = MLPClassifier(hidden_layer_sizes=(100),activation='logistic',max_iter=1000)

In [47]:
#calculando os scores
scores = cross_val_score(classificador,X,y,cv=10)
scores

array([0.85308642, 0.87654321, 0.88518519, 0.87283951, 0.87160494,
       0.87407407, 0.84177998, 0.64894932, 0.30160692, 0.50432633])

In [48]:
round(scores.mean(),3),round(scores.std(),3)

(0.753, 0.192)

# **7. Comparando MLP com Árvore de Decisão e Random Forest**

### Validação Cruzada

In [49]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [50]:
#criando árvore
arvore = DecisionTreeClassifier()

#calculando os scores
scores_arvore = cross_val_score(arvore,X,y,cv=10)

In [51]:
#criando random forest
floresta = RandomForestClassifier()

#calculando os scores
scores_floresta = cross_val_score(floresta,X,y,cv=10)

In [52]:
#criando rede neural
mlp = MLPClassifier(hidden_layer_sizes=(100),activation='logistic',max_iter=1000)

#calculando os scores
scores_mlp = cross_val_score(mlp,X,y,cv=10)

In [53]:
print('Árvore de Decisão: ', round(scores_arvore.mean(),3),round(scores_arvore.std(),3))
print('Random Forest: ', round(scores.mean(),3),round(scores.std(),3))
print('MLP:', round(scores_mlp.mean(),3),round(scores_mlp.std(),3))

Árvore de Decisão:  0.625 0.288
Random Forest:  0.753 0.192
MLP: 0.744 0.205


## 8. Otimização de Parâmetros

## Otimizando parâmetros
* Problema 
  * qual a melhor configuração de parâmetros para o modelo
* Otimização
  * escolher o melhor elemento de um conjunto
  * o significado de melhor é dado por uma função objetivo
    * taxa de erro

  <img src=https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/ac3f2f5a-9199-4bb7-8ce6-47e4dc307a0e.png width=500>

* Solução "mais simples"
  * tente todas as possibilidades
  * alto custo computacional
* Solução heurística
  * otimização estocástica
  * busca no espaço de soluções
* Random search
  * busca aleatória
  * sorteia alguns pontos do espaço e escolhe o melhor

  <img src= https://maelfabien.github.io/assets/images/expl4_4.jpg width=500>

In [54]:
from sklearn.model_selection import RandomizedSearchCV

In [55]:
param_grid = [
              {
                  'hidden_layer_sizes': [(10),(50),(100),(50,10),(100,50)],
                  'activation': ['identity', 'logistic', 'tanh', 'relu'],
                  'solver': ['lbfgs', 'sgd', 'adam'],
                  'max_iter': [500,1000,2000]
              }
              
]

In [56]:
mlp = RandomizedSearchCV(MLPClassifier(),param_grid,cv=5,scoring='accuracy')

In [57]:
mlp.fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                           batch_size='auto', beta_1=0.9,
                                           beta_2=0.999, early_stopping=False,
                                           epsilon=1e-08,
                                           hidden_layer_sizes=(100,),
                                           learning_rate='constant',
                                           learning_rate_init=0.001,
                                           max_fun=15000, max_iter=200,
                                           momentum=0.9, n_iter_no_change=10,
                                           nesterovs_momentum=True, power_t=0.5,
                                           random...
                                           verbose=False, warm_start=False),
                   iid='deprecated', n_iter=10, n_jobs=None,
                   param_distributions=[{

In [58]:
print(mlp.best_params_)

{'solver': 'adam', 'max_iter': 2000, 'hidden_layer_sizes': 10, 'activation': 'identity'}


In [59]:
print(round(mlp.best_score_,3))

0.717


* Grid search
  * monta um espaço de soluções reduzido como um reticulado
  * testa todas as soluções, guardando a melhor

  <img src=https://maelfabien.github.io/assets/images/expl4_1.jpg width=500>

In [60]:
from sklearn.model_selection import GridSearchCV

In [61]:
mlp = GridSearchCV(MLPClassifier(),param_grid,cv=5,scoring='accuracy')

In [None]:
mlp.fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

In [None]:
print(mlp.best_params_)

In [None]:
print(mlp.best_score_)

In [None]:
mlp.cv_results_