# Atividade: Generalização

### Avaliando a generalização de algoritmos


Escolha uma base de classificação e compare os classificadores Logistic Regression e KNN do scikit-learn.

Use pelo menos duas formas de avaliação e as repita pelo menos 10 vezes.

Calcule a média das repetições de cada avaliação.

## Base de Dados

### Dataset for classification of bank notes

The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:

    c1: Variance of Wavelet Transformed image (continuous).
    c2: Skewness of Wavelet Transformed image (continuous).
    c3: Kurtosis of Wavelet Transformed image (continuous).
    c4: Entropy of image (continuous).
    category: Class (0 for authentic, 1 for inauthentic).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.

https://archive.ics.uci.edu/ml/datasets/banknote+authentication

## Leitura dos Dados

In [65]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [66]:
# Leitura do arquivo csv
df = pd.read_csv("../datasets/data_banknote_authentication.csv")
df.head()

Unnamed: 0,c1,c2,c3,c4,category
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


## Análise de valores nulos e estatísticas sobre os dados

In [67]:
# Verificação do tipo de dado e se existem valores nulos
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   c1        1372 non-null   float64
 1   c2        1372 non-null   float64
 2   c3        1372 non-null   float64
 3   c4        1372 non-null   float64
 4   category  1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


In [68]:
# Algumas estatísticas sobre a base de dados
df.describe()

Unnamed: 0,c1,c2,c3,c4,category
count,1372.0,1372.0,1372.0,1372.0,1372.0
mean,0.433735,1.922353,1.397627,-1.191657,0.444606
std,2.842763,5.869047,4.31003,2.101013,0.497103
min,-7.0421,-13.7731,-5.2861,-8.5482,0.0
25%,-1.773,-1.7082,-1.574975,-2.41345,0.0
50%,0.49618,2.31965,0.61663,-0.58665,0.0
75%,2.821475,6.814625,3.17925,0.39481,1.0
max,6.8248,12.9516,17.9274,2.4495,1.0


In [69]:
df_numrecords = df.groupby('category').size().reset_index(name='qtd')
print("Número de registros por classe")
df_numrecords

Número de registros por classe


Unnamed: 0,category,qtd
0,0,762
1,1,610


## Separação dos conjuntos de treino e teste

In [70]:
from sklearn.model_selection import train_test_split
# Separa dados e rótulos
y = df['category']
X = df.drop(['category'], axis=1)

# Cria conjunto de treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23, shuffle=True)

print(f"X_train shape :{X_train.shape}")
print(f"X_test shape :{X_test.shape}")
print(f"y_train shape :{y_train.shape}")
print(f"y_test shape :{y_test.shape}")

X_train shape :(1097, 4)
X_test shape :(275, 4)
y_train shape :(1097,)
y_test shape :(275,)


## Desenvolvimento de modelo com pipeline

- Foram criados pipelines para os dois classificadores: KNN e Regressão Logística

- Os dados foram padronizados antes de alimentar os modelos utilizando a biblioteca **StandardScaler**

- Os conjuntos de dados de treinamento foram separados em 10 subconjuntos (9/10 dos dados usados para treino e 1/10 para teste) utilizando a validação cruzada 
  **K-Fold**  (cv=10)

- A partir do GridSearch foi analisada a performarce dos modelos para diferentes hiperparâmetros. Os modelos com os melhores resultados foram escolhidos

- Os modelos foram avaliados utilizando os dados de teste

In [71]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
from sklearn.preprocessing import StandardScaler

# Criação dos pipelines
pipe_reglog = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('LR', LogisticRegression())
])

pipe_knn = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('KNN', KNeighborsClassifier())
])

# Parâmetros do Grid Search
reglog_grid_params = [{'LR__penalty': ['l1', 'l2'],
                   'LR__C': [1, 2, 3, 4, 5, 6],
                   'LR__solver': ['liblinear']}]

knn_grid_params = [{'KNN__n_neighbors': [1, 2, 3, 4, 5, 6],
                   'KNN__weights': ['uniform', 'distance'],
                   'KNN__metric': ['euclidean', 'manhattan']}]


# Configura o GridSearch
knnGridSearch = GridSearchCV(estimator=pipe_knn, param_grid=knn_grid_params, scoring='accuracy', verbose=3, cv=10)

reglogGridSearch = GridSearchCV(estimator=pipe_reglog, param_grid=reglog_grid_params, scoring='accuracy', verbose=3, cv=10)

list_of_grids = [knnGridSearch, reglogGridSearch]
name_of_grids = ["KNN", "Regressão Logística"]

In [72]:
# Treina modelos
for pipe in list_of_grids:
    pipe.fit(X_train,y_train)


Fitting 10 folds for each of 24 candidates, totalling 240 fits
[CV 1/10] END KNN__metric=euclidean, KNN__n_neighbors=1, KNN__weights=uniform;, score=0.991 total time=   0.0s
[CV 2/10] END KNN__metric=euclidean, KNN__n_neighbors=1, KNN__weights=uniform;, score=1.000 total time=   0.0s
[CV 3/10] END KNN__metric=euclidean, KNN__n_neighbors=1, KNN__weights=uniform;, score=1.000 total time=   0.0s
[CV 4/10] END KNN__metric=euclidean, KNN__n_neighbors=1, KNN__weights=uniform;, score=1.000 total time=   0.0s
[CV 5/10] END KNN__metric=euclidean, KNN__n_neighbors=1, KNN__weights=uniform;, score=1.000 total time=   0.0s
[CV 6/10] END KNN__metric=euclidean, KNN__n_neighbors=1, KNN__weights=uniform;, score=1.000 total time=   0.0s
[CV 7/10] END KNN__metric=euclidean, KNN__n_neighbors=1, KNN__weights=uniform;, score=0.991 total time=   0.0s
[CV 8/10] END KNN__metric=euclidean, KNN__n_neighbors=1, KNN__weights=uniform;, score=1.000 total time=   0.0s
[CV 9/10] END KNN__metric=euclidean, KNN__n_neigh

## Avaliação dos modelos
O algoritmo KNN obteve a  melhor acurácia em relação ao classificadore de Regressão Logística

In [73]:
# Avaliação dos modelos
for i, model in enumerate(list_of_grids):
    print('{} Test Accuracy: {}'.format(name_of_grids[i],
    model.score(X_test,y_test)))
    print('{} Best Params: {}'.format(name_of_grids[i], model.best_params_))

KNN Test Accuracy: 1.0
KNN Best Params: {'KNN__metric': 'euclidean', 'KNN__n_neighbors': 1, 'KNN__weights': 'uniform'}
Regressão Logística Test Accuracy: 0.9854545454545455
Regressão Logística Best Params: {'LR__C': 2, 'LR__penalty': 'l1', 'LR__solver': 'liblinear'}
