@author Joubert Alexandrino de Souza
@version 2020-12-11

-----------------------------------------------
# Avaliando a generalização de algoritmos
-----------------------------------------------

## Escolha uma base de classificação e compare os classificadores Logistic Regression e KNN do scikit-learn.
## Use pelo menos duas formas de avaliação e as repita pelo menos 10 vezes. 
## Calcule a média das repetições de cada avaliação.

## Base de dados

<b>Indian Liver Patient Records</b>

<b>Data available at:</b> https://www.kaggle.com/uciml/indian-liver-patient-records

<b>Context</b>

Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. This dataset was used to evaluate prediction algorithms in an effort to reduce burden on doctors.
Content

This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The "Dataset" column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.

Any patient whose age exceeded 89 is listed as being of age "90".

<b>Columns:</b>

    Age of the patient
    Gender of the patient
    Total Bilirubin
    Direct Bilirubin
    Alkaline Phosphotase
    Alamine Aminotransferase
    Aspartate Aminotransferase
    Total Protiens
    Albumin
    Albumin and Globulin Ratio
    Dataset: field used to split the data into two sets (patient with liver disease, or no disease)

<b>Objective</b>

Use these patient records to determine which patients have liver disease and which ones do not. 

In [1]:
# Instala as bibliotecas necessárias
!pip install category_encoders #O category encoders é um pacote para a manipulação de processos e tarefas de codificação.



In [2]:
# Importa as bibliotecas necessárias
#import matplotlib.pyplot as plt
import pandas as pd
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
import numpy as np
from sklearn.model_selection import KFold

# Ajustes
%matplotlib inline

In [3]:
#Carrega os dados
dados = pd.read_csv('https://raw.githubusercontent.com/joubert-alexandrino/reconhecimento-padroes/main/indian_liver_patient.csv', sep=',')
dados.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [4]:
dados.shape

(583, 11)

In [5]:
# Separa o conjunto de características do target
X, y = dados.drop(['Dataset'], axis=1, inplace=False), dados.Dataset

In [6]:
# Exibe os dados particionados
X.shape, y.shape

((583, 10), (583,))

In [7]:
# Verifica se existem dados vazios
X.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
dtype: int64

### Define os parâmtros gerais para as abordagens escolhidas: com padronização dos dados e sem padronização dos dados

In [8]:
# Define os parâmetros gerais

# Número de rodadas
rodadas = 10

# Número de Folds
nfolds = 5

# Número de classificadores
nclfs = 2

# Matriz de desempenho
desempenho = np.empty((rodadas, nclfs))
desempenho2 = np.empty((rodadas, nclfs))

# Define os random states
rstates = np.arange(10)

### Abordagem 1: com padronização dos dados (StandardScaler)

In [9]:
# Iterage para cada uma das rodadas
for i in range(rodadas):

    # KFolds para loops externo e interno

    kfe = KFold(n_splits = nfolds,   shuffle=True, random_state=rstates[i])
    kfi = KFold(n_splits = nfolds-1, shuffle=True, random_state=rstates[i])

    ########################
    # Logistic Regression
    ########################
    pip_lr = Pipeline([
        ('trata-valores-categoricos', OneHotEncoder(use_cat_names=True)), 
        ('trata-valores-nan', SimpleImputer(strategy='mean')),    
        ('faz-a-padronizacao', StandardScaler()),
        ('lr', LogisticRegression())
        ])

    # Define os parâmetros para uso do modelo: Logistic Regression
    params_lr = {'lr__solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

    # Cria o modelo
    modelo_lr = GridSearchCV(pip_lr, params_lr, scoring='accuracy', cv=kfi)

    # Faz a validação cruzada aninhada
    scores_lr = cross_validate(modelo_lr, X, y, scoring='accuracy', cv=kfe)

    ########################
    # KNeighbors Classifier
    ########################
    pip_knn = Pipeline([
        ('trata-valores-categoricos', OneHotEncoder(use_cat_names=True)), 
        ('trata-valores-nan', SimpleImputer(strategy='mean')),    
        ('faz-a-padronizacao', StandardScaler()),
        ('knn', KNeighborsClassifier())
        ])

    # Define os parâmtros para uso do modelo: Logistic Regression
    params_knn = {'knn__n_neighbors':[3,5,7]}

    # Cria o modelo
    modelo_knn = GridSearchCV(pip_knn, params_knn, scoring='accuracy', cv=kfi)

    # Faz a validação cruzada aninhada
    scores_knn = cross_validate(modelo_knn, X, y, scoring='accuracy', cv=kfe)

    # Armazena os dados da rodada
    desempenho[i,0], desempenho[i,1] = np.mean(scores_lr['test_score']), np.mean(scores_knn['test_score'])    

### Exibe os resultados da abordagem 1 por rodada

In [10]:
# Cria o Dataframe de resultados
novo = np.append(desempenho, (desempenho[:,0] >= desempenho[:,1]).reshape(-1,1), axis=1)
novo = np.append(novo, (desempenho[:,0] < desempenho[:,1]).reshape(-1,1), axis=1)
index = [f"Rodada {i+1}" for i in range(rodadas)]
resultados = pd.DataFrame(data=novo, index=index, columns=['Logistic Regression (Acurácia)', 'KNeighbors Classifier(Acurácia)', 'LR >= KNN?', 'KNN > LR'])
resultados

Unnamed: 0,Logistic Regression (Acurácia),KNeighbors Classifier(Acurácia),LR >= KNN?,KNN > LR
Rodada 1,0.723814,0.672384,1.0,0.0
Rodada 2,0.717006,0.65868,1.0,0.0
Rodada 3,0.71008,0.644916,1.0,0.0
Rodada 4,0.711804,0.653596,1.0,0.0
Rodada 5,0.717006,0.658665,1.0,0.0
Rodada 6,0.71369,0.639847,1.0,0.0
Rodada 7,0.717006,0.651813,1.0,0.0
Rodada 8,0.72041,0.646714,1.0,0.0
Rodada 9,0.698084,0.657073,1.0,0.0
Rodada 10,0.713469,0.67931,1.0,0.0


### Exibe a média dos resultados da abordagem 1

In [11]:
media = pd.DataFrame(data=np.mean(desempenho, 
                     axis=0).reshape(-1,2), 
                     index=[f'Resultado médio após {rodadas} rodadas'], 
                     columns=['Logistic Regression (Acurácia)', 'KNeighbors Classifier(Acurácia)'])
media

Unnamed: 0,Logistic Regression (Acurácia),KNeighbors Classifier(Acurácia)
Resultado médio após 10 rodadas,0.714237,0.6563


### Abordagem 2: sem padronização dos dados

In [12]:
# Iterage para cada uma das rodadas
for i in range(rodadas):

    # KFolds para loops externo e interno

    kfe = KFold(n_splits = nfolds,   shuffle=True, random_state=rstates[i])
    kfi = KFold(n_splits = nfolds-1, shuffle=True, random_state=rstates[i])

    ########################
    # Logistic Regression
    ########################
    pip_lr = Pipeline([
        ('trata-valores-categoricos', OneHotEncoder(use_cat_names=True)), 
        ('trata-valores-nan', SimpleImputer(strategy='mean')),        
        ('lr', LogisticRegression())
        ])

    # Define os parâmetros para uso do modelo: Logistic Regression
    params_lr = {'lr__solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

    # Cria o modelo
    modelo_lr = GridSearchCV(pip_lr, params_lr, scoring='accuracy', cv=kfi)

    # Faz a validação cruzada aninhada
    scores_lr = cross_validate(modelo_lr, X, y, scoring='accuracy', cv=kfe)

    ########################
    # KNeighbors Classifier
    ########################
    pip_knn = Pipeline([
        ('trata-valores-categoricos', OneHotEncoder(use_cat_names=True)), 
        ('trata-valores-nan', SimpleImputer(strategy='mean')),        
        ('knn', KNeighborsClassifier())
        ])

    # Define os parâmtros para uso do modelo: Logistic Regression
    params_knn = {'knn__n_neighbors':[3,5,7]}

    # Cria o modelo
    modelo_knn = GridSearchCV(pip_knn, params_knn, scoring='accuracy', cv=kfi)

    # Faz a validação cruzada aninhada
    scores_knn = cross_validate(modelo_knn, X, y, scoring='accuracy', cv=kfe)

    # Armazena os dados da rodada
    desempenho2[i,0], desempenho2[i,1] = np.mean(scores_lr['test_score']), np.mean(scores_knn['test_score'])    

### Exibe os resultados da abordagem 2 por rodada

In [13]:
# Cria o Dataframe de resultados
novo2 = np.append(desempenho2, (desempenho2[:,0] >= desempenho2[:,1]).reshape(-1,1), axis=1)
novo2 = np.append(novo2, (desempenho2[:,0] < desempenho2[:,1]).reshape(-1,1), axis=1)
index2 = [f"Rodada {i+1}" for i in range(rodadas)]
resultados2 = pd.DataFrame(data=novo2, index=index2, columns=['Logistic Regression (Acurácia)', 'KNeighbors Classifier(Acurácia)', 'LR >= KNN?', 'KNN > LR'])
resultados2

Unnamed: 0,Logistic Regression (Acurácia),KNeighbors Classifier(Acurácia),LR >= KNN?,KNN > LR
Rodada 1,0.718627,0.679325,1.0,0.0
Rodada 2,0.708429,0.701562,1.0,0.0
Rodada 3,0.701503,0.648335,1.0,0.0
Rodada 4,0.69807,0.675818,1.0,0.0
Rodada 5,0.703212,0.689596,1.0,0.0
Rodada 6,0.717065,0.686133,1.0,0.0
Rodada 7,0.701592,0.662039,1.0,0.0
Rodada 8,0.715252,0.66898,1.0,0.0
Rodada 9,0.699838,0.662202,1.0,0.0
Rodada 10,0.708296,0.679134,1.0,0.0


### Exibe a média dos resultados da abordagem 2

In [14]:
media2 = pd.DataFrame(data=np.mean(desempenho2, 
                     axis=0).reshape(-1,2), 
                     index=[f'Resultado médio após {rodadas} rodadas'], 
                     columns=['Logistic Regression (Acurácia)', 'KNeighbors Classifier(Acurácia)'])
media2

Unnamed: 0,Logistic Regression (Acurácia),KNeighbors Classifier(Acurácia)
Resultado médio após 10 rodadas,0.707188,0.675312


### Exibe as médias de resultados das duas abordagens

In [17]:
mmedias = pd.concat([media, media2], keys=[f"Com padronização - ", f"Sem padronização - "])
mmedias

Unnamed: 0,Unnamed: 1,Logistic Regression (Acurácia),KNeighbors Classifier(Acurácia)
Com padronização -,Resultado médio após 10 rodadas,0.714237,0.6563
Sem padronização -,Resultado médio após 10 rodadas,0.707188,0.675312


### Conclusão

Para as das abordagens, com e sem padronização, o classificador Logistic Regression obteve melhores acurácias médias.