# Julio Patti Pereira 
- 14/02/2025

# Introdução

- Este é um trabalho complementar ao notebook "Explore_data_and_get_model_1.ipynb".
   - Leia ele para a contextualização

- Dentre alguns estudos realizados, este chamou a atenção por 2 motivos:
   - 1º enquanto técnicas de oversampling foram ineficazes, a utilização de UNDERSAMPLING proporcionou um resultado interessante:
   - 2º Com a técnica tem-se um aumento muito grande do Recall na classe de bons vinhos, porém uma perda no Recall da classe de vinhos ruins
      - Como a classe de vinhos ruins possui muito mais amostras que a de bons vinhos, o efeito da perda na metrica Recall da classe de vinhos ruins implica na perda de precisão da classe de vinhos bons
         - Em outras palavras, acerta-se muito mais os casos de vinhos bons, porém o número de vinhos ruins que são considerados "bons" pelo modelo não permite dizer que é uma solução precisa para identificação de vinhos bons.
      - Por outro lado, com o aumento do recall da classe de bons vinhos, tem-se o aumento da precisão na classificação de vinhos bons.
         - Quando o modelo vai avaliar um vinho ruim, embora tenha bom desempenho, ele comete erros nao despreziveis
            - Porém a precisão dele é exímia. Isto é, se o modelo diz que o vinho é ruim, pode confiar.
               - Entre a validação cruzada, o erro (na precisão!) ficou menor que 3% com um desvio muito baixo, e o resultado no conjunto cego (teste) foi de 98%

      - Modelos tão precisos, mesmo que em uma só classe, possuem aplicações interessantes. Seja pela confiança no resultado quando essa classe é atribuida (vinhos ruins), seja para utilizar em modelos sequencias.
         - O modelo poderia ser usado, por exemplo, como uma primeira camada de uma solução, onde todos os dados classificados como "ruins" tem sua atribuição final ja definida, os demais passam para uma reavaliação de outro modelo.
            - uma das vantagens dessa estrategia seria até mesmo a redução do desbalanceamento

- Embora tenha-se encontrado esse interessante resultado, este será o segundo e ultimo modelo deste trabalho, uma vez que o foco aqui não é o aprendizado de máquina.
   - Esta etapa serve então para fornecer a segunda solução de aplicação de IA que será utilizada na API desenvolvida


# Imports de bibliotecas

In [45]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, precision_score
from lightgbm import LGBMClassifier
from skopt import gp_minimize
from skopt.space import Real, Categorical, Integer
import pickle

In [None]:
# Função cunstomizada de redução de dados: Obs: Logicamente o undersampling nao pode (e não é) ser aplicado em dados de validação ou teste
def undersample(df):
    total = df.shape[0]
    n_78 = (df['quality']>=7).sum()
    n_43 = (df['quality']<=4).sum()
    n5 = (df['quality']==5).sum()
    n6 = (df['quality']==5).sum()

    s56 = n_78 - n_43
    s6 = s56//2
    s5 = s56 - s6

    samp_df = df[~(df['quality'].isin([5,6]))].copy()
    samp_5 = df[df['quality']==5].sample(s5, random_state=2025).copy()
    samp_6 = df[df['quality']==6].sample(s6, random_state=2025).copy()

    samp_df = pd.concat([samp_df, samp_5, samp_6]).reset_index(drop=True)
    return samp_df

def dist_quality(df, column='quality'):
    count = df[column].value_counts()
    prop = df[column].value_counts(normalize=True)
    return pd.concat([count, prop], axis=1)


# Import train data

In [41]:
df_train = pd.read_csv('../data/df_train.csv')
df_train.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
count,1087.0,1087.0,1087.0,1087.0,1087.0,1087.0,1087.0,1087.0,1087.0,1087.0,1087.0,1087.0
mean,8.264305,0.531463,0.267056,2.513937,0.087316,15.858786,46.559798,0.996671,3.313017,0.654802,10.432475,5.625575
std,1.707009,0.18663,0.193222,1.331599,0.046865,10.534171,33.126142,0.001839,0.154486,0.160925,1.073139,0.824693
min,4.6,0.12,0.0,1.2,0.012,1.0,6.0,0.9902,2.88,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.069,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.25,2.2,0.079,13.0,38.0,0.99668,3.31,0.62,10.1,6.0
75%,9.1,0.64,0.42,2.6,0.09,21.0,62.0,0.9978,3.4,0.72,11.1,6.0
max,15.9,1.58,0.76,15.5,0.611,72.0,289.0,1.00369,4.01,1.95,14.9,8.0


# Binarização do problema
- Será considerado que o vinho é "bom" se ele tiver uma nota igual ou superior a 7
- O problema sera reduzido ao par 0, 1, sendo 1 a classe dos vinhos "bons"

In [43]:
df_train['bin_quality'] = df_train['quality'].astype(int)>=7
df_train['bin_quality'] = df_train['bin_quality'].astype(int)

feature_columns = df_train.drop(columns=['quality', 'bin_quality']).columns.tolist()
dist_quality(df_train, column='bin_quality')

Unnamed: 0_level_0,count,proportion
bin_quality,Unnamed: 1_level_1,Unnamed: 2_level_1
0,939,0.863845
1,148,0.136155


In [11]:
dist_quality(df_train[df_train['bin_quality']==1], column='quality')

Unnamed: 0_level_0,count,proportion
quality,Unnamed: 1_level_1,Unnamed: 2_level_1
7,134,0.905405
8,14,0.094595


In [12]:
dist_quality(df_train[df_train['bin_quality']==0], column='quality')

Unnamed: 0_level_0,count,proportion
quality,Unnamed: 1_level_1,Unnamed: 2_level_1
5,461,0.490948
6,428,0.455804
4,42,0.044728
3,8,0.00852


# Std solution
- A validação cruzada será realizada com split estratificado na "quality" propositadamente para maior homogeneidade dos dados
- Ela também será realizada de modo "menos automatico", pois caso algum futuro estudo de sampleamento no treinamento seja realizado, a estratégia pode ser mais bem customizado sem as bibliotecas que realizam esse procedimento de modo automatico

In [13]:
def calcular_estatisticas(lista, output=True):
    
    mean = round(np.mean(lista) * 100, 1)
    std = round(np.std(lista) * 100, 1)
    min = round(np.min(lista) * 100, 1)
    max = round(np.max(lista) * 100, 1)
    
    if output:
        output=f"Média da validação cruzada (std): {mean}% ({std}%)\n(Min/Máx): ({min}%/{max}%)"
        print(output)
    
    return mean, std, min, max

In [None]:
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=2025)
auc = []
for train_index_fold, val_index_fold in skf.split(df_train, df_train['quality']):
    df_train_fold, df_val_fold = df_train.loc[train_index_fold], df_train.loc[val_index_fold]
    
    X_fold = df_train_fold[feature_columns]
    y_fold = df_train_fold['bin_quality']
    X_val = df_val_fold[feature_columns]
    y_val = df_val_fold['bin_quality']
    
    model = LGBMClassifier(class_weight='balanced', verbose=-1)
    model.fit(X_fold, y_fold)
    # y_val_pred = model.predict(X_val)
    y_val_proba = model.predict_proba(X_val)[:,1]
    auc.append(roc_auc_score(y_val, y_val_proba))
    
# Calcular média, desvio padrão, mínimo e máximo
print('AUC')
mean, std, min, max = calcular_estatisticas(auc)

AUC
Média da validação cruzada (std): 86.4% (2.4%)
(Min/Máx): (83.6%/89.0%)


# Otimização de hiperparâmetros

In [15]:
space = [Real(low=1e-4, high=3e-1, prior='log-uniform'),   # learning_rate     = params[0]
         Integer(low=2, high=128),                         # num_leaves        = params[1]
         Integer(low=1, high=200),                         # min_child_samples = params[2]
         Integer(low=100, high=500),                       # max_bin           = params[3]
         Real(low=0.05, high=1.0),                         # subsample         = params[4]
         Real(low=0.1, high=1.0),                          # colsample_bytree  = params[5]
         Integer(low=100, high=300)                        # n_estimators      = params[6]
         ]


def get_model(params):

    model = LGBMClassifier(
        learning_rate     = params[0],
        num_leaves        = params[1],
        min_child_samples = params[2],
        max_bin           = params[3],
        subsample         = params[4],
        colsample_bytree  = params[5],
        n_estimators      = params[6],
        subsample_freq    = 1,
        class_weight      = 'balanced',
        random_state      = 2025
    )
    
    return model


def objective_minimize(params):
    # print(params)
    auc = []
    for train_index_fold, val_index_fold in skf.split(df_train, df_train['quality']):
        df_train_fold, df_val_fold = df_train.loc[train_index_fold], df_train.loc[val_index_fold]
        
        df_train_fold = undersample(df_train_fold)
        
        X_fold = df_train_fold[feature_columns]
        y_fold = df_train_fold['bin_quality']
        X_val = df_val_fold[feature_columns]
        y_val = df_val_fold['bin_quality']
        
        model = get_model(params)
        model.fit(X_fold, y_fold)
        # y_val_pred = model.predict(X_val)
        y_val_proba = model.predict_proba(X_val)[:,1]
        auc.append(roc_auc_score(y_val, y_val_proba))
    metric = float(np.mean(auc))
    
    return - metric

In [16]:
# Otimização de (hiper)parametros
result = gp_minimize(objective_minimize, space, random_state=2025, n_calls=100, n_random_starts=30)

# Melhores parâmetros
best_params = result.x
print(best_params)

[0.016354585363571775, np.int64(49), np.int64(15), np.int64(500), 0.7629504231845382, 1.0, np.int64(146)]


# Escolha e avaliação do modelo
- Primeiramente deve-se avaliar o desempenho médio e desvios para decidir se a configuração utilizada produz a consistencia desejada
- Verificar se há melhoria em comparação a outras estratégias
- Por fim, deve-se especificar o modelo. Dentre os processos de escolha, é comum:
    - Dado os melhores parâmetros, utilizar o modelo de melhor desempenho nas split folds da validação cruzada
    - Realizar o "Reffit" do modelo, isto é, com os melhores hiperparâmetros treinar um novo modelo com a totalidade dos dados de test_index
- Como se tem uma amostragem reduzida da classe minoritária, optou-se pela segunda abordagem, visto que nela um número maior de dados é utilizado em treinamento
    - Mesmo assim, a validação cruzada com os "melhores parâmetros" será realizada para a comparação com a avaliação em modo padrão, realizada anteriormente.

In [47]:
# Validação cruzada
auc = []
for train_index_fold, val_index_fold in skf.split(df_train, df_train['quality']):
    df_train_fold, df_val_fold = df_train.loc[train_index_fold], df_train.loc[val_index_fold]
    
    df_train_fold = undersample(df_train_fold)
    
    X_fold = df_train_fold[feature_columns]
    y_fold = df_train_fold['bin_quality']
    X_val = df_val_fold[feature_columns]
    y_val = df_val_fold['bin_quality']
    
    model = get_model(best_params)
    model.fit(X_fold, y_fold)
    y_val_pred = model.predict(X_val)
    y_val_proba = model.predict_proba(X_val)[:,1]
    auc.append(roc_auc_score(y_val, y_val_proba))
    print(confusion_matrix(y_val, y_val_pred))
    print(precision_score(y_val, y_val_pred, pos_label=0))
    
# Calcular média, desvio padrão, mínimo e máximo
print('AUC')
mean, std, min, max = calcular_estatisticas(auc)

[[191  45]
 [  5  31]]
0.9744897959183674
[[179  55]
 [  4  34]]
0.9781420765027322
[[177  57]
 [  9  29]]
0.9516129032258065
[[178  57]
 [  7  29]]
0.9621621621621622
AUC
Média da validação cruzada (std): 87.5% (3.4%)
(Min/Máx): (82.3%/90.5%)


In [19]:
# Modelo final
df_train = undersample(df_train)
X_train = df_train[feature_columns]
y_train = df_train['bin_quality']

model = get_model(best_params)
model.fit(X_train, y_train)

# Simulação de desempenho em produção/blind set
- Metricas gerais
Obs: dada a métrica de otimização, a simulação teve um desempenho que muito condiz com a avaliação, isto é, não só perto da média como dentro de um pequeno desvio.
- Avaliação: Media 88,3% e 2,4% std (AUC)
- Blind test: 87,4% (AUC)

In [20]:
df_test = pd.read_csv('../data/df_test.csv')
df_test['bin_quality'] = (df_test['quality']>=7).astype(int)
df_test.head(2)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,bin_quality
0,6.8,0.61,0.2,1.8,0.077,11.0,65.0,0.9971,3.54,0.58,9.3,5,0
1,11.3,0.36,0.66,2.4,0.123,3.0,8.0,0.99642,3.2,0.53,11.9,6,0


In [21]:
X_test = df_test.drop(columns=['quality','bin_quality'])
y_test = df_test['bin_quality']

y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_pred_proba)
report = classification_report(y_test, y_pred)

print(f'AUC: {round(100*auc,1)}%\n\n{report}')

AUC: 88.2%

              precision    recall  f1-score   support

           0       0.98      0.74      0.85       236
           1       0.35      0.92      0.51        36

    accuracy                           0.76       272
   macro avg       0.67      0.83      0.68       272
weighted avg       0.90      0.76      0.80       272



In [None]:
confusion_matrix(y_test, y_pred)

array([[175,  61],
       [  3,  33]])

In [31]:
df_test['pred'] = y_pred
df_test['proba'] = y_pred_proba

cond_err = (df_test['bin_quality']==0) & (df_test['pred']==1)
df_test[cond_err]

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,bin_quality,proba,pred
1,11.3,0.36,0.66,2.4,0.123,3.0,8.0,0.99642,3.20,0.53,11.9,6,0,0.720154,1
8,6.1,0.32,0.25,2.3,0.071,23.0,58.0,0.99633,3.42,0.97,10.6,5,0,0.736648,1
11,10.7,0.40,0.37,1.9,0.081,17.0,29.0,0.99674,3.12,0.65,11.2,6,0,0.785936,1
15,7.7,0.18,0.34,2.7,0.066,15.0,58.0,0.99470,3.37,0.78,11.8,6,0,0.873059,1
21,7.0,0.54,0.09,2.0,0.081,10.0,16.0,0.99479,3.43,0.59,11.5,6,0,0.501178,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,9.3,0.36,0.39,1.5,0.080,41.0,55.0,0.99652,3.47,0.73,10.9,6,0,0.708085,1
261,7.0,0.43,0.02,1.9,0.080,15.0,28.0,0.99492,3.35,0.81,10.6,6,0,0.806639,1
263,10.0,0.42,0.50,3.4,0.107,7.0,21.0,0.99790,3.26,0.93,11.8,6,0,0.867850,1
268,8.0,0.18,0.37,0.9,0.049,36.0,109.0,0.99007,2.89,0.44,12.7,6,0,0.709203,1


In [32]:
df_test[cond_err].describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,bin_quality,proba,pred
count,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0
mean,9.07541,0.426721,0.353115,2.388525,0.079131,15.754098,35.983607,0.996657,3.291639,0.732295,11.244262,5.803279,0.0,0.750301,1.0
std,2.367323,0.151858,0.205804,0.853639,0.025565,9.202637,21.035281,0.002454,0.176297,0.107539,0.943844,0.400819,0.0,0.125127,0.0
min,5.6,0.16,0.0,0.9,0.038,3.0,8.0,0.99007,2.86,0.44,8.4,5.0,0.0,0.501178,1.0
25%,7.1,0.32,0.15,1.8,0.066,9.0,19.0,0.99494,3.2,0.67,10.6,6.0,0.0,0.658334,1.0
50%,8.8,0.4,0.39,2.2,0.076,15.0,32.0,0.99633,3.31,0.72,11.1,6.0,0.0,0.745536,1.0
75%,10.7,0.53,0.49,2.7,0.085,23.0,50.0,0.9983,3.42,0.79,11.8,6.0,0.0,0.849534,1.0
max,15.5,0.915,0.79,4.8,0.194,41.0,109.0,1.00315,3.68,0.97,14.0,6.0,0.0,0.950631,1.0


In [33]:
cond_acerto = (df_test['bin_quality']==1) & (df_test['pred']==1)
df_test[cond_acerto].describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,bin_quality,proba,pred
count,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0
mean,9.166667,0.394242,0.434242,3.486364,0.078424,14.651515,42.757576,0.996425,3.270606,0.770303,11.820202,7.090909,1.0,0.829242,1.0
std,1.982213,0.111643,0.169042,2.01995,0.029875,10.677433,49.296951,0.002571,0.133087,0.13166,1.022522,0.291937,0.0,0.108504,0.0
min,5.1,0.27,0.1,1.4,0.042,3.0,7.0,0.99182,2.95,0.51,10.3,7.0,1.0,0.568277,1.0
25%,8.0,0.32,0.34,2.2,0.064,6.0,17.0,0.9947,3.19,0.68,10.8,7.0,1.0,0.797345,1.0
50%,9.3,0.35,0.45,2.6,0.074,12.0,29.0,0.99658,3.27,0.8,12.0,7.0,1.0,0.842953,1.0
75%,10.1,0.43,0.5,4.6,0.085,18.0,43.0,0.99859,3.34,0.86,12.5,7.0,1.0,0.910059,1.0
max,15.6,0.685,0.76,8.9,0.216,38.0,278.0,1.0032,3.58,1.02,13.6,8.0,1.0,0.949523,1.0


# Avaliação de Subsets (ponto de vista Recall)

In [34]:
df_test['proba'] = y_pred_proba
df_test['pred'] = y_pred

df_test_quality_7 = df_test[df_test['quality']==7].copy()
dist_quality(df_test_quality_7, column='pred')

Unnamed: 0_level_0,count,proportion
pred,Unnamed: 1_level_1,Unnamed: 2_level_1
1,30,0.909091
0,3,0.090909


In [35]:
df_test_quality_8 = df_test[df_test['quality']==8].copy()
dist_quality(df_test_quality_8, column='pred')

Unnamed: 0_level_0,count,proportion
pred,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3,1.0


# Conclusão Sobre esse modelo:
- Vinhos bons que passam pelo modelo são bem classificados
- Vinhos ruins que passam pelo modelo, são eventualmente mal classificados
- Quando o modelo diz que um vinho é ruim, há alta confiança que ele é, de fato ruim
    - a reciproca não é verdadeira


# Save Model

In [36]:
path_model = '../ml_models/modelo_02.pkl'
with open(path_model, 'wb') as arquivo:
    pickle.dump(model, arquivo)

# Test loaded model

In [37]:
with open(path_model, 'rb') as arquivo:
    model = pickle.load(arquivo)

y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
report = classification_report(y_test, y_pred)

print(f'AUC: {round(100*auc,1)}%\n\n{report}')

AUC: 88.2%

              precision    recall  f1-score   support

           0       0.98      0.74      0.85       236
           1       0.35      0.92      0.51        36

    accuracy                           0.76       272
   macro avg       0.67      0.83      0.68       272
weighted avg       0.90      0.76      0.80       272



# Julio Patti Pereira 
- 14/02/2025