# Trabalho 2

## 4.2.1 SVM (resolução alternativa)

Neste notebook está o código relacionado com as SVM na sua utilização para classificação.

Páginas consultadas:
https://www.datacamp.com/tutorial/svm-classification-scikit-learn-python
https://www.kaggle.com/code/rishabhkmr/scikit-learn-svm-understanding-the-concept/notebook


### Imports

In [1]:
import warnings

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn.metrics import accuracy_score,f1_score

import utilidades as ut

### Inicializações e variáveis

In [2]:
warnings.filterwarnings("ignore")  # Desabilitar warnings.
# Garantir que se utiliza um estilo definido centralmente e comum a todos os gráficos.
plt.style.use("style/estilo.mplstyle")
# plt.style.use('ggplot')

%matplotlib inline

label_encoder = LabelEncoder()

ficheiro = "dados_preparados.csv"
colunas_numericas = ["Idade", "FCV", "NRP", "CA", "FAF", "TUDE", "IMC"]
colunas_classes = ["Genero", "Historico_obesidade_familiar", "FCCAC", "Fumador", "MCC", "CCER", "CBA", "TRANS", "Label"]
colunas_classes_binarias = ['Genero', 'Historico_obesidade_familiar', 'FCCAC', 'Fumador', 'MCC']
colunas_classes_multiplos = ["CCER", "CBA", "TRANS", "Label"]

## Leitura dos dados preparados

Os dados já foram analisados anteriormente (ver ficheiro ``4.1.1_a_4.1.4_analise_dados.ipynb``).
Vamos também trabalhar com um ficheiro cujos dados já tiveram algum tipo de preparação e filtragem. 

In [3]:
dados_trabalho = pd.read_csv(ficheiro)

Vamos só confirmar que os dados foram carregados como esperado.

In [4]:
dados_trabalho

Unnamed: 0,Genero,Idade,Historico_obesidade_familiar,FCCAC,FCV,NRP,CCER,Fumador,CA,MCC,FAF,TUDE,CBA,TRANS,Label,IMC
0,Feminino,21.000000,Sim,Nao,2.0,3.0,Ocasionalmente,Nao,2.000000,Nao,0.000000,1.000000,Nao,Transportes_Publicos,Peso_Normal,24.386526
1,Feminino,21.000000,Sim,Nao,3.0,3.0,Ocasionalmente,Sim,3.000000,Sim,3.000000,0.000000,Ocasionalmente,Transportes_Publicos,Peso_Normal,24.238227
2,Masculino,23.000000,Sim,Nao,2.0,3.0,Ocasionalmente,Nao,2.000000,Nao,2.000000,1.000000,Frequentemente,Transportes_Publicos,Peso_Normal,23.765432
3,Masculino,27.000000,Nao,Nao,3.0,3.0,Ocasionalmente,Nao,2.000000,Nao,2.000000,0.000000,Frequentemente,Caminhada,Excesso_Peso_Grau_I,26.851852
4,Masculino,22.000000,Nao,Nao,2.0,1.0,Ocasionalmente,Nao,2.000000,Nao,0.000000,0.000000,Ocasionalmente,Transportes_Publicos,Excesso_Peso_Grau_II,28.342381
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,Feminino,20.976842,Sim,Sim,3.0,3.0,Ocasionalmente,Nao,1.728139,Nao,1.676269,0.906247,Ocasionalmente,Transportes_Publicos,Obesidade_Mórbida,44.901475
2107,Feminino,21.982942,Sim,Sim,3.0,3.0,Ocasionalmente,Nao,2.005130,Nao,1.341390,0.599270,Ocasionalmente,Transportes_Publicos,Obesidade_Mórbida,43.741923
2108,Feminino,22.524036,Sim,Sim,3.0,3.0,Ocasionalmente,Nao,2.054193,Nao,1.414209,0.646288,Ocasionalmente,Transportes_Publicos,Obesidade_Mórbida,43.543817
2109,Feminino,24.361936,Sim,Sim,3.0,3.0,Ocasionalmente,Nao,2.852339,Nao,1.139107,0.586035,Ocasionalmente,Transportes_Publicos,Obesidade_Mórbida,44.071535


## Codificação das classes

Temos no entanto de realizar primeiro o encoding das classes para valores numéricos, esta operação é realizada usando o ``sklearn.preprocessing.LabelEncoder``.

In [5]:
ut.titulo("Valores codificados por atributo")

for coluna in colunas_classes:
    if dados_trabalho[coluna].dtype == 'object':
        dados_trabalho[coluna] = label_encoder.fit_transform(dados_trabalho[coluna].values)
        ut.etiqueta_e_valor(coluna, str(sorted(dados_trabalho[coluna].unique())))

[21;30;44mValores codificados por atributo[0m
[0;94mGenero: [1;94m[0, 1][0m
[0;94mHistorico_obesidade_familiar: [1;94m[0, 1][0m
[0;94mFCCAC: [1;94m[0, 1][0m
[0;94mFumador: [1;94m[0, 1][0m
[0;94mMCC: [1;94m[0, 1][0m
[0;94mCCER: [1;94m[0, 1, 2, 3][0m
[0;94mCBA: [1;94m[0, 1, 2, 3][0m
[0;94mTRANS: [1;94m[0, 1, 2, 3, 4][0m
[0;94mLabel: [1;94m[0, 1, 2, 3, 4, 5, 6, 7, 8][0m


## Holdout 
 
Separação dos dados nos grupos de treino e de teste e em simultâneo separação do alvo (``Label``) e "preditores.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    dados_trabalho.drop("Label", axis=1),
    dados_trabalho["Label"],
    test_size=0.2,
    random_state=100
)

## Normalização

In [7]:
colunas = X_train.columns

In [8]:
x_train = X_train.values
x_test = X_test.values

min_max_scaler = MinMaxScaler()

x_train_scaled = min_max_scaler.fit_transform(x_train)
x_test_scaled = min_max_scaler.transform(x_test)

X_train = pd.DataFrame(x_train_scaled)
X_test = pd.DataFrame(x_test_scaled)

In [9]:
X_train.columns = colunas
X_test.columns = colunas

In [10]:
X_train.head()

Unnamed: 0,Genero,Idade,Historico_obesidade_familiar,FCCAC,FCV,NRP,CCER,Fumador,CA,MCC,FAF,TUDE,CBA,TRANS,IMC
0,0.0,0.222222,1.0,1.0,1.0,0.666667,0.666667,0.0,0.868677,0.0,0.0,0.038047,0.666667,1.0,0.763853
1,0.0,0.111111,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,1.0,0.136989
2,1.0,0.161009,1.0,1.0,0.78105,0.701669,0.666667,0.0,0.086214,0.0,0.061639,0.461541,0.666667,1.0,0.33964
3,0.0,0.222222,1.0,1.0,1.0,0.666667,0.666667,0.0,0.837784,0.0,0.0,0.040965,0.666667,1.0,0.757783
4,0.0,0.024064,0.0,1.0,0.907579,0.666667,0.666667,0.0,0.955593,0.0,0.865043,0.690102,0.666667,1.0,0.082203


In [11]:
X_train.head()

Unnamed: 0,Genero,Idade,Historico_obesidade_familiar,FCCAC,FCV,NRP,CCER,Fumador,CA,MCC,FAF,TUDE,CBA,TRANS,IMC
0,0.0,0.222222,1.0,1.0,1.0,0.666667,0.666667,0.0,0.868677,0.0,0.0,0.038047,0.666667,1.0,0.763853
1,0.0,0.111111,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,1.0,0.136989
2,1.0,0.161009,1.0,1.0,0.78105,0.701669,0.666667,0.0,0.086214,0.0,0.061639,0.461541,0.666667,1.0,0.33964
3,0.0,0.222222,1.0,1.0,1.0,0.666667,0.666667,0.0,0.837784,0.0,0.0,0.040965,0.666667,1.0,0.757783
4,0.0,0.024064,0.0,1.0,0.907579,0.666667,0.666667,0.0,0.955593,0.0,0.865043,0.690102,0.666667,1.0,0.082203


## Treino do modelo

In [12]:
svm = SVC()

svm_grid_parameters = {
    "C": [0.01, 0.1, 1, 10, 100, 1000],
    "kernel": ["linear", "poly", "sigmoid", "rbf"],
    "gamma": [1, 0.1, 0.01, 0.001, 0.0001]
}
svm_grid_search = GridSearchCV(svm, svm_grid_parameters)

In [13]:
svm_grid_search.fit(X_train, y_train)
ut.etiqueta_e_valor("Melhores parâmetros para o SVM:", svm_grid_search.best_estimator_)
y_pred_svm_grid = svm_grid_search.predict(X_test)

[0;94mMelhores parâmetros para o SVM:: [1;94mSVC(C=1000, gamma=1, kernel='linear')[0m


## Métricas

Accuracy e F1-score

In [14]:
ut.etiqueta_e_valor("Accuracy (%)", f"{accuracy_score(y_test, y_pred_svm_grid) * 100:.3f}")
ut.etiqueta_e_valor("F1-score", f"{f1_score(y_test, y_pred_svm_grid, average='micro'):.3f}")

[0;94mAccuracy (%): [1;94m94.326[0m
[0;94mF1-score: [1;94m0.943[0m
